Form anti-spam without a captcha: 6 filters that catch 99% of the junk

Honeypot, URL regex, phone length, non-Cyrillic ratio, rate-limit, log. Six filters without a captcha catch almost all form spam without user friction.

A captcha is a tax on conversion. By various measurements it costs 3-8% of real submissions, especially on mobile. From the start of the project I decided no captcha. Spam is killed by six simple PHP filters, each closing its class of attacks. Combined - about 60 lines of code, zero dependencies, zero requests to third-party services.

The short version

  • 6 filters: honeypot, URL regex, phone length, non-Cyrillic ratio, rate-limit, log.
  • I do not use a captcha - it costs 3-8% of conversions and ties the form to a third-party service.
  • All filters run on the server (no JavaScript, no dependencies). Bots that visit without a JS engine cannot bypass.
  • Spam drops 99% in my metrics, real leads come through without loss.

Filter 1: Honeypot

Add a hidden field to the form - usually with a name like website or url, the ones bots react to most actively. Hide via CSS (not type="hidden", because bots often ignore fields by type).

<input type="text" name="website" tabindex="-1" autocomplete="off"
       style="position:absolute;left:-9999px;width:0;height:0;visibility:hidden">

Check first on the server:

if (!empty($_POST['website'])) {
    log_spam('honeypot', $_POST);
    exit; // Drop silently, no message
}

Important - drop silently. Do not respond ‘you are a bot’. Bots learn: if your server reacts to a caught honeypot with a redirect or error, they figure out the field is checked. Better to pretend the submission went through: return 200 OK and the normal ‘thank you’ page. The bot thinks it worked and moves on to the next target.

In my logs, the honeypot catches 60-80% of all attacks on forms. The simplest and most effective filter.

Filter 2: Regex for URLs in the comment

Most of the remaining spam is posts with links to gambling, replicas, escort. They put the URL directly in the comment or message field.

$message = $_POST['message'] ?? '';
if (preg_match('#https?://|www\.|\.com/|\.ru/|\.net/|\.shop/#i', $message)) {
    log_spam('url_in_message', $_POST);
    exit;
}

This blocks ‘normal’ spam with links. It happens that a legitimate user wants to mention a link - say, ‘here is our site example.com’. Solution - do not use the ‘message’ field as a catch-all. In my form there is a separate ‘site’ field (optional, passes through its own URL validator), and the message field has URLs disallowed by policy.

If your customers often mention sites - loosen the filter to explicit http:// and https://, not every dot with a domain. Then mentioning example.com passes, but a real link https://casino.xyz gets caught.

Filter 3: Phone length

This is common sense. A Russian phone has at least 10 digits (without country code) or 11 (with). Bots usually put random 4-7 digits, or even 16-20 (imitating international format with extra junk).

$phone = preg_replace('/\D/', '', $_POST['phone'] ?? '');
$len = strlen($phone);
if ($len < 10 || $len > 11) {
    log_spam('phone_length', $_POST);
    exit;
}

preg_replace('/\D/', '') strips everything that is not a digit - spaces, dashes, brackets. After that you count length. Russia and CIS are almost always 10-11 digits. International customers - expand to 15 (E.164 max).

Optional: check the first digit is 7 or 8 (for those with bots that smash the first digit):

if ($len === 11 && !in_array($phone[0], ['7', '8'])) {
    log_spam('phone_country', $_POST);
    exit;
}

Filter 4: Non-Cyrillic ratio in comments

The site is Russian-speaking, real customers write in Russian. English-spam bots are filtered by the ratio of Cyrillic to total characters.

$message = $_POST['message'] ?? '';
$len = mb_strlen($message);
if ($len > 5) {
    // Count Cyrillic characters
    preg_match_all('/[\p{Cyrillic}]/u', $message, $matches);
    $cyr = count($matches[0]);
    if ($cyr / $len < 0.3) {
        log_spam('not_cyrillic', $_POST);
        exit;
    }
}

30% Cyrillic is a working threshold. A bilingual comment like ‘apartment cleaning, area 80 m², 2 bathrooms’ passes. Pure English spam gets cut. Pure Russian passes naturally.

For English-language sites you invert the filter - check the Latin ratio. For bilingual ones you can either disable or raise the bar to 80% - but then more spam gets through.

Filter 5: Rate-limit at 5 requests per hour per IP

One user does not submit ten forms per minute. If an IP makes more than 5 requests per hour - it is an attacker testing filters or pushing bulk spam.

Without a Redis cluster on shared hosting it is simplest to store in a file or MySQL. I keep it in MySQL:

CREATE TABLE rate_limit (
    ip VARCHAR(45) NOT NULL,
    ts INT UNSIGNED NOT NULL,
    KEY idx_ip_ts (ip, ts)
);

Before processing the submission:

$ip = $_SERVER['REMOTE_ADDR'];
$hour_ago = time() - 3600;

$pdo->prepare("DELETE FROM rate_limit WHERE ts < ?")->execute([$hour_ago]);

$stmt = $pdo->prepare("SELECT COUNT(*) FROM rate_limit WHERE ip = ? AND ts > ?");
$stmt->execute([$ip, $hour_ago]);
$count = $stmt->fetchColumn();

if ($count >= 5) {
    log_spam('rate_limit', $_POST);
    exit;
}

$pdo->prepare("INSERT INTO rate_limit (ip, ts) VALUES (?, ?)")->execute([$ip, time()]);

Table cleanup - daily cron, so it does not grow indefinitely.

The catch - users behind NAT (corporate network, mobile operator). If 10 people in one office fill forms - the office IP triggers the rate-limit. 5 per hour usually leaves room: real users do not submit more than 1-2, plenty of buffer. If you worry - raise to 10-20 per hour. Just do not remove it entirely, or a single-IP mass attack will overload the inbox.

Filter 6: Log to a protected file

All rejected submissions are written to spam_log.txt. Not the general server log, not the DB (DB is more expensive on writes), but a simple text file:

function log_spam(string $reason, array $data): void {
    $entry = date('c') . ' | ' . $reason . ' | IP ' . $_SERVER['REMOTE_ADDR'] . ' | ' . json_encode($data, JSON_UNESCAPED_UNICODE) . "\n";
    file_put_contents(__DIR__ . '/../spam_log.txt', $entry, FILE_APPEND | LOCK_EX);
}

The log absolutely has to be protected at the web-server level - otherwise anyone can download it and see your spam patterns. For Apache, in the root .htaccess:

<Files "spam_log.txt">
    Require all denied
</Files>

For Nginx - in the location config:

location = /spam_log.txt {
    deny all;
}

What the log gives you. Once a week I open it and look. You see which filters trigger most often: if 90% are honeypot, the rest barely fire because bots do not get past them. You see attack patterns: a sudden flood from one IP range, a stream of Chinese comments with links to the same topic. From those patterns you can tune filters or temporarily ban a range in .htaccess.

After 30 days I rotate spam_log.txtspam_log.txt.bak, a fresh empty file gets created. The old one I keep one period for analysis, then delete.

Order of filters

Important point - check in the right order, from cheap to expensive. So you do not run a rate-limit SQL query if the honeypot already flagged the submission as garbage.

// 1. Honeypot - cheapest
if (!empty($_POST['website'])) { log_spam('honeypot', $_POST); exit; }

// 2. URL regex - also cheap (in-memory)
if (preg_match('#https?://#i', $_POST['message'] ?? '')) { ... }

// 3. Phone length - cheap
if (strlen(preg_replace('/\D/', '', $_POST['phone'] ?? '')) < 10) { ... }

// 4. Non-Cyrillic - slightly more expensive due to UTF-8 regex
preg_match_all('/[\p{Cyrillic}]/u', $_POST['message'] ?? '', $m);
// ...

// 5. Rate-limit - most expensive, needs SQL
$pdo->prepare("SELECT COUNT(*) FROM rate_limit WHERE ...");

That keeps load minimal - most attacks get cut on the first two filters, before the server touches the DB.

What these filters do not close

Six filters kill automated spam. They do not close:

  • Targeted attacks by a human. If someone sits and fills your form by hand with competing offers - filters let it through. But that is a rare and expensive attack. One or two cases a year - resolved in 5 minutes.
  • Lead-generation spam via CRM. Sometimes a ‘lead generation’ contractor mass-registers a client’s contacts on sites of friends with fake data. Honeypot does not fire (human filling), but rate-limit catches it.
  • DDoS on a POST endpoint. That is a server-level problem, not application-level. Solved by nginx limit_req or CDN. Beget CDN does this at the edge for free.

Together the filters cover 99% of automated traffic attacking a normal B2B services site. Enough for a business that is not the target of a targeted attack.

The full case on launching a custom site on shared with PHP 8.4 - in the article ‘50 days of SEO in B2B cleaning’. Form anti-spam is part of week-one work.

Related: CLS 0.377 → 0.002 in a day and OPcache on shared hosting - other fast fixes with big impact.

Frequently asked questions

Why I am against reCAPTCHA and similar ready-made solutions
Three reasons. First - conversion. By various studies reCAPTCHA v2 costs 3-8% of users, especially on mobile and on older devices. That is money. Second - external dependency. reCAPTCHA is a request to google.com, which from Russia is blocked or DPI-throttled. If a user is on a corporate VPN or their provider filters - the form simply does not submit, with no clear error. Third - privacy. reCAPTCHA collects a device fingerprint and sends it to Google, an extra risk under Russia's 152-FZ for a Russian site.
What is a honeypot and how does it work?
A field in the form, hidden from users via CSS but visible to bots and parsers. Bots usually fill in every field in turn - text inputs, urls, emails - because they cannot tell visible from hidden. If the field is filled - it is a bot, drop the submission. Real users do not see the field and do not fill it. A simple honeypot kills 60-80% of automated form spam with zero user-facing feedback.
Doesn't a rate-limit block real customers?
With the right thresholds - no. I have 5 submissions per hour from one IP. Real scenario: a person submits, mistypes the phone, fixes, resubmits. That is 2. Maybe a third with a clarification in the comment. A fifth submission from the same IP in an hour is already suspicious. If you have an office on NAT with several people who actually submit forms 10 times - raise the limit to 20/hour.
Why a non-Cyrillic filter on comments?
The site is Russian-speaking, audience - Russia and CIS. Real customers write in Russian. Spam bots often post in English or Chinese with links to gambling, pharma, and porn. Filter logic: if the comment is over 5 characters and less than 30% of it is Cyrillic - drop. That does not block bilingual comments ('need apartment cleaning, area 80 m²') but kills pure English spam. For English-language sites you invert the filter.
Where to store the spam log and why look at it?
In a separate file `spam_log.txt`, protected via `.htaccess`. Why look at it. First - measure filter effectiveness: how much and what kind of attack is coming. Second - spot false positives: a legitimate lead landed in the log, the filter needs tuning. Third - keyword blocklist. After the fact you see patterns (constant mentions of 'crypto', 'replica', 'escort') and can add them to a blacklist. The log must not be public - in `.htaccess` rule `<Files spam_log.txt> Require all denied </Files>`.