Effectual spam filtering for WordPress

Saturday, 30 November, Year 5 d.Tr. | Author: Mircea Popescu

Trilema runs an (ancient) version of Wordpress which has meanwhile been modified into oblivion. I've already explained in detail how to process Bitcoin payments cheaply and easily, let's see if we can do something to alleviate the problem of spam comments.

Trilema does not use Akismet or any other third-party antispam plugin. Nevertheless, to date it has received 18`219 spam commentsi out of a 96`421 total over five years. That's a 81% ham rate, which I'd say is not too shabby. How is this miracle accomplished ?

I. Change the comment form. Let's examine Trilema's comment form together.ii

    <form action="http://trilema.com/wp-comments-post.php" method="post"
     id="commentform">
    
    <!--- This is just a decoy ;)
    <p><input type="text" name="author" id="author" value="" size="22"
    tabindex="1" />
    <label for="author"><small>Name (required)</small></label></p>
    <p><input type="text" name="email" id="email" value="" size="22"
    tabindex="2" />
    
    <label for="email"><small>E-Mail (will not be published , required)
    </small></label></p>
    <p><input type="text" name="url" id="url" value="http://pest"
    size="22" tabindex="3" />
    -->
    
    <p><input type="text" name="authore46998f" id="authore46998f"
    value="" size="22" tabindex="1" />
    <label for="author"><small>Name (required)</small></label></p>
    <p><input type="text" name="emaile46998f" id="emaile46998f"
    value="" size="22" tabindex="2" />
    <label for="email"><small>E-Mail
    </small></label></p>
    <p><input type="text" name="urle46998f" id="urle46998f" value=""
    size="22" tabindex="3" />
    
    <label for="url"><small>Website (optional)</small></label></p>
    
    <!--- This is just a decoy ;)
    <p><input type="text" name="author" id="author" value="" size="22"
    tabindex="1" />
    <label for="author"><small>Name (required)</small></label></p>
    <p><input type="text" name="email" id="email" value="" size="22"
    tabindex="2" />
    
    <label for="email"><small>E-Mail (will not be published , required)
    </small></label></p>
    <p><input type="text" name="url" id="url" value="http://pest"
    size="22" tabindex="3" />
    -->

The grey part is a html comment. It is ignored by the browser, but it is nevertheless a userful countermeasure against the grep based page scrapper.iii The red part is however the meat and potatoes of this entire thing. Specifically, I bet you that the page you're looking at has a different string there.

This magic is achieved by altering your wordpress in the following manner : In wp-comments-post.php you add

$suffix = substr(md5(date(’Y-m-d’).”salt”iv. $_SERVER['REMOTE_ADDR']),0,6);

$comment_author = ( isset( $_POST[ 'author'.$suffix ]) ) ? trim( strip_tags( $_POST [ 'author'.$suffix ] ) ) : null;

and in comments.phpv you add

<p><input type=”text” name=”author<?$suffix = substr( md5( date(’Y-m-d’). ”salt” . $_SERVER['REMOTE_ADDR'] ),0,6); echo $suffix;?>” id=”author” value=”<?php echo $comment_author; ?>” size=”22″ tabindex=”1″ />

The effect of all this is that the forms allowing commenting on any one Trilema page are only usable on the day the page was loaded, by the same machine that loaded the page.

This matches pretty well the experience of most legitimate users - when was the last time you switched computers mid-way commenting on a blog by means of saving the page on one machine, then burning it on a CD, copying it on another computer, opening it in a browser there and finally submitting the comment ? or when was the last time you took over a day writing down your comment ? - but it throws a royal monkey wrench into the workings of spam scripts, because on one hand most spam is sent by amateur "SEO experts" running pretty stupid software that doesn't even bother visiting your page, but instead simply assumes your html form parameters will be in the default state and connects directly to wp-comments-post.php with a forged POST request, and on the other hand because the entreprise level spammer that does occasionally actually scrape pages before spamming them tends to use separate machines for the two tasks, which won't work with our set-up. In either case this simple two line change of two files is going to cut the spam you receive by orders of magnitude, and the best part is that nobody will likely even notice.

II. Change the link counting rules. Hey, it's your blog, you can change anything you want!

Modify wp-includes/comment.php so that function check_comment includes the following :

    $textchk = apply_filters('comment_text', $comment);
    $linkcount = preg_match_all(
    "|(href\t*?=\t*?['\"]?)?(https?:)?//|i", $textchk, $out);
    $plinkcount = preg_match_all(
    "|(href\t*?=\t*?['\"]?)?(https?:)?//trilema\.com/|i",$textchk,
    $out);
    $linkcount_result = $linkcount - $plinkcount;

And then further down the road in the comment whitelisting section modify the if as follows :

    if ( ( 1 == $ok_to_comment ) && (empty($mod_keys) || false ===
    strpos( $email, $mod_keys)))
    if ($linkcount_result == 0) return 1; else return 0;
    else if ($linkcount_result > 1) return "spam"; else return 0;

What this does is as follows : the first part counts the number of times the string "http://" appears in a comment, and then counts how many times that actually is a reference to your own blog (so change trilema\.com to your own domain - don't forget the dot needs to be escaped in regex). If there's more http references than links to your own site, this logically means your commenter is linking outside, and so the second part moves any comments from unwhitelistedvi commenters straight into spam if they do contain outside links, whereas it puts comments from whitelisted people into your moderation queue so you can review them. Unwhitelisted users go to your moderation queue by default.

So there you have it, with this arrangement you'll never see all the nonsense spam posted by first time commenters - fictitious personas like Steve from Virginia battles in his recent article, but you'll still see all the comments made by legitimate first time usersvii, as well as all the links made by your whitelisted tribe.viii Together with the previous measure, I will be very much surprised if spam comments still give you trouble, even should you live to average fifty comments a day for a five year stretch.

Enjoy blogging!

———
  1. Comes to a little under ten a day, but the best part is that they mostly go straight into the spam folder anyway, so it's not like I have to click on anything. []
  2. This can be obtained by loading any page where one can put in comments, clicking "show source" or whatever your browser uses to show the html source of a page and searching for "form". []
  3. A page scrapper is an automated utility that downloads webpages looking for particular strings. For instance if someone was aiming to spam Wordpress blogs, perhaps one way they'd proceed would be to look all over for pages that contain the string

    <input type="text" name="author" id="author" value="

    If the scrapper itself was complex enough to parse html and eliminate comments our decoy would do nothing, but because of the humongous volume of data these scripts have to sift through and the memory and CPU constraints of the real world economics of spamming it's more likely the would-be spammer will never know it choked on a html comment. []

  4. Replace this string with a well chosen secret string of your own. []
  5. Found under whatever directory your theme lives, such as /wp-content/themes/mytheme/. []
  6. In this context whitelisted means "user who already has had one comment approved". []
  7. For any definition of legitimate that reduces to "doesn't spam links as their first comment". []
  8. And if moderating all links whitelisted users post seems undesirable to you, all it takes is changing the first branch of the if to always return 1. Trilema has this system in place because I don't use the rel="nofollow" convention and I like to keep some control of what I end up linking to. []
Category: Meta psihoza
Comments feed : RSS 2.0. Leave your own comment below, or send a trackback.

3 Responses

  1. Peter Lambert`s avatar
    1
    Peter Lambert 
    Friday, 6 December 2013

    This is a pretty ingenious approach.

    Is this patent pending yet?

  2. Mircea Popescu`s avatar
    2
    Mircea Popescu 
    Saturday, 7 December 2013

    Nah. Have at.

  3. Peter Lambert`s avatar
    3
    Peter Lambert 
    Thursday, 3 April 2014

    testing comment feature ... I swear I am an actual human being ...

Add your cents! »
    If this is your first comment, it will wait to be approved. This usually takes a few hours. Subsequent comments are not delayed.