I’ve been noticing a pattern in the comment spam being received on a couple of our blogs. See the following comment:

A couple things I noticed about the pattern:
- comment spam frequently comes from blogspot.com
- starts with hyperlink and then contains a sentence or two of text, usually nonsense
Let’s write some code to filter any HTML that starts with a hyperlink and a couple of sentences:
$html_bad = ‘<a href="http://somespam.somedomain.com/">Pussycat Dolls it rose-tinted in the lust rare in your perspective, in your perspective blah. Blah blah yadda yadda.’;
$html_good = ‘It rose-tinted in the lust rare in your perspective, in your perspective blah. Blah blah yadda yadda. <a href="http://somespam.somedomain.com/">Pussycat Dolls</a> Blah blah yadda yadda.’.
preg_match(“%^<a href=(\”http://|http://)([^>]+).*%i”,$html_bad,$matched);
var_dump($matched);
print “<br />URL: <font color=’red’><b>” . str_replace(“\”)”,“”, $matched[2]) . ‘</b></font><hr />’;
preg_match(“%^</a><a href=(\”http://|http://)([^>]+).*%i”,$html_good,$matched);
var_dump($matched);
?>
Notes: note the ^ checks from the start of the expression and the parenthesis capture the URL. We could use a more complicated regular expression to also clean up the stripped URL but I did that in a separate step using str_replace. This removes the trailing the double quote (”).
Possible alterations? Check for https URLs also. The regular expression above would fail for a check against https URLs.

Hi,
Nice snippet. What’s the ‘%’ used for? Does it does the same as ‘*.’ ?
on July 28th, 2006 at 12:19 am | #Link CommentHello
My life,vist it http://www.freeblog.com.br/zhangdress/ ,Thanks.
on September 13th, 2011 at 1:14 am | #Link CommentHello
My life,vist it http://www.free-blog-site.com/blog2.php?user=zhangda¬e=252645 ,Thanks.
on September 16th, 2011 at 1:09 am | #Link Comment