Spider Hunter

18 Sep

More thoughts on the spider catcher script

I just went through my IP database, which now stands at 383,472 IP addresses checked :-), looking at the IP addresses that their percentile score is 0.75. This means that the IP Address has never had any cookies. It has never had any referrer data and it has a known good spider user agent. 3 of the 4 criteria for being a spider have been met and the only criteria lacking is that they are not close enough to an existing spider IP address to get caught. Two things have jumped out at me on this. First, the Temoa user agent gets faked more then any other. Typically faked by dial up or DSL line users. So why not use the Anti-SPAM RBL for dynamic IP addresses to help filter these out. The idea being that any dynamic IP address or dialup account would not be used by a corporate search engine for these spiders. Next, extending the IP address check. I found 2 or 3 Google spiders and one or two Inktomi spiders while going through my logs. Now those and the ones near then will get caught like they are suppose to, but why not use the ARIN data to find out the entire range that is given to a specific company and then bump up the percentile score if an IP address falls in that range. Both of these have their issues and I’ll go back and forth on if I want to implement this. I may do it as a secondary check on anything that scores above a 0.75 score to clear out the faked agents and bump up the real ones. This would create a two phase spider checker script where anything above 0.75 gets sent through a more thorough script and the ones above 0.90 get sent directly to me to look at. (The ones above 0.98 are getting automatically added)

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Netvouz
  • DZone
  • ThisNext
  • MisterWong
  • Wists

Leave a Reply

You must be logged in to post a comment.

© 2008 Spider Hunter | Entries (RSS) and Comments (RSS)