Spider Hunter

29 Jan

Catching Spiders with the IP database

Last night I decided to put the new IP database to the test to find out the data that I was looking for in the end. I am moslty interested in tracking search engine spiders, so here is what I found. Starting with the total IP addresses in the database, 568,286, I then wanted to find the IPs that have more then 10 visits in the past 20 months: 36,448, cut that down pretty fast. FYI: 216,122 IPs have at least 1 visit. Next I removed anything that I had time zone, color depth or resolution data for, as these client were running javascript, 35,972. Not much, but I only started tracking that at the begining of this year. This will become more important over time. Next I looked at the average cookies that the IP address used. In my database no data is collected as a “No Data” field, so no data really is some data. That means that any IP address with an average cookie of 1 is likely a spider. I cut my number down to 11,709 Next I looked at the average referer data, again looging for an average of 1. That got my number down to 9,468. Now I started looking at the DNS Name of the IP and here is what I got:
* Googlebot: 758
* Inktomisearch: 1,089
* MSNBot: 230
* Teoma/Ask Jeeves: 66
This is pretty close to the numbers that I have from my own manual tracking, but now I can do this in 1 SQL statement for each spider and in miliseconds, instead of minutes for each IP :-)

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Netvouz
  • DZone
  • ThisNext
  • MisterWong
  • Wists

Leave a Reply

You must be logged in to post a comment.

© 2008 Spider Hunter | Entries (RSS) and Comments (RSS)