Future additions to the spider checker script
After looking at how the data is collected by the spider checker script I’m thinking of adding a few things to it. First off I want to create a non-spider list. IP Addresses that are simply not spiders. These IP Addresses will be things like aol.com’s caching computers and altavista’s babelfish. (Is that still around?) When an IP Address on or near a non-spider is being looked at the IP check will be set to 0 by default. Or at least that IP address will not increase the score at all. The next thing I want to do is a user agent block list. This will contain user agents for spiders that are not Search Engine Spiders. Things like Alexa and some of the image and intellectual property spiders. This will create a balance to the user agents, as well. Making these user agents bring up a lower score for the user agent test. The reason I want to make these additions is the number of times I am looking at things with a 0.50 or a 0.75 score to see what they are. A 0.75 score means that the user agent looks good, the http_referrer is good and the http_cookie is good, but no IP anywhere near this address are from a valid spider. This is often a sign of a good spider in a new range, but I keep finding fakes and I want a way to eliminate the fake so I don’t look at them again. Also, when I’m looking at the 0.50 scores I’m seeing a whole lot of grub clients, which I simply don’t trust to put into the database as a known spider.






