We have a situation where we log visits and visitors on page hits and bots are clogging up our database. We can't use captcha or other techniques like that because this is before we even ask for human input, basically we are logging page hits and we would like to only log page hits by humans.
Is there a list of known bot IP out there? Does checking known bot user-agents work?
There is no sure-fire way to catch all bots. A bot could act just like a real browser if someone wanted that.
Most serious bots identify themselves clearly in the agent string, so with a list of known bots you can fitler out most of them. To the list you can also add some agent strings that some HTTP libraries use by default, to catch bots from people who don't even know how to change the agent string. If you just log the agent strings of visitors, you should be able to pick out the ones to store in the list.
You can also make a "bad bot trap" by putting a hidden link on your page that leads to a page that's filtered out in your robots.txt file. Serious bots would not follow the link, and humans can't click on it, so only bot that doesn't follow the rules request the file.