The more traffic your Website gets, the more important efficient handling of that traffic becomes. First you want to be sure that your visitors get a fast response, but you also want to make sure that bogus traffic doesn't suck all your bandwidth. Here is a simple way to keep bogus traffic off your Website.
O yea... those pesky offline browsers (or site stealers, who reads content offline these days anyway), crawlers looking for email addresses, robots looking for vulnerabilities, and the list goes on.
These programs can crawl your Website repeatedly, requesting the same pages over and over again. They can spider your entire Website harvesting your content. They look for holes to hack into, try to crack your password protected areas, and more. Even if your site is small, there is someone out there with bad intentions looking for a way in or stealing your content.
So in order to keep the bad guys out, we need to find a way to seperate them from everyone else. First we'll look to the User Agent to help us identify these pests. What we will be doing is looking at the User Agent, and telling Apache to deny access to your Website for specific ones we don't like. In doing this we can eliminate some naughty software from accessing your Website and save you precious resources like processor time, RAM, and bandwidth. There are a number of programs that qualify for the bad list, I won't bore you with a list in this article, I'll just link to a list I found instead.
You could also go the opposite route and allow access only to User Agents for known Web Browsers, but I don't recommend that option because there are plenty of valid browsers like those used in cell phones, PDA's, and special software for the disabled being created and updated everyday. Creating an allow list means you would need to constantly be on the lookout for new browsers to add allow or valid traffic might be denied. That's bad usability, and we don't want that.
Simple. Using Apache's built in SetEnvIf module! Just drop the tidbit below into your Apache configuration file (usually httpd.conf) and restart Apache.
Note: I've only included a few of the hundreds of bad User Agents in this example for clarity.
##--> this denys all requests from bogus User Agents
<location /var/www/my_personal_website/public_html/>
SetEnvIf User-Agent "^EmailSiphon" badUA
SetEnvIf User-Agent "^Mister Pix" badUA
Order Allow,Deny
Allow from all
Deny from env=badUA
</location>
It's also important to note that the code above be placed in the root section of your Apache configuration, not in a VirtualHost or other container. The example assumes you want to deny access to all web accessible files on your server so this applies to all Virtual Hosts (if any) you have setup. You can also specify a directory to deny access to in the <Location> container by simply entering its full path like so.
<location /var/www/my_personal_website/public_html/>
Order Allow,Deny
Allow from all
Deny from env=badUA
</location>
As it stands adding more User Agents to deny is also easy. Just add another line in the configuration. Just be aware that Apache uses Regular Expression syntax for matching User Agent names.
SetEnvIf User-Agent "^New Bad User Agent" badUA
If you don't have access to your servers configuation file have no fear... this also works in your .htaccess files. All you would need to do is remove the <Location> container and everything should work the same way.
Now that's it! You're done. You can move on to more Apache tweaking.
J Cornelius is a software developer, Web developer, and Formula 1 fan in Atlanta GA. He has a strange affinity for odd numbers, european sports cars, thoughtful analogies, and is hopelessly addicted to chips & salsa. Read more
Was it good for you?
Post to Digg Post to del.icio.us Post to ma.gnolia Post to Furl Post to Mixx