Does anyone know how tell the 'facebookexternalhit' bot to spread its traffic?
Our website gets hammered every 45 - 60 minutes with spikes of approx. 400 requests per second, from 20 to 30 different IP addresses from the facebook netblocks. Between the spikes the traffic does not disappear, but the load is acceptable. Offcourse we do not want to block the bot, but these spikes are risky. We'd prefer to see the bot spread it's load equally over time. And see it behave like Googlebot & friends.
I've seen related bug reports ( First Bug, Second Bug and Third Bug (#385275384858817)), but could not find any suggestions how to manage the load.
Per other answers, the semi-official word from Facebook is "suck it". It boggles me they cannot follow Crawl-delay (yes, I know it's not a "crawler", however GET'ing 100 pages in a few seconds is a crawl, whatever you want to call it).
Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.
In PHP, execute the following code as quickly as possible for every request.
define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit
if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && strpos( $_SERVER['HTTP_USER_AGENT'], 'facebookexternalhit' ) === 0 ) {
$fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
$lastTime = fread( $fh, 100 );
$microTime = microtime( TRUE );
// check current microtime with microtime of last access
if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
// bail if requests are coming too quickly with http 503 Service Unavailable
header( $_SERVER["SERVER_PROTOCOL"].' 503' );
die;
} else {
// write out the microsecond time of last access
rewind( $fh );
fwrite( $fh, $microTime );
}
fclose( $fh );
} else {
header( $_SERVER["SERVER_PROTOCOL"].' 429' );
die;
}
}
You can test this from a command line with something like:
$ rm index.html*; wget -U "facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)" http://www.foobar.com/; less index.html
Improvement suggestions are welcome... I would guess their might be some concurrency issues with a huge blast.