file_get_contents() give me 403 Forbidden

Steven picture Steven · Jul 27, 2012 · Viewed 27.7k times · Source

I have a partner that has created some content for me to scrape.
I can access the page with my browser, but when trying to user file_get_contents, I get a 403 forbidden.

I've tried using stream_context_create, but that's not helping - it might be because I don't know what should go in there.

1) Is there any way for me to scrape the data?
2) If no, and if partner is not allowed to configure server to allow me access, what can I do then?

The code I've tried using:

$opts = array(
  'http'=>array(
    'user_agent' => 'My company name',
    'method'=>"GET",
    'header'=> implode("\r\n", array(
      'Content-type: text/plain;'
    ))
  )
);

$context = stream_context_create($opts);

//Get header content
$_header = file_get_contents($partner_url,false, $context);

Answer

Cleric picture Cleric · Jul 27, 2012

This is not a problem in your script, its a feature in you partners web server security.

It's hard to say exactly whats blocking you, most likely its some sort of block against scraping. If your partner has access to his web servers setup it might help pinpoint.

What you could do is to "fake a web browser" by setting the user-agent headers so that it imitates a standard web browser.

I would recommend cURL to do this, and it will be easy to find good documentation for doing this.

    // create curl resource
    $ch = curl_init();

    // set url
    curl_setopt($ch, CURLOPT_URL, "example.com");

    //return the transfer as a string
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

    // $output contains the output string
    $output = curl_exec($ch);

    // close curl resource to free up system resources
    curl_close($ch);