How to scrape iframe content using cURL

ven picture ven · Dec 7, 2011 · Viewed 19.7k times · Source

Goal: I want to scrape the word "Paris" inside an iframe using cURL.

Say you have a simple page containing an iframe:

<html>
<head>
<title>Curl into this page</title>
</head>
<body>

<iframe src="france.html" title="test" name="test">

</body>
</html>

The iframe page:

<html>
<head>
<title>France</title>
</head>
<body>

<p>The Capital of France is: Paris</p>

</body>
</html>

My cURL script:

<?php>

// 1. initialize

$ch = curl_init();

// 2. The URL containing the iframe

$url = "http://localhost/test/index.html";

// 3. set the options, including the url

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 2);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

// 4. execute and fetch the resulting HTML output by putting into $output

$output = curl_exec($ch);

// 5. free up the curl handle

curl_close($ch);

// 6. Scrape for a single string/word ("Paris") 

preg_match("'The Capital of France is:(.*?). </p>'si", $output, $match);
if($match) 

// 7. Display the scraped string 

echo "The Capital of France is: ".$match[1];

?>

Result = nothing!

Can someone help me find out the capital of France?! ;)

I need example of:

  1. parsing/grabbing the iframe url
  2. curling the url (as I've done with the index.html page)
  3. parsing for the string "Paris"

Thanks!

Answer

Mike Purcell picture Mike Purcell · Dec 7, 2011

--Edit-- You could load the page contents into a string, parse the string for iframe, then load the iframe source into another string.

$wrapperPage = file_get_contents('http://localhost/test/index.html');

$pattern = '/\.*src=\".*\.html"\.*/';

$iframeSrc = preg_match($pattern, $wrapperPage, $matches);

if (!isset($matches[0])) {
    throw new Exception('No match found!');
}

$src = $matches[0];

$src = str_ireplace('"', '', $src);
$src = str_ireplace('src=', '', $src);
$src = trim($src);

$iframeContents = file_get_contents($src);

var_dump($iframeContents);

--Original--

Work on your acceptance rate (accept answers to previously answered questions).

The url you are setting the curl handler to is the file wrapping the i-frame, try setting it to the url of the iframe:

$url = "http://localhost/test/france.html";