I want to scrap a webpage containing a list of user with addresses, email etc. webpage contain list of user with pagination i.e. page contains 10 users when I click on page 2 link it will load users list form 2nd page via AJAX and update list so on for all pagination links.
Website is developed in asp i.e. page with extension .aspx since I don't know anything about asp.net and how asp manages pagination and AJAX
I am using simple html dom http://sourceforge.net/projects/simplehtmldom/ to scrap contain
for pages having users <=10
I dont have to simulate AJAX request same as when user clicks on pagination link
but for page having pagination to get data from other pages I am simulating post AJAX request
require 'simple_html_dom.php';
$html = file_get_html('www.example.com/user_list.aspx');
$viewstate = $html->find("#__VIEWSTATE");
$viewstate = $viewstate[0]->attr['value'];
$eventvalidation = $html->find("#__EVENTVALIDATION");
$eventvalidation = $eventvalidation[0]->attr['value'];
$number_of_pageinations = 3;
$pageNumberCodes = array(
'ctl00$cphMainContent$rdpMembers$ctl01$ctl01',
'ctl00$cphMainContent$rdpMembers$ctl01$ctl02',
'ctl00$cphMainContent$rdpMembers$ctl01$ctl03'
); // this code is added for each page in POST as __EVENTTARGET
for ($i = 0; $i < $number_of_pageinations; $i++) {
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 1120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => true,
CURLOPT_VERBOSE => true,
CURLOPT_POSTFIELDS => urlencode('ctl00%24scriptManager=ctl00%24cphMainContent%24ctl00%24cphMainContent%24rdpMembersPanel%7C' . $pageNumberCodes[0] . '&__EVENTTARGET=' . $pageNumberCodes[0] . '&__EVENTARGUMENT=' . '&__VIEWSTATE=' . $viewstate . '&__EVENTVALIDATION=' . $eventvalidation . "&google=" . '&ctl00%24cphMainContent%24txtZip=' . '&ctl00%24cphMainContent%24cboRadius=Exact' . '&ctl00%24cphMainContent%24txtMemberName=' . '&ctl00%24cphMainContent%24txtCity=Honolulu' . '&ctl00%24cphMainContent%24cboState=HI' . '&ctl00%24cphMainContent%24txtAddress=' . '&ctl00_cphMainContent_rdpMembers_ClientState=' . '&ctl00%24cphMainContent%24ddList=-Select%20field%20to%20sort-' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_rdpMembers1_ClientState=' . '&__ASYNCPOST=true' . 'RadAJAXControlID=ctl00_cphMainContent_RadAjaxManager1')
);
$ch = curl_init($url);
curl_setopt_array($ch, $options);
$return = curl_exec($ch);
curl_close($ch);
echo $return;
$newHtml = str_get_html($return);
$viewstate = $newHtml->find("#__VIEWSTATE");
$viewstate = $viewstate[0]->attr['value'];
$eventvalidation = $newHtml->find("#__EVENTVALIDATION");
$eventvalidation = $eventvalidation[0]->attr['value'];
}
this should echo data from different pages but It always prints data of first page, can anybody point me where I am worng and what is missing
I dont know how asp manages paginations and AJAX request and what is __EVENTARGUMENT
, __VIEWSTATE
and __EVENTVALIDATION
In general, in order to fake the ASP.NET web site to think that you actually pressed a button (in more general terms - performed a postback), you need to do the following:
Get the value of every single INPUT and SELECT element on the page. It might not be required in every scenario, but you should always at least get the values of all hidden fields where the name starts with "__" (such as __VIEWSTATE). You don't really need to know what is written in them - just that the value in them has to be sent back to the server unchanged.
Create a POST request to the server. You need to use the classic POST, avoiding any AJAX requests. Using some browser plugins (in Firefox or Chrome) it might be possible to disable XMLHttpRequest so you can then intercept the non-AJAX request with tools like Fiddler.
Add every value from #1 to that post request. There are only two values you need to overwrite: __EVENTTARGET and __EVENTARGUMENT. You would leave those empty except if the link or button that you try to imitate has a onclick
handler like <a href="javascript:__doPostBack('ctl00$login','')">
. If it is, parse the values from this link - the first one is the event target (it usually will match the ID of some element on the page), the second is the event argument.
If you executed the request correctly, you should get back HTML page. If you get a partial response, check if you didn't pass the HTTP header that asks for async result.