Scraping data from all asp.net pages with AJAX pagination implemented

Subodh Ghulaxe picture Subodh Ghulaxe · Feb 8, 2013 · Viewed 12.1k times · Source

I want to scrap a webpage containing a list of user with addresses, email etc. webpage contain list of user with pagination i.e. page contains 10 users when I click on page 2 link it will load users list form 2nd page via AJAX and update list so on for all pagination links.

Website is developed in asp i.e. page with extension .aspx since I don't know anything about asp.net and how asp manages pagination and AJAX

I am using simple html dom http://sourceforge.net/projects/simplehtmldom/ to scrap contain

for pages having users <=10 I dont have to simulate AJAX request same as when user clicks on pagination link

but for page having pagination to get data from other pages I am simulating post AJAX request

require 'simple_html_dom.php';

$html = file_get_html('www.example.com/user_list.aspx');

$viewstate = $html->find("#__VIEWSTATE");
$viewstate = $viewstate[0]->attr['value'];

$eventvalidation        = $html->find("#__EVENTVALIDATION");
$eventvalidation        = $eventvalidation[0]->attr['value'];
$number_of_pageinations = 3;

$pageNumberCodes = array(
    'ctl00$cphMainContent$rdpMembers$ctl01$ctl01',
    'ctl00$cphMainContent$rdpMembers$ctl01$ctl02',
    'ctl00$cphMainContent$rdpMembers$ctl01$ctl03'
); // this code is added for each page in POST  as  __EVENTTARGET 

for ($i = 0; $i < $number_of_pageinations; $i++) {
    $options = array(
        CURLOPT_RETURNTRANSFER => true, // return web page
        CURLOPT_HEADER => false, // don't return headers
        CURLOPT_ENCODING => "", // handle all encodings
        CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'", // who am i
        CURLOPT_AUTOREFERER => true, // set referer on redirect
        CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
        CURLOPT_TIMEOUT => 1120, // timeout on response
        CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
        CURLOPT_POST => true,
        CURLOPT_VERBOSE => true,
        CURLOPT_POSTFIELDS => urlencode('ctl00%24scriptManager=ctl00%24cphMainContent%24ctl00%24cphMainContent%24rdpMembersPanel%7C' . $pageNumberCodes[0] . '&__EVENTTARGET=' . $pageNumberCodes[0] . '&__EVENTARGUMENT=' . '&__VIEWSTATE=' . $viewstate . '&__EVENTVALIDATION=' . $eventvalidation . "&google=" . '&ctl00%24cphMainContent%24txtZip=' . '&ctl00%24cphMainContent%24cboRadius=Exact' . '&ctl00%24cphMainContent%24txtMemberName=' . '&ctl00%24cphMainContent%24txtCity=Honolulu' . '&ctl00%24cphMainContent%24cboState=HI' . '&ctl00%24cphMainContent%24txtAddress=' . '&ctl00_cphMainContent_rdpMembers_ClientState=' . '&ctl00%24cphMainContent%24ddList=-Select%20field%20to%20sort-' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_rdpMembers1_ClientState=' . '&__ASYNCPOST=true' . 'RadAJAXControlID=ctl00_cphMainContent_RadAjaxManager1')
    );
    $ch      = curl_init($url);
    curl_setopt_array($ch, $options);
    $return = curl_exec($ch);
    curl_close($ch);
    echo $return;

    $newHtml = str_get_html($return);

    $viewstate = $newHtml->find("#__VIEWSTATE");
    $viewstate = $viewstate[0]->attr['value'];

    $eventvalidation = $newHtml->find("#__EVENTVALIDATION");
    $eventvalidation = $eventvalidation[0]->attr['value'];
}

this should echo data from different pages but It always prints data of first page, can anybody point me where I am worng and what is missing I dont know how asp manages paginations and AJAX request and what is __EVENTARGUMENT, __VIEWSTATE and __EVENTVALIDATION

Answer

Knaģis picture Knaģis · Feb 19, 2013

In general, in order to fake the ASP.NET web site to think that you actually pressed a button (in more general terms - performed a postback), you need to do the following:

  1. Get the value of every single INPUT and SELECT element on the page. It might not be required in every scenario, but you should always at least get the values of all hidden fields where the name starts with "__" (such as __VIEWSTATE). You don't really need to know what is written in them - just that the value in them has to be sent back to the server unchanged.

  2. Create a POST request to the server. You need to use the classic POST, avoiding any AJAX requests. Using some browser plugins (in Firefox or Chrome) it might be possible to disable XMLHttpRequest so you can then intercept the non-AJAX request with tools like Fiddler.

  3. Add every value from #1 to that post request. There are only two values you need to overwrite: __EVENTTARGET and __EVENTARGUMENT. You would leave those empty except if the link or button that you try to imitate has a onclick handler like <a href="javascript:__doPostBack('ctl00$login','')">. If it is, parse the values from this link - the first one is the event target (it usually will match the ID of some element on the page), the second is the event argument.

  4. If you executed the request correctly, you should get back HTML page. If you get a partial response, check if you didn't pass the HTTP header that asks for async result.