Nodejs webpage scraping with authentication cookie

mspl picture mspl · Oct 3, 2015 · Viewed 9.3k times · Source

Lately I'm trying to scrape Information from a website (kicktipp) using Nodejs, the request module and cheerio. Since this site requires an authentication to view most of it's sites, I tried to login via a post request and checking if the user is logged in with the following code (I replaced the credentials with dummy data but I use real data in my actual script):

var request = require('request');
var jar = request.jar();
var request = request.defaults({
  jar: jar,
  followAllRedirects: true
});
var jar = request.jar();
var cheerio = require('cheerio');

request.post({
    url: 'http://www.kicktipp.de/info/profil/loginaction',
    headers: { 'content-type': 'application/x-www-form-urlencoded' },
    method: 'post',
    jar: jar,
    body: '[email protected]&passwort=1234567890&_charset_=UTF-8&submitbutton=Anmelden'
}, function(err, res, body){
  if(err) {
    return console.error(err);
  };

  request.get({
    url: 'http://www.kicktipp.de/',
    method: 'get',
    jar: jar
  }, function(err, res, body) {
    if(err) {
      return console.error(err);
    };

    var $ = cheerio.load(body);
    var text = $('.dropdownbox > li > a').text();
    console.log(text);
    var error = $('#kicktipp-content > div.messagebox.errors > p').text();
    console.log(error);
    var cookies = jar.getCookies('http://www.kicktipp.de/');
    console.log(cookies);
  });
});

The parameters send by the html-form (as inspected with the browser) looking like this:

[email protected]&passwort=1234567890&_charset_=UTF-8&submitbutton=Anmelden

With that script, my cookie jar looks like this:

[ Cookie="JSESSIONID=F650D7F5CD6AF4F6B0944B2190EE2D29.kt213; Path=/; hostOnly=true; aAge=1ms; cAge=179ms" ]

The JSESSIONID is saved successfully but the server will not be logged in since console.log(text) prints Login but it should print Logout if the user is signed in properly.

After inspecting the login request with the browser I recognized that the browser receives a new cookie everytime a page on this domain is requested via set-cookie in the response header like this:

Set-Cookie: login=bS5zcGxpZXRob2V2ZXJAZ21haWwuY29tOjE0NzU0MDA3MjAxMjA6Mzg1NTI4OGY3ODgzN2FkMzllNTA0NWNkY2ZjMjBjZGM; Domain=.kicktipp.de; Expires=Sun, 02-Oct-2016 09:32:00 GMT; Path=/; HttpOnly

However I'm not able (or just don't know how) to get this cookie into my request jar and therefore visiting the page as a logged in user.

Is there anything I'm missing here to stay logged in (or log in to the page at all)? Thanks in advance.

Answer

mspl picture mspl · Sep 4, 2016

The problem is that this page seems to need a specific cookie that you get on your first page visit (in this case it seems to a timezone cookie). To get this cookie you just need to visit the page (using a GET request) before sending the login (POST) request to the server. In this case it is as easy as wrapping another GET request around the code above:

var loginLink = 'http://www.kicktipp.de/info/profil/login';

// creating a clean jar
var j = request.jar();

request.get({url: loginLink, jar: j}, function(err, httpResponse, html) {
  // place POST request and rest of the code here
});