How to perform unauthenticated Instagram web scraping in response to recent private API changes?

ReactingToAngularVues picture ReactingToAngularVues · Apr 12, 2018 · Viewed 16.1k times · Source

Months ago, Instagram began rendering their public API inoperable by removing most features and refusing to accept new applications for most permissions scopes. Further changes were made this week which further constricts developer options.

Many of us have turned to Instagram's private web API to implement the functionality we previously had. One standout ping/instagram_private_api manages to rebuild most of the prior functionality, however, with the publicly announced changes this week, Instagram also made underlying changes to their private API, requiring in magic variables, user-agents, and MD5 hashing to make web scraping requests possible. This can be seen by following the recent releases on the previously linked git repository, and the exact changes needed to continue fetching data can be seen here.

These changes include:

  • Persisting the User Agent & CSRF token between requests.
  • Making an initial request to https://instagram.com/ to grab an rhx_gis magic key from the response body.
  • Setting the X-Instagram-GIS header, which is formed by magically concatenating the rhx_gis key and query variables before passing them through an MD5 hash.

Anything less than this will result in a 403 error. These changes have been implemented successfully in the above repository, however, my attempt in JS continues to fail. In the below code, I am attempting to fetch the first 9 posts from a user timeline. The query parameters which determine this are:

  • query_hash of 42323d64886122307be10013ad2dcc44 (fetch media from the user's timeline).
  • variables.id of any user ID as a string (the user to fetch media from).
  • variables.first, the number of posts to fetch, as an integer.

Previously, this request could be made without any of the above changes by simply GETting from https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%7B%22id%22%3A%225380311726%22%2C%22first%22%3A1%7D, as the URL was unprotected.

However, my attempt at implementing the functionality to successfully written in the above repository is not working, and I only receive 403 responses from Instagram. I'm using superagent as my requests library, in a node environment.

/*
** Retrieve an arbitrary cookie value by a given key.
*/
const getCookieValueFromKey = function(key, cookies) {
        const cookie = cookies.find(c => c.indexOf(key) !== -1);
        if (!cookie) {
            throw new Error('No key found.');
        }
        return (RegExp(key + '=(.*?);', 'g').exec(cookie))[1];
    };

/*
** Calculate the value of the X-Instagram-GIS header by md5 hashing together the rhx_gis variable and the query variables for the request.
*/
const generateRequestSignature = function(rhxGis, queryVariables) {
    return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex");
};

/*
** Begin
*/
const userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5';

// Make an initial request to get the rhx_gis string
const initResponse = await superagent.get('https://www.instagram.com/');
const rhxGis = (RegExp('"rhx_gis":"([a-f0-9]{32})"', 'g')).exec(initResponse.text)[1];

const csrfTokenCookie = getCookieValueFromKey('csrftoken', initResponse.header['set-cookie']);

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 9
});

const signature = generateRequestSignature(rhxGis, queryVariables);

const res = await superagent.get('https://www.instagram.com/graphql/query/')
    .query({
        query_hash: '42323d64886122307be10013ad2dcc44',
        variables: queryVariables
    })
    .set({
        'User-Agent': userAgent,
        'X-Instagram-GIS': signature,
        'Cookie': `rur=FRC;csrftoken=${csrfTokenCookie};ig_pr=1`
    }));

What else should I try? What makes my code fail, and the provided code in the repository above work just fine?

Update (2018-04-17)

For at least the 3rd time in a week, Instagram has again updated their API. The change no longer requires the CSRF Token to form part of the hashed signature.

The question above has been updated to reflect this.

Update (2018-04-14)

Instagram has again updated their private graphql API. As far as anyone can figure out:

  • User Agent is no longer needed to be included in the X-Instagram-Gis md5 calculation.

The question above has been updated to reflect this.

Answer

Alex picture Alex · Apr 13, 2018

Values to persist

You aren't persisting the User Agent (a requirement) in the first query to Instagram:

const initResponse = await superagent.get('https://www.instagram.com/');

Should be:

const initResponse = await superagent.get('https://www.instagram.com/')
                     .set('User-Agent', userAgent);

This must be persisted in each request, along with the csrftoken cookie.

X-Instagram-GIS header generation

As your answer shows, you must generate the X-Instagram-GIS header from two properties, the rhx_gis value which is found in your initial request, and the query variables in your next request. These must be md5 hashed, as shown in your function above:

const generateRequestSignature = function(rhxGis, queryVariables) {
    return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex");
};