wget for fetching Facebook profile/friend pages

rogerchucker picture rogerchucker · Jul 25, 2011 · Viewed 15.5k times · Source

I am trying to fetch facebook a user's profile page using "wget" but keep getting a non-profile page called "browser.php" which has nothing to do with that particular user. The profile page's URL as I see in the browser happens to be of the following format:

http://www.facebook.com/user-name

and that's what I have been using as the argument to the wget command:

wget http://www.facebook.com/user-name

I am also interested in using wget to fetch a user's friends' list but even that is giving me the same unhelpful result ("browser.php"):

wget http://www.facebook.com/user-name?sk=friends&v=friends

Could someone kindly advise me what I'm doing wrong here? In other words, am I missing out some key options for wget command or does wget not fit such a scenario at all?

Any help will be greatly appreciated.

To add context to this query, I need to figure out how to fetch these pages from Facebook using wget as it would then help me write a script/program to look up friends' profile URLs from the HTML source code and then look up some other keywords on them, etc. I am basically hoping that this would help me in doing some kind of selective-crawling (with Facebook's permission of course) of people I am not connected to.

Answer

Soren picture Soren · Jul 25, 2011

First, Facebook have probably created a condition where certain user agents (e.g. wget) cannot crawl the pages. So they redirect certain user agents yo a different page which would probably say something like "your browser is not supported" They do that to protect people from doing exactly what you are doing. However you can tell wget to identify itself as a different agent using -U argument to wget (read the wget man page). e.g. wget -U Mozilla http://....

Second, Facebooks privacy setting rarely allows you to read any/much information unless you are logged in as a user, and probably only as a user who is friend to the profile you are trying to scrape.

Thridly, there is an Facebook API which you need to use to crawl and extract information from facebook -- you are likely in violation of the Acceptable Use policy if you try to obtain information in any other way.