How to programmatically extract information from a web page, using Linux command line?

Question 1

How to programmatically extract information from a web page, using Linux command line?

html linux extract html-content-extraction

ysap · Feb 27, 2013 · Viewed 11k times · Source

Answer

Answer

You need to use -O to write the STDOUT

wget -O- http://www.xe.com/currencytables/?from=USD&date=2012-10-15

But it looks like xe.com does not want you to do automated downloads. I would suggest not doing automated downloads at xe.com

Question 2

I need to extract the exchange rate of USD to another currency (say, EUR) for a long list of historical dates.

The www.xe.com website gives the historical lookup tool, and using a detailed URL, one can get the rate table for a specific date, w/o populating the Date: and From: boxes. For example, the URL http://www.xe.com/currencytables/?from=USD&date=2012-10-15 gives the table of conversion rates from USD to other currencies on the day of Oct. 15th, 2012.

Now, assume I have a list of dates, I can loop through the list and change the date part of that URL to get the required page. If I can extract the rates list, then simple grep EUR will give me the relevant exchange rate (I can use awk to specifically extract the rate).

The question is, how can I get the page(s) using Linux command line command? I tried wget but it did not do the job.

If not CLI, is there an easy and straight forward way to programmatically do that (i.e., will require less time than do copy-paste of the dates to the browser's address bar)?

UPDATE 1:

When running:

$ wget 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'

I get a file which contain:

<HTML>
<HEAD><TITLE>Autoextraction Prohibited</TITLE></HEAD>
<BODY>
Automated extraction of our content is prohibited.  See <A HREF="http://www.xe.com/errors/noautoextract.htm">http://www.xe.com/errors/noautoextract.htm</A>.
</BODY>
</HTML>

so it seems like the server can identify the type of query and blocks the wget. Any way around this?

UPDATE 2:

After reading the response from the wget command and the comments/answers, I checked the ToS of the website and found this clause:

You agree that you shall not:
...
f. use any automatic or manual process to collect, harvest, gather, or extract
   information about other visitors to or users of the Services, or otherwise
   systematically extract data or data fields, including without limitation any
   financial and/or currency data or e-mail addresses;

which, I guess, concludes the efforts in this front.

Now, for my curiosity, if wget generates an HTTP request, how does the server know that it was a command and not a browser request?

How to programmatically extract information from a web page, using Linux command line?

Answer

Related questions