Searching through the Web by using the Google search engine is a de facto standard for Internet users. Google provides a basic or an advanced form to prepare a query string to its search engine. Supposing to be interested in not using the web form, one can simply do an HTTP get request to the specific URL with a query string constructed upon the search conditions.
For instance I can search for results with word "hello" by doing an HTTP request at:
http://www.google.com/search?q=hello
I can add another word, e.g. "world", as follows:
http://www.google.com/search?q=hello+world
You know, the search can be more "complicated" by specifying nice parameters like:
How can I modify the query string to account for the above search parameters?
I carefully examined the answers by Pratik Chowdhury and Robbie Vercammen. They provides a link to Web documents that report a list of possible textual filtering to be used within the Google search form. Despite this is interesting, they don't provide an answer to the question. Hence, I studied a lot the problem and I found the following solution.
Suppose that you need to make a una tantum HTTP call (e.g. by a PHP class runned via CRON once a month) to Google Search in order to retrieve the search results for a particular string query, e.g. all the pages with some words (i.e. "hello" and "world") in your website (i.e. mywebsite.com), then you can do an HTTP get call to the following address:
http://www.google.com/search?q=hello+world+site:mywebsite.com
The q
parameter can contain the whole search query, however Google defined a dummy proof list of parameters.
Notice that the AND
operator can be represented by the as_q
parameter instead.
To get page results with one between "hello" and" world" (i.e. and OR), must be changed the query "q" parameter as:
q=hello+OR+world
while a more compact representation uses the as_oq
parameter:
as_oq=hello+world
If one looks for the exact phrase "hello world", the q
parameter is:
q="hello+world"
while, again, another compact representation uses the as_epq
parameter:
as_epq=hello+world
If one looks for all the results that not contain the words "hello" and "world", the q
parameter is:
q=-hello+-world
while, again, another compact representation uses the as_eq
parameter:
as_eq=hello+world
Of course, as_q
, as_oq
, as_epq
, as_eq
, etc. can by combined in a unique search query as usual (i.e. by using the &
character). Thus, for instance I can search for both words "hello" and "word" plus one between "programming" and "code" as follow here:
q=hello+world&as_oq=programming+code
One can search for a specific domain (again, mydomain.com) as follow:
as_sitesearch=mydomain.com
However, if you want to exclude a specific domain (e.g., because it is a spam source), you must recur to standard notation. E.g.:
q=hello+-site:mydomain.com
return all the pages with word "hello" that are not in site mydomain.com.
To get for a specific file type, e.g. a pdf, you can use as_filetype
:
as_filetype=pdf
More complex search parameter can be used, as provided in Google support docs.
For instance, to get also results with a synonym of a word, simply use the ~
operator in front of the word, e.g.
q=~hello
Moreover, if you want to use wildcards, e.g. to get all the exact phrases that start with "hello" and end with "world", you should use the *
operator:
q="hello+*+world"
which probably will return something like: "hello to the world" and "hello sweet world".
One can also search for specific words inside the page title or in the page url by using the following keywords (read here for more details):
For instance, the following returns all the pages s.a. both words "hello" and "world" are in the url:
q=allinurl:hello+world
For the language of the Google GUI page (not the one of the results), one must insert into the query string the language string (e.g. en
for English, fr
for French, it
for Italian, etc.) to the hl
parameter. In other words, if one search with the English version of Google, the query string becomes as follow:
http://www.google.com/search?hl=en&q=hello+world+site:mywebsite.com
To select a specific language, e.g. Italian, use the lr
query parameter:
lr=lang_it
One can also select pages published in a specific geographical region by using the cr
parameter. E.g., to find all the pages published in Italy:
cr=countryIT