I am building a small plagiarism detecting system in php for practice. Well I did some research on Google figured that I may use Google API (custom search API) to build a plagiarism detecting software.
Well I found this question very helpful [How would you code an anti plagiarism site?]
I have managed to obtain the result of search from google api using following codes
<?php
ini_set('max_execution_time',300);
require_once '../../src/Google_Client.php';
require_once '../../src/contrib/Google_CustomsearchService.php';
session_start();
$client = new Google_Client();
$client->setApplicationName('Google CustomSearch PHP Starter Application');
$client->setDeveloperKey('MY_DEVELOPER_KEY');
$search = new Google_CustomsearchService($client);
$to_search="This is the text that should be searched in google so that the result that I obtain can be used by my codes to perform plagarism analysis";
$result = $search->cse->listCse($to_search, array('cx' => 'MY_SEARCH_ENGINE_ID'));
for($i=0; $i<6; $i++)
{
print "<pre>" . print_r($result, true) . "</pre>";
}
?>
From the $result variable I have the [link], [snippet] and [html snipped] obtained from google search. using the code below
$result['items'][$i]['snippet'];
$result['items'][$i]['link'];
Here $i is the integer value obtained from loop.
The problem is As you know that, I can only send short keyword or few lines for searching in google but not a huge text so should I substr the big chunks of text into small lines and then run multiple queries? or should I do something else? The snippet, and link value I will obtain can be analysed for plagiarism. Doing this resulted huge amount of query which overflowed the limit of hundred query per day.
Please suggest me the proper way of doing what I am supposed to do. The way I am doing query to Google and then analyzing the huge text with the user input for plagarism, Is this correct way?
The way I would do it would be to Google the page Title looking for exact matches. The chances are that if someone stole your content they used the same title.
From here you can then pull the page with the possible stolen content and compare.
A more sophisticated method would be to search your own content for statistically unlikely words and phrases.Words with a lower than average modern usage rate. Then Google for content that contains all of the least likely words. However this is going to be a lot harder than the first approach as you will need to build a large database of low search result words and excessively used words in Google.
A third technique is to search your content for miss-spelt words. Then have your Script Google the miss-spelling and look for matches.
A forth - which is preventive only and works best at stopping automated scrapers is to have your system invent a made up word - a string of letters and numbers that is unlikely to have any search results at all. Then to have the script watch for new search results.
A combination of the above would probably make a really brilliant script and one hat I would urge you to release as open source.
Best of luck with your project.