The Urls in its 'readme' file is not valid (http://www.fjoch.com/mkcls.html and http://www.fjoch.com/GIZA++.html). Is there a good tutorial about giza++? Or is there some alternatives that have complete documentation?
The following is excerpted from a tutorial I'm putting together for a class. (NB: This assumes you have successfully installed GIZA++-v2 on a *nix system.)
Sample 1 - train.en
I gave him the book .
He read the book .
He loved the book .
Sample 2 - train.fr
Je lui ai donne/ le livre .
Il a lu le livre .
Il aimait le livre .
plain2snt.out
to get target and source vocabulary files (*.vcb
) as well as a sentence pair file (*.snt
).From the GIZA++ directory, run:
./plain2snt.out TEXT1 TEXT2
where TEXT1
and TEXT2
are the data files described in step 1.
This produces four files in the same directory as TEXT1
and TEXT2
(assuming they are in the same directory):
The vocab files contain a unique (integer) ID for each word in the text (NB: not tokenized/lemmatized), the word/string, and the number of times that string occurred. These are separated by a single space character.
The sentence files contain numbers. For each sentence pair, there are three lines: the first is a count of the number of times that sentence pair occurs in the corpus and the second and third are a string of (space-separated) numbers corresponding to the entries for words in the vocab files. Based on the naming convention for *.snt
files, the first file is assumed to be the source, and the second is assumed to be the target language. For example, in the file TEXT1_TEXT2.snt
, the first line will be a count of the number of times the first sentence-pair occurred in the corpus, the second line will be a string of numbers corresponding to words in the TEXT1.vcb
file, and the third line will be a string of numbers corresponding to words in the TEXT2.vcb
file.
TEXT1.vcb
, TEXT2.vcb
, and either of the two *.snt
files can be used as input to GIZA++ to produce an alignment.For example:
./GIZA++ -s TEXT1.vcb -t TEXT2.vcb -c TEXT1_TEXT2.snt
But note that when I tried to run this, I had to rename TEXT1_TEXT2.snt
to something without an underscore in the name in order to get any proper output.