Counting the number of occurrences of words in a textfile

c algorithm nlp counting

vinc456 · Dec 25, 2008 · Viewed 16.3k times · Source

How could I go about keeping track of the number of times a word appears in a textfile? I would like to do this for every word.

For example, if the input is something like:

"the man said hi to the boy."

Each of "man said hi to boy" would have an occurrence of 1.

"the" would have an occurence of 2.

I was thinking of keeping a dictionary with word/occurrence pairs but I'm not sure how to implement this in C. A link to any similar or related problems with a solution would be great.

EDIT: To avoid rolling out my own hash table I decided to learn how to use glib. Along the way I found an excellent tutorial which walks through a similar problem. http://bo.majewski.name/bluear/gnu/GLib/ch03s03.html

I am awestruck by the number of different approaches, and in particular the simplicity and elegance of the Ruby implementation.

Answer

Yes, a dictionary with word-occurence pairs would work fine, and the usual way to implement such a dictionary would be to use a hash table (or, sometimes, a binary search tree).

You could also use a trie (or its compressed version, "Patricia trie"/Radix trie) whose complexity is asymptotically optimal for this problem, although I suspect that in practice it might be slower than a (good) hash table implementation.

[I really think whether hash tables or tries are better depends on the distribution of words in your input -- e.g. a hash table will need to store each word in its hash bucket (to guard against collisions), while if you have a lot of words with common prefixes, in a trie those common prefixes are shared and need to be stored only once each, but there is still the overhead of all the pointers... if you do happen to try both, I'm curious to know how they compare.]

Counting the number of occurrences of words in a textfile

Answer

Related questions