Text segmentation: dictionary-based word splitting

Question 1

Text segmentation: dictionary-based word splitting

java nlp data-dictionary text-segmentation

Dave Jarvis · Jan 2, 2011 · Viewed 7.3k times · Source

Answer

Answer

Peter Norvig has written some stuff in python.

http://norvig.com/ngrams/ngrams.py

contains a function called segment. It run a Naive Bayes probability of a sequence of words. works well. Can be a good basis for what your trying to accomplish in java.

If you get it converted to java, i'd be interested in seeing the implementation.

Thanks.

Mike

Question 2

Background

Split database column names into equivalent English text to seed a data dictionary. The English dictionary is created from a corpus of corporate documents, wikis, and email. The dictionary (lexicon.csv) is a CSV file with words and probabilities. Thus, the more often someone writes the word "therapist" (in email or on a wiki page) the higher the chance of "therapistname" splits to "therapist name" as opposed to something else. (The lexicon probably won't even include the word rapist.)

Source Code

TextSegmenter.java @ http://pastebin.com/taXyE03L
SortableValueMap.java @ http://pastebin.com/v3hRXYan

Data Files

lexicon.csv - http://pastebin.com/0crECtXY
columns.txt - http://pastebin.com/EtN9Qesr

Problem (updated 2011-01-03)

When the following problem is encountered:

dependentrelationship::end depend ent dependent relationship
end=0.86
ent=0.001
dependent=0.8
relationship=0.9

These possible solutions exist:

dependentrelationship::dependent relationship
dependentrelationship::dep end ent relationship
dependentrelationship::depend ent relationship

The lexicon contains words with their relative probabilities (based on word frequency): dependent 0.8, end 0.86, relationship 0.9, depend 0.3, and ent 0.001.

Eliminate the solution of dep end ent relationship because dep is not in the lexicon (i.e., 75% word usage), whereas the other two solutions cover 100% of words in the lexicon. Of the remaining solutions, the probability of dependent relationship is 0.72 whereas depend ent relationship is 0.00027. We can therefore select dependent relationship as the correct solution.

Question

Given:

// The concatenated phrase or database column (e.g., dependentrelationship).
String concat;

// All words (String) in the lexicon within concat, in left-to-right order; and
// the ranked probability of those words (Double). (E.g., {end, 0.97}
// {dependent, 0.86}, {relationship, 0.95}.)
Map.Entry<String, Double> word;

How would you implement a routine that generates the most likely solution based on lexicon coverage and probabilities? For example:

for( Map.Entry<String, Double> word : words ) {
  result.append( word.getKey() ).append( ' ' );

  // What goes here?

  System.out.printf( "%s=%f\n", word.getKey(), word.getValue() );
}

Thank you!

Text segmentation: dictionary-based word splitting

Background

Source Code

Data Files

Problem (updated 2011-01-03)

Related

Question

Answer

Related questions