I was recently working on a data set that used abbreviations for various words. For example,
wtrbtl = water bottle
bwlingbl = bowling ball
bsktball = basketball
There did not seem to be any consistency in terms of the convention used, i.e. sometimes they used vowels sometimes not. I am trying to build a mapping object like the one above for abbreviations and their corresponding words without a complete corpus or comprehensive list of terms (i.e. abbreviations could be introduced that are not explicitly known). For simplicity sake say it is restricted to stuff you would find in a gym but it could be anything.
Basically, if you only look at the left hand side of the examples, what kind of model could do the same processing as our brain in terms of relating each abbreviation to the corresponding full text label.
My ideas have stopped at taking the first and last letter and finding those in a dictionary. Then assign a priori probabilities based on context. But since there are a large number of morphemes without a marker that indicates end of word I don't see how its possible to split them.
UPDATED:
I also had the idea to combine a couple string metric algorithms like a Match Rating Algorithm to determine a set of related terms and then calculate the Levenshtein Distance between each word in the set to the target abbreviation. However, I am still in the dark when it comes to abbreviations for words not in a master dictionary. Basically, inferring word construction - may a Naive Bayes model could help but I am concerned that any error in precision caused by using the algorithms above will invalid any model training process.
Any help is appreciated, as I am really stuck on this one.
If you cannot find an exhaustive dictionary, you could build (or download) a probabilistic language model, to generate and evaluate sentence candidates for you. It could be a character n-gram model or a neural network.
For your abbreviations, you can build a "noise model" which predicts probability of character omissions. It can learn from a corpus (you have to label it manually or half-manually) that consonants are missed less frequently than vowels.
Having a complex language model and a simple noise model, you can combine them using noisy channel approach (see e.g. the article by Jurafsky for more details), to suggest candidate sentences.
Update. I got enthusiastic about this problem and implemented this algorithm:
My solution is implemented in this Python notebook. With trained models, it has interface like noisy_channel('bsktball', language_model, error_model)
, which, by the way, returns
{'basket ball': 33.5, 'basket bally': 36.0}
. Dictionary values are scores of the suggestions (the lower, the better).
With other examples it works worse: for 'wtrbtl' it returns
{'water but all': 23.7,
'water but ill': 24.5,
'water but lay': 24.8,
'water but let': 26.0,
'water but lie': 25.9,
'water but look': 26.6}
For 'bwlingbl' it gives
{'bwling belia': 32.3,
'bwling bell': 33.6,
'bwling below': 32.1,
'bwling belt': 32.5,
'bwling black': 31.4,
'bwling bling': 32.9,
'bwling blow': 32.7,
'bwling blue': 30.7}
However, when training on an appropriate corpus (e.g. sports magazines and blogs; maybe with oversampling of nouns), and maybe with more generous width of beam search, this model will provide more relevant suggestions.