I'm trying to develop a python algorithm to check if a string could be an abbrevation for another word. For example
fck
is a match for fc kopenhavn
because it matches the first characters of the word. fhk
would not match.fco
should not match fc kopenhavn
because no one irl would abbrevate FC Kopenhavn as FCO.irl
is a match for in real life
.ifk
is a match for ifk goteborg
.aik
is a match for allmanna idrottskluben
.aid
is a match for allmanna idrottsklubben
. This is not a real team name abbrevation, but I guess it is hard to exclude it unless you apply domain specific knowledge on how Swedish abbrevations are formed.manu
is a match for manchester united
.It is hard to describe the exact rules of the algorithm, but I hope my examples show what I'm after.
Update I made a mistake in showing the strings with the matching letters uppercased. In the real scenario, all letters are lowercase so it is not as easy as just checking which letters are uppercased.
This passes all the tests, including a few extra I created. It uses recursion. Here are the rules that I used:
The rest of the abbreviation (the abbrev minus the first letter) must be an abbreviation for:
tests=(
('fck','fc kopenhavn',True),
('fco','fc kopenhavn',False),
('irl','in real life',True),
('irnl','in real life',False),
('ifk','ifk gotebork',True),
('ifko','ifk gotebork',False),
('aik','allmanna idrottskluben',True),
('aid','allmanna idrottskluben',True),
('manu','manchester united',True),
('fz','faz zoo',True),
('fzz','faz zoo',True),
('fzzz','faz zoo',False),
)
def is_abbrev(abbrev, text):
abbrev=abbrev.lower()
text=text.lower()
words=text.split()
if not abbrev:
return True
if abbrev and not text:
return False
if abbrev[0]!=text[0]:
return False
else:
return (is_abbrev(abbrev[1:],' '.join(words[1:])) or
any(is_abbrev(abbrev[1:],text[i+1:])
for i in range(len(words[0]))))
for abbrev,text,answer in tests:
result=is_abbrev(abbrev,text)
print(abbrev,text,result,answer)
assert result==answer