I'm using Python 3.5.2
I have two lists
So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.
I am doing this by pre-compiling my words so that they are flanked by the \b
metacharacter
compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Then I loop through my "sentences"
import re
for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list
This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.
Is there a way to using the str.replace
method (which I believe is faster), but still requiring that replacements only happen at word boundaries?
Alternatively, is there a way to speed up the re.sub
method? I have already improved the speed marginally by skipping over re.sub
if the length of my word is > than the length of my sentence, but it's not much of an improvement.
Thank you for any suggestions.
Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP's, it's approximately 2000 times faster than the accepted answer.
If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.
If your sentences aren't humongous strings, it's probably feasible to process many more than 50 per second.
If you save all the banned words into a set, it will be very fast to check if another word is included in that set.
Pack the logic into a function, give this function as argument to re.sub
and you're done!
import re
with open('/usr/share/dict/american-english') as wordbook:
banned_words = set(word.strip().lower() for word in wordbook)
def delete_banned_words(matchobj):
word = matchobj.group(0)
if word.lower() in banned_words:
return ""
else:
return word
sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
"GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000
word_pattern = re.compile('\w+')
for sentence in sentences:
sentence = word_pattern.sub(delete_banned_words, sentence)
Converted sentences are:
' . !
.
GiraffeElephantBoat
sfgsdg sdwerha aswertwe
Note that:
lower()
)""
might leave two spaces (as in your code)\w+
also matches accented characters (e.g. "ångström"
).There are a million sentences, banned_words
has almost 100000 words and the script runs in less than 7s.
In comparison, Liteye's answer needed 160s for 10 thousand sentences.
With n
being the total amound of words and m
the amount of banned words, OP's and Liteye's code are O(n*m)
.
In comparison, my code should run in O(n+m)
. Considering that there are many more sentences than banned words, the algorithm becomes O(n)
.
What's the complexity of a regex search with a '\b(word1|word2|...|wordN)\b'
pattern? Is it O(N)
or O(1)
?
It's pretty hard to grasp the way the regex engine works, so let's write a simple test.
This code extracts 10**i
random english words into a list. It creates the corresponding regex union, and tests it with different words :
#
)import re
import timeit
import random
with open('/usr/share/dict/american-english') as wordbook:
english_words = [word.strip().lower() for word in wordbook]
random.shuffle(english_words)
print("First 10 words :")
print(english_words[:10])
test_words = [
("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
("First word", english_words[0]),
("Last word", english_words[-1]),
("Almost a word", "couldbeaword")
]
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nUnion of %d words" % 10**exp)
union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %-17s : %.1fms" % (description, time))
It outputs:
First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']
Union of 10 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 0.7ms
Almost a word : 0.7ms
Union of 100 words
Surely not a word : 0.7ms
First word : 1.1ms
Last word : 1.2ms
Almost a word : 1.2ms
Union of 1000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 9.6ms
Almost a word : 10.1ms
Union of 10000 words
Surely not a word : 1.4ms
First word : 1.8ms
Last word : 96.3ms
Almost a word : 116.6ms
Union of 100000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 1227.1ms
Almost a word : 1404.1ms
So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b'
pattern has:
O(1)
best caseO(n/2)
average case, which is still O(n)
O(n)
worst caseThese results are consistent with a simple loop search.
A much faster alternative to a regex union is to create the regex pattern from a trie.