I need to filter rows in a pandas
dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.
For example:
lst = ['kdSj;af-!?', 'aBC+dsfa?\-', 'sdKaJg|dksaf-*']
I currently apply the mask like this:
mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst])
df = df[mask]
My dataframe is large (~1mio rows) and lst
has length 100. Is there a more efficient way? For example, if the first item in lst
is found, we should not have to test any subsequent strings for that row.
If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).
This is easy to do using re.escape
:
>>> import re
>>> esc_lst = [re.escape(s) for s in lst]
These escaped substrings can then be joined using a regex pipe |
. Each of the substrings can be checked against a string until one matches (or they have all been tested).
>>> pattern = '|'.join(esc_lst)
The masking stage then becomes a single low-level loop through the rows:
df[col].str.contains(pattern, case=False)
Here's a simple setup to get a sense of performance:
from random import randint, seed
seed(321)
# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]
# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]
col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)
The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):
%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop
The method in the question took approximately 5 seconds using the same input data.
It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.