I have the following pandas DataFrame:
import pandas as pd
df = pd.DataFrame('filename.csv')
print(df)
order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....
So, there is an order
column encompasses many rows each, and then a range/interval from start
to end
for each row. Each row then is labeled by a certain value
(e.g. category1, category2, etc.)
Now I have another dataframe called key_df
. It is basically the exact same format:
import pandas as pd
key_df = pd.DataFrame(...)
print(key_df)
order start end value
1 1284 1299 category4
1 1297 1309 category9
1 1312 1369 category3
1 1345 1392 category29
1 1371 1383 category31
....
1 1471 1501 category31
...
My goal is to take the key_df
dataframe and check whether the intervals start:end
match any of the rows in the original dataframe df
. If it does, this row in df
should be labeled with the key_df
dataframe's value
value.
In our example above, the dataframe df
would end up like this:
order start end value key_value
1 1342 1357 category1 category29
1 1459 1489 category7 category31
....
This is because if you look at key_df
, the row
1 1345 1392 category29
with interval 1::1345-1392
falls in the interval 1::1342-1357
in the original df
. Likewise, the key_df
row:
1 1471 1501 category31
corresponds to the second row in df
:
1 1459 1489 category7 category31
I'm not entirely sure
(1) how to accomplish this task in pandas
(2) how to scale this efficiently in pandas
One could begin with an if statement, e.g.
if df.order == key_df.order:
# now check intervals...somehow
but this doesn't take advantage of the dataframe structure. One then must check by interval, i.e. something like (df.start =< key_df.start) && (df.end => key_df.end)
I'm stuck. What is the most efficient way to match multiple columns in an "interval" in pandas? (Creating a new column if this condition is met is then straightforward)
You can use merge
with boolean indexing
, but if DataFrames
are large, scaling is problematic:
df1 = pd.merge(df, key_df, on='order', how='outer', suffixes=('','_key'))
df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
print (df1)
order start end value start_key end_key value_key
3 1 1342 1357 category1 1345.0 1392.0 category29
4 1 1342 1357 category1 1371.0 1383.0 category31
5 1 1342 1357 category1 1471.0 1501.0 category31
11 1 1459 1489 category7 1471.0 1501.0 category31
EDIT by comment:
df1 = pd.merge(df, key_df, on='order', how='outer', suffixes=('','_key'))
df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
df1 = pd.merge(df, df1, on=['order','start','end', 'value'], how='left')
print (df1)
order start end value start_key end_key value_key
0 1 1342 1357 category1 1345.0 1392.0 category29
1 1 1342 1357 category1 1371.0 1383.0 category31
2 1 1342 1357 category1 1471.0 1501.0 category31
3 1 1459 1489 category7 1471.0 1501.0 category31
4 1 1572 1601 category23 NaN NaN NaN
5 1 1587 1599 category2 NaN NaN NaN
6 1 1591 1639 category1 NaN NaN NaN
7 15 792 813 category13 NaN NaN NaN
8 15 892 913 category5 NaN NaN NaN