I'm fairly new to Python and data science. I'm working on the kaggle Outbrain competition, and all datasets referenced in my code can be found at https://www.kaggle.com/c/outbrain-click-prediction/data.
On to the problem: I have a dataframe with columns ['document_id', 'category_id', 'confidence_level']
. I would like to add a fourth column, 'max_cat'
, that returns the 'category_id'
value that corresponds to the greatest 'confidence_level'
value for the row's 'document_id'
.
import pandas as pd
import numpy
main_folder = r'...filepath\data_location' + '\\'
docs_meta = pd.read_csv(main_folder + 'documents_meta.csv\documents_meta.csv',nrows=1000)
docs_categories = pd.read_csv(main_folder + 'documents_categories.csv\documents_categories.csv',nrows=1000)
docs_entities = pd.read_csv(main_folder + 'documents_entities.csv\documents_entities.csv',nrows=1000)
docs_topics = pd.read_csv(main_folder + 'documents_topics.csv\documents_topics.csv',nrows=1000)
def find_max(row,the_df,groupby_col,value_col,target_col):
return the_df[the_df[groupby_col]==row[groupby_col]].loc[the_df[value_col].idxmax()][target_col]
test = docs_categories.copy()
test['max_cat'] = test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'))
This gives me the error: KeyError: ('document_id', 'occurred at index document_id')
Can anyone help explain either why this error occurred, or how to achieve my goal in a more efficient manner?
Thanks!
As answered by EdChum in the comments. The issue is that apply
works column wise by default (see the docs). Therefore, the column names cannot be accessed.
To specify that it should be applied to each row instead, axis=1
must be passed:
test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'), axis=1)