Pandas Apply Key Error

Question 1

Pandas Apply Key Error

python pandas group-by keyerror kaggle

user133248 · Oct 10, 2016 · Viewed 34.6k times · Source

Answer

Answer

As answered by EdChum in the comments. The issue is that apply works column wise by default (see the docs). Therefore, the column names cannot be accessed.

To specify that it should be applied to each row instead, axis=1 must be passed:

test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'), axis=1)

Question 2

I'm fairly new to Python and data science. I'm working on the kaggle Outbrain competition, and all datasets referenced in my code can be found at https://www.kaggle.com/c/outbrain-click-prediction/data.

On to the problem: I have a dataframe with columns ['document_id', 'category_id', 'confidence_level']. I would like to add a fourth column, 'max_cat', that returns the 'category_id' value that corresponds to the greatest 'confidence_level' value for the row's 'document_id'.

import pandas as pd
import numpy

main_folder = r'...filepath\data_location' + '\\'

docs_meta = pd.read_csv(main_folder + 'documents_meta.csv\documents_meta.csv',nrows=1000)
docs_categories = pd.read_csv(main_folder + 'documents_categories.csv\documents_categories.csv',nrows=1000)
docs_entities = pd.read_csv(main_folder + 'documents_entities.csv\documents_entities.csv',nrows=1000)
docs_topics = pd.read_csv(main_folder + 'documents_topics.csv\documents_topics.csv',nrows=1000)

def find_max(row,the_df,groupby_col,value_col,target_col):
   return the_df[the_df[groupby_col]==row[groupby_col]].loc[the_df[value_col].idxmax()][target_col]

test = docs_categories.copy()
test['max_cat'] = test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'))

This gives me the error: KeyError: ('document_id', 'occurred at index document_id')

Can anyone help explain either why this error occurred, or how to achieve my goal in a more efficient manner?

Thanks!

Pandas Apply Key Error

Answer

Related questions