I have a dataframe that looks like this:
total downloaded avg_rating
id
1 2 2 5.0
2 12 12 4.5
3 1 1 5.0
4 1 1 4.0
5 0 0 0.0
I'm trying to add a new column with the percent difference of two of these columns, but only for columns that do not have a 0 in 'downloaded'.
I'm trying to use a function for this that looks like:
def diff(ratings):
if ratings[ratings.downloaded > 0]:
val = (ratings['total'] - ratings['downloaded']) / ratings['downloaded']
else:
val = 0
return val
ratings['Pct Diff'] = diff(ratings)
I'm getting an error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-129-729c09bf14e8> in <module>()
6 return val
7
----> 8 ratings['Pct Diff'] = diff(ratings)
<ipython-input-129-729c09bf14e8> in diff(ratings)
1 def diff(ratings):
----> 2 if ratings[ratings.downloaded > 0]:
3 val = (ratings['total'] - ratings['downloaded']) /
ratings['downloaded']
4 else:
5 val = 0
~\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
953 raise ValueError("The truth value of a {0} is ambiguous. "
954 "Use a.empty, a.bool(), a.item(), a.any() or
a.all()."
--> 955 .format(self.__class__.__name__))
956
957 __bool__ = __nonzero__
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Can someone please help me understand what this error means?
Also, would this be a good application for an apply function? Can I use conditions in an apply? How would I use it in this case?
The reason for your error is you are attempting to do a row-wise (vectorised calculation), but in fact in your function diff()
ratings[ratings.downloaded > 0]
returns a subset of the dataframe and preceding it by if
is ambiguous. The error message reflect this.
You may wish to review Indexing and Selecting Data. The below solution sets the default value 0 by setting it at the beginning.
import pandas as pd
df = pd.DataFrame([[2, 2, 5.0], [12, 12, 4.5], [10, 5, 3.0],
[20, 2, 3.5], [3, 0, 0.0], [0, 0, 0.0]],
columns=['total', 'downloaded', 'avg_rating'])
df['Pct Diff'] = 0
df.loc[df['downloaded'] > 0, 'Pct Diff'] = (df['total'] - df['downloaded']) / df['total']
# total downloaded avg_rating Pct Diff
# 0 2 2 5.0 0.0
# 1 12 12 4.5 0.0
# 2 10 5 3.0 0.5
# 3 20 2 3.5 0.9
# 4 3 0 0.0 0.0
# 5 0 0 0.0 0.0