I've been reading through the documentation and many explanations and examples use levels
as something taken for granted. Imho the docs lack a bit on a fundamental explanation of the data structure and definitions.
What are levels in a data frame? What are levels in a MultiIndex
index?
I stumbled across this question while analyzing the answer to my own question, but I didn't find the John's answer satisfying enough. After a few experiments though I think I understood the levels and decided to share:
Short answer:
Levels are parts of the index or column.
Long answer:
I think this multi-column DataFrame.groupby
example illustrates the index levels quite nicely.
Let's say we have the time logged on issues report data:
report = pd.DataFrame([
[1, 10, 'John'],
[1, 20, 'John'],
[1, 30, 'Tom'],
[1, 10, 'Bob'],
[2, 25, 'John'],
[2, 15, 'Bob']], columns = ['IssueKey','TimeSpent','User'])
IssueKey TimeSpent User
0 1 10 John
1 1 20 John
2 1 30 Tom
3 1 10 Bob
4 2 25 John
5 2 15 Bob
The index here has only 1 level (there is only one index value identifying every row). The index is artificial (running number) and consists of values form 0 to 5.
Say we want to merge (sum) all logs created by the same user to the same issue (to get the total time spent on the issue by the user)
time_logged_by_user = report.groupby(['IssueKey', 'User']).TimeSpent.sum()
IssueKey User
1 Bob 10
John 30
Tom 30
2 Bob 15
John 25
Now our data index has 2 levels, as multiple users logged time to the same issue. The levels are IssueKey
and User
. The levels are parts of the index (only together they can identify a row in a DataFrame / Series).
Levels being parts of the index (as a tuple) can be nicely observed in the Spyder Variable explorer:
Having levels gives us opportunity to aggregate values within groups in respect to an index part (level) of our choice. E.g. if we want to assign the max time spent on an issue by any user, we can:
max_time_logged_to_an_issue = time_logged_by_user.groupby(level='IssueKey').transform('max')
IssueKey User
1 Bob 30
John 30
Tom 30
2 Bob 25
John 25
Now the first 3 rows have the value 30
, as they correspond to the issue 1
(User
level was ignored in the code above). The same story for the issue 2
.
This can be useful e.g. if we want to find out which users spent most time on every issue:
issue_owners = time_logged_by_user[time_logged_by_user == max_time_logged_to_an_issue]
IssueKey User
1 John 30
Tom 30
2 John 25