Subsetting Hierarchical Index and Hierarchical column names in Pandas (with and without indices)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Subsetting Hierarchical Index and Hierarchical column names in Pandas (with and without indices)



I am a beginner in Python and Pandas, and it has been 2 days since I opened Wes McKinney's book. So, this question might be a basic one.


Python


Pandas



I am using Anaconda distribution (Python 3.6.6) and Pandas 0.21.0. I researched the following threads (https://pandas.pydata.org/pandas-docs/stable/advanced.html, xs function at https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-xs, Select only one index of multiindex DataFrame, Selecting rows from pandas by subset of multiindex, and https://pandas.pydata.org/pandas-docs/stable/indexing.html) before posting this. All of them explain how to subset data.frame using either hierarchical index or hierarchical column, but not both.


xs


data.frame



Here's the data.


import pandas as pd
import numpy as np
from numpy import nan as NA

#Hierarchical index for row and column
data = pd.DataFrame(np.arange(36).reshape(6,6),
index=[['a']*2+['b']*1+['c']*1+['d']*2,
[1, 2, 3, 1, 3, 1]],
columns = [['Title1']*3+['Title2']*3,
['A']*2+['B']*2+['C']*2])

data.index.names = ['key1','key2']
data.columns.names = ['state','color']



Here are my questions:



Question:1 I'd like to access key1 = a, key2 = 1, state = Title1 (column), and color = A (column).
After a few trial and errors, I found that this version works (I really don't know why this works--my hypothesis is that data.loc['a',1] gives an indexed dataframe, which is then subset...and so on):


key1 = a


key2 = 1


state = Title1


color = A


data.loc['a',1]


dataframe


data.loc['a',1].loc['Title1'].loc['A']



Is there a better way to subset above?



Question:2 How do I subset the data after deleting the indices?


data_wo_index = data.reset_index()



I'm relatively comfortable with data.table in R. So, I thought of using http://datascience-enthusiast.com/R/pandas_datatable.html to subset the data using my data.table knowledge.


data.table


data.table



I tried one step at a time, but even the first step (i.e. subsetting key1 = a gave me an error:


key1 = a


data_wo_index[data_wo_index['key1']=='a']



Exception: cannot handle a non-unique multi-index!



I don't know why Pandas is still thinking that there is multi-index. I have already reset it.



Question:3 If I run data.columns command, I get the following output:


data.columns


MultiIndex(levels=[['Title1', 'Title2'], ['A', 'B', 'C']],
labels=[[0, 0, 0, 1, 1, 1], [0, 0, 1, 1, 2, 2]],
names=['state', 'color'])



It seems to me that column names are also indexes. I am saying this because I see MultiIndex class, which is what I see if I run data.index:


MultiIndex


data.index


MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 1, 2, 3, 3], [0, 1, 2, 0, 2, 0]],
names=['key1', 'key2'])



I am unsure why column names are also on object of MultiIndex class. If they are indeed an object of MultiIndex class, then why do we need to set aside a few columns (e.g. key1 and key2 in our example above) as indices, meaning why can't we just use column-based indices? (As a comparison, in data.table in R, we can setkey to whatever columns we want.)


MultiIndex


MultiIndex


key1


key2


data.table



Question 4 Why are column names an object of MultiIndex class? It will be great if someone can offer a theoretical treatment for this.


MultiIndex



As a beginner, I'd really appreciate your thoughts. I have spent 3-4 hours researching this topic and have hit a dead-end.





Why is this tagged R?
– camille
Aug 6 at 2:02





I'm very sorry about this. I tagged it because I did compare R - data.table with pandas.
– watchtower
Aug 6 at 2:47


R - data.table


pandas




2 Answers
2



First off, MultiIndex's can be tricky to work with, so it's worth considering whether they actually provide enough benefit for what you're actually doing (in terms of speed/organisation) to make those hassles worthwhile.


MultiIndex



To answer your question 1, you can subset a MultiIndexed dataframe by providing tuples of the keys you want for each axis. So you first example subset can be done as:


MultiIndex


# We want to use ":" to get all the states, but can't just
# have ":" by itself due to Python's syntax rules
# So pandas provides the IndexSlice object to wrap it in
slicer = pd.IndexSlice
data.loc[('a', 1), (slicer[:], 'A')]



Which gives:


state color
Title1 A 0
A 1
Name: (a, 1), dtype: int32





Thank you so much for your help. I am using indexes because I read somewhere that join and merge are faster with indexes. While McKinney has also written that one can do join and merge using columns. Could you please offer alternatives to indexing?
– watchtower
Aug 6 at 2:46


join


merge


join


merge





The alternative to indexing is basically to leave everything in plain columns, like R does (when not using data.table). Joins may be slower than the indexed versions, but it can be easier to work with, especially if you're used to R.
– Marius
Aug 6 at 3:00


data.table





Thank you once again. Final q: While I haven't run benchmarks between pandas and data.table, from your experience, will pandas be slower than data.table in R?
– watchtower
Aug 6 at 3:05


pandas


data.table


pandas


data.table



Wow seems like a lot of questions ..



Q1 Multiple index I will recommend IndexSlice


IndexSlice


data.loc[pd.IndexSlice['a',1],pd.IndexSlice['Title1','A']]
Out[410]:
state color
Title1 A 0
A 1



Q2 when you reset the index for this complete data frame it will have some issue , I do not think in R you can do that without ftable


R


ftable



Here is the way doing with pandas


data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')]
Out[434]:
state key1 key2 Title1 Title2
color A A B B C C
0 a 1 0 1 2 3 4 5
1 a 2 6 7 8 9 10 11



Q3 I think the column and index multiple level offer 4 dimension, yes you can using one columns or index to represent all just do stack


column


index


stack


data.stack()
Out[436]:
state Title1 Title2
key1 key2 color
a 1 A 0 3
B 1 4
C 2 5
2 A 6 9
B 7 10
C 8 11
b 3 A 12 15
B 13 16
C 14 17
c 1 A 18 21
B 19 22
C 20 23
d 3 A 24 27
B 25 28
C 26 29
1 A 30 33
B 31 34
C 32 35



Q4 MultiIndex is one of type for index , and pandas treat index and columns to index type


MultiIndex


pandas


index


columns



For example


df.index # index but just different type of index
Out[441]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
df.columns # index as well
Out[442]: Index(['A', 'B'], dtype='object')





Thank you so much for your help. I sincerely appreciate it. Could you please give me some resource how I can understand and create data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')] query that you have posted for Q2?
– watchtower
Aug 6 at 2:43


data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')]





@watchtower you can try with Pandas Cookbook
– Wen
Aug 6 at 3:02







By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard