Subsetting Hierarchical Index and Hierarchical column names in Pandas (with and without indices)

Subsetting Hierarchical Index and Hierarchical column names in Pandas (with and without indices)

I am a beginner in Python and Pandas, and it has been 2 days since I opened Wes McKinney's book. So, this question might be a basic one.

Python

Pandas

I am using Anaconda distribution (Python 3.6.6) and Pandas 0.21.0. I researched the following threads (https://pandas.pydata.org/pandas-docs/stable/advanced.html, xs function at https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-xs, Select only one index of multiindex DataFrame, Selecting rows from pandas by subset of multiindex, and https://pandas.pydata.org/pandas-docs/stable/indexing.html) before posting this. All of them explain how to subset data.frame using either hierarchical index or hierarchical column, but not both.

xs

data.frame

Here's the data.

import pandas as pd import numpy as np from numpy import nan as NA #Hierarchical index for row and column data = pd.DataFrame(np.arange(36).reshape(6,6), index=[['a']*2+['b']*1+['c']*1+['d']*2, [1, 2, 3, 1, 3, 1]], columns = [['Title1']*3+['Title2']*3, ['A']*2+['B']*2+['C']*2]) data.index.names = ['key1','key2'] data.columns.names = ['state','color']

Here are my questions:

Question:1 I'd like to access key1 = a, key2 = 1, state = Title1 (column), and color = A (column).
After a few trial and errors, I found that this version works (I really don't know why this works--my hypothesis is that data.loc['a',1] gives an indexed dataframe, which is then subset...and so on):

key1 = a

key2 = 1

state = Title1

color = A

data.loc['a',1]

dataframe

data.loc['a',1].loc['Title1'].loc['A']

Is there a better way to subset above?

Question:2 How do I subset the data after deleting the indices?

data_wo_index = data.reset_index()

I'm relatively comfortable with data.table in R. So, I thought of using http://datascience-enthusiast.com/R/pandas_datatable.html to subset the data using my data.table knowledge.

data.table

I tried one step at a time, but even the first step (i.e. subsetting key1 = a gave me an error:

key1 = a

data_wo_index[data_wo_index['key1']=='a']

Exception: cannot handle a non-unique multi-index!

I don't know why Pandas is still thinking that there is multi-index. I have already reset it.

Question:3 If I run data.columns command, I get the following output:

data.columns

MultiIndex(levels=[['Title1', 'Title2'], ['A', 'B', 'C']], labels=[[0, 0, 0, 1, 1, 1], [0, 0, 1, 1, 2, 2]], names=['state', 'color'])

It seems to me that column names are also indexes. I am saying this because I see MultiIndex class, which is what I see if I run data.index:

MultiIndex

data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]], labels=[[0, 0, 1, 2, 3, 3], [0, 1, 2, 0, 2, 0]], names=['key1', 'key2'])

I am unsure why column names are also on object of MultiIndex class. If they are indeed an object of MultiIndex class, then why do we need to set aside a few columns (e.g. key1 and key2 in our example above) as indices, meaning why can't we just use column-based indices? (As a comparison, in data.table in R, we can setkey to whatever columns we want.)

MultiIndex

key1

key2

data.table

Question 4 Why are column names an object of MultiIndex class? It will be great if someone can offer a theoretical treatment for this.

MultiIndex

As a beginner, I'd really appreciate your thoughts. I have spent 3-4 hours researching this topic and have hit a dead-end.

Why is this tagged R?
– camille
Aug 6 at 2:02

I'm very sorry about this. I tagged it because I did compare R - data.table with pandas.
– watchtower
Aug 6 at 2:47

R - data.table

pandas

2 Answers
2

First off, MultiIndex's can be tricky to work with, so it's worth considering whether they actually provide enough benefit for what you're actually doing (in terms of speed/organisation) to make those hassles worthwhile.

MultiIndex

To answer your question 1, you can subset a MultiIndexed dataframe by providing tuples of the keys you want for each axis. So you first example subset can be done as:

MultiIndex

# We want to use ":" to get all the states, but can't just # have ":" by itself due to Python's syntax rules # So pandas provides the IndexSlice object to wrap it in slicer = pd.IndexSlice data.loc[('a', 1), (slicer[:], 'A')]

Which gives:

state color Title1 A 0 A 1 Name: (a, 1), dtype: int32

Thank you so much for your help. I am using indexes because I read somewhere that join and merge are faster with indexes. While McKinney has also written that one can do join and merge using columns. Could you please offer alternatives to indexing?
– watchtower
Aug 6 at 2:46

join

merge

join

merge

The alternative to indexing is basically to leave everything in plain columns, like R does (when not using data.table). Joins may be slower than the indexed versions, but it can be easier to work with, especially if you're used to R.
– Marius
Aug 6 at 3:00

data.table

Thank you once again. Final q: While I haven't run benchmarks between pandas and data.table, from your experience, will pandas be slower than data.table in R?
– watchtower
Aug 6 at 3:05

pandas

data.table

pandas

data.table

Wow seems like a lot of questions ..

Q1 Multiple index I will recommend IndexSlice

IndexSlice

data.loc[pd.IndexSlice['a',1],pd.IndexSlice['Title1','A']] Out[410]: state color Title1 A 0 A 1

Q2 when you reset the index for this complete data frame it will have some issue , I do not think in R you can do that without ftable

R

ftable

Here is the way doing with pandas

data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')] Out[434]: state key1 key2 Title1 Title2 color A A B B C C 0 a 1 0 1 2 3 4 5 1 a 2 6 7 8 9 10 11

Q3 I think the column and index multiple level offer 4 dimension, yes you can using one columns or index to represent all just do stack

column

index

stack

data.stack() Out[436]: state Title1 Title2 key1 key2 color a 1 A 0 3 B 1 4 C 2 5 2 A 6 9 B 7 10 C 8 11 b 3 A 12 15 B 13 16 C 14 17 c 1 A 18 21 B 19 22 C 20 23 d 3 A 24 27 B 25 28 C 26 29 1 A 30 33 B 31 34 C 32 35

Q4 MultiIndex is one of type for index , and pandas treat index and columns to index type

MultiIndex

pandas

index

columns

For example

df.index # index but just different type of index Out[441]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64') df.columns # index as well Out[442]: Index(['A', 'B'], dtype='object')

Thank you so much for your help. I sincerely appreciate it. Could you please give me some resource how I can understand and create data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')] query that you have posted for Q2?
– watchtower
Aug 6 at 2:43

data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')]

@watchtower you can try with Pandas Cookbook
– Wen
Aug 6 at 3:02

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

DjAv8yPgwg7SGHs2boIkN9Zolx8HvhjtNsWS OjVWvhdyRkbx82L4T Jbn2nOZ,NQTd0W,kyEeG0iZt0Qoh,g0

搜尋此網誌

Sfyjdyy