Subsetting Hierarchical Index and Hierarchical column names in Pandas (with and without indices)
Clash Royale CLAN TAG#URR8PPP
Subsetting Hierarchical Index and Hierarchical column names in Pandas (with and without indices)
I am a beginner in Python
and Pandas
, and it has been 2 days since I opened Wes McKinney's book. So, this question might be a basic one.
Python
Pandas
I am using Anaconda distribution (Python 3.6.6) and Pandas 0.21.0. I researched the following threads (https://pandas.pydata.org/pandas-docs/stable/advanced.html, xs
function at https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-xs, Select only one index of multiindex DataFrame, Selecting rows from pandas by subset of multiindex, and https://pandas.pydata.org/pandas-docs/stable/indexing.html) before posting this. All of them explain how to subset data.frame
using either hierarchical index or hierarchical column, but not both.
xs
data.frame
Here's the data.
import pandas as pd
import numpy as np
from numpy import nan as NA
#Hierarchical index for row and column
data = pd.DataFrame(np.arange(36).reshape(6,6),
index=[['a']*2+['b']*1+['c']*1+['d']*2,
[1, 2, 3, 1, 3, 1]],
columns = [['Title1']*3+['Title2']*3,
['A']*2+['B']*2+['C']*2])
data.index.names = ['key1','key2']
data.columns.names = ['state','color']
Here are my questions:
Question:1 I'd like to access key1 = a
, key2 = 1
, state = Title1
(column), and color = A
(column).
After a few trial and errors, I found that this version works (I really don't know why this works--my hypothesis is that data.loc['a',1]
gives an indexed dataframe
, which is then subset...and so on):
key1 = a
key2 = 1
state = Title1
color = A
data.loc['a',1]
dataframe
data.loc['a',1].loc['Title1'].loc['A']
Is there a better way to subset above?
Question:2 How do I subset the data after deleting the indices?
data_wo_index = data.reset_index()
I'm relatively comfortable with data.table
in R. So, I thought of using http://datascience-enthusiast.com/R/pandas_datatable.html to subset the data using my data.table
knowledge.
data.table
data.table
I tried one step at a time, but even the first step (i.e. subsetting key1 = a
gave me an error:
key1 = a
data_wo_index[data_wo_index['key1']=='a']
Exception: cannot handle a non-unique multi-index!
I don't know why Pandas is still thinking that there is multi-index. I have already reset it.
Question:3 If I run data.columns
command, I get the following output:
data.columns
MultiIndex(levels=[['Title1', 'Title2'], ['A', 'B', 'C']],
labels=[[0, 0, 0, 1, 1, 1], [0, 0, 1, 1, 2, 2]],
names=['state', 'color'])
It seems to me that column names are also indexes. I am saying this because I see MultiIndex
class, which is what I see if I run data.index
:
MultiIndex
data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 1, 2, 3, 3], [0, 1, 2, 0, 2, 0]],
names=['key1', 'key2'])
I am unsure why column names are also on object of MultiIndex
class. If they are indeed an object of MultiIndex
class, then why do we need to set aside a few columns (e.g. key1
and key2
in our example above) as indices, meaning why can't we just use column-based indices? (As a comparison, in data.table
in R, we can setkey to whatever columns we want.)
MultiIndex
MultiIndex
key1
key2
data.table
Question 4 Why are column names an object of MultiIndex
class? It will be great if someone can offer a theoretical treatment for this.
MultiIndex
As a beginner, I'd really appreciate your thoughts. I have spent 3-4 hours researching this topic and have hit a dead-end.
I'm very sorry about this. I tagged it because I did compare
R - data.table
with pandas
.– watchtower
Aug 6 at 2:47
R - data.table
pandas
2 Answers
2
First off, MultiIndex
's can be tricky to work with, so it's worth considering whether they actually provide enough benefit for what you're actually doing (in terms of speed/organisation) to make those hassles worthwhile.
MultiIndex
To answer your question 1, you can subset a MultiIndex
ed dataframe by providing tuples of the keys you want for each axis. So you first example subset can be done as:
MultiIndex
# We want to use ":" to get all the states, but can't just
# have ":" by itself due to Python's syntax rules
# So pandas provides the IndexSlice object to wrap it in
slicer = pd.IndexSlice
data.loc[('a', 1), (slicer[:], 'A')]
Which gives:
state color
Title1 A 0
A 1
Name: (a, 1), dtype: int32
Thank you so much for your help. I am using indexes because I read somewhere that
join
and merge
are faster with indexes. While McKinney has also written that one can do join
and merge
using columns. Could you please offer alternatives to indexing?– watchtower
Aug 6 at 2:46
join
merge
join
merge
The alternative to indexing is basically to leave everything in plain columns, like R does (when not using
data.table
). Joins may be slower than the indexed versions, but it can be easier to work with, especially if you're used to R.– Marius
Aug 6 at 3:00
data.table
Thank you once again. Final q: While I haven't run benchmarks between
pandas
and data.table
, from your experience, will pandas
be slower than data.table
in R?– watchtower
Aug 6 at 3:05
pandas
data.table
pandas
data.table
Wow seems like a lot of questions ..
Q1 Multiple index I will recommend IndexSlice
IndexSlice
data.loc[pd.IndexSlice['a',1],pd.IndexSlice['Title1','A']]
Out[410]:
state color
Title1 A 0
A 1
Q2 when you reset the index for this complete data frame it will have some issue , I do not think in R
you can do that without ftable
R
ftable
Here is the way doing with pandas
data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')]
Out[434]:
state key1 key2 Title1 Title2
color A A B B C C
0 a 1 0 1 2 3 4 5
1 a 2 6 7 8 9 10 11
Q3 I think the column
and index
multiple level offer 4 dimension, yes you can using one columns or index to represent all just do stack
column
index
stack
data.stack()
Out[436]:
state Title1 Title2
key1 key2 color
a 1 A 0 3
B 1 4
C 2 5
2 A 6 9
B 7 10
C 8 11
b 3 A 12 15
B 13 16
C 14 17
c 1 A 18 21
B 19 22
C 20 23
d 3 A 24 27
B 25 28
C 26 29
1 A 30 33
B 31 34
C 32 35
Q4 MultiIndex
is one of type for index , and pandas
treat index
and columns
to index type
MultiIndex
pandas
index
columns
For example
df.index # index but just different type of index
Out[441]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
df.columns # index as well
Out[442]: Index(['A', 'B'], dtype='object')
Thank you so much for your help. I sincerely appreciate it. Could you please give me some resource how I can understand and create
data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')]
query that you have posted for Q2?– watchtower
Aug 6 at 2:43
data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')]
@watchtower you can try with Pandas Cookbook
– Wen
Aug 6 at 3:02
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Why is this tagged R?
– camille
Aug 6 at 2:02