Python dynamic subset in a loop
Clash Royale CLAN TAG#URR8PPP
Python dynamic subset in a loop
I have the following dataframe. Multiple county names in a column, and across the table the dates and the values. Pre-recession max is some max that specific County had over a specific time frame (since not every county experienced same drops in values instantaneously). I need to find out the time between the minimum date which is unique to a row and the date when the value rebounded (when the value in the following column after the column with a minimum value became equal or above the pre-recession Max).
I am new to Python, new to stackoverflow, and spent a week researching online with no success.
Dataframe
Final result
The following code works and evaluates all values in df if they are above 51000. The question is: how to subset the df dynamically? Thanks.
df
revcols = df.columns.values.tolist()
revcols.reverse()
tmpdf=tmpdf= df>51000
final=tmpdf[tmpdf.any(axis=1)].idxmax(axis=1)
final
2 Answers
2
Use:
df = df.set_index(['County','Prerecession Max Value'])
a = df.idxmin(axis=1)
m1 = df.eq(df.min(axis=1), axis=0).cumsum(axis=1).gt(0)
m2 = df.sub(df.index.get_level_values(1), axis=0).ge(0)
b = (m1 & m2).idxmax(axis=1)
d = 'Date of Min': a, 'Date of Max':b
df = df.assign(**d).reset_index()
print (df)
County Prerecession Max Value 2007 2008 2009 2010 2011 2012
0 County 1 100000 90000 81000 72900 65610 70000 80000
1 County 2 20000 18000 16000 21000 22000 23000 24000
2 County 3 10000 9000 8100 7290 6561 5905 6405
3 County 4 6000 6000 4860 4374 4474 4574 6001
2013 2014 2015 Date of Min Date of Max
0 90000 100000 110000 2010 2014
1 25000 26000 27000 2008 2009
2 6905 12405 13405 2011 2014
3 7000 7500 7900 2009 2012
Setup: (changed last value of 2007
column to 6000
for test matching after minimal year value)
2007
6000
import pandas as pd
temp=u"""
County;Prerecession Max Value;2007;2008;2009;2010;2011;2012;2013;2014;2015
County 1;100,000;90,000;81,000;72,900;65,610;70,000;80,000;90,000;100,000;110,000
County 2;20,000;18,000;16,000;21,000;22,000;23,000;24,000;25,000;26,000;27,000
County 3;10,000;9,000;8,100;7,290;6,561;5,905;6,405;6,905;12,405;13,405
County 4;6,000;6,000;4,860;4,374;4,474;4,574;6,001;7,000;7,500;7,900"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", thousands=',')
print (df)
County Prerecession Max Value 2007 2008 2009 2010 2011 2012
0 County 1 100000 90000 81000 72900 65610 70000 80000
1 County 2 20000 18000 16000 21000 22000 23000 24000
2 County 3 10000 9000 8100 7290 6561 5905 6405
3 County 4 6000 6000 4860 4374 4474 4574 6001
2013 2014 2015
0 90000 100000 110000
1 25000 26000 27000
2 6905 12405 13405
3 7000 7500 7900
Explanation:
First create MultiIndex
with no date columns by DataFrame.set_index
:
MultiIndex
DataFrame.set_index
df = df.set_index(['County','Prerecession Max Value'])
print (df)
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 90000 81000 72900 65610 70000 80000
County 2 20000 18000 16000 21000 22000 23000 24000
County 3 10000 9000 8100 7290 6561 5905 6405
County 4 6000 6000 4860 4374 4474 4574 6001
2013 2014 2015
County Prerecession Max Value
County 1 100000 90000 100000 110000
County 2 20000 25000 26000 27000
County 3 10000 6905 12405 13405
County 4 6000 7000 7500 7900
For minimal Date use DataFrame.idxmin
:
DataFrame.idxmin
print (df.idxmin(axis=1))
County Prerecession Max Value
County 1 100000 2010
County 2 20000 2008
County 3 10000 2011
County 4 6000 2009
dtype: object
Then need filter all values after minimal values per rows - first compare by min
values by DataFrame.eq
:
min
DataFrame.eq
print (df.eq(df.min(axis=1), axis=0))
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 False False False True False False
County 2 20000 False True False False False False
County 3 10000 False False False False True False
County 4 6000 False False True False False False
2013 2014 2015
County Prerecession Max Value
County 1 100000 False False False
County 2 20000 False False False
County 3 10000 False False False
County 4 6000 False False False
Use cumulative sum per rows by DataFrame.cumsum
DataFrame.cumsum
print (df.eq(df.min(axis=1), axis=0).cumsum(axis=1))
2007 2008 2009 2010 2011 2012 2013
County Prerecession Max Value
County 1 100000 0 0 0 1 1 1 1
County 2 20000 0 1 1 1 1 1 1
County 3 10000 0 0 0 0 1 1 1
County 4 6000 0 0 1 1 1 1 1
2014 2015
County Prerecession Max Value
County 1 100000 1 1
County 2 20000 1 1
County 3 10000 1 1
County 4 6000 1 1
And compare by DataFrame.gt
:
DataFrame.gt
print (df.eq(df.min(axis=1), axis=0).cumsum(axis=1).gt(0))
2007 2008 2009 2010 2011 2012 2013
County Prerecession Max Value
County 1 100000 False False False True True True True
County 2 20000 False True True True True True True
County 3 10000 False False False False True True True
County 4 6000 False False True True True True True
2014 2015
County Prerecession Max Value
County 1 100000 True True
County 2 20000 True True
County 3 10000 True True
County 4 6000 True True
Then create another mask - subtract second level of MultiIndex
selected by Index.get_level_values
and DataFrame.sub
:
MultiIndex
Index.get_level_values
DataFrame.sub
print (df.index.get_level_values(1))
Int64Index([100000, 20000, 10000, 6000],
dtype='int64', name='Prerecession Max Value')
print (df.sub(df.index.get_level_values(1), axis=0))
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 -10000 -19000 -27100 -34390 -30000 -20000
County 2 20000 -2000 -4000 1000 2000 3000 4000
County 3 10000 -1000 -1900 -2710 -3439 -4095 -3595
County 4 6000 0 -1140 -1626 -1526 -1426 1
2013 2014 2015
County Prerecession Max Value
County 1 100000 -10000 0 10000
County 2 20000 5000 6000 7000
County 3 10000 -3095 2405 3405
County 4 6000 1000 1500 1900
Then compare by DataFrame.ge
for >=
with 0
:
DataFrame.ge
>=
0
print (df.sub(df.index.get_level_values(1), axis=0).ge(0))
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 False False False False False False
County 2 20000 False False True True True True
County 3 10000 False False False False False False
County 4 6000 True False False False False True
2013 2014 2015
County Prerecession Max Value
County 1 100000 False True True
County 2 20000 True True True
County 3 10000 False True True
County 4 6000 True True True
Chain both boolean mask by &
for AND
and get column name of first True
s per row by DataFrame.idxmax
:
&
AND
True
DataFrame.idxmax
print ((m1 & m2))
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 False False False False False False
County 2 20000 False False True True True True
County 3 10000 False False False False False False
County 4 6000 False False False False False True
2013 2014 2015
County Prerecession Max Value
County 1 100000 False True True
County 2 20000 True True True
County 3 10000 False True True
County 4 6000 True True True
print ((m1 & m2).idxmax(axis=1))
County Prerecession Max Value
County 1 100000 2014
County 2 20000 2009
County 3 10000 2014
County 4 6000 2012
dtype: object
Create dictionary of new columns for assign
:
assign
d = 'Date of Min': a, 'Date of Max':b
df = df.assign(**d)
print (df)
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 90000 81000 72900 65610 70000 80000
County 2 20000 18000 16000 21000 22000 23000 24000
County 3 10000 9000 8100 7290 6561 5905 6405
County 4 6000 6000 4860 4374 4474 4574 6001
2013 2014 2015 Date of Min Date of Max
County Prerecession Max Value
County 1 100000 90000 100000 110000 2010 2014
County 2 20000 25000 26000 27000 2008 2009
County 3 10000 6905 12405 13405 2011 2014
County 4 6000 7000 7500 7900 2009 2012
And last reset_index
for columns from MultiIndex
.
reset_index
MultiIndex
Thanks for posting this question.I have come up with a solution to this problem which has been presented below:
I have created a 'csv' file with sample data provided in the problem statement and named it stack.csv. I have added three new columns in this csv which will hold calculated values for:
Initially there are nulls or NaNs in these columns.
Now,we can have a look at the solution I have coded:
#Loading the CSV file into a data frame
df = pd.read_csv('stack.csv')
#Transposing the county and year columns to create a subset in order to fetch minimum value for each year
df_subset=df[['county','2007','2008','2009','2010','2011','2012','2013','2014','2015']]
df_subset_transposed = df_subset.T
df_subset_transposed.rename(columns=0:'county1', inplace=True)
df_subset_transposed.rename(columns=1:'county2', inplace=True)
df_subset_transposed.rename(columns=2:'county3', inplace=True)
df_subset_transposed.rename(columns=3:'county4', inplace=True)
df_subset_transposed.drop(['county'],inplace=True)
df_subset_transposed.index.names=['year']
df['MinVal_Year'][df['county']=='county1'] = pd.to_numeric(df_subset_transposed[('county1')]).idxmin()
df['MinVal_Year'][df['county']=='county2'] = pd.to_numeric(df_subset_transposed[('county2')]).idxmin()
df['MinVal_Year'][df['county']=='county3'] = pd.to_numeric(df_subset_transposed[('county3')]).idxmin()
df['MinVal_Year'][df['county']=='county4'] = pd.to_numeric(df_subset_transposed[('county4')]).idxmin()
#Iterating the main data frame couny wise to fetch which year is the rebound year
j=0
for i in df['county']:
if df[df['county']==i]['2007'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2007')
if df[df['county']==i]['2008'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2008')
elif df[df['county']==i]['2009'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2009')
elif df[df['county']==i]['2010'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2010')
elif df[df['county']==i]['2011'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2011')
elif df[df['county']==i]['2012'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2012')
elif df[df['county']==i]['2013'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2013')
elif df[df['county']==i]['2014'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2014')
elif df[df['county']==i]['2015'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2015')
j+=1
#Calculating the time difference of number of years elapse between year of minimum value and rebound year
df['TimeDiff']=df['Rebound_Year']-pd.to_numeric(df['MinVal_Year'])
Lets' have look at the key columns in resultant data frame:
df[['county','prerecession val','MinVal_Year','Rebound_Year','TimeDiff']]
Hope this end to end tested solution helps you.
Thank you so much for sharing your feedback. Even I agree with you that my solution can be made better and more generic as it is very specific at this point.There is a lot to learn from jezrael's solution.I will work on making my solution more generic and optimized.Keep posting such questions as it really helps to fine tune one's skills.
– Richa Monga
Aug 13 at 8:07
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Thank you for trying to solve this problem. My original dataset includes 15K of counties/zips (rows) and monthly values (the number of columns is 250 or so). I had to simplify the task before posting here. My vote goes to jezrael as it worked on the whole dataframe without specifying column names one by one. There probably could be a simple tweak to your solution to pass all column names as a list which as a beginner, I wouldn't know how to do. I am surprised my question was looked at and addressed so quickly. I knew stackoverflow is supportive but didn't expect it to be so good!!
– ADS13
Aug 13 at 0:43