Python dynamic subset in a loop

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Python dynamic subset in a loop



I have the following dataframe. Multiple county names in a column, and across the table the dates and the values. Pre-recession max is some max that specific County had over a specific time frame (since not every county experienced same drops in values instantaneously). I need to find out the time between the minimum date which is unique to a row and the date when the value rebounded (when the value in the following column after the column with a minimum value became equal or above the pre-recession Max).



I am new to Python, new to stackoverflow, and spent a week researching online with no success.



Dataframe



Final result



The following code works and evaluates all values in df if they are above 51000. The question is: how to subset the df dynamically? Thanks.


df
revcols = df.columns.values.tolist()
revcols.reverse()
tmpdf=tmpdf= df>51000
final=tmpdf[tmpdf.any(axis=1)].idxmax(axis=1)
final




2 Answers
2



Use:


df = df.set_index(['County','Prerecession Max Value'])

a = df.idxmin(axis=1)
m1 = df.eq(df.min(axis=1), axis=0).cumsum(axis=1).gt(0)
m2 = df.sub(df.index.get_level_values(1), axis=0).ge(0)
b = (m1 & m2).idxmax(axis=1)

d = 'Date of Min': a, 'Date of Max':b
df = df.assign(**d).reset_index()
print (df)
County Prerecession Max Value 2007 2008 2009 2010 2011 2012
0 County 1 100000 90000 81000 72900 65610 70000 80000
1 County 2 20000 18000 16000 21000 22000 23000 24000
2 County 3 10000 9000 8100 7290 6561 5905 6405
3 County 4 6000 6000 4860 4374 4474 4574 6001

2013 2014 2015 Date of Min Date of Max
0 90000 100000 110000 2010 2014
1 25000 26000 27000 2008 2009
2 6905 12405 13405 2011 2014
3 7000 7500 7900 2009 2012



Setup: (changed last value of 2007 column to 6000 for test matching after minimal year value)


2007


6000


import pandas as pd

temp=u"""
County;Prerecession Max Value;2007;2008;2009;2010;2011;2012;2013;2014;2015
County 1;100,000;90,000;81,000;72,900;65,610;70,000;80,000;90,000;100,000;110,000
County 2;20,000;18,000;16,000;21,000;22,000;23,000;24,000;25,000;26,000;27,000
County 3;10,000;9,000;8,100;7,290;6,561;5,905;6,405;6,905;12,405;13,405
County 4;6,000;6,000;4,860;4,374;4,474;4,574;6,001;7,000;7,500;7,900"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", thousands=',')
print (df)
County Prerecession Max Value 2007 2008 2009 2010 2011 2012
0 County 1 100000 90000 81000 72900 65610 70000 80000
1 County 2 20000 18000 16000 21000 22000 23000 24000
2 County 3 10000 9000 8100 7290 6561 5905 6405
3 County 4 6000 6000 4860 4374 4474 4574 6001

2013 2014 2015
0 90000 100000 110000
1 25000 26000 27000
2 6905 12405 13405
3 7000 7500 7900



Explanation:



First create MultiIndex with no date columns by DataFrame.set_index:


MultiIndex


DataFrame.set_index


df = df.set_index(['County','Prerecession Max Value'])
print (df)
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 90000 81000 72900 65610 70000 80000
County 2 20000 18000 16000 21000 22000 23000 24000
County 3 10000 9000 8100 7290 6561 5905 6405
County 4 6000 6000 4860 4374 4474 4574 6001

2013 2014 2015
County Prerecession Max Value
County 1 100000 90000 100000 110000
County 2 20000 25000 26000 27000
County 3 10000 6905 12405 13405
County 4 6000 7000 7500 7900



For minimal Date use DataFrame.idxmin:


DataFrame.idxmin


print (df.idxmin(axis=1))
County Prerecession Max Value
County 1 100000 2010
County 2 20000 2008
County 3 10000 2011
County 4 6000 2009
dtype: object



Then need filter all values after minimal values per rows - first compare by min values by DataFrame.eq:


min


DataFrame.eq


print (df.eq(df.min(axis=1), axis=0))

2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 False False False True False False
County 2 20000 False True False False False False
County 3 10000 False False False False True False
County 4 6000 False False True False False False

2013 2014 2015
County Prerecession Max Value
County 1 100000 False False False
County 2 20000 False False False
County 3 10000 False False False
County 4 6000 False False False



Use cumulative sum per rows by DataFrame.cumsum


DataFrame.cumsum


print (df.eq(df.min(axis=1), axis=0).cumsum(axis=1))
2007 2008 2009 2010 2011 2012 2013
County Prerecession Max Value
County 1 100000 0 0 0 1 1 1 1
County 2 20000 0 1 1 1 1 1 1
County 3 10000 0 0 0 0 1 1 1
County 4 6000 0 0 1 1 1 1 1

2014 2015
County Prerecession Max Value
County 1 100000 1 1
County 2 20000 1 1
County 3 10000 1 1
County 4 6000 1 1



And compare by DataFrame.gt:


DataFrame.gt


print (df.eq(df.min(axis=1), axis=0).cumsum(axis=1).gt(0))
2007 2008 2009 2010 2011 2012 2013
County Prerecession Max Value
County 1 100000 False False False True True True True
County 2 20000 False True True True True True True
County 3 10000 False False False False True True True
County 4 6000 False False True True True True True

2014 2015
County Prerecession Max Value
County 1 100000 True True
County 2 20000 True True
County 3 10000 True True
County 4 6000 True True



Then create another mask - subtract second level of MultiIndex selected by Index.get_level_values and DataFrame.sub:


MultiIndex


Index.get_level_values


DataFrame.sub


print (df.index.get_level_values(1))
Int64Index([100000, 20000, 10000, 6000],
dtype='int64', name='Prerecession Max Value')

print (df.sub(df.index.get_level_values(1), axis=0))
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 -10000 -19000 -27100 -34390 -30000 -20000
County 2 20000 -2000 -4000 1000 2000 3000 4000
County 3 10000 -1000 -1900 -2710 -3439 -4095 -3595
County 4 6000 0 -1140 -1626 -1526 -1426 1

2013 2014 2015
County Prerecession Max Value
County 1 100000 -10000 0 10000
County 2 20000 5000 6000 7000
County 3 10000 -3095 2405 3405
County 4 6000 1000 1500 1900



Then compare by DataFrame.ge for >= with 0:


DataFrame.ge


>=


0


print (df.sub(df.index.get_level_values(1), axis=0).ge(0))
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 False False False False False False
County 2 20000 False False True True True True
County 3 10000 False False False False False False
County 4 6000 True False False False False True

2013 2014 2015
County Prerecession Max Value
County 1 100000 False True True
County 2 20000 True True True
County 3 10000 False True True
County 4 6000 True True True



Chain both boolean mask by & for AND and get column name of first Trues per row by DataFrame.idxmax:


&


AND


True


DataFrame.idxmax


print ((m1 & m2))
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 False False False False False False
County 2 20000 False False True True True True
County 3 10000 False False False False False False
County 4 6000 False False False False False True

2013 2014 2015
County Prerecession Max Value
County 1 100000 False True True
County 2 20000 True True True
County 3 10000 False True True
County 4 6000 True True True

print ((m1 & m2).idxmax(axis=1))
County Prerecession Max Value
County 1 100000 2014
County 2 20000 2009
County 3 10000 2014
County 4 6000 2012
dtype: object



Create dictionary of new columns for assign:


assign


d = 'Date of Min': a, 'Date of Max':b
df = df.assign(**d)
print (df)
2007 2008 2009 2010 2011 2012
County Prerecession Max Value
County 1 100000 90000 81000 72900 65610 70000 80000
County 2 20000 18000 16000 21000 22000 23000 24000
County 3 10000 9000 8100 7290 6561 5905 6405
County 4 6000 6000 4860 4374 4474 4574 6001

2013 2014 2015 Date of Min Date of Max
County Prerecession Max Value
County 1 100000 90000 100000 110000 2010 2014
County 2 20000 25000 26000 27000 2008 2009
County 3 10000 6905 12405 13405 2011 2014
County 4 6000 7000 7500 7900 2009 2012



And last reset_index for columns from MultiIndex.


reset_index


MultiIndex



Thanks for posting this question.I have come up with a solution to this problem which has been presented below:



I have created a 'csv' file with sample data provided in the problem statement and named it stack.csv. I have added three new columns in this csv which will hold calculated values for:



Initially there are nulls or NaNs in these columns.



enter image description here



Now,we can have a look at the solution I have coded:




#Loading the CSV file into a data frame
df = pd.read_csv('stack.csv')

#Transposing the county and year columns to create a subset in order to fetch minimum value for each year
df_subset=df[['county','2007','2008','2009','2010','2011','2012','2013','2014','2015']]
df_subset_transposed = df_subset.T
df_subset_transposed.rename(columns=0:'county1', inplace=True)
df_subset_transposed.rename(columns=1:'county2', inplace=True)
df_subset_transposed.rename(columns=2:'county3', inplace=True)
df_subset_transposed.rename(columns=3:'county4', inplace=True)
df_subset_transposed.drop(['county'],inplace=True)
df_subset_transposed.index.names=['year']
df['MinVal_Year'][df['county']=='county1'] = pd.to_numeric(df_subset_transposed[('county1')]).idxmin()
df['MinVal_Year'][df['county']=='county2'] = pd.to_numeric(df_subset_transposed[('county2')]).idxmin()
df['MinVal_Year'][df['county']=='county3'] = pd.to_numeric(df_subset_transposed[('county3')]).idxmin()
df['MinVal_Year'][df['county']=='county4'] = pd.to_numeric(df_subset_transposed[('county4')]).idxmin()

#Iterating the main data frame couny wise to fetch which year is the rebound year
j=0
for i in df['county']:
if df[df['county']==i]['2007'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2007')
if df[df['county']==i]['2008'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2008')
elif df[df['county']==i]['2009'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2009')
elif df[df['county']==i]['2010'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2010')
elif df[df['county']==i]['2011'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2011')
elif df[df['county']==i]['2012'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2012')
elif df[df['county']==i]['2013'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2013')
elif df[df['county']==i]['2014'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2014')
elif df[df['county']==i]['2015'][j] >= df[df['county']==i]['prerecession val'][j]:
df.set_value(j,'Rebound_Year','2015')
j+=1

#Calculating the time difference of number of years elapse between year of minimum value and rebound year
df['TimeDiff']=df['Rebound_Year']-pd.to_numeric(df['MinVal_Year'])



Lets' have look at the key columns in resultant data frame:




df[['county','prerecession val','MinVal_Year','Rebound_Year','TimeDiff']]



enter image description here



Hope this end to end tested solution helps you.





Thank you for trying to solve this problem. My original dataset includes 15K of counties/zips (rows) and monthly values (the number of columns is 250 or so). I had to simplify the task before posting here. My vote goes to jezrael as it worked on the whole dataframe without specifying column names one by one. There probably could be a simple tweak to your solution to pass all column names as a list which as a beginner, I wouldn't know how to do. I am surprised my question was looked at and addressed so quickly. I knew stackoverflow is supportive but didn't expect it to be so good!!
– ADS13
Aug 13 at 0:43





Thank you so much for sharing your feedback. Even I agree with you that my solution can be made better and more generic as it is very specific at this point.There is a lot to learn from jezrael's solution.I will work on making my solution more generic and optimized.Keep posting such questions as it really helps to fine tune one's skills.
– Richa Monga
Aug 13 at 8:07







By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard