Python dynamic subset in a loop

I have the following dataframe. Multiple county names in a column, and across the table the dates and the values. Pre-recession max is some max that specific County had over a specific time frame (since not every county experienced same drops in values instantaneously). I need to find out the time between the minimum date which is unique to a row and the date when the value rebounded (when the value in the following column after the column with a minimum value became equal or above the pre-recession Max).

I am new to Python, new to stackoverflow, and spent a week researching online with no success.

Dataframe

Final result

The following code works and evaluates all values in df if they are above 51000. The question is: how to subset the df dynamically? Thanks.

df revcols = df.columns.values.tolist() revcols.reverse() tmpdf=tmpdf= df>51000 final=tmpdf[tmpdf.any(axis=1)].idxmax(axis=1) final

2 Answers
2

Use:

df = df.set_index(['County','Prerecession Max Value']) a = df.idxmin(axis=1) m1 = df.eq(df.min(axis=1), axis=0).cumsum(axis=1).gt(0) m2 = df.sub(df.index.get_level_values(1), axis=0).ge(0) b = (m1 & m2).idxmax(axis=1) d = 'Date of Min': a, 'Date of Max':b df = df.assign(**d).reset_index() print (df) County Prerecession Max Value 2007 2008 2009 2010 2011 2012 0 County 1 100000 90000 81000 72900 65610 70000 80000 1 County 2 20000 18000 16000 21000 22000 23000 24000 2 County 3 10000 9000 8100 7290 6561 5905 6405 3 County 4 6000 6000 4860 4374 4474 4574 6001 2013 2014 2015 Date of Min Date of Max 0 90000 100000 110000 2010 2014 1 25000 26000 27000 2008 2009 2 6905 12405 13405 2011 2014 3 7000 7500 7900 2009 2012

Setup: (changed last value of 2007 column to 6000 for test matching after minimal year value)

2007

6000

import pandas as pd temp=u""" County;Prerecession Max Value;2007;2008;2009;2010;2011;2012;2013;2014;2015 County 1;100,000;90,000;81,000;72,900;65,610;70,000;80,000;90,000;100,000;110,000 County 2;20,000;18,000;16,000;21,000;22,000;23,000;24,000;25,000;26,000;27,000 County 3;10,000;9,000;8,100;7,290;6,561;5,905;6,405;6,905;12,405;13,405 County 4;6,000;6,000;4,860;4,374;4,474;4,574;6,001;7,000;7,500;7,900""" #after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv' df = pd.read_csv(pd.compat.StringIO(temp), sep=";", thousands=',') print (df) County Prerecession Max Value 2007 2008 2009 2010 2011 2012 0 County 1 100000 90000 81000 72900 65610 70000 80000 1 County 2 20000 18000 16000 21000 22000 23000 24000 2 County 3 10000 9000 8100 7290 6561 5905 6405 3 County 4 6000 6000 4860 4374 4474 4574 6001 2013 2014 2015 0 90000 100000 110000 1 25000 26000 27000 2 6905 12405 13405 3 7000 7500 7900

Explanation:

First create MultiIndex with no date columns by DataFrame.set_index:

MultiIndex

DataFrame.set_index

df = df.set_index(['County','Prerecession Max Value']) print (df) 2007 2008 2009 2010 2011 2012 County Prerecession Max Value County 1 100000 90000 81000 72900 65610 70000 80000 County 2 20000 18000 16000 21000 22000 23000 24000 County 3 10000 9000 8100 7290 6561 5905 6405 County 4 6000 6000 4860 4374 4474 4574 6001 2013 2014 2015 County Prerecession Max Value County 1 100000 90000 100000 110000 County 2 20000 25000 26000 27000 County 3 10000 6905 12405 13405 County 4 6000 7000 7500 7900

For minimal Date use DataFrame.idxmin:

DataFrame.idxmin

print (df.idxmin(axis=1)) County Prerecession Max Value County 1 100000 2010 County 2 20000 2008 County 3 10000 2011 County 4 6000 2009 dtype: object

Then need filter all values after minimal values per rows - first compare by min values by DataFrame.eq:

min

DataFrame.eq

print (df.eq(df.min(axis=1), axis=0)) 2007 2008 2009 2010 2011 2012 County Prerecession Max Value County 1 100000 False False False True False False County 2 20000 False True False False False False County 3 10000 False False False False True False County 4 6000 False False True False False False 2013 2014 2015 County Prerecession Max Value County 1 100000 False False False County 2 20000 False False False County 3 10000 False False False County 4 6000 False False False

Use cumulative sum per rows by DataFrame.cumsum

DataFrame.cumsum

print (df.eq(df.min(axis=1), axis=0).cumsum(axis=1)) 2007 2008 2009 2010 2011 2012 2013 County Prerecession Max Value County 1 100000 0 0 0 1 1 1 1 County 2 20000 0 1 1 1 1 1 1 County 3 10000 0 0 0 0 1 1 1 County 4 6000 0 0 1 1 1 1 1 2014 2015 County Prerecession Max Value County 1 100000 1 1 County 2 20000 1 1 County 3 10000 1 1 County 4 6000 1 1

And compare by DataFrame.gt:

DataFrame.gt

print (df.eq(df.min(axis=1), axis=0).cumsum(axis=1).gt(0)) 2007 2008 2009 2010 2011 2012 2013 County Prerecession Max Value County 1 100000 False False False True True True True County 2 20000 False True True True True True True County 3 10000 False False False False True True True County 4 6000 False False True True True True True 2014 2015 County Prerecession Max Value County 1 100000 True True County 2 20000 True True County 3 10000 True True County 4 6000 True True

Then create another mask - subtract second level of MultiIndex selected by Index.get_level_values and DataFrame.sub:

MultiIndex

Index.get_level_values

DataFrame.sub

print (df.index.get_level_values(1)) Int64Index([100000, 20000, 10000, 6000], dtype='int64', name='Prerecession Max Value') print (df.sub(df.index.get_level_values(1), axis=0)) 2007 2008 2009 2010 2011 2012 County Prerecession Max Value County 1 100000 -10000 -19000 -27100 -34390 -30000 -20000 County 2 20000 -2000 -4000 1000 2000 3000 4000 County 3 10000 -1000 -1900 -2710 -3439 -4095 -3595 County 4 6000 0 -1140 -1626 -1526 -1426 1 2013 2014 2015 County Prerecession Max Value County 1 100000 -10000 0 10000 County 2 20000 5000 6000 7000 County 3 10000 -3095 2405 3405 County 4 6000 1000 1500 1900

Then compare by DataFrame.ge for >= with 0:

DataFrame.ge

>=

0

print (df.sub(df.index.get_level_values(1), axis=0).ge(0)) 2007 2008 2009 2010 2011 2012 County Prerecession Max Value County 1 100000 False False False False False False County 2 20000 False False True True True True County 3 10000 False False False False False False County 4 6000 True False False False False True 2013 2014 2015 County Prerecession Max Value County 1 100000 False True True County 2 20000 True True True County 3 10000 False True True County 4 6000 True True True

Chain both boolean mask by & for AND and get column name of first Trues per row by DataFrame.idxmax:

&

AND

True

DataFrame.idxmax

print ((m1 & m2)) 2007 2008 2009 2010 2011 2012 County Prerecession Max Value County 1 100000 False False False False False False County 2 20000 False False True True True True County 3 10000 False False False False False False County 4 6000 False False False False False True 2013 2014 2015 County Prerecession Max Value County 1 100000 False True True County 2 20000 True True True County 3 10000 False True True County 4 6000 True True True print ((m1 & m2).idxmax(axis=1)) County Prerecession Max Value County 1 100000 2014 County 2 20000 2009 County 3 10000 2014 County 4 6000 2012 dtype: object

Create dictionary of new columns for assign:

assign

d = 'Date of Min': a, 'Date of Max':b df = df.assign(**d) print (df) 2007 2008 2009 2010 2011 2012 County Prerecession Max Value County 1 100000 90000 81000 72900 65610 70000 80000 County 2 20000 18000 16000 21000 22000 23000 24000 County 3 10000 9000 8100 7290 6561 5905 6405 County 4 6000 6000 4860 4374 4474 4574 6001 2013 2014 2015 Date of Min Date of Max County Prerecession Max Value County 1 100000 90000 100000 110000 2010 2014 County 2 20000 25000 26000 27000 2008 2009 County 3 10000 6905 12405 13405 2011 2014 County 4 6000 7000 7500 7900 2009 2012

And last reset_index for columns from MultiIndex.

reset_index

MultiIndex

Thanks for posting this question.I have come up with a solution to this problem which has been presented below:

I have created a 'csv' file with sample data provided in the problem statement and named it stack.csv. I have added three new columns in this csv which will hold calculated values for:

Initially there are nulls or NaNs in these columns.

enter image description here

Now,we can have a look at the solution I have coded:

#Loading the CSV file into a data frame df = pd.read_csv('stack.csv') #Transposing the county and year columns to create a subset in order to fetch minimum value for each year df_subset=df[['county','2007','2008','2009','2010','2011','2012','2013','2014','2015']] df_subset_transposed = df_subset.T df_subset_transposed.rename(columns=0:'county1', inplace=True) df_subset_transposed.rename(columns=1:'county2', inplace=True) df_subset_transposed.rename(columns=2:'county3', inplace=True) df_subset_transposed.rename(columns=3:'county4', inplace=True) df_subset_transposed.drop(['county'],inplace=True) df_subset_transposed.index.names=['year'] df['MinVal_Year'][df['county']=='county1'] = pd.to_numeric(df_subset_transposed[('county1')]).idxmin() df['MinVal_Year'][df['county']=='county2'] = pd.to_numeric(df_subset_transposed[('county2')]).idxmin() df['MinVal_Year'][df['county']=='county3'] = pd.to_numeric(df_subset_transposed[('county3')]).idxmin() df['MinVal_Year'][df['county']=='county4'] = pd.to_numeric(df_subset_transposed[('county4')]).idxmin() #Iterating the main data frame couny wise to fetch which year is the rebound year j=0 for i in df['county']: if df[df['county']==i]['2007'][j] >= df[df['county']==i]['prerecession val'][j]: df.set_value(j,'Rebound_Year','2007') if df[df['county']==i]['2008'][j] >= df[df['county']==i]['prerecession val'][j]: df.set_value(j,'Rebound_Year','2008') elif df[df['county']==i]['2009'][j] >= df[df['county']==i]['prerecession val'][j]: df.set_value(j,'Rebound_Year','2009') elif df[df['county']==i]['2010'][j] >= df[df['county']==i]['prerecession val'][j]: df.set_value(j,'Rebound_Year','2010') elif df[df['county']==i]['2011'][j] >= df[df['county']==i]['prerecession val'][j]: df.set_value(j,'Rebound_Year','2011') elif df[df['county']==i]['2012'][j] >= df[df['county']==i]['prerecession val'][j]: df.set_value(j,'Rebound_Year','2012') elif df[df['county']==i]['2013'][j] >= df[df['county']==i]['prerecession val'][j]: df.set_value(j,'Rebound_Year','2013') elif df[df['county']==i]['2014'][j] >= df[df['county']==i]['prerecession val'][j]: df.set_value(j,'Rebound_Year','2014') elif df[df['county']==i]['2015'][j] >= df[df['county']==i]['prerecession val'][j]: df.set_value(j,'Rebound_Year','2015') j+=1 #Calculating the time difference of number of years elapse between year of minimum value and rebound year df['TimeDiff']=df['Rebound_Year']-pd.to_numeric(df['MinVal_Year'])

Lets' have look at the key columns in resultant data frame:

df[['county','prerecession val','MinVal_Year','Rebound_Year','TimeDiff']]

enter image description here

Hope this end to end tested solution helps you.

Thank you for trying to solve this problem. My original dataset includes 15K of counties/zips (rows) and monthly values (the number of columns is 250 or so). I had to simplify the task before posting here. My vote goes to jezrael as it worked on the whole dataframe without specifying column names one by one. There probably could be a simple tweak to your solution to pass all column names as a list which as a beginner, I wouldn't know how to do. I am surprised my question was looked at and addressed so quickly. I knew stackoverflow is supportive but didn't expect it to be so good!!
– ADS13
Aug 13 at 0:43

Thank you so much for sharing your feedback. Even I agree with you that my solution can be made better and more generic as it is very specific at this point.There is a lot to learn from jezrael's solution.I will work on making my solution more generic and optimized.Keep posting such questions as it really helps to fine tune one's skills.
– Richa Monga
Aug 13 at 8:07

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Sfyjdyy

Python dynamic subset in a loop

Python dynamic subset in a loop

2 Answers
2

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

Python dynamic subset in a loop

Python dynamic subset in a loop

2 Answers 2

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

2 Answers
2