Pandas first date condition is met while another condition is active
Clash Royale CLAN TAG#URR8PPP
Pandas first date condition is met while another condition is active
I have a dataframe with a time series of scores. My goal is to detect when the score is larger than a certain threshold th
and then to find when the score goes back to 0. Is quite easy to find each condition separately
th
dates_1 = score > th
dates_2 = np.sign(score[1:]) == np.sign(score.shift(1).dropna())
However, I don't know what's the most pythonic way to override dates_2 so that only dates when an 'active' date_1 has been observed
Perhaps using an auxiliary column 'active' set to 1 whenever score > th
is True and set it to False when the condition for dates_2 is met. That way I can ask for the change in sign AND active == True
. However, that approach requires iteration and I'm wondering if there's a vectorized solution to my problem
score > th
active == True
Any thoughts on how to improve my approach?
Sample data:
date score
2010-01-04 0.0
2010-01-05 -0.3667779798467592
2010-01-06 -1.9641427199568868
2010-01-07 -0.49976215445519134
2010-01-08 -0.7069108074548405
2010-01-11 -1.4624766212523337
2010-01-12 -0.9132777669357441
2010-01-13 0.16204588193577152
2010-01-14 0.958085568609925
2010-01-15 1.4683022129399834
2010-01-19 3.036016680985081
2010-01-20 2.2357911432637345
2010-01-21 2.8827438241030707
2010-01-22 -3.395977874791837
if th = 0.94
date active
2010-01-04 False
2010-01-05 False
2010-01-06 False
2010-01-07 False
2010-01-08 False
2010-01-11 False
2010-01-12 False
2010-01-13 False
2010-01-14 True
2010-01-15 True
2010-01-19 True
2010-01-20 True
2010-01-21 True
2010-01-22 False
2 Answers
2
def alt_cond(s, th):
active = False
for x in s:
active = [x >= th, x > 0][int(active)]
yield active
df.assign(A=[*alt_cond(df.score, 0.94)])
date score A
0 2010-01-04 0.000000 False
1 2010-01-05 -0.366778 False
2 2010-01-06 -1.964143 False
3 2010-01-07 -0.499762 False
4 2010-01-08 -0.706911 False
5 2010-01-11 -1.462477 False
6 2010-01-12 -0.913278 False
7 2010-01-13 0.162046 False
8 2010-01-14 0.958086 True
9 2010-01-15 1.468302 True
10 2010-01-19 3.036017 True
11 2010-01-20 2.235791 True
12 2010-01-21 2.882744 True
13 2010-01-22 -3.395978 False
I used Numba to really speed things up. Still a loop but should be very fast if you can install numba
numba
from numba import njit
@njit
def alt_cond(s, th):
active = False
out = np.zeros(len(s), dtype=np.bool8)
for i, x in enumerate(s):
if active:
if x <= 0:
active = False
else:
if x >= th:
active = True
out[i] = active
return out
df.assign(A=alt_cond(df.score.values, .94))
You can have a dictionary of column names and threshold values and iterate
th = 'score': 0.94
df.join(pd.DataFrame(
np.column_stack([[*alt_cond(df[k], v)] for k, v in th.items()]),
df.index, [f"k_A" for k in th]
))
date score score_A
0 2010-01-04 0.000000 False
1 2010-01-05 -0.366778 False
2 2010-01-06 -1.964143 False
3 2010-01-07 -0.499762 False
4 2010-01-08 -0.706911 False
5 2010-01-11 -1.462477 False
6 2010-01-12 -0.913278 False
7 2010-01-13 0.162046 False
8 2010-01-14 0.958086 True
9 2010-01-15 1.468302 True
10 2010-01-19 3.036017 True
11 2010-01-20 2.235791 True
12 2010-01-21 2.882744 True
13 2010-01-22 -3.395978 False
Tried the non vectorized, works great!
– Yuca
Aug 8 at 20:36
If ever in NY, beers on me :)
– Yuca
Aug 8 at 20:41
(-: no worries. Glad you found it.
– piRSquared
Aug 9 at 13:27
Cleaning up comments.
– piRSquared
Aug 9 at 13:27
I updated the post with one way to do it.
– piRSquared
Aug 14 at 19:21
I'm assuming your data is in a pandas dataframe, and 'date' is your index column. Then this would be the way I'd do it:
th = 0.94 # Threshold value
i = df[df.score>th].index[0] # Check the index for the first condition
df[i:][df.score<0].index[0] # Check the index for the second condition, after the index of the first condition
So use conditional indexing to find the index for the first condition ([df.score>th]
), then check for the second condition ([df.score<0]
), but begin to look from the index found for the first condition ([i:]
)
[df.score>th]
[df.score<0]
[i:]
I ran your code and it doesn't do anything to the original input a, can you include your results?
– Yuca
Aug 8 at 19:49
This can be due to the random data. If none of the data yields a value greater than the threshold value, no output is generated
– Jeroen
Aug 8 at 19:50
I get FutureWarning; 'summary' is deprecated (no idea why this is getting triggered tbh)
– Yuca
Aug 8 at 19:50
I'm running it multiple times, it still not doing anything. Also why generate random data vs loading the dataframe from the clipboard?
– Yuca
Aug 8 at 19:51
As you should! pd.read_clipboard()
– Yuca
Aug 8 at 20:20
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
What is the expected outcome given the data in your example?
– piRSquared
Aug 8 at 19:27