New column based on multiple conditions ignoring missing values

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



New column based on multiple conditions ignoring missing values



I have following dataframe with some missing values:


A B
0 63.0 9.0
1 NaN 35.0
2 51.0 95.0
3 25.0 11.0
4 91.0 NaN
5 2.0 47.0
6 37.0 10.0
7 NaN 88.0
8 75.0 87.0
9 92.0 21.0



I want to create a new column based on conditions of both above columns:


df['C'] = numpy.where((df['A']>55) | (df['B']>55), "Yes", "No")



This works but does not take into account missing values:


A B C
0 63.0 9.0 Yes
1 NaN 35.0 No
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN Yes
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 Yes
8 75.0 87.0 Yes
9 92.0 21.0 Yes



For correcting for missing values, I have to run following code:


df['C'] = numpy.where((df['A'].isnull()) | (df['B'].isnull()), numpy.nan, df['C'])



Then I get proper new column:


A B C
0 63.0 9.0 Yes
1 NaN 35.0 NaN
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN NaN
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 NaN
8 75.0 87.0 Yes
9 92.0 21.0 Yes



Is there any better way so that I can correct for missing values in single line of code only?




4 Answers
4



IMO if we want to ignore the NaN rows, let's just drop them. Assignment will align on the indices, resulting in NaNs where the index is missing.


In [317]: df["C"] = (df[["A","B"]].dropna() > 55).any(axis=1).replace(
False: "No", True: "Yes")

In [318]: df
Out[318]:
A B C
0 63.0 9.0 Yes
1 NaN 35.0 NaN
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN NaN
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 NaN
8 75.0 87.0 Yes
9 92.0 21.0 Yes



Using np.select. The order of the conditions is important here, as np.select chooses the first valid option, so you must have your null check first.


np.select


np.select


c1 = df.isnull().any(1)
c2 = df.gt(55).any(1)

df['C'] = np.select([c1, c2], [np.nan, 'Yes'], 'No')

A B C
0 63.0 9.0 Yes
1 NaN 35.0 nan
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN nan
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 nan
8 75.0 87.0 Yes
9 92.0 21.0 Yes





Nice one man :-) I know there must be a select method , I just get confused with the order of conditions . LOL thank you for the explanation :-)
– Wen
Aug 6 at 1:09




Chain two any and map


any


map


(df>55).any(1).mask((df.isnull()).any(1),np.nan).map(1:'Yes',0:'No')
Out[405]:
0 Yes
1 NaN
2 Yes
3 No
4 NaN
5 No
6 No
7 NaN
8 Yes
9 Yes
dtype: object



Or using two np.where


np.where


a=(df>55).any(1)
b=df.isnull().any(1)
np.where(b,np.nan,np.where(a,'Yes','No'))
Out[407]:
array(['Yes', 'nan', 'Yes', 'No', 'nan', 'No', 'No', 'nan', 'Yes', 'Yes'],
dtype='<U32')





just about works for the current df, but breaks on second run, etc
– Ashish Acharya
Aug 6 at 1:02



This becomes a little complicated depending on how many columns you want to do this for.



Here's an alternative with map and mask:


map


mask


(df[['A', 'B']]
.gt(55)
.any(1)
.map(lambda x: 'Yes' if x else 'No') # .map(True : 'Yes', False : 'No'.__getitem__)
.mask(df[['A', 'B']].isna().any(1))
)

0 Yes
1 NaN
2 Yes
3 No
4 NaN
5 No
6 No
7 NaN
8 Yes
9 Yes
dtype: object



This is still a single statement, split across multiple lines for readability.






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard