New column based on multiple conditions ignoring missing values

New column based on multiple conditions ignoring missing values

I have following dataframe with some missing values:

A B 0 63.0 9.0 1 NaN 35.0 2 51.0 95.0 3 25.0 11.0 4 91.0 NaN 5 2.0 47.0 6 37.0 10.0 7 NaN 88.0 8 75.0 87.0 9 92.0 21.0

I want to create a new column based on conditions of both above columns:

df['C'] = numpy.where((df['A']>55) | (df['B']>55), "Yes", "No")

This works but does not take into account missing values:

A B C 0 63.0 9.0 Yes 1 NaN 35.0 No 2 51.0 95.0 Yes 3 25.0 11.0 No 4 91.0 NaN Yes 5 2.0 47.0 No 6 37.0 10.0 No 7 NaN 88.0 Yes 8 75.0 87.0 Yes 9 92.0 21.0 Yes

For correcting for missing values, I have to run following code:

df['C'] = numpy.where((df['A'].isnull()) | (df['B'].isnull()), numpy.nan, df['C'])

Then I get proper new column:

A B C 0 63.0 9.0 Yes 1 NaN 35.0 NaN 2 51.0 95.0 Yes 3 25.0 11.0 No 4 91.0 NaN NaN 5 2.0 47.0 No 6 37.0 10.0 No 7 NaN 88.0 NaN 8 75.0 87.0 Yes 9 92.0 21.0 Yes

Is there any better way so that I can correct for missing values in single line of code only?

4 Answers
4

IMO if we want to ignore the NaN rows, let's just drop them. Assignment will align on the indices, resulting in NaNs where the index is missing.

In [317]: df["C"] = (df[["A","B"]].dropna() > 55).any(axis=1).replace( False: "No", True: "Yes") In [318]: df Out[318]: A B C 0 63.0 9.0 Yes 1 NaN 35.0 NaN 2 51.0 95.0 Yes 3 25.0 11.0 No 4 91.0 NaN NaN 5 2.0 47.0 No 6 37.0 10.0 No 7 NaN 88.0 NaN 8 75.0 87.0 Yes 9 92.0 21.0 Yes

Using np.select. The order of the conditions is important here, as np.select chooses the first valid option, so you must have your null check first.

np.select

c1 = df.isnull().any(1) c2 = df.gt(55).any(1) df['C'] = np.select([c1, c2], [np.nan, 'Yes'], 'No') A B C 0 63.0 9.0 Yes 1 NaN 35.0 nan 2 51.0 95.0 Yes 3 25.0 11.0 No 4 91.0 NaN nan 5 2.0 47.0 No 6 37.0 10.0 No 7 NaN 88.0 nan 8 75.0 87.0 Yes 9 92.0 21.0 Yes

Nice one man :-) I know there must be a select method , I just get confused with the order of conditions . LOL thank you for the explanation :-)
– Wen
Aug 6 at 1:09

Chain two any and map

any

map

(df>55).any(1).mask((df.isnull()).any(1),np.nan).map(1:'Yes',0:'No') Out[405]: 0 Yes 1 NaN 2 Yes 3 No 4 NaN 5 No 6 No 7 NaN 8 Yes 9 Yes dtype: object

Or using two np.where

np.where

a=(df>55).any(1) b=df.isnull().any(1) np.where(b,np.nan,np.where(a,'Yes','No')) Out[407]: array(['Yes', 'nan', 'Yes', 'No', 'nan', 'No', 'No', 'nan', 'Yes', 'Yes'], dtype='<U32')

just about works for the current df, but breaks on second run, etc
– Ashish Acharya
Aug 6 at 1:02

This becomes a little complicated depending on how many columns you want to do this for.

Here's an alternative with map and mask:

map

mask

(df[['A', 'B']] .gt(55) .any(1) .map(lambda x: 'Yes' if x else 'No') # .map(True : 'Yes', False : 'No'.__getitem__) .mask(df[['A', 'B']].isna().any(1)) ) 0 Yes 1 NaN 2 Yes 3 No 4 NaN 5 No 6 No 7 NaN 8 Yes 9 Yes dtype: object

This is still a single statement, split across multiple lines for readability.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

1v8fYQaPy 94N,wxcmj 1vQVL8adRv3GYuXAH6aVNXcL fgravc9,eeZ3B1dhtE,hhN,014 vUHMEh74r,GE

搜尋此網誌

Sfyjdyy