New column based on multiple conditions ignoring missing values
Clash Royale CLAN TAG#URR8PPP
New column based on multiple conditions ignoring missing values
I have following dataframe with some missing values:
A B
0 63.0 9.0
1 NaN 35.0
2 51.0 95.0
3 25.0 11.0
4 91.0 NaN
5 2.0 47.0
6 37.0 10.0
7 NaN 88.0
8 75.0 87.0
9 92.0 21.0
I want to create a new column based on conditions of both above columns:
df['C'] = numpy.where((df['A']>55) | (df['B']>55), "Yes", "No")
This works but does not take into account missing values:
A B C
0 63.0 9.0 Yes
1 NaN 35.0 No
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN Yes
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 Yes
8 75.0 87.0 Yes
9 92.0 21.0 Yes
For correcting for missing values, I have to run following code:
df['C'] = numpy.where((df['A'].isnull()) | (df['B'].isnull()), numpy.nan, df['C'])
Then I get proper new column:
A B C
0 63.0 9.0 Yes
1 NaN 35.0 NaN
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN NaN
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 NaN
8 75.0 87.0 Yes
9 92.0 21.0 Yes
Is there any better way so that I can correct for missing values in single line of code only?
4 Answers
4
IMO if we want to ignore the NaN rows, let's just drop them. Assignment will align on the indices, resulting in NaNs where the index is missing.
In [317]: df["C"] = (df[["A","B"]].dropna() > 55).any(axis=1).replace(
False: "No", True: "Yes")
In [318]: df
Out[318]:
A B C
0 63.0 9.0 Yes
1 NaN 35.0 NaN
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN NaN
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 NaN
8 75.0 87.0 Yes
9 92.0 21.0 Yes
Using np.select
. The order of the conditions is important here, as np.select
chooses the first valid option, so you must have your null check first.
np.select
np.select
c1 = df.isnull().any(1)
c2 = df.gt(55).any(1)
df['C'] = np.select([c1, c2], [np.nan, 'Yes'], 'No')
A B C
0 63.0 9.0 Yes
1 NaN 35.0 nan
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN nan
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 nan
8 75.0 87.0 Yes
9 92.0 21.0 Yes
Chain two any
and map
any
map
(df>55).any(1).mask((df.isnull()).any(1),np.nan).map(1:'Yes',0:'No')
Out[405]:
0 Yes
1 NaN
2 Yes
3 No
4 NaN
5 No
6 No
7 NaN
8 Yes
9 Yes
dtype: object
Or using two np.where
np.where
a=(df>55).any(1)
b=df.isnull().any(1)
np.where(b,np.nan,np.where(a,'Yes','No'))
Out[407]:
array(['Yes', 'nan', 'Yes', 'No', 'nan', 'No', 'No', 'nan', 'Yes', 'Yes'],
dtype='<U32')
just about works for the current df, but breaks on second run, etc
– Ashish Acharya
Aug 6 at 1:02
This becomes a little complicated depending on how many columns you want to do this for.
Here's an alternative with map
and mask
:
map
mask
(df[['A', 'B']]
.gt(55)
.any(1)
.map(lambda x: 'Yes' if x else 'No') # .map(True : 'Yes', False : 'No'.__getitem__)
.mask(df[['A', 'B']].isna().any(1))
)
0 Yes
1 NaN
2 Yes
3 No
4 NaN
5 No
6 No
7 NaN
8 Yes
9 Yes
dtype: object
This is still a single statement, split across multiple lines for readability.
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Nice one man :-) I know there must be a select method , I just get confused with the order of conditions . LOL thank you for the explanation :-)
– Wen
Aug 6 at 1:09