Encoding string features in pandas
Clash Royale CLAN TAG#URR8PPP
Encoding string features in pandas
I have dataframe like following
train_df
'type', 'manufacturer', 'year', 'num_doors'
sedan, bmw, 2012, 4
couple, audi, 2014, 2
and so on
and test_df
in similar format
All the features are categorical features (some string, some int) and I want to encode them as categorical variables.
test_df
Whats a good way to handle these categorical variables in pandas/sklearn
Also, once the transformation is applied on train df.. I want to encode the test_df also as per these encodings?
pandas.pydata.org/pandas-docs/stable/categorical.html
– John Zwinck
Jan 16 at 2:10
@cᴏʟᴅsᴘᴇᴇᴅ: yeah.. though there are only 10 possible values it can take..
– Fraz
Jan 16 at 2:14
Look up one hot encoding and
get_dummies()
- both sklearn
functions IIRC– pault
Jan 16 at 2:20
get_dummies()
sklearn
@pault get_dummies() is from pandas, not scikit-learn
– Vivek Kumar
Jan 16 at 5:03
2 Answers
2
When reading your data, specify dtype
to be category
to make every single column categorical in nature.
dtype
category
df = pd.read_csv('file.csv', dtype='category')
df
type manufacturer year num_doors
0 sedan bmw 2012 4
1 couple audi 2014 2
df.dtypes
type category
manufacturer category
year category
num_doors category
dtype: object
If you want to convert only a specific subset of columns, something like this would do -
f = dict.fromkeys(['type', 'manufacturer', ...], 'categorical')
Pass f
to dtype
.
f
dtype
df = pd.read_csv('file.csv', dtype=f)
There are multiple ways to achieve this:
If you can use the development version of scikit (scikit-learn v0.20.dev0) , then there's a CategoricalEncoder present there, which does exactly what you want.
Example:
from sklearn.preprocessing import CategoricalEncoder
enc = CategoricalEncoder(handle_unknown='ignore')
X = pd.read_csv('file.csv')
enc.fit(X)
enc.categories_
# Output:
# [array(['sedan', 'couple'], dtype=object),
array(['bmw', 'audi'], dtype=object)]
array([2012, 2014], dtype=object)]
...
...
If you are unable to use that and want to use the current stable version (<=0.19.1), then you have to use a combination of LabelEncoder + OneHotEncoder to do the same.
The above two work well where you have data split into train and test already.
But if you have all the data at once, then the recommended way is to use get_dummies() from pandas, after which you can split the data into train and test.
Update:
Apparently, CategoricalEncoder
has been removed from scikit and OneHotEncoder
has been given those capabilities. So in current dev version, OneHotEncoder
can do string to one-hot encoding directly, without using LabelEncoder
.
CategoricalEncoder
OneHotEncoder
OneHotEncoder
LabelEncoder
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Do you know what columns are categorical to begin with?
– coldspeed
Jan 16 at 2:07