Encoding string features in pandas

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Encoding string features in pandas



I have dataframe like following


train_df
'type', 'manufacturer', 'year', 'num_doors'
sedan, bmw, 2012, 4
couple, audi, 2014, 2
and so on



and test_df in similar format
All the features are categorical features (some string, some int) and I want to encode them as categorical variables.


test_df



Whats a good way to handle these categorical variables in pandas/sklearn
Also, once the transformation is applied on train df.. I want to encode the test_df also as per these encodings?





Do you know what columns are categorical to begin with?
– coldspeed
Jan 16 at 2:07





pandas.pydata.org/pandas-docs/stable/categorical.html
– John Zwinck
Jan 16 at 2:10





@cᴏʟᴅsᴘᴇᴇᴅ: yeah.. though there are only 10 possible values it can take..
– Fraz
Jan 16 at 2:14





Look up one hot encoding and get_dummies() - both sklearn functions IIRC
– pault
Jan 16 at 2:20


get_dummies()


sklearn





@pault get_dummies() is from pandas, not scikit-learn
– Vivek Kumar
Jan 16 at 5:03




2 Answers
2



When reading your data, specify dtype to be category to make every single column categorical in nature.


dtype


category


df = pd.read_csv('file.csv', dtype='category')
df

type manufacturer year num_doors
0 sedan bmw 2012 4
1 couple audi 2014 2




df.dtypes

type category
manufacturer category
year category
num_doors category
dtype: object



If you want to convert only a specific subset of columns, something like this would do -


f = dict.fromkeys(['type', 'manufacturer', ...], 'categorical')



Pass f to dtype.


f


dtype


df = pd.read_csv('file.csv', dtype=f)



There are multiple ways to achieve this:



If you can use the development version of scikit (scikit-learn v0.20.dev0) , then there's a CategoricalEncoder present there, which does exactly what you want.



Example:


from sklearn.preprocessing import CategoricalEncoder
enc = CategoricalEncoder(handle_unknown='ignore')
X = pd.read_csv('file.csv')
enc.fit(X)

enc.categories_
# Output:
# [array(['sedan', 'couple'], dtype=object),
array(['bmw', 'audi'], dtype=object)]
array([2012, 2014], dtype=object)]
...
...



If you are unable to use that and want to use the current stable version (<=0.19.1), then you have to use a combination of LabelEncoder + OneHotEncoder to do the same.



The above two work well where you have data split into train and test already.



But if you have all the data at once, then the recommended way is to use get_dummies() from pandas, after which you can split the data into train and test.



Update:



Apparently, CategoricalEncoder has been removed from scikit and OneHotEncoder has been given those capabilities. So in current dev version, OneHotEncoder can do string to one-hot encoding directly, without using LabelEncoder.


CategoricalEncoder


OneHotEncoder


OneHotEncoder


LabelEncoder






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard