Encoding string features in pandas

Encoding string features in pandas

I have dataframe like following

train_df 'type', 'manufacturer', 'year', 'num_doors' sedan, bmw, 2012, 4 couple, audi, 2014, 2 and so on

and test_df in similar format
All the features are categorical features (some string, some int) and I want to encode them as categorical variables.

test_df

Whats a good way to handle these categorical variables in pandas/sklearn
Also, once the transformation is applied on train df.. I want to encode the test_df also as per these encodings?

Do you know what columns are categorical to begin with?
– coldspeed
Jan 16 at 2:07

pandas.pydata.org/pandas-docs/stable/categorical.html
– John Zwinck
Jan 16 at 2:10

@cᴏʟᴅsᴘᴇᴇᴅ: yeah.. though there are only 10 possible values it can take..
– Fraz
Jan 16 at 2:14

Look up one hot encoding and get_dummies() - both sklearn functions IIRC
– pault
Jan 16 at 2:20

get_dummies()

sklearn

@pault get_dummies() is from pandas, not scikit-learn
– Vivek Kumar
Jan 16 at 5:03

2 Answers
2

When reading your data, specify dtype to be category to make every single column categorical in nature.

dtype

category

df = pd.read_csv('file.csv', dtype='category') df type manufacturer year num_doors 0 sedan bmw 2012 4 1 couple audi 2014 2

df.dtypes type category manufacturer category year category num_doors category dtype: object

If you want to convert only a specific subset of columns, something like this would do -

f = dict.fromkeys(['type', 'manufacturer', ...], 'categorical')

Pass f to dtype.

f

dtype

df = pd.read_csv('file.csv', dtype=f)

There are multiple ways to achieve this:

If you can use the development version of scikit (scikit-learn v0.20.dev0) , then there's a CategoricalEncoder present there, which does exactly what you want.

Example:

from sklearn.preprocessing import CategoricalEncoder enc = CategoricalEncoder(handle_unknown='ignore') X = pd.read_csv('file.csv') enc.fit(X) enc.categories_ # Output: # [array(['sedan', 'couple'], dtype=object), array(['bmw', 'audi'], dtype=object)] array([2012, 2014], dtype=object)] ... ...

If you are unable to use that and want to use the current stable version (<=0.19.1), then you have to use a combination of LabelEncoder + OneHotEncoder to do the same.

The above two work well where you have data split into train and test already.

But if you have all the data at once, then the recommended way is to use get_dummies() from pandas, after which you can split the data into train and test.

Update:

Apparently, CategoricalEncoder has been removed from scikit and OneHotEncoder has been given those capabilities. So in current dev version, OneHotEncoder can do string to one-hot encoding directly, without using LabelEncoder.

CategoricalEncoder

OneHotEncoder

LabelEncoder

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

SuNxYq3WEf3C,m,cU,SMs5CwZk4fiBeXfidfixal9v

搜尋此網誌

Sfyjdyy