Encode Categorical Features
In machine learning when you deal with pandas datasets you often have to convert categorical features to numerical since machine learning algorithms work with numerical values.
In case we have our data as pandas DataFrame, we encode categorical features (read:columns) of a DataFrame. For conversion you can use pandas and scikit-learn methods:
Table of Contents:
- The get_dummies method
- LabelEncoder method
- OneHotEncoder method
- OrdinalEncoder method
- DictVectorizer method
All mentioned scikit-learn methods are called transformers.
Let’s use a subset from Kaggle wine-reviews dataset:
import pandas as pd
import io
text=u""" points price country region_1 variety winery
0 96 235.0 US Napa Valley Cabernet Sauvignon Heitz
1 96 110.0 Spain Toro Tinta de Toro Bodega Carmen Rodríguez
2 96 90.0 US Knights Valley Sauvignon Blanc Macauley
3 96 65.0 US Willamette Valley Pinot Noir Ponzi
4 95 66.0 France Bandol Provence red blend Domaine de la Bégude
5 95 73.0 Spain Toro Tinta de Toro Numanthia
6 95 65.0 Spain Toro Tinta de Toro Maurodos
7 95 110.0 Spain Toro Tinta de Toro Bodega Carmen Rodríguez
8 95 65.0 US Chehalem Mountains Pinot Noir Bergström
9 95 60.0 US Sonoma Coast Pinot Noir Blue Farm
10 95 80.0 Italy Collio Friulano Borgo del Tiglio
11 95 48.0 US Ribbon Ridge Pinot Noir Patricia Green Cellars
12 95 48.0 US Dundee Hills Pinot Noir Patricia Green Cellars
13 95 90.0 France Madiran Tannat Vignobles Brumont
14 95 185.0 US Dundee Hills Pinot Noir Domaine Serene"""
df=pd.read_csv(io.StringIO(text),
sep=r'\t',
engine='python',
encoding = "utf8")
print(df)
First we identify categorical features (columns).
cc = df.columns[df.dtypes==object].tolist()
cc
Output:
['country', 'region_1', 'variety', 'winery']
Remove the NaN values first
You do machine learning algorithms, but in case on NaN values you have options. One of the options is to remove rows with the NaN.
In here we have string ‘NaN’ conversion to np.nan
and then we use dropna to drop NaN values.
df.replace('NaN', np.nan, inplace=True, regex=True)
# print(df.columns)
df.dropna(inplace=True, subset=['points','price','region_1','variety','winery'])
df
Instead setting the subset parameter to all the columns (df.columns) we may set just the columns we will be using further.
The get_dummies method
We will use the get_dummies for the whole DataFrame. We can/could specify the columns, the default prefix is _, and drop_first is smart to use with any machine learning algorithm to remove redundancy.
Example:
X = pd.get_dummies(df, columns=df.columns, prefix_sep='_', drop_first=True)
X.shape, X.columns
Output:
((12, 32),
Index(['points', 'price', 'region_1_Carneros ', 'region_1_Central Coast ',
'region_1_Columbia Valley (WA) ',
'region_1_Conegliano Valdobbiadene Prosecco Superiore ',
'region_1_Livermore Valley ', 'region_1_Mendoza ',
'region_1_Paso Robles ', 'region_1_Puget Sound ',
'region_1_Sonoma Coast ', 'variety_Chardonnay ', 'variety_Glera ',
'variety_Malbec ', 'variety_Merlot ', 'variety_Pinot Noir ',
'variety_Red Blend ', 'variety_Riesling ', 'variety_Sangiovese Grosso ',
'variety_Siegerrebe ', 'variety_Syrah ', 'winery_Burt Street Cellars',
'winery_Calcareous', 'winery_Casisano Colombaio',
'winery_Familia Los Agüeros', 'winery_Mahoney', 'winery_Merryvale',
'winery_Ruggeri & C.', 'winery_Sineann', 'winery_Vihuela',
'winery_Wente', 'winery_Whidbey Island Winery'],
dtype='object'))
LabelEncoder method
Example:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df[cc] = df[cc].apply(lambda c: le.fit_transform(c))
df
Output:
points price country region_1 variety winery
0 96 235.0 3 6 0 6
1 96 110.0 2 9 6 2
2 96 90.0 3 4 4 7
3 96 65.0 3 10 2 11
4 95 66.0 0 0 3 5
5 95 73.0 2 9 6 9
6 95 65.0 2 9 6 8
7 95 110.0 2 9 6 2
8 95 65.0 3 1 2 0
9 95 60.0 3 8 2 1
Label encoder works fine. The problem is if you have columns with more than 2 values it will not provide optimal results for machine learning problems.
The reason behind this non optimal results is because it introduces relation such as label encoded with 4 is greater than label encoded with 2 or 0. This may hurt the performance.
The documentation say this transformer should be used to encode target values y and not the input X.
However, it may be used for encoding the X if we insist, although better methods exist.
This is why we use OneHotEncoder.
OneHotEncoder method
OneHotEncoder works almost exactly like dummy_encoder function
Example:
import pandas as pd
import numpy as np
import io
s="""points price country region_1 variety winery
0 96 235.0 US Napa Valley Cabernet Sauvignon Heitz
1 96 110.0 Spain Toro Tinta de Toro Bodega Carmen Rodríguez
2 96 90.0 US Knights Valley Sauvignon Blanc Macauley
3 96 65.0 US Willamette Valley Pinot Noir Ponzi
4 95 66.0 France Bandol Provence red blend Domaine de la Bégude
5 95 73.0 Spain Toro Tinta de Toro Numanthia"""
df=pd.read_csv(io.StringIO(s),
sep=r'\t',
engine='python',
encoding = "utf8")
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(df[['country']]), ohe.categories_
Output:
(array([[0., 0., 1.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.]]), [array(['France ', 'Spain ', 'US '], dtype=object)])
You can encode multiple columns:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(df[['country','winery']]), ohe.categories_
Example:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, categories='auto' )
df_ohe = ohe.fit_transform(df, df.columns)
df_ohe.shape # (15, 47)
Output:
(array([[0., 0., 1., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 1.],
[1., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 1., 0.]]),
[array(['France ', 'Spain ', 'US '], dtype=object),
array(['Bodega Carmen Rodríguez', 'Domaine de la Bégude', 'Heitz',
'Macauley', 'Numanthia', 'Ponzi'], dtype=object)])
By using the make_column_transformer you can automate the process:
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
(OneHotEncoder(sparse=False), ['country','winery']), remainder='drop')
ct.fit_transform(df)
Output:
array([[0., 0., 1., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 1.],
[1., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 1., 0.]])
In case we like to keep the all non categorical columns, and convert all categorical columns: we can use the cc list (categorical columns list):
from sklearn.compose import make_column_transformer
cc = df.columns[df.dtypes==object].tolist()
ct = make_column_transformer(
(OneHotEncoder(sparse=True), cc), remainder='passthrough')
ct.fit_transform(df)
And of course we can use sparse columns.
OrdinalEncoder method
This type of encoder can be use on categorical features that involve graduation such as education. It also works for prices
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit_transform(df[['price']]), oe.categories_
Output:
(array([[5.],
[4.],
[3.],
[0.],
[1.],
[2.]]), [array([ 65., 66., 73., 90., 110., 235.])])
If you note the original data:
points price country
0 96 235.0 US
1 96 110.0 Spain
2 96 90.0 US
3 96 65.0 US
4 95 66.0 France
5 95 73.0 Spain
DictVectorizer method
DictVectorized need to be applied on a dict, so we need to convert the DataFrame to a dict first.
This may not be optimal.
Example:
d = df.to_dict(orient='records')
d
Output:
[{'points ': 96,
'price': 235.0,
'country': 'US ',
'region_1': 'Napa Valley ',
'variety': 'Cabernet Sauvignon ',
'winery': 'Heitz'},
{'points ': 96,
'price': 110.0,
'country': 'Spain ',
'region_1': 'Toro ',
'variety': 'Tinta de Toro ',
'winery': 'Bodega Carmen Rodríguez'},
{'points ': 96,
'price': 90.0,
'country': 'US ',
'region_1': 'Knights Valley ',
'variety': 'Sauvignon Blanc ',
'winery': 'Macauley'},
{'points ': 96,
'price': 65.0,
'country': 'US ',
'region_1': 'Willamette Valley ',
'variety': 'Pinot Noir ',
'winery': 'Ponzi'},
{'points ': 95,
'price': 66.0,
'country': 'France ',
'region_1': 'Bandol ',
'variety': 'Provence red blend ',
'winery': 'Domaine de la Bégude'},
{'points ': 95,
'price': 73.0,
'country': 'Spain ',
'region_1': 'Toro ',
'variety': 'Tinta de Toro ',
'winery': 'Numanthia'}]
Next:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
e = dv.fit_transform(d)
e
Output:
array([[ 0., 0., 1., 96., 235., 0., 0., 1., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 0., 96., 110., 0., 0., 0., 1., 0., 0.,
0., 0., 0., 1., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 96., 90., 0., 1., 0., 0., 0., 0.,
0., 0., 1., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 1., 96., 65., 0., 0., 0., 0., 1., 0.,
1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 1., 0., 0., 95., 66., 1., 0., 0., 0., 0., 0.,
0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 1., 0., 95., 73., 0., 0., 0., 1., 0., 0.,
0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
…
tags: encoder - pandas - sklearn & category: machine-learning