Scikit-learn (sklearn) examples

Table of contents:

What can we do with Scikit-learn?
Installation
Dependencies
Mission as a reference
Splitting the train and test set
Datasets
Estimators
Pipeline
Metrics
- Classification metrics
  - f1_score
  - accuracy_score
- Regression metrics
Models
- Cross Validation
Decision Trees
Ensembles

What can we do with Scikit-learn?

Scikit-learn (also known as sklearn) is the first association for “Machine Learning in Python”. This package helps solving and analyzing different classification, regression, clustering problems. It includes SVM, and interesting subparts like decision trees, random forests, gradient boosting, k-means, KNN and other algorithms. It uses NumPy and SciPy as core dependencies.

scikit-learn

Installation

From the official website you will find that the correct way to install is:

pip install -U scikit-learn

and the correct way to import is:

import sklearn

The scikit-learn and sklearn are synonyms.

Dependencies

When I executed:

pip show scikit-learn

I got the following feedback:

Name: scikit-learn
Version: 0.22.2.post1
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: None
Author-email: None
License: new BSD
Location: /usr/local/lib/python3.6/dist-packages
Requires: numpy, scipy, joblib
Required-by: yellowbrick, umap-learn, textgenrnn, sklearn, sklearn-pandas, mlxtend, lucid, lightgbm, librosa, imbalanced-learn, fancyimpute

You can also check:

sklearn.show_versions()

Output I got:

System:
    python: 3.6.9 (default, Nov  7 2019, 10:44:02)  [GCC 8.3.0]
executable: /usr/bin/python3
   machine: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
       pip: 19.3.1
setuptools: 46.1.3
   sklearn: 0.22.2.post1
     numpy: 1.18.3
     scipy: 1.4.1
    Cython: 0.29.16
    pandas: 1.0.3
matplotlib: 3.2.1
    joblib: 0.14.1

Built with OpenMP: True

Mission as a reference

Scikit-learn mission is to provide simple and efficient solutions for some machine learning problems that are nice documented and easy to use.

You may consider the scikit-learn as a reference of machine learning models, estimators, and terms.

For instance, you can split train and test set in PyTorch or TensorFlow, but this can be done with sklearn.model_selection > train_test_split.

Splitting the train and test set

In here we will have the typical example with n samples, f features with single target y.

Example:

import numpy as np
from sklearn.model_selection import train_test_split
np.random.seed(13)
n=100 # number of samples
f=10 # number of features
X,y = np.random.rand(f*n).reshape((n,f)), np.random.rand(n)
# X,y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=13)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

Output:

(67, 10) (67,)
(33, 10) (33,)

Datasets

Originally you can load predefined dataset.

Example wine dataset:

from sklearn.datasets import load_wine
data = load_wine()
data.target[[10, 80, 140]]

list(data.target_names)

Output:

['class_0', 'class_1', 'class_2']

Example Boston dataset 1:

from sklearn.datasets import load_boston
X,y = load_boston(return_X_y=True)
X,y

In here you will get X features and y target.

Example Boston dataset 2:

from sklearn.datasets import load_boston
X = load_boston(return_X_y=False)
X['data'],X['feature_names'], X['DESCR']

In here you will get feature names and description.

There is also a list of generator datasets:

Example:

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10, n_features=4,n_informative=2, n_redundant=0, random_state=0, shuffle=False)
X,y

Output:

(array([[-2.24080141,  0.50054121, -0.34791215,  0.15634897],
        [-1.60020786, -0.51203824,  1.23029068,  1.20237985],
        [-2.3748206 ,  0.82768797, -0.38732682, -0.30230275],
        [-1.43411205,  1.50037656, -1.04855297, -1.42001794],
        [-0.98740462,  0.99958558, -1.70627019,  1.9507754 ],
        [-1.19536732,  1.27804656, -0.50965218, -0.4380743 ],
        [ 0.29484027, -0.79249401, -1.25279536,  0.77749036],
        [ 0.6476239 , -0.81753611, -1.61389785, -0.21274028],
        [ 2.38363567,  1.56914464, -0.89546656,  0.3869025 ],
        [ 1.51783379,  1.22140561, -0.51080514, -1.18063218]]),
 array([0, 0, 0, 1, 1, 1, 0, 0, 1, 1]))

Estimators

Estimator is scikit-learn terminology for the model. An estimator is an object that learns from data using the fit method. It can be either classification, regression or clustering type of the process or even a transform operation on data. (extracts some columns from the data).

Special case of estimator is called transofrm estimator or transformer if it supports either transform or fit_transform methods.

An estimator may take parameters. The base class for all the estimators is BaseEstimator.

An estimator which takes another estimator as a parameter is called meta-estimator.

One such meta-estimator is a pipeline.Pipeline.

Pipeline

To combine multiple estimators into one use a pipeline. The Pipeline is a list of key-value pairs.

For example:

estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)

With the pipeline you modify just the features X or the data. The target y should not be modified with pipeline technique.

For example feature selection, normalization and classification may be in a pipeline.

All estimators in a pipeline, except the last one, must have a transform method (transformers).

Check last estimator type with his _estimator_type attribute of type string.

pipe[-1]._estimator_type

Transform estimators do not have _estimator_type attribute.

make_pipeline is a shortcut to create pipeline with auto keys. It fills the names automatically.

Example:

from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer
make_pipeline(Binarizer(), MultinomialNB())
p

Output:

Pipeline(memory=None,
         steps=[('binarizer', Binarizer(copy=True, threshold=0.0)),
                ('multinomialnb',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

you index into estomator with

p[0]

Output:

Binarizer(copy=True, threshold=0.0)

Metrics

There are following metrics:

Classification metrics
Regression metrics
Multilabel ranking metrics
Clustering metrics
Biclustering metrics
Pairwise metrics

In here we briefly cover the common examples on classification and regression metrics.

Classification metrics

f1_score

F1 is an example from classification metrics. It considers both the precision p and the recall r. It works for binary classification and multiclass/multilabel targets.

Calculation:

f1 = 2pr/(p+r)

Example:

from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
ma = f1_score(y_true, y_pred, average='macro')
mi = f1_score(y_true, y_pred, average='micro')
we = f1_score(y_true, y_pred, average='weighted')
no = f1_score(y_true, y_pred, average=None)
ma,mi,we,no

Output:

(0.26666666666666666,
 0.3333333333333333,
 0.26666666666666666,
 array([0.8, 0. , 0. ]))

The parameter average is binary by default. The other averages are:

micro
macro
weighted
samples and
None

accuracy_score

Example:

from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3] # prediction
y_true = [0, 1, 2, 3] # true
accuracy_score(y_true, y_pred),
accuracy_score(y_true, y_pred, normalize=False)

Output:

0.5
2

Regression metrics

mean_absolute_error

Mean absolute error regression loss

\[\text{MAE}(y, \hat{y}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}}-1} \left| y_i - \hat{y}_i \right|\]

Example:

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, -0.4, 2, 8]
r1 = mean_absolute_error(y_true, y_pred)
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
r2 = mean_absolute_error(y_true, y_pred)
r3 = mean_absolute_error(y_true, y_pred, multioutput='raw_values')
r4 = mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7])
r1,r2,r3,r4

Output:

(0.4, 0.75, array([0.5, 1. ]), 0.85)

mean_squared_error

\[\text{MSE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (y_i - \hat{y}_i)^2\]

Example:

from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
r1 = mean_squared_error(y_true, y_pred)
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
r2 = mean_squared_error(y_true, y_pred)
r1,r2

Output:

(0.375, 0.7083333333333334)

mean_squared_log_error

The mean_squared_log_error function computes a risk metric corresponding to the expected value of the squared logarithmic (quadratic) error or loss.

\[\text{MSLE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (\log_e (1 + y_i) - \log_e (1 + \hat{y}_i) )^2\]

Example:

from sklearn.metrics import mean_squared_log_error
y_true = [3, 5, 2.5, 7]
y_pred = [2.5, 5, 4, 8]
r1 = mean_squared_log_error(y_true, y_pred)
y_true = [[0.5, 1], [1, 2], [7, 6]]
y_pred = [[0.5, 2], [1, 2.5], [8, 8]]
r2 = mean_squared_log_error(y_true, y_pred)
r1,r2 

Output:

(0.039, 0.044)

Models

Example models that are often used are: LogisticRegression, that is a classifier model itself, and LinearRegression that is a regression model.

linear models

Example:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X,y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, solver='liblinear', multi_class='auto').fit(X, y)
r1 = clf.predict(X[:2, :])
r2 = clf.predict_proba(X[:2, :])
r3 = clf.score(X, y)
r1,r2,r3

Output:

(array([0, 0]), array([[8.78030305e-01, 1.21958900e-01, 1.07949250e-05],
        [7.97058292e-01, 2.02911413e-01, 3.02949242e-05]]), 0.96)

Example:

import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)
r1 = reg.score(X, y)
r2 = reg.coef_
r3 = reg.intercept_
r4 = reg.predict(np.array([[3, 5]]))
r1,r2,r3,r4

Output:

(1.0, array([1., 2.]), 3.0000000000000018, array([16.]))

Cross Validation

Once we have the prediction model we need to evaluate model precision on unseen data. While training, we don’t have unseen data in advance and is a common practice to hold out part of the available data known as a test set.

This can be done with cross validation applied together with the prediction model.

Here are some common imports for cross validation:

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_prediction
from sklearn.model_selection import cross_validate

Example:

from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
print(cross_val_score(lasso, X, y, cv=3))

Output:

[0.33150734 0.08022311 0.03531764]

Decision Trees

You use decision trees forms often:

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

Example:

import numpy as np
from matplotlib import pyplot as plt

np.random.seed(13)
train_data = np.random.normal(size=(100, 2))
train_labels = np.zeros(100)

# adding second class
train_data = np.r_[train_data, np.random.normal(size=(100, 2), loc=2)]
train_labels = np.r_[train_labels, np.ones(100)]

from sklearn.tree import DecisionTreeClassifier

r = 150 # resolution
x = np.linspace(2*train_data[:, 0].min(), train_data[:, 0].max()+1, num=r)
y = np.linspace(2*train_data[:, 1].min(), train_data[:, 1].max()+1, num=r)
xx,yy = np.meshgrid(x, y)

clf_tree = DecisionTreeClassifier(criterion='entropy', max_depth=2) # gini

# training = fitting
clf_tree.fit(train_data, train_labels)

# predicting (with the resolution shape of rXr)
predicted = clf_tree.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.figure(figsize=(10,8), dpi=140)
plt.pcolormesh(xx, yy, predicted, cmap='cool')
plt.scatter(train_data[:, 0], train_data[:, 1], c=train_labels, s=100, alpha=0.8,
            edgecolors='black', cmap='cool')

Output:

decision tree classifier

Ensembles

Random Forests

The following RF are frequent in use:

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

Example:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators=20)
clf.fit(X, y)
print(clf.feature_importances_)
print(clf.predict([[0, 0, 0, 0]]))

Output:

[0.15733456 0.79341435 0.0222356  0.02701549]
[1]

AdaBoost

The core principle of AdaBoost is to fit a sequence of weak learners. Weak learners are models that are slightly better than random guessing (typically small decision trees).

The predictions from weak learners are combined to a weighted sum to produce the better prediction.

Often in use:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import AdaBoostRegressor

Example:

from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier

X, y = load_iris(return_X_y=True)
clf = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()

Output:

0.9

GradientBoosting

Can be used for both regression and classification problems in a variety of areas including Web search ranking.

This is an iterative technique which adjusts the weight of an observation based on the previous predictions.

Often in use:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor

Example:

from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)

Output:

0.913

Bagging

Bagging from bootstrap aggregating.

Bagging is technique to generate the additional data from the original dataset. This should decrease the variance in the precision.

Math equivalent would be combinations with repetitions.

It is usually applied to decision trees, but it can be used with any type of method.

Often in use:

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import BaggingRegressor

Example:

from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)
clf = BaggingClassifier(base_estimator=SVC(),
                        n_estimators=10, random_state=0).fit(X, y)
clf.predict([[0, 0, 0, 0]])

Output:

array([1])

…

tags: examples - data analysis - machine learning & category: python