sklearn – confusion_matrix

2020-10-29 / tau / コメントする

概要

機械学習の精度を複数の視点で確認するのに、Confusion Matrix（混同行列）を用いるが、sklearn.metricsパッケージのconfusion_matrixモジュールはこの集計を自動で行ってくれる。

使い方

引数

confusion_matrix(y_true, y_pred, labels=None, normalize=None)

y_true: ターゲットの正解の配列を与える。
y_pred: 予測されたターゲットの配列を与える。
labels: 表示される順番を変更したいときに、ターゲット値をリストで指定する。
normalize: 合計値に対する比率で表示する。正解の合計に対する場合は'true'、予測結果の合計に対する場合は'pred'、全体の合計に対する場合は'all'を指定する。

戻り値

戻り値は[n_class, n_class]の2次元配列で、各行が正解の各クラス、各列が予測された各クラスに対応する。各クラスの並びは、数値なら昇順、文字列なら辞書順で、行・列とも同じ並びになる。

実行例

データの準備とモデルによる予測

Breast Cancerデータセットで使い方を見ていく。まず、cancerデータを読み込み、訓練データとテストデータに分割する。予測モデルにはLogistic回帰を用いて、訓練データについてターゲットを予測する。以降、訓練データに関する正解ターゲットと予測ターゲットを使う。

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(
    ds.data, ds.target, stratify=ds.target, random_state=42)

logreg = LogisticRegression(solver='liblinear').fit(X_train, y_train)
y_train_pred = logreg.predict(X_train)

import numpy as np

import pandas as pd

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(

ds.data, ds.target, stratify=ds.target, random_state=42)

logreg = LogisticRegression(solver='liblinear').fit(X_train, y_train)

y_train_pred = logreg.predict(X_train)

ここでデータの内容を確認しておく。正解データ、予測データとも0/1の2クラスで、0が悪性(malignant)、1が良性(begnign)と定義されている。

np.set_printoptions(threshold=1, edgeitems=3)
print("Target data")
print("Actual data    (size={}):{}".format(y_train.size, y_train))
print("Predicted data (size={}):{}".format(y_train_pred.size, y_train_pred))
print(ds.target_names)

# Actual data    (size=426):[0 1 0 ... 0 0 1]
# Predicted data (size=426):[0 1 0 ... 0 0 1]
# ['malignant' 'benign']

np.set_printoptions(threshold=1, edgeitems=3)

print("Target data")

print("Actual data (size={}):{}".format(y_train.size, y_train))

print("Predicted data (size={}):{}".format(y_train_pred.size, y_train_pred))

print(ds.target_names)

# Actual data (size=426):[0 1 0 ... 0 0 1]

# Predicted data (size=426):[0 1 0 ... 0 0 1]

# ['malignant' 'benign']

また、0/1の数値によるクラス表現を文字列表現にした配列を別に作っておく。

y_train_named = np.array([ds.target_names[x] for x in y_train])
y_train_pred_named = np.array([ds.target_names[x] for x in y_train_pred])

print("Actual data    (size={}):{}".format(y_train.size, y_train_named))
print("Predicted data (size={}):{}".format(y_train_pred.size, y_train_pred_named))

# Actual data    (size=426):['malignant' 'benign' 'malignant' ... 'malignant' 'malignant' 'benign']
# Predicted data (size=426):['malignant' 'benign' 'malignant' ... 'malignant' 'malignant' 'benign']

y_train_named = np.array([ds.target_names[x] for x in y_train])

y_train_pred_named = np.array([ds.target_names[x] for x in y_train_pred])

print("Actual data (size={}):{}".format(y_train.size, y_train_named))

print("Predicted data (size={}):{}".format(y_train_pred.size, y_train_pred_named))

# Actual data (size=426):['malignant' 'benign' 'malignant' ... 'malignant' 'malignant' 'benign']

# Predicted data (size=426):['malignant' 'benign' 'malignant' ... 'malignant' 'malignant' 'benign']

基本的な使い方

要素のみを得る

基本的な使い方は、confusion_matrix()の引数に正解データと予測データをコレクションで与える。結果は行・列とも昇順で並べられる。以下の例では、1行目が正解・悪性、2行目が正解・良性、1列目が予測・悪性、2列目が予測・良性となっている。

mat = confusion_matrix(y_train, y_train_pred)
print(mat)

# [[148  11]
#  [  9 258]]

mat = confusion_matrix(y_train, y_train_pred)

print(mat)

# [[148 11]

# [ 9 258]]

クラスが文字列で表現されている場合は、文字列の辞書順なので、行・列とも'benign'、'malignant'の順で並べられる。この結果、数値表現の場合に対して行・列とも入れ替わっている。

mat_named = confusion_matrix(y_train_named, y_train_pred_named)
print(mat_named)

# [[258   9]
#  [ 11 148]]

mat_named = confusion_matrix(y_train_named, y_train_pred_named)

print(mat_named)

# [[258 9]

# [ 11 148]]

要素の並び順を変更する

引数labelsにリストでクラスの並びを指定できる。以下の例ではデフォルトの昇順の並びを変更している。

print(confusion_matrix(y_train, y_train_pred, labels=[1, 0]))

# [[258   9]
#  [ 11 148]]

print(confusion_matrix(y_train_named, y_train_pred_named,
    labels=['malignant', 'benign']))

# [[148  11]
#  [  9 258]]

print(confusion_matrix(y_train, y_train_pred, labels=[1, 0]))

# [[258 9]

# [ 11 148]]

print(confusion_matrix(y_train_named, y_train_pred_named,

labels=['malignant', 'benign']))

# [[148 11]

# [ 9 258]]

要素を正規化する～比率で表す

引数normalizeで合計に対する比率の計算の仕方を指定できる。

normalize='true'の場合、正解の各行の合計に対する比率が計算される。以下の例では行の合計で各要素が除され、各行の合計が1となっている。

mat = confusion_matrix(y_train, y_train_pred, normalize='true')
print(mat)
print(mat.sum(axis=1))

# [[0.93081761 0.06918239]
#  [0.03370787 0.96629213]]
# [1. 1.]

mat = confusion_matrix(y_train, y_train_pred, normalize='true')

print(mat)

print(mat.sum(axis=1))

# [[0.93081761 0.06918239]

# [0.03370787 0.96629213]]

# [1. 1.]

normalize='pred'の場合、予測の各列の合計に対する比率が計算される。以下の例では列の合計で各要素が除され、各列の合計が1となっている。

mat = confusion_matrix(y_train, y_train_pred, normalize='pred')
print(mat)
print(mat.sum(axis=0))

# [[0.94267516 0.04089219]
#  [0.05732484 0.95910781]]
# [1. 1.]

mat = confusion_matrix(y_train, y_train_pred, normalize='pred')

print(mat)

print(mat.sum(axis=0))

# [[0.94267516 0.04089219]

# [0.05732484 0.95910781]]

# [1. 1.]

normalize='all'の場合、すべての要素の合計に対する比率が計算される。以下の例では、全要素の合計が1となっている。

mat = confusion_matrix(y_train, y_train_pred, normalize='all')
print(mat)
print(mat.sum())

# [[0.34741784 0.0258216 ]
#  [0.02112676 0.6056338 ]]
# 1.0

mat = confusion_matrix(y_train, y_train_pred, normalize='all')

print(mat)

print(mat.sum())

# [[0.34741784 0.0258216 ]

# [0.02112676 0.6056338 ]]

# 1.0

なお、normalize='all'を指定した場合の対角要素の和は、全要素に対する正解要素の比率になり、score()メソッドの値と等しい。

print("Accuracy      :{}".format(mat[0, 0] + mat[1, 1]))
print("Training score:{}".format(logreg.score(X_train, y_train)))

# Accuracy      :0.9530516431924883
# Training score:0.9530516431924883

print("Accuracy :{}".format(mat[0, 0] + mat[1, 1]))

print("Training score:{}".format(logreg.score(X_train, y_train)))

# Accuracy :0.9530516431924883

# Training score:0.9530516431924883

DataFrameによる扱い

ラベルの追加

pandasのDataFrameを使うと、行・列のラベルが表示されるので見やすくなる。行（正解）のラベルはindexで、列（予測）のラベルはcolumnsで指定し、同じ内容のコレクションを与える。

mat = confusion_matrix(y_train, y_train_pred)
result_label = ['malignant', 'benign']
df = pd.DataFrame(mat, columns=result_label, index=result_label)

#            malignant  benign
# malignant        148      11
# benign             9     258

mat = confusion_matrix(y_train, y_train_pred)

result_label = ['malignant', 'benign']

df = pd.DataFrame(mat, columns=result_label, index=result_label)

# malignant benign

# malignant 148 11

# benign 9 258

合計欄

DataFrameのsum()メソッドで、行・列の合計を計算して追加すると見やすい。sum()メソッドの引数を省略するとデフォルトのaxis=0となり、列ごとの合計が1次元配列で得られる。引数をaxis=1とすると行単位の合計が1次元配列で得られる。

以下の例では、まず列方向の合計（各予測クラスの合計）を最後の行に加え、その行も含めて行方向の合計（各正解クラス、予測クラス合計の合計）を最後の列に加える。

sums_in_col = df.sum()
df.loc['Total'] = sums_in_col

sums_in_row = df.sum(axis=1)
df['Total'] = sums_in_row

print(df)

#            malignant  benign  Total
# malignant        148      11    159
# benign             9     258    267
# Total            157     269    426

sums_in_col = df.sum()

df.loc['Total'] = sums_in_col

sums_in_row = df.sum(axis=1)

df['Total'] = sums_in_row

print(df)

# malignant benign Total

# malignant 148 11 159

# benign 9 258 267

# Total 157 269 426

Multiindex

DataFrameのMultiindexを使うと、正解・予測を表示できるのでより分かりやすくなる。ただし行・列・要素の指定が少し煩雑になる。

actual_label = ['Actual'] * 2
pred_label = ['Prediction'] * 2
df = pd.DataFrame(mat, columns=[pred_label, result_label], index=[actual_label, result_label])
print(df)

#                  Prediction       
#                   malignant benign
# Actual malignant        148     11
#        benign             9    258

actual_label = ['Actual'] * 2

pred_label = ['Prediction'] * 2

df = pd.DataFrame(mat, columns=[pred_label, result_label], index=[actual_label, result_label])

print(df)

# Prediction

# malignant benign

# Actual malignant 148 11

# benign 9 258

以下はMultiindexの場合に合計欄を加える例。

sums_in_col = df.sum()
df.loc[('Actual', 'Total'), :] = sums_in_col

sums_in_row = df.sum(axis=1)
df[('Prediction', 'Total')] = sums_in_row

df = df.astype('int')
print(df)

#                  Prediction             
#                   malignant benign Total
# Actual malignant        148     11   159
#        benign             9    258   267
#        Total            157    269   426

sums_in_col = df.sum()

df.loc[('Actual', 'Total'), :] = sums_in_col

sums_in_row = df.sum(axis=1)

df[('Prediction', 'Total')] = sums_in_row

df = df.astype('int')

print(df)

# Prediction

# malignant benign Total

# Actual malignant 148 11 159

# benign 9 258 267

# Total 157 269 426

sklearn.preprocessing

2020-10-09 / tau / コメントする

使い方

機械学習のうち、ニューラルネットワークやSVMなどのモデルは、データの値の大きさやレンジが異なる場合、過学習になったり精度が悪くなることがあり、データを揃えるための前処理が必要になる（SVMの例、ニューラルネットワークの例）。

scikit-learnのpreprocessingモジュールには、データの前処理を行う各種のクラスが準備されている。一般的な使い方は以下の通り。

データを訓練データとテストデータに分ける
各preprocessorのfit()メソッドに訓練データを与えて変換用のパラメータを準備する（変換モデルを構築する）
- fit()メソッドは、各列が特徴量、各行がデータレコードである2次元配列を想定している
変換器のtransform()メソッドに訓練データを与えて前処理を施す
同じ変換器のtransform()メソッドにテストデータを与えて前処理をほどこす

なお、fit()メソッドとtransform()メソッドをそれぞれ分けて行うほか、fit().transform()とメソッドチェーンで実行してもよい。またpreprocessorにはこれらを一体化したfit_transform()というメソッドも準備されている。

実行例

preprocessingのscaler系のクラスの1つ、MinMaxScalerを例にして、その挙動を追ってみる。

まず必要なライブラリーやクラスをインポートし、Breast cancerデータを読み込み、データを訓練データとテストデータに分ける。cancerデータは30の特徴量を列とし、569のレコードを持つが、それを3:1に分け、426セットの訓練データと143セットのテストデータとしている。

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

np.set_printoptions(suppress=True, precision=2, floatmode='fixed')

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test =\
    train_test_split(cancer.data, cancer.target, random_state=1)
print("shepe of trainning data:{}".format(X_train.shape))
print("shepe of test data     :{}".format(X_test.shape))

# shepe of trainning data:(426, 30)
# shepe of test data     :(143, 30)

import numpy as np

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

np.set_printoptions(suppress=True, precision=2, floatmode='fixed')

cancer = load_breast_cancer()

X_train, X_test, y_train, y_test =\

train_test_split(cancer.data, cancer.target, random_state=1)

print("shepe of trainning data:{}".format(X_train.shape))

print("shepe of test data :{}".format(X_test.shape))

# shepe of trainning data:(426, 30)

# shepe of test data :(143, 30)

次にMinMaxScalerのインスタンスを生成し、fit()メソッドに訓練データX_trainを与えて、変換用のモデルを構築する。

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_train)

preprocessingでいうモデルの構築とは、基準となるデータを与えて、変換用のパラメータを算出・保持するのに相当する。

今回の例のMinMaxScalerオブジェクトでは、特徴量数を要素数とする1次元配列で、データセット中の各特徴量の最小値(data_min_)、最大値(data_max_)、最大値－最小値のレンジ(data_range_)、レンジの逆数であるscales_がインスタンス内に保持されている。

これらのパラメーターは、30の特徴量について、426個のデータの最小値、最大値・・・などとなっている。たとえば1つ目の特徴量については、最大値－最小値は28.11−6.98=21.13となり、data_range_の1つ目の値と符合している。またscales_の各要素は、data_range_の各要素の逆数となっている。

print("-----traing data characteristics and parameters")
print("mins  :\n{}".format(scaler.data_min_))
print("maxs  :\n{}".format(scaler.data_max_))
print("ranges:\n{}".format(scaler.data_range_))
print("scales:\n{}".format(scaler.scale_))

# -----traing data characteristics and parameters
# mins  :
# [  6.98   9.71  43.79 143.50   0.05   0.02   0.00   0.00   0.11   0.05
#    0.12   0.36   0.76   6.80   0.00   0.00   0.00   0.00   0.01   0.00
#    7.93  12.02  50.41 185.20   0.07   0.03   0.00   0.00   0.16   0.06]
# maxs  :
# [  28.11   39.28  188.50 2501.00    0.16    0.29    0.43    0.20    0.30
#     0.10    2.87    4.88   21.98  542.20    0.03    0.14    0.40    0.05
#     0.06    0.03   36.04   49.54  251.20 4254.00    0.22    0.94    1.17
#     0.29    0.58    0.15]
# ranges:
# [  21.13   29.57  144.71 2357.50    0.11    0.27    0.43    0.20    0.20
#     0.05    2.76    4.52   21.22  535.40    0.03    0.13    0.40    0.05
#     0.05    0.03   28.11   37.52  200.79 4068.80    0.15    0.91    1.17
#     0.29    0.42    0.09]
# scales:
# [ 0.05  0.03  0.01  0.00  9.03  3.74  2.34  4.97  5.05 21.97  0.36  0.22
#   0.05  0.00 33.99  7.51  2.53 18.94 19.26 34.55  0.04  0.03  0.00  0.00
#   6.60  1.10  0.85  3.44  2.38 10.71]

print("-----traing data characteristics and parameters")

print("mins :\n{}".format(scaler.data_min_))

print("maxs :\n{}".format(scaler.data_max_))

print("ranges:\n{}".format(scaler.data_range_))

print("scales:\n{}".format(scaler.scale_))

# -----traing data characteristics and parameters

# mins :

# [ 6.98 9.71 43.79 143.50 0.05 0.02 0.00 0.00 0.11 0.05

# 0.12 0.36 0.76 6.80 0.00 0.00 0.00 0.00 0.01 0.00

# 7.93 12.02 50.41 185.20 0.07 0.03 0.00 0.00 0.16 0.06]

# maxs :

# [ 28.11 39.28 188.50 2501.00 0.16 0.29 0.43 0.20 0.30

# 0.10 2.87 4.88 21.98 542.20 0.03 0.14 0.40 0.05

# 0.06 0.03 36.04 49.54 251.20 4254.00 0.22 0.94 1.17

# 0.29 0.58 0.15]

# ranges:

# [ 21.13 29.57 144.71 2357.50 0.11 0.27 0.43 0.20 0.20

# 0.05 2.76 4.52 21.22 535.40 0.03 0.13 0.40 0.05

# 0.05 0.03 28.11 37.52 200.79 4068.80 0.15 0.91 1.17

# 0.29 0.42 0.09]

# scales:

# [ 0.05 0.03 0.01 0.00 9.03 3.74 2.34 4.97 5.05 21.97 0.36 0.22

# 0.05 0.00 33.99 7.51 2.53 18.94 19.26 34.55 0.04 0.03 0.00 0.00

# 6.60 1.10 0.85 3.44 2.38 10.71]

構築された変換器によりX_trainを変換すると、すべての特徴量について最小値が0、最大値が1となる。

X_train_scaled = scaler.transform(X_train)
print("-----scaled training data characteristics")
print("mins  :\n{}".format(X_train_scaled.min(axis=0)))
print("maxs  :\n{}".format(X_train_scaled.max(axis=0)))
print("ranges:\n{}".format(X_train_scaled.max(axis=0) - X_train_scaled.min(axis=0)))

# mins  :
# [0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
#  0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
#  0.00 0.00]
# maxs  :
# [1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
#  1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
#  1.00 1.00]
# ranges:
# [1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
#  1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
#  1.00 1.00]

X_train_scaled = scaler.transform(X_train)

print("-----scaled training data characteristics")

print("mins :\n{}".format(X_train_scaled.min(axis=0)))

print("maxs :\n{}".format(X_train_scaled.max(axis=0)))

print("ranges:\n{}".format(X_train_scaled.max(axis=0) - X_train_scaled.min(axis=0)))

# mins :

# [0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

# 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

# 0.00 0.00]

# maxs :

# [1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

# 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

# 1.00 1.00]

# ranges:

# [1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

# 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

# 1.00 1.00]

同じ変換器でテストデータも変換すると、変換後の特徴量の最小値・最大値は0、1になっていない。これはテストデータの最大値・最小値が必ずしも訓練データのそれらと一致しないので当然である。また、テストデータの最大値が訓練データの最大値よりも大きい場合は、テストデータの最大値が1を超えることになる。

X_test_scaled = scaler.transform(X_test)
print("-----scaled test data characteristics")
print("mins  :\n{}".format(X_test_scaled.min(axis=0)))
print("maxs  :\n{}".format(X_test_scaled.max(axis=0)))
print("ranges:\n{}".format(X_test_scaled.max(axis=0) - X_test_scaled.min(axis=0)))

# mins  :
# [ 0.03  0.02  0.03  0.01  0.14  0.04  0.00  0.00  0.15 -0.01 -0.00  0.01
#   0.00  0.00  0.04  0.01  0.00  0.00 -0.03  0.01  0.03  0.06  0.02  0.01
#   0.11  0.03  0.00  0.00 -0.00 -0.00]
# maxs  :
# [0.96 0.82 0.96 0.89 0.81 1.22 0.88 0.93 0.93 1.04 0.43 0.50 0.44 0.28
#  0.49 0.74 0.77 0.63 1.34 0.39 0.90 0.79 0.85 0.74 0.92 1.13 1.07 0.92
#  1.21 1.63]
# ranges:
# [0.92 0.79 0.92 0.88 0.67 1.18 0.88 0.93 0.78 1.04 0.43 0.49 0.44 0.28
#  0.45 0.73 0.77 0.63 1.37 0.38 0.87 0.74 0.83 0.74 0.81 1.11 1.07 0.92
#  1.21 1.63]

X_test_scaled = scaler.transform(X_test)

print("-----scaled test data characteristics")

print("mins :\n{}".format(X_test_scaled.min(axis=0)))

print("maxs :\n{}".format(X_test_scaled.max(axis=0)))

print("ranges:\n{}".format(X_test_scaled.max(axis=0) - X_test_scaled.min(axis=0)))

# mins :

# [ 0.03 0.02 0.03 0.01 0.14 0.04 0.00 0.00 0.15 -0.01 -0.00 0.01

# 0.00 0.00 0.04 0.01 0.00 0.00 -0.03 0.01 0.03 0.06 0.02 0.01

# 0.11 0.03 0.00 0.00 -0.00 -0.00]

# maxs :

# [0.96 0.82 0.96 0.89 0.81 1.22 0.88 0.93 0.93 1.04 0.43 0.50 0.44 0.28

# 0.49 0.74 0.77 0.63 1.34 0.39 0.90 0.79 0.85 0.74 0.92 1.13 1.07 0.92

# 1.21 1.63]

# ranges:

# [0.92 0.79 0.92 0.88 0.67 1.18 0.88 0.93 0.78 1.04 0.43 0.49 0.44 0.28

# 0.45 0.73 0.77 0.63 1.37 0.38 0.87 0.74 0.83 0.74 0.81 1.11 1.07 0.92

# 1.21 1.63]

テストデータで改めてfit()メソッドを実行してテストデータに適用するとレンジが0～1になるが、そうすると訓練データとテストデータで異なる変換を行うことになり、結果が歪んでしまう。

preprocessingの各種モデル

sklearn.preprocessingには多様な変換器が準備されているが、それらを目的ごとのカテゴリーに分けて整理する。

scaler～スケール変換

データの大きさやレンジを変換してそろえる。

MinMaxScaler: 各特徴量が0～1の範囲になるよう正規化する（線形変換）。
StandardScaler: 各特徴量の標本平均と標本分散を使って標準化する（線形変換）。
RobustScaler: 各特徴量の中央値と4分位数を使って標準化する（線形変換）。

normalization～正則化

特徴量ベクトルのノルムをそろえる。レンジをそろえる目的のscalerに比べて、元のデータ分布の相似性はなくなる。

Normalizer: 特徴量ベクトルのノルムを1にそろえる。

binalize～2値化

特徴量データを0/1の2値に分ける。

encoder～カテゴリーデータのエンコード

カテゴリーで与えられたデータ（性別、曜日など）をモデルで扱うために数値化する。

LabelEncoder: 1次元配列で与えられた特徴量クラスデータを、数値ラベルに変換する。
OrdinalEncoder: 2次元配列で与えられた特徴量クラスデータを、数値ラベルに変換する。
OneHotEncoder: 2次元配列で与えられた特徴量クラスデータを、特徴量ごとのインジケーター列に変換する。

スケール変換の頑健性

MinMaxScalerは計算過程が簡明だが、飛び離れた異常値がわずかでもあるとそれが全体のレンジを規定し、本来適用したいデータの値が歪んでしまう。StandardScalerやRobustScalerはこのような異常値に対して頑健な変換を行う。これら3つの頑健性についてはこちらで確認している。

OneHotEncoder

2020-10-09 / tau / コメントする

概要

OneHotEncoderは、あるクラスデータの特徴量をエンコードする。LabelEncoderやOrdinalEncoderが特徴量内のクラスに一連の数値を振るのに対して、OneHotEncoderはクラスの数だけ列を確保し、データごとに該当するクラスのみに1を立てる。エンコードされたデータは、該当するクラスのみに反応するインデックス引数となる。

なお、DataFrameのget_dummis()メソッドでもone-hotエンコーディングができる。

使い方

fit()～インデックス列の生成

以下の例は、2つのクラス特徴量を持つ6個のデータセットをOneHotEncoderで変換。

sklearn.prreprocessingからOneHotEncoderをインポート
エンコーダーのインスタンスを生成
- デフォルトではスパース行列になるので、オプションでsparse=Falseを指定
fit()メソッドでデータをフィッティングし、変換器を準備
この段階でcategories_プロパティーには各特徴量ごとのインデックス構成がセットされる

以下の例では、1つ目の特徴量は3つのクラス、2つ目の特徴量は2つのクラスを持つので、3要素、2要素の配列を要素に持つリストがcategories_にセットされる。

from sklearn.preprocessing import OneHotEncoder

X = [
    ["Tokyo", "Male"],
    ["Tokyo", "Female"],
    ["Osaka", "Male"],
    ["Kyoto", "Female"],
    ["Osaka", "Female"],
    ["Osaka", "Male"]
]

ohe = OneHotEncoder(sparse=False)
ohe.fit(X)

print(ohe.categories_)
print(ohe.categories_[0])
print(ohe.categories_[1])

# [array(['Kyoto', 'Osaka', 'Tokyo'], dtype=object), array(['Female', 'Male'], dtype=object)]
# ['Kyoto' 'Osaka' 'Tokyo']
# ['Female' 'Male']

from sklearn.preprocessing import OneHotEncoder

X = [

["Tokyo", "Male"],

["Tokyo", "Female"],

["Osaka", "Male"],

["Kyoto", "Female"],

["Osaka", "Female"],

["Osaka", "Male"]

]

ohe = OneHotEncoder(sparse=False)

ohe.fit(X)

print(ohe.categories_)

print(ohe.categories_[0])

print(ohe.categories_[1])

# [array(['Kyoto', 'Osaka', 'Tokyo'], dtype=object), array(['Female', 'Male'], dtype=object)]

# ['Kyoto' 'Osaka' 'Tokyo']

# ['Female' 'Male']

transform()～インデックスデータへの変換

fit()メソッドで準備された変換器によってデータを変換する。変換後のデータは特徴量のクラス数分の列を持つ2次元のndarrayで返される。なおfitとtransformを一度に行うfit_transform()メソッドも準備されている。

X_trans = ohe.transform(X)
print(X_trans)

# [[0. 0. 1. 0. 1.]
#  [0. 0. 1. 1. 0.]
#  [0. 1. 0. 0. 1.]
#  [1. 0. 0. 1. 0.]
#  [0. 1. 0. 1. 0.]
#  [0. 1. 0. 0. 1.]]

X_trans = ohe.transform(X)

print(X_trans)

# [[0. 0. 1. 0. 1.]

# [0. 0. 1. 1. 0.]

# [0. 1. 0. 0. 1.]

# [1. 0. 0. 1. 0.]

# [0. 1. 0. 1. 0.]

# [0. 1. 0. 0. 1.]]

出力の右3列は3つの都市、それに続く2列は性別に対応していて、たとえば1行目のデータの都市はcategories_[0]の3番目'Tokyo'、性別はcategories_[1]の2番目の'Male'であることがあらわされている。

DataFrameによる操作

OneHotEncoderはpandas.DataFrameも扱える。ただしtransfrom()やfit_transform()メソッドの戻り値はndarrayなので、以下の例ではこれをDataFrameの形にしている。このときcolumns引数にエンコーダーのインスタンスのcategories_プロパティーを使うと個別のクラス名まで打ち込まずに済んで便利。

import numpy as np
from pandas import DataFrame

df_X = DataFrame(X, columns=["city", "gender"])
X_trans = ohe.fit_transform(df_X)
df_X_trans =DataFrame(X_trans,
    columns=np.append(ohe.categories_[0], ohe.categories_[1]))

print(df_X)
print()
print(df_X_trans)

#     city  gender
# 0  Tokyo    Male
# 1  Tokyo  Female
# 2  Osaka    Male
# 3  Kyoto  Female
# 4  Osaka  Female
# 5  Osaka    Male
# 
#    Kyoto  Osaka  Tokyo  Female  Male
# 0    0.0    0.0    1.0     0.0   1.0
# 1    0.0    0.0    1.0     1.0   0.0
# 2    0.0    1.0    0.0     0.0   1.0
# 3    1.0    0.0    0.0     1.0   0.0
# 4    0.0    1.0    0.0     1.0   0.0
# 5    0.0    1.0    0.0     0.0   1.0

import numpy as np

from pandas import DataFrame

df_X = DataFrame(X, columns=["city", "gender"])

X_trans = ohe.fit_transform(df_X)

df_X_trans =DataFrame(X_trans,

columns=np.append(ohe.categories_[0], ohe.categories_[1]))

print(df_X)

print()

print(df_X_trans)

# city gender

# 0 Tokyo Male

# 1 Tokyo Female

# 2 Osaka Male

# 3 Kyoto Female

# 4 Osaka Female

# 5 Osaka Male

# Kyoto Osaka Tokyo Female Male

# 0 0.0 0.0 1.0 0.0 1.0

# 1 0.0 0.0 1.0 1.0 0.0

# 2 0.0 1.0 0.0 0.0 1.0

# 3 1.0 0.0 0.0 1.0 0.0

# 4 0.0 1.0 0.0 1.0 0.0

# 5 0.0 1.0 0.0 0.0 1.0

数値データとクラスデータが混在する場合

DataFrameの準備

以下の例では、2つのクラス特徴量と2つの数値特徴量を持つデータセットをDataFrameとして扱う。

import numpy as np
from pandas import DataFrame
from sklearn.preprocessing import OneHotEncoder

X = [
    ["Tokyo", 10000, "Male", 2],
    ["Tokyo", 8000, "Female", 1.5],
    ["Osaka", 9000, "Male", 1.5],
    ["Kyoto", 10000, "Female", 1],
    ["Osaka", 7000, "Female", 1],
    ["Osaka", 8000, "Male", 1.5]
]

df_X = DataFrame(X, columns=["city", "hotel_charge", "gender", "travel_time"])
print(df_X)

#     city  hotel_charge  gender  travel_time
# 0  Tokyo         10000    Male          2.0
# 1  Tokyo          8000  Female          1.5
# 2  Osaka          9000    Male          1.5
# 3  Kyoto         10000  Female          1.0
# 4  Osaka          7000  Female          1.0
# 5  Osaka          8000    Male          1.5

import numpy as np

from pandas import DataFrame

from sklearn.preprocessing import OneHotEncoder

X = [

["Tokyo", 10000, "Male", 2],

["Tokyo", 8000, "Female", 1.5],

["Osaka", 9000, "Male", 1.5],

["Kyoto", 10000, "Female", 1],

["Osaka", 7000, "Female", 1],

["Osaka", 8000, "Male", 1.5]

]

df_X = DataFrame(X, columns=["city", "hotel_charge", "gender", "travel_time"])

print(df_X)

# city hotel_charge gender travel_time

# 0 Tokyo 10000 Male 2.0

# 1 Tokyo 8000 Female 1.5

# 2 Osaka 9000 Male 1.5

# 3 Kyoto 10000 Female 1.0

# 4 Osaka 7000 Female 1.0

# 5 Osaka 8000 Male 1.5

クラスデータのヘッダーの準備

クラスデータを複数のインデックスデータの列にするための準備。

特徴量のうち、クラスデータのものと数値データのもののヘッダーを分けておく
クラスデータ用のDataFrameを準備して、元データからクラスデータの列だけを切り出し
エンコーダーを生成してfit_trans()を実行
実行後にエンコーダーのcategories_に保持されているクラスリストを取得

このクラスリストが変換後のデータのヘッダーになる。

col_class = ["city", "gender"]
col_num = ["hotel_charge", "travel_time"]

df_X_class = df_X[col_class]
ohe = OneHotEncoder(sparse=False)
X_trans = ohe.fit_transform(df_X_class)

col_class = [cls for ary in ohe.categories_ for cls in ary]
print(col_class)

# ['Kyoto', 'Osaka', 'Tokyo', 'Female', 'Male']

col_class = ["city", "gender"]

col_num = ["hotel_charge", "travel_time"]

df_X_class = df_X[col_class]

ohe = OneHotEncoder(sparse=False)

X_trans = ohe.fit_transform(df_X_class)

col_class = [cls for ary in ohe.categories_ for cls in ary]

print(col_class)

# ['Kyoto', 'Osaka', 'Tokyo', 'Female', 'Male']

クラスデータと数値データの合体

以下の処理では、変換されたクラスデータ列と元の数値データ列を合わせて最終的なデータセットとしている

クラスリストをヘッダーとして、変換後のクラスデータ(ndarray)をDataFrameとして読み込み
上記DataFrameに元データの数値データを追加

この処理によって元データセットから特徴量の順番が変わるが、学習過程で特徴量の順番は影響しない。

df_X_class_trans = DataFrame(X_trans, columns=col_class_trans)
print(df_X_class_trans)

#    Kyoto  Osaka  Tokyo  Female  Male
# 0    0.0    0.0    1.0     0.0   1.0
# 1    0.0    0.0    1.0     1.0   0.0
# 2    0.0    1.0    0.0     0.0   1.0
# 3    1.0    0.0    0.0     1.0   0.0
# 4    0.0    1.0    0.0     1.0   0.0
# 5    0.0    1.0    0.0     0.0   1.0

df_X_trans = df_X_class_trans.copy()
df_X_trans[col_num] = df_X[col_num]
print(df_X_trans)

#    Kyoto  Osaka  Tokyo  Female  Male  hotel_charge  travel_time
# 0    0.0    0.0    1.0     0.0   1.0         10000          2.0
# 1    0.0    0.0    1.0     1.0   0.0          8000          1.5
# 2    0.0    1.0    0.0     0.0   1.0          9000          1.5
# 3    1.0    0.0    0.0     1.0   0.0         10000          1.0
# 4    0.0    1.0    0.0     1.0   0.0          7000          1.0
# 5    0.0    1.0    0.0     0.0   1.0          8000          1.5

df_X_class_trans = DataFrame(X_trans, columns=col_class_trans)

print(df_X_class_trans)

# Kyoto Osaka Tokyo Female Male

# 0 0.0 0.0 1.0 0.0 1.0

# 1 0.0 0.0 1.0 1.0 0.0

# 2 0.0 1.0 0.0 0.0 1.0

# 3 1.0 0.0 0.0 1.0 0.0

# 4 0.0 1.0 0.0 1.0 0.0

# 5 0.0 1.0 0.0 0.0 1.0

df_X_trans = df_X_class_trans.copy()

df_X_trans[col_num] = df_X[col_num]

print(df_X_trans)

# Kyoto Osaka Tokyo Female Male hotel_charge travel_time

# 0 0.0 0.0 1.0 0.0 1.0 10000 2.0

# 1 0.0 0.0 1.0 1.0 0.0 8000 1.5

# 2 0.0 1.0 0.0 0.0 1.0 9000 1.5

# 3 1.0 0.0 0.0 1.0 0.0 10000 1.0

# 4 0.0 1.0 0.0 1.0 0.0 7000 1.0

# 5 0.0 1.0 0.0 0.0 1.0 8000 1.5

inverse_transform()

上でdf_X_trans = df_X_class_trans.copy()としたので、df_X_class_transは保存されている。このデータをエンコーダーのinverse_transform()に与えると、複数列で表現されていたクラスが元の表現で得られる。

print(ohe.inverse_transform(df_X_class_trans))

# [['Tokyo' 'Male']
#  ['Tokyo' 'Female']
#  ['Osaka' 'Male']
#  ['Kyoto' 'Female']
#  ['Osaka' 'Female']
#  ['Osaka' 'Male']]

print(ohe.inverse_transform(df_X_class_trans))

# [['Tokyo' 'Male']

# ['Tokyo' 'Female']

# ['Osaka' 'Male']

# ['Kyoto' 'Female']

# ['Osaka' 'Female']

# ['Osaka' 'Male']]

新しいデータの変換

訓練済みモデルにデータを与えて予測する場合、前処理のエンコーディングでは、フィッティング済みのエンコーダーに新しいデータを与えて変換する。

x = [["Kyoto", 7000, "Male", 0.5]]
df_X = DataFrame(x, columns=col_original)
print(df_X)

#     city  hotel_charge gender  travel_time
# 0  Kyoto          7000   Male          0.5

df_X_class = df_X[col_class]
X_trans = ohe.transform(df_X_class)

df_X_trans = DataFrame(X_trans, columns=col_class_trans)
df_X_trans[col_num] = df_X[col_num]
print(df_X_trans)

#    Kyoto  Osaka  Tokyo  Female  Male  hotel_charge  travel_time
# 0    1.0    0.0    0.0     0.0   1.0          7000          0.5

x = [["Kyoto", 7000, "Male", 0.5]]

df_X = DataFrame(x, columns=col_original)

print(df_X)

# city hotel_charge gender travel_time

# 0 Kyoto 7000 Male 0.5

df_X_class = df_X[col_class]

X_trans = ohe.transform(df_X_class)

df_X_trans = DataFrame(X_trans, columns=col_class_trans)

df_X_trans[col_num] = df_X[col_num]

print(df_X_trans)

# Kyoto Osaka Tokyo Female Male hotel_charge travel_time

# 0 1.0 0.0 0.0 0.0 1.0 7000 0.5

未知のクラスへの対処

フィッティング時になかったクラスに遭遇した場合の動作は、エンコーダーのインスタンス生成時に指定する。

OneHotEncoder(handle_unknown='error'/'ignore')

デフォルトは'error'で、未知のクラスに遭遇するとエラーを投げる。'ignore'を指定すると未知のクラスの場合はその特徴量のすべてのクラスラベルが0になる。

以下の例では、2行目のデータにフィッティングでは含まれていなかった”Nagoya”があるため、変換後のデータの2行目の1～3列が0となっている。

df_X = DataFrame(X, columns=col_original)
df_X_class = df_X[col_class]

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe.fit(df_X_class)

x = [
    ["Kyoto", 9000, "Female", 1],
    ["Nagoya", 7000, "Male", 0.5]
]
df_X = DataFrame(x, columns=col_original)
df_X_class = df_X[col_class]
print(df_X_class)

#      city  gender
# 0   Kyoto  Female
# 1  Nagoya    Male

X_class_trans = ohe.transform(df_X_class)
print(X_class_trans)

# [[1. 0. 0. 1. 0.]
#  [0. 0. 0. 0. 1.]]

df_X_trans = DataFrame(X_class_trans, columns=col_class_trans)
df_X_trans[col_num] = df_X[col_num]
print(df_X_trans)

#    Kyoto  Osaka  Tokyo  Female  Male  hotel_charge  travel_time
# 0    1.0    0.0    0.0     1.0   0.0          9000          1.0
# 1    0.0    0.0    0.0     0.0   1.0          7000          0.5

df_X = DataFrame(X, columns=col_original)

df_X_class = df_X[col_class]

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

ohe.fit(df_X_class)

x = [

["Kyoto", 9000, "Female", 1],

["Nagoya", 7000, "Male", 0.5]

]

df_X = DataFrame(x, columns=col_original)

df_X_class = df_X[col_class]

print(df_X_class)

# city gender

# 0 Kyoto Female

# 1 Nagoya Male

X_class_trans = ohe.transform(df_X_class)

print(X_class_trans)

# [[1. 0. 0. 1. 0.]

# [0. 0. 0. 0. 1.]]

df_X_trans = DataFrame(X_class_trans, columns=col_class_trans)

df_X_trans[col_num] = df_X[col_num]

print(df_X_trans)

# Kyoto Osaka Tokyo Female Male hotel_charge travel_time

# 0 1.0 0.0 0.0 1.0 0.0 9000 1.0

# 1 0.0 0.0 0.0 0.0 1.0 7000 0.5

この変換データをinverse_transform()で逆変換すると、未知のクラスであったところは'None'に変換される。

print(ohe.inverse_transform(X_class_trans))

# [['Kyoto' 'Female']
#  [None 'Male']]

print(ohe.inverse_transform(X_class_trans))

# [['Kyoto' 'Female']

# [None 'Male']]

OrdinalEncoder

2020-10-08 / tau / コメントする

概要

sklearn.preprocessingのOrdinalEncoderは、2次元のデータ（行数×列数＝データ数×特徴量数）を須知ラベルデータに変換する。

コンストラクターでencoderのインスタンスを生成
fit()メソッドに2次元の元データを与える（元データは2次元のリスト、ndarray、DataFrameは可）
元データの特徴量ごと（列ごと）にデータが数値ラベル化される
特徴量のカテゴリー数がn_classのとき、特徴量データが0～n_class−1の整数ラベルに変換される
1次元のデータを変換する場合も2次元に変形する必要がある
変換は全ての列が対象となり、定量的な数値データが含まれていてもそれらが数値ラベルに変換される

使い方

fit～ラベルの設定

以下の例では、3つの特徴量を持つ6つのデータを例題としている。特徴量は3つともクラスデータで、fit()メソッドで変換器の準備をする。

エンコーダーにおけるfit()は、特徴量ごとにクラスデータのラベルを設定し、変換器を準備する
フィッティングの後、categories_プロパティーにリストがセットされる
categories_はndarrayを要素とするリストで、各配列には特徴量ごとの重複を除いたクラス名が格納される
各特徴量のクラスはcategories_各要素の配列の先頭から数値ラベル0, 1, 2, …に対応している。

from sklearn.preprocessing import OrdinalEncoder

X = [
    ["Tokyo", "Male", "by air"],
    ["Tokyo", "Male", "by rail"],
    ["Osaka", "Female", "by rail"],
    ["Kyoto", "Female", "by bus"],
    ["Osaka", "Male", "by air"],
    ["Osaka", "Female", "by bus"]
]

oe = OrdinalEncoder()
oe.fit(X)
print(oe.categories_)
print(oe.categories_[0])
print(oe.categories_[1])
print(oe.categories_[2])

# [array(['Kyoto', 'Osaka', 'Tokyo'], dtype=object), array(['Female', 'Male'], dtype=object), array(['by air', 'by bus', 'by rail'], dtype=object)]
# ['Kyoto' 'Osaka' 'Tokyo']
# ['Female' 'Male']
# ['by air' 'by bus' 'by rail']

from sklearn.preprocessing import OrdinalEncoder

X = [

["Tokyo", "Male", "by air"],

["Tokyo", "Male", "by rail"],

["Osaka", "Female", "by rail"],

["Kyoto", "Female", "by bus"],

["Osaka", "Male", "by air"],

["Osaka", "Female", "by bus"]

]

oe = OrdinalEncoder()

oe.fit(X)

print(oe.categories_)

print(oe.categories_[0])

print(oe.categories_[1])

print(oe.categories_[2])

# [array(['Kyoto', 'Osaka', 'Tokyo'], dtype=object), array(['Female', 'Male'], dtype=object), array(['by air', 'by bus', 'by rail'], dtype=object)]

# ['Kyoto' 'Osaka' 'Tokyo']

# ['Female' 'Male']

# ['by air' 'by bus' 'by rail']

transform～ラベルへの変換

この変換器のtransform()メソッドで元データを変換すると、元データと同じ次元・次数の2次元配列が得られ、各クラスデータが数値データに変換された結果が格納されている。

なお、OrdinalEncoderにもfit_transform()メソッドが準備されている。

X_trans = oe.transform(X)
print(X_trans)

# [[2. 1. 0.]
#  [2. 1. 2.]
#  [1. 0. 2.]
#  [0. 0. 1.]
#  [1. 1. 0.]
#  [1. 0. 1.]]

X_trans = oe.transform(X)

print(X_trans)

# [[2. 1. 0.]

# [2. 1. 2.]

# [1. 0. 2.]

# [0. 0. 1.]

# [1. 1. 0.]

# [1. 0. 1.]]

1次元のデータを変換する場合でも、1×1の2次元とする必要があり、結果も2次元の配列で返される。

y = [["Kyoto", "Male", "by rail"]]
y_trans = oe.transform(y)
print(y_trans)

# [[0. 1. 2.]]

y = [["Kyoto", "Male", "by rail"]]

y_trans = oe.transform(y)

print(y_trans)

# [[0. 1. 2.]]

inverse_transform()で数値ラベルをクラスデータに逆変換可能。

print(oe.inverse_transform(y_trans))

# [['Kyoto' 'Male' 'by rail']]

print(oe.inverse_transform(y_trans))

# [['Kyoto' 'Male' 'by rail']]

categories_パラメーターについて

なおコンストラクターのcategories_パラメーターを指定できるが、これはあらかじめ特徴量のクラスデータがわかっている場合に、これらを全特徴量について指定する。この際、元データに含まれないクラスを含めてもよい。

oe = OrdinalEncoder(categories=[
    ["Tokyo", "Kyoto", "Osaka", "Nagoya"],
    ["Male", "Female"],
    ["by air", "by bus", "by rail"]])
oe.fit(X)

oe = OrdinalEncoder(categories=[

["Tokyo", "Kyoto", "Osaka", "Nagoya"],

["Male", "Female"],

["by air", "by bus", "by rail"]])

oe.fit(X)

数値データとクラスデータが混在する場合

クラスデータと数値データが混在する場合にOrdinalEncoderで変換すると、すべてのデータがクラスデータとみなされ、数値データもラベルに変換されてしまう。

以下の例では、最後の列の実数データも、1, 1.5, …, ５に対して0, 1, …, 5のラベルに変換されている。

X = [
    ["Tokyo", "Male", "by air", 1.5],
    ["Tokyo", "Male", "by rail", 3],
    ["Osaka", "Female", "by rail", 3.5],
    ["Kyoto", "Female", "by bus", 5],
    ["Osaka", "Male", "by air", 1],
    ["Osaka", "Female", "by bus", 4]
]

oe = OrdinalEncoder()
oe.fit(X)
X_trans = oe.transform(X)
print(X_trans)

# [[2. 1. 0. 1.]
#  [2. 1. 2. 2.]
#  [1. 0. 2. 3.]
#  [0. 0. 1. 5.]
#  [1. 1. 0. 0.]
#  [1. 0. 1. 4.]]

X = [

["Tokyo", "Male", "by air", 1.5],

["Tokyo", "Male", "by rail", 3],

["Osaka", "Female", "by rail", 3.5],

["Kyoto", "Female", "by bus", 5],

["Osaka", "Male", "by air", 1],

["Osaka", "Female", "by bus", 4]

]

oe = OrdinalEncoder()

oe.fit(X)

X_trans = oe.transform(X)

print(X_trans)

# [[2. 1. 0. 1.]

# [2. 1. 2. 2.]

# [1. 0. 2. 3.]

# [0. 0. 1. 5.]

# [1. 1. 0. 0.]

# [1. 0. 1. 4.]]

このような場合は、クラスデータのみ取り出して変換させる。OrdinalEncoderはpandas.DataFrameを扱うことができるので、列操作のために元データをDataFrameとする。

import pandas as pd
df = pd.DataFrame(X, columns=["city", "gender", "transportation", "travel_time"])
print(df)

#     city  gender transportation  travel_time
# 0  Tokyo    Male         by air          1.5
# 1  Tokyo    Male        by rail          3.0
# 2  Osaka  Female        by rail          3.5
# 3  Kyoto  Female         by bus          5.0
# 4  Osaka    Male         by air          1.0
# 5  Osaka  Female         by bus          4.0

import pandas as pd

df = pd.DataFrame(X, columns=["city", "gender", "transportation", "travel_time"])

print(df)

# city gender transportation travel_time

# 0 Tokyo Male by air 1.5

# 1 Tokyo Male by rail 3.0

# 2 Osaka Female by rail 3.5

# 3 Kyoto Female by bus 5.0

# 4 Osaka Male by air 1.0

# 5 Osaka Female by bus 4.0

今回の例では、最初の3列がクラスデータなので、一時的なDataFrameにそれらを切出してOrdinalEncoderを適用する。transform()の結果はndarrayで戻るので、それを元のDataFrameの列に入れ替えている。

df_temp = df[["city", "gender", "transportation"]]
oe.fit(df_temp)
df_trans = oe.transform(df_temp)
print(df_trans)
print()
df[["city", "gender", "transportation"]] = df_trans
print(df)

# [[2. 1. 0.]
#  [2. 1. 2.]
#  [1. 0. 2.]
#  [0. 0. 1.]
#  [1. 1. 0.]
#  [1. 0. 1.]]
#
#    city  gender  transportation  travel_time
# 0   2.0     1.0             0.0          1.5
# 1   2.0     1.0             2.0          3.0
# 2   1.0     0.0             2.0          3.5
# 3   0.0     0.0             1.0          5.0
# 4   1.0     1.0             0.0          1.0
# 5   1.0     0.0             1.0          4.0

df_temp = df[["city", "gender", "transportation"]]

oe.fit(df_temp)

df_trans = oe.transform(df_temp)

print(df_trans)

print()

df[["city", "gender", "transportation"]] = df_trans

print(df)

# [[2. 1. 0.]

# [2. 1. 2.]

# [1. 0. 2.]

# [0. 0. 1.]

# [1. 1. 0.]

# [1. 0. 1.]]

# city gender transportation travel_time

# 0 2.0 1.0 0.0 1.5

# 1 2.0 1.0 2.0 3.0

# 2 1.0 0.0 2.0 3.5

# 3 0.0 0.0 1.0 5.0

# 4 1.0 1.0 0.0 1.0

# 5 1.0 0.0 1.0 4.0

最後の列はそのままで、その前の3列がラベルデータに変換されている。

Normalizer

2020-10-08 / tau / コメントする

概要

sklearn.preprocessorsモジュールのNormalizerは、特徴量ベクトルのノルムが1になるようにする。具体的には、データごとに特徴量F_iを以下の式によってF_i^*に変換する。

(1) $\begin{equation*} {F_i}^* = \frac{\sum F_i}{\left( \sum {|F_i|}^p \right) ^\frac{1}{p}} \end{equation*}$

ノルムのタイプはコンストラクターの引数で指定する。デフォルトは'l2'で、その他に'l1'、'max'を指定可能。

Normalizer(norm='l2')

挙動

それぞれ異なる正規分布に従う2つの特徴量について、Normalizerを適用したときの挙動を以下に示す。

scalerのような相似性の変換ではないので左下の変換後のヒストグラムは変換前の形状と異なっている。

データの空間的な分布は、デフォルトのL2ノルムの指定によって全データが半径1の円周上に位置するよう変換される。

変換後のデータを拡大してみると以下の通りで、原点を中心とした半径1の円周上に各点が並んでいる。

他の2つ、L1ノルムと最大値ノルムを指定して実行した結果が下記の通りで、それぞれのノルムに応じた線上に各点が並んでいる。

コードは以下の通りで、データに対してfit()メソッドでスケールパラメーターを決定し、transform()メソッドで変換を行うところを、これらを連続して実行するfit_transform()メソッドを使っている。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
import matplotlib.patches as patch
from sklearn.preprocessing import Normalizer

rnd.seed(0)
x1 = rnd.normal(loc=1, scale=2, size=100)
x2 = rnd.normal(loc=5, scale=1, size=100)
X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

X_trans = Normalizer().fit_transform(X)

fig1 = plt.figure(figsize=(9.6, 4.8))

ax1 = fig1.add_subplot(2, 2, 1)
ax2 = fig1.add_subplot(2, 2, 3)
ax3 = fig1.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5)
ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5)

ax2.hist(X_trans[:, 0], range=(-1.2, 1.2), bins=40, ec='k', alpha=0.5)
ax2.hist(X_trans[:, 1], range=(-1.2, 1.2), bins=40, ec='k', alpha=0.5)

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w')
ax3.scatter(X_trans[:, 0], X_trans[:, 1], ec='k', fc='gray')
ax3.set_aspect('equal')
ax3.set_xlim(-5, 8)
ax3.set_ylim(-5, 8)

fig2, ax4 = plt.subplots()

ax4.scatter(X_trans[:, 0], X_trans[:, 1], ec='k', fc='gray')
ax4.set_aspect('equal')
ax4.set_xlim(-1.5, 1.5)
ax4.set_ylim(-1.5, 1.5)
ax4.grid()
ax4.spines['top'].set_visible(False)
ax4.spines['right'].set_visible(False)
ax4.spines['bottom'].set_position('zero')
ax4.spines['left'].set_position('zero')
circ = patch.Circle(xy=(0, 0), radius=1, ec='k', fill=False)
ax4.add_patch(circ)

X_trans_l1 = Normalizer('l1').fit_transform(X)
X_trans_max = Normalizer('max').fit_transform(X)

fig3, axes = plt.subplots(1, 2, figsize=(9.6, 4.8))

axes[0].scatter(X_trans_l1[:, 0],X_trans_l1[:, 1], ec='k', fc='gray')
axes[0].plot([0, 1, 0, -1, 0], [1, 0, -1, 0, 1], c='k')
axes[1].scatter(X_trans_max[:, 0],X_trans_max[:, 1], ec='k', fc='gray')
axes[1].plot([1, 1, -1, -1, 1], [1, -1, -1, 1, 1], c='k')

for ax in axes.reshape(-1):
    ax.set_aspect('equal')
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)
    ax.grid()
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['bottom'].set_position('zero')
    ax.spines['left'].set_position('zero')

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

import matplotlib.patches as patch

from sklearn.preprocessing import Normalizer

rnd.seed(0)

x1 = rnd.normal(loc=1, scale=2, size=100)

x2 = rnd.normal(loc=5, scale=1, size=100)

X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

X_trans = Normalizer().fit_transform(X)

fig1 = plt.figure(figsize=(9.6, 4.8))

ax1 = fig1.add_subplot(2, 2, 1)

ax2 = fig1.add_subplot(2, 2, 3)

ax3 = fig1.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5)

ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5)

ax2.hist(X_trans[:, 0], range=(-1.2, 1.2), bins=40, ec='k', alpha=0.5)

ax2.hist(X_trans[:, 1], range=(-1.2, 1.2), bins=40, ec='k', alpha=0.5)

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w')

ax3.scatter(X_trans[:, 0], X_trans[:, 1], ec='k', fc='gray')

ax3.set_aspect('equal')

ax3.set_xlim(-5, 8)

ax3.set_ylim(-5, 8)

fig2, ax4 = plt.subplots()

ax4.scatter(X_trans[:, 0], X_trans[:, 1], ec='k', fc='gray')

ax4.set_aspect('equal')

ax4.set_xlim(-1.5, 1.5)

ax4.set_ylim(-1.5, 1.5)

ax4.grid()

ax4.spines['top'].set_visible(False)

ax4.spines['right'].set_visible(False)

ax4.spines['bottom'].set_position('zero')

ax4.spines['left'].set_position('zero')

circ = patch.Circle(xy=(0, 0), radius=1, ec='k', fill=False)

ax4.add_patch(circ)

X_trans_l1 = Normalizer('l1').fit_transform(X)

X_trans_max = Normalizer('max').fit_transform(X)

fig3, axes = plt.subplots(1, 2, figsize=(9.6, 4.8))

axes[0].scatter(X_trans_l1[:, 0],X_trans_l1[:, 1], ec='k', fc='gray')

axes[0].plot([0, 1, 0, -1, 0], [1, 0, -1, 0, 1], c='k')

axes[1].scatter(X_trans_max[:, 0],X_trans_max[:, 1], ec='k', fc='gray')

axes[1].plot([1, 1, -1, -1, 1], [1, -1, -1, 1, 1], c='k')

for ax in axes.reshape(-1):

ax.set_aspect('equal')

ax.set_xlim(-1.5, 1.5)

ax.set_ylim(-1.5, 1.5)

ax.grid()

ax.spines['top'].set_visible(False)

ax.spines['right'].set_visible(False)

ax.spines['bottom'].set_position('zero')

ax.spines['left'].set_position('zero')

plt.show()

特徴

Normalizerは特徴量ベクトルの方向だけが重要な場合に用いる。たとえば空間内の特定の方向範囲にあるクラスターの分離などかと思うが、抽象的なものになると想像がつかない。実際、サイト上で見ても、Normalizerの意義とデータの性質に基づいて適用しているケースは、検索上位には出てこない。

なおNormalizerによる変換は不可逆であり、scalerのようなinverse_transform()を持たない。

preprocessor – 異常値に対する頑健性

2020-10-04 / tau / コメントする

機械学習モデルにデータを適用するための前処理としていくつかのアルゴリズムによっては、異常値の影響を受けやすいことがある。

たとえば下図の左のような分布のデータがあるとする（平均が1、分散が1の正規分布に従う500個のランダムデータ）。そしてこのデータに値20の異常値が10個発生したとすると、全体の分布は右のようになる。

このデータに対して、MinMaxScaler、StandardScaler、RobustScalerで変換した結果を以下に示す。ただしStandardScalerとRobustScalerについては、異常値は表示させず元の正規分布に係る範囲のみを表示している。

まず左側のMinMaxScalerについては、異常値を含めてレンジが0～1となるので、本体の正規分布のデータが0付近の小さな値に集中する。このため、本来学習の精度に効いてくるべき本体部分のデータの分離が十分でない可能性が出てくる。

真ん中のStandardScalerと右側のRobustScalerについては、本体部分の形は元の正規分布の形と大きく変わらず、頑健であることがわかる。

ここで異常値の個数を10個から20個に増やして、同じく3種類の変換を施してみる。

左側のMinMaxScalerについては、異常値の個数とは関係なくその値のみでレンジが決まり、元の分布が0付近に押し込められている状況は同じ。

真ん中のStandardScalerについては、10個の時に比べて少し分布の形が変わっていて、レンジが狭くなっている。

右側のRobustScalerについては、元の分布の形は大きくは変わっていない。

以上のことから、少なくとも3つの変換器について以下のような特徴があることがわかる。

MinMaxScalerは異常値によって本来分析したいデータのレンジが狭くなる可能性がある
StandardScalerは異常値の影響を受けにくいが、その大きさや頻度によって若干本体部分の分布が影響を受ける
RobustScalerは異常値の個数が極端に多くなければ、本来のデータの特性を頑健に保持する

なお、上記の作図のコードは以下の通り。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

rnd.seed(0)
x = rnd.normal(loc=1, scale=1, size=500)
x1 = np.append(x, [20] * 10)
x2 = np.append(x, [20] * 20)

scaler = MinMaxScaler()
x1_scaled_by_minmax = scaler.fit_transform(x1.reshape(-1, 1))
x2_scaled_by_minmax = scaler.fit_transform(x2.reshape(-1, 1))

scaler = StandardScaler()
x1_scaled_by_standard = scaler.fit_transform(x1.reshape(-1, 1))
x2_scaled_by_standard = scaler.fit_transform(x2.reshape(-1, 1))

scaler = RobustScaler()
x1_scaled_by_robust = scaler.fit_transform(x1.reshape(-1, 1))
x2_scaled_by_robust = scaler.fit_transform(x2.reshape(-1, 1))

fig0, axes = plt.subplots(1, 2, figsize=(12.8, 4.8))
axes[0].hist(x1, ec='k', bins=10, range=(-2, 4))
axes[1].hist(x1, ec='k', bins=40)

fig1, axes = plt.subplots(1, 3, figsize=(18.6, 4.8))

ax = axes[0]
ax.hist(x1_scaled_by_minmax, ec='k', bins=40)
ax.set_title("MinMaxScaler")

ax = axes[1]
ax.hist(x1_scaled_by_standard, ec='k', bins=10, range=(-1.5, 1))
ax.set_title("StandardScaler")

ax = axes[2]
ax.hist(x1_scaled_by_robust, ec='k', bins=10, range=(-2.5, 2.5))
ax.set_title("RobustScaler")

fig2, axes = plt.subplots(1, 3, figsize=(18.6, 4.8))

ax = axes[0]
ax.hist(x2_scaled_by_minmax, ec='k', bins=40)
ax.set_title("MinMaxScaler")

ax = axes[1]
ax.hist(x2_scaled_by_standard, ec='k', bins=10, range=(-1.5, 1))
ax.set_title("StandardScaler")

ax = axes[2]
ax.hist(x2_scaled_by_robust, ec='k', bins=10, range=(-2.5, 2.5))
ax.set_title("RobustScaler")

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import RobustScaler

rnd.seed(0)

x = rnd.normal(loc=1, scale=1, size=500)

x1 = np.append(x, [20] * 10)

x2 = np.append(x, [20] * 20)

scaler = MinMaxScaler()

x1_scaled_by_minmax = scaler.fit_transform(x1.reshape(-1, 1))

x2_scaled_by_minmax = scaler.fit_transform(x2.reshape(-1, 1))

scaler = StandardScaler()

x1_scaled_by_standard = scaler.fit_transform(x1.reshape(-1, 1))

x2_scaled_by_standard = scaler.fit_transform(x2.reshape(-1, 1))

scaler = RobustScaler()

x1_scaled_by_robust = scaler.fit_transform(x1.reshape(-1, 1))

x2_scaled_by_robust = scaler.fit_transform(x2.reshape(-1, 1))

fig0, axes = plt.subplots(1, 2, figsize=(12.8, 4.8))

axes[0].hist(x1, ec='k', bins=10, range=(-2, 4))

axes[1].hist(x1, ec='k', bins=40)

fig1, axes = plt.subplots(1, 3, figsize=(18.6, 4.8))

ax = axes[0]

ax.hist(x1_scaled_by_minmax, ec='k', bins=40)

ax.set_title("MinMaxScaler")

ax = axes[1]

ax.hist(x1_scaled_by_standard, ec='k', bins=10, range=(-1.5, 1))

ax.set_title("StandardScaler")

ax = axes[2]

ax.hist(x1_scaled_by_robust, ec='k', bins=10, range=(-2.5, 2.5))

ax.set_title("RobustScaler")

fig2, axes = plt.subplots(1, 3, figsize=(18.6, 4.8))

ax = axes[0]

ax.hist(x2_scaled_by_minmax, ec='k', bins=40)

ax.set_title("MinMaxScaler")

ax = axes[1]

ax.hist(x2_scaled_by_standard, ec='k', bins=10, range=(-1.5, 1))

ax.set_title("StandardScaler")

ax = axes[2]

ax.hist(x2_scaled_by_robust, ec='k', bins=10, range=(-2.5, 2.5))

ax.set_title("RobustScaler")

plt.show()

RobustScaler

2020-10-04 / tau / コメントする

概要

sklearn.preprocessingモジュールのRobustScalerは、各特徴量の中央値(med_i)と第1-4分位数(q_1i)、第3-4分位数(q_3i)を用いて特徴量を標準化する。

(1) $\begin{equation*} {F_i}^* = \frac{F_i - med_i}{q_{3i} - q_{1i}} \end{equation*}$

挙動

それぞれ異なる正規分布に従う2つの特徴量について、RobustScalerを適用したときの挙動を以下に示す。異なる大きさとレンジの特徴量が、変換後には原点を中心としてほぼ同じような広がりになっているのがわかる。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScaler

rnd.seed(0)
x1 = rnd.normal(loc=2, scale=3, size=100)
x2 = rnd.normal(loc=7, scale=1, size=100)
X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = RobustScaler()
X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 3)
ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 0")
ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 1")
ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-3, 3), bins=40, ec='k', alpha=0.5,
    label="Feature 0")
ax2.hist(X_transformed[:, 1], range=(-3, 3), bins=40, ec='k', alpha=0.5,
    label="Feature 1")
ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")
ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',
    label="after transformation")
ax3.set_aspect('equal')
ax3.set_xlim(-7, 10)
ax3.set_ylim(-7, 10)
ax3.set_xlabel("Feature 0")
ax3.set_ylabel("Feature 1")
ax3.legend()

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

from sklearn.preprocessing import RobustScaler

rnd.seed(0)

x1 = rnd.normal(loc=2, scale=3, size=100)

x2 = rnd.normal(loc=7, scale=1, size=100)

X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = RobustScaler()

X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

ax1 = fig.add_subplot(2, 2, 1)

ax2 = fig.add_subplot(2, 2, 3)

ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 0")

ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 1")

ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-3, 3), bins=40, ec='k', alpha=0.5,

label="Feature 0")

ax2.hist(X_transformed[:, 1], range=(-3, 3), bins=40, ec='k', alpha=0.5,

label="Feature 1")

ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")

ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',

label="after transformation")

ax3.set_aspect('equal')

ax3.set_xlim(-7, 10)

ax3.set_ylim(-7, 10)

ax3.set_xlabel("Feature 0")

ax3.set_ylabel("Feature 1")

ax3.legend()

plt.show()

簡単なデータでRobustScalerの計算過程を確認しておく。以下の例では5個のデータにRobustScalerを適用している。これは1つの特徴量を持つ5個のデータを模していることになる。

インスタンス内に保持されたパラメーターのうち、center_は特徴量の標本平均、scale_が第3-4分位数－第1-4分位数となっていて、これらで各特徴量が標準化されているのが確認できる。

import numpy as np
from sklearn.preprocessing import RobustScaler

x = np.array([2, 3, 4, 5, 6, 8, 10, 12])
print(np.percentile(x, q=[0, 25, 50, 75, 100]))

scaler = RobustScaler()
x_transformed = scaler.fit_transform(x.reshape(-1, 1))
print(x_transformed.reshape(-1))
print("centers:{}".format(scaler.center_))
print("scales :{}".format(scaler.scale_))

# [ 2.    3.75  5.5   8.5  12.  ]
# [-0.73684211 -0.52631579 -0.31578947 -0.10526316  0.10526316  0.52631579
#   0.94736842  1.36842105]
# centers:[5.5]
# scales :[4.75]

import numpy as np

from sklearn.preprocessing import RobustScaler

x = np.array([2, 3, 4, 5, 6, 8, 10, 12])

print(np.percentile(x, q=[0, 25, 50, 75, 100]))

scaler = RobustScaler()

x_transformed = scaler.fit_transform(x.reshape(-1, 1))

print(x_transformed.reshape(-1))

print("centers:{}".format(scaler.center_))

print("scales :{}".format(scaler.scale_))

# [ 2. 3.75 5.5 8.5 12. ]

# [-0.73684211 -0.52631579 -0.31578947 -0.10526316 0.10526316 0.52631579

# 0.94736842 1.36842105]

# centers:[5.5]

# scales :[4.75]

特徴

RobustScalerは異常値に対して頑健であり、StandardScalerより頑健性が高い。

StandardScaler

2020-10-04 / tau / コメントする

概要

sklearn.preprocessingモジュールのStandardScalerは、各特徴量の標本平均と標本分散を用いて特徴量を標準化する。

具体的には、特徴量F_iの標本平均(m_i)と標本分散(v_i)から以下の式により各特徴量F_iをF_i^*に変換する。

(1) $\begin{equation*} {F_i}^* = \frac{F_i -m_i}{\sqrt{v_i}} \end{equation*}$

挙動

それぞれ異なる正規分布に従う2つの特徴量について、StandardScalerを適用したときの挙動を以下に示す。異なる大きさとレンジの特徴量が、変換後には原点を中心としてほぼ同じような広がりになっているのがわかる。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

rnd.seed(0)
x1 = rnd.normal(loc=2, scale=3, size=100)
x2 = rnd.normal(loc=7, scale=1, size=100)
X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = StandardScaler()
X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))
fig.subplots_adjust(wspace=0.3)

ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 3)
ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 0")
ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 1")
ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-3, 3), bins=40, ec='k', alpha=0.5,
    label="Feature 0")
ax2.hist(X_transformed[:, 1], range=(-3, 3), bins=40, ec='k', alpha=0.5,
    label="Feature 0")
ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")
ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',
    label="after transformation")
ax3.set_aspect('equal')
ax3.set_xlim(-10, 10)
ax3.set_ylim(-10, 10)
ax3.set_xlabel("Feature 0")
ax3.set_ylabel("Feature 1")
ax3.legend()

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

rnd.seed(0)

x1 = rnd.normal(loc=2, scale=3, size=100)

x2 = rnd.normal(loc=7, scale=1, size=100)

X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = StandardScaler()

X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

fig.subplots_adjust(wspace=0.3)

ax1 = fig.add_subplot(2, 2, 1)

ax2 = fig.add_subplot(2, 2, 3)

ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 0")

ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 1")

ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-3, 3), bins=40, ec='k', alpha=0.5,

label="Feature 0")

ax2.hist(X_transformed[:, 1], range=(-3, 3), bins=40, ec='k', alpha=0.5,

label="Feature 0")

ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")

ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',

label="after transformation")

ax3.set_aspect('equal')

ax3.set_xlim(-10, 10)

ax3.set_ylim(-10, 10)

ax3.set_xlabel("Feature 0")

ax3.set_ylabel("Feature 1")

ax3.legend()

plt.show()

簡単なデータでStandardScalerの計算過程を確認しておく。以下の例では5個のデータにStandardScalerを適用している。これは1つの特徴量を持つ5個のデータを模していることになる。

import numpy as np
from sklearn.preprocessing import StandardScaler

x = np.array([1, 2, 3, 4, 5])

scaler = StandardScaler()
x_transformed = scaler.fit_transform(x.reshape(-1, 1))
print(x_transformed.reshape(-1))

print("mean_ :{}".format(scaler.mean_))
print("var_  :{}".format(scaler.var_))
print("scale_:{}".format(scaler.scale_))

# [-1.41421356 -0.70710678  0.          0.70710678  1.41421356]
# mean_ :[3.]
# var_  :[2.]
# scale_:[1.41421356]

import numpy as np

from sklearn.preprocessing import StandardScaler

x = np.array([1, 2, 3, 4, 5])

scaler = StandardScaler()

x_transformed = scaler.fit_transform(x.reshape(-1, 1))

print(x_transformed.reshape(-1))

print("mean_ :{}".format(scaler.mean_))

print("var_ :{}".format(scaler.var_))

print("scale_:{}".format(scaler.scale_))

# [-1.41421356 -0.70710678 0. 0.70710678 1.41421356]

# mean_ :[3.]

# var_ :[2.]

# scale_:[1.41421356]

インスタンス内に保持されたパラメーターのうち、mean_は特徴量の標本平均、var_は標本分散（不偏分散ではない）となっている。scale_はvar_の平方根。

各データの特徴量は次式で標準化されているのが計算で確認できる。

(2) $\begin{equation*} {F_i}^* = \frac{F_i - \rm{mean\_}}{\rm{scale\_}} = \frac{F_i - \rm{mean\_}}{\sqrt{\rm{var\_}}} \end{equation*}$

特徴

StandardScalerは異常値の影響に対して比較的頑健である。

MinMaxScaler

2020-10-04 / tau / コメントする

概要

sklearn.preprocessingモジュールのMinMaxScalerは、各特徴量が0～1の範囲に納まるように変換する。具体的には、特徴量F_iの最小値(min_i)と最大値(max_i)から以下の式により各特徴量F_iをF_i^*に変換する。

(1) $\begin{equation*} {F_i}^* = \frac{F_i - min_i}{max_i - min_i} \end{equation*}$

挙動

それぞれ異なる正規分布に従う2つの特徴量について、MinMaxScalerを適用したときの挙動を以下に示す。異なる大きさとレンジの特徴量が、変換後にはいずれも0～1の間に納まっているのが確認できる。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

rnd.seed(0)
x1 = rnd.normal(loc=1, scale=1, size=100)
x2 = rnd.normal(loc=3, scale=0.5, size=100)
X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = MinMaxScaler()
X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 3)
ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-2, 5), bins=40, alpha=0.5, label="Feature 0")
ax1.hist(X[:, 1], ec='k', range=(-2, 5), bins=40, alpha=0.5, label="Feature 1")
ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-0.2, 1.2), bins=40, ec='k', alpha=0.5,
    label="Feature 0")
ax2.hist(X_transformed[:, 1], range=(-0.2, 1.2), bins=40, ec='k', alpha=0.5,
    label="Feature 1")
ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")
ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',
    label="after transformation")
ax3.set_aspect('equal')
ax3.set_xlim(-2, 5)
ax3.set_ylim(-2, 5)
ax3.set_xlabel("Feature 0")
ax3.set_ylabel("Feature 1")
ax3.legend()

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

rnd.seed(0)

x1 = rnd.normal(loc=1, scale=1, size=100)

x2 = rnd.normal(loc=3, scale=0.5, size=100)

X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = MinMaxScaler()

X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

ax1 = fig.add_subplot(2, 2, 1)

ax2 = fig.add_subplot(2, 2, 3)

ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-2, 5), bins=40, alpha=0.5, label="Feature 0")

ax1.hist(X[:, 1], ec='k', range=(-2, 5), bins=40, alpha=0.5, label="Feature 1")

ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-0.2, 1.2), bins=40, ec='k', alpha=0.5,

label="Feature 0")

ax2.hist(X_transformed[:, 1], range=(-0.2, 1.2), bins=40, ec='k', alpha=0.5,

label="Feature 1")

ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")

ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',

label="after transformation")

ax3.set_aspect('equal')

ax3.set_xlim(-2, 5)

ax3.set_ylim(-2, 5)

ax3.set_xlabel("Feature 0")

ax3.set_ylabel("Feature 1")

ax3.legend()

plt.show()

特徴

MinMaxScalerは簡明な方法だが、極端に値が離れた異常値が発生すると本来のデータがその影響を受ける場合がある。

scikit-learn – predict_proba

2020-09-09 / tau / コメントする

概要

decision_function()は各データが推測したクラスに属する確信度(confidence)を表すが、超平面のパラメータに依存し、そのレンジや値の大きさと確信度の関係が明確ではない。

これに対してpredict_probaは、それぞれのターゲットが予測されたクラスに属する確率を0～1の実数で表す。2クラス分類では、結果の配列の形状は(n_sumples, 2)となる。

`predict_proba()`の挙動

以下はmake_circles()で生成した2クラスのデータをGradient Boostingによって分類したときの確信度。各データに対応した2要素の配列の1つ目がクラス0(blue)、2つ目がクラス1(orange)に属する確率を表し、2つの和は1となる。なお16行目でsuppress=Trueとすることで、ndarrayの表示を常に固定小数点としている。

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)
y_named = np.array(["blue", "orange"])[y]

X_train, X_test, y_train, y_test, y_train_named, y_test_named = \
    train_test_split(X, y, y_named, random_state=0)

gbc = GradientBoostingClassifier(random_state=0)
gbc.fit(X_train, y_train_named)

pred_prob = gbc.predict_proba(X_test)
np.set_printoptions(suppress=True)
print(pred_prob)

# [[0.01573626 0.98426374]
#  [0.84575653 0.15424347]
#  [0.98112869 0.01887131]
#  .....
#  [0.06307595 0.93692405]
#  [0.93442475 0.06557525]
#  [0.86619957 0.13380043]]

import numpy as np

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.datasets import make_circles

from sklearn.model_selection import train_test_split

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

y_named = np.array(["blue", "orange"])[y]

X_train, X_test, y_train, y_test, y_train_named, y_test_named = \

train_test_split(X, y, y_named, random_state=0)

gbc = GradientBoostingClassifier(random_state=0)

gbc.fit(X_train, y_train_named)

pred_prob = gbc.predict_proba(X_test)

np.set_printoptions(suppress=True)

print(pred_prob)

# [[0.01573626 0.98426374]

# [0.84575653 0.15424347]

# [0.98112869 0.01887131]

# .....

# [0.06307595 0.93692405]

# [0.93442475 0.06557525]

# [0.86619957 0.13380043]]

`decision_function()`との比較

先のコードに以下を続けて、predict_proba()による確率、予測されたクラス、decsion_function()の値と、各データの正解クラスを並べて表示する。予測されたクラスの方の確率が大きいこと、その予測結果とdecision_function()の符号が一致していることが確認できる。

prob0 = pred_prob[:, 0]
prob1 = pred_prob[:, 1]

data = DataFrame()
data["prob0"] = prob0
data["prob1"] = prob1
data["pred"] = gbc.predict(X_test)
data["dec_func"] = gbc.decision_function(X_test)
data["correct"] = y_test_named
print(data)

#        prob0     prob1    pred  dec_func correct
# 0   0.015736  0.984264  orange  4.135926  orange
# 1   0.845757  0.154243    blue -1.701699    blue
# 2   0.981129  0.018871    blue -3.951061    blue
# .....
# 22  0.063076  0.936924  orange  2.698263  orange
# 23  0.934425  0.065575    blue -2.656733    blue
# 24  0.866200  0.133800    blue -1.867766    blue

prob0 = pred_prob[:, 0]

prob1 = pred_prob[:, 1]

data = DataFrame()

data["prob0"] = prob0

data["prob1"] = prob1

data["pred"] = gbc.predict(X_test)

data["dec_func"] = gbc.decision_function(X_test)

data["correct"] = y_test_named

print(data)

# prob0 prob1 pred dec_func correct

# 0 0.015736 0.984264 orange 4.135926 orange

# 1 0.845757 0.154243 blue -1.701699 blue

# 2 0.981129 0.018871 blue -3.951061 blue

# .....

# 22 0.063076 0.936924 orange 2.698263 orange

# 23 0.934425 0.065575 blue -2.656733 blue

# 24 0.866200 0.133800 blue -1.867766 blue

このデータをクラス0(blue)に対する確率(prob0)でソートし、decision_function()との関係を見てみると、以下のことがわかる。

blueクラスの確率が高いとdecision_functionの確信度はマイナスで絶対値が大きくなり、orangeクラスの確率が高いと確信度はプラスで絶対値が大きくなる
blueクラスの確率とorangeクラスの確率が同程度の時、確信度の絶対値が同程度になり、符号が逆になる
確率に対して確信度は線形ではない

print(data.sort_values(by="prob0", ascending=False))

#        prob0     prob1    pred  dec_func correct
# 6   0.999543  0.000457    blue -7.690972    blue
# 10  0.998442  0.001558    blue -6.462560    blue
# 15  0.984817  0.015183    blue -4.172312  orange
# .....
# 0   0.015736  0.984264  orange  4.135926  orange
# 11  0.013521  0.986479  orange  4.289866  orange
# 4   0.013521  0.986479  orange  4.289866  orange

print(data.sort_values(by="prob0", ascending=False))

# prob0 prob1 pred dec_func correct

# 6 0.999543 0.000457 blue -7.690972 blue

# 10 0.998442 0.001558 blue -6.462560 blue

# 15 0.984817 0.015183 blue -4.172312 orange

# .....

# 0 0.015736 0.984264 orange 4.135926 orange

# 11 0.013521 0.986479 orange 4.289866 orange

# 4 0.013521 0.986479 orange 4.289866 orange

クラス0(blue)に対する確率とdecision_function()の確信度の関係を図示すると以下のようになり、確率に対して確信度が必ずしも線形になっていないことがわかる。

コードはmatplotlib.pyplotをインポートした上で、以下を追加。

prob = np.array(sorted_data["prob0"])
conf = np.array(sorted_data["dec_func"])
fig = plt.figure()
ax = fig.add_subplot()
ax.plot(prob, conf)
ax.grid()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position(('data', 0.5))
ax.set_xlabel("class-0 probaility", loc='left')
ax.set_ylabel("confidence", loc='bottom')
plt.show()

prob = np.array(sorted_data["prob0"])

conf = np.array(sorted_data["dec_func"])

fig = plt.figure()

ax = fig.add_subplot()

ax.plot(prob, conf)

ax.grid()

ax.spines['top'].set_visible(False)

ax.spines['right'].set_visible(False)

ax.spines['bottom'].set_position('zero')

ax.spines['left'].set_position(('data', 0.5))

ax.set_xlabel("class-0 probaility", loc='left')

ax.set_ylabel("confidence", loc='bottom')

plt.show()

決定境界

以下は、predict_proba()で計算された確率を可視化したもので、decision_function()の場合に比べて、直感的にも分かりやすい分布となっている。

コンターに表す値として、30行目でpredict_proba()の結果の0列目、すなわちClass0の確率を取り出している。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split

f0min, f0max = -1.5, 1.5
f1min, f1max = -1.75, 1.5

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

X_train, X_test, y_train, y_test =\
    train_test_split(X, y, random_state=0)

gb = GradientBoostingClassifier(random_state=0)
gb.fit(X_train, y_train)

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))
color0, color1 = 'tab:blue', 'tab:orange'

f0 = np.linspace(f0min, f0max, 200)
f1 = np.linspace(f1min, f1max, 200)
f0, f1 = np.meshgrid(f0, f1)
F = np.hstack((f0.reshape(-1, 1), f1.reshape(-1, 1)))

pred = gb.predict(F).reshape(f0.shape)
axs[0].contour(f0, f1, pred, levels=[0.5])
axs[0].contourf(f0, f1, pred, levels=1, colors=[color0, color1], alpha=0.25)

proba = gb.predict_proba(F)[:, 0].reshape(f0.shape)
print(proba.shape)
axs[1].contourf(f0, f1, proba, alpha=0.5, cmap='RdBu')

for ax in axs:
    ax.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], marker='o', fc=color0, ec='k', label="Train class 0")
    ax.scatter(X_test[y_test==0][:, 0], X_test[y_test==0][:, 1], marker='^', fc=color0, ec='k', label="Test Class 0")
    ax.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], marker='o', fc=color1, ec='k', label="Train class 1")
    ax.scatter(X_test[y_test==1][:, 0], X_test[y_test==1][:, 1], marker='^', fc=color1, ec='k', label="Test class 1")

    ax.set_xlim(f0min, f0max)
    ax.set_ylim(f1min, f1max)
    ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

handles, labels = axs[0].get_legend_handles_labels()

fig.legend(handles, labels, ncol=4, loc='upper center')
plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.datasets import make_circles

from sklearn.model_selection import train_test_split

f0min, f0max = -1.5, 1.5

f1min, f1max = -1.75, 1.5

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

X_train, X_test, y_train, y_test =\

train_test_split(X, y, random_state=0)

gb = GradientBoostingClassifier(random_state=0)

gb.fit(X_train, y_train)

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))

color0, color1 = 'tab:blue', 'tab:orange'

f0 = np.linspace(f0min, f0max, 200)

f1 = np.linspace(f1min, f1max, 200)

f0, f1 = np.meshgrid(f0, f1)

F = np.hstack((f0.reshape(-1, 1), f1.reshape(-1, 1)))

pred = gb.predict(F).reshape(f0.shape)

axs[0].contour(f0, f1, pred, levels=[0.5])

axs[0].contourf(f0, f1, pred, levels=1, colors=[color0, color1], alpha=0.25)

proba = gb.predict_proba(F)[:, 0].reshape(f0.shape)

print(proba.shape)

axs[1].contourf(f0, f1, proba, alpha=0.5, cmap='RdBu')

for ax in axs:

ax.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], marker='o', fc=color0, ec='k', label="Train class 0")

ax.scatter(X_test[y_test==0][:, 0], X_test[y_test==0][:, 1], marker='^', fc=color0, ec='k', label="Test Class 0")

ax.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], marker='o', fc=color1, ec='k', label="Train class 1")

ax.scatter(X_test[y_test==1][:, 0], X_test[y_test==1][:, 1], marker='^', fc=color1, ec='k', label="Test class 1")

ax.set_xlim(f0min, f0max)

ax.set_ylim(f1min, f1max)

ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

handles, labels = axs[0].get_legend_handles_labels()

fig.legend(handles, labels, ncol=4, loc='upper center')

plt.show()

3クラス以上の場合

3クラスのirisデータセットにGradientBoostingClassifierを適用し、predict_proba()の出力を見てみる。

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from pandas import DataFrame

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=42)

gbc = GradientBoostingClassifier(learning_rate=0.01, random_state=0)
gbc.fit(X_train, y_train)

pred_proba = gbc.predict_proba(X_test)
df = DataFrame(pred_proba, columns=iris.target_names)
df["decision"] = np.argmax(pred_proba, axis=1)
df["prediction"] = gbc.predict(X_test)
print(df)

import numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.ensemble import GradientBoostingClassifier

from pandas import DataFrame

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(

iris.data, iris.target, random_state=42)

gbc = GradientBoostingClassifier(learning_rate=0.01, random_state=0)

gbc.fit(X_train, y_train)

pred_proba = gbc.predict_proba(X_test)

df = DataFrame(pred_proba, columns=iris.target_names)

df["decision"] = np.argmax(pred_proba, axis=1)

df["prediction"] = gbc.predict(X_test)

print(df)

このコードの出力結果は以下の通り。3つのクラスに対する確率が得られ、合計は1になる。こちらはdecision_function()が2クラスの時だけ配列が1次元となるのと違って、どのような場合でも行数×列数＝データ数×クラス数の配列になる。

なお17行目で、argmaxを使って各データで確率が最大となるクラスを探している。

      setosa  versicolor  virginica  decision  prediction
0   0.102177    0.788400   0.109422         1           1
1   0.783471    0.109367   0.107161         0           0
2   0.098181    0.110059   0.791761         2           2
3   0.102177    0.788400   0.109422         1           1
4   0.103600    0.667239   0.229161         1           1
.....
33  0.783471    0.109367   0.107161         0           0
34  0.783471    0.109367   0.107161         0           0
35  0.101941    0.115024   0.783035         2           2
36  0.102177    0.788400   0.109422         1           1
37  0.783471    0.109367   0.107161         0           0

setosa versicolor virginica decision prediction

0 0.102177 0.788400 0.109422 1 1

1 0.783471 0.109367 0.107161 0 0

2 0.098181 0.110059 0.791761 2 2

3 0.102177 0.788400 0.109422 1 1

4 0.103600 0.667239 0.229161 1 1

.....

33 0.783471 0.109367 0.107161 0 0

34 0.783471 0.109367 0.107161 0 0

35 0.101941 0.115024 0.783035 2 2

36 0.102177 0.788400 0.109422 1 1

37 0.783471 0.109367 0.107161 0 0

概要

使い方

引数

戻り値

実行例

データの準備とモデルによる予測

基本的な使い方

要素のみを得る

要素の並び順を変更する

要素を正規化する～比率で表す

DataFrameによる扱い

ラベルの追加

合計欄

Multiindex

使い方

実行例

preprocessingの各種モデル

scaler～スケール変換

normalization～正則化

binalize～2値化

encoder～カテゴリーデータのエンコード

スケール変換の頑健性

概要

使い方

fit()～インデックス列の生成

transform()～インデックスデータへの変換

DataFrameによる操作

数値データとクラスデータが混在する場合

DataFrameの準備

クラスデータのヘッダーの準備

クラスデータと数値データの合体

inverse_transform()

新しいデータの変換

未知のクラスへの対処

概要

使い方

fit～ラベルの設定

transform～ラベルへの変換

categories_パラメーターについて

数値データとクラスデータが混在する場合

概要

挙動

特徴

概要

挙動

特徴

概要

挙動

特徴

概要

挙動

特徴

概要

predict_proba()の挙動

decision_function()との比較

決定境界

3クラス以上の場合

`predict_proba()`の挙動

`decision_function()`との比較