RobustScaler – TauStation

概要

sklearn.preprocessingモジュールのRobustScalerは、各特徴量の中央値(med_i)と第1-4分位数(q_1i)、第3-4分位数(q_3i)を用いて特徴量を標準化する。

(1) $\begin{equation*} {F_i}^* = \frac{F_i - med_i}{q_{3i} - q_{1i}} \end{equation*}$

挙動

それぞれ異なる正規分布に従う2つの特徴量について、RobustScalerを適用したときの挙動を以下に示す。異なる大きさとレンジの特徴量が、変換後には原点を中心としてほぼ同じような広がりになっているのがわかる。

コードは以下の通りで、データに対してfit()メソッドでスケールパラメーターを決定し、transform()メソッドで変換を行うところを、これらを連続して実行するfit_transform()メソッドを使っている。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScaler

rnd.seed(0)
x1 = rnd.normal(loc=2, scale=3, size=100)
x2 = rnd.normal(loc=7, scale=1, size=100)
X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = RobustScaler()
X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 3)
ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 0")
ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 1")
ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-3, 3), bins=40, ec='k', alpha=0.5,
    label="Feature 0")
ax2.hist(X_transformed[:, 1], range=(-3, 3), bins=40, ec='k', alpha=0.5,
    label="Feature 1")
ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")
ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',
    label="after transformation")
ax3.set_aspect('equal')
ax3.set_xlim(-7, 10)
ax3.set_ylim(-7, 10)
ax3.set_xlabel("Feature 0")
ax3.set_ylabel("Feature 1")
ax3.legend()

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

from sklearn.preprocessing import RobustScaler

rnd.seed(0)

x1 = rnd.normal(loc=2, scale=3, size=100)

x2 = rnd.normal(loc=7, scale=1, size=100)

X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = RobustScaler()

X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

ax1 = fig.add_subplot(2, 2, 1)

ax2 = fig.add_subplot(2, 2, 3)

ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 0")

ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 1")

ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-3, 3), bins=40, ec='k', alpha=0.5,

label="Feature 0")

ax2.hist(X_transformed[:, 1], range=(-3, 3), bins=40, ec='k', alpha=0.5,

label="Feature 1")

ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")

ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',

label="after transformation")

ax3.set_aspect('equal')

ax3.set_xlim(-7, 10)

ax3.set_ylim(-7, 10)

ax3.set_xlabel("Feature 0")

ax3.set_ylabel("Feature 1")

ax3.legend()

plt.show()

簡単なデータでRobustScalerの計算過程を確認しておく。以下の例では5個のデータにRobustScalerを適用している。これは1つの特徴量を持つ5個のデータを模していることになる。

インスタンス内に保持されたパラメーターのうち、center_は特徴量の標本平均、scale_が第3-4分位数－第1-4分位数となっていて、これらで各特徴量が標準化されているのが確認できる。

import numpy as np
from sklearn.preprocessing import RobustScaler

x = np.array([2, 3, 4, 5, 6, 8, 10, 12])
print(np.percentile(x, q=[0, 25, 50, 75, 100]))

scaler = RobustScaler()
x_transformed = scaler.fit_transform(x.reshape(-1, 1))
print(x_transformed.reshape(-1))
print("centers:{}".format(scaler.center_))
print("scales :{}".format(scaler.scale_))

# [ 2.    3.75  5.5   8.5  12.  ]
# [-0.73684211 -0.52631579 -0.31578947 -0.10526316  0.10526316  0.52631579
#   0.94736842  1.36842105]
# centers:[5.5]
# scales :[4.75]

import numpy as np

from sklearn.preprocessing import RobustScaler

x = np.array([2, 3, 4, 5, 6, 8, 10, 12])

print(np.percentile(x, q=[0, 25, 50, 75, 100]))

scaler = RobustScaler()

x_transformed = scaler.fit_transform(x.reshape(-1, 1))

print(x_transformed.reshape(-1))

print("centers:{}".format(scaler.center_))

print("scales :{}".format(scaler.scale_))

# [ 2. 3.75 5.5 8.5 12. ]

# [-0.73684211 -0.52631579 -0.31578947 -0.10526316 0.10526316 0.52631579

# 0.94736842 1.36842105]

# centers:[5.5]

# scales :[4.75]

特徴

RobustScalerは異常値に対して頑健であり、StandardScalerより頑健性が高い。

概要

挙動

特徴

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル