イテレーターは再利用不可

2020-10-06 / tau / コメントする

イテレーターで生成されたオブジェクトを変数にセットして実行できるが、これをそのまま再度利用することはできない。

from itertools import repeat

rpt = repeat("Ha", 3)

for x in rpt:
    print(x, end="")
print()
# HaHaHa

for x in rpt:
    print(x)
print()
# nothing displayed

from itertools import repeat

rpt = repeat("Ha", 3)

for x in rpt:

print(x, end="")

print()

# HaHaHa

for x in rpt:

print(x)

print()

# nothing displayed

イテレーターはインスタンス生成時に__init__()メソッドにより初期化され、その後イテレーターとして使用が終わった直後の値を保持している。正確には、再利用が禁止されているのではなく初期状態から再度実行することができない、ということになる。

たとえば次の例を見ると、最初の実行が終わったのちに再度利用することは不可能ではない。ただし結果を見るとわかるように、2度目の実行の初期値が1度目の終了判定時の値4ではなく5から始まっている。

from itertools import count

cnt = count(0)

for x in cnt:
    if x > 3: break
    print(x, end=" ")
print()
# 0 1 2 3 

for x in cnt:
    if x > 7: break
    print(x, end=" ")
print()
# 5 6 7

from itertools import count

cnt = count(0)

for x in cnt:

if x > 3: break

print(x, end=" ")

print()

# 0 1 2 3

for x in cnt:

if x > 7: break

print(x, end=" ")

print()

# 5 6 7

おそらくcountイテレーターの__next__()メソッドの最初で内部カウンターをインクリメントしていると考えられる。

このようにイテレーターの再利用は予想外の動作をすることがあるので控えた方がよさそうだ。

preprocessor – 異常値に対する頑健性

2020-10-04 / tau / コメントする

機械学習モデルにデータを適用するための前処理としていくつかのアルゴリズムによっては、異常値の影響を受けやすいことがある。

たとえば下図の左のような分布のデータがあるとする（平均が1、分散が1の正規分布に従う500個のランダムデータ）。そしてこのデータに値20の異常値が10個発生したとすると、全体の分布は右のようになる。

このデータに対して、MinMaxScaler、StandardScaler、RobustScalerで変換した結果を以下に示す。ただしStandardScalerとRobustScalerについては、異常値は表示させず元の正規分布に係る範囲のみを表示している。

まず左側のMinMaxScalerについては、異常値を含めてレンジが0～1となるので、本体の正規分布のデータが0付近の小さな値に集中する。このため、本来学習の精度に効いてくるべき本体部分のデータの分離が十分でない可能性が出てくる。

真ん中のStandardScalerと右側のRobustScalerについては、本体部分の形は元の正規分布の形と大きく変わらず、頑健であることがわかる。

ここで異常値の個数を10個から20個に増やして、同じく3種類の変換を施してみる。

左側のMinMaxScalerについては、異常値の個数とは関係なくその値のみでレンジが決まり、元の分布が0付近に押し込められている状況は同じ。

真ん中のStandardScalerについては、10個の時に比べて少し分布の形が変わっていて、レンジが狭くなっている。

右側のRobustScalerについては、元の分布の形は大きくは変わっていない。

以上のことから、少なくとも3つの変換器について以下のような特徴があることがわかる。

MinMaxScalerは異常値によって本来分析したいデータのレンジが狭くなる可能性がある
StandardScalerは異常値の影響を受けにくいが、その大きさや頻度によって若干本体部分の分布が影響を受ける
RobustScalerは異常値の個数が極端に多くなければ、本来のデータの特性を頑健に保持する

なお、上記の作図のコードは以下の通り。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

rnd.seed(0)
x = rnd.normal(loc=1, scale=1, size=500)
x1 = np.append(x, [20] * 10)
x2 = np.append(x, [20] * 20)

scaler = MinMaxScaler()
x1_scaled_by_minmax = scaler.fit_transform(x1.reshape(-1, 1))
x2_scaled_by_minmax = scaler.fit_transform(x2.reshape(-1, 1))

scaler = StandardScaler()
x1_scaled_by_standard = scaler.fit_transform(x1.reshape(-1, 1))
x2_scaled_by_standard = scaler.fit_transform(x2.reshape(-1, 1))

scaler = RobustScaler()
x1_scaled_by_robust = scaler.fit_transform(x1.reshape(-1, 1))
x2_scaled_by_robust = scaler.fit_transform(x2.reshape(-1, 1))

fig0, axes = plt.subplots(1, 2, figsize=(12.8, 4.8))
axes[0].hist(x1, ec='k', bins=10, range=(-2, 4))
axes[1].hist(x1, ec='k', bins=40)

fig1, axes = plt.subplots(1, 3, figsize=(18.6, 4.8))

ax = axes[0]
ax.hist(x1_scaled_by_minmax, ec='k', bins=40)
ax.set_title("MinMaxScaler")

ax = axes[1]
ax.hist(x1_scaled_by_standard, ec='k', bins=10, range=(-1.5, 1))
ax.set_title("StandardScaler")

ax = axes[2]
ax.hist(x1_scaled_by_robust, ec='k', bins=10, range=(-2.5, 2.5))
ax.set_title("RobustScaler")

fig2, axes = plt.subplots(1, 3, figsize=(18.6, 4.8))

ax = axes[0]
ax.hist(x2_scaled_by_minmax, ec='k', bins=40)
ax.set_title("MinMaxScaler")

ax = axes[1]
ax.hist(x2_scaled_by_standard, ec='k', bins=10, range=(-1.5, 1))
ax.set_title("StandardScaler")

ax = axes[2]
ax.hist(x2_scaled_by_robust, ec='k', bins=10, range=(-2.5, 2.5))
ax.set_title("RobustScaler")

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import RobustScaler

rnd.seed(0)

x = rnd.normal(loc=1, scale=1, size=500)

x1 = np.append(x, [20] * 10)

x2 = np.append(x, [20] * 20)

scaler = MinMaxScaler()

x1_scaled_by_minmax = scaler.fit_transform(x1.reshape(-1, 1))

x2_scaled_by_minmax = scaler.fit_transform(x2.reshape(-1, 1))

scaler = StandardScaler()

x1_scaled_by_standard = scaler.fit_transform(x1.reshape(-1, 1))

x2_scaled_by_standard = scaler.fit_transform(x2.reshape(-1, 1))

scaler = RobustScaler()

x1_scaled_by_robust = scaler.fit_transform(x1.reshape(-1, 1))

x2_scaled_by_robust = scaler.fit_transform(x2.reshape(-1, 1))

fig0, axes = plt.subplots(1, 2, figsize=(12.8, 4.8))

axes[0].hist(x1, ec='k', bins=10, range=(-2, 4))

axes[1].hist(x1, ec='k', bins=40)

fig1, axes = plt.subplots(1, 3, figsize=(18.6, 4.8))

ax = axes[0]

ax.hist(x1_scaled_by_minmax, ec='k', bins=40)

ax.set_title("MinMaxScaler")

ax = axes[1]

ax.hist(x1_scaled_by_standard, ec='k', bins=10, range=(-1.5, 1))

ax.set_title("StandardScaler")

ax = axes[2]

ax.hist(x1_scaled_by_robust, ec='k', bins=10, range=(-2.5, 2.5))

ax.set_title("RobustScaler")

fig2, axes = plt.subplots(1, 3, figsize=(18.6, 4.8))

ax = axes[0]

ax.hist(x2_scaled_by_minmax, ec='k', bins=40)

ax.set_title("MinMaxScaler")

ax = axes[1]

ax.hist(x2_scaled_by_standard, ec='k', bins=10, range=(-1.5, 1))

ax.set_title("StandardScaler")

ax = axes[2]

ax.hist(x2_scaled_by_robust, ec='k', bins=10, range=(-2.5, 2.5))

ax.set_title("RobustScaler")

plt.show()

RobustScaler

2020-10-04 / tau / コメントする

概要

sklearn.preprocessingモジュールのRobustScalerは、各特徴量の中央値(med_i)と第1-4分位数(q_1i)、第3-4分位数(q_3i)を用いて特徴量を標準化する。

(1) $\begin{equation*} {F_i}^* = \frac{F_i - med_i}{q_{3i} - q_{1i}} \end{equation*}$

挙動

それぞれ異なる正規分布に従う2つの特徴量について、RobustScalerを適用したときの挙動を以下に示す。異なる大きさとレンジの特徴量が、変換後には原点を中心としてほぼ同じような広がりになっているのがわかる。

コードは以下の通りで、データに対してfit()メソッドでスケールパラメーターを決定し、transform()メソッドで変換を行うところを、これらを連続して実行するfit_transform()メソッドを使っている。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScaler

rnd.seed(0)
x1 = rnd.normal(loc=2, scale=3, size=100)
x2 = rnd.normal(loc=7, scale=1, size=100)
X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = RobustScaler()
X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 3)
ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 0")
ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 1")
ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-3, 3), bins=40, ec='k', alpha=0.5,
    label="Feature 0")
ax2.hist(X_transformed[:, 1], range=(-3, 3), bins=40, ec='k', alpha=0.5,
    label="Feature 1")
ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")
ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',
    label="after transformation")
ax3.set_aspect('equal')
ax3.set_xlim(-7, 10)
ax3.set_ylim(-7, 10)
ax3.set_xlabel("Feature 0")
ax3.set_ylabel("Feature 1")
ax3.legend()

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

from sklearn.preprocessing import RobustScaler

rnd.seed(0)

x1 = rnd.normal(loc=2, scale=3, size=100)

x2 = rnd.normal(loc=7, scale=1, size=100)

X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = RobustScaler()

X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

ax1 = fig.add_subplot(2, 2, 1)

ax2 = fig.add_subplot(2, 2, 3)

ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 0")

ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 1")

ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-3, 3), bins=40, ec='k', alpha=0.5,

label="Feature 0")

ax2.hist(X_transformed[:, 1], range=(-3, 3), bins=40, ec='k', alpha=0.5,

label="Feature 1")

ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")

ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',

label="after transformation")

ax3.set_aspect('equal')

ax3.set_xlim(-7, 10)

ax3.set_ylim(-7, 10)

ax3.set_xlabel("Feature 0")

ax3.set_ylabel("Feature 1")

ax3.legend()

plt.show()

簡単なデータでRobustScalerの計算過程を確認しておく。以下の例では5個のデータにRobustScalerを適用している。これは1つの特徴量を持つ5個のデータを模していることになる。

インスタンス内に保持されたパラメーターのうち、center_は特徴量の標本平均、scale_が第3-4分位数－第1-4分位数となっていて、これらで各特徴量が標準化されているのが確認できる。

import numpy as np
from sklearn.preprocessing import RobustScaler

x = np.array([2, 3, 4, 5, 6, 8, 10, 12])
print(np.percentile(x, q=[0, 25, 50, 75, 100]))

scaler = RobustScaler()
x_transformed = scaler.fit_transform(x.reshape(-1, 1))
print(x_transformed.reshape(-1))
print("centers:{}".format(scaler.center_))
print("scales :{}".format(scaler.scale_))

# [ 2.    3.75  5.5   8.5  12.  ]
# [-0.73684211 -0.52631579 -0.31578947 -0.10526316  0.10526316  0.52631579
#   0.94736842  1.36842105]
# centers:[5.5]
# scales :[4.75]

import numpy as np

from sklearn.preprocessing import RobustScaler

x = np.array([2, 3, 4, 5, 6, 8, 10, 12])

print(np.percentile(x, q=[0, 25, 50, 75, 100]))

scaler = RobustScaler()

x_transformed = scaler.fit_transform(x.reshape(-1, 1))

print(x_transformed.reshape(-1))

print("centers:{}".format(scaler.center_))

print("scales :{}".format(scaler.scale_))

# [ 2. 3.75 5.5 8.5 12. ]

# [-0.73684211 -0.52631579 -0.31578947 -0.10526316 0.10526316 0.52631579

# 0.94736842 1.36842105]

# centers:[5.5]

# scales :[4.75]

特徴

RobustScalerは異常値に対して頑健であり、StandardScalerより頑健性が高い。

StandardScaler

2020-10-04 / tau / コメントする

概要

sklearn.preprocessingモジュールのStandardScalerは、各特徴量の標本平均と標本分散を用いて特徴量を標準化する。

具体的には、特徴量F_iの標本平均(m_i)と標本分散(v_i)から以下の式により各特徴量F_iをF_i^*に変換する。

(1) $\begin{equation*} {F_i}^* = \frac{F_i -m_i}{\sqrt{v_i}} \end{equation*}$

挙動

それぞれ異なる正規分布に従う2つの特徴量について、StandardScalerを適用したときの挙動を以下に示す。異なる大きさとレンジの特徴量が、変換後には原点を中心としてほぼ同じような広がりになっているのがわかる。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

rnd.seed(0)
x1 = rnd.normal(loc=2, scale=3, size=100)
x2 = rnd.normal(loc=7, scale=1, size=100)
X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = StandardScaler()
X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))
fig.subplots_adjust(wspace=0.3)

ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 3)
ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 0")
ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 1")
ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-3, 3), bins=40, ec='k', alpha=0.5,
    label="Feature 0")
ax2.hist(X_transformed[:, 1], range=(-3, 3), bins=40, ec='k', alpha=0.5,
    label="Feature 0")
ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")
ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',
    label="after transformation")
ax3.set_aspect('equal')
ax3.set_xlim(-10, 10)
ax3.set_ylim(-10, 10)
ax3.set_xlabel("Feature 0")
ax3.set_ylabel("Feature 1")
ax3.legend()

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

rnd.seed(0)

x1 = rnd.normal(loc=2, scale=3, size=100)

x2 = rnd.normal(loc=7, scale=1, size=100)

X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = StandardScaler()

X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

fig.subplots_adjust(wspace=0.3)

ax1 = fig.add_subplot(2, 2, 1)

ax2 = fig.add_subplot(2, 2, 3)

ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 0")

ax1.hist(X[:, 1], ec='k', range=(-5, 10), bins=40, alpha=0.5, label="Feature 1")

ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-3, 3), bins=40, ec='k', alpha=0.5,

label="Feature 0")

ax2.hist(X_transformed[:, 1], range=(-3, 3), bins=40, ec='k', alpha=0.5,

label="Feature 0")

ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")

ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',

label="after transformation")

ax3.set_aspect('equal')

ax3.set_xlim(-10, 10)

ax3.set_ylim(-10, 10)

ax3.set_xlabel("Feature 0")

ax3.set_ylabel("Feature 1")

ax3.legend()

plt.show()

簡単なデータでStandardScalerの計算過程を確認しておく。以下の例では5個のデータにStandardScalerを適用している。これは1つの特徴量を持つ5個のデータを模していることになる。

import numpy as np
from sklearn.preprocessing import StandardScaler

x = np.array([1, 2, 3, 4, 5])

scaler = StandardScaler()
x_transformed = scaler.fit_transform(x.reshape(-1, 1))
print(x_transformed.reshape(-1))

print("mean_ :{}".format(scaler.mean_))
print("var_  :{}".format(scaler.var_))
print("scale_:{}".format(scaler.scale_))

# [-1.41421356 -0.70710678  0.          0.70710678  1.41421356]
# mean_ :[3.]
# var_  :[2.]
# scale_:[1.41421356]

import numpy as np

from sklearn.preprocessing import StandardScaler

x = np.array([1, 2, 3, 4, 5])

scaler = StandardScaler()

x_transformed = scaler.fit_transform(x.reshape(-1, 1))

print(x_transformed.reshape(-1))

print("mean_ :{}".format(scaler.mean_))

print("var_ :{}".format(scaler.var_))

print("scale_:{}".format(scaler.scale_))

# [-1.41421356 -0.70710678 0. 0.70710678 1.41421356]

# mean_ :[3.]

# var_ :[2.]

# scale_:[1.41421356]

インスタンス内に保持されたパラメーターのうち、mean_は特徴量の標本平均、var_は標本分散（不偏分散ではない）となっている。scale_はvar_の平方根。

各データの特徴量は次式で標準化されているのが計算で確認できる。

(2) $\begin{equation*} {F_i}^* = \frac{F_i - \rm{mean\_}}{\rm{scale\_}} = \frac{F_i - \rm{mean\_}}{\sqrt{\rm{var\_}}} \end{equation*}$

特徴

StandardScalerは異常値の影響に対して比較的頑健である。

MinMaxScaler

2020-10-04 / tau / コメントする

概要

sklearn.preprocessingモジュールのMinMaxScalerは、各特徴量が0～1の範囲に納まるように変換する。具体的には、特徴量F_iの最小値(min_i)と最大値(max_i)から以下の式により各特徴量F_iをF_i^*に変換する。

(1) $\begin{equation*} {F_i}^* = \frac{F_i - min_i}{max_i - min_i} \end{equation*}$

挙動

それぞれ異なる正規分布に従う2つの特徴量について、MinMaxScalerを適用したときの挙動を以下に示す。異なる大きさとレンジの特徴量が、変換後にはいずれも0～1の間に納まっているのが確認できる。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

rnd.seed(0)
x1 = rnd.normal(loc=1, scale=1, size=100)
x2 = rnd.normal(loc=3, scale=0.5, size=100)
X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = MinMaxScaler()
X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 3)
ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-2, 5), bins=40, alpha=0.5, label="Feature 0")
ax1.hist(X[:, 1], ec='k', range=(-2, 5), bins=40, alpha=0.5, label="Feature 1")
ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-0.2, 1.2), bins=40, ec='k', alpha=0.5,
    label="Feature 0")
ax2.hist(X_transformed[:, 1], range=(-0.2, 1.2), bins=40, ec='k', alpha=0.5,
    label="Feature 1")
ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")
ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',
    label="after transformation")
ax3.set_aspect('equal')
ax3.set_xlim(-2, 5)
ax3.set_ylim(-2, 5)
ax3.set_xlabel("Feature 0")
ax3.set_ylabel("Feature 1")
ax3.legend()

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

rnd.seed(0)

x1 = rnd.normal(loc=1, scale=1, size=100)

x2 = rnd.normal(loc=3, scale=0.5, size=100)

X = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

scaler = MinMaxScaler()

X_transformed = scaler.fit_transform(X)

fig = plt.figure(figsize=(9.6, 4.8))

ax1 = fig.add_subplot(2, 2, 1)

ax2 = fig.add_subplot(2, 2, 3)

ax3 = fig.add_subplot(1, 2, 2)

ax1.hist(X[:, 0], ec='k', range=(-2, 5), bins=40, alpha=0.5, label="Feature 0")

ax1.hist(X[:, 1], ec='k', range=(-2, 5), bins=40, alpha=0.5, label="Feature 1")

ax1.legend(loc='upper left')

ax2.hist(X_transformed[:, 0], range=(-0.2, 1.2), bins=40, ec='k', alpha=0.5,

label="Feature 0")

ax2.hist(X_transformed[:, 1], range=(-0.2, 1.2), bins=40, ec='k', alpha=0.5,

label="Feature 1")

ax2.legend(loc='upper left')

ax3.scatter(X[:, 0], X[:, 1], ec='k', fc='w', label="before transformation")

ax3.scatter(X_transformed[:, 0], X_transformed[:, 1], ec='k', fc='gray',

label="after transformation")

ax3.set_aspect('equal')

ax3.set_xlim(-2, 5)

ax3.set_ylim(-2, 5)

ax3.set_xlabel("Feature 0")

ax3.set_ylabel("Feature 1")

ax3.legend()

plt.show()

特徴

MinMaxScalerは簡明な方法だが、極端に値が離れた異常値が発生すると本来のデータがその影響を受ける場合がある。

matplot.pyplot – 格子でないグラフの組み合わせ

2020-10-04 / tau / コメントする

通常、Figure.subplots()やpyplot.add_subplot()でグラフの描画領域を指定するとき、m行n列の格子状のグラフエリアが生成される。

これに対して、たとえば1行目に2つのグラフエリアを表示して2行目に全幅のグラフを1つ、だとか、1列目に2列ぶち抜きのグラフエリアを表示して2列目に縦2つのグラフエリアを表示したいときがある。

このような場合の1つの方法が、Figure.add_subplotで加えたいグラフエリアの構成自体を変える方法がある。

以下の例は、1行目に2つのグラフを並べ、2行目は全幅で1つのグラフエリアを表示させる方法。

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, 100)

fig = plt.figure()

ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 1, 2)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, 100)

fig = plt.figure()

ax1 = fig.add_subplot(2, 2, 1)

ax2 = fig.add_subplot(2, 2, 2)

ax3 = fig.add_subplot(2, 1, 2)

plt.show()

また、1列目に2行分を占有する一つのグラフエリアと、2列目に2つのグラフエリアを縦に並べる方法。

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, 100)

fig = plt.figure()

ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 4)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, 100)

fig = plt.figure()

ax1 = fig.add_subplot(1, 2, 1)

ax2 = fig.add_subplot(2, 2, 2)

ax3 = fig.add_subplot(2, 2, 4)

plt.show()

ndarrayの書式設定 – printoptions

2020-10-02 / tau / コメントする

概要

配列をprintで表示させようとして、書式設定でよく間違える。たとえば以下のように。

import numpy as np

a = np.array([0.0123, 1.2345, 12.3456789])
print("{:.3f}".format(a))
# TypeError: unsupported format string passed to numpy.ndarray.__format__

import numpy as np

a = np.array([0.0123, 1.2345, 12.3456789])

print("{:.3f}".format(a))

# TypeError: unsupported format string passed to numpy.ndarray.__format__

配列の各要素の書式を指定して表示させたい場合、formatメソッドではなく、Numpyのset_printoptionsを使う必要がある。

`get_printoptions()`

配列の書式オプションの一覧は、numpy.get_printoptions()で得られる。各オプションは辞書形式で保存されている。

import numpy as np

options = np.get_printoptions()

for k, v in zip(options.keys(), options.values()):
    print("{:<9}: {}".format(k, v))

# edgeitems: 3
# threshold: 1000
# floatmode: maxprec
# precision: 8
# suppress : False
# linewidth: 75
# nanstr   : nan
# infstr   : inf
# sign     : -
# formatter: None
# legacy   : False

import numpy as np

options = np.get_printoptions()

for k, v in zip(options.keys(), options.values()):

print("{:<9}: {}".format(k, v))

# edgeitems: 3

# threshold: 1000

# floatmode: maxprec

# precision: 8

# suppress : False

# linewidth: 75

# nanstr : nan

# infstr : inf

# sign : -

# formatter: None

# legacy : False

`set_printoptions()`

これらのオプションを個別に設定するにはnumpy.set_printoptions()メソッドでキーと値を指定する。

numpy.set_printoptions([キー]=[値])

よく使いそうないくつかのオプションについてまとめる。

省略表示

`threshold`と`edgeitems`

要素数（列数・行数）がthresholdに指定した値を越えた場合に省略表示する。

np.set_printoptions(threshold=20)

print(np.arange(20))
# [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

print(np.arange(21))
# [ 0  1  2 ... 18 19 20]

np.set_printoptions(threshold=20)

print(np.arange(20))

# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]

print(np.arange(21))

# [ 0 1 2 ... 18 19 20]

edgeitemsは省略時に表示する要素数（列数・行数）を指定する。

np.set_printoptions(edgeitems=5)

print(np.arange(40))

# [ 0  1  2  3  4 ... 35 36 37 38 39]

np.set_printoptions(edgeitems=5)

print(np.arange(40))

# [ 0 1 2 3 4 ... 35 36 37 38 39]

threshold=0を指定すると、edgeitemsの値を超えると常に省略表示する（デフォルトの場合、edgeitems=3を越えると省略表示）。

np.set_printoptions(threshold=0, edgeitems=3)

print(np.arange(6))
# [0 1 2 3 4 5]

print(np.arange(7))
# [0 1 2 ... 4 5 6]

np.set_printoptions(threshold=0, edgeitems=3)

print(np.arange(6))

# [0 1 2 3 4 5]

print(np.arange(7))

# [0 1 2 ... 4 5 6]

2次元配列の行も同じ条件で省略表示される。

print(np.arange(36).reshape(6, 6))

# [[ 0  1  2  3  4  5]
#  [ 6  7  8  9 10 11]
#  [12 13 14 15 16 17]
#  [18 19 20 21 22 23]
#  [24 25 26 27 28 29]
#  [30 31 32 33 34 35]]

print(np.arange(49).reshape(7, 7))

# [[ 0  1  2 ...  4  5  6]
#  [ 7  8  9 ... 11 12 13]
#  [14 15 16 ... 18 19 20]
#  ...
#  [28 29 30 ... 32 33 34]
#  [35 36 37 ... 39 40 41]
#  [42 43 44 ... 46 47 48]]

print(np.arange(36).reshape(6, 6))

# [[ 0 1 2 3 4 5]

# [ 6 7 8 9 10 11]

# [12 13 14 15 16 17]

# [18 19 20 21 22 23]

# [24 25 26 27 28 29]

# [30 31 32 33 34 35]]

print(np.arange(49).reshape(7, 7))

# [[ 0 1 2 ... 4 5 6]

# [ 7 8 9 ... 11 12 13]

# [14 15 16 ... 18 19 20]

# ...

# [28 29 30 ... 32 33 34]

# [35 36 37 ... 39 40 41]

# [42 43 44 ... 46 47 48]]

数値の書式

`supress`

デフォルトでは要素にオーダーが小さい数値が含まれていると浮動小数点表示となり、1つの要素でも浮動小数点表示になるとすべての要素が浮動小数点表示になる。

オプションで'supress=True'を指定すると、強制的に固定小数点で表示される。

import numpy as np

a = np.array([0.0000123, 0.123, 12.3])

print(a)
# [1.23e-05 1.23e-01 1.23e+01]

np.set_printoptions(suppress=True)

print(a)
# [ 0.0000123  0.123     12.3      ]

import numpy as np

a = np.array([0.0000123, 0.123, 12.3])

print(a)

# [1.23e-05 1.23e-01 1.23e+01]

np.set_printoptions(suppress=True)

print(a)

# [ 0.0000123 0.123 12.3 ]

`precision`

precisionで精度の桁数を指定する。固定小数点数の場合は小数点以下の桁数、浮動小数点数の場合は仮数部の桁数。

import numpy as np

a = np.array([x / 7 for x in [0.1, 1, 10, 100]])
print(a)
# [ 0.01428571  0.14285714  1.42857143 14.28571429]

np.set_printoptions(precision=3)
print(a)
# [ 0.014  0.143  1.429 14.286]

b = np.array([x / 7 for x in [0.01, 1, 10, 100]])
print(b)
# [1.429e-03 1.429e-01 1.429e+00 1.429e+01]

import numpy as np

a = np.array([x / 7 for x in [0.1, 1, 10, 100]])

print(a)

# [ 0.01428571 0.14285714 1.42857143 14.28571429]

np.set_printoptions(precision=3)

print(a)

# [ 0.014 0.143 1.429 14.286]

b = np.array([x / 7 for x in [0.01, 1, 10, 100]])

print(b)

# [1.429e-03 1.429e-01 1.429e+00 1.429e+01]

`floatmode`

floatmodeでキーワードを指定し、あらかじめ定められた書式を設定する。

次のような配列でキーワードごとの挙動を確認する。配列aは最大でもprecision設定より低い精度、配列bはprecisionを超える精度の要素を持ち、デフォルトのprecision=8で表示が丸められている。

a = np.array([0.1, 0.123, 0.123456])
b = np.array([0.1, 0.12345, 0.123456789])
print("default      :{}".format(a))
print("             :{}".format(b))

# default      :[0.1      0.123    0.123456]
#              :[0.1        0.12345    0.12345679]

a = np.array([0.1, 0.123, 0.123456])

b = np.array([0.1, 0.12345, 0.123456789])

print("default :{}".format(a))

print(" :{}".format(b))

# default :[0.1 0.123 0.123456]

# :[0.1 0.12345 0.12345679]

`maxprec`

デフォルトの設定。各要素がそれぞれ最大の精度で表示される。いずれの配列も、最大精度となる最後尾の要素の桁幅に統一されていて、0埋めはされない。デフォルトはこの設定なので、結果は上と同じ。

np.set_printoptions(floatmode='maxprec')
print("maxprec      :{}".format(a))
print("             :{}".format(b))

# maxprec      :[0.1      0.123    0.123456]
#              :[0.1        0.12345    0.12345679]

np.set_printoptions(floatmode='maxprec')

print("maxprec :{}".format(a))

print(" :{}".format(b))

# maxprec :[0.1 0.123 0.123456]

# :[0.1 0.12345 0.12345679]

`maxprec_equal`

maxplecは0埋めされなかったが、maxprec_equalは最大精度の桁数に統一された上で0で埋められる(equalの意味が曖昧、maxprec_zeroとでもしてくれればよかったのに)。

np.set_printoptions(floatmode='maxprec')
print("maxprec      :{}".format(a))
print("             :{}".format(b))

# maxprec_equal:[0.100000 0.123000 0.123456]
#              :[0.10000000 0.12345000 0.12345679]

np.set_printoptions(floatmode='maxprec')

print("maxprec :{}".format(a))

print(" :{}".format(b))

# maxprec_equal:[0.100000 0.123000 0.123456]

# :[0.10000000 0.12345000 0.12345679]

`fixed`

全ての要素の精度がprecisionに統一され、それより低い精度の場合は0で埋められる。下の例では、2つの配列のすべての要素が小数点以下8桁に統一され、0で埋められている。

np.set_printoptions(floatmode='fixed')
print("fixed        :{}".format(a))
print("             :{}".format(b))

# fixed        :[0.10000000 0.12300000 0.12345600]
#              :[0.10000000 0.12345000 0.12345679]

np.set_printoptions(floatmode='fixed')

print("fixed :{}".format(a))

print(" :{}".format(b))

# fixed :[0.10000000 0.12300000 0.12345600]

# :[0.10000000 0.12345000 0.12345679]

`unique`

precisionは無視され、各要素で必要な分だけの精度が保たれ、桁数は最大精度に統一される。配列bの最後の要素が丸められていないことに注意。

np.set_printoptions(floatmode='unique')
print("unique       :{}".format(a))
print("             :{}".format(b))

# unique       :[0.1      0.123    0.123456]
#              :[0.1         0.12345     0.123456789]

np.set_printoptions(floatmode='unique')

print("unique :{}".format(a))

print(" :{}".format(b))

# unique :[0.1 0.123 0.123456]

# :[0.1 0.12345 0.123456789]

`formatter`

書式設定文字列とformatを渡して、任意の書式を設定する。渡し方は以下の通り。

formatter={'型名' : "{:書式}".format }

型名としては'int'、'float'のほか'numpystr'で文字列も指定できる。

import numpy as np

a = np.array([0.0123, 1.2345, 12.3456789])

np.set_printoptions(formatter={'float' : "{:10.5f}".format})
print(a)

np.set_printoptions(formatter={'float' : "{:15.7e}".format})
print(a)

# [   0.01230    1.23450   12.34568]
# [  1.2300000e-02   1.2345000e+00   1.2345679e+01]

import numpy as np

a = np.array([0.0123, 1.2345, 12.3456789])

np.set_printoptions(formatter={'float' : "{:10.5f}".format})

print(a)

np.set_printoptions(formatter={'float' : "{:15.7e}".format})

print(a)

# [ 0.01230 1.23450 12.34568]

# [ 1.2300000e-02 1.2345000e+00 1.2345679e+01]

scikit-learn – predict_proba

2020-09-09 / tau / コメントする

概要

decision_function()は各データが推測したクラスに属する確信度(confidence)を表すが、超平面のパラメータに依存し、そのレンジや値の大きさと確信度の関係が明確ではない。

これに対してpredict_probaは、それぞれのターゲットが予測されたクラスに属する確率を0～1の実数で表す。2クラス分類では、結果の配列の形状は(n_sumples, 2)となる。

`predict_proba()`の挙動

以下はmake_circles()で生成した2クラスのデータをGradient Boostingによって分類したときの確信度。各データに対応した2要素の配列の1つ目がクラス0(blue)、2つ目がクラス1(orange)に属する確率を表し、2つの和は1となる。なお16行目でsuppress=Trueとすることで、ndarrayの表示を常に固定小数点としている。

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)
y_named = np.array(["blue", "orange"])[y]

X_train, X_test, y_train, y_test, y_train_named, y_test_named = \
    train_test_split(X, y, y_named, random_state=0)

gbc = GradientBoostingClassifier(random_state=0)
gbc.fit(X_train, y_train_named)

pred_prob = gbc.predict_proba(X_test)
np.set_printoptions(suppress=True)
print(pred_prob)

# [[0.01573626 0.98426374]
#  [0.84575653 0.15424347]
#  [0.98112869 0.01887131]
#  .....
#  [0.06307595 0.93692405]
#  [0.93442475 0.06557525]
#  [0.86619957 0.13380043]]

import numpy as np

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.datasets import make_circles

from sklearn.model_selection import train_test_split

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

y_named = np.array(["blue", "orange"])[y]

X_train, X_test, y_train, y_test, y_train_named, y_test_named = \

train_test_split(X, y, y_named, random_state=0)

gbc = GradientBoostingClassifier(random_state=0)

gbc.fit(X_train, y_train_named)

pred_prob = gbc.predict_proba(X_test)

np.set_printoptions(suppress=True)

print(pred_prob)

# [[0.01573626 0.98426374]

# [0.84575653 0.15424347]

# [0.98112869 0.01887131]

# .....

# [0.06307595 0.93692405]

# [0.93442475 0.06557525]

# [0.86619957 0.13380043]]

`decision_function()`との比較

先のコードに以下を続けて、predict_proba()による確率、予測されたクラス、decsion_function()の値と、各データの正解クラスを並べて表示する。予測されたクラスの方の確率が大きいこと、その予測結果とdecision_function()の符号が一致していることが確認できる。

prob0 = pred_prob[:, 0]
prob1 = pred_prob[:, 1]

data = DataFrame()
data["prob0"] = prob0
data["prob1"] = prob1
data["pred"] = gbc.predict(X_test)
data["dec_func"] = gbc.decision_function(X_test)
data["correct"] = y_test_named
print(data)

#        prob0     prob1    pred  dec_func correct
# 0   0.015736  0.984264  orange  4.135926  orange
# 1   0.845757  0.154243    blue -1.701699    blue
# 2   0.981129  0.018871    blue -3.951061    blue
# .....
# 22  0.063076  0.936924  orange  2.698263  orange
# 23  0.934425  0.065575    blue -2.656733    blue
# 24  0.866200  0.133800    blue -1.867766    blue

prob0 = pred_prob[:, 0]

prob1 = pred_prob[:, 1]

data = DataFrame()

data["prob0"] = prob0

data["prob1"] = prob1

data["pred"] = gbc.predict(X_test)

data["dec_func"] = gbc.decision_function(X_test)

data["correct"] = y_test_named

print(data)

# prob0 prob1 pred dec_func correct

# 0 0.015736 0.984264 orange 4.135926 orange

# 1 0.845757 0.154243 blue -1.701699 blue

# 2 0.981129 0.018871 blue -3.951061 blue

# .....

# 22 0.063076 0.936924 orange 2.698263 orange

# 23 0.934425 0.065575 blue -2.656733 blue

# 24 0.866200 0.133800 blue -1.867766 blue

このデータをクラス0(blue)に対する確率(prob0)でソートし、decision_function()との関係を見てみると、以下のことがわかる。

blueクラスの確率が高いとdecision_functionの確信度はマイナスで絶対値が大きくなり、orangeクラスの確率が高いと確信度はプラスで絶対値が大きくなる
blueクラスの確率とorangeクラスの確率が同程度の時、確信度の絶対値が同程度になり、符号が逆になる
確率に対して確信度は線形ではない

print(data.sort_values(by="prob0", ascending=False))

#        prob0     prob1    pred  dec_func correct
# 6   0.999543  0.000457    blue -7.690972    blue
# 10  0.998442  0.001558    blue -6.462560    blue
# 15  0.984817  0.015183    blue -4.172312  orange
# .....
# 0   0.015736  0.984264  orange  4.135926  orange
# 11  0.013521  0.986479  orange  4.289866  orange
# 4   0.013521  0.986479  orange  4.289866  orange

print(data.sort_values(by="prob0", ascending=False))

# prob0 prob1 pred dec_func correct

# 6 0.999543 0.000457 blue -7.690972 blue

# 10 0.998442 0.001558 blue -6.462560 blue

# 15 0.984817 0.015183 blue -4.172312 orange

# .....

# 0 0.015736 0.984264 orange 4.135926 orange

# 11 0.013521 0.986479 orange 4.289866 orange

# 4 0.013521 0.986479 orange 4.289866 orange

クラス0(blue)に対する確率とdecision_function()の確信度の関係を図示すると以下のようになり、確率に対して確信度が必ずしも線形になっていないことがわかる。

コードはmatplotlib.pyplotをインポートした上で、以下を追加。

prob = np.array(sorted_data["prob0"])
conf = np.array(sorted_data["dec_func"])
fig = plt.figure()
ax = fig.add_subplot()
ax.plot(prob, conf)
ax.grid()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position(('data', 0.5))
ax.set_xlabel("class-0 probaility", loc='left')
ax.set_ylabel("confidence", loc='bottom')
plt.show()

prob = np.array(sorted_data["prob0"])

conf = np.array(sorted_data["dec_func"])

fig = plt.figure()

ax = fig.add_subplot()

ax.plot(prob, conf)

ax.grid()

ax.spines['top'].set_visible(False)

ax.spines['right'].set_visible(False)

ax.spines['bottom'].set_position('zero')

ax.spines['left'].set_position(('data', 0.5))

ax.set_xlabel("class-0 probaility", loc='left')

ax.set_ylabel("confidence", loc='bottom')

plt.show()

決定境界

以下は、predict_proba()で計算された確率を可視化したもので、decision_function()の場合に比べて、直感的にも分かりやすい分布となっている。

コンターに表す値として、30行目でpredict_proba()の結果の0列目、すなわちClass0の確率を取り出している。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split

f0min, f0max = -1.5, 1.5
f1min, f1max = -1.75, 1.5

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

X_train, X_test, y_train, y_test =\
    train_test_split(X, y, random_state=0)

gb = GradientBoostingClassifier(random_state=0)
gb.fit(X_train, y_train)

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))
color0, color1 = 'tab:blue', 'tab:orange'

f0 = np.linspace(f0min, f0max, 200)
f1 = np.linspace(f1min, f1max, 200)
f0, f1 = np.meshgrid(f0, f1)
F = np.hstack((f0.reshape(-1, 1), f1.reshape(-1, 1)))

pred = gb.predict(F).reshape(f0.shape)
axs[0].contour(f0, f1, pred, levels=[0.5])
axs[0].contourf(f0, f1, pred, levels=1, colors=[color0, color1], alpha=0.25)

proba = gb.predict_proba(F)[:, 0].reshape(f0.shape)
print(proba.shape)
axs[1].contourf(f0, f1, proba, alpha=0.5, cmap='RdBu')

for ax in axs:
    ax.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], marker='o', fc=color0, ec='k', label="Train class 0")
    ax.scatter(X_test[y_test==0][:, 0], X_test[y_test==0][:, 1], marker='^', fc=color0, ec='k', label="Test Class 0")
    ax.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], marker='o', fc=color1, ec='k', label="Train class 1")
    ax.scatter(X_test[y_test==1][:, 0], X_test[y_test==1][:, 1], marker='^', fc=color1, ec='k', label="Test class 1")

    ax.set_xlim(f0min, f0max)
    ax.set_ylim(f1min, f1max)
    ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

handles, labels = axs[0].get_legend_handles_labels()

fig.legend(handles, labels, ncol=4, loc='upper center')
plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.datasets import make_circles

from sklearn.model_selection import train_test_split

f0min, f0max = -1.5, 1.5

f1min, f1max = -1.75, 1.5

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

X_train, X_test, y_train, y_test =\

train_test_split(X, y, random_state=0)

gb = GradientBoostingClassifier(random_state=0)

gb.fit(X_train, y_train)

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))

color0, color1 = 'tab:blue', 'tab:orange'

f0 = np.linspace(f0min, f0max, 200)

f1 = np.linspace(f1min, f1max, 200)

f0, f1 = np.meshgrid(f0, f1)

F = np.hstack((f0.reshape(-1, 1), f1.reshape(-1, 1)))

pred = gb.predict(F).reshape(f0.shape)

axs[0].contour(f0, f1, pred, levels=[0.5])

axs[0].contourf(f0, f1, pred, levels=1, colors=[color0, color1], alpha=0.25)

proba = gb.predict_proba(F)[:, 0].reshape(f0.shape)

print(proba.shape)

axs[1].contourf(f0, f1, proba, alpha=0.5, cmap='RdBu')

for ax in axs:

ax.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], marker='o', fc=color0, ec='k', label="Train class 0")

ax.scatter(X_test[y_test==0][:, 0], X_test[y_test==0][:, 1], marker='^', fc=color0, ec='k', label="Test Class 0")

ax.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], marker='o', fc=color1, ec='k', label="Train class 1")

ax.scatter(X_test[y_test==1][:, 0], X_test[y_test==1][:, 1], marker='^', fc=color1, ec='k', label="Test class 1")

ax.set_xlim(f0min, f0max)

ax.set_ylim(f1min, f1max)

ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

handles, labels = axs[0].get_legend_handles_labels()

fig.legend(handles, labels, ncol=4, loc='upper center')

plt.show()

3クラス以上の場合

3クラスのirisデータセットにGradientBoostingClassifierを適用し、predict_proba()の出力を見てみる。

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from pandas import DataFrame

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=42)

gbc = GradientBoostingClassifier(learning_rate=0.01, random_state=0)
gbc.fit(X_train, y_train)

pred_proba = gbc.predict_proba(X_test)
df = DataFrame(pred_proba, columns=iris.target_names)
df["decision"] = np.argmax(pred_proba, axis=1)
df["prediction"] = gbc.predict(X_test)
print(df)

import numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.ensemble import GradientBoostingClassifier

from pandas import DataFrame

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(

iris.data, iris.target, random_state=42)

gbc = GradientBoostingClassifier(learning_rate=0.01, random_state=0)

gbc.fit(X_train, y_train)

pred_proba = gbc.predict_proba(X_test)

df = DataFrame(pred_proba, columns=iris.target_names)

df["decision"] = np.argmax(pred_proba, axis=1)

df["prediction"] = gbc.predict(X_test)

print(df)

このコードの出力結果は以下の通り。3つのクラスに対する確率が得られ、合計は1になる。こちらはdecision_function()が2クラスの時だけ配列が1次元となるのと違って、どのような場合でも行数×列数＝データ数×クラス数の配列になる。

なお17行目で、argmaxを使って各データで確率が最大となるクラスを探している。

      setosa  versicolor  virginica  decision  prediction
0   0.102177    0.788400   0.109422         1           1
1   0.783471    0.109367   0.107161         0           0
2   0.098181    0.110059   0.791761         2           2
3   0.102177    0.788400   0.109422         1           1
4   0.103600    0.667239   0.229161         1           1
.....
33  0.783471    0.109367   0.107161         0           0
34  0.783471    0.109367   0.107161         0           0
35  0.101941    0.115024   0.783035         2           2
36  0.102177    0.788400   0.109422         1           1
37  0.783471    0.109367   0.107161         0           0

setosa versicolor virginica decision prediction

0 0.102177 0.788400 0.109422 1 1

1 0.783471 0.109367 0.107161 0 0

2 0.098181 0.110059 0.791761 2 2

3 0.102177 0.788400 0.109422 1 1

4 0.103600 0.667239 0.229161 1 1

.....

33 0.783471 0.109367 0.107161 0 0

34 0.783471 0.109367 0.107161 0 0

35 0.101941 0.115024 0.783035 2 2

36 0.102177 0.788400 0.109422 1 1

37 0.783471 0.109367 0.107161 0 0

Axes.set_xlabel/ylabel～軸のラベル

2020-09-08 / tau / コメントする

概要

Axesに描いたグラフの軸にラベルを付けるには、set_xlabel()やset_ylabel()を使う。

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi)
y = np.sin(x * 2)

fig = plt.figure()
ax = fig.add_subplot()
ax.plot(x, y)
ax.set_xlabel("theta")
ax.set_ylabel("sine")
plt.show()

import numpy as np

import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi)

y = np.sin(x * 2)

fig = plt.figure()

ax = fig.add_subplot()

ax.plot(x, y)

ax.set_xlabel("theta")

ax.set_ylabel("sine")

plt.show()

位置等の調整

ラベルの位置はlocパラメーターで調整する。x軸とy軸では設定値が異なる（matplotlibのバージョンが3.1.1ではエラーになるので、3.3.1にアップグレード）。

set_xlabel("labelname", loc='left'/'center'/'right')
set_ylabel("labelname", loc='bottom'/'center'/'top')

以下の例では、set_ylabel()のパラメーターにTextのプロパティーとしてrotationを指定している（そのままだと横にしたラベルがエリア外に切れてしまうので、8行目でsubplots_adjustを指定している）。

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi)
y = np.sin(x * 2)

fig, ax = plt.subplots()
fig.subplots_adjust(left=0.2)
ax.plot(x, y)
ax.set_xlabel("theta", loc='right')
ax.set_ylabel("sine", loc='top', rotation='horizontal')
plt.show()

import numpy as np

import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi)

y = np.sin(x * 2)

fig, ax = plt.subplots()

fig.subplots_adjust(left=0.2)

ax.plot(x, y)

ax.set_xlabel("theta", loc='right')

ax.set_ylabel("sine", loc='top', rotation='horizontal')

plt.show()

scikit-learn – decision_function

2020-08-25 / tau / コメントする

概要

decision_function()は、超平面によってクラス分類をするモデルにおける、各予測データの確信度を表す。

2クラス分類の場合は(n_samples, )の1次元配列、マルチクラスの場合は(n_samples, n_classes)の2次元配列になる。2クラス分類の場合、符号の正負がそれぞれのクラスに対応する。

decision_function()を持つモデルは、LogisticRegression、SVC、GladientBoostClassifierなどで、RandomForestはこのメソッドを持っていない。

`decision_function()`の挙動

decision_function()の挙動をGradientBoostingClassifierで確認する。

まずmake_circles()で2クラスのデータを生成し、外側のクラス0をblue、内側のクラス1をorangeとして再定義する。

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)
y_named = np.array(["blue", "orange"])[y]

import numpy as np

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.datasets import make_circles

from sklearn.model_selection import train_test_split

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

y_named = np.array(["blue", "orange"])[y]

次に、データを訓練データとテストデータに分割し、訓練データによって学習する。データ分割にあたって、Xとyに加えて文字列に置き換えたy_namedを分割している。学習の際にはXとy_namedの訓練データとテストデータのみを用いるのでyについては特に含める必要ないが、ここではtrain_test_split()が3つ以上のデータでも分割可能なことを示している。

X_train, X_test, y_train, y_test, y_train_named, y_test_named = \
    train_test_split(X, y, y_named, random_state=0)

gbc = GradientBoostingClassifier(random_state=0)
gbc.fit(X_train, y_train_named)

X_train, X_test, y_train, y_test, y_train_named, y_test_named = \

train_test_split(X, y, y_named, random_state=0)

gbc = GradientBoostingClassifier(random_state=0)

gbc.fit(X_train, y_train_named)

学習後の分類器のclasses_プロパティーを参照すると、クラスがどのように表現されるかを確認できる。上のfit()メソッドでy_train_namedを与えたのでクラスの表現が文字列になっているが、代わりにy_trainを用いると[0, 1]のように元のyに対応したクラス表現が返される。

print(gbc.classes_)

# ['blue' 'orange']

print(gbc.classes_)

# ['blue' 'orange']

次に、学習済みモデルにテストデータを与えて、decision_function()の結果とpredict()の結果を並べてみる。decision_function()はfit()で与えたテストデータ数の1次元配列を返し、各要素の負の値に対してクラス0のblueが、正の値に対してはクラス1のorangeがpredict()で予測されていることがわかる。

data = DataFrame()
data["decision_value"] = gbc.decision_function(X_test)
data["prediction"] = gbc.predict(X_test)
print(data)

#     decision_value prediction
# 0         4.135926     orange
# 1        -1.701699       blue
# 2        -3.951061       blue
# 3        -3.626096       blue
# 4         4.289866     orange
# 5         3.661661     orange
# 6        -7.690972       blue
# 7         4.110017     orange
# 8         1.107539     orange
# 9         3.407822     orange
# 10       -6.462560       blue
# 11        4.289866     orange
# 12        3.901563     orange
# 13       -1.200312       blue
# 14        3.661661     orange
# 15       -4.172312       blue
# 16       -1.230101       blue
# 17       -3.915762       blue
# 18        4.036028     orange
# 19        4.110017     orange
# 20        4.110017     orange
# 21        0.657090     orange
# 22        2.698263     orange
# 23       -2.656733       blue
# 24       -1.867766       blue

data = DataFrame()

data["decision_value"] = gbc.decision_function(X_test)

data["prediction"] = gbc.predict(X_test)

print(data)

# decision_value prediction

# 0 4.135926 orange

# 1 -1.701699 blue

# 2 -3.951061 blue

# 3 -3.626096 blue

# 4 4.289866 orange

# 5 3.661661 orange

# 6 -7.690972 blue

# 7 4.110017 orange

# 8 1.107539 orange

# 9 3.407822 orange

# 10 -6.462560 blue

# 11 4.289866 orange

# 12 3.901563 orange

# 13 -1.200312 blue

# 14 3.661661 orange

# 15 -4.172312 blue

# 16 -1.230101 blue

# 17 -3.915762 blue

# 18 4.036028 orange

# 19 4.110017 orange

# 20 4.110017 orange

# 21 0.657090 orange

# 22 2.698263 orange

# 23 -2.656733 blue

# 24 -1.867766 blue

decision_function()の各要素の符号に応じてpredict()と同じ結果を得たいなら、次のように処理していくとよい。

print(gbc.decision_function(X_test) > 0)
# [ True False False False  True  True False  True  True  True False  True
#   True False  True False False False  True  True  True  True  True False
#  False]

print((gbc.decision_function(X_test) > 0).astype(int))
# [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 1 0 0]

print(gbc.classes_[(gbc.decision_function(X_test) > 0).astype(int)])
# ['orange' 'blue' 'blue' 'blue' 'orange' 'orange' 'blue' 'orange' 'orange'
#  'orange' 'blue' 'orange' 'orange' 'blue' 'orange' 'blue' 'blue' 'blue'
#  'orange' 'orange' 'orange' 'orange' 'orange' 'blue' 'blue']

print(gbc.decision_function(X_test) > 0)

# [ True False False False True True False True True True False True

# True False True False False False True True True True True False

# False]

print((gbc.decision_function(X_test) > 0).astype(int))

# [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 1 0 0]

print(gbc.classes_[(gbc.decision_function(X_test) > 0).astype(int)])

# ['orange' 'blue' 'blue' 'blue' 'orange' 'orange' 'blue' 'orange' 'orange'

# 'orange' 'blue' 'orange' 'orange' 'blue' 'orange' 'blue' 'blue' 'blue'

# 'orange' 'orange' 'orange' 'orange' 'orange' 'blue' 'blue']

最後に、上記のデータと正解であるy_test_namedのデータを先ほどのデータフレームに追加して全体を確認する。predit()メソッドの結果とdecision_function()の符号による判定結果は等しく、y_testと異なるデータがあることがわかる。

data["decision"] = \
    gbc.classes_[(gbc.decision_function(X_test) > 0).astype(int)]
data["y_test"] = y_test_named

print(data)

#     decision_value prediction decision  y_test
# 0         4.135926     orange   orange  orange
# 1        -1.701699       blue     blue    blue
# 2        -3.951061       blue     blue    blue
# 3        -3.626096       blue     blue    blue
# 4         4.289866     orange   orange  orange
# 5         3.661661     orange   orange  orange
# 6        -7.690972       blue     blue    blue
# 7         4.110017     orange   orange  orange
# 8         1.107539     orange   orange  orange
# 9         3.407822     orange   orange  orange
# 10       -6.462560       blue     blue    blue
# 11        4.289866     orange   orange  orange
# 12        3.901563     orange   orange  orange
# 13       -1.200312       blue     blue  orange
# 14        3.661661     orange   orange  orange
# 15       -4.172312       blue     blue  orange
# 16       -1.230101       blue     blue  orange
# 17       -3.915762       blue     blue    blue
# 18        4.036028     orange   orange  orange
# 19        4.110017     orange   orange  orange
# 20        4.110017     orange   orange    blue
# 21        0.657090     orange   orange  orange
# 22        2.698263     orange   orange  orange
# 23       -2.656733       blue     blue    blue
# 24       -1.867766       blue     blue    blue

data["decision"] = \

gbc.classes_[(gbc.decision_function(X_test) > 0).astype(int)]

data["y_test"] = y_test_named

print(data)

# decision_value prediction decision y_test

# 0 4.135926 orange orange orange

# 1 -1.701699 blue blue blue

# 2 -3.951061 blue blue blue

# 3 -3.626096 blue blue blue

# 4 4.289866 orange orange orange

# 5 3.661661 orange orange orange

# 6 -7.690972 blue blue blue

# 7 4.110017 orange orange orange

# 8 1.107539 orange orange orange

# 9 3.407822 orange orange orange

# 10 -6.462560 blue blue blue

# 11 4.289866 orange orange orange

# 12 3.901563 orange orange orange

# 13 -1.200312 blue blue orange

# 14 3.661661 orange orange orange

# 15 -4.172312 blue blue orange

# 16 -1.230101 blue blue orange

# 17 -3.915762 blue blue blue

# 18 4.036028 orange orange orange

# 19 4.110017 orange orange orange

# 20 4.110017 orange orange blue

# 21 0.657090 orange orange orange

# 22 2.698263 orange orange orange

# 23 -2.656733 blue blue blue

# 24 -1.867766 blue blue blue

`decision_function()`の意味

decusuib_function()のレベルは超平面上の高さになるが、これはデータ、モデルパラメーターにより変化し、このスケールの解釈は難しい。それはpredict_proba()で得られる予測確率とdecision_function()で計算される確信度の非線形性からも予想される。

circlesデータに対するGradientBoostingClassifierの決定境界とdecision_function()の値の分布を表示したのが以下の図。コンターが交錯していてわかりにくく、直感的にはpredict_proba()の方がわかりやすい。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

f0min, f0max = -1.5, 1.5
f1min, f1max = -1.75, 1.5

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

X_train, X_test, y_train, y_test =\
    train_test_split(X, y, random_state=0)

gb = GradientBoostingClassifier(random_state=0)

gb.fit(X_train, y_train)

print("X_test.shape           : {}".format(X_test.shape))
print("Decision function shape: {}".format(
    gb.decision_function(X_test).shape))

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))
color0, color1 = 'tab:blue', 'tab:orange'

f0 = np.linspace(f0min, f0max, 200)
f1 = np.linspace(f1min, f1max, 200)
f0, f1 = np.meshgrid(f0, f1)
F = np.c_[f0.reshape(-1, 1), f1.reshape(-1, 1)]

pred = gb.predict(F).reshape(f0.shape)
axs[0].contour(f0, f1, pred, levels=[0.5])
axs[0].contourf(f0, f1, pred, levels=1, colors=[color0, color1], alpha=0.25)

decision = gb.decision_function(F).reshape(f0.shape)
axs[1].contourf(f0, f1, decision, alpha=0.5, cmap='bwr')

for ax in axs:
    ax.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], marker='o', fc=color0, ec='k', label="Train class 0")
    ax.scatter(X_test[y_test==0][:, 0], X_test[y_test==0][:, 1], marker='^', fc=color0, ec='k', label="Test Class 0")
    ax.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], marker='o', fc=color1, ec='k', label="Train class 1")
    ax.scatter(X_test[y_test==1][:, 0], X_test[y_test==1][:, 1], marker='^', fc=color1, ec='k', label="Test class 1")

    ax.set_xlim(f0min, f0max)
    ax.set_ylim(f1min, f1max)
    ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

handles, labels = axs[0].get_legend_handles_labels()

fig.legend(handles, labels, ncol=4, loc='upper center')
plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_circles

from sklearn.model_selection import train_test_split

from sklearn.ensemble import GradientBoostingClassifier

f0min, f0max = -1.5, 1.5

f1min, f1max = -1.75, 1.5

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

X_train, X_test, y_train, y_test =\

train_test_split(X, y, random_state=0)

gb = GradientBoostingClassifier(random_state=0)

gb.fit(X_train, y_train)

print("X_test.shape : {}".format(X_test.shape))

print("Decision function shape: {}".format(

gb.decision_function(X_test).shape))

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))

color0, color1 = 'tab:blue', 'tab:orange'

f0 = np.linspace(f0min, f0max, 200)

f1 = np.linspace(f1min, f1max, 200)

f0, f1 = np.meshgrid(f0, f1)

F = np.c_[f0.reshape(-1, 1), f1.reshape(-1, 1)]

pred = gb.predict(F).reshape(f0.shape)

axs[0].contour(f0, f1, pred, levels=[0.5])

axs[0].contourf(f0, f1, pred, levels=1, colors=[color0, color1], alpha=0.25)

decision = gb.decision_function(F).reshape(f0.shape)

axs[1].contourf(f0, f1, decision, alpha=0.5, cmap='bwr')

for ax in axs:

ax.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], marker='o', fc=color0, ec='k', label="Train class 0")

ax.scatter(X_test[y_test==0][:, 0], X_test[y_test==0][:, 1], marker='^', fc=color0, ec='k', label="Test Class 0")

ax.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], marker='o', fc=color1, ec='k', label="Train class 1")

ax.scatter(X_test[y_test==1][:, 0], X_test[y_test==1][:, 1], marker='^', fc=color1, ec='k', label="Test class 1")

ax.set_xlim(f0min, f0max)

ax.set_ylim(f1min, f1max)

ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

handles, labels = axs[0].get_legend_handles_labels()

fig.legend(handles, labels, ncol=4, loc='upper center')

plt.show()

3クラス以上の場合

3クラスのirisデータセットにGradientBoostingClassifierを適用して、decision_function()の出力を見てみる。

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from pandas import DataFrame

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=42)

gbc = GradientBoostingClassifier(learning_rate=0.01, random_state=0)
gbc.fit(X_train, y_train)

dec_func = gbc.decision_function(X_test)
print(dec_func.shape)
df = DataFrame(dec_func, columns=iris.target_names)

df["decision"] = np.argmax(dec_func, axis=1)
df["prediction"] = gbc.predict(X_test)
print(df)

import numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.ensemble import GradientBoostingClassifier

from pandas import DataFrame

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(

iris.data, iris.target, random_state=42)

gbc = GradientBoostingClassifier(learning_rate=0.01, random_state=0)

gbc.fit(X_train, y_train)

dec_func = gbc.decision_function(X_test)

print(dec_func.shape)

df = DataFrame(dec_func, columns=iris.target_names)

df["decision"] = np.argmax(dec_func, axis=1)

df["prediction"] = gbc.predict(X_test)

print(df)

このコードの出力結果は以下の通り。2クラスの場合は1次元配列だったが、3クラスになると行数×列数がデータ数×クラス数の配列になる。predict_proba()は2クラスでも2列の配列になるので、decision_function()の2クラスの場合だけ特に1次元配列になると言える。

なお、19行目で各データごとに最大の値をとる列をargmaxで探して、そのサフィックスを”decision”のクラス番号として表示している。

(38, 3)
      setosa  versicolor  virginica  decision  prediction
0  -1.995715    0.047583  -1.927207         1           1
1   0.061464   -1.907557  -1.927938         0           0
2  -1.990582   -1.876379   0.096867         2           2
3  -1.995715    0.047583  -1.927207         1           1
4  -1.997302   -0.134691  -1.203415         1           1
.....
33  0.061464   -1.907557  -1.927938         0           0
34  0.061464   -1.907557  -1.927938         0           0
35 -1.997122   -1.876379   0.041660         2           2
36 -1.995715    0.047583  -1.927207         1           1
37  0.061464   -1.907557  -1.927938         0           0

(38, 3)

setosa versicolor virginica decision prediction

0 -1.995715 0.047583 -1.927207 1 1

1 0.061464 -1.907557 -1.927938 0 0

2 -1.990582 -1.876379 0.096867 2 2

3 -1.995715 0.047583 -1.927207 1 1

4 -1.997302 -0.134691 -1.203415 1 1

.....

33 0.061464 -1.907557 -1.927938 0 0

34 0.061464 -1.907557 -1.927938 0 0

35 -1.997122 -1.876379 0.041660 2 2

36 -1.995715 0.047583 -1.927207 1 1

37 0.061464 -1.907557 -1.927938 0 0

概要

挙動

特徴

概要

挙動

特徴

概要

挙動

特徴

概要

get_printoptions()

set_printoptions()

省略表示

thresholdとedgeitems

数値の書式

supress

precision

floatmode

maxprec

maxprec_equal

fixed

unique

formatter

概要

predict_proba()の挙動

decision_function()との比較

決定境界

3クラス以上の場合

概要

位置等の調整

概要

decision_function()の挙動

decision_function()の意味

3クラス以上の場合

`get_printoptions()`

`set_printoptions()`

`threshold`と`edgeitems`

`supress`

`precision`

`floatmode`

`maxprec`

`maxprec_equal`

`fixed`

`unique`

`formatter`

`predict_proba()`の挙動

`decision_function()`との比較

`decision_function()`の挙動

`decision_function()`の意味