pyplot – グラフ要素のフォントサイズ

2020-06-28 / tau / コメントする

グラフ全体のフォントサイズ

pyplot.rcParams()で基準のフォントサイズを変更。デフォルトはfont.size=12。以下は全体のフォントサイズを大きくした例。

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi)
ys = np.sin(x)
yc = np.cos(x)

plt.rcParams['font.size'] = 15
fig, ax = plt.subplots()
fig.subplots_adjust(left=0.2)

ax.set_title("Axes Title")
ax.plot(x, ys, label="sin x")
ax.plot(x, yc, label="cos x")
ax.set_ylabel("sin/cos")
ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi)

ys = np.sin(x)

yc = np.cos(x)

plt.rcParams['font.size'] = 15

fig, ax = plt.subplots()

fig.subplots_adjust(left=0.2)

ax.set_title("Axes Title")

ax.plot(x, ys, label="sin x")

ax.plot(x, yc, label="cos x")

ax.set_ylabel("sin/cos")

ax.legend()

plt.show()

個別要素のフォントサイズ

タイトル、軸ラベル、軸目盛、凡例について個別にフォントサイズを指定した例。

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi)
ys = np.sin(x)
yc = np.cos(x)

fig, ax = plt.subplots()

ax.set_title("Axes Title", fontsize=20)
ax.plot(x, ys, label="sin x")
ax.plot(x, yc, label="cos x")
ax.set_ylabel("sin/cos", fontsize=10)
ax.tick_params(labelsize=7)
ax.legend(fontsize=7)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi)

ys = np.sin(x)

yc = np.cos(x)

fig, ax = plt.subplots()

ax.set_title("Axes Title", fontsize=20)

ax.plot(x, ys, label="sin x")

ax.plot(x, yc, label="cos x")

ax.set_ylabel("sin/cos", fontsize=10)

ax.tick_params(labelsize=7)

ax.legend(fontsize=7)

plt.show()

Python – 多重ループの一重化

2020-06-27 / tau / コメントする

概要

以下のような二重ループを、一重ループで実現する方法。

years = [2000, 2010, 2020]
seasons = ["Spring", "Summer", "Autumn", "Winter"]

for year in years:
    for season in seasons:
        print(year, season)

# 2000 Spring
# 2000 Summer
# 2000 Autumn
# 2000 Winter
# 2010 Spring
# 2010 Summer
# 2010 Autumn
# 2010 Winter
# 2020 Spring
# 2020 Summer
# 2020 Autumn
# 2020 Winter

years = [2000, 2010, 2020]

seasons = ["Spring", "Summer", "Autumn", "Winter"]

for year in years:

for season in seasons:

print(year, season)

# 2000 Spring

# 2000 Summer

# 2000 Autumn

# 2000 Winter

# 2010 Spring

# 2010 Summer

# 2010 Autumn

# 2010 Winter

# 2020 Spring

# 2020 Summer

# 2020 Autumn

# 2020 Winter

内包表記による方法

内包表記の中で二重ループを回し、1つのリストを生成する。

years = [2000, 2010, 2020]
seasons = ["Spring", "Summer", "Autumn", "Winter"]

lst = [(year, season) for year in years for season in seasons]

for year, season in lst:
    print(year, season)

years = [2000, 2010, 2020]

seasons = ["Spring", "Summer", "Autumn", "Winter"]

lst = [(year, season) for year in years for season in seasons]

for year, season in lst:

print(year, season)

`itertools.product`による方法

itertoolsライブラリーにあるproduct()は、引数のリストの各要素の直積を要素とするリストを返す。

from itertools import product

years = [2000, 2010, 2020]
seasons = ["Spring", "Summer", "Autumn", "Winter"]

iter = product(years, seasons)

for year, season in iter:
    print(year, season)

from itertools import product

years = [2000, 2010, 2020]

seasons = ["Spring", "Summer", "Autumn", "Winter"]

iter = product(years, seasons)

for year, season in iter:

print(year, season)

numpy – r_とc_

2020-06-27 / tau / コメントする

概要

numpy.r_ / numpy.c_は配列を結合するオブジェクト。r_は縦方向に配列を結合し、c_は横方向に配列を結合する。vstack() / hstack()やlinspace()と似たような使い方ができるが、少し癖がある。

配列と数値を混在させて結合できる
スライスでステップ数やか分割数を指定して数列をつくれる
vstack()やhstack()の代わりに使える

vstack()やhstack()と同じように使う。

import numpy as np

a1 = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
a2 = np.array([
    [10, 20, 30],
    [40, 50, 60]
])

print(np.r_[a1, a2])
# [[ 1  2  3]
#  [ 4  5  6]
#  [10 20 30]
#  [40 50 60]]

print(np.c_[a1, a2])
# [[ 1  2  3 10 20 30]
#  [ 4  5  6 40 50 60]]

import numpy as np

a1 = np.array([

[1, 2, 3],

[4, 5, 6]

])

a2 = np.array([

[10, 20, 30],

[40, 50, 60]

])

print(np.r_[a1, a2])

# [[ 1 2 3]

# [ 4 5 6]

# [10 20 30]

# [40 50 60]]

print(np.c_[a1, a2])

# [[ 1 2 3 10 20 30]

# [ 4 5 6 40 50 60]]

`r_`について

numpy.r_で2次元配列に1行だけ追加するとき、1次元配列のままだ”次元が異なる”とエラー。素直にvstack()を使った方がよい。

a3 = np.array([10, 20, 30])
#print(np.r_[a1, a3])
# -> ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

print(np.r_[a1, a3.reshape(1, -1)])
# [[ 1  2  3]
#  [ 4  5  6]
#  [10 20 30]]

a3 = np.array([10, 20, 30])

#print(np.r_[a1, a3])

# -> ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

print(np.r_[a1, a3.reshape(1, -1)])

# [[ 1 2 3]

# [ 4 5 6]

# [10 20 30]]

r_のデフォルトで1次元配列同士を結合すると、単に横方向に結合される。配列と要素が混在していてもok。文字列の配列も結合できるが、文字列要素が混在するとエラーになる。

print(np.r_[[1, 2, 3], 4, 5, [6, 7]])
# [1 2 3 4 5 6 7]

print(np.r_[['A', 'B', 'C'], ['D', 'E']])
# ['A' 'B' 'C' 'D' 'E']

#print(np.r_[['A', 'B', 'C'], 'D', 'E', ['E', 'F']])
# -> ValueError: special directives must be the first entry.

print(np.r_[[1, 2, 3], 4, 5, [6, 7]])

# [1 2 3 4 5 6 7]

print(np.r_[['A', 'B', 'C'], ['D', 'E']])

# ['A' 'B' 'C' 'D' 'E']

#print(np.r_[['A', 'B', 'C'], 'D', 'E', ['E', 'F']])

# -> ValueError: special directives must be the first entry.

スライスを使って数列を生成。

print(np.r_[:10])
# [0 1 2 3 4 5 6 7 8 9]

print(np.r_[4:10:2])
# [4 6 8]

print(np.r_[0.5:5.5:0.5])
# [0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5 5. ]

print(np.r_[:10])

# [0 1 2 3 4 5 6 7 8 9]

print(np.r_[4:10:2])

# [4 6 8]

print(np.r_[0.5:5.5:0.5])

# [0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. ]

3つ目の引数に'j'をつけてnumpy.linspace()と同様の動作。このときは終了値が含まれる。

print(np.r_[0:10:5j])
# [ 0.   2.5  5.   7.5 10. ]

1 2	print(np.r_[0:10:5j]) # [ 0. 2.5 5. 7.5 10. ]

`c_`について

numpy.c_で2次元配列にその行数と同じ要素数の1次元配列を結合すると、列ベクトルとみなされて1列追加される。hstack()が1次元配列を列ベクトル化する必要があるのに比べると手軽。

a5 = np.array([10, 20])
print(np.c_[a1, a5])
# [[ 1  2  3 10]
#  [ 4  5  6 20]]

a5 = np.array([10, 20])

print(np.c_[a1, a5])

# [[ 1 2 3 10]

# [ 4 5 6 20]]

さらに要素数が同じ1次元配列同士を結合すると、それらが列ベクトルとみなされて結合される。

b1 = np.array([1, 2, 3])
b2 = np.array([4, 5, 6])
print(np.c_[b1, b2])

# [[1 4]
#  [2 5]
#  [3 6]]

b1 = np.array([1, 2, 3])

b2 = np.array([4, 5, 6])

print(np.c_[b1, b2])

# [[1 4]

# [2 5]

# [3 6]]

空の配列に対して順次列ベクトルを追加する場合には、empty(n, 0, dtype=type)を準備する。

b0 = np.empty((3, 0), dtype=int)
b0 = np.c_[b0, b1]
print(b0)
# [[1]
#  [2]
#  [3]]

b0 = np.c_[b0, b2]
print(b0)
# [[1 4]
#  [2 5]
#  [3 6]]

b0 = np.empty((3, 0), dtype=int)

b0 = np.c_[b0, b1]

print(b0)

# [[1]

# [2]

# [3]]

b0 = np.c_[b0, b2]

print(b0)

# [[1 4]

# [2 5]

# [3 6]]

SVM～カーネル法

2020-06-24 / tau / コメントする

概要

書籍”Pythonではじめる機械学習”の2.3.7 カーネル法を用いた”サポートベクタマシン”の写経

線形特徴量の非線形化

線形モデルでは分離不可能なデータ

以下は、scikit-learnのmake_blobs()により生成した2特徴量、2クラス分類のデータに線形サポートベクターマシンを適用した例。このとき、決定境界は以下のように得られる。

(1) $\begin{gather*} b + w_0 f_0 + w_1 f_1 = 0\\ b \approx -0.2817,\; w_0 \approx 0.1261,\; w_1 \approx -0.0918 \end{gather*}$

決定境界より上側では多項式の値は負となり、下側では正となるが、この境界は明らかに2つのクラスを分割していない。このように単純な例でも、線形モデルでは的確なクラス分類はできない。

以下のコードでは、原典と以下が異なっている。

収束しないという警告を受けて、LinearSVCのmax_iterをデフォルトの1000より大きな値としている

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVC

X, y = make_blobs(centers=4, random_state=8)
y = y % 2

linsvm = LinearSVC(max_iter=5500).fit(X, y)

y_min, y_max = -15, 15
b = linsvm.intercept_
w0 = linsvm.coef_[0][0]
w1 = linsvm.coef_[0][1]
x_lower = -(b + w1 * y_min) / w0
x_upper = -(b + w1 * y_max) / w0

fig, ax = plt.subplots()
X0 = X[y==0]
X1 = X[y==1]
ax.scatter(X0[:, 0], X0[:, 1], marker='o', s=60, ec='k')
ax.scatter(X1[:, 0], X1[:, 1], marker='^', s=60, ec='k')
ax.plot([x_lower, x_upper], [y_min, y_max], linewidth=2, c='tab:green')
ax.set_ylim(y_min, y_max)
ax.set_xlabel("Feature-0")
ax.set_ylabel("Feature-1")
plt.show()

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from sklearn.svm import LinearSVC

X, y = make_blobs(centers=4, random_state=8)

y = y % 2

linsvm = LinearSVC(max_iter=5500).fit(X, y)

y_min, y_max = -15, 15

b = linsvm.intercept_

w0 = linsvm.coef_[0][0]

w1 = linsvm.coef_[0][1]

x_lower = -(b + w1 * y_min) / w0

x_upper = -(b + w1 * y_max) / w0

fig, ax = plt.subplots()

X0 = X[y==0]

X1 = X[y==1]

ax.scatter(X0[:, 0], X0[:, 1], marker='o', s=60, ec='k')

ax.scatter(X1[:, 0], X1[:, 1], marker='^', s=60, ec='k')

ax.plot([x_lower, x_upper], [y_min, y_max], linewidth=2, c='tab:green')

ax.set_ylim(y_min, y_max)

ax.set_xlabel("Feature-0")

ax.set_ylabel("Feature-1")

plt.show()

非線形特徴量の追加

ここで、特徴量1の2乗を新たな特徴量として加える。この場合、3つの特徴量に対して3次元空間内に各点が位置し、それぞれがクラス0/1に属している。新たな特徴量の追加によって、その軸の方向に各点が立ち上がり、真ん中の三角形の点群と両側の丸印の点群が平面でうまく分割できそうである。

このデータセットに対して、線形SVMを適用し、決定境界を描いたのが以下の画像。特徴量が2つの場合の決定境界は直線だったが、特徴が3つになると決定境界は平面となる。予想通り、単純な平面で2つのクラスが分けられている。この決定境界の式は以下のようになる。

(2) $\begin{gather*} b + w_0 f_0 + w_1 f_1 + w_2 {f_1}^2 = 0\\ b \approx 1.1734,\; w_0 \approx 0.1301,\; w_1 \approx -0.2203,\; w_2 = -0.0597 \end{gather*}$

この平面に対して上側（f₁²が小さい側）では多項式の値は正となり、その反対側では負となる。

以下のコードは、原典と以下が異なっている。

収束しないという警告を受けて、LinearSVCのmax_iterをデフォルトの1000より大きな値としている
Axes3Dの生成の仕方を最新のバージョンに合ったものとしている
- 原典ではFigureオブジェクトを生成し、それをビューに関する引数とともにAxes3Dコンストラクターに渡している
- 本コードでは、subplotsの引数でprojectionを指定してFigureとAxes3Dを同時に生成し、veiw_init()でビューに関する引数を指定

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVC
from mpl_toolkits.mplot3d import Axes3D

X, y = make_blobs(centers=4, random_state=8)
y = y % 2
X_new = np.hstack((X, X[:, 1].reshape(-1, 1)**2))
X0, X1 = X_new[y==0], X_new[y==1]

linsvc = LinearSVC(max_iter=3700).fit(X_new, y)
intercept = linsvc.intercept_
coef = linsvc.coef_.ravel()

fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))
ax.view_init(elev=-152, azim=-23)

u = np.linspace(X_new[:, 0].min() - 2, X_new[:, 1].max() + 2)
v = np.linspace(X_new[:, 0].min() - 2, X_new[:, 1].max() + 2)
u, v = np.meshgrid(u, v)
w = -(coef[0] * u + coef[1] * v + intercept) / coef[2]

ax.scatter(X0[:, 0], X0[:, 1], X0[:, 2], marker='o', s=40, ec='k')
ax.scatter(X1[:, 0], X1[:, 1], X1[:, 2], marker='^', s=40, ec='k')
ax.plot_wireframe(u, v, w, rstride=8, cstride=8, color='tab:green', alpha=0.5)

ax.set_xlabel("Feature-0")
ax.set_ylabel("Feature-1")
ax.set_zlabel("Feature-1**2")

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from sklearn.svm import LinearSVC

from mpl_toolkits.mplot3d import Axes3D

X, y = make_blobs(centers=4, random_state=8)

y = y % 2

X_new = np.hstack((X, X[:, 1].reshape(-1, 1)**2))

X0, X1 = X_new[y==0], X_new[y==1]

linsvc = LinearSVC(max_iter=3700).fit(X_new, y)

intercept = linsvc.intercept_

coef = linsvc.coef_.ravel()

fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))

ax.view_init(elev=-152, azim=-23)

u = np.linspace(X_new[:, 0].min() - 2, X_new[:, 1].max() + 2)

v = np.linspace(X_new[:, 0].min() - 2, X_new[:, 1].max() + 2)

u, v = np.meshgrid(u, v)

w = -(coef[0] * u + coef[1] * v + intercept) / coef[2]

ax.scatter(X0[:, 0], X0[:, 1], X0[:, 2], marker='o', s=40, ec='k')

ax.scatter(X1[:, 0], X1[:, 1], X1[:, 2], marker='^', s=40, ec='k')

ax.plot_wireframe(u, v, w, rstride=8, cstride=8, color='tab:green', alpha=0.5)

ax.set_xlabel("Feature-0")

ax.set_ylabel("Feature-1")

ax.set_zlabel("Feature-1**2")

plt.show()

元の特徴量に対する決定境界

上の例では特徴量は3つだが、最後の特徴量は2つ目の特徴量f₁から計算される量であり、実質は2つの特徴量が決まれば決定境界が決まる。3次元空間内の平面の決定境界を以下のように書きなおすと、このことが確認できる。

(3) $\begin{align*} f_0 &= \frac{-b - w_1 f_1 - w_2 {f_1}^2}{w_0} \\ &= -9.02 +1.69 f_1 + 0.46 {f_1}^2 \\ &= 0.46(f_1 +1.83)^2 - 10.56 \end{align*}$

これを2つの特徴量に対する決定境界として描画したのが以下の図で、境界が2次関数となっているのが確認できる。

SVMそのものは線形の決定境界しか得られないが、非線形化した特徴量を追加することによって、より複雑な決定境界とすることができる。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVC
from mpl_toolkits.mplot3d import Axes3D

X, y = make_blobs(centers=4, random_state=8)
y = y % 2
X_new = np.hstack((X, X[:, 1].reshape(-1, 1)**2))

linsvc = LinearSVC(max_iter=3700).fit(X_new, y)
intercept = linsvc.intercept_

x_min, x_max = -12, 12
y_min, y_max = -15, 15
u = np.linspace(x_min, x_max, 500)
v = np.linspace(y_min, y_max, 500)
u, v = np.meshgrid(u, v)
w = v**2
decision = linsvc.predict(np.c_[u.ravel(), v.ravel(), w.ravel()])

fig, ax = plt.subplots()

ax.scatter(X_new[y==0, 0], X_new[y==0, 1], marker='o', s=60, ec='k')
ax.scatter(X_new[y==1, 0], X_new[y==1, 1], marker='^', s=60, ec='k')
ax.contourf(u, v, decision.reshape(u.shape),
    levels=1, colors=['tab:blue', 'tab:orange'], alpha=0.4)
ax.contour(u, v, decision.reshape(u.shape), levels=1, colors='k')

ax.set_xlim(x_min, x_max)
ax.set_ylim(y_min, y_max)
ax.set_xlabel("Feature-0")
ax.set_ylabel("Feature-1")

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from sklearn.svm import LinearSVC

from mpl_toolkits.mplot3d import Axes3D

X, y = make_blobs(centers=4, random_state=8)

y = y % 2

X_new = np.hstack((X, X[:, 1].reshape(-1, 1)**2))

linsvc = LinearSVC(max_iter=3700).fit(X_new, y)

intercept = linsvc.intercept_

x_min, x_max = -12, 12

y_min, y_max = -15, 15

u = np.linspace(x_min, x_max, 500)

v = np.linspace(y_min, y_max, 500)

u, v = np.meshgrid(u, v)

w = v**2

decision = linsvc.predict(np.c_[u.ravel(), v.ravel(), w.ravel()])

fig, ax = plt.subplots()

ax.scatter(X_new[y==0, 0], X_new[y==0, 1], marker='o', s=60, ec='k')

ax.scatter(X_new[y==1, 0], X_new[y==1, 1], marker='^', s=60, ec='k')

ax.contourf(u, v, decision.reshape(u.shape),

levels=1, colors=['tab:blue', 'tab:orange'], alpha=0.4)

ax.contour(u, v, decision.reshape(u.shape), levels=1, colors='k')

ax.set_xlim(x_min, x_max)

ax.set_ylim(y_min, y_max)

ax.set_xlabel("Feature-0")

ax.set_ylabel("Feature-1")

plt.show()

カーネルトリック

概要

上記の例では特徴量の1つを2次として新たな特徴量とした。特徴量の非線形化としては、このように特徴量の累乗とするほか、異なる特徴量同士の積を交互作用として導入することが考えられる。ただし、特徴量の数が多くなった時に、それらの全ての組み合わせに対する積を考えると、計算量が膨れ上がる。カーネルトリック(kernel trick)とは、拡張された特徴量空間でのデータ間の距離を、実際の拡張計算をせずに行う方法らしい。

受け売りをそのまま書いておくと、SVMで広く用いられているカーネルトリックのマッピング方法は以下の2つとのこと。

多項式カーネル(polynomial kernel)：もとの特徴量の特定の次数までの全ての多項式を計算
放射既定関数(radial basis function: RBF)カーネルとも呼ばれるガウシアンカーネル：直感的には全次数の全ての多項式を考えるが、次数が高くなるにつれて特徴量の重要性を小さくする

以下はforgeデータセットに対して、カーネルトリックを用いたSVCを適用した例。直線はLinearSVCによる決定境界で、曲線はガウシアンカーネル(RBF)によるSVCの決定境界で、カーネル関数は以下のような形。

(4) $\begin{equation*} k_{\rm rbf}(x_1, x_2) = \exp \left( -\gamma || x_1 - x_2 ||^2 \right) \end{equation*}$

scikit-learnのSVCの引数で、kernel='rbf'、C=10、gamma=0.1と指定している。

線形モデルの決定境界が直線なのに対して、カーネルトリックによる決定境界は、非線形化した特徴量を導入していることから曲線となっている。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC, LinearSVC

X = np.array( \
[[ 9.96346605,  4.59676542],
 [11.0329545,  -0.16816717],
 [11.54155807,  5.21116083],
 .....
 [ 9.50169345,  1.93824624],
 [ 9.15072323,  5.49832246],
 [11.563957,    1.3389402 ]]
)
y = np.array( \
[1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]
)

f0min, f0max = 7.5, 12.5
f1min, f1max = -1, 6

linsvc = LinearSVC(max_iter=4000).fit(X, y)
intercept = linsvc.intercept_
coef = linsvc.coef_.ravel()

svc = SVC(kernel='rbf', C=10, gamma=0.1).fit(X, y)
u = np.linspace(f0min, f0max, 400)
v = np.linspace(f1min, f1max, 400)
u, v = np.meshgrid(u, v)
pred = svc.predict(np.c_[u.ravel(), v.ravel()]).reshape(u.shape)

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],
    marker='o', s=60, fc='tab:blue', ec='k')
ax.scatter(X[y==1][:, 0], X[y==1][:, 1],
    marker='^', s=60, fc='tab:orange', ec='k')

sv_class = y[svc.support_]
ax.scatter(svc.support_vectors_[sv_class==0][:, 0],
           svc.support_vectors_[sv_class==0][:, 1],
           marker='o', s=150, fc='tab:blue', ec='blue', linewidth=3)
ax.scatter(svc.support_vectors_[sv_class==1][:, 0],
           svc.support_vectors_[sv_class==1][:, 1],
           marker='^', s=150, fc='tab:orange', ec='red', linewidth=3)

f1 = lambda f0: -(intercept + coef[0]*f0) / coef[1]
ax.plot([f0min, f0max], [f1(f0min), f1(f0max)])

ax.contour(u, v, pred, levels=[0.5])

ax.set_xlim(f0min, f0max)
ax.set_ylim(f1min, f1max)
ax.tick_params(bottom=False, left=False, labelbottom=False, labelleft=False)
ax.set_xlabel("Feature-0")
ax.set_ylabel("Feature-1")

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.svm import SVC, LinearSVC

X = np.array( \

[[ 9.96346605, 4.59676542],

[11.0329545, -0.16816717],

[11.54155807, 5.21116083],

.....

[ 9.50169345, 1.93824624],

[ 9.15072323, 5.49832246],

[11.563957, 1.3389402 ]]

)

y = np.array( \

[1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]

)

f0min, f0max = 7.5, 12.5

f1min, f1max = -1, 6

linsvc = LinearSVC(max_iter=4000).fit(X, y)

intercept = linsvc.intercept_

coef = linsvc.coef_.ravel()

svc = SVC(kernel='rbf', C=10, gamma=0.1).fit(X, y)

u = np.linspace(f0min, f0max, 400)

v = np.linspace(f1min, f1max, 400)

u, v = np.meshgrid(u, v)

pred = svc.predict(np.c_[u.ravel(), v.ravel()]).reshape(u.shape)

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],

marker='o', s=60, fc='tab:blue', ec='k')

ax.scatter(X[y==1][:, 0], X[y==1][:, 1],

marker='^', s=60, fc='tab:orange', ec='k')

sv_class = y[svc.support_]

ax.scatter(svc.support_vectors_[sv_class==0][:, 0],

svc.support_vectors_[sv_class==0][:, 1],

marker='o', s=150, fc='tab:blue', ec='blue', linewidth=3)

ax.scatter(svc.support_vectors_[sv_class==1][:, 0],

svc.support_vectors_[sv_class==1][:, 1],

marker='^', s=150, fc='tab:orange', ec='red', linewidth=3)

f1 = lambda f0: -(intercept + coef[0]*f0) / coef[1]

ax.plot([f0min, f0max], [f1(f0min), f1(f0max)])

ax.contour(u, v, pred, levels=[0.5])

ax.set_xlim(f0min, f0max)

ax.set_ylim(f1min, f1max)

ax.tick_params(bottom=False, left=False, labelbottom=False, labelleft=False)

ax.set_xlabel("Feature-0")

ax.set_ylabel("Feature-1")

plt.show()

scikit-learnのSVCクラスには、サポートベクターに関する以下のパラメーターがある。

support_：データセットにおけるサポートベクターのインデックス（1次元配列）
support_vector_：サポートベクターの配列（2次元配列）

38～44行目で、これらのパラメーターを使ってサポートベクターを強調表示している。

パラメータ調整

SVCモデルでパラメーターCとgammaの値を変化させたときの決定境界は以下の通り。

gammaはガウシアンカーネルの直径（σ2に相当）の逆数で、この値が小さいと直径が大きくなり、より多くの点を近いと判断するようになる。左の方はgammaが小さく広域のデータをまとめようとするため、決定境界は大まかとなり、右の方はgammaが大きく近いもの同士をまとめようとする傾向となる。

Cは正則化の強さの逆数で、上の方ほどCの値が小さく正則化が強く効くため、決定境界はよりまっすぐとなり、下の方ほど正則化が弱く個々のデータの影響を受ける。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from itertools import product

X = np.array( \
[[ 9.96346605,  4.59676542],
 [11.0329545,  -0.16816717],
 [11.54155807,  5.21116083],
 .....
 [ 9.50169345,  1.93824624],
 [ 9.15072323,  5.49832246],
 [11.563957,    1.3389402 ]]
)
y = np.array( \
[1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]
)

f0min, f0max = 7.5, 12.5
f1min, f1max = -1, 6
u = np.linspace(f0min, f0max, 400)
v = np.linspace(f1min, f1max, 400)
u, v = np.meshgrid(u, v)

C_list = [0.1, 1, 1000]
gamma_list = [0.1, 1, 10]
params = product(C_list, gamma_list)

plt.rcParams['font.size'] = 6
fig, axs = plt.subplots(3, 3, figsize=(6.4, 4.8))
axs_1d = axs.ravel()
fig.subplots_adjust(hspace=0.3)

for ax, param in zip(axs_1d, params):
    svc = SVC(kernel='rbf', C=param[0], gamma=param[1]).fit(X, y)
    pred = svc.predict(
            np.hstack([u.ravel().reshape(-1, 1), v.ravel().reshape(-1, 1)])
        ).reshape(u.shape)
    ax.scatter(X[y==0][:, 0], X[y==0][:, 1], marker='o')
    ax.scatter(X[y==1][:, 0], X[y==1][:, 1], marker='^')
    ax.contour(u, v, pred, levels=[0.5])
    ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)
    ax.set_title("C={:.1f}, gamma={:.1f}".format(param[0], param[1]))

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.svm import SVC

from itertools import product

X = np.array( \

[[ 9.96346605, 4.59676542],

[11.0329545, -0.16816717],

[11.54155807, 5.21116083],

.....

[ 9.50169345, 1.93824624],

[ 9.15072323, 5.49832246],

[11.563957, 1.3389402 ]]

)

y = np.array( \

[1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]

)

f0min, f0max = 7.5, 12.5

f1min, f1max = -1, 6

u = np.linspace(f0min, f0max, 400)

v = np.linspace(f1min, f1max, 400)

u, v = np.meshgrid(u, v)

C_list = [0.1, 1, 1000]

gamma_list = [0.1, 1, 10]

params = product(C_list, gamma_list)

plt.rcParams['font.size'] = 6

fig, axs = plt.subplots(3, 3, figsize=(6.4, 4.8))

axs_1d = axs.ravel()

fig.subplots_adjust(hspace=0.3)

for ax, param in zip(axs_1d, params):

svc = SVC(kernel='rbf', C=param[0], gamma=param[1]).fit(X, y)

pred = svc.predict(

np.hstack([u.ravel().reshape(-1, 1), v.ravel().reshape(-1, 1)])

).reshape(u.shape)

ax.scatter(X[y==0][:, 0], X[y==0][:, 1], marker='o')

ax.scatter(X[y==1][:, 0], X[y==1][:, 1], marker='^')

ax.contour(u, v, pred, levels=[0.5])

ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

ax.set_title("C={:.1f}, gamma={:.1f}".format(param[0], param[1]))

plt.show()

Breast cancerデータへの適用例

Breast cancerデータへの適用例で、特徴量データの大きさやレンジが大きくばらついていること、特徴量データをそのまま使った場合に過学習となること、特徴量データに前処理を施して正規化(normalize)した場合に精度が向上することを示している。

SVMの特徴

SVMの特徴量を受け売りのまままとめておく。

データにわずかな特徴量しかない場合も複雑な決定境界を生成可能（低次元でも高次元でもうまく機能）
サンプルの個数が大きくなるとうまく機能しない（10万サンプルくらいになると、実行時間やメモリ使用量の面で難しくなる
注意深いデータの前処理とパラメーター調整が必要
検証が難しい（予測に対する理由を理解することが難しい）
RBFの場合、Cやgammaを大きくするとより複雑なモデルになる（2つのパラメーターは強く相関するため、同時に調整する必要がある）

今後の課題～覚え書き

カーネル関数

(5) $\begin{equation*} K(\boldsymbol{x}_1, \boldsymbol{x}_2) = \sum \phi(\boldsymbol{x}_1) \phi(\boldsymbol{x}_2) \end{equation*}$

多項式カーネル

(6) $\begin{equation*} K(\boldsymbol{x}_1, \boldsymbol{x}_2) = (\boldsymbol{x}_1 \cdot \boldsymbol{x}_2 + 1 )^d \end{equation*}$

ガウシアンカーネル

(7) $\begin{equation*} K(\boldsymbol{x}_1, \boldsymbol{x}_2) = \exp \left(- \frac{||(\boldsymbol{x}_1 - \boldsymbol{x}_2 ||^2}{2\sigma^2} \right) \end{equation*}$

SVMの定式化

2020-06-24 / tau / コメントする

SVMの定式化

SVMのクラス分類の条件

2つの特徴量を持つデータが2つのクラスに分かれているとする。ここで下図のように、1つの直線によって、2つのクラスを完全に分離できるとする。

このとき、直線lによって分割したとして、以下の符号によってクラスを分離する。

(1) $\begin{equation*} \left\{ \begin{align} a x_1 + b x_2 + c > 0 &\rightarrow \rm{Class1} \\ a x_1 + b x_2 + c < 0 & \rightarrow \rm{Class2} \end{align} \right. \end{equation*}$

ここでラベル変数t_iを導入する。t_iはデータiがClass1/2のいずれに属するかを示す変数で、Class1ならt_i > 0、Class2ならt_i < 0と定義する。

(2) $\begin{equation*} \left\{ \begin{array}{lll} t_i = 1 & x_i \in \rm{Class1} & (a x_{i1} + b x_{i2} + c > 0) \\ t_i = -1 & x_i \in \rm{Class2} & (a x_{i1} + b x_{i2} + c < 0) \\ \end{array} \right. \end{equation*}$

このラベル変数を用いて、クラスの条件式は以下のように統一される。

(3) $\begin{equation*} t_i (a x_{i1} + b x_{i2} + c) > 0 \end{equation*}$

SVMにおいては、すべてのデータについてこの式が満足されるようにa, b, cを決定する。これらはすべてa, b, cに対する制約条件だが、どのようにこれらの値を求めるべきか、その目的関数が必要になる。SVMでは、これをマージン最大化により行う。

マージン最大化

ある直線l₁によって、下図のようにデータセットがClass1/2に分類できるとする。このときl₁に対してClass1/2の最も直線に近いデータを”サポートベクター”と呼ぶ。また、これらのサポートベクターに対応するl₁と平行な直線間の距離を”マージン”と呼ぶ。

ところで、l₁とは異なる別の直線l₂を選ぶと、異なるサポートベクターに対してより大きなマージンを得ることができる。SVMでは、式(3)のもとでこのマージンを最大化するような直線lを探すこととなる。

直線lに対するサポートベクターの対を(x⁺, x⁻)とすると、それぞれからlへの距離dは以下のように表現される。

(4) $\begin{equation*} d = \frac{|a x^+_1 + b x^+_2 + c|}{a^2 + b^2} = \frac{|a x^-_1 + b x^-_2 + c|}{a^2 + b^2} \end{equation*}$

ここで直線lはマージンの端にある平行な2つの直線の中央にあることから、上式の分子は同じ値となる。この値でdを除したものを改めて $\tilde{d}$ と置くと、dの最大化問題は $1/\tilde{d}=\sqrt{a^2+b^2}$ の最小化問題となる。これに式(3)の制約条件を加味して、問題は以下の制約条件付き最小化問題となる。

(5) $\begin{align*} \min a^2 + b^2 \quad {\rm s.t.} \; t_i (a x_{i1} + b x_{i2} + c) > 0 \; (i=1~n) \end{align*}$

今後の課題

ここから先の定式化
ソフトマージンの導出

勾配ブースティング

2020-06-21 / tau / コメントする

概要

勾配ブースティング(gradient boosthing)は、ランダムフォレストと同じく複数の決定木を組み合わせてモデルを強化する手法。ランダムフォレストと異なる点は、最初から複数の決定木を使うのではなく、1つずつ順番に決定木を増やしていく。その際に追加される決定木はそれぞれ深さ1～5くらいの浅い木（弱学習機：weak learner）で、直前の適合不足を補うように学習する。

勾配ブースティングの主なパラメーターは弱学習機の数(n_estimators)と学習率(learning_rate)で、学習率を大きくすると個々の弱学習機の補正を強化しモデルは複雑になる。

cancerデータへの適用

Pythonのscikit-learnにあるGradienBoostingClassifierをbreast_cancerデータに適用する例。”Pythonではじめる機械学習”の”2.3.6.2 勾配ブースティング回帰木”掲載のコードに沿って確認するが、バージョンの違いのためか、結果が異なる。いくつかのデフォルトのパラメーターを明示的に設定／変更してみたが、書籍に掲載されている結果には至っていない。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

ds = load_breast_cancer()
X_train, X_test, y_train, y_test =\
    train_test_split(ds.data, ds.target, random_state=0)

gbcf = GradientBoostingClassifier(random_state=1)
gbcf.fit(X_train, y_train)
print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))
print("Test score    : {:.3f}".format(gbcf.score(X_test, y_test)))

gbcf = GradientBoostingClassifier(max_depth=1, random_state=0)
gbcf.fit(X_train, y_train)
print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))
print("Test score    : {:.3f}".format(gbcf.score(X_test, y_test)))

fig, ax = plt.subplots(figsize=(8, 4.8))
fig.subplots_adjust(left=0.3)
ax.barh(ds.feature_names, gbcf.feature_importances_)
ax.set_xlabel("feature importance")
plt.show()

gbcf = GradientBoostingClassifier(learning_rate=0.01, random_state=0)
gbcf.fit(X_train, y_train)
print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))
print("Test score    : {:.3f}".format(gbcf.score(X_test, y_test)))

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.ensemble import GradientBoostingClassifier

ds = load_breast_cancer()

X_train, X_test, y_train, y_test =\

train_test_split(ds.data, ds.target, random_state=0)

gbcf = GradientBoostingClassifier(random_state=1)

gbcf.fit(X_train, y_train)

print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))

print("Test score : {:.3f}".format(gbcf.score(X_test, y_test)))

gbcf = GradientBoostingClassifier(max_depth=1, random_state=0)

gbcf.fit(X_train, y_train)

print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))

print("Test score : {:.3f}".format(gbcf.score(X_test, y_test)))

fig, ax = plt.subplots(figsize=(8, 4.8))

fig.subplots_adjust(left=0.3)

ax.barh(ds.feature_names, gbcf.feature_importances_)

ax.set_xlabel("feature importance")

plt.show()

gbcf = GradientBoostingClassifier(learning_rate=0.01, random_state=0)

gbcf.fit(X_train, y_train)

print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))

print("Test score : {:.3f}".format(gbcf.score(X_test, y_test)))

最初に試したのが以下のコード。ここでテストスコアが書籍にある0.958にならない。min_samples_split=5とすると書籍と同じ結果になるが、以降の特徴量重要度やlearning_rateの変更結果は再現されない。

gbcf = GradientBoostingClassifier(random_state=0)
gbcf.fit(X_train, y_train)
print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))
print("Test score    : {:.3f}".format(gbcf.score(X_test, y_test)))

# Training score: 1.000
# Test score    : 0.965

gbcf = GradientBoostingClassifier(random_state=0)

gbcf.fit(X_train, y_train)

print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))

print("Test score : {:.3f}".format(gbcf.score(X_test, y_test)))

# Training score: 1.000

# Test score : 0.965

過剰適合に対してmax_depth=1と強力な枝刈りをした場合。この結果は小数点以下3桁の表示で書籍と一致している。

gbcf = GradientBoostingClassifier(max_depth=1, random_state=0)
gbcf.fit(X_train, y_train)
print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))
print("Test score    : {:.3f}".format(gbcf.score(X_test, y_test)))

# Training score: 0.991
# Test score    : 0.972

gbcf = GradientBoostingClassifier(max_depth=1, random_state=0)

gbcf.fit(X_train, y_train)

print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))

print("Test score : {:.3f}".format(gbcf.score(X_test, y_test)))

# Training score: 0.991

# Test score : 0.972

learning_rateをデフォルトの0.1から0.01に変更した場合の結果も書籍と一致する。今回の再現結果では、デフォルト状態からテストスコアは改善されていない。

gbcf = GradientBoostingClassifier(learning_rate=0.01, random_state=0)
gbcf.fit(X_train, y_train)
print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))
print("Test score    : {:.3f}".format(gbcf.score(X_test, y_test)))

# Training score: 0.988
# Test score    : 0.965

gbcf = GradientBoostingClassifier(learning_rate=0.01, random_state=0)

gbcf.fit(X_train, y_train)

print("Training score: {:.3f}".format(gbcf.score(X_train, y_train)))

print("Test score : {:.3f}".format(gbcf.score(X_test, y_test)))

# Training score: 0.988

# Test score : 0.965

なお、事前剪定を強化したケースのグラフが、書籍と大きく異なる。横軸の値が倍ほどになっており、worst concave points、worst perimeter、mean concave pointsが重要度の大半を占めている。書籍では他の多くの特徴量も重要度がある程度高い点と異なっている。

gbcf = GradientBoostingClassifier(max_depth=1, random_state=0)
gbcf.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(8, 4.8))
fig.subplots_adjust(left=0.3)
ax.barh(ds.feature_names, gbcf.feature_importances_)
ax.set_xlabel("feature importance")
plt.show()

gbcf = GradientBoostingClassifier(max_depth=1, random_state=0)

gbcf.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(8, 4.8))

fig.subplots_adjust(left=0.3)

ax.barh(ds.feature_names, gbcf.feature_importances_)

ax.set_xlabel("feature importance")

plt.show()

今後確認したい点

勾配ブースティングの基本的な考え方の整理
簡単な事例での勾配ブースティングの挙動確認
回帰への適用
異なるモデルの組み合わせ

ランダムフォレストの概要

2020-06-20 / tau / コメントする

概要

“Pythonではじめる機械学習”のランダムフォレストの写経。

ランダムフォレストは決定木のアンサンブル法の1つ。異なる複数の決定木をランダムに発生させて平均をとることで、個々の決定木の過剰適合を打ち消すという考え方。

実行例

以下は、scikit-learnのランダムフォレストのモデルRandomForestClassifierでmoonsデータセットをクラス分類した例で、ランダムに生成された5つの木とそれらを平均したランダムフォレストの決定領域を示している。

以下は実装コードで、draw_decision_field()とdraw_tree_boundary()はコードを省略。

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patch
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


def draw_decision_field(clf, ax, x0s, x1s, n_areas=2,
        colors=['tab:blue', 'tab:orange'], alpha=0.5, fill=True):
# コード省略

def draw_tree_boundary(tree, ax, left, right, bottom, top,
        i_node=0, stop_level=None, n_level=0):
# コード省略 

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test =\
    train_test_split(X, y, stratify=y, random_state=42)

x0_min, x0_max = -1.8, 2.5
x1_min, x1_max = -1.0, 1.8
x0s = np.linspace(x0_min, x0_max, 50)
x1s = np.linspace(x1_min, x1_max, 50)

forest = RandomForestClassifier(n_estimators=5, random_state=2)
forest.fit(X_train, y_train)

fig, axs = plt.subplots(2, 3, figsize=(8, 4.8))
fig.subplots_adjust(hspace=0.3)
axs_1d = axs.reshape(-1)

for i, ax in enumerate(axs_1d[:-1]):
    draw_tree_boundary(tree=forest.estimators_[i].tree_, ax=ax,
        left=x0_min, right=x0_max, bottom=x1_min, top=x1_max)
    ax.set_title("Tree-{}".format(i))

draw_decision_field(forest, axs_1d[-1], x0s, x1s)
axs_1d[-1].set_title("Random Forest")

for ax in axs_1d:
    ax.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1],
        marker='o', s=15, fc='tab:blue', ec='k')
    ax.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1],
        marker='^', s=15, fc='tab:orange', ec='k')

    ax.set_xlim(x0_min, x0_max)
    ax.set_ylim(x1_min, x1_max)
    ax.tick_params(bottom=False, left=False, labelbottom=False, labelleft=False)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

import matplotlib.patches as patch

from sklearn.datasets import make_moons

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

def draw_decision_field(clf, ax, x0s, x1s, n_areas=2,

colors=['tab:blue', 'tab:orange'], alpha=0.5, fill=True):

# コード省略

def draw_tree_boundary(tree, ax, left, right, bottom, top,

i_node=0, stop_level=None, n_level=0):

# コード省略

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)

X_train, X_test, y_train, y_test =\

train_test_split(X, y, stratify=y, random_state=42)

x0_min, x0_max = -1.8, 2.5

x1_min, x1_max = -1.0, 1.8

x0s = np.linspace(x0_min, x0_max, 50)

x1s = np.linspace(x1_min, x1_max, 50)

forest = RandomForestClassifier(n_estimators=5, random_state=2)

forest.fit(X_train, y_train)

fig, axs = plt.subplots(2, 3, figsize=(8, 4.8))

fig.subplots_adjust(hspace=0.3)

axs_1d = axs.reshape(-1)

for i, ax in enumerate(axs_1d[:-1]):

draw_tree_boundary(tree=forest.estimators_[i].tree_, ax=ax,

left=x0_min, right=x0_max, bottom=x1_min, top=x1_max)

ax.set_title("Tree-{}".format(i))

draw_decision_field(forest, axs_1d[-1], x0s, x1s)

axs_1d[-1].set_title("Random Forest")

for ax in axs_1d:

ax.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1],

marker='o', s=15, fc='tab:blue', ec='k')

ax.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1],

marker='^', s=15, fc='tab:orange', ec='k')

ax.set_xlim(x0_min, x0_max)

ax.set_ylim(x1_min, x1_max)

ax.tick_params(bottom=False, left=False, labelbottom=False, labelleft=False)

plt.show()

上の例でn_estimatorsの数を1から9まで増やしていったときの様子を示す。徐々に細かい枝が取り払われて、境界が滑らかになっていく様子がわかる。

ランダムフォレストの考え方

概要

ランダムフォレストの大まかな手順は以下の通り。

訓練／ランダムフォレストの構築
1. 決定木の数を指定する
2. 決定木の数の分だけ、訓練データからランダムにデータを生成する
3. 各決定木を学習させる
予測段階
1. 予測したいデータの特徴量を与える
2. その特徴量に対して各決定木がクラス分類
3. 各決定木の分類結果の多数決でクラスを決定

ランダムフォレストの構築

決定木の数

決定木の数をRandomForestClassifierのn_estimatorsパラメーターで、構築する決定木の数を指定する。

決定木に与えるデータの生成

各決定木に与えるデータセットをランダムに設定。特徴量の一部または全部を選び、各特徴量のデータをブートストラップサンプリングによってランダムに選び出す。

RandomForestClassifierでは、選ぶ特徴量の数をmax_featuresで指定し、各特徴量のデータをブートストラップサンプリングで選ぶかどうかをbootstrap(=True/False)で設定する。これらの乱数系列をrandom_stateパラメーターで指定する。

冒頭の事例の場合、特徴量は2つなので常にいずれも選び、bootstrapはデフォルトのTrue、決定木の数を3つ、乱数系列を3と指定している。

forest = RandomForestClassifier(n_estimators=3, random_state=2)
forest.fit(X_train, y_train)

1 2	forest = RandomForestClassifier(n_estimators=3, random_state=2) forest.fit(X_train, y_train)

ランダムフォレストによる予測

ある特徴量セットを持つデータのクラスを予測する。与えられたデータに対して、各決定木がクラスを判定し、その結果の多数決でそのデータのクラスを決定する。

クラス分類でランダムフォレストが多数決で任意の点のクラスを決定する様子を確認する。以下は10個のmoonsデータセットに対して3つの分類木を適用した例。

訓練後のランダムフォレストにおいて、各点のクラスが多数決で決められていることが、以下の図で確認できる。

cancerデータによる確認

精度

breast_cancerデータに対してランダムフォレストを適用する。100個の決定木を準備し、他のパラメーターはデフォルトのままで実行する。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

ds = load_breast_cancer()
X_train, X_test, y_train, y_test = \
    train_test_split(ds.data, ds.target, random_state=0)
forest = RandomForestClassifier(n_estimators=100, random_state=0)
forest.fit(X_train, y_train)

print("Training score: {:.3f}".format(forest.score(X_train, y_train)))
print("Test score    : {:.3f}".format(forest.score(X_test, y_test)))

fig, ax = plt.subplots()
fig.subplots_adjust(left=0.3)
ax.barh(ds.feature_names, forest.feature_importances_)
plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = \

train_test_split(ds.data, ds.target, random_state=0)

forest = RandomForestClassifier(n_estimators=100, random_state=0)

forest.fit(X_train, y_train)

print("Training score: {:.3f}".format(forest.score(X_train, y_train)))

print("Test score : {:.3f}".format(forest.score(X_test, y_test)))

fig, ax = plt.subplots()

fig.subplots_adjust(left=0.3)

ax.barh(ds.feature_names, forest.feature_importances_)

plt.show()

このコードの前半の実行結果は以下の通りで、訓練データに対して完全適合し、テストデータに対しても97.2%の精度を示している。

Training score: 1.000
Test score    : 0.972

1 2	Training score: 1.000 Test score : 0.972

【注】決定木の数による違い

n_estimatorの数によって、テストスコアが違ってくるが、このケースでは100以上決定木の数を多くしてもテストスコアは向上しない。

n_estimators = 10 → 0.951
n_estimators = 50 → 0.965
n_estimators ≥ 100 → 0.972

特徴量重要度

後半の実行結果は以下のように表示される。単独の決定木による特徴量重要度と比べて全ての特徴量が0以上の重要度となっている。決定木の時に最重要であったworst radiusも重要度が高いが、worst perimeterが最も重要度が高く、worst concave pointsやconcave pointsも重要度が高い。

今後の課題

ランダムフォレストにおける特徴量重要度の意義
n_jobsの効果
RandomForestClassifierのrandom_stateによる違い
max_features、max_depth、min_samples_leafなどの影響
- デフォルトはmax_features=sqrt(n_features)、max_depth=None(all)、min_samples_leaf=1

ブートストラップサンプリング

2020-06-20 / tau / コメントする

概要

母集団から得られたサンプルから標本をつくり、それに対して統計的な検討を加える方法。限られたサンプルデータから異なる再標本を大量に作り(resampling)、母集団パラメーターの推定、アンサンブル機械学習のデータなどに用いる。

以下は、1次元配列に対してnumpy.random.choice()で並べ替えた再標本を複数生成している例。

import numpy as np

np.random.seed(0)

a = np.arange(10)

for n in range(5):
    print(np.random.choice(a, len(a)))

# [5 0 3 3 7 9 3 5 2 4]
# [7 6 8 8 1 6 7 7 8 1]
# [5 9 8 9 4 3 0 3 5 0]
# [2 3 8 1 3 3 3 7 0 1]
# [9 9 0 4 7 3 2 7 2 0]

import numpy as np

np.random.seed(0)

a = np.arange(10)

for n in range(5):

print(np.random.choice(a, len(a)))

# [5 0 3 3 7 9 3 5 2 4]

# [7 6 8 8 1 6 7 7 8 1]

# [5 9 8 9 4 3 0 3 5 0]

# [2 3 8 1 3 3 3 7 0 1]

# [9 9 0 4 7 3 2 7 2 0]

ブートストラップ(bootstrap)とはブーツの後ろについているつまみ／輪っかのことで、ここを持ったりフックをかけてブーツを引っ張り上げる。19世紀にはブートストラップを引っ張って自分自身を引っ張り上げる、という不可能なことの比喩に使われていたが、20世紀に入って自分自身で何とかすることや自己完結の仕組みなどの比喩に使われるようになったとのこと。コンピューターの起動を指すブートもbootstrapを略。

ブートストラップ法による信頼区間の推定

再標本を大量に生成することで、パラメーターの信頼区間などの統計量を直接得ることができる。

e-statの身長・体重に関する国民健康・栄養調査2017年のデータから、40歳代の日本国民の身長の平均171.2cm及び標準偏差6.0cmを母集団のパラメーターとして用いる（データ数は374人）。

このパラメーターから、正規分布に従う10個の乱数を発生させる。

180.9 167.5 168.  164.8 176.4 157.4 181.7 166.6 173.1 169.7

1	180.9 167.5 168. 164.8 176.4 157.4 181.7 166.6 173.1 169.7

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

np.random.seed(1)

pop_mean = 171.2
pop_std = 6
sample_size = 10
resample_size = sample_size
n_boots = 1000
conf_prob = np.array([0.025, 0.975])

sample = np.random.normal(pop_mean, pop_std, sample_size)
sample_mean = np.mean(sample)
sample_uvar = np.var(sample, ddof=1)
sample_std = np.sqrt(sample_uvar)

import numpy as np

import matplotlib.pyplot as plt

import scipy.stats as stats

np.random.seed(1)

pop_mean = 171.2

pop_std = 6

sample_size = 10

resample_size = sample_size

n_boots = 1000

conf_prob = np.array([0.025, 0.975])

sample = np.random.normal(pop_mean, pop_std, sample_size)

sample_mean = np.mean(sample)

sample_uvar = np.var(sample, ddof=1)

sample_std = np.sqrt(sample_uvar)

次に、サンプルデータセットからブートストラップサンプリングで再標本を多数発生させ、それらの平均を一つのデータセットとする。

resample_means = []
for i in range(n_boots):
    resample = np.random.choice(sample, resample_size)
    resample_means.append(np.mean(resample))

resample_means = []

for i in range(n_boots):

resample = np.random.choice(sample, resample_size)

resample_means.append(np.mean(resample))

numpy.percentile()で95 %信頼区間（2.5%～97.5%）を計算。

resample_conf = np.percentile(resample_means, conf_prob*100)

1	resample_conf = np.percentile(resample_means, conf_prob*100)

比較のため、元のサンプルについてt分布による平均の信頼区間も計算。scipy.stats.tのinterval()でも求められるが、ここでは愚直に元の計算式から計算した。

sample_conf = \
    sample_mean + \
    stats.t.ppf(conf_prob, df=sample_size-1) * \
        np.sqrt(sample_uvar / sample_size)

sample_conf = \

sample_mean + \

stats.t.ppf(conf_prob, df=sample_size-1) * \

np.sqrt(sample_uvar / sample_size)

これらの結果、元の10個のサンプルの分布と1000個の再標本の平均の分布は以下のとおりで、釣り鐘状のきれいな分布となっている。

この時の各種データは以下の通り。

再標本の分散(不偏分散)は2.186と母集団やサンプルの分散より小さいが、これは多数の再標本の平均値の分散であり、母集団や元のサンプルの分散とは意味が違う。

また、10個のデータからt分布で推定した信頼区間よりも、ブートストラップで得られた信頼区間の方が狭くなっている。この傾向は乱数系列によって変わらず、一般的な傾向のようである。

Mean and STD
population   : 171.2, 6.000
sample       : 170.6, 7.532
resample mean: 170.6, 2.186
Confidence interval
sample       : 165.23 - 176.01 (10.78)
resample mean: 166.34 - 174.96 ( 8.62)

Mean and STD

population : 171.2, 6.000

sample : 170.6, 7.532

resample mean: 170.6, 2.186

Confidence interval

sample : 165.23 - 176.01 (10.78)

resample mean: 166.34 - 174.96 ( 8.62)

以下は再標本数を1000から100にした場合だが、分布形状は整っていて信頼区間もt分布による推定より狭い。なお、再標本数を10万、100万と増やしても、これ以上分散は小さくならず、信頼区間も変化しない。

Mean and STD
population   : 171.2, 6.000
sample       : 170.6, 7.532
resample mean: 170.9, 2.220
Confidence interval
sample       : 165.23 - 176.01 (10.78)
resample mean: 166.21 - 175.32 ( 9.11)

Mean and STD

population : 171.2, 6.000

sample : 170.6, 7.532

resample mean: 170.9, 2.220

Confidence interval

sample : 165.23 - 176.01 (10.78)

resample mean: 166.21 - 175.32 ( 9.11)

異常の計算・表示のコードは以下の通り。

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

np.random.seed(1)

pop_mean = 171.2
pop_std = 6
sample_size = 10
resample_size = sample_size
n_boots = 1000
conf_prob = np.array([0.025, 0.975])

sample = np.random.normal(pop_mean, pop_std, sample_size)
sample_mean = np.mean(sample)
sample_uvar = np.var(sample, ddof=1)
sample_std = np.sqrt(sample_uvar)

resample_means = []
for i in range(n_boots):
    resample = np.random.choice(sample, resample_size)
    resample_means.append(np.mean(resample))

resample_means_mean = np.mean(resample_means)
resample_means_std = np.sqrt(np.var(resample_means, ddof=1))
resample_conf = np.percentile(resample_means, conf_prob*100)

sample_conf = \
    sample_mean + \
    stats.t.ppf(conf_prob, df=sample_size-1) * \
        np.sqrt(sample_uvar / sample_size)

print ("Mean and STD")
print("population   : {:5.1f}, {:5.3f}".format(pop_mean, pop_std))
print("sample       : {:5.1f}, {:5.3f}".format(sample_mean, sample_std))
print("resample mean: {:5.1f}, {:5.3f}".
    format(resample_means_mean, resample_means_std))
print("Confidence interval")
print("sample       : {:6.2f} - {:6.2f} ({:5.2f})".
    format(sample_conf[0], sample_conf[1], sample_conf[1] - sample_conf[0]))
print("resample mean: {:6.2f} - {:6.2f} ({:5.2f})".
    format(resample_conf[0], resample_conf[1], resample_conf[1] - resample_conf[0]))

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))
axs[0].hist(sample, ec='k')
axs[1].hist(resample_means, bins=10, ec='k')
plt.show()

import numpy as np

import matplotlib.pyplot as plt

import scipy.stats as stats

np.random.seed(1)

pop_mean = 171.2

pop_std = 6

sample_size = 10

resample_size = sample_size

n_boots = 1000

conf_prob = np.array([0.025, 0.975])

sample = np.random.normal(pop_mean, pop_std, sample_size)

sample_mean = np.mean(sample)

sample_uvar = np.var(sample, ddof=1)

sample_std = np.sqrt(sample_uvar)

resample_means = []

for i in range(n_boots):

resample = np.random.choice(sample, resample_size)

resample_means.append(np.mean(resample))

resample_means_mean = np.mean(resample_means)

resample_means_std = np.sqrt(np.var(resample_means, ddof=1))

resample_conf = np.percentile(resample_means, conf_prob*100)

sample_conf = \

sample_mean + \

stats.t.ppf(conf_prob, df=sample_size-1) * \

np.sqrt(sample_uvar / sample_size)

print ("Mean and STD")

print("population : {:5.1f}, {:5.3f}".format(pop_mean, pop_std))

print("sample : {:5.1f}, {:5.3f}".format(sample_mean, sample_std))

print("resample mean: {:5.1f}, {:5.3f}".

format(resample_means_mean, resample_means_std))

print("Confidence interval")

print("sample : {:6.2f} - {:6.2f} ({:5.2f})".

format(sample_conf[0], sample_conf[1], sample_conf[1] - sample_conf[0]))

print("resample mean: {:6.2f} - {:6.2f} ({:5.2f})".

format(resample_conf[0], resample_conf[1], resample_conf[1] - resample_conf[0]))

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))

axs[0].hist(sample, ec='k')

axs[1].hist(resample_means, bins=10, ec='k')

plt.show()

numpy.percentile()～パーセンタイル

2020-06-20 / tau / コメントする

numpy.percentile()は、与えた配列から指定したパーセンタイル値を計算する。

percentile(a, q): a：パーセンタイルを計算する元の配列。; q：パーセンタイル値、または配列。パーセンタイル値は0～100で、百分率表示であることに注意。1次元配列を指定すると、各要素のパーセンタイル値に相当する値が同じサイズの配列で返される。

以下は実行例。パーセンタイル値が要素の間になる場合は内挿される。

import numpy as np

a = np.arange(11)

print("source = {}".format(a))
print("40 percentile = {}".format(np.percentile(a, 40)))
print("43 percentile = {}".format(np.percentile(a, 43)))

# source = [ 0  1  2  3  4  5  6  7  8  9 10]
# 40 percentile = 4.0
# 43 percentile = 4.3

import numpy as np

a = np.arange(11)

print("source = {}".format(a))

print("40 percentile = {}".format(np.percentile(a, 40)))

print("43 percentile = {}".format(np.percentile(a, 43)))

# source = [ 0 1 2 3 4 5 6 7 8 9 10]

# 40 percentile = 4.0

# 43 percentile = 4.3

元の配列はソートされていなくてもよい。

np.random.shuffle(a)
print(np.percentile(a, 63))

# 6.3

np.random.shuffle(a)

print(np.percentile(a, 63))

# 6.3

パーセンタイル値を配列で指定した場合。

print(np.percentile(a, [55, 75]))
# [5.5 7.5]

1 2	print(np.percentile(a, [55, 75])) # [5.5 7.5]

95%両側信頼区間の場合、以下のように計算できる。

print(np.percentile(a, [2.5, 97.5]))
# [0.25 9.75]

1 2	print(np.percentile(a, [2.5, 97.5])) # [0.25 9.75]

母比率の信頼区間

2020-06-18 / tau / コメントする

Bernoulli試行の成功確率をpとする。この試行をn回繰り返す場合の二項分布に従う確率変数X（成功回数）の平均と分散は以下で表される。

(1) $\begin{align*} E(X) &= np \\ V(X) &= np(1 - p) \end{align*}$

試行回数nが大きいとき、中心極限定理より以下の確率変数は標準正規分布に従う。

(2) $\begin{equation*} Z = \frac{X - np}{\sqrt{np(1 - p)}} \end{equation*}$

分母・分子をnで割り、サンプルから観測された確率として $X/n = \hat{p}$ と置く。

(3) $\begin{equation*} Z = \frac{\dfrac{X}{n} - p}{\sqrt{\dfrac{p(1 - p)}{n}}} = \frac{\hat{p} - p}{\sqrt{\dfrac{p(1 - p)}{n}}} \end{equation*}$

Zが標準正規分布に従うことから、信頼確率αの信頼区間は以下のように表せる。

(4) $\begin{equation*} -Z_\alpha = Z\left( \frac{1 - \alpha}{2} \right) \le \frac{\hat{p} - p}{\sqrt{\dfrac{p(1 - p)}{n}}} \le Z\left( \frac{1 + \alpha}{2} \right) = Z_\alpha \end{equation*}$

これよりpの信頼区間は以下のように表せる。

(5) $\begin{equation*} \hat{p} - Z_\alpha \sqrt{\dfrac{p(1 - p)}{n}} \le p \le \hat{p} + Z_\alpha \sqrt{\dfrac{p(1 - p)}{n}} \end{equation*}$

ここで信頼区間の境界値の計算に母比率pが含まれているが、nが大きいときは $\hat{p} = p$ として、以下を得る。

(6) $\begin{equation*} \hat{p} - Z_\alpha \sqrt{\dfrac{\hat{p}(1 - \hat{p})}{n}} \le p \le \hat{p} + Z_\alpha \sqrt{\dfrac{\hat{p}(1 - \hat{p})}{n}} \end{equation*}$

ここで、母比率0～1.0のBernoulli試行を繰り返し数を変えて試行したときの観測確率について、その平均と標準偏差がどうなるか計算してみた。

import numpy as np
import scipy.stats as stats
import pandas as pd


def p_trials(n, p, m):
    sum_p = []
    for traial in range(m):
        x = stats.uniform.rvs(size=n)
        sum_p.append(len(x[x<p]) / n)
    return np.mean(sum_p), np.std(sum_p, ddof=1)


np.random.seed(0)

p_list = np.arange(0, 1.1, 0.1)
n_list = [10, 20, 30, 50, 100, 1000]
n_trials = 100

mean_results = np.empty((len(p_list), len(n_list)))
std_results = np.empty((len(p_list), len(n_list)))

for cp, p in enumerate(p_list):
    for cn, n in enumerate(n_list):
        mean, std = p_trials(n, p, n_trials)
        mean_results[cp, cn] = mean
        std_results[cp, cn] = std

pd.options.display.precision = 3

df_mean = pd.DataFrame(mean_results, columns=n_list)
df_mean["p"] = p_list
columns = ["p"] + n_list
df_mean = df_mean.loc[:, columns]

df_std = pd.DataFrame(std_results, columns=n_list)
df_std["p"] = p_list
columns = ["p"] + n_list
df_std = df_std.loc[:, columns]

print(df_mean)
print(df_std)

import numpy as np

import scipy.stats as stats

import pandas as pd

def p_trials(n, p, m):

sum_p = []

for traial in range(m):

x = stats.uniform.rvs(size=n)

sum_p.append(len(x[x<p]) / n)

return np.mean(sum_p), np.std(sum_p, ddof=1)

np.random.seed(0)

p_list = np.arange(0, 1.1, 0.1)

n_list = [10, 20, 30, 50, 100, 1000]

n_trials = 100

mean_results = np.empty((len(p_list), len(n_list)))

std_results = np.empty((len(p_list), len(n_list)))

for cp, p in enumerate(p_list):

for cn, n in enumerate(n_list):

mean, std = p_trials(n, p, n_trials)

mean_results[cp, cn] = mean

std_results[cp, cn] = std

pd.options.display.precision = 3

df_mean = pd.DataFrame(mean_results, columns=n_list)

df_mean["p"] = p_list

columns = ["p"] + n_list

df_mean = df_mean.loc[:, columns]

df_std = pd.DataFrame(std_results, columns=n_list)

df_std["p"] = p_list

columns = ["p"] + n_list

df_std = df_std.loc[:, columns]

print(df_mean)

print(df_std)

まずpの平均についてはn = 10でもそれなりの精度となっていて、あまり試行回数による変化は大きくない。

      p     10     20     30     50    100   1000
0   0.0  0.000  0.000  0.000  0.000  0.000  0.000
1   0.1  0.093  0.102  0.105  0.099  0.097  0.101
2   0.2  0.215  0.194  0.196  0.208  0.206  0.203
3   0.3  0.328  0.287  0.295  0.297  0.299  0.299
4   0.4  0.393  0.384  0.394  0.396  0.407  0.399
5   0.5  0.494  0.491  0.514  0.494  0.497  0.498
6   0.6  0.596  0.609  0.605  0.592  0.598  0.600
7   0.7  0.695  0.714  0.704  0.698  0.694  0.700
8   0.8  0.811  0.807  0.799  0.791  0.793  0.798
9   0.9  0.910  0.904  0.887  0.898  0.903  0.902
10  1.0  1.000  1.000  1.000  1.000  1.000  1.000

p 10 20 30 50 100 1000

0 0.0 0.000 0.000 0.000 0.000 0.000 0.000

1 0.1 0.093 0.102 0.105 0.099 0.097 0.101

2 0.2 0.215 0.194 0.196 0.208 0.206 0.203

3 0.3 0.328 0.287 0.295 0.297 0.299 0.299

4 0.4 0.393 0.384 0.394 0.396 0.407 0.399

5 0.5 0.494 0.491 0.514 0.494 0.497 0.498

6 0.6 0.596 0.609 0.605 0.592 0.598 0.600

7 0.7 0.695 0.714 0.704 0.698 0.694 0.700

8 0.8 0.811 0.807 0.799 0.791 0.793 0.798

9 0.9 0.910 0.904 0.887 0.898 0.903 0.902

10 1.0 1.000 1.000 1.000 1.000 1.000 1.000

次にpの標準偏差（不偏分散の平方根）を見てみる。母比率が1/2に近いほどばらつきは大きく、試行回数nが大きいほどばらつきは小さくなっている。実務的にはn = 50～100あたりでそれなりのばらつきで観測確率をを母比率の代わりに用いてよいだろうか。

      p     10     20     30     50    100   1000
0   0.0  0.000  0.000  0.000  0.000  0.000  0.000
1   0.1  0.090  0.067  0.061  0.041  0.029  0.010
2   0.2  0.120  0.092  0.083  0.053  0.038  0.011
3   0.3  0.162  0.103  0.090  0.068  0.043  0.013
4   0.4  0.145  0.110  0.079  0.074  0.049  0.016
5   0.5  0.148  0.105  0.094  0.060  0.048  0.016
6   0.6  0.150  0.124  0.102  0.069  0.047  0.016
7   0.7  0.127  0.106  0.084  0.060  0.042  0.015
8   0.8  0.117  0.098  0.065  0.052  0.036  0.012
9   0.9  0.089  0.060  0.056  0.043  0.030  0.010
10  1.0  0.000  0.000  0.000  0.000  0.000  0.000

p 10 20 30 50 100 1000

0 0.0 0.000 0.000 0.000 0.000 0.000 0.000

1 0.1 0.090 0.067 0.061 0.041 0.029 0.010

2 0.2 0.120 0.092 0.083 0.053 0.038 0.011

3 0.3 0.162 0.103 0.090 0.068 0.043 0.013

4 0.4 0.145 0.110 0.079 0.074 0.049 0.016

5 0.5 0.148 0.105 0.094 0.060 0.048 0.016

6 0.6 0.150 0.124 0.102 0.069 0.047 0.016

7 0.7 0.127 0.106 0.084 0.060 0.042 0.015

8 0.8 0.117 0.098 0.065 0.052 0.036 0.012

9 0.9 0.089 0.060 0.056 0.043 0.030 0.010

10 1.0 0.000 0.000 0.000 0.000 0.000 0.000

以下はB(n, 0.5)についてnを変化させたときの観測確率のグラフで、やはりn = 50あたりまでにばらつきが急に減っていることがわかる。

グラフ全体のフォントサイズ

個別要素のフォントサイズ

概要

内包表記による方法

itertools.productによる方法

概要

r_について

c_について

概要

線形特徴量の非線形化

線形モデルでは分離不可能なデータ

非線形特徴量の追加

元の特徴量に対する決定境界

カーネルトリック

概要

パラメータ調整

Breast cancerデータへの適用例

SVMの特徴

今後の課題～覚え書き

SVMの定式化

SVMのクラス分類の条件

マージン最大化

今後の課題

概要

cancerデータへの適用

今後確認したい点

概要

実行例

ランダムフォレストの考え方

概要

ランダムフォレストの構築

決定木の数

決定木に与えるデータの生成

ランダムフォレストによる予測

cancerデータによる確認

精度

特徴量重要度

今後の課題

概要

ブートストラップ法による信頼区間の推定

`itertools.product`による方法

`r_`について

`c_`について