過学習～多項式回帰の場合

2020-05-14 / tau / コメントする

概要

過学習(over fitting)の例として、多項式の係数を線形回帰で予測した場合の挙動をまとめてみた。

複数の点(x_i, y_i)に対して、以下の線形式の項数を変化させて、Pythonのパッケージ、scikit-learnにあるLinearRegressionでフィッティングさせてみる。

(1) $\begin{equation*} \hat{y} = w_0 + \sum_{j=1}\m w_j x^j \end{equation*}$

データ数が少ない場合

以下の例は、[-3, 1]の間で等間隔な4つの値を発生させ、(x, e^x)となる4つの点を準備、これらのデータセットに対して、多項式の項数（すなわちxの次数）を1～6まで変化させてフィッティングした結果。たとえばn_terms=3の場合は $y = w_0 + w_1 x + w_2 x^2 + w_3 x^3$ の4つの係数を決定することになる。

n_terms=1の場合は単純な線形関数で、データセットの曲線関係を表しているとは言えない。
n_terms=2になるとかなり各点にフィットしているが、x < −1の範囲で本来の関数の値と離れていく。
n_terms=3はデータ数より項数（特徴量の数）が1つ少ない。各点にほぼぴったり合っていて、最も「それらしい」（ただしデータセットの外側の範囲でも合っているとは限らない／指数関数に対してxの有限の多項式ではどこかで乖離していく）
n_terms=4はデータ数と項数（特徴量の数）が等しい。予測曲線がすべての点を通っているが、無理矢理合わせている感があり、データセットの左側で関数形が跳ね上がっている。
n_terms=5はデータ数より特徴量数の方が多くなる。予測曲線は全ての点を通っているが、1番目の点と2番目の点の間で若干曲線が歪んでいる
n_terms=6になると歪が大きくなる

上記の実行コードは以下の通り。

7～8行目は、切片・係数のセットとxの値を与えて多項式の値を計算する関数。
19行目でn_data=4個のxの値を発生させ、20行目で指数関数の値を計算している。後のために乱数でばらつかせる準備をしているが、ここではばらつかせていない
23～24行目でxⁿの特徴量を生成している
35行目で線形回帰モデルのフィッティングを行っている。n_termsで指定した項数（＝次数）までをフィッティングに使っている。
36行目で、フィッティングの結果予測された切片と係数を使って、予測曲線の値を計算している。

import numpy as np
import random as rnd
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

def poly(intercept, coef, x):
    return intercept + sum([w * x**(n + 1) for n, w in enumerate(coef)])

rnd.seed(0)
xmin, xmax = -3, 1
xlim_min, xlim_max = -4, 2
ylim_min, ylim_max = -2, 4

n_data = 4
n_features = 20
n_terms_list = [1, 2, 3, 4, 5, 6]

x = np.linspace(xmin, xmax, n_data)
y = np.exp(x) + [rnd.uniform(-0.0, 0.0) for n in range(n_data)]

df = pd.DataFrame(y, columns=['y'])
for n in range(n_features):
    df["x^{}".format(n+1)] = x**(n+1)
print(df)

fig, axs = plt.subplots(2, 3, figsize=(12, 6.4))
axs_1d = axs.reshape(1, -1)[0]

linreg = LinearRegression()

x_graph = np.linspace(xlim_min, xlim_max)

for ax, n_terms in zip(axs_1d, n_terms_list):
    linreg.fit(df.iloc[:, 1:n_terms+1], df['y'])
    y_linreg = poly(linreg.intercept_, linreg.coef_, x_graph)
    ax.scatter(df['x^1'], df['y'], c='r', zorder=10)
    ax.plot(x_graph, y_linreg, c='gray', linewidth=2,
        label="n_terms={}".format(n_terms))

    ax.set_xlim(xlim_min, xlim_max)
    ax.set_ylim(ylim_min, ylim_max)
    ax.set_aspect('equal')
    ax.legend(loc='upper left')

plt.show()

import numpy as np

import random as rnd

import pandas as pd

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

def poly(intercept, coef, x):

return intercept + sum([w * x**(n + 1) for n, w in enumerate(coef)])

rnd.seed(0)

xmin, xmax = -3, 1

xlim_min, xlim_max = -4, 2

ylim_min, ylim_max = -2, 4

n_data = 4

n_features = 20

n_terms_list = [1, 2, 3, 4, 5, 6]

x = np.linspace(xmin, xmax, n_data)

y = np.exp(x) + [rnd.uniform(-0.0, 0.0) for n in range(n_data)]

df = pd.DataFrame(y, columns=['y'])

for n in range(n_features):

df["x^{}".format(n+1)] = x**(n+1)

print(df)

fig, axs = plt.subplots(2, 3, figsize=(12, 6.4))

axs_1d = axs.reshape(1, -1)[0]

linreg = LinearRegression()

x_graph = np.linspace(xlim_min, xlim_max)

for ax, n_terms in zip(axs_1d, n_terms_list):

linreg.fit(df.iloc[:, 1:n_terms+1], df['y'])

y_linreg = poly(linreg.intercept_, linreg.coef_, x_graph)

ax.scatter(df['x^1'], df['y'], c='r', zorder=10)

ax.plot(x_graph, y_linreg, c='gray', linewidth=2,

label="n_terms={}".format(n_terms))

ax.set_xlim(xlim_min, xlim_max)

ax.set_ylim(ylim_min, ylim_max)

ax.set_aspect('equal')

ax.legend(loc='upper left')

plt.show()

異常値がある場合

上記の整然とした指数関数のデータに1つだけ飛び離れた異常値を入れてみる。

x = np.linspace(xmin, xmax, n_data)
y = np.exp(x) + [rnd.uniform(-0.0, 0.0) for n in range(n_data)]
x = np.append(x, -1)
y = np.append(y, 2)

x = np.linspace(xmin, xmax, n_data)

y = np.exp(x) + [rnd.uniform(-0.0, 0.0) for n in range(n_data)]

x = np.append(x, -1)

y = np.append(y, 2)

先の例に比べて不安定性＝曲線の振動の度合いが大きくなっている。

データ数を多くした場合

点の数を10個とし、乱数で擾乱を与えてみる（乱数系列も変えている）。

rnd.seed(1)

.....

n_data = 10
n_features = 20
n_terms_list = [1, 3, 5, 7, 9, 13]

x = np.linspace(xmin, xmax, n_data)
y = np.exp(x) + [rnd.uniform(-0.6, 0.6) for n in range(n_data)]

rnd.seed(1)

.....

n_data = 10

n_features = 20

n_terms_list = [1, 3, 5, 7, 9, 13]

x = np.linspace(xmin, xmax, n_data)

y = np.exp(x) + [rnd.uniform(-0.6, 0.6) for n in range(n_data)]

n_terms=5あたりから、全ての点に何とかフィットさせようと曲線が揺れ始め、特徴量数がデータ数と同じ値となる前後から振動が大きくなっている。

英語 – 機械学習

2020-05-10 / tau / コメントする

説明変数・独立変数

explately variable, regressor, independent variable, designed variable

被説明変数・従属変数: explained variable, regressand, dependent variable, designed variable

Ridge回帰の理解

2020-04-26 / tau / コメントする

定義

Ridge回帰は多重回帰の損失関数に罰則項としてL2正則化項を加味する。正則化の意味についてはこちらに詳しくまとめている。

L2ノルムは原点からのユークリッド距離。

(1) $\begin{equation*} \| \boldsymbol{w} \| _2 = \sqrt{w_1 ^2 + \cdots + w_m^2} \end{equation*}$

ただしリッジ回帰では、根号の中の二乗項で計算する。

(2) $\begin{equation*} \mathrm{minimize} \quad \sum_{i=1}^n (y_i - \hat{y}_i) + \alpha \sum_{j=1}^m w_j^2 \end{equation*}$

定式化

最小化すべき関数は、

(3) $\begin{align*} L &= \sum_{i=1}^n ( \hat{y}_i - y_i )^2 + \alpha ({w_1}^2 + \cdots + {w_2}^2) \\ &= \sum ( w_0 + w_1 x_{1i} + \cdots + w_m x_{mi} - y_i )^2 + \alpha ({w_1}^2 + \cdots + {w_m}^2) \end{align*}$

重み係数を計算するために、それぞれで偏微分してゼロとする。

(4) $\begin{align*} \frac{\partial L}{\partial w_0} &= 2 \sum (w_0 + w_1 x_{1i} + \cdots + w_m x_{mi} - y_i) = 0 \\ \frac{\partial L}{\partial w_1} &= 2 \sum x_{1i} (w_0 + w_1 x_{1i} + \cdots + w_m x_{mi} - y_i) + 2 \alpha w_1 = 0 \\ \vdots\\ \frac{\partial L}{\partial w_m} &= 2 \sum x_{mi} (w_0 + w_1 x_{1i} + \cdots + w_m x_{mi} - y_i) + 2 \alpha w_m = 0\\ \end{align*}$

その結果得られる連立方程式は以下の通り。

(5) $\begin{align*} n w_0 + w_1 \sum x_{1i} + \cdots + w_m \sum x_{mi} &= \sum y_i \\ w_0 \sum x_{1i} + w_1 \left( \sum {x_{1i}}^2 + \alpha \right) + \cdots + w_m \sum x_{1i} x_{mi} &= \sum x_{1i} y_i \\ \vdots \\ w_0 \sum x_{mi} + w_1 \sum x_{1i} x_{mi} + \cdots+ w_m \left( \sum {x_{mi}}^2 + \alpha \right) &= \sum x_{mi} y_i \\ \end{align*}$

ここでそれぞれの和を記号Sと添字で表し、さらに行列表示すると以下の通り。

(6) $\begin{equation*} \left[ \begin{array}{cccc} n & S_1 & \cdots & S_m \\ S_1 & S_{11} + \alpha & & S_{1m} \\ \vdots & \vdots & & \vdots \\ S_m & S_{m1} & \cdots & S_{mm} + \alpha \end{array} \right] \left[ \begin{array}{c} w_0 \\ w_1 \\ \vdots \\ w_m \end{array} \right] = \left[ \begin{array}{c} S_y \\S_{1y} \\ \vdots \\ S_{my} \end{array} \right] \end{equation*}$

ここで $w_0$ を消去して、以下の連立方程式を得る。

(7) $\begin{align*} &\left[ \begin{array}{ccc} ( S_{11} + \alpha ) - \dfrac{{S_1}^2}{n} & \cdots & S_{1m} - \dfrac{S_1 S_m}{n} \\ \vdots & & \vdots \\ S_{m1} - \dfrac{S_m S_1}{n} & \cdots & ( S_{mm} + \alpha )- \dfrac{{S_2}^2}{n} \end{array} \right] \left[ \begin{array}{c} w_1 \\ \vdots \\ w_m \end{array} \right] \\&= \left[ \begin{array}{c} S_{1y} - \dfrac{S_1 S_y}{n} \\ \vdots \\ S_{my} - \dfrac{S_m S_y}{n} \end{array} \right] \end{align*}$

これを分散・共分散で表すと、

(8) $\begin{equation*} \left[ \begin{array}{ccc} V_{11} + \dfrac{\alpha}{n} & \cdots & V_{1m} \\ \vdots & & \vdots \\ V_{m1} & \cdots & V_{mm} + \dfrac{\alpha}{n} \end{array} \right] \left[ \begin{array}{c} w_1 \\ \vdots \\ w_m \end{array} \right] = \left[ \begin{array}{c} V_{1y} \\ \vdots \\ V_{my} \end{array} \right] \end{equation*}$

ここで仮に、x_jiとx_kiが完全な線形関係にある場合を考えてみる。 $x_j = a x_i + b$ とすると、分散・共分散の性質より、

(9) $\begin{equation*} V_{jj} = a^2V_{ii}, \; V_{ji} = V_{ij} = aV_{ii}, \; V_{jk} = V_{kj} = aV_{ji} = aV_{ij} \end{equation*}$

このような場合、通常の線形回帰は多重共線性により解を持たないが、式(8)に適用すると係数行列は以下のようになる。

(10) $\begin{align*} \left[ \begin{array}{ccccccc} V_{11} + \dfrac{\alpha}{n} & \cdots & V_{1i} & \cdots & aV_{1i} & \cdots & V_{1m}\\ \vdots && \vdots && \vdots && \vdots\\ V_{i1} & \cdots & V_{ii} + \dfrac{\alpha}{n} & \cdots & aV_{ii} & \cdots & V_{im}\\ \vdots && \vdots && \vdots && \vdots\\ aV_{i1} & \cdots & aV_{ii} & \cdots & a^2V_{ii} + \dfrac{\alpha}{n} & \cdots & aV_{im}\\ \vdots && \vdots && \vdots && \vdots\\ V_{m1} & \cdots & V_{mi} & \cdots & aV_{mi} & \cdots & V_{mm} + \dfrac{\alpha}{n} \end{array} \right] \end{align*}$

対角要素にαが加わることで、多重共線性が強い場合でも係数行列の行列式は正則となり、方程式は解を持つ。また正則化の効果より、αを大きな値とすることによって係数の値が小さく抑えられる。

行列による表示

式(3)の損失関数を、n個のデータに対する行列で表示すると以下の通り（重回帰の行列表現はこちらを参照）。

(11) $\begin{align*} L &= \left( \boldsymbol{Xw} - \boldsymbol{y} \right)^T \left( \boldsymbol{Xw} - \boldsymbol{y} \right) + \alpha \boldsymbol{w}^T \boldsymbol{w} \\ &= \boldsymbol{w}^T \boldsymbol{X}^T \boldsymbol{Xw} - 2\boldsymbol{y}^T \boldsymbol{Xw} + \boldsymbol{y}^T \boldsymbol{y} + \alpha \boldsymbol{w}^T \boldsymbol{w} \end{align*}$

これをwで微分してLを最小とする値を求める。

(12) $\begin{gather*} \frac{dL}{d\boldsymbol{w}} = 2\boldsymbol{X}^T \boldsymbol{Xw} - 2 \boldsymbol{X}^T \boldsymbol{y} + 2 \alpha \boldsymbol{w} = \boldsymbol{0} \\ \boldsymbol{w} = \left( \boldsymbol{X}^T \boldsymbol{X} + \alpha \boldsymbol{I} \right)^{-1} \boldsymbol{X}^T \boldsymbol{y} \end{gather*}$

waveデータセット – 線形回帰

2020-04-05 / tau / コメントする

O’Reillyの”Pythonではじめる機械学習”に載っている、scikit-learnの線形回帰のwaveデータセットへの適用の再現。

waveデータセットのサンプル数を60、train_test_split()でrandom_satet=42として、書籍と同じグラフを得る。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mglearn.datasets import make_wave

xmin, xmax = -3, 3
ymin, ymax = -3, 3

X_source, y_source = make_wave(n_samples=60)
X_train, X_test, y_train, y_test = train_test_split(X_source, y_source, random_state=42)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

X_test = np.linspace(xmin, xmax, 2).reshape(-1, 1)
y_test = linreg.predict(X_test)

print(linreg.coef_[0], linreg.intercept_)

fig, ax = plt.subplots(figsize=(6.4, 6.4))

ax.scatter(X_source, y_source, s=20)
ax.plot(X_test, y_test, c="tab:orange")

ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid()

ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)

ax.set_aspect('equal')

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from mglearn.datasets import make_wave

xmin, xmax = -3, 3

ymin, ymax = -3, 3

X_source, y_source = make_wave(n_samples=60)

X_train, X_test, y_train, y_test = train_test_split(X_source, y_source, random_state=42)

linreg = LinearRegression()

linreg.fit(X_train, y_train)

X_test = np.linspace(xmin, xmax, 2).reshape(-1, 1)

y_test = linreg.predict(X_test)

print(linreg.coef_[0], linreg.intercept_)

fig, ax = plt.subplots(figsize=(6.4, 6.4))

ax.scatter(X_source, y_source, s=20)

ax.plot(X_test, y_test, c="tab:orange")

ax.spines['bottom'].set_position('zero')

ax.spines['left'].set_position('zero')

ax.spines['top'].set_visible(False)

ax.spines['right'].set_visible(False)

ax.grid()

ax.set_xlim(xmin, xmax)

ax.set_ylim(ymin, ymax)

ax.set_aspect('equal')

plt.show()

また、訓練結果の係数、切片とスコアについても同じ結果を得ることができる。

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mglearn.datasets import make_wave

X, y = make_wave(n_samples=60)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

print("coef_     : {}".format(linreg.coef_))
print("intercept_: {}".format(linreg.intercept_))

print("training score: {:.3f}".format(linreg.score(X_train, y_train)))
print("test score    : {:.3f}".format(linreg.score(X_test, y_test)))

# coef_     : [0.39390555]
# intercept_: -0.031804343026759746
# training score: 0.670
# test score    : 0.659

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from mglearn.datasets import make_wave

X, y = make_wave(n_samples=60)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

linreg = LinearRegression()

linreg.fit(X_train, y_train)

print("coef_ : {}".format(linreg.coef_))

print("intercept_: {}".format(linreg.intercept_))

print("training score: {:.3f}".format(linreg.score(X_train, y_train)))

print("test score : {:.3f}".format(linreg.score(X_test, y_test)))

# coef_ : [0.39390555]

# intercept_: -0.031804343026759746

# training score: 0.670

# test score : 0.659

Breast cancer データセット – Logistic回帰による学習率曲線

2020-04-05 / tau / コメントする

概要

breast-cancerデータセットにscikit-learnのLogisticRegressionクラスでLogistic回帰を適用した結果。

手法全般の適用の流れはLogistic回帰～cancer～Pythonではじめる機械学習よりを参照。

ここではハイパーパラメーターを変化させたときの学習率の違いをみている。

学習率曲線

scikit-learnのLogisticRegressionクラスで、正則化のパラメーターを変化させたときの学習率曲線。同クラスにはsolver引数で収束計算のいくつかの手法が選択できるが、収束手法の違いによって意外に学習率曲線に違いが出た。またtrain_test_split()のrandom_stateを変えても違いがある。569のデータセットで訓練データとテストデータを分けてもいるが、その程度では結構ばらつきが出るということかもしれない。

まず、random_state=0とした場合の、4つの収束手法における学習率曲線を示す。L-BFGSは準ニュートン法の1つらしいので、Newton-CGと同じ傾向であるのは頷ける。SAG(Stochastic Average Gradient)はまた違った計算方法のようで、他の手法と随分挙動が異なる。収束回数はmax_iter=10000で設定していて、これくらいでも計算回数オーバーの警告がいくつか出る。回数をこれより2オーダー多くしても、状況はあまり変わらない。

random_state=11としてみると、liblinearでは大きく違わないが、他の3つの手法では傾向が違っていて、特にsagを用いた場合は訓練データの学習率の方がテストデータの学習率よりも低くなっている。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

ds = load_breast_cancer()

df = pd.DataFrame(ds.data, columns=ds.feature_names)

X_train, X_test, y_train, y_test = \
    train_test_split(df, ds.target, stratify=ds.target, random_state=0)

C_sup = np.linspace(5, -4, 20)
C_val = 10**C_sup

solvers = ['liblinear', 'lbfgs', 'newton-cg', 'sag']


fig, axs = plt.subplots(2, 2, figsize=(8, 8))
axs_1d = axs.reshape(-1)

for ax, solver in zip(axs_1d, solvers):
    train_scores = np.empty(0)
    test_scores = np.empty(0)
    for C in C_val:
        logreg = LogisticRegression(C=C, solver=solver, max_iter=10000)
        logreg.fit(X_train, y_train)
        train_scores = np.append(train_scores, logreg.score(X_train, y_train))
        test_scores = np.append(test_scores, logreg.score(X_test, y_test))

    ax.plot(C_val, train_scores, label="Training scores")
    ax.plot(C_val, test_scores, label="Test scores")

    ax.set_xscale('log')
    ax.set_ylim(0.9, 1)
    ax.grid(True)
    ax.legend()
    ax.set_title(solver)

plt.show()

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_breast_cancer

ds = load_breast_cancer()

df = pd.DataFrame(ds.data, columns=ds.feature_names)

X_train, X_test, y_train, y_test = \

train_test_split(df, ds.target, stratify=ds.target, random_state=0)

C_sup = np.linspace(5, -4, 20)

C_val = 10**C_sup

solvers = ['liblinear', 'lbfgs', 'newton-cg', 'sag']

fig, axs = plt.subplots(2, 2, figsize=(8, 8))

axs_1d = axs.reshape(-1)

for ax, solver in zip(axs_1d, solvers):

train_scores = np.empty(0)

test_scores = np.empty(0)

for C in C_val:

logreg = LogisticRegression(C=C, solver=solver, max_iter=10000)

logreg.fit(X_train, y_train)

train_scores = np.append(train_scores, logreg.score(X_train, y_train))

test_scores = np.append(test_scores, logreg.score(X_test, y_test))

ax.plot(C_val, train_scores, label="Training scores")

ax.plot(C_val, test_scores, label="Test scores")

ax.set_xscale('log')

ax.set_ylim(0.9, 1)

ax.grid(True)

ax.legend()

ax.set_title(solver)

plt.show()

Logistic回帰～1変数・Excelによる解

2020-04-04 / tau / コメントする

例題のデータ

ある旅行会社の会員顧客20人の年齢と、温泉(SPA)とレジャーランド(LSR)のどちらを選んだかというデータが以下のように得られているとき、新たな顧客にどちらを勧めればより適切か。

このようなクラス分けの問題にLogistic回帰を使うのにPythonのパッケージなどによる方法もあるが、ここではExcelを使った方法を示す。

その流れは、各観光客の選択結果のカテゴリー変数と年齢から個別の尤度と合計の尤度の計算式を定義し、切片と係数の初期値を設定しておいてから、尤度が最大となるような切片・係数を求めるためにExcelのソルバーを使う。

元となるデータは、各観光客の年齢と、行先に選んだのが温泉(SPA)かレジャーランドか(LSR)の別、それらに対して温泉を選んだ場合は1、レジャーランドを選んだ場合は0となるカテゴリー変数。

計算表の準備

このデータから以下のような表を作る。各セルの意味と内容は以下の通り。

coef：線形式の切片Aと係数Bの初期値としてそれぞれ0をセットし、収束計算の結果が入る
intercept：切片の計算のために使われるデータで、全て固定値の1
prob：coefがA, Bの値の時に各顧客の年齢に対してis_spa=1となる確率で、Logistic関数の計算値
- セルの内容は計算式で=1/(1+EXP(-$A*Y-$B*Z))
- $A, $Bは固定座標を表し、全てのデータに対してこれらのセルの内容を使う
LH：is_spaの値に対する尤度(likelihood)
- セルの内容は計算式でX*LN(C)+(1-X)*LN(1-C)
MLE：全データのLHの和で、このデータセットのパターンに対する最大尤度の結果が入る

収束計算

データタブの一番右にあるソルバーに入る(ない場合はファイル→オプション→アドイン→設定からソルバーアドインにチェックを入れる)。

ソルバーのパラメーター設定ダイアログで、

目的セルを上記のDで選択
変数セルを上記のA:Bの範囲で選択。
目標値は「最大値」を選択

「解決」ボタンを押して収束計算すると、Dの値を最大化するA:Bの内容がセットされる。

この場合の結果は以下の通り

coef：-13.6562, 0.234647
MLE：-7.45298

確率0.5(線形式の値が0)を温泉とレジャーランドの閾値とするなら、それに相当する年齢は以下のように計算される。

(1) $\begin{equation*} -13.6562 + 0.234647 \times x = 0 \quad \rightarrow \quad x = \frac{13.6562 }{0.234647 } =58.2 \end{equation*}$

Pythonのscikit-learnのLogisticRegressionモデルを同じデータに適用した結果(C=1e5)は以下の通りで、かなり近い値となっている。

intercept_ = [-13.38993211]
coefficient_ = [0.23015561]

得られた係数の値を使って、以下の関数式のグラフを描いてみたのが以下の図でLogistic曲線が現れている。

Boston house‐pricesデータセット

2020-03-25 / tau / コメントする

概要

Boston house-pricesデータセットは、カーネギーメロン大学のStatLibライブラリーから取得したもので、持家の価格とその持家が属する地域に関する指標からなる。

ボストンの各地域にある506の持家の価格の中央値に対して、その地域の犯罪発生率やNOx濃度など13の指標が得られる。

ここではPythonのscikit-learnにあるbostonデータの使い方をまとめる。

データの取得とデータ構造

Pythonで扱う場合、scikit-learnのdatasetsモジュールにあるload_breast_cancer()でデータを取得できる。データはBunchクラスのオブジェクト。

from sklearn.datasets import load_boston

boston_ds = load_boston()

for key, value in zip(boston_ds.keys(), boston_ds.values()):
    print("{}:\n{}\n".format(key, value))

from sklearn.datasets import load_boston

boston_ds = load_boston()

for key, value in zip(boston_ds.keys(), boston_ds.values()):

print("{}:\n{}\n".format(key, value))

データセットの構造は辞書型で、506の地域に関する13の特徴量と、当該地域における持家住宅の1000ドル単位の価格などのデータ。

data:
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

target:
[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 .....
 16.7 12.  14.6 21.4 23.  23.7 25.  21.8 20.6 21.2 19.1 20.6 15.2  7.
  8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
 22.  11.9]

feature_names:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

DESCR:
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

.....

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.


filename:
C:\Users\tomo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv

data:

[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]

[2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]

[2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]

...

[6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]

[1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]

[4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

target:

[24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4

18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8

18.4 21. 12.7 14.5 13.2 13.1 13.5 18.9 20. 21. 24.7 30.8 34.9 26.6

.....

16.7 12. 14.6 21.4 23. 23.7 25. 21.8 20.6 21.2 19.1 20.6 15.2 7.

8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9

22. 11.9]

feature_names:

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'

'B' 'LSTAT']

DESCR:

.. _boston_dataset:

Boston house prices dataset

---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

.....

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.

- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

filename:

C:\Users\tomo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv

データのキーは以下のようになっている。

from sklearn.datasets import load_boston

boston_ds = load_boston()

print(boston_ds.keys())

# dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

from sklearn.datasets import load_boston

boston_ds = load_boston()

print(boston_ds.keys())

# dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

データの内容

`'data'`～特徴量データセット

506の地域における13の指標を特徴量として格納した2次元配列。列のインデックスが特徴量の番号に対応している。

data:
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

data:

[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]

[2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]

[2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]

...

[6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]

[1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]

[4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

`'target'`～住宅価格

506の地域における持家住宅の1000ドル単位の価格中央値

target:
[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 .....
 16.7 12.  14.6 21.4 23.  23.7 25.  21.8 20.6 21.2 19.1 20.6 15.2  7.
  8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
 22.  11.9]

target:

[24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4

18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8

18.4 21. 12.7 14.5 13.2 13.1 13.5 18.9 20. 21. 24.7 30.8 34.9 26.6

.....

16.7 12. 14.6 21.4 23. 23.7 25. 21.8 20.6 21.2 19.1 20.6 15.2 7.

8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9

22. 11.9]

`'feature_names'`～特徴名

13種類の特徴量の名称。

CRIM：町ごとの人口当たり犯罪率
ZN：25,000平方フィート以上の区画の住居用途地区比率
INDUS：町ごとの小売り以外の産業用途地区比率
CHAS：チャールズ川に関するダミー変数（1：川沿い、0：それ以外）
NOX：NOx濃度（10ppm単位）
RM：1戸あたり部屋数
AGE：1940年より前に建てられた持家物件の比率
DIS：ボストンの5つの職業紹介所への重みづけ平均距離
RAD：放射道路へのアクセス性
TAX：10,000ドルあたりの固定資産税総額
PTRATIO：生徒対教師の比率
B：1000(Bk – 0.63)^2（Bkは待ちにおける黒人比率）
LSTAT：下位層の人口比率(%)

feature_names:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

feature_names:

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'

'B' 'LSTAT']

`'filename'`～ファイル名

CSVファイルのフルパス名が示されている。1行目にはデータ数、特徴量数が並んでおり、2行目に13の特徴量とターゲットの住宅価格、その後に506行のレコードに対する13列の特徴量と1列のターゲットデータが格納されている。このファイルにはDESCRに当たるデータは格納されていない。

'C:...\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv'

1	'C:...\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv'

`'DESCR'`～データセットの説明

データセットの説明。print(breast_ds_dataset['DESCR'])のようにprint文で整形表示される。

レコード数506個
属性は、13の数値／カテゴリー属性と、通常はターゲットに用いられる中央値

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

.. _boston_dataset:

Boston house prices dataset

---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,

vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics

...', Wiley, 1980. N.B. Various transformations are used in the table on

pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression

problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.

データの利用

データの取得方法

bostonデータセットから各データを取り出すのに、以下の2つの方法がある。

辞書のキーを使って呼び出す（例：boston['DESCR']）
キーの文字列をプロパティーに指定する（例：boston.DESCR）

全レコードの特徴量データの取得

'data'から、506のレコードに関する13の特徴量が506行13列の2次元配列で得られる。13の特徴量は’feature_names’の13の特徴名に対応している。

from sklearn.datasets import load_boston

boston_ds = load_boston()

print(boston_ds.data)

# [[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
#  [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
#  [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
#  ...
#  [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
#  [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
#  [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

from sklearn.datasets import load_boston

boston_ds = load_boston()

print(boston_ds.data)

# [[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]

# [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]

# [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]

# ...

# [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]

# [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]

# [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

特定の特徴量のデータのみ取得

特定の特徴量に関する全レコードのデータを取り出すときにはX[:, n]の形で指定する。

from sklearn.datasets import load_boston

boston_ds = load_boston()

features = boston_ds.feature_names
X = boston_ds.data
n_feature = 10

feature = X[:, n_feature]

print("feature name : {}".format(features[n_feature]))
print("feature data :\n{}".format(feature))

# feature name : PTRATIO
# feature data :
# [15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 15.2 15.2 15.2 21.
#  21.  21.  21.  21.  21.  21.  21.  21.  21.  21.  21.  21.  21.  21.
#  21.  21.  21.  21.  21.  21.  21.  19.2 19.2 19.2 19.2 18.3 18.3 17.9
#  ...
#  20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.1 20.1
#  20.1 20.1 20.1 19.2 19.2 19.2 19.2 19.2 19.2 19.2 19.2 21.  21.  21.
#  21.  21. ]

from sklearn.datasets import load_boston

boston_ds = load_boston()

features = boston_ds.feature_names

X = boston_ds.data

n_feature = 10

feature = X[:, n_feature]

print("feature name : {}".format(features[n_feature]))

print("feature data :\n{}".format(feature))

# feature name : PTRATIO

# feature data :

# [15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 15.2 15.2 15.2 21.

# 21. 21. 21. 21. 21. 21. 21. 21. 21. 21. 21. 21. 21. 21.

# 21. 21. 21. 21. 21. 21. 21. 19.2 19.2 19.2 19.2 18.3 18.3 17.9

# ...

# 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.1 20.1

# 20.1 20.1 20.1 19.2 19.2 19.2 19.2 19.2 19.2 19.2 19.2 21. 21. 21.

# 21. 21. ]

waveデータセット – knn

2020-03-22 / tau / コメントする

概要

k-最近傍回帰の例として、scikit-learnのwaveデータにKNeighborsRegressorを適用してみた結果。

近傍点数とクラス分類の挙動

訓練データとして10個のwaveデータを訓練データとして与え、2つのテストデータの予測するのに、近傍点数を1, 2, 3と変えた場合の様子を見てみる。

近傍点数=1の場合

2つのテストデータの特徴量の値に最も近い特徴量を持つ訓練データが選ばれ、その属性値がそのままテストデータの属性値となっている。

近傍点数=2の場合

テストデータの特徴量に最も近い方から1番目、2番目の特徴量を持つ訓練データが選ばれ、それらの属性値の平均がテストデータの属性値となっている。

近傍点数=3の場合

同様に、テストデータの特徴量に最も近い3つの訓練データの属性の平均がテストデータの属性値となっている。

実行コード

上記の計算のコードは以下の通り。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from mglearn.datasets import make_wave

X_train, y_train = make_wave(n_samples=10)

reg = KNeighborsRegressor(n_neighbors=3)
reg.fit(X_train, y_train)

X_test = np.array([[-1], [1]])
y_pred = reg.predict(X_test)

neigh_dist, neigh_ind = reg.kneighbors(X=X_test)
print(neigh_ind)

fig, ax = plt.subplots(figsize=(8.0, 4.8))

xmin, xmax = -3, 3
ymin, ymax = -3, 3

ax.scatter(X_train, y_train, marker='o', s=20)
ax.scatter(X_test, y_pred, marker='*', s=120)

for test, pred, ind in zip(X_test, y_pred, neigh_ind):
    for neigh in ind:
        ax.plot([test, test], [ymin, ymax], c='gray', linestyle='dashed')
        ax.plot(
            [test[0], X_train[neigh, 0]], [pred, y_train[neigh]],
            color='k', linestyle='dotted')

for x, y in zip(X_train, y_train):
    ax.annotate("{:6.3f}".format(y), xy=(x[0] - 0.1, y + 0.08))
for x, y in zip(X_test, y_pred):
    ax.annotate("{:6.3f}".format(y), xy=(x[0] - 0.2, y - 0.3))

ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)

ax.set_xlabel("feature")
ax.set_ylabel("prediction")

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsRegressor

from mglearn.datasets import make_wave

X_train, y_train = make_wave(n_samples=10)

reg = KNeighborsRegressor(n_neighbors=3)

reg.fit(X_train, y_train)

X_test = np.array([[-1], [1]])

y_pred = reg.predict(X_test)

neigh_dist, neigh_ind = reg.kneighbors(X=X_test)

print(neigh_ind)

fig, ax = plt.subplots(figsize=(8.0, 4.8))

xmin, xmax = -3, 3

ymin, ymax = -3, 3

ax.scatter(X_train, y_train, marker='o', s=20)

ax.scatter(X_test, y_pred, marker='*', s=120)

for test, pred, ind in zip(X_test, y_pred, neigh_ind):

for neigh in ind:

ax.plot([test, test], [ymin, ymax], c='gray', linestyle='dashed')

ax.plot(

[test[0], X_train[neigh, 0]], [pred, y_train[neigh]],

color='k', linestyle='dotted')

for x, y in zip(X_train, y_train):

ax.annotate("{:6.3f}".format(y), xy=(x[0] - 0.1, y + 0.08))

for x, y in zip(X_test, y_pred):

ax.annotate("{:6.3f}".format(y), xy=(x[0] - 0.2, y - 0.3))

ax.set_xlim(xmin, xmax)

ax.set_ylim(ymin, ymax)

ax.set_xlabel("feature")

ax.set_ylabel("prediction")

plt.show()

knnの精度

O’Reillyの”Pythonではじめる機械学習”中、KNeighborsRegressorのwaveデータに対する精度が計算されている。40サンプルのwaveデータを発生させ訓練データとテストデータに分け、テストデータに対するR²スコアが0.83となることが示されている。実際に計算してみると、確かに同じ値となる。

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from mglearn.datasets import make_wave

X_source, y_source = make_wave(n_samples=40)

X_train, X_test, y_train, y_test =\
    train_test_split(X_source, y_source, random_state=0)

reg = KNeighborsRegressor(n_neighbors=3)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

print("R^2 score:{:6.3f}".format(reg.score(X_test, y_test)))

# R^2 score: 0.834

import numpy as np

from sklearn.neighbors import KNeighborsRegressor

from sklearn.model_selection import train_test_split

from mglearn.datasets import make_wave

X_source, y_source = make_wave(n_samples=40)

X_train, X_test, y_train, y_test =\

train_test_split(X_source, y_source, random_state=0)

reg = KNeighborsRegressor(n_neighbors=3)

reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

print("R^2 score:{:6.3f}".format(reg.score(X_test, y_test)))

# R^2 score: 0.834

これを見ると比較的高い精度のように見えるが、train_test_split()の引数random_stateを変化させてみると以下のように精度はばらつく。乱数系列が異なると精度が0.3未満の場合もあるが、全体としてみると0.6～0.7あたりとなりそうである。

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from mglearn.datasets import make_wave

X_source, y_source = make_wave(n_samples=40)

reg = KNeighborsRegressor(n_neighbors=3)

print("random_state -> R^2")

for random_state in range(0, 10):
    X_train, X_test, y_train, y_test =\
        train_test_split(X_source, y_source, random_state=random_state)
    reg.fit(X_train, y_train)
    y_pred = reg.predict(X_test)

    print("{} -> {:6.3f}".format(random_state, reg.score(X_test, y_test)))

# random_state -> R^2
# 0 ->  0.834
# 1 ->  0.581
# 2 ->  0.798
# 3 ->  0.281
# 4 ->  0.773
# 5 ->  0.738
# 6 ->  0.554
# 7 ->  0.494
# 8 ->  0.678
# 9 ->  0.801

import numpy as np

from sklearn.neighbors import KNeighborsRegressor

from sklearn.model_selection import train_test_split

from mglearn.datasets import make_wave

X_source, y_source = make_wave(n_samples=40)

reg = KNeighborsRegressor(n_neighbors=3)

print("random_state -> R^2")

for random_state in range(0, 10):

X_train, X_test, y_train, y_test =\

train_test_split(X_source, y_source, random_state=random_state)

reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

print("{} -> {:6.3f}".format(random_state, reg.score(X_test, y_test)))

# random_state -> R^2

# 0 -> 0.834

# 1 -> 0.581

# 2 -> 0.798

# 3 -> 0.281

# 4 -> 0.773

# 5 -> 0.738

# 6 -> 0.554

# 7 -> 0.494

# 8 -> 0.678

# 9 -> 0.801

ためしにmake_wave(n_samples=1000)としてみると、結果は以下の通りとなり、精度は0.67程度（平均は0.677）と一定してくる。

random_state -> R^2
0 ->  0.679
1 ->  0.662
2 ->  0.682
3 ->  0.672
4 ->  0.680
5 ->  0.697
6 ->  0.712
7 ->  0.682
8 ->  0.661
9 ->  0.641

random_state -> R^2

0 -> 0.679

1 -> 0.662

2 -> 0.682

3 -> 0.672

4 -> 0.680

5 -> 0.697

6 -> 0.712

7 -> 0.682

8 -> 0.661

9 -> 0.641

予測カーブ

訓練データが少ない場合

40個のwaveデータに対して、n_neighborsを変化させたときの予測カーブを見てみる。

n_neighbors=1の時は、全ての訓練データを通るような線となる
n_neighborsが多くなるほど滑らかになる
n_neighborsがかなり大きくなると水平に近くなる
n_neighborsが訓練データ数と同じになると、予測線は水平になる（任意の特徴量に対して、全ての点の平均を計算しているため）

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from mglearn.datasets import make_wave

X_train, y_train = make_wave(n_samples=40)
xmin = np.min(X_train[:, 0])
xmax = np.max(X_train[:, 0])
X_test = np.linspace(xmin, xmax, 200).reshape(-1, 1)

fig, axs = plt.subplots(2, 3, figsize=(11, 6.4))
plt.subplots_adjust(hspace=0.4, wspace=0.4)

n_neighbors_list=[1, 2, 8, 16, 32, 40]
axs_1d = axs.reshape(1, -1)[0]

for ax, n_neighbors in zip(axs_1d, n_neighbors_list):
    reg = KNeighborsRegressor(n_neighbors=n_neighbors)
    reg.fit(X_train, y_train)
    y_pred = reg.predict(X_test)

    ax.scatter(X_train[:, 0], y_train, zorder=2, s=20, color='tab:blue')
    ax.plot(X_test, y_pred, zorder=1, color='tab:orange')

    ax.set_title("n_neighbors={}".format(n_neighbors))
    ax.set_xlabel("feature")
    ax.set_ylabel("target")

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsRegressor

from mglearn.datasets import make_wave

X_train, y_train = make_wave(n_samples=40)

xmin = np.min(X_train[:, 0])

xmax = np.max(X_train[:, 0])

X_test = np.linspace(xmin, xmax, 200).reshape(-1, 1)

fig, axs = plt.subplots(2, 3, figsize=(11, 6.4))

plt.subplots_adjust(hspace=0.4, wspace=0.4)

n_neighbors_list=[1, 2, 8, 16, 32, 40]

axs_1d = axs.reshape(1, -1)[0]

for ax, n_neighbors in zip(axs_1d, n_neighbors_list):

reg = KNeighborsRegressor(n_neighbors=n_neighbors)

reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

ax.scatter(X_train[:, 0], y_train, zorder=2, s=20, color='tab:blue')

ax.plot(X_test, y_pred, zorder=1, color='tab:orange')

ax.set_title("n_neighbors={}".format(n_neighbors))

ax.set_xlabel("feature")

ax.set_ylabel("target")

plt.show()

訓練データが多い場合

今度はwaveデータでn_samples=200と数を多くしてみる。データ数を多くするとその名の通り、上下に波打ちながら増加している様子が見られる。これに対してn_neighborsを変化させたのが以下の図。

n_neighbors=10～20あたりで滑らかに、かつ波打つ状況が曲線で再現されている。

n_samples=300として訓練データに200を振り分け、n_neighborsを変化させたときのスコアは以下の通り。n_neighbors=20あたりで精度が最もよさそうである。

あるデータが得られたとき、その科学的なメカニズムは置いておいて、とりあえずデータから予測値を再現したいときにはそれなりに使えるかもしれない。

n_neighbors -> R^2
5 ->  0.754
10 ->  0.788
15 ->  0.789
20 ->  0.792
25 ->  0.777
50 ->  0.737
100 ->  0.613
200 -> -0.022

n_neighbors -> R^2

5 -> 0.754

10 -> 0.788

15 -> 0.789

20 -> 0.792

25 -> 0.777

50 -> 0.737

100 -> 0.613

200 -> -0.022

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from mglearn.datasets import make_wave

X_source, y_source = make_wave(n_samples=300)

X_train, X_test, y_train, y_test =\
    train_test_split(X_source, y_source, train_size=200, random_state=0)

n_neighbors_list = [5, 10, 15, 20, 25, 50, 100, 200]

print("n_neighbors -> R^2")

for n_neighbors in n_neighbors_list:
    reg = KNeighborsRegressor(n_neighbors=n_neighbors)
    reg.fit(X_train, y_train)
    y_pred = reg.predict(X_test)

    print("{} -> {:6.3f}".format(n_neighbors, reg.score(X_test, y_test)))

import numpy as np

from sklearn.neighbors import KNeighborsRegressor

from sklearn.model_selection import train_test_split

from mglearn.datasets import make_wave

X_source, y_source = make_wave(n_samples=300)

X_train, X_test, y_train, y_test =\

train_test_split(X_source, y_source, train_size=200, random_state=0)

n_neighbors_list = [5, 10, 15, 20, 25, 50, 100, 200]

print("n_neighbors -> R^2")

for n_neighbors in n_neighbors_list:

reg = KNeighborsRegressor(n_neighbors=n_neighbors)

reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

print("{} -> {:6.3f}".format(n_neighbors, reg.score(X_test, y_test)))

forgeデータセット – knn

2020-03-22 / tau / コメントする

概要

ここでは、Pythonのscikit-learnパッケージのKNeighborsClassifierクラスにmglearnパッケージのforgeデータを適用してknnの挙動を確認する。

近傍点数を変化させたときのクラス分類の挙動や学習率曲線についてみていく。

近傍点数によるクラス分類の挙動

近傍点数=1の場合

データセットとしてmglearnで提供されているforgeデータを用いて、近傍点数=1とした場合の、3つのテストデータのクラス判定を以下に示す。各テストデータに対して最も距離(この場合はユークリッド距離)が近い点1つが定まり、その点のクラステストデータのクラスとして決定している。

なお、いろいろなところで見かけるforgeデータセットの散布図は当該データセットの特徴量0(横軸)と特徴量1(縦軸)の最小値と最大値に合わせて表示しており、軸目盛の比率が等しくない。ここでは、距離計算に視覚上の齟齬が生じないように、縦軸と横軸の比率を同じとしている。

後の計算のために、このグラフ描画のコードを以下に示す。

import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from mglearn.datasets import make_forge

X, y = make_forge()

clfr = KNeighborsClassifier(n_neighbors=1)
clfr.fit(X, y)

col = ['blue', 'red']

test_points = [[9., 4.], [10., 3.], [11., 2.]]
nb_dist, nb_idx = clfr.kneighbors(test_points)
test_pred = clfr.predict(test_points)

fig, ax = plt.subplots()

ax.scatter(X[:, 0][y==0], X[:, 1][y==0], marker='o', c=col[0], label="class-0")
ax.scatter(X[:, 0][y==1], X[:, 1][y==1], marker='^', c=col[1], label="class-1")

ax.legend(loc="lower left")

for pts, cls, ids, dists in zip(test_points, test_pred, nb_idx, nb_dist):
    print(pts)
    ax.scatter(pts[0], pts[1], marker='*', s=150, c=col[cls])
    for id, dst in zip(ids, dists):
        ax.plot([pts[0], X[id, 0]], [pts[1], X[id, 1]], c='gray')
        print(" [{:7.4f}, {:7.4f}] - {:7.4f}".format(X[id, 0], X[id, 1], dst))

plt.show()

import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier

from mglearn.datasets import make_forge

X, y = make_forge()

clfr = KNeighborsClassifier(n_neighbors=1)

clfr.fit(X, y)

col = ['blue', 'red']

test_points = [[9., 4.], [10., 3.], [11., 2.]]

nb_dist, nb_idx = clfr.kneighbors(test_points)

test_pred = clfr.predict(test_points)

fig, ax = plt.subplots()

ax.scatter(X[:, 0][y==0], X[:, 1][y==0], marker='o', c=col[0], label="class-0")

ax.scatter(X[:, 0][y==1], X[:, 1][y==1], marker='^', c=col[1], label="class-1")

ax.legend(loc="lower left")

for pts, cls, ids, dists in zip(test_points, test_pred, nb_idx, nb_dist):

print(pts)

ax.scatter(pts[0], pts[1], marker='*', s=150, c=col[cls])

for id, dst in zip(ids, dists):

ax.plot([pts[0], X[id, 0]], [pts[1], X[id, 1]], c='gray')

print(" [{:7.4f}, {:7.4f}] - {:7.4f}".format(X[id, 0], X[id, 1], dst))

plt.show()

概要は以下の通り。

5行目でforgeデータセットを準備
7行目で近傍点数を1で指定してクラス分類器を構築
8行目で訓練データとしてforgeデータを与える
12行目で3つのテストデータを準備
13行目でテストデータに対する近傍点のインデックスとテストデータまでの距離を獲得
14行目でテストデータのクラスを決定
18-19行目で訓練データの散布図を描画
23行目で、テストデータとそのクラス決定結果、クラス決定に用いられた点群のインデックス、テストデータと各点の距離を並行してループ
- 24行目でテストデータの座標を出力
- 25行目でテストデータを描画
- 26行目のループで、テストデータごとの近傍点に関する処理を実行
  - 27行目でテストデータと近傍点の間に直線を描画
  - 28行目で近傍点とテストデータからの距離を出力

出力結果は以下の通りで、各予測点に対して近傍点が1つ決定されている。

[9.0, 4.0]
 [ 8.6749,  4.4757] -  0.5762
[10.0, 3.0]
 [10.2403,  2.4554] -  0.5952
[11.0, 2.0]
 [11.5640,  1.3389] -  0.8689

[9.0, 4.0]

[ 8.6749, 4.4757] - 0.5762

[10.0, 3.0]

[10.2403, 2.4554] - 0.5952

[11.0, 2.0]

[11.5640, 1.3389] - 0.8689

近傍点数=3の場合

先の例で、コードの7行目で近傍点=3で指定してクラス分類器を構築する。

clfr = KNeighborsClassifier(n_neighbors=3)

1	clfr = KNeighborsClassifier(n_neighbors=3)

一般にknnでは、テストデータに対して複数の近傍点を指定する場合、各近傍点のクラスのうち最も多いものをテストデータのクラスとする(多数決)。

[9.0, 4.0]
 [ 8.6749,  4.4757] -  0.5762
 [ 9.4912,  4.3322] -  0.5930
 [ 8.1062,  4.2870] -  0.9387
[10.0, 3.0]
 [10.2403,  2.4554] -  0.5952
 [ 9.5017,  1.9382] -  1.1729
 [ 8.7337,  2.4916] -  1.3645
[11.0, 2.0]
 [11.5640,  1.3389] -  0.8689
 [10.2403,  2.4554] -  0.8858
 [10.0639,  0.9908] -  1.3765

[9.0, 4.0]

[ 8.6749, 4.4757] - 0.5762

[ 9.4912, 4.3322] - 0.5930

[ 8.1062, 4.2870] - 0.9387

[10.0, 3.0]

[10.2403, 2.4554] - 0.5952

[ 9.5017, 1.9382] - 1.1729

[ 8.7337, 2.4916] - 1.3645

[11.0, 2.0]

[11.5640, 1.3389] - 0.8689

[10.2403, 2.4554] - 0.8858

[10.0639, 0.9908] - 1.3765

近傍点数=2の場合

テストデータのクラスを近傍点のクラスの多数決で求めるとすると、近傍点数が偶数の時の処理が問題になる。KNeighborsClassifierの場合、偶数でクラス分類が拮抗する場合は、クラス番号が最も小さいものに割り当てられるらしい。実際、n_neighbors=2としたときの3つのテストデータのうち中央の点(10.0, 3.0)については、赤い点(10.24, 2.45)～class-1～距離0.5952の方が青い点(9.5017, 1.9382)～class-0～距離1.1729よりも距離は近いがクラス番号が0である青い点のクラスで判定されている。

[9.0, 4.0]
 [ 8.6749,  4.4757] -  0.5762
 [ 9.4912,  4.3322] -  0.5930
[10.0, 3.0]
 [10.2403,  2.4554] -  0.5952
 [ 9.5017,  1.9382] -  1.1729
[11.0, 2.0]
 [11.5640,  1.3389] -  0.8689
 [10.2403,  2.4554] -  0.8858

[9.0, 4.0]

[ 8.6749, 4.4757] - 0.5762

[ 9.4912, 4.3322] - 0.5930

[10.0, 3.0]

[10.2403, 2.4554] - 0.5952

[ 9.5017, 1.9382] - 1.1729

[11.0, 2.0]

[11.5640, 1.3389] - 0.8689

[10.2403, 2.4554] - 0.8858

偶数の点で多数決で拮抗した場合には、最も近い点のクラスで決定する、平均距離が近い方のクラスで決定するといった方法が考えられるが、この場合は必ず番号が小さなクラスが選ばれるため、若干結果に偏りがでやすいのでは、と考える。

決定境界

近傍点の数を変えた時の決定境界の変化を確認する。k近傍法はscikit-learnのKNeighborsClassifierクラスを利用する。

近傍点の数を1, 2, 3, …と変化させたときの決定境界の変化は以下の通り。

近傍点数が少ないときは訓練データにフィットするよう決定境界が複雑になるが、近傍点数が多いと決定境界は滑らかになる。特に近傍点数が訓練データの点数に等しいとき、全訓練データの多数決でクラス決定され、全領域で判定結果が同じとなる(この場合は近傍点数26が偶数なので、クラス番号の小さいclass-0で決定されている)。

この図を描画したコードを以下に示す。

7行目、引数で与えたAxesに対して決定境界を描く関数を定義
- 18行目、決定境界をcontourf()を利用して描いている
21行目、引数で与えたAxesに対してクラスごとに色分けした散布図を描く関数を定義
54行目、2次元配列のAxesを1次元配列として扱っている

import numpy as np
import matplotlib.pyplot as plt
from mglearn.datasets import make_forge
from sklearn.neighbors import KNeighborsClassifier


def draw_decision_boundary(ax, n_neighbors, X, y, X0_field, X1_field):
    clsfr = KNeighborsClassifier(n_neighbors=n_neighbors)

    clsfr.fit(X, y)

    y_predicted = np.empty((len(X1_field), len(X0_field)))

    for row, x1 in enumerate(X1_field):
        for col, x0 in enumerate(X0_field):
            y_predicted[row, col] = clsfr.predict(np.array([[x0, x1]]))

    ax.contourf(X0_field, X1_field, y_predicted, levels=1, alpha=0.5)


def draw_scatter(ax, X0, X1, xlim, ylim):
    ax.scatter(X0[y==0], X1[y==0], marker='o', s=40, label="class-0")
    ax.scatter(X0[y==1], X1[y==1], marker='^', s=40, label="class-1")

    ax.set_xlim(xlim[0], xlim[1])
    ax.set_ylim(ylim[0], ylim[1])

    ax.set_xlabel("feature 0")
    ax.set_ylabel("feature 1")

    ax.tick_params(labelbottom=False, labelleft=False)
    ax.tick_params(bottom=False, left=False)

    ax.legend(loc='lower right')


X, y = make_forge()

X0_scatter = X[:, 0]
X1_scatter = X[:, 1]

n_X0_field, n_X1_field = 20, 20
y_predicted = np.empty((n_X1_field, n_X0_field))

xlim = (7.5, 12.5)
ylim = (-1.5, 6.5)
X0_field = np.linspace(xlim[0], xlim[1], n_X0_field)
X1_field = np.linspace(ylim[0], ylim[1], n_X1_field)

fig, axs = plt.subplots(2, 3, figsize=(9.6, 6.4))
fig.subplots_adjust(hspace= 0.4)

n_neighbors_list = [1, 2, 3, 24, 25, 26]
axs_1d = axs.reshape(1, -1)[0]

for n_neighbors, ax in zip(n_neighbors_list, axs_1d):
    ax.set_title("neighbors={}".format(n_neighbors))
    draw_decision_boundary(ax, n_neighbors, X, y, X0_field, X1_field)
    draw_scatter(ax, X0_scatter, X1_scatter, xlim, ylim)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from mglearn.datasets import make_forge

from sklearn.neighbors import KNeighborsClassifier

def draw_decision_boundary(ax, n_neighbors, X, y, X0_field, X1_field):

clsfr = KNeighborsClassifier(n_neighbors=n_neighbors)

clsfr.fit(X, y)

y_predicted = np.empty((len(X1_field), len(X0_field)))

for row, x1 in enumerate(X1_field):

for col, x0 in enumerate(X0_field):

y_predicted[row, col] = clsfr.predict(np.array([[x0, x1]]))

ax.contourf(X0_field, X1_field, y_predicted, levels=1, alpha=0.5)

def draw_scatter(ax, X0, X1, xlim, ylim):

ax.scatter(X0[y==0], X1[y==0], marker='o', s=40, label="class-0")

ax.scatter(X0[y==1], X1[y==1], marker='^', s=40, label="class-1")

ax.set_xlim(xlim[0], xlim[1])

ax.set_ylim(ylim[0], ylim[1])

ax.set_xlabel("feature 0")

ax.set_ylabel("feature 1")

ax.tick_params(labelbottom=False, labelleft=False)

ax.tick_params(bottom=False, left=False)

ax.legend(loc='lower right')

X, y = make_forge()

X0_scatter = X[:, 0]

X1_scatter = X[:, 1]

n_X0_field, n_X1_field = 20, 20

y_predicted = np.empty((n_X1_field, n_X0_field))

xlim = (7.5, 12.5)

ylim = (-1.5, 6.5)

X0_field = np.linspace(xlim[0], xlim[1], n_X0_field)

X1_field = np.linspace(ylim[0], ylim[1], n_X1_field)

fig, axs = plt.subplots(2, 3, figsize=(9.6, 6.4))

fig.subplots_adjust(hspace= 0.4)

n_neighbors_list = [1, 2, 3, 24, 25, 26]

axs_1d = axs.reshape(1, -1)[0]

for n_neighbors, ax in zip(n_neighbors_list, axs_1d):

ax.set_title("neighbors={}".format(n_neighbors))

draw_decision_boundary(ax, n_neighbors, X, y, X0_field, X1_field)

draw_scatter(ax, X0_scatter, X1_scatter, xlim, ylim)

plt.show()

k-最近傍法 – 回帰

2020-03-22 / tau / コメントする

概要

k-最近傍法(k nearest neighbors: knn)による回帰は、テストデータの近傍の訓練データからテストデータの属性値を決定する。その手法は単純で、特段の学習処理はせず、訓練データセットの特徴量と属性値を記憶するのみで、テストデータが与えられたときに近傍点から属性値を決定する。手順は以下の通り。

パッケージをインポートする
特徴量と属性値のデータセットを記憶する
テストデータが与えられたら、特徴量空間の中で近傍点を選ぶ
近傍点の属性値からテストデータの属性値を決定する

パラメーターは近傍点の数で、1以上訓練データの数まで任意に増やすことができる。

利用方法

手順

scikit-learnのKNeighborsRegressorクラスの利用方法は以下の通り。

sklearn.neighborsからKNeighborsRegressorをインポート
コンストラクターの引数に近傍点数n_neighborsを指定して、KNeighborsRegressorのインスタンスを生成
fit()メソッドに訓練データの特徴量と属性値を与えて学習
predict()メソッドにテストデータの特徴量を指定して、属性値を予測
必要に応じて、kneighbors()メソッドでテストデータの近傍点情報を取得

パッケージのインポート

k-最近傍回帰のパッケージは以下でインポートする。

from sklearn.neighbors import KNeighborsRegressor

1	from sklearn.neighbors import KNeighborsRegressor

コンストラクター

KNeighborsClassifier(n_neighbors=n): nは近傍点の数でデフォルトは5。この他の引数に、近傍点を発見するアルゴリズムなどが指定できるようだ。

訓練

fit()メソッドに与える訓練データは、特徴量セットと属性値の2つ。

fit(X, y): Xは訓練データセットの特徴量データで、データ数×特徴量数の2次元配列。yは訓練データセットの属性値データで要素数はデータ数に等しい

予測

テストデータの属性値の予測は、predict()メソッドにテストデータの特徴量を与える。

y = predict(X): Xはテストデータの特徴量データで、データ数×特徴量数の2次元配列。戻り値yは予測された属性値データで要素数はデータ数に等しい。

近傍点の情報

テストデータに対する近傍点の情報を、kneighbors()メソッドで得ることができる。

neigh_dist, neigh_ind = kneighbors(X): テストデータの特徴量Xを引数に与え、近傍点に関する情報を得る。neigh_distは各テストデータから各近傍点までの距離、neigh_indは各テストデータに対する各近傍点のインデックス。いずれも2次元の配列で、テストデータ数×近傍点数の2次元配列となっている。

実行例

以下の例では、n_neighbors=2としてKNeighborsRegressorのインスタンスを準備している。

これに対してfit()メソッドで、2つの特徴量とそれに対する属性値を持つ訓練データを5個与えている。特徴量データX_trainは行数がデータ数、列数が特徴量の数となる2次元配列を想定している。また属性値y_trainは訓練データ数と同じ要素数の1次元配列。

特徴量1	特徴量2	属性値
-2	-3	-1
-1	-1	0
0	1	1
1	2	2
3	3	3

これらの訓練データに対して、テストデータの特徴量X_testとして(-0.5, -2)、(1, 0)の2つを与えた時の出力を見てみる。

import numpy as np
from sklearn.neighbors import KNeighborsRegressor

X_train = np.array(
    [[-2, -3],
     [-1, -1],
     [0, 1],
     [1, 2]])
y_train = np.array([-1, 0, 1, 2])

reg = KNeighborsRegressor(n_neighbors=2)
reg.fit(X_train, y_train)

X_test = np.array([[-0.5, -2], [1, 0]])
y_pred = reg.predict(X_test)

neigh_dist, neigh_ind = reg.kneighbors(X=X_test)

print("X_train=\n{}".format(X_train))
print("y_train={}".format(y_train))
print("X_test=\n{}".format(X_test))
print("y_pred={}".format(y_pred))
print("neighbors' distance=\n{}".format(neigh_dist))
print("neighbors' indicies=\n{}".format(neigh_ind))

import numpy as np

from sklearn.neighbors import KNeighborsRegressor

X_train = np.array(

[[-2, -3],

[-1, -1],

[0, 1],

[1, 2]])

y_train = np.array([-1, 0, 1, 2])

reg = KNeighborsRegressor(n_neighbors=2)

reg.fit(X_train, y_train)

X_test = np.array([[-0.5, -2], [1, 0]])

y_pred = reg.predict(X_test)

neigh_dist, neigh_ind = reg.kneighbors(X=X_test)

print("X_train=\n{}".format(X_train))

print("y_train={}".format(y_train))

print("X_test=\n{}".format(X_test))

print("y_pred={}".format(y_pred))

print("neighbors' distance=\n{}".format(neigh_dist))

print("neighbors' indicies=\n{}".format(neigh_ind))

このコードの実行結果は以下の通り。

X_train=
[[-2 -3]
 [-1 -1]
 [ 0  1]
 [ 1  2]]
y_train=[-1  0  1  2]
[[-0.5 -2. ]
 [ 1.   0. ]]
y_pred=[-0.5  1.5]
neighbors' distance=
[[1.11803399 1.80277564]
 [1.41421356 2.        ]]
neighbors' indicies=
[[1 0]
 [2 3]]

X_train=

[[-2 -3]

[-1 -1]

[ 0 1]

[ 1 2]]

y_train=[-1 0 1 2]

[[-0.5 -2. ]

[ 1. 0. ]]

y_pred=[-0.5 1.5]

neighbors' distance=

[[1.11803399 1.80277564]

[1.41421356 2. ]]

neighbors' indicies=

[[1 0]

[2 3]]

属性値の予測結果については、2つのテストデータに対して2つの属性値0.5と1.5が返されている。

kneighbors()メソッドの戻り値から、1つ目のテストデータにはインデックスが1と0の2つの点とそれぞれへの距離1.118と1.802が、2つ目のテストデータにはインデックスが2と3の点とそれぞれへの距離1.414と2.0が得られる。

1つ目のテストデータ(-0.5, -2)からの距離
- X_train[1]=(-1, -1)→ $\sqrt{(-0.5)^2+1^2}\approx 1.118$
- X_train[0]=(-2, -3)→ $\sqrt{(-1.5)^2+(-1)^2}\approx 1.802$
2つ目のテストデータ(1, 0)からの距離
- X_train[2]=(0, 1)→ $\sqrt{(-1)^2+1^2}\approx 1.414$
- X_train[3]=(1, 2)→ $\sqrt{0^2+2^2}=2$

y_predは、テストデータごとに2つの近傍点の属性値の平均をとっている。

1つ目のテストデータの属性値
- y_train[1]=-1とy_train[0]=0の平均→-0.5
2つ目のテストデータの属性値
- y_train[2]=1とy_train[3]=2の平均→1.5

この様子を特徴量平面上に描いたのが以下の図である。各点の数値は、各データの属性値を示している。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor

X_train = np.array(
    [[-2, -3],
     [-1, -1],
     [0, 1],
     [1, 2],
     [3, 3]])
y_train = np.array([-1, 0, 1, 2, 3])

reg = KNeighborsRegressor(n_neighbors=2)
reg.fit(X_train, y_train)

X_test = np.array([[-0.5, -2], [1, 0]])
y_pred = reg.predict(X_test)

neigh_dist, neigh_ind = reg.kneighbors(X=X_test)

fig, ax = plt.subplots()

ax.scatter(X_train[:, 0], X_train[:, 1], label="X_train")
ax.scatter(X_test[:, 0], X_test[:, 1], marker='*', s=120, label="X_test")

for tests, ind in zip(X_test, neigh_ind):
    for neigh in ind:
        ax.plot(
            [tests[0], X_train[neigh][0]], [tests[1], X_train[neigh][1]],
            color='k', linestyle='dotted')

for x, y in zip(X_train, y_train):
    ax.annotate("{}".format(y), xy=(x[0], x[1]), xytext=(x[0]+0.1, x[1]+0.1))
for x, y in zip(X_test, y_pred):
    ax.annotate("{}".format(y), xy=(x[0], x[1]), xytext=(x[0]+0.1, x[1]+0.1))

ax.set_xlabel("feature 1")
ax.set_ylabel("feature 2")
ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsRegressor

X_train = np.array(

[[-2, -3],

[-1, -1],

[0, 1],

[1, 2],

[3, 3]])

y_train = np.array([-1, 0, 1, 2, 3])

reg = KNeighborsRegressor(n_neighbors=2)

reg.fit(X_train, y_train)

X_test = np.array([[-0.5, -2], [1, 0]])

y_pred = reg.predict(X_test)

neigh_dist, neigh_ind = reg.kneighbors(X=X_test)

fig, ax = plt.subplots()

ax.scatter(X_train[:, 0], X_train[:, 1], label="X_train")

ax.scatter(X_test[:, 0], X_test[:, 1], marker='*', s=120, label="X_test")

for tests, ind in zip(X_test, neigh_ind):

for neigh in ind:

ax.plot(

[tests[0], X_train[neigh][0]], [tests[1], X_train[neigh][1]],

color='k', linestyle='dotted')

for x, y in zip(X_train, y_train):

ax.annotate("{}".format(y), xy=(x[0], x[1]), xytext=(x[0]+0.1, x[1]+0.1))

for x, y in zip(X_test, y_pred):

ax.annotate("{}".format(y), xy=(x[0], x[1]), xytext=(x[0]+0.1, x[1]+0.1))

ax.set_xlabel("feature 1")

ax.set_ylabel("feature 2")

ax.legend()

plt.show()

各種データに対する適用例

waveデータ

概要

データ数が少ない場合

異常値がある場合

データ数を多くした場合

定義

定式化

行列による表示

概要

学習率曲線

例題のデータ

計算表の準備

収束計算

概要

データの取得とデータ構造

データの内容

'data'～特徴量データセット

'target'～住宅価格

'feature_names'～特徴名

'filename'～ファイル名

'DESCR'～データセットの説明

データの利用

データの取得方法

全レコードの特徴量データの取得

特定の特徴量のデータのみ取得

概要

近傍点数とクラス分類の挙動

近傍点数=1の場合

近傍点数=2の場合

近傍点数=3の場合

実行コード

knnの精度

予測カーブ

訓練データが少ない場合

訓練データが多い場合

概要

近傍点数によるクラス分類の挙動

近傍点数=1の場合

近傍点数=3の場合

近傍点数=2の場合

決定境界

概要

利用方法

手順

パッケージのインポート

コンストラクター

訓練

予測

近傍点の情報

実行例

各種データに対する適用例

`'data'`～特徴量データセット

`'target'`～住宅価格

`'feature_names'`～特徴名

`'filename'`～ファイル名

`'DESCR'`～データセットの説明