scikit-learn – decision_function

2020-08-25 / tau / コメントする

概要

decision_function()は、超平面によってクラス分類をするモデルにおける、各予測データの確信度を表す。

2クラス分類の場合は(n_samples, )の1次元配列、マルチクラスの場合は(n_samples, n_classes)の2次元配列になる。2クラス分類の場合、符号の正負がそれぞれのクラスに対応する。

decision_function()を持つモデルは、LogisticRegression、SVC、GladientBoostClassifierなどで、RandomForestはこのメソッドを持っていない。

`decision_function()`の挙動

decision_function()の挙動をGradientBoostingClassifierで確認する。

まずmake_circles()で2クラスのデータを生成し、外側のクラス0をblue、内側のクラス1をorangeとして再定義する。

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)
y_named = np.array(["blue", "orange"])[y]

import numpy as np

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.datasets import make_circles

from sklearn.model_selection import train_test_split

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

y_named = np.array(["blue", "orange"])[y]

次に、データを訓練データとテストデータに分割し、訓練データによって学習する。データ分割にあたって、Xとyに加えて文字列に置き換えたy_namedを分割している。学習の際にはXとy_namedの訓練データとテストデータのみを用いるのでyについては特に含める必要ないが、ここではtrain_test_split()が3つ以上のデータでも分割可能なことを示している。

X_train, X_test, y_train, y_test, y_train_named, y_test_named = \
    train_test_split(X, y, y_named, random_state=0)

gbc = GradientBoostingClassifier(random_state=0)
gbc.fit(X_train, y_train_named)

X_train, X_test, y_train, y_test, y_train_named, y_test_named = \

train_test_split(X, y, y_named, random_state=0)

gbc = GradientBoostingClassifier(random_state=0)

gbc.fit(X_train, y_train_named)

学習後の分類器のclasses_プロパティーを参照すると、クラスがどのように表現されるかを確認できる。上のfit()メソッドでy_train_namedを与えたのでクラスの表現が文字列になっているが、代わりにy_trainを用いると[0, 1]のように元のyに対応したクラス表現が返される。

print(gbc.classes_)

# ['blue' 'orange']

print(gbc.classes_)

# ['blue' 'orange']

次に、学習済みモデルにテストデータを与えて、decision_function()の結果とpredict()の結果を並べてみる。decision_function()はfit()で与えたテストデータ数の1次元配列を返し、各要素の負の値に対してクラス0のblueが、正の値に対してはクラス1のorangeがpredict()で予測されていることがわかる。

data = DataFrame()
data["decision_value"] = gbc.decision_function(X_test)
data["prediction"] = gbc.predict(X_test)
print(data)

#     decision_value prediction
# 0         4.135926     orange
# 1        -1.701699       blue
# 2        -3.951061       blue
# 3        -3.626096       blue
# 4         4.289866     orange
# 5         3.661661     orange
# 6        -7.690972       blue
# 7         4.110017     orange
# 8         1.107539     orange
# 9         3.407822     orange
# 10       -6.462560       blue
# 11        4.289866     orange
# 12        3.901563     orange
# 13       -1.200312       blue
# 14        3.661661     orange
# 15       -4.172312       blue
# 16       -1.230101       blue
# 17       -3.915762       blue
# 18        4.036028     orange
# 19        4.110017     orange
# 20        4.110017     orange
# 21        0.657090     orange
# 22        2.698263     orange
# 23       -2.656733       blue
# 24       -1.867766       blue

data = DataFrame()

data["decision_value"] = gbc.decision_function(X_test)

data["prediction"] = gbc.predict(X_test)

print(data)

# decision_value prediction

# 0 4.135926 orange

# 1 -1.701699 blue

# 2 -3.951061 blue

# 3 -3.626096 blue

# 4 4.289866 orange

# 5 3.661661 orange

# 6 -7.690972 blue

# 7 4.110017 orange

# 8 1.107539 orange

# 9 3.407822 orange

# 10 -6.462560 blue

# 11 4.289866 orange

# 12 3.901563 orange

# 13 -1.200312 blue

# 14 3.661661 orange

# 15 -4.172312 blue

# 16 -1.230101 blue

# 17 -3.915762 blue

# 18 4.036028 orange

# 19 4.110017 orange

# 20 4.110017 orange

# 21 0.657090 orange

# 22 2.698263 orange

# 23 -2.656733 blue

# 24 -1.867766 blue

decision_function()の各要素の符号に応じてpredict()と同じ結果を得たいなら、次のように処理していくとよい。

print(gbc.decision_function(X_test) > 0)
# [ True False False False  True  True False  True  True  True False  True
#   True False  True False False False  True  True  True  True  True False
#  False]

print((gbc.decision_function(X_test) > 0).astype(int))
# [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 1 0 0]

print(gbc.classes_[(gbc.decision_function(X_test) > 0).astype(int)])
# ['orange' 'blue' 'blue' 'blue' 'orange' 'orange' 'blue' 'orange' 'orange'
#  'orange' 'blue' 'orange' 'orange' 'blue' 'orange' 'blue' 'blue' 'blue'
#  'orange' 'orange' 'orange' 'orange' 'orange' 'blue' 'blue']

print(gbc.decision_function(X_test) > 0)

# [ True False False False True True False True True True False True

# True False True False False False True True True True True False

# False]

print((gbc.decision_function(X_test) > 0).astype(int))

# [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 1 0 0]

print(gbc.classes_[(gbc.decision_function(X_test) > 0).astype(int)])

# ['orange' 'blue' 'blue' 'blue' 'orange' 'orange' 'blue' 'orange' 'orange'

# 'orange' 'blue' 'orange' 'orange' 'blue' 'orange' 'blue' 'blue' 'blue'

# 'orange' 'orange' 'orange' 'orange' 'orange' 'blue' 'blue']

最後に、上記のデータと正解であるy_test_namedのデータを先ほどのデータフレームに追加して全体を確認する。predit()メソッドの結果とdecision_function()の符号による判定結果は等しく、y_testと異なるデータがあることがわかる。

data["decision"] = \
    gbc.classes_[(gbc.decision_function(X_test) > 0).astype(int)]
data["y_test"] = y_test_named

print(data)

#     decision_value prediction decision  y_test
# 0         4.135926     orange   orange  orange
# 1        -1.701699       blue     blue    blue
# 2        -3.951061       blue     blue    blue
# 3        -3.626096       blue     blue    blue
# 4         4.289866     orange   orange  orange
# 5         3.661661     orange   orange  orange
# 6        -7.690972       blue     blue    blue
# 7         4.110017     orange   orange  orange
# 8         1.107539     orange   orange  orange
# 9         3.407822     orange   orange  orange
# 10       -6.462560       blue     blue    blue
# 11        4.289866     orange   orange  orange
# 12        3.901563     orange   orange  orange
# 13       -1.200312       blue     blue  orange
# 14        3.661661     orange   orange  orange
# 15       -4.172312       blue     blue  orange
# 16       -1.230101       blue     blue  orange
# 17       -3.915762       blue     blue    blue
# 18        4.036028     orange   orange  orange
# 19        4.110017     orange   orange  orange
# 20        4.110017     orange   orange    blue
# 21        0.657090     orange   orange  orange
# 22        2.698263     orange   orange  orange
# 23       -2.656733       blue     blue    blue
# 24       -1.867766       blue     blue    blue

data["decision"] = \

gbc.classes_[(gbc.decision_function(X_test) > 0).astype(int)]

data["y_test"] = y_test_named

print(data)

# decision_value prediction decision y_test

# 0 4.135926 orange orange orange

# 1 -1.701699 blue blue blue

# 2 -3.951061 blue blue blue

# 3 -3.626096 blue blue blue

# 4 4.289866 orange orange orange

# 5 3.661661 orange orange orange

# 6 -7.690972 blue blue blue

# 7 4.110017 orange orange orange

# 8 1.107539 orange orange orange

# 9 3.407822 orange orange orange

# 10 -6.462560 blue blue blue

# 11 4.289866 orange orange orange

# 12 3.901563 orange orange orange

# 13 -1.200312 blue blue orange

# 14 3.661661 orange orange orange

# 15 -4.172312 blue blue orange

# 16 -1.230101 blue blue orange

# 17 -3.915762 blue blue blue

# 18 4.036028 orange orange orange

# 19 4.110017 orange orange orange

# 20 4.110017 orange orange blue

# 21 0.657090 orange orange orange

# 22 2.698263 orange orange orange

# 23 -2.656733 blue blue blue

# 24 -1.867766 blue blue blue

`decision_function()`の意味

decusuib_function()のレベルは超平面上の高さになるが、これはデータ、モデルパラメーターにより変化し、このスケールの解釈は難しい。それはpredict_proba()で得られる予測確率とdecision_function()で計算される確信度の非線形性からも予想される。

circlesデータに対するGradientBoostingClassifierの決定境界とdecision_function()の値の分布を表示したのが以下の図。コンターが交錯していてわかりにくく、直感的にはpredict_proba()の方がわかりやすい。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

f0min, f0max = -1.5, 1.5
f1min, f1max = -1.75, 1.5

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

X_train, X_test, y_train, y_test =\
    train_test_split(X, y, random_state=0)

gb = GradientBoostingClassifier(random_state=0)

gb.fit(X_train, y_train)

print("X_test.shape           : {}".format(X_test.shape))
print("Decision function shape: {}".format(
    gb.decision_function(X_test).shape))

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))
color0, color1 = 'tab:blue', 'tab:orange'

f0 = np.linspace(f0min, f0max, 200)
f1 = np.linspace(f1min, f1max, 200)
f0, f1 = np.meshgrid(f0, f1)
F = np.c_[f0.reshape(-1, 1), f1.reshape(-1, 1)]

pred = gb.predict(F).reshape(f0.shape)
axs[0].contour(f0, f1, pred, levels=[0.5])
axs[0].contourf(f0, f1, pred, levels=1, colors=[color0, color1], alpha=0.25)

decision = gb.decision_function(F).reshape(f0.shape)
axs[1].contourf(f0, f1, decision, alpha=0.5, cmap='bwr')

for ax in axs:
    ax.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], marker='o', fc=color0, ec='k', label="Train class 0")
    ax.scatter(X_test[y_test==0][:, 0], X_test[y_test==0][:, 1], marker='^', fc=color0, ec='k', label="Test Class 0")
    ax.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], marker='o', fc=color1, ec='k', label="Train class 1")
    ax.scatter(X_test[y_test==1][:, 0], X_test[y_test==1][:, 1], marker='^', fc=color1, ec='k', label="Test class 1")

    ax.set_xlim(f0min, f0max)
    ax.set_ylim(f1min, f1max)
    ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

handles, labels = axs[0].get_legend_handles_labels()

fig.legend(handles, labels, ncol=4, loc='upper center')
plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_circles

from sklearn.model_selection import train_test_split

from sklearn.ensemble import GradientBoostingClassifier

f0min, f0max = -1.5, 1.5

f1min, f1max = -1.75, 1.5

X, y = make_circles(noise=0.25, factor=0.5, random_state=1)

X_train, X_test, y_train, y_test =\

train_test_split(X, y, random_state=0)

gb = GradientBoostingClassifier(random_state=0)

gb.fit(X_train, y_train)

print("X_test.shape : {}".format(X_test.shape))

print("Decision function shape: {}".format(

gb.decision_function(X_test).shape))

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))

color0, color1 = 'tab:blue', 'tab:orange'

f0 = np.linspace(f0min, f0max, 200)

f1 = np.linspace(f1min, f1max, 200)

f0, f1 = np.meshgrid(f0, f1)

F = np.c_[f0.reshape(-1, 1), f1.reshape(-1, 1)]

pred = gb.predict(F).reshape(f0.shape)

axs[0].contour(f0, f1, pred, levels=[0.5])

axs[0].contourf(f0, f1, pred, levels=1, colors=[color0, color1], alpha=0.25)

decision = gb.decision_function(F).reshape(f0.shape)

axs[1].contourf(f0, f1, decision, alpha=0.5, cmap='bwr')

for ax in axs:

ax.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], marker='o', fc=color0, ec='k', label="Train class 0")

ax.scatter(X_test[y_test==0][:, 0], X_test[y_test==0][:, 1], marker='^', fc=color0, ec='k', label="Test Class 0")

ax.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], marker='o', fc=color1, ec='k', label="Train class 1")

ax.scatter(X_test[y_test==1][:, 0], X_test[y_test==1][:, 1], marker='^', fc=color1, ec='k', label="Test class 1")

ax.set_xlim(f0min, f0max)

ax.set_ylim(f1min, f1max)

ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

handles, labels = axs[0].get_legend_handles_labels()

fig.legend(handles, labels, ncol=4, loc='upper center')

plt.show()

3クラス以上の場合

3クラスのirisデータセットにGradientBoostingClassifierを適用して、decision_function()の出力を見てみる。

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from pandas import DataFrame

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=42)

gbc = GradientBoostingClassifier(learning_rate=0.01, random_state=0)
gbc.fit(X_train, y_train)

dec_func = gbc.decision_function(X_test)
print(dec_func.shape)
df = DataFrame(dec_func, columns=iris.target_names)

df["decision"] = np.argmax(dec_func, axis=1)
df["prediction"] = gbc.predict(X_test)
print(df)

import numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.ensemble import GradientBoostingClassifier

from pandas import DataFrame

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(

iris.data, iris.target, random_state=42)

gbc = GradientBoostingClassifier(learning_rate=0.01, random_state=0)

gbc.fit(X_train, y_train)

dec_func = gbc.decision_function(X_test)

print(dec_func.shape)

df = DataFrame(dec_func, columns=iris.target_names)

df["decision"] = np.argmax(dec_func, axis=1)

df["prediction"] = gbc.predict(X_test)

print(df)

このコードの出力結果は以下の通り。2クラスの場合は1次元配列だったが、3クラスになると行数×列数がデータ数×クラス数の配列になる。predict_proba()は2クラスでも2列の配列になるので、decision_function()の2クラスの場合だけ特に1次元配列になると言える。

なお、19行目で各データごとに最大の値をとる列をargmaxで探して、そのサフィックスを”decision”のクラス番号として表示している。

(38, 3)
      setosa  versicolor  virginica  decision  prediction
0  -1.995715    0.047583  -1.927207         1           1
1   0.061464   -1.907557  -1.927938         0           0
2  -1.990582   -1.876379   0.096867         2           2
3  -1.995715    0.047583  -1.927207         1           1
4  -1.997302   -0.134691  -1.203415         1           1
.....
33  0.061464   -1.907557  -1.927938         0           0
34  0.061464   -1.907557  -1.927938         0           0
35 -1.997122   -1.876379   0.041660         2           2
36 -1.995715    0.047583  -1.927207         1           1
37  0.061464   -1.907557  -1.927938         0           0

(38, 3)

setosa versicolor virginica decision prediction

0 -1.995715 0.047583 -1.927207 1 1

1 0.061464 -1.907557 -1.927938 0 0

2 -1.990582 -1.876379 0.096867 2 2

3 -1.995715 0.047583 -1.927207 1 1

4 -1.997302 -0.134691 -1.203415 1 1

.....

33 0.061464 -1.907557 -1.927938 0 0

34 0.061464 -1.907557 -1.927938 0 0

35 -1.997122 -1.876379 0.041660 2 2

36 -1.995715 0.047583 -1.927207 1 1

37 0.061464 -1.907557 -1.927938 0 0

scikit-learn – make_circles

2020-08-23 / tau / コメントする

概要

sklearn.datasets.make_circles()はクラス分類のためのデータを生成する。2つのクラスのデータが同心円状に分布し、各クラスの半径の差異、データのばらつきを指定できる。

得られるデータの形式

2つの配列X, yが返され、配列Xは列が特徴量、行がレコードの2次元配列。ターゲットyはレコード数分のクラス属性値の整数で0か1の値をとる。

from sklearn.datasets import make_circles

X, y = make_circles(n_samples=10, random_state=0)

print("X:\n{}".format(X))
print("y:{}".format(y))

# X:
# [[-0.80901699  0.58778525]
#  [-0.6472136  -0.4702282 ]
#  [ 0.30901699 -0.95105652]
#  [ 0.2472136  -0.76084521]
#  [ 0.30901699  0.95105652]
#  [ 0.2472136   0.76084521]
#  [-0.6472136   0.4702282 ]
#  [-0.80901699 -0.58778525]
#  [ 1.          0.        ]
#  [ 0.8         0.        ]]
# y:[0 1 0 1 0 1 1 0 0 1]

from sklearn.datasets import make_circles

X, y = make_circles(n_samples=10, random_state=0)

print("X:\n{}".format(X))

print("y:{}".format(y))

# X:

# [[-0.80901699 0.58778525]

# [-0.6472136 -0.4702282 ]

# [ 0.30901699 -0.95105652]

# [ 0.2472136 -0.76084521]

# [ 0.30901699 0.95105652]

# [ 0.2472136 0.76084521]

# [-0.6472136 0.4702282 ]

# [-0.80901699 -0.58778525]

# [ 1. 0. ]

# [ 0.8 0. ]]

# y:[0 1 0 1 0 1 1 0 0 1]

パラメーターの指定

sklearn.datasets.make_circles(n_samples=100, *, shuffle=True, noise=None, random_state=None, factor=0.8)

1	sklearn.datasets.make_circles(n_samples=100, *, shuffle=True, noise=None, random_state=None, factor=0.8)

n_samples: 総データ数で、奇数の場合は内側のデータが1つ多くなる。2要素のタプルで指定した場合、1つ目が外側、2つ目が内側のデータ数となる。デフォルトは100
shuffle: データをシャッフルするかどうかを指定。Falseにすると、前半がクラス0、後半がクラス1となるように並ぶ。デフォルトはTrue。
noise: ガウス分布のノイズを標準偏差で指定。デフォルトはNoneでノイズなし。
random_state: データを生成する乱数系列を指定。デフォルトはNone
factor: 外側に対する内側のデータのスケールファクター。デフォルトは0.8。

利用例

以下はスケールファクターを0.5、ノイズを0.15としてデフォルトの100個のデータを生成した例。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

X, y = make_circles(factor=0.5, noise=0.15)

X0 = X[y==0]
X1 = X[y==1]

fig, ax = plt.subplots()
ax.scatter(X0[:, 0], X0[:, 1])
ax.scatter(X1[:, 0], X1[:, 1])

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_circles

X, y = make_circles(factor=0.5, noise=0.15)

X0 = X[y==0]

X1 = X[y==1]

fig, ax = plt.subplots()

ax.scatter(X0[:, 0], X0[:, 1])

ax.scatter(X1[:, 0], X1[:, 1])

plt.show()

以下はノイズの程度を変化させた例。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

scale_factors = [0, 0.4, 0.6, 0.8]

fig = plt.figure(figsize=(9.6,7.2))
axs = fig.subplots(2, 2)

for ax, factor in zip(axs.reshape(axs.size), scale_factors):
    X, y = make_circles(noise=0.1, factor=factor, random_state=0)

    ax.scatter(X[y==0][:, 0], X[y==0][:, 1])
    ax.scatter(X[y==1][:, 0], X[y==1][:, 1])
    ax.set_title("factor={:3.1f}".format(factor))

fig.subplots_adjust(wspace=0.3, hspace=0.4)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_circles

scale_factors = [0, 0.4, 0.6, 0.8]

fig = plt.figure(figsize=(9.6,7.2))

axs = fig.subplots(2, 2)

for ax, factor in zip(axs.reshape(axs.size), scale_factors):

X, y = make_circles(noise=0.1, factor=factor, random_state=0)

ax.scatter(X[y==0][:, 0], X[y==0][:, 1])

ax.scatter(X[y==1][:, 0], X[y==1][:, 1])

ax.set_title("factor={:3.1f}".format(factor))

fig.subplots_adjust(wspace=0.3, hspace=0.4)

plt.show()

以下はスケールファクターを変化させた例。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

scale_factors = [0, 0.4, 0.6, 0.8]

fig = plt.figure(figsize=(9.6,7.2))
axs = fig.subplots(2, 2)

for ax, factor in zip(axs.reshape(axs.size), scale_factors):
    X, y = make_circles(noise=0.1, factor=factor, random_state=0)

    ax.scatter(X[y==0][:, 0], X[y==0][:, 1])
    ax.scatter(X[y==1][:, 0], X[y==1][:, 1])
    ax.set_title("factor={:3.1f}".format(factor))

fig.subplots_adjust(wspace=0.3, hspace=0.4)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_circles

scale_factors = [0, 0.4, 0.6, 0.8]

fig = plt.figure(figsize=(9.6,7.2))

axs = fig.subplots(2, 2)

for ax, factor in zip(axs.reshape(axs.size), scale_factors):

X, y = make_circles(noise=0.1, factor=factor, random_state=0)

ax.scatter(X[y==0][:, 0], X[y==0][:, 1])

ax.scatter(X[y==1][:, 0], X[y==1][:, 1])

ax.set_title("factor={:3.1f}".format(factor))

fig.subplots_adjust(wspace=0.3, hspace=0.4)

plt.show()

DecisionTreeClassifier – Treeオブジェクト・再帰表示など

2020-05-31 / tau / コメントする

概要

Scikit-learnの決定木モデル、DecisionTreeClassifierについていろいろ試した際のコードをストック。

Treeオブジェクト内容確認

DecisionTreeClassifierオブジェクトのプロパティーtree_はデータセットに対して生成された決定木の構造が保存されている。以下はその内容を確認するためのコード。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

tree = treeclf.tree_

n_nodes = tree.node_count
children_left = tree.children_left
children_right = tree.children_right
features = tree.feature
thresholds = tree.threshold
value = tree.value

print("number of nodes: {}".format(n_nodes))
print("Chlidren(Left) : {}".format(children_left))
print("Chlidren(Right): {}".format(children_right))
print("Features       : {}".format(features))
print("Thresholds     : {}".format(np.round(thresholds, 3)))
print("Values:\n{}".format(value))

print()

for i in range(n_nodes):
    print("Node-{}".format(i), end="")
    print("(Feature[{:2d}]<{:6.3f}):"\
        .format(features[i], thresholds[i]), end="")
    print("LeftNode[{:2d}], RightNode[{:2d}]"\
        .format(children_left[i], children_right[i]))

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

tree = treeclf.tree_

n_nodes = tree.node_count

children_left = tree.children_left

children_right = tree.children_right

features = tree.feature

thresholds = tree.threshold

value = tree.value

print("number of nodes: {}".format(n_nodes))

print("Chlidren(Left) : {}".format(children_left))

print("Chlidren(Right): {}".format(children_right))

print("Features : {}".format(features))

print("Thresholds : {}".format(np.round(thresholds, 3)))

print("Values:\n{}".format(value))

print()

for i in range(n_nodes):

print("Node-{}".format(i), end="")

print("(Feature[{:2d}]<{:6.3f}):"\

.format(features[i], thresholds[i]), end="")

print("LeftNode[{:2d}], RightNode[{:2d}]"\

.format(children_left[i], children_right[i]))

Treeクラスはツリー内の各ノードの情報を1次元の配列でもっていて、子ノードを参照するにはノード番号に対応する配列のインデックスを参照する。Treeクラスが持っている主なプロパティーは以下の通り。

node_count: ツリーが持つ全ノード数。
children_left, children_right: 各ノードの左／右の子ノードの番号を格納した1次元配列。
feature: 各ノードを分割する際に使われる特徴量の番号を格納した1次元配列。
threshold: 各ノードをfeatureで示された特性量で分割する際の閾値を格納した1次元配列。
value: 各ノードにおける、各クラスのデータ数。クラス数分のデータを格納した1次元配列1つだけを要素とする2次元配列を、ノード数分だけ集めた3次元配列。

コードの実行結果は以下の通り。

number of nodes: 7
Chlidren(Left) : [ 1  2 -1 -1  5 -1 -1]
Chlidren(Right): [ 4  3 -1 -1  6 -1 -1]
Features       : [ 1  0 -2 -2  0 -2 -2]
Thresholds     : [ 0.072 -0.643 -2.    -2.     1.536 -2.    -2.   ]
Values:
[[[10. 10.]]

 [[ 1.  9.]]

 [[ 1.  0.]]

 [[ 0.  9.]]

 [[ 9.  1.]]

 [[ 9.  0.]]

 [[ 0.  1.]]]

Node-0(Feature[ 1]< 0.072):LeftNode[ 1], RightNode[ 4]
Node-1(Feature[ 0]<-0.643):LeftNode[ 2], RightNode[ 3]
Node-2(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]
Node-3(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]
Node-4(Feature[ 0]< 1.536):LeftNode[ 5], RightNode[ 6]
Node-5(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]
Node-6(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]

number of nodes: 7

Chlidren(Left) : [ 1 2 -1 -1 5 -1 -1]

Chlidren(Right): [ 4 3 -1 -1 6 -1 -1]

Features : [ 1 0 -2 -2 0 -2 -2]

Thresholds : [ 0.072 -0.643 -2. -2. 1.536 -2. -2. ]

Values:

[[[10. 10.]]

[[ 1. 9.]]

[[ 1. 0.]]

[[ 0. 9.]]

[[ 9. 1.]]

[[ 9. 0.]]

[[ 0. 1.]]]

Node-0(Feature[ 1]< 0.072):LeftNode[ 1], RightNode[ 4]

Node-1(Feature[ 0]<-0.643):LeftNode[ 2], RightNode[ 3]

Node-2(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]

Node-3(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]

Node-4(Feature[ 0]< 1.536):LeftNode[ 5], RightNode[ 6]

Node-5(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]

Node-6(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]

親ノードと子ノードの関係は、たとえばノード0の左右の子ノードはchildren_leftとchildren_rightの0番目の要素からノード1とノード4、ノード1の左右の子ノードはノード2とノード3、という風に追っていくことができる。

valueがややこしい。この配列は各ノードにおけるクラスごとのデータ数を格納している。全体配列の中にこのケースだとノード数に対応する7個の配列が要素として格納されているが、その配列が2次元配列になっていて、その要素の配列がクラスごとのデータを格納した配列になっている。例えば3番目の要素のクラス1の要素を取り出す場合にはvalue[3, 0, 1]と言う風に指定することになる。

Treeのコンソール表示

Treeオブジェクトのツリー構造を確認し、決定境界の描画などの準備とするために書いたコード。決定木の構造をコンソールに表示させる。2つの再帰関数を定義していて、本体は決定木学習後にそれらの関数を呼び出すのみ。

from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier

def print_node1(tree, i_node, n_level=0):
    print("{}{:2d}-feature:{:2d}"\
        .format("             " * n_level, i_node, tree.feature[i_node]))
    if tree.children_left[i_node] == -1:
        return
    print_node1(
        tree=tree, i_node=tree.children_left[i_node], n_level=n_level+1)
    print_node1(
        tree=tree, i_node=tree.children_right[i_node], n_level=n_level+1)

def print_node2(tree, i_node, n_level=0):
    if tree.children_left[i_node] == -1:
        print("{}{:2d}-feature:{:2d}"\
            .format("             " * n_level, i_node, tree.feature[i_node]))
        return
    print_node2(
        tree=tree, i_node=tree.children_left[i_node], n_level=n_level+1)
    print("{}{:2d}-feature:{:2d}"\
        .format("             " * n_level, i_node, tree.feature[i_node]))
    print_node2(
        tree=tree, i_node=tree.children_right[i_node], n_level=n_level+1)

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

tree = treeclf.tree_

print_node1(tree=tree, i_node=0)
print("-"*40)
print_node2(tree=tree, i_node=0)

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier

def print_node1(tree, i_node, n_level=0):

print("{}{:2d}-feature:{:2d}"\

.format(" " * n_level, i_node, tree.feature[i_node]))

if tree.children_left[i_node] == -1:

return

print_node1(

tree=tree, i_node=tree.children_left[i_node], n_level=n_level+1)

print_node1(

tree=tree, i_node=tree.children_right[i_node], n_level=n_level+1)

def print_node2(tree, i_node, n_level=0):

if tree.children_left[i_node] == -1:

print("{}{:2d}-feature:{:2d}"\

.format(" " * n_level, i_node, tree.feature[i_node]))

return

print_node2(

tree=tree, i_node=tree.children_left[i_node], n_level=n_level+1)

print("{}{:2d}-feature:{:2d}"\

.format(" " * n_level, i_node, tree.feature[i_node]))

print_node2(

tree=tree, i_node=tree.children_right[i_node], n_level=n_level+1)

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

tree = treeclf.tree_

print_node1(tree=tree, i_node=0)

print("-"*40)

print_node2(tree=tree, i_node=0)

関数print_node1()は、ツリー構造をルートノードから階層が下がるごとに段下げして表示していく。このため、まず親ノードを表示してから左右の子ノードを引数として再帰呼び出しをしている。

終了条件はノードが子ノードを持たない葉(leaf)であることを利用するが、リーフの時のパラメータは以下の通りで、ここでは左子ノードの番号が−1となることを利用している。

子ノードの番号が−1
特性量の番号が−2
特性量の閾値が−2.0

関数print_node2は、決定木の構造を枝分かれした木の形で表示する。左側のノードから右側に移るのを、コンソール上で上から下に表示していく。手順としては、

リーフノードならノードの内容を出力してリターン
リーフノードでなければ、
1. 左子ノードの処理を呼び出す
2. それが戻ってきたら（左側の全子孫ノードが出力されたら）自身の内容を出力
3. 右子ノードの処理を呼び出す
4. それが戻ってきたら（右側の全子孫ノードが出力されたら）リターン

引数に現在のノードの階層を保持する変数があり、その階層に応じた数のスペースでインデントすることで木の構造を表す。

出力は以下の通り。

 0-feature: 1
              1-feature: 0
                           2-feature:-2
                           3-feature:-2
              4-feature: 0
                           5-feature:-2
                           6-feature:-2
----------------------------------------
                           2-feature:-2
              1-feature: 0
                           3-feature:-2
 0-feature: 1
                           5-feature:-2
              4-feature: 0
                           6-feature:-2

0-feature: 1

1-feature: 0

2-feature:-2

3-feature:-2

4-feature: 0

5-feature:-2

6-feature:-2

----------------------------------------

2-feature:-2

1-feature: 0

3-feature:-2

0-feature: 1

5-feature:-2

4-feature: 0

6-feature:-2

決定木の構築過程の表示

make_monns()による2特性量のデータについて、順次ノードを分割する過程を図で描画するためのコード。

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patch
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier


def draw_tree_boundary(tree, ax, left, right, bottom, top,
        i_node=0, stop_level=None, n_level=0):

    if tree.children_left[i_node] == -1 or stop_level == n_level:
        fc =\
            'tab:orange' if np.argmax(tree.value[i_node][0])==0 else 'tab:blue'
        rect = patch.Rectangle(xy=(left, bottom),
            width=right-left, height=top-bottom, fc=fc, alpha=0.2)
        ax.add_patch(rect)
        return

    if tree.feature[i_node] == 0:
        f0 = tree.threshold[i_node]
        ax.plot([f0, f0], [top, bottom])
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=f0, top=top, bottom=bottom,
            i_node=tree.children_left[i_node],
            stop_level=stop_level, n_level=n_level+1,)
        draw_tree_boundary(tree=tree, ax=ax,
            left=f0, right=right, top=top, bottom=bottom,
            i_node=tree.children_right[i_node],
            stop_level=stop_level, n_level=n_level+1)
    else:
        f1 = tree.threshold[i_node]
        ax.plot([left, right], [f1, f1])
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=right, top=f1, bottom=bottom,
            i_node=tree.children_left[i_node],
            stop_level=stop_level, n_level=n_level+1)
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=right, top=top, bottom=f1,
            i_node=tree.children_right[i_node],
            stop_level=stop_level, n_level=n_level+1)


X, y = make_moons(n_samples=20, noise=0.25, random_state=5)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

tree = treeclf.tree_

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],
    ec='k', s=60, marker='o', label="Class 0")
ax.scatter(X[y==1][:, 0], X[y==1][:, 1],
    ec='k', s=60, marker='^', label="Class 1")

x0_min, x0_max = -2, 2.5
x1_min, x1_max = -1, 1.5

draw_tree_boundary(tree=tree, i_node=0, ax=ax,
    left=x0_min, right=x0_max, bottom=x1_min, top=x1_max, stop_level=None)

ax.set_xlim(x0_min, x0_max)
ax.set_ylim(x1_min, x1_max)
ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

import matplotlib.patches as patch

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier

def draw_tree_boundary(tree, ax, left, right, bottom, top,

i_node=0, stop_level=None, n_level=0):

if tree.children_left[i_node] == -1 or stop_level == n_level:

fc =\

'tab:orange' if np.argmax(tree.value[i_node][0])==0 else 'tab:blue'

rect = patch.Rectangle(xy=(left, bottom),

width=right-left, height=top-bottom, fc=fc, alpha=0.2)

ax.add_patch(rect)

return

if tree.feature[i_node] == 0:

f0 = tree.threshold[i_node]

ax.plot([f0, f0], [top, bottom])

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=f0, top=top, bottom=bottom,

i_node=tree.children_left[i_node],

stop_level=stop_level, n_level=n_level+1,)

draw_tree_boundary(tree=tree, ax=ax,

left=f0, right=right, top=top, bottom=bottom,

i_node=tree.children_right[i_node],

stop_level=stop_level, n_level=n_level+1)

else:

f1 = tree.threshold[i_node]

ax.plot([left, right], [f1, f1])

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=right, top=f1, bottom=bottom,

i_node=tree.children_left[i_node],

stop_level=stop_level, n_level=n_level+1)

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=right, top=top, bottom=f1,

i_node=tree.children_right[i_node],

stop_level=stop_level, n_level=n_level+1)

X, y = make_moons(n_samples=20, noise=0.25, random_state=5)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

tree = treeclf.tree_

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],

ec='k', s=60, marker='o', label="Class 0")

ax.scatter(X[y==1][:, 0], X[y==1][:, 1],

ec='k', s=60, marker='^', label="Class 1")

x0_min, x0_max = -2, 2.5

x1_min, x1_max = -1, 1.5

draw_tree_boundary(tree=tree, i_node=0, ax=ax,

left=x0_min, right=x0_max, bottom=x1_min, top=x1_max, stop_level=None)

ax.set_xlim(x0_min, x0_max)

ax.set_ylim(x1_min, x1_max)

ax.legend()

plt.show()

draw_tree_boundary()関数は再帰関数で、もしそのノードがリーフノードか指定された終了階層の場合はクラスに応じた色で領域を塗りつぶす。リーフノードでなければ、閾値が特性量0の場合と1の場合で境界線の縦横や開始終了位置を変化させて再帰的に関数を呼び出す。引数stop_levelに正の整数を指定することで、その階層までの描画に留めることができる。関数の内容についてはこちらを参照。

本体はデータをクラスごとの色で散布図として描き、ルートノードについてdraw_tree_boundary()を呼び出している。

以下は、実行例。

以下は、stop_levelを順次増やしていって、領域が分割される過程を描いた例。

決定木のツリー表示

DecisionTreeClassificationオブジェクトを可視化する環境によって、決定木を表示する例。

環境構築
1. Pythonでpydotplusパッケージを導入
2. Graphviz環境を構築
実行
1. sklearn.tree.export_graphviz()で決定木のdotデータを得る
2. pydotplus.graph_from_dot_data()でDotオブジェクトを生成
3. write_png()などのメソッドでグラフを画像として書き出す

import numpy as np
import pydotplus as pdp
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier, export_graphviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=5)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

dot_data = export_graphviz(treeclf, max_depth=None,
    feature_names=["feature-0", "feature-1"],
    class_names=["class-0", "class-1"])
graph = pdp.graph_from_dot_data(dot_data)
graph.write_png("tree.png")

# C:...\atom\app-1.47.0

import numpy as np

import pydotplus as pdp

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier, export_graphviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=5)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

dot_data = export_graphviz(treeclf, max_depth=None,

feature_names=["feature-0", "feature-1"],

class_names=["class-0", "class-1"])

graph = pdp.graph_from_dot_data(dot_data)

graph.write_png("tree.png")

# C:...\atom\app-1.47.0

このコードはAtom上でコードを実行したため、Atomのディレクトリーに画像ファイルが書き出される。

scikit-learn – make_moons

2020-05-24 / tau / コメントする

概要

sklearn.datasets.make_moons()はクラス分類のためのデータを生成する。上向き、下向きの弧が相互にかみ合う形で生成され、単純な直線では分離できないデータセットを提供する。クラス数は常に2クラス。

得られるデータの形式

2つの配列X, yが返され、配列Xは列が特徴量、行がレコードの2次元配列。ターゲットyはレコード数分のクラス属性値の整数。

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10, random_state=0)

print("X:\n{}".format(X))
print("y:{}".format(y))

# X:
# [[ 6.12323400e-17  1.00000000e+00]
#  [ 1.70710678e+00 -2.07106781e-01]
#  [-1.00000000e+00  1.22464680e-16]
#  [ 2.00000000e+00  5.00000000e-01]
#  [ 7.07106781e-01  7.07106781e-01]
#  [ 2.92893219e-01 -2.07106781e-01]
#  [ 1.00000000e+00 -5.00000000e-01]
#  [-7.07106781e-01  7.07106781e-01]
#  [ 1.00000000e+00  0.00000000e+00]
#  [ 0.00000000e+00  5.00000000e-01]]
# y:[0 1 0 1 0 1 1 0 0 1]

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10, random_state=0)

print("X:\n{}".format(X))

print("y:{}".format(y))

# X:

# [[ 6.12323400e-17 1.00000000e+00]

# [ 1.70710678e+00 -2.07106781e-01]

# [-1.00000000e+00 1.22464680e-16]

# [ 2.00000000e+00 5.00000000e-01]

# [ 7.07106781e-01 7.07106781e-01]

# [ 2.92893219e-01 -2.07106781e-01]

# [ 1.00000000e+00 -5.00000000e-01]

# [-7.07106781e-01 7.07106781e-01]

# [ 1.00000000e+00 0.00000000e+00]

# [ 0.00000000e+00 5.00000000e-01]]

# y:[0 1 0 1 0 1 1 0 0 1]

利用例

以下の例では、noiseパラメーターを変化させている。

import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

noises = [0, 0.1, 0.2, 0.3]
fig, axs = plt.subplots(2, 2)
axs_1d = axs.reshape(axs.size)

for ax, noise in zip(axs_1d, noises):
    plt.subplots_adjust(hspace=0.4)
    X, y = make_moons(noise=noise, random_state=0)
    ax.scatter(X[y==0][:, 0], X[y==0][:, 1], s=5, marker='o')
    ax.scatter(X[y==1][:, 0], X[y==1][:, 1], s=5, marker='^')
    ax.set_title("noise={}".format(noise))

plt.show()

import matplotlib.pyplot as plt

from sklearn.datasets import make_moons

noises = [0, 0.1, 0.2, 0.3]

fig, axs = plt.subplots(2, 2)

axs_1d = axs.reshape(axs.size)

for ax, noise in zip(axs_1d, noises):

plt.subplots_adjust(hspace=0.4)

X, y = make_moons(noise=noise, random_state=0)

ax.scatter(X[y==0][:, 0], X[y==0][:, 1], s=5, marker='o')

ax.scatter(X[y==1][:, 0], X[y==1][:, 1], s=5, marker='^')

ax.set_title("noise={}".format(noise))

plt.show()

パラメーターの指定

sklearn.datasets.make_moons(n_samples=100, shuffle=True, noise=None, random_state=None)

1	sklearn.datasets.make_moons(n_samples=100, shuffle=True, noise=None, random_state=None)

n_samples

1つの数値で与えた場合は全データ数、2要素のタプルで与えた場合はそれぞれのクラスのデータ数。デフォルトは100。
shuffle: データをシャッフルするかどうか。デフォルトはTrue。
noise: データに加えられるノイズの標準偏差。デフォルトはノイズなし。
random_state: データ生成の乱数系列。

scikit-learn – make_blobs

2020-05-18 / tau / コメントする

概要

sklearn.datasets.make_blobls()は、クラス分類のためのデータを生成する。blobとはインクの染みなどを指し、散布図の点の様子からつけられてるようだ。

標準では、データの総数、特徴量の数、クラスターの数などを指定して実行し、特徴量配列X、ターゲットとなるクラスデータyのタプルが返される（引数の指定によってはもう1つ戻り値が追加される）。

得られるデータの形式

特徴量配列Xは列が特徴量、行がレコードの2次元配列。ターゲットyはレコード数分のクラス属性値の整数。

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10, random_state=0)

print(X)
print(y)

# [[ 1.12031365  5.75806083]
#  [ 1.7373078   4.42546234]
#  [ 2.36833522  0.04356792]
#  [ 0.87305123  4.71438583]
#  [-0.66246781  2.17571724]
#  [ 0.74285061  1.46351659]
#  [-4.07989383  3.57150086]
#  [ 3.54934659  0.6925054 ]
#  [ 2.49913075  1.23133799]
#  [ 1.9263585   4.15243012]]
# [0 0 1 0 2 2 2 1 1 0]

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10, random_state=0)

print(X)

print(y)

# [[ 1.12031365 5.75806083]

# [ 1.7373078 4.42546234]

# [ 2.36833522 0.04356792]

# [ 0.87305123 4.71438583]

# [-0.66246781 2.17571724]

# [ 0.74285061 1.46351659]

# [-4.07989383 3.57150086]

# [ 3.54934659 0.6925054 ]

# [ 2.49913075 1.23133799]

# [ 1.9263585 4.15243012]]

# [0 0 1 0 2 2 2 1 1 0]

利用例

そのままscikit-learnのモデルの入力とする。

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X, y = make_blobs(n_samples=100, centers=2, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X_train, y_train)

print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))
# 1.0
# 0.96

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

X, y = make_blobs(n_samples=100, centers=2, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = KNeighborsClassifier(n_neighbors=1)

clf.fit(X_train, y_train)

print(clf.score(X_train, y_train))

print(clf.score(X_test, y_test))

# 1.0

# 0.96

クラスごとに色やマークを変えて散布図を描く。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=30, centers=3, random_state=10)

markers = ['o', '^', 'v']
fig, ax = plt.subplots()

for cluster, marker in zip(range(3), markers):
    x = X[y==cluster]
    ax.scatter(x[:, 0], x[:, 1], marker=marker)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=30, centers=3, random_state=10)

markers = ['o', '^', 'v']

fig, ax = plt.subplots()

for cluster, marker in zip(range(3), markers):

x = X[y==cluster]

ax.scatter(x[:, 0], x[:, 1], marker=marker)

plt.show()

パラメーターの指定

make_blobs(n_samples, n_features, centers, cluster_std,
           center_box, shuffle, random_state, return_centers)

1 2	make_blobs(n_samples, n_features, centers, cluster_std, center_box, shuffle, random_state, return_centers)

主なもの。

n_samples: 整数で指定した場合、生成されるサンプルの総数で戻り値Xの行数になる。配列で指定した場合、その要素数がクラスターの数となり、各要素はクラスターのデータ数となる。デフォルトは100。
n_features: 特徴量の数で、戻り値Xの列数になる。デフォルトは2
centers: クラスター中心の数。n_samplesを整数で指定してcentersを指定しない場合（デフォルトのNoneの場合）、centers=3となる。n_samplesを配列で指定した場合はNoneか[n_centers, n_features]の配列。
center_std: クラスターの標準偏差。

scikit-learn – LogisticRegression

2020-05-17 / tau / コメントする

概要

scikit-learnのLogisticRegressionモデルはLogistic回帰のモデルを提供する。利用方法の概要は以下の手順で、LinearRegressionなど他の線形モデルとほぼ同じだが、モデルインスタンス生成時に与える正則化パラメーターCはRidge/Lassoのalphaと逆で、正則化の効果を強くするにはCを小さくする（Cを大きくすると正則化が弱まり、訓練データに対する精度は高まるが過学習の可能性が高くなる）。

また、正則化の方法をL1正則化、L2正則化、Elastic netから選択できる。

LogisticRegressのクラスをインポートする
ハイパーパラメーターC、正則化方法、solver（収束計算方法）などを指定し、モデルのインスタンスを生成する
fit()メソッドに訓練データを与えて学習させる

学習済みのモデルの利用方法は以下の通り。

score()メソッドにテストデータを与えて適合度を計算する
predict()メソッドに説明変数を与えてターゲットを予測
モデルインスタンスのプロパティーからモデルのパラメーターを利用
- 切片はintercept_、重み係数はcoef_(末尾のアンダースコアに注意)

利用例

以下は、breast_cancerデータセットに対してLogisticRegressionを適用した例。デフォルトのsolverは'lbfgs'でデフォルトの最大収束回数(100)では収束しなかったため、max_iter=3000を指定している。

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(
    ds.data, ds.target, stratify=ds.target, random_state=42)

logreg = LogisticRegression(max_iter=3000).fit(X_train, y_train)
print("")
print("Training score: {}".format(logreg.score(X_train, y_train)))
print("Test score    : {}".format(logreg.score(X_test, y_test)))
print("Prediction")
for i in range(3):
    print("{} -> {}".format(y_test[i], logreg.predict(X_test[i].reshape(1, -1))))

# Training score: 0.9577464788732394
# Test score    : 0.958041958041958
# Prediction
# 1 -> [1]
# 0 -> [0]
# 1 -> [1]

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(

ds.data, ds.target, stratify=ds.target, random_state=42)

logreg = LogisticRegression(max_iter=3000).fit(X_train, y_train)

print("")

print("Training score: {}".format(logreg.score(X_train, y_train)))

print("Test score : {}".format(logreg.score(X_test, y_test)))

print("Prediction")

for i in range(3):

print("{} -> {}".format(y_test[i], logreg.predict(X_test[i].reshape(1, -1))))

# Training score: 0.9577464788732394

# Test score : 0.958041958041958

# Prediction

# 1 -> [1]

# 0 -> [0]

# 1 -> [1]

利用方法

LogisticRgressionの主な利用方法はLineaRegressionとほとんど同じで、以下は特有の設定を中心にまとめる。

モデルクラスのインポート

scikit-learn.linear_modelパッケージからLogisticRegressonクラスをインポートする。

from sklearn.linear_model import LogisticRegression

1	from sklearn.linear_model import LogisticRegression

モデルのインスタンスの生成

LogisticRegressionでは、ハイパーパラメーターCによって正則化の強さを指定する。このCはRidge/Lassoのalphaと異なり、正則化の効果を強めるためには値を小さくする。デフォルトはC=1.0。

logreg = LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0,
             fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None,
             solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False,
             n_jobs=None, l1_ratio=None)

logreg = LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0,

fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None,

solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False,

n_jobs=None, l1_ratio=None)

以下、RidgeとLassoに特有のパラメーターのみ説明。LinearRegressionと共通のパラメーターはLinearRegressionを参照。

penalty: 'l1', 'l2', 'elasticnet', 'none'で正則化項のノルムのタイプを指定する。ソルバーの'newton-cg','sag','lbfgs'はL2正則化のみサポートし、'elasticnet'は'saga'のみがサポートする。デフォルトは'none'で正則化は適用されない('liblinear'は'none'に対応しない)。
tol: 収束計算の解の精度で、デフォルトは1e-4。
C: 正則化の強さの逆数。正の整数で指定し、デフォルトは1.0。
solver: 'newton-cg'、'lbfgs'、'liblinear'、'sag'、'saga'のうちから選択される。デフォルトは'lbfgs'。小さなデータセットには'liblnear'が適し、大きなデータセットに対しては'sag'、'saga'の計算が速い。複数クラスの問題には、'newton-cg'、'sag'、'saga'、'lbfgs'が対応し、'liblinear'は一対他しか対応しない。その他ノルムの種類とソルバーの対応。
max_iter: 収束計算の制限回数を指定する。デフォルト値は100。
random_state: データをシャッフルする際のランダム・シードで、solver='sag'の際に用いる。
l1_ratio: Elastic-Netのパラメーター。[0, 1]の値で、penalty='elasticnet'の時のみ使われる。

モデルの学習

fit()メソッドに特徴量とターゲットの訓練データを与えてモデルに学習させる(回帰係数を決定する)。

lr.fit(X, y)

1	lr.fit(X, y)

X: 特徴量の配列。2次元配列で、各列が各々の説明変数に対応し、行数はデータ数を想定している。変数が1つで1次元配列の時はreshape(-1, 1)かスライス([:, n:n+1])を使って1列の列ベクトルに変換する必要がある。
y: ターゲットの配列で、通常は1変数で1次元配列。

3つ目の引数sample_weightは省略。

適合度の計算

score()メソッドに特徴量とターゲットを与えて適合度を計算する。

lr.score(X, y)

1	lr.score(X, y)

その他のメソッド

decision_function(X)
densiffy()
predict_proba(X)
predict_log_proba()
sparsify()

Diabetesデータセット

2020-05-16 / tau / コメントする

概要

diabetesデータは、年齢や性別など10個の特徴量と、それらの測定1年後の糖尿病の進行度に関する数値を、442人について集めたデータ。出典は”From Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499″。

ここではPythonのscikit-learnにあるdiabetesデータの使い方をまとめる。

データの取得とデータ構造

Pythonで扱う場合、scikit-learn.datasetsモジュールにあるload_diabetes()でデータを取得できる。データはBunchクラスのオブジェクト

from sklearn.datasets import load_diabetes

ds = load_diabetes()

for key, value in zip(ds.keys(), ds.values()):
    print("{}:\n{}\n".format(key, value))

from sklearn.datasets import load_diabetes

ds = load_diabetes()

for key, value in zip(ds.keys(), ds.values()):

print("{}:\n{}\n".format(key, value))

データの構造は辞書型で、442人の糖尿病に関する10個の特徴量をレコードとした配列、442人の測定1年後の糖尿病の進行度を示す数値データの配列など。

data:
[[ 0.03807591  0.05068012  0.06169621 ... -0.00259226  0.01990842
  -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974
  -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 ... -0.00259226  0.00286377
  -0.02593034]
 ...
 [ 0.04170844  0.05068012 -0.01590626 ... -0.01107952 -0.04687948
   0.01549073]
 [-0.04547248 -0.04464164  0.03906215 ...  0.02655962  0.04452837
  -0.02593034]
 [-0.04547248 -0.04464164 -0.0730303  ... -0.03949338 -0.00421986
   0.00306441]]

target:
[151.  75. 141. 206. 135.  97. 138.  63. 110. 310. 101.  69. 179. 185.
 118. 171. 166. 144.  97. 168.  68.  49.  68. 245. 184. 202. 137.  85.
 131. 283. 129.  59. 341.  87.  65. 102. 265. 276. 252.  90. 100.  55.
  61.  92. 259.  53. 190. 142.  75. 142. 155. 225.  59. 104. 182. 128.
  52.  37. 170. 170.  61. 144.  52. 128.  71. 163. 150.  97. 160. 178.
  48. 270. 202. 111.  85.  42. 170. 200. 252. 113. 143.  51.  52. 210.
  65. 141.  55. 134.  42. 111.  98. 164.  48.  96.  90. 162. 150. 279.
  92.  83. 128. 102. 302. 198.  95.  53. 134. 144. 232.  81. 104.  59.
 246. 297. 258. 229. 275. 281. 179. 200. 200. 173. 180.  84. 121. 161.
  99. 109. 115. 268. 274. 158. 107.  83. 103. 272.  85. 280. 336. 281.
 118. 317. 235.  60. 174. 259. 178. 128.  96. 126. 288.  88. 292.  71.
 197. 186.  25.  84.  96. 195.  53. 217. 172. 131. 214.  59.  70. 220.
 268. 152.  47.  74. 295. 101. 151. 127. 237. 225.  81. 151. 107.  64.
 138. 185. 265. 101. 137. 143. 141.  79. 292. 178.  91. 116.  86. 122.
  72. 129. 142.  90. 158.  39. 196. 222. 277.  99. 196. 202. 155.  77.
 191.  70.  73.  49.  65. 263. 248. 296. 214. 185.  78.  93. 252. 150.
  77. 208.  77. 108. 160.  53. 220. 154. 259.  90. 246. 124.  67.  72.
 257. 262. 275. 177.  71.  47. 187. 125.  78.  51. 258. 215. 303. 243.
  91. 150. 310. 153. 346.  63.  89.  50.  39. 103. 308. 116. 145.  74.
  45. 115. 264.  87. 202. 127. 182. 241.  66.  94. 283.  64. 102. 200.
 265.  94. 230. 181. 156. 233.  60. 219.  80.  68. 332. 248.  84. 200.
  55.  85.  89.  31. 129.  83. 275.  65. 198. 236. 253. 124.  44. 172.
 114. 142. 109. 180. 144. 163. 147.  97. 220. 190. 109. 191. 122. 230.
 242. 248. 249. 192. 131. 237.  78. 135. 244. 199. 270. 164.  72.  96.
 306.  91. 214.  95. 216. 263. 178. 113. 200. 139. 139.  88. 148.  88.
 243.  71.  77. 109. 272.  60.  54. 221.  90. 311. 281. 182. 321.  58.
 262. 206. 233. 242. 123. 167.  63. 197.  71. 168. 140. 217. 121. 235.
 245.  40.  52. 104. 132.  88.  69. 219.  72. 201. 110.  51. 277.  63.
 118.  69. 273. 258.  43. 198. 242. 232. 175.  93. 168. 275. 293. 281.
  72. 140. 189. 181. 209. 136. 261. 113. 131. 174. 257.  55.  84.  42.
 146. 212. 233.  91. 111. 152. 120.  67. 310.  94. 183.  66. 173.  72.
  49.  64.  48. 178. 104. 132. 220.  57.]

DESCR:
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

feature_names:
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

data_filename:
C:\Users\tomo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\datasets\data\diabetes_data.csv.gz

target_filename:
C:\Users\tomo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\datasets\data\diabetes_target.csv.gz

data:

[[ 0.03807591 0.05068012 0.06169621 ... -0.00259226 0.01990842

-0.01764613]

[-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974

-0.09220405]

[ 0.08529891 0.05068012 0.04445121 ... -0.00259226 0.00286377

-0.02593034]

...

[ 0.04170844 0.05068012 -0.01590626 ... -0.01107952 -0.04687948

0.01549073]

[-0.04547248 -0.04464164 0.03906215 ... 0.02655962 0.04452837

-0.02593034]

[-0.04547248 -0.04464164 -0.0730303 ... -0.03949338 -0.00421986

0.00306441]]

target:

[151. 75. 141. 206. 135. 97. 138. 63. 110. 310. 101. 69. 179. 185.

118. 171. 166. 144. 97. 168. 68. 49. 68. 245. 184. 202. 137. 85.

131. 283. 129. 59. 341. 87. 65. 102. 265. 276. 252. 90. 100. 55.

61. 92. 259. 53. 190. 142. 75. 142. 155. 225. 59. 104. 182. 128.

52. 37. 170. 170. 61. 144. 52. 128. 71. 163. 150. 97. 160. 178.

48. 270. 202. 111. 85. 42. 170. 200. 252. 113. 143. 51. 52. 210.

65. 141. 55. 134. 42. 111. 98. 164. 48. 96. 90. 162. 150. 279.

92. 83. 128. 102. 302. 198. 95. 53. 134. 144. 232. 81. 104. 59.

246. 297. 258. 229. 275. 281. 179. 200. 200. 173. 180. 84. 121. 161.

99. 109. 115. 268. 274. 158. 107. 83. 103. 272. 85. 280. 336. 281.

118. 317. 235. 60. 174. 259. 178. 128. 96. 126. 288. 88. 292. 71.

197. 186. 25. 84. 96. 195. 53. 217. 172. 131. 214. 59. 70. 220.

268. 152. 47. 74. 295. 101. 151. 127. 237. 225. 81. 151. 107. 64.

138. 185. 265. 101. 137. 143. 141. 79. 292. 178. 91. 116. 86. 122.

72. 129. 142. 90. 158. 39. 196. 222. 277. 99. 196. 202. 155. 77.

191. 70. 73. 49. 65. 263. 248. 296. 214. 185. 78. 93. 252. 150.

77. 208. 77. 108. 160. 53. 220. 154. 259. 90. 246. 124. 67. 72.

257. 262. 275. 177. 71. 47. 187. 125. 78. 51. 258. 215. 303. 243.

91. 150. 310. 153. 346. 63. 89. 50. 39. 103. 308. 116. 145. 74.

45. 115. 264. 87. 202. 127. 182. 241. 66. 94. 283. 64. 102. 200.

265. 94. 230. 181. 156. 233. 60. 219. 80. 68. 332. 248. 84. 200.

55. 85. 89. 31. 129. 83. 275. 65. 198. 236. 253. 124. 44. 172.

114. 142. 109. 180. 144. 163. 147. 97. 220. 190. 109. 191. 122. 230.

242. 248. 249. 192. 131. 237. 78. 135. 244. 199. 270. 164. 72. 96.

306. 91. 214. 95. 216. 263. 178. 113. 200. 139. 139. 88. 148. 88.

243. 71. 77. 109. 272. 60. 54. 221. 90. 311. 281. 182. 321. 58.

262. 206. 233. 242. 123. 167. 63. 197. 71. 168. 140. 217. 121. 235.

245. 40. 52. 104. 132. 88. 69. 219. 72. 201. 110. 51. 277. 63.

118. 69. 273. 258. 43. 198. 242. 232. 175. 93. 168. 275. 293. 281.

72. 140. 189. 181. 209. 136. 261. 113. 131. 174. 257. 55. 84. 42.

146. 212. 233. 91. 111. 152. 120. 67. 310. 94. 183. 66. 173. 72.

49. 64. 48. 178. 104. 132. 220. 57.]

DESCR:

.. _diabetes_dataset:

Diabetes dataset

----------------

Ten baseline variables, age, sex, body mass index, average blood

pressure, and six blood serum measurements were obtained for each of n =

442 diabetes patients, as well as the response of interest, a

quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:

- Age

- Sex

- Body mass index

- Average blood pressure

- S1

- S2

- S3

- S4

- S5

- S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.

(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

feature_names:

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

data_filename:

C:\Users\tomo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\datasets\data\diabetes_data.csv.gz

target_filename:

C:\Users\tomo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\datasets\data\diabetes_target.csv.gz

データのキーは以下のようになっている。

from sklearn.datasets import load_diabetes

ds = load_diabetes()

print(ds.keys())

# ddict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

from sklearn.datasets import load_diabetes

ds = load_diabetes()

print(ds.keys())

# ddict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

データの内容

`'data'`～特徴量データセット

10個の特徴量を列とし、442人の被検者を業とした2次元配列。DESCRに説明されているように、これらのデータは標本平均と標本分散で正規化されており、各特徴量とも、データの和はゼロ（正確には1×10^-14～1×10^-13のオーダーの実数）、2乗和は1となる。

data:
[[ 0.03807591  0.05068012  0.06169621 ... -0.00259226  0.01990842
  -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974
  -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 ... -0.00259226  0.00286377
  -0.02593034]
 ...
 [ 0.04170844  0.05068012 -0.01590626 ... -0.01107952 -0.04687948
   0.01549073]
 [-0.04547248 -0.04464164  0.03906215 ...  0.02655962  0.04452837
  -0.02593034]
 [-0.04547248 -0.04464164 -0.0730303  ... -0.03949338 -0.00421986
   0.00306441]]

data:

[[ 0.03807591 0.05068012 0.06169621 ... -0.00259226 0.01990842

-0.01764613]

[-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974

-0.09220405]

[ 0.08529891 0.05068012 0.04445121 ... -0.00259226 0.00286377

-0.02593034]

...

[ 0.04170844 0.05068012 -0.01590626 ... -0.01107952 -0.04687948

0.01549073]

[-0.04547248 -0.04464164 0.03906215 ... 0.02655962 0.04452837

-0.02593034]

[-0.04547248 -0.04464164 -0.0730303 ... -0.03949338 -0.00421986

0.00306441]]

`'target'`～糖尿病の進行度

442人に関する10個の特徴量データを測定した1年後の糖尿病の進行度を示す数値。原文でも”a measure of disease progression one year after baseline”としか示されていない。このデータは正規化されていない。

target:
[151.  75. 141. 206. 135.  97. 138.  63. 110. 310. 101.  69. 179. 185.
 118. 171. 166. 144.  97. 168.  68.  49.  68. 245. 184. 202. 137.  85.
 131. 283. 129.  59. 341.  87.  65. 102. 265. 276. 252.  90. 100.  55.
.....
  72. 140. 189. 181. 209. 136. 261. 113. 131. 174. 257.  55.  84.  42.
 146. 212. 233.  91. 111. 152. 120.  67. 310.  94. 183.  66. 173.  72.
  49.  64.  48. 178. 104. 132. 220.  57.]

target:

[151. 75. 141. 206. 135. 97. 138. 63. 110. 310. 101. 69. 179. 185.

118. 171. 166. 144. 97. 168. 68. 49. 68. 245. 184. 202. 137. 85.

131. 283. 129. 59. 341. 87. 65. 102. 265. 276. 252. 90. 100. 55.

.....

72. 140. 189. 181. 209. 136. 261. 113. 131. 174. 257. 55. 84. 42.

146. 212. 233. 91. 111. 152. 120. 67. 310. 94. 183. 66. 173. 72.

49. 64. 48. 178. 104. 132. 220. 57.]

`'feature_names'`～特徴名

10種類の特徴量の名称

	sklearn	R
0	age	age	年齢
1	sex	sex	性別
2	bmi	bmi	BMI(Body Mass Index)
3	bp	map	(動脈の)平均血圧(Average blood pressure)
4	S1	tc	総コレステロール？
5	S2	ldl	悪玉コレステロール(Low Density Lipoprotein)
6	S3	hdl	善玉コレステロール(High Density Lipoprotein)
7	S4	tch	総コレステロール？
8	S5	ltg	ラモトリギン？
9	S6	glu	血糖＝グルコース？

scikit-learnでは後半のデータがs1～s6とだけ表示されていて、DESCRにおいても”six blood serum measurements”とだけ書かれている。Rのデータセットでは、これらがtc, ldlなど血清に関する指標の略号で示されている。

tcとtchはどちらも総コレステロールに関するデータのようだが、どういう違いなのかよくわからない。少なくとも双方に正の相関があるが、ばらつきは大きい。

`'filename'`～ファイル名

CSVファイルのフルパス名が示されている。scikit-learnの他のデータセットと以下の2点が異なっている。

特徴量データdiabetes_data.csvとターゲットデータdiabetes_target.csvの2つのファイルに分かれている
ファイルの拡張子がcsvとなっているが、区切りはスペースとなっている

data_filename:
C:...\lib\site-packages\sklearn\datasets\data\diabetes_data.csv.gz

target_filename:
C:...\lib\site-packages\sklearn\datasets\data\diabetes_target.csv.gz

data_filename:

C:...\lib\site-packages\sklearn\datasets\data\diabetes_data.csv.gz

target_filename:

C:...\lib\site-packages\sklearn\datasets\data\diabetes_target.csv.gz

diabetes_data.csv

1行に10個の実数がスペース区切りで配置されており、442行のデータがある。442人分の10個の特徴量データ

3.807590643342410180e-02 5.068011873981870252e-02 6.169620651868849837e-02 2.187235499495579841e-02 -4.422349842444640161e-02 -3.482076283769860309e-02 -4.340084565202689815e-02 -2.592261998182820038e-03 1.990842087631829876e-02 -1.764612515980519894e-02
-1.882016527791040067e-03 -4.464163650698899782e-02 -5.147406123880610140e-02 -2.632783471735180084e-02 -8.448724111216979540e-03 -1.916333974822199970e-02 7.441156407875940126e-02 -3.949338287409189657e-02 -6.832974362442149896e-02 -9.220404962683000083e-02
8.529890629667830071e-02 5.068011873981870252e-02 4.445121333659410312e-02 -5.670610554934250001e-03 -4.559945128264750180e-02 -3.419446591411950259e-02 -3.235593223976569732e-02 -2.592261998182820038e-03 2.863770518940129874e-03 -2.593033898947460017e-02
.....
4.170844488444359899e-02 5.068011873981870252e-02 -1.590626280073640167e-02 1.728186074811709910e-02 -3.734373413344069942e-02 -1.383981589779990050e-02 -2.499265663159149983e-02 -1.107951979964190078e-02 -4.687948284421659950e-02 1.549073015887240078e-02
-4.547247794002570037e-02 -4.464163650698899782e-02 3.906215296718960200e-02 1.215130832538269907e-03 1.631842733640340160e-02 1.528299104862660025e-02 -2.867429443567860031e-02 2.655962349378539894e-02 4.452837402140529671e-02 -2.593033898947460017e-02
-4.547247794002570037e-02 -4.464163650698899782e-02 -7.303030271642410587e-02 -8.141376581713200000e-02 8.374011738825870577e-02 2.780892952020790065e-02 1.738157847891100005e-01 -3.949338287409189657e-02 -4.219859706946029777e-03 3.064409414368320182e-03

3.807590643342410180e-02 5.068011873981870252e-02 6.169620651868849837e-02 2.187235499495579841e-02 -4.422349842444640161e-02 -3.482076283769860309e-02 -4.340084565202689815e-02 -2.592261998182820038e-03 1.990842087631829876e-02 -1.764612515980519894e-02

-1.882016527791040067e-03 -4.464163650698899782e-02 -5.147406123880610140e-02 -2.632783471735180084e-02 -8.448724111216979540e-03 -1.916333974822199970e-02 7.441156407875940126e-02 -3.949338287409189657e-02 -6.832974362442149896e-02 -9.220404962683000083e-02

8.529890629667830071e-02 5.068011873981870252e-02 4.445121333659410312e-02 -5.670610554934250001e-03 -4.559945128264750180e-02 -3.419446591411950259e-02 -3.235593223976569732e-02 -2.592261998182820038e-03 2.863770518940129874e-03 -2.593033898947460017e-02

.....

4.170844488444359899e-02 5.068011873981870252e-02 -1.590626280073640167e-02 1.728186074811709910e-02 -3.734373413344069942e-02 -1.383981589779990050e-02 -2.499265663159149983e-02 -1.107951979964190078e-02 -4.687948284421659950e-02 1.549073015887240078e-02

-4.547247794002570037e-02 -4.464163650698899782e-02 3.906215296718960200e-02 1.215130832538269907e-03 1.631842733640340160e-02 1.528299104862660025e-02 -2.867429443567860031e-02 2.655962349378539894e-02 4.452837402140529671e-02 -2.593033898947460017e-02

-4.547247794002570037e-02 -4.464163650698899782e-02 -7.303030271642410587e-02 -8.141376581713200000e-02 8.374011738825870577e-02 2.780892952020790065e-02 1.738157847891100005e-01 -3.949338287409189657e-02 -4.219859706946029777e-03 3.064409414368320182e-03

diabetes_target.csv

ターゲットyに相当する442行の実数データ。

1.510000000000000000e+02
7.500000000000000000e+01
1.410000000000000000e+02
.....
1.320000000000000000e+02
2.200000000000000000e+02
5.700000000000000000e+01

1.510000000000000000e+02

7.500000000000000000e+01

1.410000000000000000e+02

.....

1.320000000000000000e+02

2.200000000000000000e+02

5.700000000000000000e+01

‘DESCR’～データセットの説明

データセットの説明。各特徴量データが標準化されていることが説明されている。

Python - diabetes_01_DESCR.py:5
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
[Finished in 1.105s]

Python - diabetes_01_DESCR.py:5

.. _diabetes_dataset:

Diabetes dataset

----------------

Ten baseline variables, age, sex, body mass index, average blood

pressure, and six blood serum measurements were obtained for each of n =

442 diabetes patients, as well as the response of interest, a

quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:

- Age

- Sex

- Body mass index

- Average blood pressure

- S1

- S2

- S3

- S4

- S5

- S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.

(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

[Finished in 1.105s]

データの利用

各データの取得方法

data、targetなどのデータを取り出すのに、以下の2つの方法がある。

辞書のキーを使って呼び出す（例：diabetes['data']）
キーの文字列をプロパティーに指定する（例：diabetes.data）

dataの扱い

そのまま2次元配列として扱うか、pandas.DataFrameで扱う。特定の特徴量データを取り出すには、ファンシー・インデックスを使う。

from sklearn.datasets import load_diabetes
from pandas import DataFrame

ds = load_diabetes()
df = DataFrame(ds.data, columns=ds.feature_names)

print(df[['s1', 's4']])

#            s1        s4
# 0   -0.044223 -0.002592
# 1   -0.008449 -0.039493
# 2   -0.045599 -0.002592
# 3    0.012191  0.034309
# 4    0.003935 -0.002592
# ..        ...       ...
# 437 -0.005697 -0.002592
# 438  0.049341  0.034309
# 439 -0.037344 -0.011080
# 440  0.016318  0.026560
# 441  0.083740 -0.039493

from sklearn.datasets import load_diabetes

from pandas import DataFrame

ds = load_diabetes()

df = DataFrame(ds.data, columns=ds.feature_names)

print(df[['s1', 's4']])

# s1 s4

# 0 -0.044223 -0.002592

# 1 -0.008449 -0.039493

# 2 -0.045599 -0.002592

# 3 0.012191 0.034309

# 4 0.003935 -0.002592

# .. ... ...

# 437 -0.005697 -0.002592

# 438 0.049341 0.034309

# 439 -0.037344 -0.011080

# 440 0.016318 0.026560

# 441 0.083740 -0.039493

scikit-learn – LinearRegression

2020-05-10 / tau / コメントする

概要

scikit-learnのLinearRegressionは、最も単純な多重線形回帰モデルを提供する。

モデルの利用方法の概要は以下の手順。

LinearRegressionのクラスをインポートする
モデルのインスタンスを生成する
fit()メソッドに訓練データを与えて学習させる

学習済みのモデルの利用方法は以下の通り。

score()メソッドにテストデータを与えて適合度を計算する
predict()メソッドに説明変数を与えてターゲットを予測
モデルインスタンスのプロパティーからモデルのパラメーターを利用
- 切片はintercept_、重み係数はcoef_(末尾のアンダースコアに注意)

利用例

配列による場合

以下はscikit-learnのBoston hose pricesデータのうち、2つの特徴量RM(1戸あたり部屋数)とLSTAT(下位層の人口比率)を取り出して、線形回帰のモデルを適用している。

特徴量の一部をとりだすのに、ファンシー・インデックスでリストの要素に2つの変数のインデックスを指定している。また、特徴量データXとターゲットデータyをtrain_test_split()を使って訓練データとテストデータに分けている。

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

ds = load_boston()

X = ds.data[:, [5, 12]]
y = ds.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lr = LinearRegression()
lr.fit(X_train, y_train)

print("Score:{}".format(lr.score(X_test, y_test)))

print("Prediction for (7, 5):{}".format(lr.predict([[7, 5]])))

print("Intercept:{}".format(lr.intercept_))
print("Coefficients:{}".format(lr.coef_))

# Score:0.5692445415835343
# Prediction for (7, 5):[31.14766768]
# Intercept:-0.6047107435077521
# Coefficients:[ 5.01785312 -0.67451869]

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

ds = load_boston()

X = ds.data[:, [5, 12]]

y = ds.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lr = LinearRegression()

lr.fit(X_train, y_train)

print("Score:{}".format(lr.score(X_test, y_test)))

print("Prediction for (7, 5):{}".format(lr.predict([[7, 5]])))

print("Intercept:{}".format(lr.intercept_))

print("Coefficients:{}".format(lr.coef_))

# Score:0.5692445415835343

# Prediction for (7, 5):[31.14766768]

# Intercept:-0.6047107435077521

# Coefficients:[ 5.01785312 -0.67451869]

DataFrameによる場合

以下の例では、データセットの本体(data)をpandasのDataFrameとして構成し、2つの特徴量RMとLSTATを指定して取り出している。

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

ds = load_boston()
df = pd.DataFrame(ds.data, columns=ds.feature_names)

X = df[['RM', 'LSTAT']]
y = ds['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lr = LinearRegression()
lr.fit(X_train, y_train)

print("Score:{}".format(lr.score(X_test, y_test)))

print("Prediction for (7, 5):{}".format(lr.predict([[7, 5]])))

print("Intercept:{}".format(lr.intercept_))
print("Coefficients:{}".format(lr.coef_))

# Score:0.5692445415835343
# Prediction for (7, 5):[31.14766768]
# Intercept:-0.6047107435077521
# Coefficients:[ 5.01785312 -0.67451869]

import pandas as pd

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

ds = load_boston()

df = pd.DataFrame(ds.data, columns=ds.feature_names)

X = df[['RM', 'LSTAT']]

y = ds['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lr = LinearRegression()

lr.fit(X_train, y_train)

print("Score:{}".format(lr.score(X_test, y_test)))

print("Prediction for (7, 5):{}".format(lr.predict([[7, 5]])))

print("Intercept:{}".format(lr.intercept_))

print("Coefficients:{}".format(lr.coef_))

# Score:0.5692445415835343

# Prediction for (7, 5):[31.14766768]

# Intercept:-0.6047107435077521

# Coefficients:[ 5.01785312 -0.67451869]

利用方法

モデルクラスのインポート

scikit-learn.linear_modelパッケージからLinearRegressionクラスをインポートする。

from sklearn.linear_model import LinearRegression

1	from sklearn.linear_model import LinearRegression

モデルのインスタンスの生成

LinearRegressionの場合、ハイパーパラメーターの指定はない。

lr = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)

1	lr = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)

fit_intercept: 切片を計算しない場合Falseを指定。デフォルトはTrueで切片も計算されるが、原点を通るべき場合にはFalseを指定する。
normalize: Trueを指定すると、特徴量Xが学習の前に正規化(normalize)される(平均を引いてL2ノルムで割る)。デフォルトはFalse。fit_intercept=Falseにセットされた場合は無視される。説明変数を標準化(standardize)する場合はこの引数をFalseにしてsklearn.preprocessing.StandardScalerを使う。
copy_X: Trueを指定するとXはコピーされ、Falseの場合は上書きされる。デフォルトはTrue。
n_jobs: 計算のジョブの数を指定する。デフォルトはNoneで1に相当。n_targets > 1のときのみ適用される。

モデルの学習

fit()メソッドに特徴量とターゲットの訓練データを与えてモデルに学習させる(回帰係数を決定する)。

lr.fit(X, y)

1	lr.fit(X, y)

X: 特徴量の配列。2次元配列で、各列が各々の説明変数に対応し、行数はデータ数を想定している。変数が1つで1次元配列の時はreshape(-1, 1)かスライス([:, n:n+1])を使って1列の列ベクトルに変換する必要がある。
y: ターゲットの配列で、通常は1変数で1次元配列。

3つ目の引数sample_weightは省略。

適合度の計算

score()メソッドに特徴量とターゲットを与えて適合度を計算する。

lr.score(X, y)

1	lr.score(X, y)

戻り値は適合度を示す実数で、回帰計算の決定係数R²で計算される。

(1) $\begin{equation*} R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \overline{y})^2} \end{equation*}$

モデルによる予測

predict()メソッドに特徴量を与えて、ターゲットの予測結果を得る。

y_pred = lr.predict(X)

1	y_pred = lr.predict(X)

ここで特徴量Xは複数のデータセットの2次元配列を想定しており、1組のデータの場合でも2次元配列とする必要がある。

y_pred = lr.pred([[x1, x2,..., xm]])

1	y_pred = lr.pred([[x1, x2,..., xm]])

また、結果は複数のデータセットに対する1次元配列で返されるため、ターゲットが1つの場合でも要素数1の1次元配列となる。

切片・係数の利用

fit()メソッドによる学習後、モデルの学習結果として切片と特徴量に対する重み係数を得ることができる。

各々モデル・インスタンスのプロパティーとして保持されており、切片はintercept_で1つの実数、重み係数はcoeff_で特徴量の数と同じ要素数の1次元配列となる(特徴量が1つの場合も要素数1の1次元配列)。

ic = lr.intercept_
cf = lr.coef_

1 2	ic = lr.intercept_ cf = lr.coef_

末尾のアンダースコアに注意。

実行例

waveデータセットに対する単回帰

訓練データとテストデータの分割～train_test_split()

2020-03-14 / tau / コメントする

概要

scikit-learnのtrain_test_split()関数を使うと、与えたデータをいろいろな方法で訓練データとテストデータに切り分けてくれる。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.arange(1, 13)
print(x)
# [ 1  2  3  4  5  6  7  8  9 10 11 12]

print(train_test_split(x))
# [array([ 7,  2, 12,  5,  3,  9, 11,  8, 10]), array([1, 6, 4])]

x_train, x_test = train_test_split(x)

print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
# x_train:[ 6  1 12  7  3  2 11  5  4]
# x_test :[ 8  9 10]

import numpy as np

from sklearn.model_selection import train_test_split

x = np.arange(1, 13)

print(x)

# [ 1 2 3 4 5 6 7 8 9 10 11 12]

print(train_test_split(x))

# [array([ 7, 2, 12, 5, 3, 9, 11, 8, 10]), array([1, 6, 4])]

x_train, x_test = train_test_split(x)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[ 6 1 12 7 3 2 11 5 4]

# x_test :[ 8 9 10]

8行目で、train_test_split()に配列を与えた結果、それが2つの配列に分割されていることがわかる。

11行目では、その結果を訓練用、テスト用の配列として取得している。

デフォルトでtrain_test_split()は、テスト用データのサイズが与えた配列のサイズの0.25となるように配列を分割する（1つ目のサイズ：2つ目のサイズ＝3:1）。x_testのサイズが12×0.25=3、x_trainのサイズが9となっていることが確認できる。

乱数系列の固定

データの分割あたって、要素の選択はtrain_test_split()の実行ごとにランダムに行われるが、random_stateパラメーターを指定することで固定できる。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.arange(1, 13)

x_train, x_test = train_test_split(x, random_state=0)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
# x_train:[11  3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5]

x_train, x_test = train_test_split(x, random_state=0)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
# x_train:[11  3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5]

x_train, x_test = train_test_split(x, random_state=1)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
# x_train:[11  2  7  1  8 12 10  9  6]
# x_test :[3 4 5]

x_train, x_test = train_test_split(x, random_state=1)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
# x_train:[11  2  7  1  8 12 10  9  6]
# x_test :[3 4 5]

import numpy as np

from sklearn.model_selection import train_test_split

x = np.arange(1, 13)

x_train, x_test = train_test_split(x, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[11 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5]

x_train, x_test = train_test_split(x, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[11 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5]

x_train, x_test = train_test_split(x, random_state=1)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[11 2 7 1 8 12 10 9 6]

# x_test :[3 4 5]

x_train, x_test = train_test_split(x, random_state=1)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[11 2 7 1 8 12 10 9 6]

# x_test :[3 4 5]

データのサイズ

テストデータサイズの指定

テストデータのサイズはtest_sizeパラメーターで指定することができる。

以下の例では、テストデータの比率をデフォルトの0.25→0.3に変更しており、テストデータのサイズが4となっている（test_size=0.26としてもx_testのサイズが4になり、テストデータのサイズは切り上げで計算されている）。

比率によってデータサイズを指定する場合は0<test_size<1の実数で指定(0や1.0で指定するとエラー)

x_train, x_test = train_test_split(x, test_size=0.3, random_state=0)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))

# x_train:[ 3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5 11]

x_train, x_test = train_test_split(x, test_size=0.3, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[ 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5 11]

訓練データのサイズを比率ではなく実際のサイズ(要素数)で指定することもできる。その場合、test_sizeを1以上の整数で指定。

以下の例ではテストデータのサイズを4として指定している。

x_train, x_test = train_test_split(x, test_size=4, random_state=0)
print("y_train:{}".format(x_train))
print("y_test :{}".format(x_test))

# y_train:[ 3  9  2  8 10  4  1  6]
# y_test :[ 7 12  5 11]

x_train, x_test = train_test_split(x, test_size=4, random_state=0)

print("y_train:{}".format(x_train))

print("y_test :{}".format(x_test))

# y_train:[ 3 9 2 8 10 4 1 6]

# y_test :[ 7 12 5 11]

訓練データサイズの指定

train_sizeパラメーターで訓練データのサイズを指定することもできる。

以下の例ではtrain_size=0.8とし、訓練データサイズが9となっている（訓練データサイズの計算は切り下げで行われている）。

x_train, x_test = train_test_split(x, train_size=0.8, random_state=0)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))

# x_train:[11  3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5]

x_train, x_test = train_test_split(x, train_size=0.8, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[11 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5]

訓練データサイズも要素数での指定が可能。

x_train, x_test = train_test_split(x, train_size=10, random_state=0)
print("x_train:{}".format(x_train))
print("y_test :{}".format(x_test))

# x_train:[ 5 11  3  9  2  8 10  4  1  6]
# y_test :[ 7 12]

x_train, x_test = train_test_split(x, train_size=10, random_state=0)

print("x_train:{}".format(x_train))

print("y_test :{}".format(x_test))

# x_train:[ 5 11 3 9 2 8 10 4 1 6]

# y_test :[ 7 12]

データ選択の内部手続

ここで、random_state=0としてtest_sizeやtrain_sizeを変化させたとき、テストデータの要素が現れる順番は変わらないということに気づいた。

x_train, x_test = train_test_split(x, test_size=0.2, random_state=0)
# x_train:[11  3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5]

x_train, x_test = train_test_split(x, test_size=0.3, random_state=0)
# x_train:[ 3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5 11]

x_train, x_test = train_test_split(x, test_size=0.4, random_state=0)
# x_train:[ 9  2  8 10  4  1  6]
# x_test :[ 7 12  5 11  3]

x_train, x_test = train_test_split(x, train_size=9, random_state=0)
# x_train:[ 5 11  3  9  2  8 10  4  1  6]
# y_test :[ 7 12]

x_train, x_test = train_test_split(x, train_size=8, random_state=0)
# x_train:[11  3  9  2  8 10  4  1  6]
# y_test :[ 7 12  5]

x_train, x_test = train_test_split(x, train_size=7, random_state=0)
# x_train:[ 3  9  2  8 10  4  1  6]
# y_test :[ 7 12  5 11]

x_train, x_test = train_test_split(x, test_size=0.2, random_state=0)

# x_train:[11 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5]

x_train, x_test = train_test_split(x, test_size=0.3, random_state=0)

# x_train:[ 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5 11]

x_train, x_test = train_test_split(x, test_size=0.4, random_state=0)

# x_train:[ 9 2 8 10 4 1 6]

# x_test :[ 7 12 5 11 3]

x_train, x_test = train_test_split(x, train_size=9, random_state=0)

# x_train:[ 5 11 3 9 2 8 10 4 1 6]

# y_test :[ 7 12]

x_train, x_test = train_test_split(x, train_size=8, random_state=0)

# x_train:[11 3 9 2 8 10 4 1 6]

# y_test :[ 7 12 5]

x_train, x_test = train_test_split(x, train_size=7, random_state=0)

# x_train:[ 3 9 2 8 10 4 1 6]

# y_test :[ 7 12 5 11]

test_size/train_sizeのどちらで指定しても、また比率／要素数の何れで指定しても、常にテストデータの要素は7, 12, 5,…の順番で現れている。

これに対して訓練データの方は、テストデータの要素数が変わると変化するが、テストデータの結果が同じなら訓練データのパターンも同じ。

すなわちtrain_test_split()のサイズ指定は、どのように指定しても一旦テストデータの要素数に変換し、共通の手順でテストデータを選んでいっていると考えられる。

複数データの同時分割

train_test_split()は複数データを同時に分割することもできる。

以下の例では、二つの配列を引数として与えている。その結果は、与えた配列ごとに訓練データ、テストデータの順でタプルとして返される。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.arange(1, 9)
y = np.arange(11, 19)

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
print("y_train:{}".format(y_train))
print("y_test :{}".format(y_test))

# x_train:[2 8 4 1 6 5]
# x_test :[7 3]
# y_train:[12 18 14 11 16 15]
# y_test :[17 13]

import numpy as np

from sklearn.model_selection import train_test_split

x = np.arange(1, 9)

y = np.arange(11, 19)

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

print("y_train:{}".format(y_train))

print("y_test :{}".format(y_test))

# x_train:[2 8 4 1 6 5]

# x_test :[7 3]

# y_train:[12 18 14 11 16 15]

# y_test :[17 13]

これが一般的な使い方で、複数の特徴量に関する個体のデータセットと各個体のクラスに関するデータを、同時に訓練データとテストデータに分割するときに用いられる。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.vstack((np.arange(1, 11), np.arange(11, 21))).T
print("original x:\n{}".format(x))

y = np.arange(21, 31)
print("original y:{}".format(y))

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

print("x_train:\n{}".format(x_train))
print("x_test :\n{}".format(x_test))
print("y_train:{}".format(y_train))
print("y_test :{}".format(y_test))

import numpy as np

from sklearn.model_selection import train_test_split

x = np.vstack((np.arange(1, 11), np.arange(11, 21))).T

print("original x:\n{}".format(x))

y = np.arange(21, 31)

print("original y:{}".format(y))

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

print("x_train:\n{}".format(x_train))

print("x_test :\n{}".format(x_test))

print("y_train:{}".format(y_train))

print("y_test :{}".format(y_test))

元のデータは

original x:
[[ 1 11]
 [ 2 12]
 [ 3 13]
 [ 4 14]
 [ 5 15]
 [ 6 16]
 [ 7 17]
 [ 8 18]
 [ 9 19]
 [10 20]]
original y:[21 22 23 24 25 26 27 28 29 30]

original x:

[[ 1 11]

[ 2 12]

[ 3 13]

[ 4 14]

[ 5 15]

[ 6 16]

[ 7 17]

[ 8 18]

[ 9 19]

[10 20]]

original y:[21 22 23 24 25 26 27 28 29 30]

これを訓練データとテストデータに分割した結果は

x_train:
[[10 20]
 [ 2 12]
 [ 7 17]
 [ 8 18]
 [ 4 14]
 [ 1 11]
 [ 6 16]]
x_test :
[[ 3 13]
 [ 9 19]
 [ 5 15]]
y_train:[30 22 27 28 24 21 26]
y_test :[23 29 25]

x_train:

[[10 20]

[ 2 12]

[ 7 17]

[ 8 18]

[ 4 14]

[ 1 11]

[ 6 16]]

x_test :

[[ 3 13]

[ 9 19]

[ 5 15]]

y_train:[30 22 27 28 24 21 26]

y_test :[23 29 25]

`stratify`による層化(相似化)

train_test_split()による要素の選択はランダムに行われる。この場合、クラス分類のパターンが、元データ、訓練データ、テストデータで異なってくる。

以下の例では、元のデータの0と1の比率が1:2だが、訓練データでは1:4、テストデータでは2:1になっている。ケースによっては特定のクラスが極端に少ない／存在しないということも起こり得る。

import numpy as np
from sklearn.model_selection import train_test_split

y = np.array([0, 0, 0, 1, 1, 1, 1, 1])

y_train, y_test = train_test_split(y, test_size=3, random_state=0)
print("y_train:{}".format(y_train))
print("y_test :{}".format(y_test))

# y_train:[1 1 0 1 1]
# y_test :[1 0 0]

import numpy as np

from sklearn.model_selection import train_test_split

y = np.array([0, 0, 0, 1, 1, 1, 1, 1])

y_train, y_test = train_test_split(y, test_size=3, random_state=0)

print("y_train:{}".format(y_train))

print("y_test :{}".format(y_test))

# y_train:[1 1 0 1 1]

# y_test :[1 0 0]

そこで、stratifyパラメーターで配列を指定すると、その配列でのパターンと同じになるように訓練データとテストデータを分割してくれる。

以下の例では、先の配列を元の配列の0/1のパターンと相似になるように分割している。

y_train, y_test = train_test_split(y, test_size=3, stratify=y, random_state=0)
print("y_train:{}".format(y_train))
print("y_test :{}".format(y_test))

# y_train:[0 1 1 0 1]
# y_test :[1 1 0]

y_train, y_test = train_test_split(y, test_size=3, stratify=y, random_state=0)

print("y_train:{}".format(y_train))

print("y_test :{}".format(y_test))

# y_train:[0 1 1 0 1]

# y_test :[1 1 0]

次の例は、9個体の特徴量データxと各個体のクラス区分データyを、クラスの分布に沿って訓練データとテストデータに分割するイメージ。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.array([10, 10, 10, 11, 11, 11, 11, 11, 11])
y = np.array([0, 0, 0, 1, 1, 1, 1, 1, 1])

x_train, x_test, y_train, y_test =\
    train_test_split(x, y, test_size=3, stratify=y, random_state=0)
print("y_train:{}".format(y_train))
print("y_test :{}".format(y_test))
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))

# y_train:[0 1 1 0 1 1]
# y_test :[1 1 0]
# x_train:[10 11 11 10 11 11]
# x_test :[11 11 10]

import numpy as np

from sklearn.model_selection import train_test_split

x = np.array([10, 10, 10, 11, 11, 11, 11, 11, 11])

y = np.array([0, 0, 0, 1, 1, 1, 1, 1, 1])

x_train, x_test, y_train, y_test =\

train_test_split(x, y, test_size=3, stratify=y, random_state=0)

print("y_train:{}".format(y_train))

print("y_test :{}".format(y_test))

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# y_train:[0 1 1 0 1 1]

# y_test :[1 1 0]

# x_train:[10 11 11 10 11 11]

# x_test :[11 11 10]

シャッフルの有無

デフォルトでtrain_test_split()は、データの分割にあたって要素の選択をランダムに行うが、shuffle=Falseを指定すると要素の順番を保持する。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.arange(1, 13)

x_train, x_test = train_test_split(x, shuffle=False, random_state=0)

print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))

# x_train:[1 2 3 4 5 6 7 8 9]
# x_test :[10 11 12]

import numpy as np

from sklearn.model_selection import train_test_split

x = np.arange(1, 13)

x_train, x_test = train_test_split(x, shuffle=False, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[1 2 3 4 5 6 7 8 9]

# x_test :[10 11 12]

概要

decision_function()の挙動

decision_function()の意味

3クラス以上の場合

概要

得られるデータの形式

パラメーターの指定

利用例

概要

Treeオブジェクト内容確認

Treeのコンソール表示

決定木の構築過程の表示

決定木のツリー表示

概要

得られるデータの形式

利用例

パラメーターの指定

概要

得られるデータの形式

利用例

パラメーターの指定

概要

利用例

利用方法

モデルクラスのインポート

モデルのインスタンスの生成

モデルの学習

適合度の計算

その他のメソッド

概要

データの取得とデータ構造

データの内容

'data'～特徴量データセット

'target'～糖尿病の進行度

'feature_names'～特徴名

'filename'～ファイル名

diabetes_data.csv

diabetes_target.csv

‘DESCR’～データセットの説明

データの利用

各データの取得方法

dataの扱い

概要

利用例

配列による場合

DataFrameによる場合

利用方法

モデルクラスのインポート

モデルのインスタンスの生成

モデルの学習

適合度の計算

モデルによる予測

切片・係数の利用

実行例

概要

乱数系列の固定

データのサイズ

テストデータサイズの指定

訓練データサイズの指定

データ選択の内部手続

複数データの同時分割

stratifyによる層化(相似化)

シャッフルの有無

`decision_function()`の挙動

`decision_function()`の意味

`'data'`～特徴量データセット

`'target'`～糖尿病の進行度

`'feature_names'`～特徴名

`'filename'`～ファイル名

`stratify`による層化(相似化)