決定木によるクラス分類

2020-06-06 / tau / コメントする

概要

決定木をクラス分類に用いる場合、分類木(classification tree)とも呼ぶ。決定木は、決定境界が単調ではなく混み入っていて線形モデルでは分類が難しい場合でも対応できる。ここでは決定木のクラス分類における性質・挙動を確認する。

決定木の構築過程

2つの特徴量を持ち、2つのクラスのいずれかに属するデータについて、決定木が作られていく過程を追っていく。

データとしては、scikit-learnのmake_moons()で得られる以下のデータを用いる。

import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=30, noise=0.25, random_state=9)

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],
    ec='k', s=60, marker='o', label="Class 0")
ax.scatter(X[y==1][:, 0], X[y==1][:, 1],
    ec='k', s=60, marker='^', label="Class 1")

ax.set_xlabel("Feature-0")
ax.set_ylabel("Feature-1")
ax.legend()

plt.show()

import matplotlib.pyplot as plt

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=30, noise=0.25, random_state=9)

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],

ec='k', s=60, marker='o', label="Class 0")

ax.scatter(X[y==1][:, 0], X[y==1][:, 1],

ec='k', s=60, marker='^', label="Class 1")

ax.set_xlabel("Feature-0")

ax.set_ylabel("Feature-1")

ax.legend()

plt.show()

このデータセットについて、2つの特徴量のいずれかを調節して順次領域を分割していく。このとき、どのように分割するのが最も妥当かということについては、決定木の分割の考え方を参照。また、以下の実行例のコードについてはDecisionTreeClassifierに関するテストプログラムを参照。

まず最初の分割は以下の通りで、特徴量1の値0.272が境界となり、それ以下がクラス0、それより大きいとクラス1が卓越していると判定される。

第2ステップは、第1ステップで分けられたそれぞれの領域を分割する。どちらの領域も特徴量0が境界となっていて、それぞれの領域／ノードの分布状態に応じた特性量によって左右に分けられている。

第3ステップでは、左上の領域は特徴量1で上下に、右下の領域は特徴量0で左右に分割されるが、いずれについても分割後の領域のクラスが同じになっている。利得が最も高くなるように分割しても、領域の中の擾乱クラスのデータが少ない場合はこのようになる。

なお、右上の領域はクラス1のデータが2個のみ、左下の領域クラス0のデータが2個のみと単一のクラスのデータしかないため、これ以上分割されない葉(leaf)となっている。

第4ステップでは、左上と右下の領域がいずれも特徴量0で左右に分割され、今度は分割後のノードが異なるクラスになっている。左上と右下の領域（ノード）がクラス1の葉となっている。

第5ステップでは下方右から2番目の領域を特徴量1で分割している。

分割はここで終了。全てのノードが単一のクラスで構成された純粋な状態となっている。

全体としてはクラス0が左側の上に凸な分布、クラス1が左側の下に凸な分布と判定されていて、make_moons()が意図した分布と合っている。ただし左上に一つクラス1のデータがあるために、本来意図しない領域がクラス1と分割されている。

過学習の抑制

過学習の視覚的な例

決定木の構築過程で、1つのデータの影響で予想と異なるノード区分が発生するのが見られた。別のパターンのデータの例を以下に示す。少ないデータの影響で領域分割が複雑になっていることがわかる。

このような場合、特定のパターンの教師データに対しては適合度が高く、純粋な葉のみで構成される場合には適合度が100%になるが、他のデータに対しての適合度を下げてしまう。いわゆる過学習となってしまう。

決定木の過学習を防ぐための方法として、決定木の階層をあるレベルまでに留める、いわゆる枝刈り／剪定(pruning)という考え方や、葉の切り分けを純粋なレベルより前に留める（ノード内のデータ数が2つ以上複数のデータ数になったら分割を止める）という考え方がある。

これらはscikit-learnのDecisionTreeClassifierでは、枝刈りのうち事前剪定(pre-pruning)のためのパラメーターとしてmax_depth、ノードの最小データ数のパラメーターとしてmin_samples_leafが設定できる。

max_depth～剪定

以下の例は、上記のデータに対してmax_depthを変化させたときの領域分割の様子。このパラメーターはデフォルトではNoneで可能な限り分割を行っていくが、正の整数値を指定すると、その深さまでで分割を止める。分割の深さが少なくなるにしたがってモデルが単純化されていく様子がわかる。

min_samples_leaf～葉の純度の制限

min_samples_leafはデフォルトでは1で、ノードが完全に純粋でない限りデータ数が1個になるまで分割を試みる。この値を変化させて、異なるクラスのデータを含んでいても分割を行わないようにした場合の領域分割の状況を見てみる。

枝刈りに比べてモデルの複雑さは回避しながら、それらしい領域分割になっている。これは、事前枝刈りが各ノードの不純度に関わらず同じレベルで計算を止めるのに対して、葉の純度を個別にコントロールしているために柔軟に分割が進められているためと考えられる。

cancerデータによる確認

剪定による過剰適合の抑制

breast_cancerデータセットに対してDecisionTreeClassifierを適用してクラス分類し、訓練セットとテストセットのスコアを計算する。リーフノードが純粋になるまで木を成長させた場合と、深さ4で事前剪定をした場合のスコアを比較してみる。

import pydotplus as pdp
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier, export_graphviz

ds = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(ds.data, ds.target, stratify=ds.target, random_state=42)

clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)

print("Tree depth: {}".format(clf.get_depth()))
print("Training score:{:6.3f}".format(clf.score(X_train, y_train)))
print("Test score    :{:6.3f}".format(clf.score(X_test, y_test)))

print()

clf4 = DecisionTreeClassifier(max_depth=4, random_state=0)
clf4.fit(X_train, y_train)

print("Tree depth: {}".format(clf4.get_depth()))
print("Training score:{:6.3f}".format(clf4.score(X_train, y_train)))
print("Test score    :{:6.3f}".format(clf4.score(X_test, y_test)))

import pydotplus as pdp

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_breast_cancer

from sklearn.tree import DecisionTreeClassifier, export_graphviz

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(ds.data, ds.target, stratify=ds.target, random_state=42)

clf = DecisionTreeClassifier(random_state=0)

clf.fit(X_train, y_train)

print("Tree depth: {}".format(clf.get_depth()))

print("Training score:{:6.3f}".format(clf.score(X_train, y_train)))

print("Test score :{:6.3f}".format(clf.score(X_test, y_test)))

print()

clf4 = DecisionTreeClassifier(max_depth=4, random_state=0)

clf4.fit(X_train, y_train)

print("Tree depth: {}".format(clf4.get_depth()))

print("Training score:{:6.3f}".format(clf4.score(X_train, y_train)))

print("Test score :{:6.3f}".format(clf4.score(X_test, y_test)))

出力結果は以下の通り。完全な木は7層で、訓練セットに対しては全データに適合しており、テストセットに対しては93.7%の適合率。一方、深さ4で枝刈りをした場合は、訓練セットに対する適合率は下がるがテストセットの適合率は上がり、過剰適合が抑制されている。

Tree depth: 7
Training score: 1.000
Test score    : 0.937

Tree depth: 4
Training score: 0.988
Test score    : 0.951

Tree depth: 7

Training score: 1.000

Test score : 0.937

Tree depth: 4

Training score: 0.988

Test score : 0.951

剪定の深さに対するスコアの変化

事前剪定の深さレベルを変化させたときの、訓練セットとテストセットに対するスコアの変化を確認する。train_test_split()の乱数系列によって結果のパターンが異なるが、概ねmax_depth=4で適合不足と過剰適合のバランスが最もとれているようであり、スコアは0.95程度。

ただし、そもそも決定木の深さがそれほど深くなく、max_depthのバリエーションが数個となるので、線形モデルにおけるハイパーパラメーターのような連続的な曲線は描き難い。

max_depths = np.arange(8, 0, -1)
training_scores = np.empty(max_depths.size)
test_scores = np.empty(max_depths.size)

fig, axs = plt.subplots(2, 2, figsize=(8, 6))
fig.subplots_adjust(hspace=0.4)
ax_1d = axs.reshape(axs.size)

random_state_list = [0, 1, 2, 3]
for random_state, ax in zip(random_state_list, ax_1d):
    X_train, X_test, y_train, y_test =\
        train_test_split(ds.data, ds.target, random_state=random_state)
    for i, max_depth in enumerate(max_depths):
        clf = DecisionTreeClassifier(max_depth=max_depth, random_state=0)
        clf.fit(X_train, y_train)
        training_scores[i] = clf.score(X_train, y_train)
        test_scores[i] = clf.score(X_test, y_test)
    ax.plot(max_depths, training_scores, clip_on=False)
    ax.plot(max_depths, test_scores)
    ax.set_xlim(ax.get_xlim()[::-1])
    ax.set_ylim(0.85, 1)
    ax.set_xlabel("max_depth")
    ax.set_title("random_state for split:{}".format(random_state))

plt.show()

max_depths = np.arange(8, 0, -1)

training_scores = np.empty(max_depths.size)

test_scores = np.empty(max_depths.size)

fig, axs = plt.subplots(2, 2, figsize=(8, 6))

fig.subplots_adjust(hspace=0.4)

ax_1d = axs.reshape(axs.size)

random_state_list = [0, 1, 2, 3]

for random_state, ax in zip(random_state_list, ax_1d):

X_train, X_test, y_train, y_test =\

train_test_split(ds.data, ds.target, random_state=random_state)

for i, max_depth in enumerate(max_depths):

clf = DecisionTreeClassifier(max_depth=max_depth, random_state=0)

clf.fit(X_train, y_train)

training_scores[i] = clf.score(X_train, y_train)

test_scores[i] = clf.score(X_test, y_test)

ax.plot(max_depths, training_scores, clip_on=False)

ax.plot(max_depths, test_scores)

ax.set_xlim(ax.get_xlim()[::-1])

ax.set_ylim(0.85, 1)

ax.set_xlabel("max_depth")

ax.set_title("random_state for split:{}".format(random_state))

plt.show()

リーフの最小サンプル数に対するスコアの変化

葉ノードの最小サンプル数を一定値以上とするmin_samples_leafを変化させたときの、訓練セットとテストセットの変化を見てみたが、乱数系列によってけっこうパターンがばらついている。この傾向は、max_depthを変化させても変わらなかった。

特徴量重要度

特徴量重要度の特性

breast_cancerデータセットを深さ4で剪定した決定木によってクラス分類した場合の特徴量重要度は以下のようになる。このグラフを表示するコードや特徴量重要度の計算方法についてはこちらを参照。

                 feature  importance
10          radius error    0.010197
11         texture error    0.048398
14      smoothness error    0.002416
20          worst radius    0.726829
21         worst texture    0.045816
24      worst smoothness    0.014158
26       worst concavity    0.018188
27  worst concave points    0.122113
28        worst symmetry    0.011885

feature importance

10 radius error 0.010197

11 texture error 0.048398

14 smoothness error 0.002416

20 worst radius 0.726829

21 worst texture 0.045816

24 worst smoothness 0.014158

26 worst concavity 0.018188

27 worst concave points 0.122113

28 worst symmetry 0.011885

このグラフと以下の決定木を比べ、重要度が大きい順に調べてみると以下のようなことがわかる。

worst radius(0.726)、第0層の不純度が高いノードを2つのクラスに分割し、分割後のノードの純度が高い
worst concave points(0.122)、第1層の左側のノード、259個のクラス1のデータの大部分を左側の子ノードに切り出しつつ、クラス0のデータも25を4と21と切り分けている
texture error(0.048)、第1層の右側のノード、クラス0のデータを完全に右の子ノードに切り分け、左の子ノードは不純度はゼロ
worst textureは(0.045)、第2層の左から2番目のノード、データ数は少ないが、左の子ノードにクラス1のデータを9/11、右の子ノードにクラス0のデータを18/21と子ノードの純度が高い
radius error(0.010)、第2層の左のノード、クラス1のデータを完全に左の子ノードに切り分け、右の子ノードの不純度はゼロ

特徴量重要度の計算方法や上記の特徴から、その性質は以下のように整理できる。

重要度は0～1の間の値をとる
浅いノードの対象特徴量ほど重要度が高い傾向（ただし、データの分布による可能性あり）
分割後の子ノードの純度が高いほど重要度が高い傾向
重要度の値は、どのクラスの切り分けに効いているかとは無関係

特徴量重要度の大きさは、枝を分離するときに重要な特徴量を示唆するが、その特徴量の大小とクラス分類の関係までは知ることができない。

線形モデルの特徴量係数との対比

特徴量重要度をLogistic回帰における特徴量の係数と比較してみる。以下はL2正則化によるLogistic回帰モデルをbreast_cancerデータに適用した場合の特徴量係数。

worst radiusについては、決定木での重要度が最も高いが、Logistic回帰でも比較的特徴量の重みは大きい。Logistic回帰の場合はこの特徴量がターゲット1（malignant：良性）であることを示唆しているが、決定木の場合にはそのような情報は得られない。

単調でないクラス分類

以下の例では、2つのクラス分類の境界が必ずしも1つではない。

Logistic回帰モデルの場合、一つの直線で分離しようとした結果、境界は赤い線のようになり、スコアも0.66625とかなり低い。

決定木はこのような場合でも分類可能だが、そもそもこのようなケースでは（それが決定木の性質ではなく本質的に）決定境界に対する大小だけでクラス分類を論ずることができない。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
import graphviz
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.linear_model import LogisticRegression

n0 = 40
n1a = 20
n1b = 20

rnd.seed(0)
f0_0 = rnd.rand(n0) * 4 + 3
f1_0 = rnd.rand(n0) * 1 + 4
f0_1a = rnd.rand(n1a) * 2 + 2
f1_1a = rnd.rand(n1a) * 1 + 2
f0_1b = rnd.rand(n1b) * 2 + 6
f1_1b = rnd.rand(n1b) * 1 + 6

X0 = np.hstack((f0_0.reshape(-1, 1), f1_0.reshape(-1, 1)))
X1a = np.hstack((f0_1a.reshape(-1, 1), f1_1a.reshape(-1, 1)))
X1b = np.hstack((f0_1b.reshape(-1, 1), f1_1b.reshape(-1, 1)))
X = np.vstack((X0, X1a, X1b))

y0 = np.zeros(n0)
y1a = np.ones(n1a)
y1b = np.ones(n1b)
y = np.hstack((y0, y1a, y1b))

clf = DecisionTreeClassifier().fit(X, y)
logreg = LogisticRegression().fit(X, y)
print(logreg.score(X, y))

dot_data= export_graphviz(clf, out_file=None,
    feature_names=["f0", "f1"], class_names=["C0", "C1"], filled=True)
graph = graphviz.Source(dot_data)
graph.render("image", view=True)

f0_min, f0_max = 1, 10
f1_min, f1_max = 1, 8

b = logreg.intercept_
w = logreg.coef_.reshape(logreg.coef_.size)
border_left = (-b - w[0] * f0_min) / w[1]
border_right = (-b - w[0] * f0_max) / w[1]

fig, ax = plt.subplots()
ax.scatter(X[y==0][:, 0], X[y==0][:, 1])
ax.scatter(X[y==1][:, 0], X[y==1][:, 1])
th1 = clf.tree_.threshold[0]
th2 = clf.tree_.threshold[2]
ax.plot([f0_min, f0_max], [th1, th1], c='tab:green', linestyle='dashed')
ax.plot([f0_min, f0_max], [th2, th2], c='tab:green', linestyle='dashed')
ax.plot([f0_min, f0_max], [border_left, border_right],
    c='tab:red', linestyle='dashed')
ax.set_xlim(f0_min, f0_max)
ax.set_ylim(f1_min, f1_max)
plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

import graphviz

from sklearn.tree import DecisionTreeClassifier, export_graphviz

from sklearn.linear_model import LogisticRegression

n0 = 40

n1a = 20

n1b = 20

rnd.seed(0)

f0_0 = rnd.rand(n0) * 4 + 3

f1_0 = rnd.rand(n0) * 1 + 4

f0_1a = rnd.rand(n1a) * 2 + 2

f1_1a = rnd.rand(n1a) * 1 + 2

f0_1b = rnd.rand(n1b) * 2 + 6

f1_1b = rnd.rand(n1b) * 1 + 6

X0 = np.hstack((f0_0.reshape(-1, 1), f1_0.reshape(-1, 1)))

X1a = np.hstack((f0_1a.reshape(-1, 1), f1_1a.reshape(-1, 1)))

X1b = np.hstack((f0_1b.reshape(-1, 1), f1_1b.reshape(-1, 1)))

X = np.vstack((X0, X1a, X1b))

y0 = np.zeros(n0)

y1a = np.ones(n1a)

y1b = np.ones(n1b)

y = np.hstack((y0, y1a, y1b))

clf = DecisionTreeClassifier().fit(X, y)

logreg = LogisticRegression().fit(X, y)

print(logreg.score(X, y))

dot_data= export_graphviz(clf, out_file=None,

feature_names=["f0", "f1"], class_names=["C0", "C1"], filled=True)

graph = graphviz.Source(dot_data)

graph.render("image", view=True)

f0_min, f0_max = 1, 10

f1_min, f1_max = 1, 8

b = logreg.intercept_

w = logreg.coef_.reshape(logreg.coef_.size)

border_left = (-b - w[0] * f0_min) / w[1]

border_right = (-b - w[0] * f0_max) / w[1]

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1])

ax.scatter(X[y==1][:, 0], X[y==1][:, 1])

th1 = clf.tree_.threshold[0]

th2 = clf.tree_.threshold[2]

ax.plot([f0_min, f0_max], [th1, th1], c='tab:green', linestyle='dashed')

ax.plot([f0_min, f0_max], [th2, th2], c='tab:green', linestyle='dashed')

ax.plot([f0_min, f0_max], [border_left, border_right],

c='tab:red', linestyle='dashed')

ax.set_xlim(f0_min, f0_max)

ax.set_ylim(f1_min, f1_max)

plt.show()

max_features～特徴量選択

DecisionTreeClassifierのコンストラクターのパラメーターの1つ、max_featuresについて。このパラメーターがデフォルトのNoneの場合やautoを指定した場合、n_featuresすなわちすべての特徴量が比較され、最も分離の性能がいいものが選ばれる。一方、このパラメーターに整数を指定すると、分離の際にランダムにその数だけ特徴量が選ばれ、その中で分離の性能がいいものが選ばれる。max_features=1とすると、ランダムに選ばれた特徴量が、その分離性能に関わらず用いられる。

以下は、make_moonsで生成された特徴量数2のデータセットについて、max_featuresを1、2と変えた時の実行結果。max_features=1の場合は、各深さにおいて特徴量1、2がランダムに選ばれる。これは必ずしも最適な分割とならないため、ノイズのように分割が滑らかになっていない。

決定木の特徴量重要度の計算方法

2020-06-05 / tau / コメントする

概要

scikit-learnのDecisionTreeClassificationモデルにfeature_importances_というパラメーターがある。このパラメーターは1次元配列で、特徴量番号に対する重要度が実数で格納されている。

このfeature_importances_について、公式ドキュメントでは以下のように書かれている。

The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

～特徴量の重要度は、対象とする特徴量から得られた基準値の減少分の（正規化された）合計値。ジニ重要度としても知られている。～

と書かれているが、ちょっと曖昧で定義がはっきりしない。ジニ重要度というのは日本語サイトではなかなかヒットしないが、英語では結構取り上げられている。たとえばこちらのサイトでは以下のように引用説明されている。

It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.

これを読むと、それはジニ重要度／平均不純度減少量と呼ばれ、ノードの不純度の減少分の重み付き和（重みはそのノードにたどり着いたサンプル数の比率）を決定木全体にわたって平均した値、となる。

定式化

あるノードの不純度をI(t_P)、その左右の子ノードの不純度をI(t_L), (t_R)とし、それぞれのノードのサンプル数をn_P, n_L, n_Rとする(n_P = n_L + n_R)。このとき、ノードt_Pの不純度の減少分の重み付き和は以下のようになる。

(1) $\begin{equation*} \Delta I(t_P) = \frac{n_p}{N} I(t_P) - \frac{n_L}{N} I(t_L) - \frac{n_R}{N} I(t_R) \end{equation*}$

ここでNは全サンプル数。この値を決定木全体にわたって平均したものが特徴量重要度となるので、これをM(t_P)とすると、以下のようになる。

(2) $\begin{equation*} M(t_P) = \frac{\Delta I(t_P)}{\displaystyle \sum_P^{all~nodes} \Delta I(t_P)} \end{equation*}$

なお分母分子でNが共通なので、式(1)においてNで割らずに計算しても結果は同じになる。

以上から、特徴量重要度の計算は以下の手順となる。

決定木の葉ノードを除く各ノードについて以下を計算
1. ノードの不純度とサンプル数を掛けた値(wI)を計算
2. ノードのwIから、左右の子ノードのwIを減じた値gを計算
決定木全体のgの合計でこれを除した値を、そのノードの分割基準となった特徴量の特徴量重要度とする

feature_importances_の内容

この流れを、breast_cancerデータセットに対して以下のコードで確認してみる。パラメーターの設定はO’Reillyの”Pythonではじめる機械学習”の例に合わせていて、深さ4で事前刈込をしている。

import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier

ds = load_breast_cancer()
X_train, X_test, y_train, y_test =\
    train_test_split(ds.data, ds.target, stratify=ds.target, random_state=42)

clf = DecisionTreeClassifier(max_depth=4, random_state=0)
clf.fit(X_train, y_train)

print(clf.feature_importances_)

df = DataFrame(
    {'feature':ds.feature_names, 'importance':clf.feature_importances_})
print(df)

fig, ax = plt.subplots()
fig.subplots_adjust(left=0.3)
ax.barh(df.feature, df.importance)
plt.show()

import matplotlib.pyplot as plt

from pandas import DataFrame

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_breast_cancer

from sklearn.tree import DecisionTreeClassifier

ds = load_breast_cancer()

X_train, X_test, y_train, y_test =\

train_test_split(ds.data, ds.target, stratify=ds.target, random_state=42)

clf = DecisionTreeClassifier(max_depth=4, random_state=0)

clf.fit(X_train, y_train)

print(clf.feature_importances_)

df = DataFrame(

{'feature':ds.feature_names, 'importance':clf.feature_importances_})

print(df)

fig, ax = plt.subplots()

fig.subplots_adjust(left=0.3)

ax.barh(df.feature, df.importance)

plt.show()

まず、この決定木のfeature_importances_パラメーターそのものの内容は以下の通り。すべての値を合計すると1.0となる。

[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.01019737 0.04839825
 0.         0.         0.0024156  0.         0.         0.
 0.         0.         0.72682851 0.0458159  0.         0.
 0.0141577  0.         0.018188   0.1221132  0.01188548 0.        ]

[0. 0. 0. 0. 0. 0.

0. 0. 0. 0. 0.01019737 0.04839825

0. 0. 0.0024156 0. 0. 0.

0. 0. 0.72682851 0.0458159 0. 0.

0.0141577 0. 0.018188 0.1221132 0.01188548 0. ]

また、pandas.DataFrameで特徴量名と併せて表示すると以下の通り。

                    feature  importance
0               mean radius    0.000000
1              mean texture    0.000000
2            mean perimeter    0.000000
3                 mean area    0.000000
4           mean smoothness    0.000000
5          mean compactness    0.000000
6            mean concavity    0.000000
7       mean concave points    0.000000
8             mean symmetry    0.000000
9    mean fractal dimension    0.000000
10             radius error    0.010197
11            texture error    0.048398
12          perimeter error    0.000000
13               area error    0.000000
14         smoothness error    0.002416
15        compactness error    0.000000
16          concavity error    0.000000
17     concave points error    0.000000
18           symmetry error    0.000000
19  fractal dimension error    0.000000
20             worst radius    0.726829
21            worst texture    0.045816
22          worst perimeter    0.000000
23               worst area    0.000000
24         worst smoothness    0.014158
25        worst compactness    0.000000
26          worst concavity    0.018188
27     worst concave points    0.122113
28           worst symmetry    0.011885
29  worst fractal dimension    0.000000

feature importance

0 mean radius 0.000000

1 mean texture 0.000000

2 mean perimeter 0.000000

3 mean area 0.000000

4 mean smoothness 0.000000

5 mean compactness 0.000000

6 mean concavity 0.000000

7 mean concave points 0.000000

8 mean symmetry 0.000000

9 mean fractal dimension 0.000000

10 radius error 0.010197

11 texture error 0.048398

12 perimeter error 0.000000

13 area error 0.000000

14 smoothness error 0.002416

15 compactness error 0.000000

16 concavity error 0.000000

17 concave points error 0.000000

18 symmetry error 0.000000

19 fractal dimension error 0.000000

20 worst radius 0.726829

21 worst texture 0.045816

22 worst perimeter 0.000000

23 worst area 0.000000

24 worst smoothness 0.014158

25 worst compactness 0.000000

26 worst concavity 0.018188

27 worst concave points 0.122113

28 worst symmetry 0.011885

29 worst fractal dimension 0.000000

特徴量重要度の計算過程

特徴量重要度の計算過程を視覚的に追ってみる。まず深さ4までの決定木をgraphvizで視覚化すると以下の通り。

この決定木の葉以外のノードについて、gini不純度とサンプル数を掛け合わせた値をwIとして決定木を描きなおすと以下のとおり。また、gは着目するノードとその左右の子ノードのwIの差で、総サンプル数で無次元化しない重みによる情報利得と等価。

(3) $\begin{equation*} g = n_p I(t_P) - n_L I(t_L) - n_R I(t_R) \end{equation*}$

まずmax_depth=1のとき、worst radiusについてg = 138.6130となり、特徴量重要度はこの1つに対して1となる。

次にmax_depth=2とすると、3つのノードについて特徴量とgの値、重要度は以下の通り。

特徴量	g	重要度
worst radius	138.6130	0.809982
wors concave points	23.28798	0.136083
texture error	9.229907	0.053935
	171.13089	1

実際の計算結果は以下の通りで符合している。

                    feature  importance
0               mean radius    0.000000
1              mean texture    0.000000
2            mean perimeter    0.000000
3                 mean area    0.000000
4           mean smoothness    0.000000
5          mean compactness    0.000000
6            mean concavity    0.000000
7       mean concave points    0.000000
8             mean symmetry    0.000000
9    mean fractal dimension    0.000000
10             radius error    0.000000
11            texture error    0.053935
12          perimeter error    0.000000
13               area error    0.000000
14         smoothness error    0.000000
15        compactness error    0.000000
16          concavity error    0.000000
17     concave points error    0.000000
18           symmetry error    0.000000
19  fractal dimension error    0.000000
20             worst radius    0.809981
21            worst texture    0.000000
22          worst perimeter    0.000000
23               worst area    0.000000
24         worst smoothness    0.000000
25        worst compactness    0.000000
26          worst concavity    0.000000
27     worst concave points    0.136084
28           worst symmetry    0.000000
29  worst fractal dimension    0.000000

feature importance

0 mean radius 0.000000

1 mean texture 0.000000

2 mean perimeter 0.000000

3 mean area 0.000000

4 mean smoothness 0.000000

5 mean compactness 0.000000

6 mean concavity 0.000000

7 mean concave points 0.000000

8 mean symmetry 0.000000

9 mean fractal dimension 0.000000

10 radius error 0.000000

11 texture error 0.053935

12 perimeter error 0.000000

13 area error 0.000000

14 smoothness error 0.000000

15 compactness error 0.000000

16 concavity error 0.000000

17 concave points error 0.000000

18 symmetry error 0.000000

19 fractal dimension error 0.000000

20 worst radius 0.809981

21 worst texture 0.000000

22 worst perimeter 0.000000

23 worst area 0.000000

24 worst smoothness 0.000000

25 worst compactness 0.000000

26 worst concavity 0.000000

27 worst concave points 0.136084

28 worst symmetry 0.000000

29 worst fractal dimension 0.000000

同じように計算していき、max_depth=4の時は以下の通り。ただしここで、worst textureが2回登場していることに注意。1つ目は深さ2の左から2番目、もう1つは深さ3の右から2番目で、それぞれのノードが分割されたときのクラスは異なっている。worst textureの重要度を計算する際には、この2つのgを加えている。

特徴量	g	重要度
worst radius	138.6130	0.726829
worst concave points	23.28798	0.122113
texture error	9.229907	0.048398
radius error	1.944615	0.010197
worst texture	8.737504	0.045816
worst concavity	3.168669	0.018188
smoothness error	0.460808	0.002416
worst smoothness	2.7	0.001416
worst symmetry	2.266668	0.011885
	190.7091	1

実際の計算結果は以下の通りで符合している。

                    feature  importance
0               mean radius    0.000000
1              mean texture    0.000000
2            mean perimeter    0.000000
3                 mean area    0.000000
4           mean smoothness    0.000000
5          mean compactness    0.000000
6            mean concavity    0.000000
7       mean concave points    0.000000
8             mean symmetry    0.000000
9    mean fractal dimension    0.000000
10             radius error    0.010197
11            texture error    0.048398
12          perimeter error    0.000000
13               area error    0.000000
14         smoothness error    0.002416
15        compactness error    0.000000
16          concavity error    0.000000
17     concave points error    0.000000
18           symmetry error    0.000000
19  fractal dimension error    0.000000
20             worst radius    0.726829
21            worst texture    0.045816
22          worst perimeter    0.000000
23               worst area    0.000000
24         worst smoothness    0.014158
25        worst compactness    0.000000
26          worst concavity    0.018188
27     worst concave points    0.122113
28           worst symmetry    0.011885
29  worst fractal dimension    0.000000

feature importance

0 mean radius 0.000000

1 mean texture 0.000000

2 mean perimeter 0.000000

3 mean area 0.000000

4 mean smoothness 0.000000

5 mean compactness 0.000000

6 mean concavity 0.000000

7 mean concave points 0.000000

8 mean symmetry 0.000000

9 mean fractal dimension 0.000000

10 radius error 0.010197

11 texture error 0.048398

12 perimeter error 0.000000

13 area error 0.000000

14 smoothness error 0.002416

15 compactness error 0.000000

16 concavity error 0.000000

17 concave points error 0.000000

18 symmetry error 0.000000

19 fractal dimension error 0.000000

20 worst radius 0.726829

21 worst texture 0.045816

22 worst perimeter 0.000000

23 worst area 0.000000

24 worst smoothness 0.014158

25 worst compactness 0.000000

26 worst concavity 0.018188

27 worst concave points 0.122113

28 worst symmetry 0.011885

29 worst fractal dimension 0.000000

matplotlib.pyplot.barh – 横棒グラフ

2020-06-01 / tau / コメントする

概要

barh()は横棒グラフを描く。主要なパラメーターは以下の通り。
barh(y, width, height, left, align, fc, ec, linewidth, xerr, capsize, log)

y, width, height: yは縦方向の座標で棒グラフのラベルをリスト等で指定するのが一般的。widthは棒の長さでこれもリスト等で指定。heightは棒の太さでデフォルトは0.8だが数値／リスト等で指定可。
align: alignはデフォルトで'center'だが、'edge'を指定すると棒の下側がラベルに合わせられる。上を合わせるにはheightに負の値を指定する。
fc, ec, linewidth: fc/colorは棒の塗りつぶし色、ec/edgecolorは縁の色、linewidthは縁の太さ
xerr, capsize: xerrは誤差の範囲でリスト等で指定。capsizeは誤差範囲の両端の直交線の長さ。
log: log=Trueを指定すると横軸が対数スケールになる。

実行例

基本形

基本的な使い方で、第1引数yに縦軸のラベル、第2引数widthに各棒の長さをそれぞれリストで与える。

import numpy as np
import matplotlib.pyplot as plt

subjects = np.array(["math", "physics", "chemistry", "earth science"])
scores = np.array([80, 95, 45, 65])

fig, ax = plt.subplots()
fig.subplots_adjust(left=0.2)

ax.barh(subjects, scores)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

subjects = np.array(["math", "physics", "chemistry", "earth science"])

scores = np.array([80, 95, 45, 65])

fig, ax = plt.subplots()

fig.subplots_adjust(left=0.2)

ax.barh(subjects, scores)

plt.show()

色・枠線の指定

棒の塗りつぶし色と枠線の色・太さを指定。

ax.barh(subjects, scores, fc='tab:green', ec='k', linewidth=3)

1	ax.barh(subjects, scores, fc='tab:green', ec='k', linewidth=3)

高さ・位置

棒の高さ、開始位置を指定し、ラベルに対して棒の下端を合わせている。

heights = [1, 0.8, 0.6, 0.4]
lefts = [0, 5, 10, 15]
ax.barh(subjects, scores, height=heights, left=lefts, align='edge')

heights = [1, 0.8, 0.6, 0.4]

lefts = [0, 5, 10, 15]

ax.barh(subjects, scores, height=heights, left=lefts, align='edge')

誤差

棒の端に誤差範囲を表示。

ax.barh(subjects, scores, xerr=xerr, capsize=8)

1	ax.barh(subjects, scores, xerr=xerr, capsize=8)

対数軸

横軸を対数軸としている。

import matplotlib.pyplot as plt

names = ["Andrew", "Bob", "Charie", "Dick"]
properties = [200000000, 50000000, 8000000, 1000000]

fig, ax = plt.subplots()

ax.barh(names, properties, log=True)

plt.show()

import matplotlib.pyplot as plt

names = ["Andrew", "Bob", "Charie", "Dick"]

properties = [200000000, 50000000, 8000000, 1000000]

fig, ax = plt.subplots()

ax.barh(names, properties, log=True)

plt.show()

pyplot – グラフエリアが切れる・はみ出る

2020-06-01 / tau / コメントする

グラフのラベルがはみ出てしまう場合がある。

このようなときは、pyplotやfigureに対してsubplots_adjust()でleft、bottomなどの引数でマージンを指定する。

import numpy as np
import matplotlib.pyplot as plt

subjects = np.array(["math", "physics", "chemistry", "earth science"])
scores = np.array([95, 92, 83, 75])

fig, ax = plt.subplots()
fig.subplots_adjust(left=0.2)

ax.barh(subjects, scores)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

subjects = np.array(["math", "physics", "chemistry", "earth science"])

scores = np.array([95, 92, 83, 75])

fig, ax = plt.subplots()

fig.subplots_adjust(left=0.2)

ax.barh(subjects, scores)

plt.show()

DecisionTreeClassifier – Treeオブジェクト・再帰表示など

2020-05-31 / tau / コメントする

概要

Scikit-learnの決定木モデル、DecisionTreeClassifierについていろいろ試した際のコードをストック。

Treeオブジェクト内容確認

DecisionTreeClassifierオブジェクトのプロパティーtree_はデータセットに対して生成された決定木の構造が保存されている。以下はその内容を確認するためのコード。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

tree = treeclf.tree_

n_nodes = tree.node_count
children_left = tree.children_left
children_right = tree.children_right
features = tree.feature
thresholds = tree.threshold
value = tree.value

print("number of nodes: {}".format(n_nodes))
print("Chlidren(Left) : {}".format(children_left))
print("Chlidren(Right): {}".format(children_right))
print("Features       : {}".format(features))
print("Thresholds     : {}".format(np.round(thresholds, 3)))
print("Values:\n{}".format(value))

print()

for i in range(n_nodes):
    print("Node-{}".format(i), end="")
    print("(Feature[{:2d}]<{:6.3f}):"\
        .format(features[i], thresholds[i]), end="")
    print("LeftNode[{:2d}], RightNode[{:2d}]"\
        .format(children_left[i], children_right[i]))

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

tree = treeclf.tree_

n_nodes = tree.node_count

children_left = tree.children_left

children_right = tree.children_right

features = tree.feature

thresholds = tree.threshold

value = tree.value

print("number of nodes: {}".format(n_nodes))

print("Chlidren(Left) : {}".format(children_left))

print("Chlidren(Right): {}".format(children_right))

print("Features : {}".format(features))

print("Thresholds : {}".format(np.round(thresholds, 3)))

print("Values:\n{}".format(value))

print()

for i in range(n_nodes):

print("Node-{}".format(i), end="")

print("(Feature[{:2d}]<{:6.3f}):"\

.format(features[i], thresholds[i]), end="")

print("LeftNode[{:2d}], RightNode[{:2d}]"\

.format(children_left[i], children_right[i]))

Treeクラスはツリー内の各ノードの情報を1次元の配列でもっていて、子ノードを参照するにはノード番号に対応する配列のインデックスを参照する。Treeクラスが持っている主なプロパティーは以下の通り。

node_count: ツリーが持つ全ノード数。
children_left, children_right: 各ノードの左／右の子ノードの番号を格納した1次元配列。
feature: 各ノードを分割する際に使われる特徴量の番号を格納した1次元配列。
threshold: 各ノードをfeatureで示された特性量で分割する際の閾値を格納した1次元配列。
value: 各ノードにおける、各クラスのデータ数。クラス数分のデータを格納した1次元配列1つだけを要素とする2次元配列を、ノード数分だけ集めた3次元配列。

コードの実行結果は以下の通り。

number of nodes: 7
Chlidren(Left) : [ 1  2 -1 -1  5 -1 -1]
Chlidren(Right): [ 4  3 -1 -1  6 -1 -1]
Features       : [ 1  0 -2 -2  0 -2 -2]
Thresholds     : [ 0.072 -0.643 -2.    -2.     1.536 -2.    -2.   ]
Values:
[[[10. 10.]]

 [[ 1.  9.]]

 [[ 1.  0.]]

 [[ 0.  9.]]

 [[ 9.  1.]]

 [[ 9.  0.]]

 [[ 0.  1.]]]

Node-0(Feature[ 1]< 0.072):LeftNode[ 1], RightNode[ 4]
Node-1(Feature[ 0]<-0.643):LeftNode[ 2], RightNode[ 3]
Node-2(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]
Node-3(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]
Node-4(Feature[ 0]< 1.536):LeftNode[ 5], RightNode[ 6]
Node-5(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]
Node-6(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]

number of nodes: 7

Chlidren(Left) : [ 1 2 -1 -1 5 -1 -1]

Chlidren(Right): [ 4 3 -1 -1 6 -1 -1]

Features : [ 1 0 -2 -2 0 -2 -2]

Thresholds : [ 0.072 -0.643 -2. -2. 1.536 -2. -2. ]

Values:

[[[10. 10.]]

[[ 1. 9.]]

[[ 1. 0.]]

[[ 0. 9.]]

[[ 9. 1.]]

[[ 9. 0.]]

[[ 0. 1.]]]

Node-0(Feature[ 1]< 0.072):LeftNode[ 1], RightNode[ 4]

Node-1(Feature[ 0]<-0.643):LeftNode[ 2], RightNode[ 3]

Node-2(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]

Node-3(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]

Node-4(Feature[ 0]< 1.536):LeftNode[ 5], RightNode[ 6]

Node-5(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]

Node-6(Feature[-2]<-2.000):LeftNode[-1], RightNode[-1]

親ノードと子ノードの関係は、たとえばノード0の左右の子ノードはchildren_leftとchildren_rightの0番目の要素からノード1とノード4、ノード1の左右の子ノードはノード2とノード3、という風に追っていくことができる。

valueがややこしい。この配列は各ノードにおけるクラスごとのデータ数を格納している。全体配列の中にこのケースだとノード数に対応する7個の配列が要素として格納されているが、その配列が2次元配列になっていて、その要素の配列がクラスごとのデータを格納した配列になっている。例えば3番目の要素のクラス1の要素を取り出す場合にはvalue[3, 0, 1]と言う風に指定することになる。

Treeのコンソール表示

Treeオブジェクトのツリー構造を確認し、決定境界の描画などの準備とするために書いたコード。決定木の構造をコンソールに表示させる。2つの再帰関数を定義していて、本体は決定木学習後にそれらの関数を呼び出すのみ。

from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier

def print_node1(tree, i_node, n_level=0):
    print("{}{:2d}-feature:{:2d}"\
        .format("             " * n_level, i_node, tree.feature[i_node]))
    if tree.children_left[i_node] == -1:
        return
    print_node1(
        tree=tree, i_node=tree.children_left[i_node], n_level=n_level+1)
    print_node1(
        tree=tree, i_node=tree.children_right[i_node], n_level=n_level+1)

def print_node2(tree, i_node, n_level=0):
    if tree.children_left[i_node] == -1:
        print("{}{:2d}-feature:{:2d}"\
            .format("             " * n_level, i_node, tree.feature[i_node]))
        return
    print_node2(
        tree=tree, i_node=tree.children_left[i_node], n_level=n_level+1)
    print("{}{:2d}-feature:{:2d}"\
        .format("             " * n_level, i_node, tree.feature[i_node]))
    print_node2(
        tree=tree, i_node=tree.children_right[i_node], n_level=n_level+1)

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

tree = treeclf.tree_

print_node1(tree=tree, i_node=0)
print("-"*40)
print_node2(tree=tree, i_node=0)

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier

def print_node1(tree, i_node, n_level=0):

print("{}{:2d}-feature:{:2d}"\

.format(" " * n_level, i_node, tree.feature[i_node]))

if tree.children_left[i_node] == -1:

return

print_node1(

tree=tree, i_node=tree.children_left[i_node], n_level=n_level+1)

print_node1(

tree=tree, i_node=tree.children_right[i_node], n_level=n_level+1)

def print_node2(tree, i_node, n_level=0):

if tree.children_left[i_node] == -1:

print("{}{:2d}-feature:{:2d}"\

.format(" " * n_level, i_node, tree.feature[i_node]))

return

print_node2(

tree=tree, i_node=tree.children_left[i_node], n_level=n_level+1)

print("{}{:2d}-feature:{:2d}"\

.format(" " * n_level, i_node, tree.feature[i_node]))

print_node2(

tree=tree, i_node=tree.children_right[i_node], n_level=n_level+1)

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

tree = treeclf.tree_

print_node1(tree=tree, i_node=0)

print("-"*40)

print_node2(tree=tree, i_node=0)

関数print_node1()は、ツリー構造をルートノードから階層が下がるごとに段下げして表示していく。このため、まず親ノードを表示してから左右の子ノードを引数として再帰呼び出しをしている。

終了条件はノードが子ノードを持たない葉(leaf)であることを利用するが、リーフの時のパラメータは以下の通りで、ここでは左子ノードの番号が−1となることを利用している。

子ノードの番号が−1
特性量の番号が−2
特性量の閾値が−2.0

関数print_node2は、決定木の構造を枝分かれした木の形で表示する。左側のノードから右側に移るのを、コンソール上で上から下に表示していく。手順としては、

リーフノードならノードの内容を出力してリターン
リーフノードでなければ、
1. 左子ノードの処理を呼び出す
2. それが戻ってきたら（左側の全子孫ノードが出力されたら）自身の内容を出力
3. 右子ノードの処理を呼び出す
4. それが戻ってきたら（右側の全子孫ノードが出力されたら）リターン

引数に現在のノードの階層を保持する変数があり、その階層に応じた数のスペースでインデントすることで木の構造を表す。

出力は以下の通り。

 0-feature: 1
              1-feature: 0
                           2-feature:-2
                           3-feature:-2
              4-feature: 0
                           5-feature:-2
                           6-feature:-2
----------------------------------------
                           2-feature:-2
              1-feature: 0
                           3-feature:-2
 0-feature: 1
                           5-feature:-2
              4-feature: 0
                           6-feature:-2

0-feature: 1

1-feature: 0

2-feature:-2

3-feature:-2

4-feature: 0

5-feature:-2

6-feature:-2

----------------------------------------

2-feature:-2

1-feature: 0

3-feature:-2

0-feature: 1

5-feature:-2

4-feature: 0

6-feature:-2

決定木の構築過程の表示

make_monns()による2特性量のデータについて、順次ノードを分割する過程を図で描画するためのコード。

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patch
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier


def draw_tree_boundary(tree, ax, left, right, bottom, top,
        i_node=0, stop_level=None, n_level=0):

    if tree.children_left[i_node] == -1 or stop_level == n_level:
        fc =\
            'tab:orange' if np.argmax(tree.value[i_node][0])==0 else 'tab:blue'
        rect = patch.Rectangle(xy=(left, bottom),
            width=right-left, height=top-bottom, fc=fc, alpha=0.2)
        ax.add_patch(rect)
        return

    if tree.feature[i_node] == 0:
        f0 = tree.threshold[i_node]
        ax.plot([f0, f0], [top, bottom])
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=f0, top=top, bottom=bottom,
            i_node=tree.children_left[i_node],
            stop_level=stop_level, n_level=n_level+1,)
        draw_tree_boundary(tree=tree, ax=ax,
            left=f0, right=right, top=top, bottom=bottom,
            i_node=tree.children_right[i_node],
            stop_level=stop_level, n_level=n_level+1)
    else:
        f1 = tree.threshold[i_node]
        ax.plot([left, right], [f1, f1])
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=right, top=f1, bottom=bottom,
            i_node=tree.children_left[i_node],
            stop_level=stop_level, n_level=n_level+1)
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=right, top=top, bottom=f1,
            i_node=tree.children_right[i_node],
            stop_level=stop_level, n_level=n_level+1)


X, y = make_moons(n_samples=20, noise=0.25, random_state=5)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

tree = treeclf.tree_

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],
    ec='k', s=60, marker='o', label="Class 0")
ax.scatter(X[y==1][:, 0], X[y==1][:, 1],
    ec='k', s=60, marker='^', label="Class 1")

x0_min, x0_max = -2, 2.5
x1_min, x1_max = -1, 1.5

draw_tree_boundary(tree=tree, i_node=0, ax=ax,
    left=x0_min, right=x0_max, bottom=x1_min, top=x1_max, stop_level=None)

ax.set_xlim(x0_min, x0_max)
ax.set_ylim(x1_min, x1_max)
ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

import matplotlib.patches as patch

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier

def draw_tree_boundary(tree, ax, left, right, bottom, top,

i_node=0, stop_level=None, n_level=0):

if tree.children_left[i_node] == -1 or stop_level == n_level:

fc =\

'tab:orange' if np.argmax(tree.value[i_node][0])==0 else 'tab:blue'

rect = patch.Rectangle(xy=(left, bottom),

width=right-left, height=top-bottom, fc=fc, alpha=0.2)

ax.add_patch(rect)

return

if tree.feature[i_node] == 0:

f0 = tree.threshold[i_node]

ax.plot([f0, f0], [top, bottom])

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=f0, top=top, bottom=bottom,

i_node=tree.children_left[i_node],

stop_level=stop_level, n_level=n_level+1,)

draw_tree_boundary(tree=tree, ax=ax,

left=f0, right=right, top=top, bottom=bottom,

i_node=tree.children_right[i_node],

stop_level=stop_level, n_level=n_level+1)

else:

f1 = tree.threshold[i_node]

ax.plot([left, right], [f1, f1])

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=right, top=f1, bottom=bottom,

i_node=tree.children_left[i_node],

stop_level=stop_level, n_level=n_level+1)

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=right, top=top, bottom=f1,

i_node=tree.children_right[i_node],

stop_level=stop_level, n_level=n_level+1)

X, y = make_moons(n_samples=20, noise=0.25, random_state=5)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

tree = treeclf.tree_

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],

ec='k', s=60, marker='o', label="Class 0")

ax.scatter(X[y==1][:, 0], X[y==1][:, 1],

ec='k', s=60, marker='^', label="Class 1")

x0_min, x0_max = -2, 2.5

x1_min, x1_max = -1, 1.5

draw_tree_boundary(tree=tree, i_node=0, ax=ax,

left=x0_min, right=x0_max, bottom=x1_min, top=x1_max, stop_level=None)

ax.set_xlim(x0_min, x0_max)

ax.set_ylim(x1_min, x1_max)

ax.legend()

plt.show()

draw_tree_boundary()関数は再帰関数で、もしそのノードがリーフノードか指定された終了階層の場合はクラスに応じた色で領域を塗りつぶす。リーフノードでなければ、閾値が特性量0の場合と1の場合で境界線の縦横や開始終了位置を変化させて再帰的に関数を呼び出す。引数stop_levelに正の整数を指定することで、その階層までの描画に留めることができる。関数の内容についてはこちらを参照。

本体はデータをクラスごとの色で散布図として描き、ルートノードについてdraw_tree_boundary()を呼び出している。

以下は、実行例。

以下は、stop_levelを順次増やしていって、領域が分割される過程を描いた例。

決定木のツリー表示

DecisionTreeClassificationオブジェクトを可視化する環境によって、決定木を表示する例。

環境構築
1. Pythonでpydotplusパッケージを導入
2. Graphviz環境を構築
実行
1. sklearn.tree.export_graphviz()で決定木のdotデータを得る
2. pydotplus.graph_from_dot_data()でDotオブジェクトを生成
3. write_png()などのメソッドでグラフを画像として書き出す

import numpy as np
import pydotplus as pdp
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier, export_graphviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=5)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

dot_data = export_graphviz(treeclf, max_depth=None,
    feature_names=["feature-0", "feature-1"],
    class_names=["class-0", "class-1"])
graph = pdp.graph_from_dot_data(dot_data)
graph.write_png("tree.png")

# C:...\atom\app-1.47.0

import numpy as np

import pydotplus as pdp

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier, export_graphviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=5)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

dot_data = export_graphviz(treeclf, max_depth=None,

feature_names=["feature-0", "feature-1"],

class_names=["class-0", "class-1"])

graph = pdp.graph_from_dot_data(dot_data)

graph.write_png("tree.png")

# C:...\atom\app-1.47.0

このコードはAtom上でコードを実行したため、Atomのディレクトリーに画像ファイルが書き出される。

決定木の分割の考え方

2020-05-30 / tau / コメントする

決定木の分割の考え方

決定木のデータを特性量によって分割するには、分割後のノードの状態ができるだけうまく分かれていることが必要となる。この「うまく分かれている」状態は、言い換えれば分割後のノード内のデータができるだけ「揃っている」ともいえる。たとえば2クラスの分類をする場合、ノード内に1つのクラスしか含まれていない場合は最も「純度が高い」状態であり、2つのクラスのデータが半分ずつ含まれている場合に最も「純度が低い」状態となる。

このような状態を定量的に表すのにエントロピーとジニ不純度という2つの考え方があるが、ここではそれらを確認する。

エントロピー（平均情報量）とジニ不純度

定義

クラスc = 1~Cに属するデータがノードtに属しており、各クラスごとのデータ数をn_c、データの総数をNとする。この場合、このノードの純度／不純度を表すのに、エントロピー（entropy、平均情報量）とジニ不純度（Gini impurity）という2つの考え方がある。ノードtのエントロピーをI_H(t)、ジニ不純度をI_G(t)と表すと、それぞれの定義は以下の通り。

(1) $\begin{align*} I_H(t) &= - \sum_{c=1}^C p_c(t) \log p_c(t) = - \sum_{c=1}^C \frac{n_c}{N} \log \frac{n_c}{N} \\ I_G(t) &= 1 - \sum_{c=1}^C p_c^2 = 1 - \sum_{c=1}^C \left( \frac{n_c}{N} \right)^2 \end{align*}$

エントロピーの対数の底は何でもいいが、分類するクラス数にすると最も高いエントロピーが1になって都合がよいようだ）。

ジニ不純度については次の表現の方が直感的にわかりやすい。

(2) $\begin{equation*} I_G(t) = \sum_{c=1}^C p_c (1 - p_c) \end{equation*}$

これを展開すると先のI_Gと同じ形になるが、この形だと関数形が上に凸でp_c = 0, 1でI_G = 0となることがわかる。

計算例

クラス数Cのデータについて、あるノード内のデータが全て同じクラスの場合、純度が高い／不純度が低い。

(3) $\begin{align*} I_H(t) &= - \frac{N}{N} \log_C \frac{N}{N} = 0 \\ I_G(t) &= 1 - \left( \frac{N}{N} \right)^2 = 0 \end{align*}$

ノード内で全てのクラスのデータが同じ数ずつある場合、純度が低い／不純度が高い。

(4) $\begin{align*} I_H(t) &= - C \cdot \frac{N/C}{N} \log_C \frac{N/C}{N} = - C \cdot \frac{1}{C} \log_C \frac{1}{C} = 1 \\ I_G(t) &= 1 - C \left( \frac{N/C}{N} \right)^2 = 1 - \frac{1}{C} \end{align*}$

分布

あるノード内にN個の2クラスデータがあり、クラス1のデータ数をnとする。このとき、クラス1のデータの発生率pに対するエントロピー、ジニ不純度の分布は以下のようになる。

(5) $\begin{align*} I_H(p; t) &= - \frac{n}{N} \log_2 \frac{n}{N} - \frac{N-n}{N} \log_2 \frac{N-n}{N} \\ &= -p \log_2 p - (1-p) \log_2 (1-p) \\ I_G(p; t) &= 1 - \left( \frac{n}{N} \right)^2 - \left( \frac{N-n}{N} \right)^2 \\ &= 1 - p^2 - (1-p)^2 \end{align*}$

これらをグラフ化したのが以下の図。p = 0.5で双方最大値をとり、エントロピーは1 、ジニ不純度は0.5。グラフの形状を比較するため、ジニ不純度を2倍した線も入れている。

import numpy as np
import matplotlib.pyplot as plt

eps = 1e-3
p = np.linspace(eps, 1 - eps)
Ih = -p * np.log2(p) - (1 - p) * np.log2(1 - p)
Ig = 1 - p**2 - (1 - p)**2

fig, ax = plt.subplots()

ax.plot(p, Ih, c='tab:blue', label="Entropy", clip_on=False)
ax.plot(p, Ig, c='tab:orange', label="Gini impurity")
ax.plot(p, Ig*2, c='tab:orange', ls='dashed', label="Gini doubled"
    , clip_on=False)

ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_xticks(np.linspace(0, 1, 11))
ax.set_yticks(np.linspace(0, 1, 11))
ax.grid(True)
ax.set_aspect('equal')
ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

eps = 1e-3

p = np.linspace(eps, 1 - eps)

Ih = -p * np.log2(p) - (1 - p) * np.log2(1 - p)

Ig = 1 - p**2 - (1 - p)**2

fig, ax = plt.subplots()

ax.plot(p, Ih, c='tab:blue', label="Entropy", clip_on=False)

ax.plot(p, Ig, c='tab:orange', label="Gini impurity")

ax.plot(p, Ig*2, c='tab:orange', ls='dashed', label="Gini doubled"

, clip_on=False)

ax.set_xlim(0, 1)

ax.set_ylim(0, 1)

ax.set_xticks(np.linspace(0, 1, 11))

ax.set_yticks(np.linspace(0, 1, 11))

ax.grid(True)

ax.set_aspect('equal')

ax.legend()

plt.show()

ノード分割の考え方～利得

親ノードを子ノードに分割するのに最も妥当な分割とするには、分割後の子ノードの純度ができるだけ高くなるような特徴量を探すことになる。ある特徴量について、親ノードから子ノードに分割したときにどれだけ純度が高くなったかを比較する量として、利得（情報利得、gain）が用いられる。

親ノードt_Pにc = 1~Cのクラスのデータがそれぞれn_cずつあるとする。このときの親ノードのエントロピーI_H(t_P)、ジニ不純度I_G(t_P)は式(1)で計算される。

まずエントロピーについて考える。ある特徴量fを定めると左右のノードのデータ分布が決まり、その時の左ノードt_L、右ノードt_Lのエントロピーが以下のように計算される。

(6) $\begin{align*} & I_H(t_L) = -p_c(t_L) \sum_{c=1}^C \log p_c(t_L) = - \sum_{c=1}^C \frac{n_c(t_L)}{N(t_L)} \log \frac{n_c(t_L)}{N(t_L)} \\ & I_H(t_R) = -p_c(t_R) \sum_{c=1}^C \log p_c(t_R) = - \sum_{c=1}^C \frac{n_c(t_R)}{N(t_R)} \log \frac{n_c(t_R)}{N(t_R)} \end{align*}$

このとき、それぞれのノードのエントロピーにノードの重みw_L, w_Rを掛け、これを親ノードのエントロピーから引いた量を利得（gain、情報利得）という。重みを各ノードのデータ数の比率とすると、利得は以下のように計算される。

(7) $\begin{align*} G_H(t_P, f) &= I_H(t_P) - w_L I_H(t_L) - w_R I_H(t_R) \\ &= I_H(t_P) - \frac{n_L}{N} I_H(t_L) - \frac{n_R}{N} I_H(t_R) \end{align*}$

ジニ不純度についても同様の計算ができる。

(8) $\begin{align*} G_G(t_P, f) &= I_G(t_P) - w_L I_G(t_L) - w_R I_G(t_R) \\ &= I_G(t_P) - \frac{n_L}{N} I_G(t_L) - \frac{n_R}{N} I_G(t_R) \end{align*}$

利得は、ある特徴量の値によって分割した後の状態が、分割前の状態に対してどれだけ純度が高くなったかを表す。

決定木のノードを分割するにあたっては、子ノードの純度ができるだけ高くなるように（エントロピー／ジニ不純度が小さくなるように）fを選ぶことになる。

簡単な例

特徴量が1つでデータ数が少ないケースで、利得の計算を確認してみる。

000111と並んでいる場合

クラス0、1がこのように並んでいるとき、左から境界を動かしていったときの左右のノードの不純度、利得について計算した結果は以下の通り。

IHP=1.0, IGP=0.5
[0][0 0 1 1 1]:
  IHL=-0.000, IHR= 0.971, G_H= 0.191
  IGL= 0.000, IGR= 0.480, G_G= 0.100
[0 0][0 1 1 1]:
  IHL=-0.000, IHR= 0.811, G_H= 0.459
  IGL= 0.000, IGR= 0.375, G_G= 0.250
[0 0 0][1 1 1]:
  IHL=-0.000, IHR= 0.000, G_H= 1.000
  IGL= 0.000, IGR= 0.000, G_G= 0.500
[0 0 0 1][1 1]:
  IHL= 0.811, IHR= 0.000, G_H= 0.459
  IGL= 0.375, IGR= 0.000, G_G= 0.250
[0 0 0 1 1][1]:
  IHL= 0.971, IHR= 0.000, G_H= 0.191
  IGL= 0.480, IGR= 0.000, G_G= 0.100

IHP=1.0, IGP=0.5

[0][0 0 1 1 1]:

IHL=-0.000, IHR= 0.971, G_H= 0.191

IGL= 0.000, IGR= 0.480, G_G= 0.100

[0 0][0 1 1 1]:

IHL=-0.000, IHR= 0.811, G_H= 0.459

IGL= 0.000, IGR= 0.375, G_G= 0.250

[0 0 0][1 1 1]:

IHL=-0.000, IHR= 0.000, G_H= 1.000

IGL= 0.000, IGR= 0.000, G_G= 0.500

[0 0 0 1][1 1]:

IHL= 0.811, IHR= 0.000, G_H= 0.459

IGL= 0.375, IGR= 0.000, G_G= 0.250

[0 0 0 1 1][1]:

IHL= 0.971, IHR= 0.000, G_H= 0.191

IGL= 0.480, IGR= 0.000, G_G= 0.100

この場合、当然のことながら真ん中でノードを分割することで2つのクラスがきれいに分かれ、利得もこれを表している。

00100111と並んでいる場合

今度は一部に他のクラスが紛れ込んでいる場合。左のクラス0の集団に1つだけクラス1のデータが含まれているときの挙動を確認する。

IHP=1.0, IGP=0.5
[0][0 1 0 0 1 1 1]:
  IHL=-0.000, IHR= 0.985, G_H= 0.138
  IGL= 0.000, IGR= 0.490, G_G= 0.071
[0 0][1 0 0 1 1 1]:
  IHL=-0.000, IHR= 0.918, G_H= 0.311
  IGL= 0.000, IGR= 0.444, G_G= 0.167
[0 0 1][0 0 1 1 1]:
  IHL= 0.918, IHR= 0.971, G_H= 0.049
  IGL= 0.444, IGR= 0.480, G_G= 0.033
[0 0 1 0][0 1 1 1]:
  IHL= 0.811, IHR= 0.811, G_H= 0.189
  IGL= 0.375, IGR= 0.375, G_G= 0.125
[0 0 1 0 0][1 1 1]:
  IHL= 0.722, IHR= 0.000, G_H= 0.549
  IGL= 0.320, IGR= 0.000, G_G= 0.300
[0 0 1 0 0 1][1 1]:
  IHL= 0.918, IHR= 0.000, G_H= 0.311
  IGL= 0.444, IGR= 0.000, G_G= 0.167
[0 0 1 0 0 1 1][1]:
  IHL= 0.985, IHR= 0.000, G_H= 0.138
  IGL= 0.490, IGR= 0.000, G_G= 0.071

IHP=1.0, IGP=0.5

[0][0 1 0 0 1 1 1]:

IHL=-0.000, IHR= 0.985, G_H= 0.138

IGL= 0.000, IGR= 0.490, G_G= 0.071

[0 0][1 0 0 1 1 1]:

IHL=-0.000, IHR= 0.918, G_H= 0.311

IGL= 0.000, IGR= 0.444, G_G= 0.167

[0 0 1][0 0 1 1 1]:

IHL= 0.918, IHR= 0.971, G_H= 0.049

IGL= 0.444, IGR= 0.480, G_G= 0.033

[0 0 1 0][0 1 1 1]:

IHL= 0.811, IHR= 0.811, G_H= 0.189

IGL= 0.375, IGR= 0.375, G_G= 0.125

[0 0 1 0 0][1 1 1]:

IHL= 0.722, IHR= 0.000, G_H= 0.549

IGL= 0.320, IGR= 0.000, G_G= 0.300

[0 0 1 0 0 1][1 1]:

IHL= 0.918, IHR= 0.000, G_H= 0.311

IGL= 0.444, IGR= 0.000, G_G= 0.167

[0 0 1 0 0 1 1][1]:

IHL= 0.985, IHR= 0.000, G_H= 0.138

IGL= 0.490, IGR= 0.000, G_G= 0.071

左から1つ目2つ目と境界を動かしていくと少しずつ利得が上昇するが、左側のノードにクラス1のデータが入ってきたところでその不純度が跳ね上がり、利得が下がる（8～10行目）。その後再び利得は上昇し、右側のデータがクラス1のみ3つとなった時に利得が最大となっている。このとき左側のノードにクラス1のデータが1つ含まれているが、他の4つのデータがクラスゼロと多いため、不純度は比較的低い。

利得が最大となる時でも、完全にクラスが分かれた時に比べて半分近くの利得だが、これはデータ数の多い左側に異なるクラスのデータが含まれているからと考えられる。

ここまでの計算に使ったコードは以下の通り。

import numpy as np

def impurity(node):
    N = node.size
    n0 = node[node==0].size
    n1 = node[node==1].size
    p0, p1 = n0 / N, n1 / N
    entropy = - (p0 * np.log2(p0) if p0 > 0 else 0) \
              - (p1 * np.log2(p1) if p1 > 0 else 0)
    gini = 1 - p0*p0 - p1*p1

    return entropy, gini

#node = np.array([0, 0, 0, 1, 1, 1])
node = np.array([0, 0, 1, 0, 0, 1, 1, 1])

IHP, IGP = impurity(node)
print("IHP={}, IGP={}".format(IHP, IGP))
for pos in range(1, node.size):
    node_left = node[0:pos]
    node_right = node[pos:node.size]
    node_text = "{}{}".format(node_left, node_right)
    IHL, IGL = impurity(node_left)
    IHR, IGR = impurity(node_right)
    wL = node_left.size / node.size
    wR = node_right.size / node.size
    GH = IHP - wL*IHL - wR*IHR
    GG = IGP - wL*IGL - wR*IGR
    print("{}:".format(node_text))
    print("  IHL={:6.3f}, IHR={:6.3f}, G_H={:6.3f}".format(IHL, IHR, GH))
    print("  IGL={:6.3f}, IGR={:6.3f}, G_G={:6.3f}".format(IGL, IGR, GG))

import numpy as np

def impurity(node):

N = node.size

n0 = node[node==0].size

n1 = node[node==1].size

p0, p1 = n0 / N, n1 / N

entropy = - (p0 * np.log2(p0) if p0 > 0 else 0) \

- (p1 * np.log2(p1) if p1 > 0 else 0)

gini = 1 - p0*p0 - p1*p1

return entropy, gini

#node = np.array([0, 0, 0, 1, 1, 1])

node = np.array([0, 0, 1, 0, 0, 1, 1, 1])

IHP, IGP = impurity(node)

print("IHP={}, IGP={}".format(IHP, IGP))

for pos in range(1, node.size):

node_left = node[0:pos]

node_right = node[pos:node.size]

node_text = "{}{}".format(node_left, node_right)

IHL, IGL = impurity(node_left)

IHR, IGR = impurity(node_right)

wL = node_left.size / node.size

wR = node_right.size / node.size

GH = IHP - wL*IHL - wR*IHR

GG = IGP - wL*IGL - wR*IGR

print("{}:".format(node_text))

print(" IHL={:6.3f}, IHR={:6.3f}, G_H={:6.3f}".format(IHL, IHR, GH))

print(" IGL={:6.3f}, IGR={:6.3f}, G_G={:6.3f}".format(IGL, IGR, GG))

pyplot – グラフの端が枠線で切れる

2020-05-30 / tau / コメントする

pyplotでグラフを描画したとき、軸の端の方でグラフが見切れてしまう。軸の外側も使って線や点をクリップせずに表示させるには、各グラフ描画の引数でclip_on=Falseを指定する。

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, num=200)
ys = np.sin(3*x)
yc = np.sin(3*x - np.pi)

xp = [-np.pi, np.pi]
yp1 = [-1, 1]
yp2 = [1, -1]

fig, ax = plt.subplots()

ax.plot(x, ys, linewidth=4)
ax.plot(x, yc, linewidth=4, clip_on=False)

ax.scatter(xp, yp1, s=80)
ax.scatter(xp, yp2, s=80, clip_on=False)

ax.set_xlim(-np.pi, np.pi)
ax.set_ylim(-1, 1)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, num=200)

ys = np.sin(3*x)

yc = np.sin(3*x - np.pi)

xp = [-np.pi, np.pi]

yp1 = [-1, 1]

yp2 = [1, -1]

fig, ax = plt.subplots()

ax.plot(x, ys, linewidth=4)

ax.plot(x, yc, linewidth=4, clip_on=False)

ax.scatter(xp, yp1, s=80)

ax.scatter(xp, yp2, s=80, clip_on=False)

ax.set_xlim(-np.pi, np.pi)

ax.set_ylim(-1, 1)

plt.show()

DecisionTreeClassifierの可視化環境

2020-05-27 / tau / コメントする

概要

Pythonのscikit-learnで提供される決定木のクラス分類モデルDecisionTreeClassifierの実行結果を可視化する環境について。

Graphvizとgraphvizパッケージ

この方法は、決定木の画像がPDFとして生成され、デフォルトのPDFリーダーが自動的に起動して確認できる。画像ファイルを利用する場合、PDFから切り出すか、以下のpydotplusパッケージを利用する。

import graphviz
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier, export_graphviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

dot_data = export_graphviz(treeclf, out_file=None,
    feature_names=["f0", "f1"], class_names=["c0", "c1"])
graph = graphviz.Source(dot_data)
graph.render("image", view=True)

import graphviz

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier, export_graphviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

dot_data = export_graphviz(treeclf, out_file=None,

feature_names=["f0", "f1"], class_names=["c0", "c1"])

graph = graphviz.Source(dot_data)

graph.render("image", view=True)

Graphvizとpydotplosパッケージ

この方法は、決定木の画像がファイルとして生成・保存される。画像を確認するためにファイルが保存されたディレクトリでファイルを開く手順が必要になるが、得られたファイルをそのまま活用することができる。

pydotplusのインストール

pydotplusをインストールする。

> pip install pydotplus

1	> pip install pydotplus

これだけでは次のようなエラーが出る。

pydotplus.graphviz.InvocationException: GraphViz's executables not found

1	pydotplus.graphviz.InvocationException: GraphViz's executables not found

Graphvizのインストール

Graphvizのサイトから実行ファイル(msiファイル)をダウンロード、インストールする。

実行方法1：Graphvizの実行位置を指定

以下のコード例13行目のように、Graphvizの実行プログラムの位置を指定。

import pydotplus as pdp
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier, export_graphviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

dot_data = export_graphviz(treeclf, out_file=None,
    feature_names=["f0", "f1"], class_names=["c0", "c1"])
graph = pdp.graph_from_dot_data(dot_data)
graph.progs = {'dot': u"C:\\Program Files (x86)\\Graphviz2.38\\bin\\dot.exe"}
graph.write_png("graph.png")

import pydotplus as pdp

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier, export_graphviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

dot_data = export_graphviz(treeclf, out_file=None,

feature_names=["f0", "f1"], class_names=["c0", "c1"])

graph = pdp.graph_from_dot_data(dot_data)

graph.progs = {'dot': u"C:\\Program Files (x86)\\Graphviz2.38\\bin\\dot.exe"}

graph.write_png("graph.png")

実行方法2：Graphvizへのパスを環境変数に登録

環境変数に上記のGraphvizのパスを指定する。

デスクトップのPCアイコンを右クリック→プロパティ
システム・ウィンドウ→システムの詳細設定
システムのプロパティダイアログ→環境変数ボタン
環境変数ダイアログのシステム環境変数→Pathを指定して編集ボタン
環境変数名の編集ダイアログ→新規ボタン
Graphvizへのパス(例えばC:\Program Files (x86)\Graphviz2.38\bin\)を入力してOK
以下、各ダイアログでOK

環境変数を設定しておくと、毎回パスを指定しなくてよい。

import pydotplus as pdp
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier, export_graphviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

dot_data = export_graphviz(treeclf, out_file=None,
    feature_names=["f0", "f1"], class_names=["c0", "c1"])
graph = pdp.graph_from_dot_data(dot_data)
graph.write_png("graph.png")

import pydotplus as pdp

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier, export_graphviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

dot_data = export_graphviz(treeclf, out_file=None,

feature_names=["f0", "f1"], class_names=["c0", "c1"])

graph = pdp.graph_from_dot_data(dot_data)

graph.write_png("graph.png")

dtreeviz

dtreevizのインストール

dtreevizをインストールする。

> pip install dtreeviz

1	> pip install dtreeviz

実行方法

Graphvizの実行方法2で環境変数を追加。

from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier
from dtreeviz.trees import dtreeviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)
treeclf.fit(X, y)

viz = dtreeviz(treeclf, X, y, target_name="Classes",
    feature_names=["f0", "f1"], class_names=["c0", "c1"])

viz.view()

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier

from dtreeviz.trees import dtreeviz

X, y = make_moons(n_samples=20, noise=0.25, random_state=3)

treeclf = DecisionTreeClassifier(random_state=0)

treeclf.fit(X, y)

viz = dtreeviz(treeclf, X, y, target_name="Classes",

feature_names=["f0", "f1"], class_names=["c0", "c1"])

viz.view()

scikit-learn – make_moons

2020-05-24 / tau / コメントする

概要

sklearn.datasets.make_moons()はクラス分類のためのデータを生成する。上向き、下向きの弧が相互にかみ合う形で生成され、単純な直線では分離できないデータセットを提供する。クラス数は常に2クラス。

得られるデータの形式

2つの配列X, yが返され、配列Xは列が特徴量、行がレコードの2次元配列。ターゲットyはレコード数分のクラス属性値の整数。

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10, random_state=0)

print("X:\n{}".format(X))
print("y:{}".format(y))

# X:
# [[ 6.12323400e-17  1.00000000e+00]
#  [ 1.70710678e+00 -2.07106781e-01]
#  [-1.00000000e+00  1.22464680e-16]
#  [ 2.00000000e+00  5.00000000e-01]
#  [ 7.07106781e-01  7.07106781e-01]
#  [ 2.92893219e-01 -2.07106781e-01]
#  [ 1.00000000e+00 -5.00000000e-01]
#  [-7.07106781e-01  7.07106781e-01]
#  [ 1.00000000e+00  0.00000000e+00]
#  [ 0.00000000e+00  5.00000000e-01]]
# y:[0 1 0 1 0 1 1 0 0 1]

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10, random_state=0)

print("X:\n{}".format(X))

print("y:{}".format(y))

# X:

# [[ 6.12323400e-17 1.00000000e+00]

# [ 1.70710678e+00 -2.07106781e-01]

# [-1.00000000e+00 1.22464680e-16]

# [ 2.00000000e+00 5.00000000e-01]

# [ 7.07106781e-01 7.07106781e-01]

# [ 2.92893219e-01 -2.07106781e-01]

# [ 1.00000000e+00 -5.00000000e-01]

# [-7.07106781e-01 7.07106781e-01]

# [ 1.00000000e+00 0.00000000e+00]

# [ 0.00000000e+00 5.00000000e-01]]

# y:[0 1 0 1 0 1 1 0 0 1]

利用例

以下の例では、noiseパラメーターを変化させている。

import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

noises = [0, 0.1, 0.2, 0.3]
fig, axs = plt.subplots(2, 2)
axs_1d = axs.reshape(axs.size)

for ax, noise in zip(axs_1d, noises):
    plt.subplots_adjust(hspace=0.4)
    X, y = make_moons(noise=noise, random_state=0)
    ax.scatter(X[y==0][:, 0], X[y==0][:, 1], s=5, marker='o')
    ax.scatter(X[y==1][:, 0], X[y==1][:, 1], s=5, marker='^')
    ax.set_title("noise={}".format(noise))

plt.show()

import matplotlib.pyplot as plt

from sklearn.datasets import make_moons

noises = [0, 0.1, 0.2, 0.3]

fig, axs = plt.subplots(2, 2)

axs_1d = axs.reshape(axs.size)

for ax, noise in zip(axs_1d, noises):

plt.subplots_adjust(hspace=0.4)

X, y = make_moons(noise=noise, random_state=0)

ax.scatter(X[y==0][:, 0], X[y==0][:, 1], s=5, marker='o')

ax.scatter(X[y==1][:, 0], X[y==1][:, 1], s=5, marker='^')

ax.set_title("noise={}".format(noise))

plt.show()

パラメーターの指定

sklearn.datasets.make_moons(n_samples=100, shuffle=True, noise=None, random_state=None)

1	sklearn.datasets.make_moons(n_samples=100, shuffle=True, noise=None, random_state=None)

n_samples

1つの数値で与えた場合は全データ数、2要素のタプルで与えた場合はそれぞれのクラスのデータ数。デフォルトは100。
shuffle: データをシャッフルするかどうか。デフォルトはTrue。
noise: データに加えられるノイズの標準偏差。デフォルトはノイズなし。
random_state: データ生成の乱数系列。

線形モデルによる多クラス分類

2020-05-23 / tau / 4件のコメント

概要

この項はO’REILLYの「Pythonではじめる機械学習」の「2.3.3.6　線形モデルによる多クラス分類」を自分なりに理解しやすいようにトレースしたもの。扱いやすい仮想のデータセットを生成し、LinearSVCモデルでこれらを分類する流れを例示している。

例えば特徴量x₁～x_nのデータxをC₁, C₂の2クラスに分類する線形モデルは以下とおり。

(1) $\begin{gather*} y = b + w_1 x_1 + \cdots + w_n x_n \\ \left\{ \begin{array}{ll} y \ge 0 \quad \rightarrow \quad \boldsymbol{x} \in C_1 \\ y < 0 \quad \rightarrow \quad \boldsymbol{x} \in C_2 \end{array} \right. \end{gather*}$

yの符号によってどちらのクラスに分類されるかを決定するが、1つの式で3つ以上のクラスを分類することはできない（ただし一般化線形モデル(GLM)であるLogistic回帰は多クラス分類が可能）。

このような2クラス分類を多クラス分類に拡張する方法の一つが1対その他(one-vs-rest, one-vs-the-rest, 1vR)という考え方で、1つの式によって、あるクラスとその他すべてのクラスを分けようというもの。この式の形は(1)と同じで、yの値は与えられたデータがそのクラスに属する確信度(confidence)を表す。クラスの数だけこの分類器(one-vs-the-rest-classifier)を準備し、あるデータが与えられたとき、最も確信度が高いクラスに属すると考える。たとえばn個の特徴量を持つデータの3クラス分類の場合、次のように3つの分類器を準備し、与えられたデータxはy_cの値が最も大きいクラスに属する。

(2) $\begin{gather*} y_0 = b_0 + w_{01} x_1 + \cdots + w_{0n} x_n \\ y_1 = b_1 + w_{11} x_1 + \cdots + w_{1n} x_n \\ y_2 = b_2 + w_{21} x_1 + \cdots + w_{2n} x_n \end{gather*}$

LinearSVCによる多クラス分類の例

データの準備

準備として、shikit-leran.datasetsのmake_blobs()で、2つの特徴量と3つのクラスのデータセットを生成する。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, y = make_blobs(random_state=42)

fig, ax = plt.subplots()

f0_min, f0_max = -10, 8
f1_min, f1_max = -10, 15
markers = ['o', '^', 'v']

for cls, marker in zip(range(3), markers):
    x = X[y==cls]
    ax.scatter(x[:, 0], x[:, 1],
        ec='k', marker=marker, label="Class {}".format(cls))

ax.set_xlim(f0_min, f0_max)
ax.set_ylim(f1_min, f1_max)
ax.set_xlabel("Feature 0")
ax.set_ylabel("Feature 1")
ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

X, y = make_blobs(random_state=42)

fig, ax = plt.subplots()

f0_min, f0_max = -10, 8

f1_min, f1_max = -10, 15

markers = ['o', '^', 'v']

for cls, marker in zip(range(3), markers):

x = X[y==cls]

ax.scatter(x[:, 0], x[:, 1],

ec='k', marker=marker, label="Class {}".format(cls))

ax.set_xlim(f0_min, f0_max)

ax.set_ylim(f1_min, f1_max)

ax.set_xlabel("Feature 0")

ax.set_ylabel("Feature 1")

ax.legend()

plt.show()

LinearSVCによる学習

学習とモデルの形

scikit-learn.linear_modelのLinearSVC(Linear Support Vector Classification)は多クラス分類のモデルを提供する。このモデルをmake_blobs()で生成したデータで学習させると、3行2列の係数(LinearSVC.coef_)と3要素の切片(LinearSVC.intercept_)を得る。

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVC

X, y = make_blobs(random_state=42)
df = DataFrame(X, columns=["f0", "f1"])
df['target'] = y

linsvm = LinearSVC().fit(X, y)
w = linsvm.coef_
b = linsvm.intercept_

print("Intercept: {}".format(b))
print("Coefficients(class, feature):\n{}".format(w))

# Intercept: [-1.07745476  0.13140569 -0.08604816]
# Coefficients(class, feature):
# [[-0.17491916  0.23140527]
#  [ 0.47621794 -0.06937226]
#  [-0.18914243 -0.20399679]]

import numpy as np

import matplotlib.pyplot as plt

from pandas import DataFrame

from sklearn.datasets import make_blobs

from sklearn.svm import LinearSVC

X, y = make_blobs(random_state=42)

df = DataFrame(X, columns=["f0", "f1"])

df['target'] = y

linsvm = LinearSVC().fit(X, y)

w = linsvm.coef_

b = linsvm.intercept_

print("Intercept: {}".format(b))

print("Coefficients(class, feature):\n{}".format(w))

# Intercept: [-1.07745476 0.13140569 -0.08604816]

# Coefficients(class, feature):

# [[-0.17491916 0.23140527]

# [ 0.47621794 -0.06937226]

# [-0.18914243 -0.20399679]]

これらの係数の行と切片の要素は分類されるべきクラス、係数の列は特徴量に対応している。クラスに対するインデックスをc = 0, 1, 2、特徴量f₀, f₁に対するインデックスをf= 0, 1とすると、上記の結果は以下のような意味になる。

(3) $\begin{align*} w_{cf} &= \left[ \begin{array}{rrr} -0.17492222 & 0.23140089 \\ 0.4762125 & -0.06936704 \\ -0.18914556 & -0.20399715 \end{array} \right] \\ b_c &= [-1.07745632 \quad 0.13140349 \quad -0.08604899] \end{align*}$

これらの係数、切片を用いたクラス分類の予測式は以下の通りで、LinearSVCではdecision function（決定関数）とされている。

(4) $\begin{equation*} y_c = b_c + w_{c0} \times f_0 + w_{c1} \times f_1 \end{equation*}$

あるデータの特徴量f₀, f₁に対して上記のy_cが正の時にはそのデータはクラスc、負の時にはクラスc以外であると判定される。

coef_やintercept_の値は、実行ごとにわずかに異なる（10⁻⁶くらいのオーダー）。LinearSVCのコンストラクターの引数にrandom_stateが含まれていて、ドキュメントに以下のような記述があった。

The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon to have slightly different results for the same input data. If that happens, try with a smaller tol parameter.The underlying implementation, liblinear, uses a sparse internal representation for the data that will incur a memory copy.

Predict output may not match that of standalone liblinear in certain cases. See differences from liblinear in the narrative documentation.

訓練データに対する決定関数・確信度

データセットの100個の各データに対してy_c (c = 0, 1, 2)を計算した結果は以下の通り。

df['y0'] = b[0] + w[0, 0] * df['f0'] + w[0, 1] * df['f1']
df['y1'] = b[1] + w[1, 0] * df['f0'] + w[1, 1] * df['f1']
df['y2'] = b[2] + w[2, 0] * df['f0'] + w[2, 1] * df['f1']

print(df)

# DataFrame adding Confidences:
#           f0        f1  target        y0        y1        y2
# 0  -7.726421 -8.394957       2 -1.668593 -2.965677  3.087890
# 1   5.453396  0.742305       1 -1.859585  2.676915 -1.268945
# 2  -2.978672  9.556846       0  1.655077 -1.950071 -1.472221
# 3   6.042673  0.571319       1 -2.002228  2.969401 -1.345521
# 4  -6.521840 -6.319325       2 -1.398985 -2.536026  2.436631
# ..       ...       ...     ...       ...       ...       ...
# 95 -3.186120  9.625962       0  1.707357 -2.053656 -1.447083
# 96 -1.478198  9.945566       0  1.482567 -1.262485 -1.835322
# 97  4.478593  2.377221       1 -1.310745  2.099279 -1.418086
# 98 -5.796576 -5.826308       2 -1.411761 -2.224844  2.198878
# 99 -3.348415  8.705074       0  1.522647 -2.067060 -1.228528
# 
# [100 rows x 6 columns]

df['y0'] = b[0] + w[0, 0] * df['f0'] + w[0, 1] * df['f1']

df['y1'] = b[1] + w[1, 0] * df['f0'] + w[1, 1] * df['f1']

df['y2'] = b[2] + w[2, 0] * df['f0'] + w[2, 1] * df['f1']

print(df)

# DataFrame adding Confidences:

# f0 f1 target y0 y1 y2

# 0 -7.726421 -8.394957 2 -1.668593 -2.965677 3.087890

# 1 5.453396 0.742305 1 -1.859585 2.676915 -1.268945

# 2 -2.978672 9.556846 0 1.655077 -1.950071 -1.472221

# 3 6.042673 0.571319 1 -2.002228 2.969401 -1.345521

# 4 -6.521840 -6.319325 2 -1.398985 -2.536026 2.436631

# .. ... ... ... ... ... ...

# 95 -3.186120 9.625962 0 1.707357 -2.053656 -1.447083

# 96 -1.478198 9.945566 0 1.482567 -1.262485 -1.835322

# 97 4.478593 2.377221 1 -1.310745 2.099279 -1.418086

# 98 -5.796576 -5.826308 2 -1.411761 -2.224844 2.198878

# 99 -3.348415 8.705074 0 1.522647 -2.067060 -1.228528

# [100 rows x 6 columns]

たとえばNo.0のデータはクラス2に属するので確信度はy2が正となり、他の2つのクラスに対しては負の値になっている。

上の計算ではintercept_とcoef_を使ってもともとの決定関数の式から確信度を計算したが、LinearSVCのdecition_function()メソッドで同じ結果を得ることができる。たとえばNo.0～2のデータで計算してみると以下の通りで同じ結果。

print("decision_function values for first 3 data")
print(linsvm.decision_function(df.iloc[0:3, 0:2]))

# decision_function values for first 3 data
# [[-1.668593   -2.96567746  3.08789015]
#  [-1.85958483  2.67691534 -1.26894468]
#  [ 1.65507664 -1.95007141 -1.47222084]]

print("decision_function values for first 3 data")

print(linsvm.decision_function(df.iloc[0:3, 0:2]))

# decision_function values for first 3 data

# [[-1.668593 -2.96567746 3.08789015]

# [-1.85958483 2.67691534 -1.26894468]

# [ 1.65507664 -1.95007141 -1.47222084]]

テストデータに対する予測

3つのテストデータを用意してクラス分類をしてみる。

X_test = np.array([[4, -2], [-2, 2], [-6, 5]])
preds = linsvm.predict(X_test)
print("Prediction:")
for x_test, pred in zip(X_test, preds):
    print("{} -> {}".format(x_test, pred))
print()
print("Confidences of 3 points:\n{}".format(linsvm.decision_function(X_test)))

# Prediction:
# [ 4 -2] -> 1
# [-2  2] -> 2
# [-6  5] -> 0
# 
# Confidences of 3 points:
# [[-2.23994194  2.17502199 -0.43462432]
#  [-0.2648059  -0.95977473 -0.11575688]
#  [ 1.12908656 -3.07276329  0.02882249]]

X_test = np.array([[4, -2], [-2, 2], [-6, 5]])

preds = linsvm.predict(X_test)

print("Prediction:")

for x_test, pred in zip(X_test, preds):

print("{} -> {}".format(x_test, pred))

print()

print("Confidences of 3 points:\n{}".format(linsvm.decision_function(X_test)))

# Prediction:

# [ 4 -2] -> 1

# [-2 2] -> 2

# [-6 5] -> 0

# Confidences of 3 points:

# [[-2.23994194 2.17502199 -0.43462432]

# [-0.2648059 -0.95977473 -0.11575688]

# [ 1.12908656 -3.07276329 0.02882249]]

各データとも分類されたクラスに対応する確信度が最も高い。ただし2つ目のデータについては全てのクラスに対する確信度が負の値で、その中で最も値が大きいクラス2に分類されている。

これらを図示すると以下のようになり、クラス2に分類された▼のデータは確かにどのデータにも属していそうな位置にある。

以上のコードをまとめておく。

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVC

X, y = make_blobs(random_state=42)
df = DataFrame(X, columns=["f0", "f1"])
df['target'] = y

linsvm = LinearSVC().fit(X, y)
w = linsvm.coef_
b = linsvm.intercept_

print("Intercept: {}".format(b))
print("Coefficients(class, feature):\n{}".format(w))

print()

df['y0'] = b[0] + w[0, 0] * df['f0'] + w[0, 1] * df['f1']
df['y1'] = b[1] + w[1, 0] * df['f0'] + w[1, 1] * df['f1']
df['y2'] = b[2] + w[2, 0] * df['f0'] + w[2, 1] * df['f1']

print("DataFrame adding Confidences:\n{}".format(df))

print()

print("decision_function values for first 3 data")
print(linsvm.decision_function(df.iloc[0:3, 0:2]))

print()

X_test = np.array([[4, -2], [-2, 2], [-6, 5]])
preds = linsvm.predict(X_test)
print("Prediction:")
for x_test, pred in zip(X_test, preds):
    print("{} -> {}".format(x_test, pred))
print()
print("Confidences of 3 points:\n{}".format(linsvm.decision_function(X_test)))

fig, ax = plt.subplots()

f0_min, f0_max = -10, 8
f1_min, f1_max = -10, 15
markers = ['o', '^', 'v']

for cls, marker in zip(range(3), markers):
    x = X[y==cls]
    ax.scatter(x[:, 0], x[:, 1],
        ec='k', marker=marker, label="Class {}".format(cls))

ax.scatter(X_test[0][0], X_test[0][1],
    ec='k', c='tab:orange', marker=markers[1], s=80)
ax.scatter(X_test[1][0], X_test[1][1],
    ec='k', c='tab:green', marker=markers[2], s=80)
ax.scatter(X_test[2][0], X_test[2][1],
    ec='k', c='tab:blue', marker=markers[0], s=80)

ax.set_xlim(f0_min, f0_max)
ax.set_ylim(f1_min, f1_max)
ax.set_xlabel("Feature 0")
ax.set_ylabel("Feature 1")
ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from pandas import DataFrame

from sklearn.datasets import make_blobs

from sklearn.svm import LinearSVC

X, y = make_blobs(random_state=42)

df = DataFrame(X, columns=["f0", "f1"])

df['target'] = y

linsvm = LinearSVC().fit(X, y)

w = linsvm.coef_

b = linsvm.intercept_

print("Intercept: {}".format(b))

print("Coefficients(class, feature):\n{}".format(w))

print()

df['y0'] = b[0] + w[0, 0] * df['f0'] + w[0, 1] * df['f1']

df['y1'] = b[1] + w[1, 0] * df['f0'] + w[1, 1] * df['f1']

df['y2'] = b[2] + w[2, 0] * df['f0'] + w[2, 1] * df['f1']

print("DataFrame adding Confidences:\n{}".format(df))

print()

print("decision_function values for first 3 data")

print(linsvm.decision_function(df.iloc[0:3, 0:2]))

print()

X_test = np.array([[4, -2], [-2, 2], [-6, 5]])

preds = linsvm.predict(X_test)

print("Prediction:")

for x_test, pred in zip(X_test, preds):

print("{} -> {}".format(x_test, pred))

print()

print("Confidences of 3 points:\n{}".format(linsvm.decision_function(X_test)))

fig, ax = plt.subplots()

f0_min, f0_max = -10, 8

f1_min, f1_max = -10, 15

markers = ['o', '^', 'v']

for cls, marker in zip(range(3), markers):

x = X[y==cls]

ax.scatter(x[:, 0], x[:, 1],

ec='k', marker=marker, label="Class {}".format(cls))

ax.scatter(X_test[0][0], X_test[0][1],

ec='k', c='tab:orange', marker=markers[1], s=80)

ax.scatter(X_test[1][0], X_test[1][1],

ec='k', c='tab:green', marker=markers[2], s=80)

ax.scatter(X_test[2][0], X_test[2][1],

ec='k', c='tab:blue', marker=markers[0], s=80)

ax.set_xlim(f0_min, f0_max)

ax.set_ylim(f1_min, f1_max)

ax.set_xlabel("Feature 0")

ax.set_ylabel("Feature 1")

ax.legend()

plt.show()

LinearSVCの決定境界

クラスごとのone-vs-restの決定境界

blobsデータは明確に分かれた3つのクラスに分類され、それぞれに対する決定関数の切片と係数が得られた。そこで、各決定関数の決定関数の意思決定境界(decision boundary)を描いてみる。意思決定境界は決定関数の値がゼロとなる線なので、以下の式で表される。

(5) $\begin{gather*} b_c + w_0 f_0 + w_1 f_1 = 0 \quad \rightarrow \quad f_1 = \frac{-(b_c + w_{c0} f_0)}{w_{c1}} \end{gather*}$

3つの決定関数について決定境界を描いたのが以下の結果。

たとえばClass 0の実線は、Class 0の塊とその他(Class1, Class 2)の塊を1対その他で分けている。この線の上側では確信度はプラスで、下側ではマイナスとなっている。

import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVC

f0_min, f0_max = -10, 8
f1_min, f1_max = -10, 15

X, y = make_blobs(random_state=42)
df = DataFrame(X, columns=["feature-0", "feature-1"])
df['target'] = y

linsvm = LinearSVC().fit(X, y)
w = linsvm.coef_
b = linsvm.intercept_

fig, ax = plt.subplots(figsize=(7.2, 4.8))

markers = ['o', '^', 'v']
line_styles = ['solid', 'dashed', 'dotted']

for c, marker, ls in zip(range(3), markers, line_styles):
    x = X[y==c]
    ax.scatter(x[:, 0], x[:, 1], marker=marker, label="Class {}".format(c))

    f1_left = -(b[c] + w[c, 0] * f0_min) / w[c, 1]
    f1_right = -(b[c] + w[c, 0] * f0_max) / w[c, 1]
    ax.plot([f0_min, f0_max], [f1_left, f1_right],
        linestyle=ls, linewidth=2, label="Class {}".format(c))

ax.set_xlim(f0_min, f0_max)
ax.set_ylim(f1_min, f1_max)
ax.legend(bbox_to_anchor=(1, 1))
fig.tight_layout()

plt.show()

import matplotlib.pyplot as plt

from pandas import DataFrame

from sklearn.datasets import make_blobs

from sklearn.svm import LinearSVC

f0_min, f0_max = -10, 8

f1_min, f1_max = -10, 15

X, y = make_blobs(random_state=42)

df = DataFrame(X, columns=["feature-0", "feature-1"])

df['target'] = y

linsvm = LinearSVC().fit(X, y)

w = linsvm.coef_

b = linsvm.intercept_

fig, ax = plt.subplots(figsize=(7.2, 4.8))

markers = ['o', '^', 'v']

line_styles = ['solid', 'dashed', 'dotted']

for c, marker, ls in zip(range(3), markers, line_styles):

x = X[y==c]

ax.scatter(x[:, 0], x[:, 1], marker=marker, label="Class {}".format(c))

f1_left = -(b[c] + w[c, 0] * f0_min) / w[c, 1]

f1_right = -(b[c] + w[c, 0] * f0_max) / w[c, 1]

ax.plot([f0_min, f0_max], [f1_left, f1_right],

linestyle=ls, linewidth=2, label="Class {}".format(c))

ax.set_xlim(f0_min, f0_max)

ax.set_ylim(f1_min, f1_max)

ax.legend(bbox_to_anchor=(1, 1))

fig.tight_layout()

plt.show()

全体を融合した決定境界

先の図の中で、各クラスの塊の近くでは、そのクラスの決定関数の値はプラスで他はマイナスとなっているが、真ん中の三角形の中や、その対角にある三角形の領域では、複数の確信度がマイナスあるいはプラスとなる。このような場合には、全クラスに対して着目するデータの決定関数値を計算し、その確信度が最も大きいクラスをそのデータのクラスラベルとして与える。

以下の図は、領域内の点について全て確信度を計算し、各点において最も確信度が大きいクラスをその点のクラスとして表現した図である。

各領域の境界が3つの決定関数から導かれた意思決定境界であり、その線上で決定関数の値が等しくなっている。

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVC

f0_min, f0_max = -10, 8
f1_min, f1_max = -10, 15

X, y = make_blobs(random_state=42)
df = DataFrame(X, columns=["feature-0", "feature-1"])
df['target'] = y
n_classes = max(y + 1)

linsvm = LinearSVC().fit(X, y)
w = linsvm.coef_
b = linsvm.intercept_

markers = ['o', '^', 'v']
line_styles = ['solid', 'dashed', 'dotted']
colors = ['tab:blue', 'tab:orange', 'tab:green']

fig, ax = plt.subplots()

for f0 in np.linspace(f0_min, f0_max, 75):
    for f1 in np.linspace(f1_min, f1_max, 55):
        conf = [b[c] + w[c, 0] * f0 + w[c, 1] * f1 for c in range(n_classes)]
        ax.scatter(f0, f1, c=colors[np.argmax(conf)], marker='s', s=20, alpha=0.2)

for c, marker, ls in zip(range(3), markers, line_styles):
    x = X[y==c]
    ax.scatter(x[:, 0], x[:, 1], marker=marker, label="Class {}".format(c))

    f1_left = -(b[c] + w[c, 0] * f0_min) / w[c, 1]
    f1_right = -(b[c] + w[c, 0] * f0_max) / w[c, 1]
    ax.plot([f0_min, f0_max], [f1_left, f1_right], linestyle=ls, linewidth=2)

ax.set_xlim(f0_min, f0_max)
ax.set_ylim(f1_min, f1_max)
ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from pandas import DataFrame

from sklearn.datasets import make_blobs

from sklearn.svm import LinearSVC

f0_min, f0_max = -10, 8

f1_min, f1_max = -10, 15

X, y = make_blobs(random_state=42)

df = DataFrame(X, columns=["feature-0", "feature-1"])

df['target'] = y

n_classes = max(y + 1)

linsvm = LinearSVC().fit(X, y)

w = linsvm.coef_

b = linsvm.intercept_

markers = ['o', '^', 'v']

line_styles = ['solid', 'dashed', 'dotted']

colors = ['tab:blue', 'tab:orange', 'tab:green']

fig, ax = plt.subplots()

for f0 in np.linspace(f0_min, f0_max, 75):

for f1 in np.linspace(f1_min, f1_max, 55):

conf = [b[c] + w[c, 0] * f0 + w[c, 1] * f1 for c in range(n_classes)]

ax.scatter(f0, f1, c=colors[np.argmax(conf)], marker='s', s=20, alpha=0.2)

for c, marker, ls in zip(range(3), markers, line_styles):

x = X[y==c]

ax.scatter(x[:, 0], x[:, 1], marker=marker, label="Class {}".format(c))

f1_left = -(b[c] + w[c, 0] * f0_min) / w[c, 1]

f1_right = -(b[c] + w[c, 0] * f0_max) / w[c, 1]

ax.plot([f0_min, f0_max], [f1_left, f1_right], linestyle=ls, linewidth=2)

ax.set_xlim(f0_min, f0_max)

ax.set_ylim(f1_min, f1_max)

ax.legend()

plt.show()