irisデータの俯瞰

2020-03-13 / tau / コメントする

概要

irisデータは3つのアヤメの種類(setosa, versicolor, varginica)の150個体について、萼(sepal)と花弁(petal)の長さと幅の組み合わせ4つの特徴量のデータを提供する。これらについて一般的なグラフによる可視化によって俯瞰してみる。

特徴量の分布

クラス分けしない場合

まずアヤメの150個体における4つの特徴量について、3つの種類を区別せずにその分布を見てみる。

この結果を見る限り特に際立った特徴は見いだせない。敢えて言うなら、萼の長さは若干ばらつきが大きく、萼の幅は割合”きれいな”分布。花弁については、長さ・幅とも値の小さいところで独立した分布が見られる。

このデータが異なる種類のものが混在したものだと知っていれば、花弁の独立した分布は特定の種類のものかもしれないと推測できるくらい。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris_data = load_iris()

feature_names = iris_data['feature_names']
X = iris_data['data']
n_data, n_features = X.shape

fig, axs = plt.subplots(2, 2, figsize=(6.4, 4.8))
ax_1d = [ax for row in axs for ax in row]

fig.subplots_adjust(hspace=0.4)

for feature in range(n_features):
    ax_1d[feature].set_title(feature_names[feature])
    ax_1d[feature].hist(X[:, feature], ec='k')

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris_data = load_iris()

feature_names = iris_data['feature_names']

X = iris_data['data']

n_data, n_features = X.shape

fig, axs = plt.subplots(2, 2, figsize=(6.4, 4.8))

ax_1d = [ax for row in axs for ax in row]

fig.subplots_adjust(hspace=0.4)

for feature in range(n_features):

ax_1d[feature].set_title(feature_names[feature])

ax_1d[feature].hist(X[:, feature], ec='k')

plt.show()

クラス分けした場合

次に4つの特徴量について、3つの種類ごとに分けて表示してみる。

こうすると少し特徴が見えてくる。

花弁の独立した分布はsetosa(ヒオウギアヤメ)のものであることがわかり、額の長さの分布がばらついているのは、複数の種類の特徴量が少しずつずれて重なっているからだということもわかる。

この分布だけだと、花弁の長さ2.5cm、花弁の幅が0.7～0.8cmあたりから小さいと、アヤメの種類はsetosaと特定できそうだが、versicolorとvirginicaは重なっていて、花弁の幅が1.75cmあたりで分けると少し誤判定はあるが概ね分けられそうである。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris_data = load_iris()

feature_names = iris_data['feature_names']
species_names = iris_data['target_names']
X = iris_data['data']
y = iris_data['target']
n_data, n_features = X.shape
species = np.sort(np.array(list(set(y))))

fig, axs = plt.subplots(2, 2, figsize=(9.6, 7.2))
ax_1d = [ax for row in axs for ax in row]

fig.subplots_adjust(hspace=0.4)

colors = ['r', 'y', 'b']
for feature in range(n_features):
    ax_1d[feature].set_title(feature_names[feature])
    feature_data = X[:, feature]
    range_max = np.max(feature_data)
    range_min = np.min(feature_data)
    for sp in species:
        ax_1d[feature].hist(feature_data[y==sp],
                            range=(range_min, range_max), bins=10,
                            color=colors[sp], ec='k', alpha=0.5,
                            label=species_names[sp])
        ax_1d[feature].legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris_data = load_iris()

feature_names = iris_data['feature_names']

species_names = iris_data['target_names']

X = iris_data['data']

y = iris_data['target']

n_data, n_features = X.shape

species = np.sort(np.array(list(set(y))))

fig, axs = plt.subplots(2, 2, figsize=(9.6, 7.2))

ax_1d = [ax for row in axs for ax in row]

fig.subplots_adjust(hspace=0.4)

colors = ['r', 'y', 'b']

for feature in range(n_features):

ax_1d[feature].set_title(feature_names[feature])

feature_data = X[:, feature]

range_max = np.max(feature_data)

range_min = np.min(feature_data)

for sp in species:

ax_1d[feature].hist(feature_data[y==sp],

range=(range_min, range_max), bins=10,

color=colors[sp], ec='k', alpha=0.5,

label=species_names[sp])

ax_1d[feature].legend()

plt.show()

2つの特徴量同士の関係

比較例

例として、萼の長さと萼の幅、萼の長さと花弁の幅、それぞれの間の関係をプロットしてみる。

萼の長さと花弁の長さの関係を見ると、setosaは明らかに独立したグループだが、versicolorとverginicaは混ざり合っていて分離できそうにない。先ほどのヒストグラムでは、萼の長さ、萼の幅それぞれだけではversicolorとvirginicaは区分できなかった。2次元でプロットするとそれらがうまく区分する可能性もあるが、この場合はうまくいかないようである。

一方、萼の長さと花弁の長さの関係を比べると、versicolorとversinicaも何とか区分できそうである。よくみると、この3つの区分は萼の長さと関係なく、花弁の幅のみで概ね区分できそうである。これも先ほどの花弁の幅のヒストグラムの結果と符合する。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris_data = load_iris()
X = iris_data['data']
y = iris_data['target']

data_setosa = X[y==0]
data_versicolor = X[y==1]
data_verginica = X[y==2]

sl, sw, pl, pw = (0, 1, 2, 3)

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))

a, b = (sl, sw)
axs[0].scatter(data_setosa[:, a], data_setosa[:, b], label="setosa")
axs[0].scatter(data_versicolor[:, a], data_versicolor[:, b], label="versicolor")
axs[0].scatter(data_verginica[:, a], data_verginica[:, b], label="verginica")

a, b = (sl, pw)
axs[1].scatter(data_setosa[:, a], data_setosa[:, b], label="setosa")
axs[1].scatter(data_versicolor[:, a], data_versicolor[:, b], label="versicolor")
axs[1].scatter(data_verginica[:, a], data_verginica[:, b], label="verginica")

for ax in axs:
    ax.set_xlabel(iris_data.feature_names[a])
    ax.set_ylabel(iris_data.feature_names[b])

    ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris_data = load_iris()

X = iris_data['data']

y = iris_data['target']

data_setosa = X[y==0]

data_versicolor = X[y==1]

data_verginica = X[y==2]

sl, sw, pl, pw = (0, 1, 2, 3)

fig, axs = plt.subplots(1, 2, figsize=(10, 4.8))

a, b = (sl, sw)

axs[0].scatter(data_setosa[:, a], data_setosa[:, b], label="setosa")

axs[0].scatter(data_versicolor[:, a], data_versicolor[:, b], label="versicolor")

axs[0].scatter(data_verginica[:, a], data_verginica[:, b], label="verginica")

a, b = (sl, pw)

axs[1].scatter(data_setosa[:, a], data_setosa[:, b], label="setosa")

axs[1].scatter(data_versicolor[:, a], data_versicolor[:, b], label="versicolor")

axs[1].scatter(data_verginica[:, a], data_verginica[:, b], label="verginica")

for ax in axs:

ax.set_xlabel(iris_data.feature_names[a])

ax.set_ylabel(iris_data.feature_names[b])

ax.legend()

plt.show()

`scatter_matrix`による確認

上記のような特徴量の組み合わせは、特徴量がn個の場合には_nC₂通りとなる。irisデータの場合、特徴量は4つだから6個の特徴量ペアがあり得る。pandasのscatter_matrixを利用すると、このような特徴量のペアについて網羅的に確認できる。

ただしscatter_matrixでは、対角要素のヒストグラムを特徴量ごとに分けることはできないようだ。

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris_dataset = load_iris()

iris_dataframe = pd.DataFrame(iris_dataset.data,
                              columns=iris_dataset.feature_names)

pd.plotting.scatter_matrix(iris_dataframe,
                           figsize=(9.6, 9.6),
                           c=iris_dataset.target,
                           hist_kwds={'ec':'gray', 'color':'paleturquoise'})

plt.show()

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris_dataset = load_iris()

iris_dataframe = pd.DataFrame(iris_dataset.data,

columns=iris_dataset.feature_names)

pd.plotting.scatter_matrix(iris_dataframe,

figsize=(9.6, 9.6),

c=iris_dataset.target,

hist_kwds={'ec':'gray', 'color':'paleturquoise'})

plt.show()

`pairplot`による確認

seabornのpairplotを使うと、対角要素に各特徴量ごとの頻度分布／密度分布を表示することができる。pairplotの場合、ターゲットの品種を文字列で与えるとそれに従った色分けをしてくれて、対角要素の密度分布も品種ごとに分けてくれる。

ペアプロットの結果から、3つの種類は複数の散布図で比較的きれいにグループとなっていることがわかる。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

iris_ds = load_iris()
df = pd.DataFrame(iris_ds.data, columns=iris_ds.feature_names)

df['target'] = iris_ds.target
df.loc[df['target']==0, 'target'] = "setosa"
df.loc[df['target']==1, 'target'] = "versicolor"
df.loc[df['target']==2, 'target'] = "virginica"

g = sns.pairplot(df, hue='target')
g.fig.set_figheight(9.6)
g.fig.set_figwidth(11)

plt.show()

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import load_iris

iris_ds = load_iris()

df = pd.DataFrame(iris_ds.data, columns=iris_ds.feature_names)

df['target'] = iris_ds.target

df.loc[df['target']==0, 'target'] = "setosa"

df.loc[df['target']==1, 'target'] = "versicolor"

df.loc[df['target']==2, 'target'] = "virginica"

g = sns.pairplot(df, hue='target')

g.fig.set_figheight(9.6)

g.fig.set_figwidth(11)

plt.show()

3つの特徴量の関係

最後に、4つの特徴量のうち3つを取り出して3次元の散布図で表示してみる。2次元の散布図ではversicolorとvirginicaで若干の重なりがあるが、3次元化するときれいに分かれるかもしれない。

3次元空間で見ても若干の重なりはあるが、2つの特徴量だけの時に比べて、よりグループ分離の精度が高まることは期待できそうだ。

考えてみれば、アヤメの品種区分のように特徴量が少ない場合のクラス分類問題は、1次元の頻度分布、2次元・3次元の頻度分布のように次元を増やして確認ができれば、区分は比較的容易なように思われる。一方で人の間隔では3次元を認識するのがやっとなので、特徴量の数が増えた時には太刀打ちできない。

畢竟、機械学習・AIとは人間が認識制御困難な多数の特徴量＝多次元における判別や相関を如何に実行するかというところなのでは、と思われる。

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import load_iris

iris_ds = load_iris()

X = iris_ds.data
y = iris_ds.target
feature_names = iris_ds.feature_names

sl, sw, pl, pw = (0, 1, 2, 3)
species = (0, 1, 2)

combinations = np.array([
    [sl, sw, pl],
    [sl, sw, pw],
    [sl, pl, pw],
    [sw, pl, pw]
])

fig = plt.figure(figsize=(9.6, 7.2))

ax1 = fig.add_subplot(221, projection='3d')
ax2 = fig.add_subplot(222, projection='3d')
ax3 = fig.add_subplot(223, projection='3d')
ax4 = fig.add_subplot(224, projection='3d')
axs = [ax1, ax2, ax3, ax4]

for ax, comb in zip(axs, combinations):
    f0, f1, f2 = comb[0], comb[1], comb[2]
    xs, ys, zs = X[:, f0], X[:, f1], X[:, f2]
    for sp in species:
        ax.scatter(xs[y==sp], ys[y==sp], zs[y==sp])
    ax.set_xlabel(feature_names[f0])
    ax.set_ylabel(feature_names[f1])
    ax.set_zlabel(feature_names[f2])

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from sklearn.datasets import load_iris

iris_ds = load_iris()

X = iris_ds.data

y = iris_ds.target

feature_names = iris_ds.feature_names

sl, sw, pl, pw = (0, 1, 2, 3)

species = (0, 1, 2)

combinations = np.array([

[sl, sw, pl],

[sl, sw, pw],

[sl, pl, pw],

[sw, pl, pw]

])

fig = plt.figure(figsize=(9.6, 7.2))

ax1 = fig.add_subplot(221, projection='3d')

ax2 = fig.add_subplot(222, projection='3d')

ax3 = fig.add_subplot(223, projection='3d')

ax4 = fig.add_subplot(224, projection='3d')

axs = [ax1, ax2, ax3, ax4]

for ax, comb in zip(axs, combinations):

f0, f1, f2 = comb[0], comb[1], comb[2]

xs, ys, zs = X[:, f0], X[:, f1], X[:, f2]

for sp in species:

ax.scatter(xs[y==sp], ys[y==sp], zs[y==sp])

ax.set_xlabel(feature_names[f0])

ax.set_ylabel(feature_names[f1])

ax.set_zlabel(feature_names[f2])

plt.show()

Irisデータセット

2020-03-08 / tau / コメントする

概要

Irisデータセットはアヤメの種類と特徴量に関するデータセットで、3種類のアヤメの花弁と萼(がく)に関する特徴量について多数のデータを提供する。

ここではPythonのscikit-learnにあるirisデータの使い方をまとめる。

データの取得とデータ構造

Pythonで扱う場合、scikit-learnのdatasetsモジュールにあるload_iris()でデータを取得できる。データはBunchクラスのオブジェクトととのことだが、通常の扱い方は辞書と同じようだ。

from sklearn.datasets import load_iris

iris_dataset = load_iris()

for key, value in zip(iris_dataset.keys(), iris_dataset.values()):
    print("{}:\n{}\n".format(key, value))

from sklearn.datasets import load_iris

iris_dataset = load_iris()

for key, value in zip(iris_dataset.keys(), iris_dataset.values()):

print("{}:\n{}\n".format(key, value))

データの構造は辞書型で、150個体のアヤメに関する特徴量の配列と各個体の種類、種類名などが格納されている。

data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 .....
 [6.5 3.  5.2 2. ]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]]

target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

target_names:
['setosa' 'versicolor' 'virginica']

DESCR:
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

feature_names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

filename:
C:\Users\tomo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\datasets\data\iris.csv

data:

[[5.1 3.5 1.4 0.2]

[4.9 3. 1.4 0.2]

[4.7 3.2 1.3 0.2]

.....

[6.5 3. 5.2 2. ]

[6.2 3.4 5.4 2.3]

[5.9 3. 5.1 1.8]]

target:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2]

target_names:

['setosa' 'versicolor' 'virginica']

DESCR:

.. _iris_dataset:

Iris plants dataset

--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)

:Number of Attributes: 4 numeric, predictive attributes and the class

:Attribute Information:

- sepal length in cm

- sepal width in cm

- petal length in cm

- petal width in cm

- class:

- Iris-Setosa

- Iris-Versicolour

- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================

Min Max Mean SD Class Correlation

============== ==== ==== ======= ===== ====================

sepal length: 4.3 7.9 5.84 0.83 0.7826

sepal width: 2.0 4.4 3.05 0.43 -0.4194

petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)

petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)

============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.

:Creator: R.A. Fisher

:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)

:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken

from Fisher's paper. Note that it's the same as in R, but not as in the UCI

Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the

pattern recognition literature. Fisher's paper is a classic in the field and

is referenced frequently to this day. (See Duda & Hart, for example.) The

data set contains 3 classes of 50 instances each, where each class refers to a

type of iris plant. One class is linearly separable from the other 2; the

latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"

Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to

Mathematical Statistics" (John Wiley, NY, 1950).

- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.

(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.

- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System

Structure and Classification Rule for Recognition in Partially Exposed

Environments". IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. PAMI-2, No. 1, 67-71.

- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions

on Information Theory, May 1972, 431-433.

- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II

conceptual clustering system finds 3 classes in the data.

- Many, many more ...

feature_names:

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

filename:

C:\Users\tomo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\datasets\data\iris.csv

データのキーは以下のようになっている。

from sklearn.datasets import load_iris

iris_dataset = load_iris()

print(iris_dataset.keys())

# dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

from sklearn.datasets import load_iris

iris_dataset = load_iris()

print(iris_dataset.keys())

# dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

データの内容

`'data'`～特徴量データセット

150個体のアヤメに関する、4つの特徴量をレコードとしたデータセット。各個体の4つの特徴量の配列を要素とした2次元配列。列のインデックス(0, 1, 2, 3)が四つの特徴量に対応している。

'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       .....
       [6.7, 3. , 5.2, 2.3],
       [6.3, 2.5, 5. , 1.9],
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]])

'data': array([[5.1, 3.5, 1.4, 0.2],

[4.9, 3. , 1.4, 0.2],

[4.7, 3.2, 1.3, 0.2],

[4.6, 3.1, 1.5, 0.2],

[5. , 3.6, 1.4, 0.2],

.....

[6.7, 3. , 5.2, 2.3],

[6.3, 2.5, 5. , 1.9],

[6.5, 3. , 5.2, 2. ],

[6.2, 3.4, 5.4, 2.3],

[5.9, 3. , 5.1, 1.8]])

`'target'`～アヤメの種類に対応したコード

3種類のアヤメに対応した0～2のコードの配列。150個体のアヤメに対応した1次元配列。

'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

`'target_names'`～アヤメの種類名

アヤメの3つの種類の種類名。stosaは「ヒオウギアヤメ」といって少し大人締めの色形だが、versicolorとvirginicaは素人にはその違いがよく分からない。

'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),

1	'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),

種類名とコードの関係は以下の通り。

setosa	0
versicolor	1
virginica	2

`'feature_names'`～特徴名

データの格納順はDESCRの後。アヤメの種類のクラス分けに使う特徴。

sepal(萼)とpetal(花弁)の長さと幅、計4つの特徴の名称が、単位cmを含む文字列で格納されている。

‘sepal length (cm)’　萼の長さ
‘sepal width (cm)’　萼の幅
‘petal length (cm)’　花弁の長さ
‘petal width (cm)’　花弁の幅

'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

1	'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

特徴名とコードの関係は以下の通り。

sepal length (cm)	0
sepal width (cm)	1
petal length (cm)	2
petal width (cm)	3

`'filename'`～ファイル名

これも格納順はDESCRの後で、CSVファイルの位置が示されている。1行目にはデータ数、特徴量数、特徴量名称が並んでおり、その後に150行のアヤメの個体に対する4列の特徴量と1列の種類データが格納されている。このファイルにはfeature_namesやDESCRに当たるデータは格納されていない。

'filename': 'C:...lib\\site-packages\\sklearn\\datasets\\data\\iris.csv'

1	'filename': 'C:...lib\\site-packages\\sklearn\\datasets\\data\\iris.csv'

`'DESCR'`～データセットの説明

データセットの説明。print(iris_dataset['DESCR'])のようにprint文で整形表示される。

レコード数は150個(3つのクラスで50個ずつ)
属性は、4つの数値属性とクラス(種類)
→predictiveの意味とclassが単数形なのがわからない

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

.. _iris_dataset:

Iris plants dataset

--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)

:Number of Attributes: 4 numeric, predictive attributes and the class

:Attribute Information:

- sepal length in cm

- sepal width in cm

- petal length in cm

- petal width in cm

- class:

- Iris-Setosa

- Iris-Versicolour

- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================

Min Max Mean SD Class Correlation

============== ==== ==== ======= ===== ====================

sepal length: 4.3 7.9 5.84 0.83 0.7826

sepal width: 2.0 4.4 3.05 0.43 -0.4194

petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)

petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)

============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.

:Creator: R.A. Fisher

:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)

:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken

from Fisher's paper. Note that it's the same as in R, but not as in the UCI

Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the

pattern recognition literature. Fisher's paper is a classic in the field and

is referenced frequently to this day. (See Duda & Hart, for example.) The

data set contains 3 classes of 50 instances each, where each class refers to a

type of iris plant. One class is linearly separable from the other 2; the

latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"

Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to

Mathematical Statistics" (John Wiley, NY, 1950).

- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.

(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.

- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System

Structure and Classification Rule for Recognition in Partially Exposed

Environments". IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. PAMI-2, No. 1, 67-71.

- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions

on Information Theory, May 1972, 431-433.

- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II

conceptual clustering system finds 3 classes in the data.

- Many, many more ...

データの利用

データの取得方法

irisデータセットから各データを取り出すのに、以下の2つの方法がある。

辞書のキーを使って呼び出す（例：iris_dataset['DESCR']）
キーの文字列をプロパティーに指定する（例：iris_dataset.DESCR）

全レコードの特徴量データの取得

'data'から、150の個体に関する4つの特徴量が150行4列の2次元配列で得られる。4つの特徴量は’feature_names’の4つの特徴名に対応している。

from sklearn.datasets import load_iris

iris_data = load_iris()

X = iris_data['data']

print(X)

# [[5.1 3.5 1.4 0.2]
#  [4.9 3.  1.4 0.2]
#  [4.7 3.2 1.3 0.2]
#  .....
#  [6.5 3.  5.2 2. ]
#  [6.2 3.4 5.4 2.3]
#  [5.9 3.  5.1 1.8]]

from sklearn.datasets import load_iris

iris_data = load_iris()

X = iris_data['data']

print(X)

# [[5.1 3.5 1.4 0.2]

# [4.9 3. 1.4 0.2]

# [4.7 3.2 1.3 0.2]

# .....

# [6.5 3. 5.2 2. ]

# [6.2 3.4 5.4 2.3]

# [5.9 3. 5.1 1.8]]

特定の特徴量のデータのみ取得

特定の特徴量に関する全個体のデータを取り出すときにはX[:, n]の形で指定する。

from sklearn.datasets import load_iris

iris_data = load_iris()

features = iris_data['feature_names']
X = iris_data['data']
n_feature = 2

feature = X[:, n_feature]

print("feature name : {}".format(features[n_feature]))
print("feature data :\n{}".format(feature))

# feature name : petal length (cm)
# feature data :
# [1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
#  .....
#  5.7 5.2 5.  5.2 5.4 5.1]

from sklearn.datasets import load_iris

iris_data = load_iris()

features = iris_data['feature_names']

X = iris_data['data']

n_feature = 2

feature = X[:, n_feature]

print("feature name : {}".format(features[n_feature]))

print("feature data :\n{}".format(feature))

# feature name : petal length (cm)

# feature data :

# [1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4

# .....

# 5.7 5.2 5. 5.2 5.4 5.1]

特定のクラスのデータのみ抽出

特定のクラス（この場合は種類）のレコードのみを抽出する方法。ndarrayの条件による要素抽出を使う。

from sklearn.datasets import load_iris

iris_data = load_iris()

targets = iris_data['target_names']
features = iris_data['feature_names']
X = iris_data['data']
y = iris_data['target']

n_class = 1
data_1 = X[y==1]

print("data for class {}:\n{}".format(targets[n_class], X[y==n_class]))

# data for class versicolor:
# [[7.  3.2 4.7 1.4]
#  [6.4 3.2 4.5 1.5]
#  [6.9 3.1 4.9 1.5]
#  .....
#  [6.2 2.9 4.3 1.3]
#  [5.1 2.5 3.  1.1]
#  [5.7 2.8 4.1 1.3]]

from sklearn.datasets import load_iris

iris_data = load_iris()

targets = iris_data['target_names']

features = iris_data['feature_names']

X = iris_data['data']

y = iris_data['target']

n_class = 1

data_1 = X[y==1]

print("data for class {}:\n{}".format(targets[n_class], X[y==n_class]))

# data for class versicolor:

# [[7. 3.2 4.7 1.4]

# [6.4 3.2 4.5 1.5]

# [6.9 3.1 4.9 1.5]

# .....

# [6.2 2.9 4.3 1.3]

# [5.1 2.5 3. 1.1]

# [5.7 2.8 4.1 1.3]]

k平均法

2019-11-10 / tau / コメントする

概要

k平均法(k-means clustering)はクラスタリングの手法の1つで、与えられたデータ群の特徴と初期値に基づいて、データを並列(非階層)のクラスターに分類する。

ここではk平均法の簡単な例を実装したKMeansClusteringクラスによって、その挙動を確認する。

テストケース

基本形

2つのクラスターがある程度明確なケースで試してみる。一定の円内にランダムに点を発生させ、そのグループを2つ近づけた例。

x_means[0], y_means[0] = 15, 15
x_means[1], y_means[1] = 20, 15

plot_steps = [0, 1, 2, 4, 6]

x_means[0], y_means[0] = 15, 15

x_means[1], y_means[1] = 20, 15

plot_steps = [0, 1, 2, 4, 6]

以下のように、重なった部分は仕方がないが、かなり元のグループに近い分類となっている。

convergion times = 7
[14.567003574164632, 15.215775486294216]
[25.31190419286806, 25.871321239241027]

convergion times = 7

[14.567003574164632, 15.215775486294216]

[25.31190419286806, 25.871321239241027]

初期値を変えた場合

代表点の初期値を変えて実行してみる。

x_means[0], y_means[0] = 25, 25
x_means[1], y_means[1] = 25, 30

plot_steps = [0, 1, 2, 4, 8]

x_means[0], y_means[0] = 25, 25

x_means[1], y_means[1] = 25, 30

plot_steps = [0, 1, 2, 4, 8]

上記とはかなり離れた初期値を設定しても、解は同じになる。

収束解も上記と全く同じ値になる。

convergion times = 9
[14.567003574164632, 15.215775486294216]
[25.31190419286806, 25.871321239241027]

convergion times = 9

[14.567003574164632, 15.215775486294216]

[25.31190419286806, 25.871321239241027]

クラスターが不明確な場合

先の結果だけを見ると、かなり初期値がずれてもクラス分類は安定なように見える。

そこで次に、元々の分布に明確なクラス分けが見えない場合に3つのクラスターに分ける例を考える。

初期値1

x_means[0], y_means[0] = 10, 18
x_means[1], y_means[1] = 18, 18
x_means[2], y_means[2] = 25, 18

plot_steps = [0, 1, 4, 8, 12]

x_means[0], y_means[0] = 10, 18

x_means[1], y_means[1] = 18, 18

x_means[2], y_means[2] = 25, 18

plot_steps = [0, 1, 4, 8, 12]

convergion times = 13
[11.359345006841108, 15.215511281942952]
[21.33481062455269, 13.05930376628775]
[19.777630465074534, 23.850681586725297]

convergion times = 13

[11.359345006841108, 15.215511281942952]

[21.33481062455269, 13.05930376628775]

[19.777630465074534, 23.850681586725297]

初期値2

上記に対して初期値を変更。

_means[0], y_means[0] = 18, 10
x_means[1], y_means[1] = 18, 18
x_means[2], y_means[2] = 18, 25

plot_steps = [0, 1, 4, 8, 10]

_means[0], y_means[0] = 18, 10

x_means[1], y_means[1] = 18, 18

x_means[2], y_means[2] = 18, 25

plot_steps = [0, 1, 4, 8, 10]

データは同じだが、クラスター分けは違ってきている。

convergion times = 11
[18.43160436224161, 11.596987217637075]
[12.210575780454143, 19.032933149086162]
[22.418606246186307, 23.158839350112153]

convergion times = 11

[18.43160436224161, 11.596987217637075]

[12.210575780454143, 19.032933149086162]

[22.418606246186307, 23.158839350112153]

極端な例

次に、元の分布でクラスターが見いだせないような極端な場合を考える。

初期値1

代表点の初期値は縦に並んでおり、クラスターも縦方向に分割されている。

x_means[0], y_means[0] = 15, 15
x_means[1], y_means[1] = 15, 20

plot_steps = [0, 1, 3, 5, 8]

x_means[0], y_means[0] = 15, 15

x_means[1], y_means[1] = 15, 20

plot_steps = [0, 1, 3, 5, 8]

convergion times = 9
[19.1770479839392, 13.104941433078057]
[20.361677232475646, 26.9785132838531]

convergion times = 9

[19.1770479839392, 13.104941433078057]

[20.361677232475646, 26.9785132838531]

初期値2

全く同じデータで代表点の初期値を横に並べた場合、クラスター分けは大きく異なっている。

x_means[0], y_means[0] = 15, 15
x_means[1], y_means[1] = 20, 15

x_lim = 40
y_lim = 40

plot_steps = [0, 1, 2, 4]

x_means[0], y_means[0] = 15, 15

x_means[1], y_means[1] = 20, 15

x_lim = 40

y_lim = 40

plot_steps = [0, 1, 2, 4]

convergion times = 5
[12.933983667295676, 20.22594500107436]
[25.46963894609307, 21.15223763731067]

convergion times = 5

[12.933983667295676, 20.22594500107436]

[25.46963894609307, 21.15223763731067]

3クラスター

最後に、元のデータでクラスターがかなり明確な場合を試してみる。

初期値1

_means[0], y_means[0] = 10, 15
x_means[1], y_means[1] = 15, 15
x_means[2], y_means[2] = 20, 15

plot_steps = [0, 2, 4, 6, 9]

_means[0], y_means[0] = 10, 15

x_means[1], y_means[1] = 15, 15

x_means[2], y_means[2] = 20, 15

plot_steps = [0, 2, 4, 6, 9]

初期値が隅の方から始まっていても、3つのクラスターによく分かれている。

convergion times = 10
[14.610459971900003, 15.428114490958492]
[25.252313933530775, 30.84369233062165]
[34.43506243558404, 14.084017148769334]

convergion times = 10

[14.610459971900003, 15.428114490958492]

[25.252313933530775, 30.84369233062165]

[34.43506243558404, 14.084017148769334]

初期値2

x_means[0], y_means[0] = 25, 25
x_means[1], y_means[1] = 25, 30
x_means[2], y_means[2] = 25, 35

plot_steps = [0, 1, 3, 6, 10]

x_means[0], y_means[0] = 25, 25

x_means[1], y_means[1] = 25, 30

x_means[2], y_means[2] = 25, 35

plot_steps = [0, 1, 3, 6, 10]

初期値の場所や並びがかなり異なっていても、クラスター分けは安定している。

convergion times = 11
[34.43506243558404, 14.084017148769334]
[14.665546643275666, 15.615970568228104]
[25.413807412324008, 30.963999916166244]

convergion times = 11

[34.43506243558404, 14.084017148769334]

[14.665546643275666, 15.615970568228104]

[25.413807412324008, 30.963999916166244]

まとめ

k平均法は初期値によって解が変動するとされているが、明らかにクラスターが明確な場合には解は安定している。

ただしそのようなケースは、特徴量の数が少なく分布が一目瞭然の場合に相当するので、特徴量が多く一目ではそのクラスターがわかりにくいような場合には、やはり初期値の取り方に大きく影響されるものと考えられる。

Python3 – KMeansClustering

2019-11-10 / tau / 3件のコメント

概要

アルゴリズムはシンプルで、以下の手順。

クラスターの数だけクラスターの代表点の初期値を設定する
代表点の位置が収束するまで以下を繰り返す
1. データの各点から最も近い代表点を選ぶ
2. 同じ代表点の点群から重心を算出し、新しい代表点の位置とする

このクラスは、特徴量が2つのデータ群と代表点の初期値を与えて、k平均法でクラスタリングを行うテストクラス。

2つの特徴量x_data、y_dataを与えてオブジェクトを生成し、代表点の初期値x_means、y_meansを与えてメソッドを実行して結果を得る。

全コード

KMeansClusteringクラスの全コードは以下の通り。

import numpy as np
import sys


class KMeansClustering:
    def __init__(self, x_data, y_data):
        self._x = x_data
        self._y = y_data
        self._x_means = None
        self._y_means = None
        self._num_data = len(self._x)
        self._num_means = 0
        self._groups = np.empty((0, self._num_data), dtype=int)

    def _distance(self, x0, y0, x1, y1):
        return (x0 - x1)**2 + (y0 - y1)**2

    def _point_converged(self, x0, y0, x1, y1):
        return True if x0==x1 and y0==y1 else False

    def _all_points_converged(self, x_old, y_old, x_new, y_new):
        for x0, y0, x1, y1 in zip(x_old, y_old, x_new, y_new):
            if not self._point_converged(x0, y0, x1, y1):
                return False
        return True

    def _set_nearest_mean_point(self):
        groups = np.zeros(self._num_data, int)
        for i, (x, y) in enumerate(zip(self._x, self._y)):
            min_dist = sys.float_info.max
            for j, (xm, ym) in \
                    enumerate(zip(self._x_means[-1], self._y_means[-1])):
                d = self._distance(x, y, xm, ym)
                if d < min_dist:
                    groups[i] = j
                    min_dist = d
        self._groups = np.vstack((self._groups, groups))

    def _revise_mean_points(self):
        x_means_new = np.zeros(self._num_means)
        y_means_new = np.zeros(self._num_means)

        for k in range(self._num_means):
            target = [True if j==k else False for j in self._groups[-1]]
            x_means_new[k] = np.mean(self._x[target])
            y_means_new[k] = np.mean(self._y[target])

        return x_means_new, y_means_new

    def get_result(self, x_means, y_means):
        self._x_means = np.array([x_means.copy()])
        self._y_means = np.array([y_means.copy()])
        self._num_means = len(x_means)

        while True:
            self._set_nearest_mean_point()
            xm, ym = self._revise_mean_points()

            if self._all_points_converged( \
                    self._x_means[-1], self._y_means[-1], xm, ym):
                return self._x_means, self._y_means, self._groups
            else:
                self._x_means = np.vstack((self._x_means, xm))
                self._y_means = np.vstack((self._y_means, ym))

        return self._x_means, self._y_means, self._groups

import numpy as np

import sys

class KMeansClustering:

def __init__(self, x_data, y_data):

self._x = x_data

self._y = y_data

self._x_means = None

self._y_means = None

self._num_data = len(self._x)

self._num_means = 0

self._groups = np.empty((0, self._num_data), dtype=int)

def _distance(self, x0, y0, x1, y1):

return (x0 - x1)**2 + (y0 - y1)**2

def _point_converged(self, x0, y0, x1, y1):

return True if x0==x1 and y0==y1 else False

def _all_points_converged(self, x_old, y_old, x_new, y_new):

for x0, y0, x1, y1 in zip(x_old, y_old, x_new, y_new):

if not self._point_converged(x0, y0, x1, y1):

return False

return True

def _set_nearest_mean_point(self):

groups = np.zeros(self._num_data, int)

for i, (x, y) in enumerate(zip(self._x, self._y)):

min_dist = sys.float_info.max

for j, (xm, ym) in \

enumerate(zip(self._x_means[-1], self._y_means[-1])):

d = self._distance(x, y, xm, ym)

if d < min_dist:

groups[i] = j

min_dist = d

self._groups = np.vstack((self._groups, groups))

def _revise_mean_points(self):

x_means_new = np.zeros(self._num_means)

y_means_new = np.zeros(self._num_means)

for k in range(self._num_means):

target = [True if j==k else False for j in self._groups[-1]]

x_means_new[k] = np.mean(self._x[target])

y_means_new[k] = np.mean(self._y[target])

return x_means_new, y_means_new

def get_result(self, x_means, y_means):

self._x_means = np.array([x_means.copy()])

self._y_means = np.array([y_means.copy()])

self._num_means = len(x_means)

while True:

self._set_nearest_mean_point()

xm, ym = self._revise_mean_points()

if self._all_points_converged( \

self._x_means[-1], self._y_means[-1], xm, ym):

return self._x_means, self._y_means, self._groups

else:

self._x_means = np.vstack((self._x_means, xm))

self._y_means = np.vstack((self._y_means, ym))

return self._x_means, self._y_means, self._groups

利用方法

クラスタリングを行うインスタンスの生成

初期データを与え、KMeansClusteringクラスのインスタンスを生成する。

clustering = KMeansClustering(x_data, y_data)

1	clustering = KMeansClustering(x_data, y_data)

コンストラクターに与える引数は以下の通り。

x_data, y_data: クラスタリングを行うデータの特徴量x、yの配列(1次元のndarray)。

クラスタリングの実行

生成したインスタンスに対して、クラスタリングを行うメソッドを実行して結果を得る。

xm, ym, groups = clustering.get_result(x_means, y_means)

1	xm, ym, groups = clustering.get_result(x_means, y_means)

メソッドに与える引数は以下の通り。

x_means, y_means: k個のクラスターの代表点の初期値(1次元のndarray)。

結果は以下のタプルで与えられる。

x_means, y_means: クラスターの代表点のx、yの配列。収束までの各計算段階の値を記録しており、2次元のndarrayの各行が各計算ステップに相当。
groups: 各データが属するクラスター(代表点)番号のndarray。

実行例

以下のコードでKMeansClusteringクラスをテスト。内容は以下の通り。

2つの円状に散らばるランダムな点群を発生させ、1つのデータとしてまとめる
- random_scatter_dataは指定した中心・半径の円内に指定した数のランダムな点を発生させるモジュールで、別途作成（最後の方にコードを掲載）
クラスターの代表点の数と位置、散布図を描画するパラメーターを設定する
- プロットする図の数は2行3列で固定し、初期状態を除いた5つの図を表示させる
- 予め実行させてコンソールで収束回数を確認し、33行目で表示させる計算ステップを指定している
初期状態の散布図をプロット
クラスタリングを行うKMeansClustringオブジェクトを生成し、結果を得るためのメソッドを実行(53-54行目)
結果をプロット

import numpy as np
import matplotlib.pyplot as plt
import random_scatter_data as rsd
from k_means_clustering import KMeansClustering


##################################################
# generate random points
#
np.random.seed(10)

num_data_1 = 100
num_data_2 = 100

x1, y1 = rsd.scatter_in_circle(15, 15, 10, num=num_data_1)
x2, y2 = rsd.scatter_in_circle(25, 25, 10, num=num_data_2)

x_data = np.hstack((x1, x2))
y_data = np.hstack((y1, y2))

##################################################
# settings
#
k = 2
x_means = np.zeros(k)
y_means = np.zeros(k)
x_means[0], y_means[0] = 15, 15
x_means[1], y_means[1] = 20, 15

x_lim = 40
y_lim = 40

plot_steps = [0, 1, 2, 4, 6]
plot_rows = 2
plot_cols = 3

##################################################
# plot initial state with original cluster
#
fig = plt.figure(figsize=(8, 6))

ax = fig.add_subplot(plot_rows, plot_cols, 1, aspect=1.0)
ax.set_xlim(0, x_lim)
ax.set_ylim(0, y_lim)

ax.scatter(x1, y1, s=5)
ax.scatter(x2, y2, s=5)
ax.scatter(x_means, y_means, c='red', marker='x')

##################################################
# create clustring object and get result
#
clustering = KMeansClustering(x_data, y_data)
xm, ym, groups = clustering.get_result(x_means, y_means)

print("convergion times = {}".format(len(xm)))
for n_mean in range(k):
    print("[{}, {}]".format(xm[-1][n_mean], ym[-1][n_mean]))

##################################################
# plot the states of designated steps
#
for n_plot, n_step in enumerate(plot_steps):
    ax = fig.add_subplot(plot_rows, plot_cols, n_plot + 2, aspect=1.0)
    ax.set_xlim(0, x_lim)
    ax.set_ylim(0, y_lim)
    for j in range(k):
        target = [True if jj==j else False for jj in groups[n_step]]
        ax.scatter(x_data[target], y_data[target], s=5)
        ax.scatter(xm[n_step], ym[n_step], marker='x')

plt.show()

import numpy as np

import matplotlib.pyplot as plt

import random_scatter_data as rsd

from k_means_clustering import KMeansClustering

##################################################

# generate random points

np.random.seed(10)

num_data_1 = 100

num_data_2 = 100

x1, y1 = rsd.scatter_in_circle(15, 15, 10, num=num_data_1)

x2, y2 = rsd.scatter_in_circle(25, 25, 10, num=num_data_2)

x_data = np.hstack((x1, x2))

y_data = np.hstack((y1, y2))

##################################################

# settings

k = 2

x_means = np.zeros(k)

y_means = np.zeros(k)

x_means[0], y_means[0] = 15, 15

x_means[1], y_means[1] = 20, 15

x_lim = 40

y_lim = 40

plot_steps = [0, 1, 2, 4, 6]

plot_rows = 2

plot_cols = 3

##################################################

# plot initial state with original cluster

fig = plt.figure(figsize=(8, 6))

ax = fig.add_subplot(plot_rows, plot_cols, 1, aspect=1.0)

ax.set_xlim(0, x_lim)

ax.set_ylim(0, y_lim)

ax.scatter(x1, y1, s=5)

ax.scatter(x2, y2, s=5)

ax.scatter(x_means, y_means, c='red', marker='x')

##################################################

# create clustring object and get result

clustering = KMeansClustering(x_data, y_data)

xm, ym, groups = clustering.get_result(x_means, y_means)

print("convergion times = {}".format(len(xm)))

for n_mean in range(k):

print("[{}, {}]".format(xm[-1][n_mean], ym[-1][n_mean]))

##################################################

# plot the states of designated steps

for n_plot, n_step in enumerate(plot_steps):

ax = fig.add_subplot(plot_rows, plot_cols, n_plot + 2, aspect=1.0)

ax.set_xlim(0, x_lim)

ax.set_ylim(0, y_lim)

for j in range(k):

target = [True if jj==j else False for jj in groups[n_step]]

ax.scatter(x_data[target], y_data[target], s=5)

ax.scatter(xm[n_step], ym[n_step], marker='x')

plt.show()

このコードの実行結果はコンソールで以下のように表示される。

convergion times = 7
[14.567003574164632, 15.215775486294216]
[25.31190419286806, 25.871321239241027]
[Finished in 3.529s]

convergion times = 7

[14.567003574164632, 15.215775486294216]

[25.31190419286806, 25.871321239241027]

[Finished in 3.529s]

この結果から1、2、3、5、7回目の計算結果を図示するよう上のコードでセットした表示結果は以下の通り。

グループが比較的明確なので、早い段階で代表点の位置が定まっている。

クラス説明

`init()`～インスタンス生成

    def __init__(self, x_data, y_data):
        self._x = x_data
        self._y = y_data
        self._x_means = None
        self._y_means = None
        self._num_data = len(self._x)
        self._num_means = 0
        self._groups = np.empty((0, self._num_data), dtype=int)

def __init__(self, x_data, y_data):

self._x = x_data

self._y = y_data

self._x_means = None

self._y_means = None

self._num_data = len(self._x)

self._num_means = 0

self._groups = np.empty((0, self._num_data), dtype=int)

クラスタリングを行うデータの特徴量x_data、y_dataをndarrayで与えてインスタンスを生成。

プライベート・メンバーは以下の通り。

_x, _y: クラスタリングを行うデータの2つの特徴量の配列。計算過程で変更されない。
_x_means, _y_means: 代表点の計算結果を保存していく配列。分析実行時に初期値が与えられるため、初期値はNone。
_num_data: データの個数。
_num_means: 代表点(クラスターの個数)。代表点が分析実行時に与えられるため、初期値は0。
_groups: 各データの属するクラスターを保存していく配列。

`get_result()`～分析の実行

    def get_result(self, x_means, y_means):
        self._x_means = np.array([x_means.copy()])
        self._y_means = np.array([y_means.copy()])
        self._num_means = len(x_means)

        while True:
            self._set_nearest_mean_point()
            xm, ym = self._revise_mean_points()

            if self._all_points_converged( \
                    self._x_means[-1], self._y_means[-1], xm, ym):
                return self._x_means, self._y_means, self._groups
            else:
                self._x_means = np.vstack((self._x_means, xm))
                self._y_means = np.vstack((self._y_means, ym))

        return self._x_means, self._y_means, self._groups

def get_result(self, x_means, y_means):

self._x_means = np.array([x_means.copy()])

self._y_means = np.array([y_means.copy()])

self._num_means = len(x_means)

while True:

self._set_nearest_mean_point()

xm, ym = self._revise_mean_points()

if self._all_points_converged( \

self._x_means[-1], self._y_means[-1], xm, ym):

return self._x_means, self._y_means, self._groups

else:

self._x_means = np.vstack((self._x_means, xm))

self._y_means = np.vstack((self._y_means, ym))

return self._x_means, self._y_means, self._groups

k個の代表点のx、yを引数として渡し、結果を得る。

引数

_x_means, _y_means: k個の代表点の初期値x、yを、それぞれ1次元のndarrayで与える。引数で与えた配列は変更されない。

戻り値

x_means, y_means: 代表点の計算結果が保存された配列。各行は計算ステップに相当。

groups: 各計算ステップにおける、各点のクラスターが保存された配列。各行は計算ステップに相当。

処理内容

代表点の初期値をプライベート・メンバーにコピーし、代表点の個数をセット
全ての代表点の位置が収束するまで、以下を繰り返す
1. 各データについて、最も近い代表点をセット
2. 共通の代表点を持つデータから、新しい代表点の位置を計算
3. 代表点の前回最後の計算値と今回の計算値が収束したならループ終了、でなければ計算結果を追加してループ継続
計算結果を戻り値として終了

プライベートメソッド

`_distance()`～2点間の距離

    def _distance(self, x0, y0, x1, y1):
        return (x0 - x1)**2 + (y0 - y1)**2

1 2	def _distance(self, x0, y0, x1, y1): return (x0 - x1)2 + (y0 - y1)2

2つの点の距離を与える。ここではユークリッド距離の2乗。

`_point_converged()`～収束判定

    def _point_converged(self, x0, y0, x1, y1):
        return True if x0==x1 and y0==y1 else False

1 2	def _point_converged(self, x0, y0, x1, y1): return True if x0==x1 and y0==y1 else False

2つの点の座標から、点の位置が収束したかどうかを判定。

本来、各座標値はfloatなので'=='による判定は危険だが、ここでは収束の速さと確実性を信じて簡易に設定。

`_all_points_converged()`～全ての点の収束判定

    def _all_points_converged(self, x_old, y_old, x_new, y_new):
        for x0, y0, x1, y1 in zip(x_old, y_old, x_new, y_new):
            if not self._point_converged(x0, y0, x1, y1):
                return False
        return True

def _all_points_converged(self, x_old, y_old, x_new, y_new):

for x0, y0, x1, y1 in zip(x_old, y_old, x_new, y_new):

if not self._point_converged(x0, y0, x1, y1):

return False

return True

配列で与えた2組の点がすべて収束条件を満たしているか判定。

`_set_nearest_mean_point()`～各点に最も近い代表点

    def _set_nearest_mean_point(self):
        groups = np.zeros(self._num_data, int)
        for i, (x, y) in enumerate(zip(self._x, self._y)):
            min_dist = sys.float_info.max
            for j, (xm, ym) in \
                    enumerate(zip(self._x_means[-1], self._y_means[-1])):
                d = self._distance(x, y, xm, ym)
                if d < min_dist:
                    groups[i] = j
                    min_dist = d
        self._groups = np.vstack((self._groups, groups))

def _set_nearest_mean_point(self):

groups = np.zeros(self._num_data, int)

for i, (x, y) in enumerate(zip(self._x, self._y)):

min_dist = sys.float_info.max

for j, (xm, ym) in \

enumerate(zip(self._x_means[-1], self._y_means[-1])):

d = self._distance(x, y, xm, ym)

if d < min_dist:

groups[i] = j

min_dist = d

self._groups = np.vstack((self._groups, groups))

処理内容

各データについて、それぞれから最も近い距離にある代表点を探し、その番号を1次元の配列groupsに記録
_groups配列に、今回の分類結果を行として追加

`_revise_mean_points()`～代表点の更新

    def _revise_mean_points(self):
        x_means_new = np.zeros(self._num_means)
        y_means_new = np.zeros(self._num_means)

        for k in range(self._num_means):
            target = [True if j==k else False for j in self._groups[-1]]
            x_means_new[k] = np.mean(self._x[target])
            y_means_new[k] = np.mean(self._y[target])

        return x_means_new, y_means_new

def _revise_mean_points(self):

x_means_new = np.zeros(self._num_means)

y_means_new = np.zeros(self._num_means)

for k in range(self._num_means):

target = [True if j==k else False for j in self._groups[-1]]

x_means_new[k] = np.mean(self._x[target])

y_means_new[k] = np.mean(self._y[target])

return x_means_new, y_means_new

同じ代表点に属するデータからそれらの重心を計算し、新しい代表点として返す。

random_scatter_data.py

今回整理のためにつくったモジュールで、内容は以下の通り。

import numpy as np


def scatter_in_circle(x_org=0, y_org=0, r=1, num=100):
    x = np.zeros(num)
    y = np.zeros(num)

    for i in range(num):
        while True:
            xw = np.random.rand() * 2 * r - r
            yw = np.random.rand() * 2 * r - r
            if xw*xw + yw*yw <= r*r:
                x[i], y[i] = xw + x_org, yw + y_org
                break
    return x, y


def scatter_in_2dnormal(x_org=0, y_org=0, r=1, nsigma=1, num=100):
    x = np.random.randn(num) * r/nsigma + x_org
    y = np.random.randn(num) * r/nsigma + y_org
    return x, y

import numpy as np

def scatter_in_circle(x_org=0, y_org=0, r=1, num=100):

x = np.zeros(num)

y = np.zeros(num)

for i in range(num):

while True:

xw = np.random.rand() * 2 * r - r

yw = np.random.rand() * 2 * r - r

if xw*xw + yw*yw <= r*r:

x[i], y[i] = xw + x_org, yw + y_org

break

return x, y

def scatter_in_2dnormal(x_org=0, y_org=0, r=1, nsigma=1, num=100):

x = np.random.randn(num) * r/nsigma + x_org

y = np.random.randn(num) * r/nsigma + y_org

return x, y

概要

特徴量の分布

クラス分けしない場合

クラス分けした場合

2つの特徴量同士の関係

比較例

scatter_matrixによる確認

pairplotによる確認

3つの特徴量の関係

概要

データの取得とデータ構造

データの内容

'data'～特徴量データセット

'target'～アヤメの種類に対応したコード

'target_names'～アヤメの種類名

'feature_names'～特徴名

'filename'～ファイル名

'DESCR'～データセットの説明

データの利用

データの取得方法

全レコードの特徴量データの取得

特定の特徴量のデータのみ取得

特定のクラスのデータのみ抽出

概要

テストケース

基本形

初期値を変えた場合

クラスターが不明確な場合

初期値1

初期値2

極端な例

初期値1

初期値2

3クラスター

初期値1

初期値2

まとめ

概要

全コード

利用方法

クラスタリングを行うインスタンスの生成

クラスタリングの実行

実行例

クラス説明

__init__()～インスタンス生成

get_result()～分析の実行

プライベートメソッド

_distance()～2点間の距離

_point_converged()～収束判定

_all_points_converged()～全ての点の収束判定

_set_nearest_mean_point()～各点に最も近い代表点

_revise_mean_points()～代表点の更新

random_scatter_data.py

`scatter_matrix`による確認

`pairplot`による確認

`'data'`～特徴量データセット

`'target'`～アヤメの種類に対応したコード

`'target_names'`～アヤメの種類名

`'feature_names'`～特徴名

`'filename'`～ファイル名

`'DESCR'`～データセットの説明

`init()`～インスタンス生成

`get_result()`～分析の実行

`_distance()`～2点間の距離

`_point_converged()`～収束判定

`_all_points_converged()`～全ての点の収束判定

`_set_nearest_mean_point()`～各点に最も近い代表点

`_revise_mean_points()`～代表点の更新