Ruby – コンソール表示

2020-11-26 / tau / コメントする

puts

putsはオブジェクトの内容を表示する。文の実行ごとに改行する。

puts("Ruby")
puts("Diamond")
puts(["one", "two", 3])

# Ruby
# Diamond
# one
# two
# 3

puts("Ruby")

puts("Diamond")

puts(["one", "two", 3])

# Ruby

# Diamond

# one

# two

# 3

print

printはオブジェクトの内容を表示する。改行はしない。

print("Ruby")
print("Diamond")
print(["one", "two", 3])

# RubyDiamond["one", "two", 3]

print("Ruby")

print("Diamond")

print(["one", "two", 3])

# RubyDiamond["one", "two", 3]

p

pはオブジェクトの形式がわかるように表示する。文の実行ごとに改行する。

p("Ruby")
p('Diamond')
p(["one", "two", 3])

# "Ruby"
# "Diamond"
# ["one", "two", 3]

p("Ruby")

p('Diamond')

p(["one", "two", 3])

# "Ruby"

# "Diamond"

# ["one", "two", 3]

Python – 配列要素の重複数を制限する

2020-11-24 / tau / コメントする

概要

文字で表現するとわかり難いが、要するに次のようなことを想定している。たとえば次のような1次元の配列があるとする。

import numpy as np

np.random.seed(0)
targets = np.random.randint(0, 5, 20)

print(targets)
# [4 0 3 3 3 1 3 2 4 0 0 4 2 1 0 1 1 0 1 4]

print(np.bincount(targets))
# [5 5 2 4 4]

import numpy as np

np.random.seed(0)

targets = np.random.randint(0, 5, 20)

print(targets)

# [4 0 3 3 3 1 3 2 4 0 0 4 2 1 0 1 1 0 1 4]

print(np.bincount(targets))

# [5 5 2 4 4]

この配列には20個の要素があり、0～4の数値がそれぞれ5個、5個、2個、2個、4個、4個、順不同で含まれている。

この配列において、各数値の数を最大でも3個以内となるように切り落としたい、というのが目標。

たとえば、機械学習の教師データの数がターゲットごとにばらついている場合、各ターゲットのデータ数をある程度の数以下に抑えたいときが想定される。

上の例で仮に早く出現した準から3つまでを残して後は捨てるとすれば、以下のような配列になる。

内容：4 0 3 3 3 1 3 2 4 0 0 4 2 1 0 1 1 0 1 4
個数：1 1 1 2 3 1 4 1 2 2 3 3 2 2 4 3 4 5 5 4

なお、単に1つの配列の要素を切り落とすだけでなく、これと対応する配列が別にあって、その要素についても同時に切り落とすことも想定する。これは、機械学習のターゲット配列でデータを制限するのに、これに紐づけられた画像データなどを格納した配列も同時に操作するイメージ。

手順

ターゲットごとのインデックスの取得

targetsの20個のデータのうちid=0について考える。targetsの要素のうち値が0のものは5個あり、それらのインデックスは(1, 9, 10, 14, 17)。同様にid=1についても5個あり、インデックスは(5, 13, 15, 16, 18)。

このようにしてid=0～4についてインデックスを書き出すと以下の通りになる。

0：1, 9, 10, 14, 17
1：5, 13, 15, 16, 18
2：7, 12
3：2, 3, 4, 6
4：0, 8, 11, 19

各idに対応する配列はnumpy.where()関数を用いて以下のように得られる。

for id in range(5):
    array = np.where(targets == id)[0]
    print("{}:{}".format(id, array))

# 0:[ 1  9 10 14 17]
# 1:[ 5 13 15 16 18]
# 2:[ 7 12]
# 3:[2 3 4 6]
# 4:[ 0  8 11 19]

for id in range(5):

array = np.where(targets == id)[0]

print("{}:{}".format(id, array))

# 0:[ 1 9 10 14 17]

# 1:[ 5 13 15 16 18]

# 2:[ 7 12]

# 3:[2 3 4 6]

# 4:[ 0 8 11 19]

上の例では、ループのidを0～4と変化させていくのにrange(5)を使っている。ところが一般的には、番号が連続して存在しているとは限らず、またその上限もわからない。

そこで、targetsに出てくる要素を重なりなく、かつ全て使うためにnumpy.unique()関数を使っている。unique()関数は引数の配列の要素の重複を除き、昇順・辞書順に並べてくれる。この引数にtargetsを渡して、要素の重なりを除けば、targets中の要素を重なりなく1つずつ参照できる。

print(np.unique(targets))
# [0 1 2 3 4]

for id in np.unique(targets):
    array = np.where(targets == id)[0]
    print("{}:{}".format(id, array))

# [ 1  9 10 14 17]
# [ 5 13 15 16 18]
# [ 7 12]
# [2 3 4 6]
# [ 0  8 11 19]

print(np.unique(targets))

# [0 1 2 3 4]

for id in np.unique(targets):

array = np.where(targets == id)[0]

print("{}:{}".format(id, array))

# [ 1 9 10 14 17]

# [ 5 13 15 16 18]

# [ 7 12]

# [2 3 4 6]

# [ 0 8 11 19]

取り出す要素の制限

次に、すべてのターゲットのデータ数が3個以下になるようにすることを考える。

これらのデータで各idの個数を3個以下にするのに、出現順位の早いものから3個を選び出すことを考える。

0：1, 9, 10, 14, 17
1：5, 13, 15, 16, 18
2：7, 12
3：2, 3, 4, 6
4：0, 8, 11, 19

各配列の最初の3個を取り出すには、各idに対応する配列の先頭から3個目までをスライスで取り出せばよい。

for id in np.unique(targets):
    array = np.where(targets == id)[0][:3]
    print("{}:{}".format(id, array))

# 0:[ 1  9 10]
# 1:[ 5 13 15]
# 2:[ 7 12]
# 3:[2 3 4]
# 4:[ 0  8 11]

for id in np.unique(targets):

array = np.where(targets == id)[0][:3]

print("{}:{}".format(id, array))

# 0:[ 1 9 10]

# 1:[ 5 13 15]

# 2:[ 7 12]

# 3:[2 3 4]

# 4:[ 0 8 11]

これで値の最大3個までとするのに取り出すべきtargets中のインデックスが得られた。

要素の抽出

targets配列の要素の個数を制限するには、上で絞り込まれたインデックスに対応する要素を残し、それ以外の要素を切り捨てる。そのためには、残すべきインデックス位置の値がTrue、その他のインデックス位置の値がFalseであるbool配列をつくり、これをtargetsの引数とすればよい。

この配列を例えばmaskという名前とすると、targetsと同じサイズですべての要素がFalseである配列としてmaskを準備し、先ほどの切り落とすべきインデックスの位置のみTrueにするとよい。

以下では、まず全要素がFalseでtargetsと同じサイズのbool配列を準備し、各idに対して3つ目までの要素の位置をTrue(1)としている。

ループの1回目で1、9、10番目がTrueになり、2回目で5、13、15番目がTrueに代わっていき、ループを重ねるごとに、取り出すべき要素の位置がTrueになっていることが確認できる。

なおbool配列の初期化では、Falseが数値の0と等価なため、numpy.zeros()関数を使っている。同じ理由で、numpy.where()でTrueをセットするときに、数値の1をセットしている。

mask = np.zeros(targets.size, dtype=np.bool)
print(mask)

# [False False False False False False False False False False False False
#  False False False False False False False False]

for target in np.unique(targets):
    mask[np.where(targets == target)[0][:3]] = 1
    print(mask)

# [False  True False False False False False False False  True  True False
#  False False False False False False False False]
# [False  True False False False  True False False False  True  True False
#  False  True False  True False False False False]
# [False  True False False False  True False  True False  True  True False
#   True  True False  True False False False False]
# [False  True  True  True  True  True False  True False  True  True False
#   True  True False  True False False False False]
# [ True  True  True  True  True  True False  True  True  True  True  True
#   True  True False  True False False False False]

mask = np.zeros(targets.size, dtype=np.bool)

print(mask)

# [False False False False False False False False False False False False

# False False False False False False False False]

for target in np.unique(targets):

mask[np.where(targets == target)[0][:3]] = 1

print(mask)

# [False True False False False False False False False True True False

# False False False False False False False False]

# [False True False False False True False False False True True False

# False True False True False False False False]

# [False True False False False True False True False True True False

# True True False True False False False False]

# [False True True True True True False True False True True False

# True True False True False False False False]

# [ True True True True True True False True True True True True

# True True False True False False False False]

最後に、このbool配列をtargetsに適用して、取り出すべき要素の配列を得る。

print(targets[mask])
# [4 0 3 3 3 1 2 4 0 0 4 2 1 1]

print(np.bincount(targets[mask]))
# [3 3 2 3 3]

print(targets[mask])

# [4 0 3 3 3 1 2 4 0 0 4 2 1 1]

print(np.bincount(targets[mask]))

# [3 3 2 3 3]

他の配列の同時操作

mask配列は、targetsと同じサイズを持つ次元の配列に繰り返し適用できるので、たとえば機械学習でtargetsの各要素に紐づけられた画像データなどを格納した配列などについても、targetsと整合させながら必要な分だけ切出すことができる。

PCA – 次元削減と逆変換について

2020-11-23 / tau / コメントする

概要

主成分分析(PCA)において、次元削減により主成分の一部だけを残し、それを逆変換することを考える。

結論から言うと、以下の2つの操作は同じ結果をもたらす。

全ての主成分を用いて変換し、削減する主成分に対応する元の特徴量を0とし、逆変換する
削減する主成分より低次の主成分のみで変換し、それを逆変換する

簡単な例

全主成分を使った手順

概要

最初の例は、2次元のデータについて以下のような操作を行っている。

2つの主成分まで使ってデータを変換
第2主成分に対応する変換後のデータを0にする
そのデータを逆変換する

元データの作成

まず元データを作成し、左上に散布図を描画。

元データは水平線にcosine状に正規分布するノイズを乗せ、それを45度回転させている。

また、適当な位置に特定の点を1つ定義している。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

np.random.seed(0)

n_data = 200
cs = 1 / np.sqrt(2)
x = np.linspace(-1, 1, n_data)
y = np.random.randn(n_data) * np.cos(x * np.pi / 2) / 6
x, y = cs * x - cs * y, cs * x + cs * y
xp, yp = 0.25, 0.5

X = np.array([x, y]).reshape(-1, 2)
Xp = np.array([xp, yp]).reshape(-1, 2)

fig1, axes1 = plt.subplots(2, 2) &nbsp;axes1[0, 0].scatter(X[:, 0], X[:, 1], marker='o', s=4)
axes1[0, 0].scatter(Xp[0, 0], Xp[0, 1], marker='x', s=40)

print("Xp original:{}".format(Xp))
# Xp original:[[0.25 0.5 ]]

.....

for axis in axes1.ravel():
    axis.set_aspect('equal')
    axis.set_xlim(-1, 1)
    axis.set_ylim(-1, 1)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

np.random.seed(0)

n_data = 200

cs = 1 / np.sqrt(2)

x = np.linspace(-1, 1, n_data)

y = np.random.randn(n_data) * np.cos(x * np.pi / 2) / 6

x, y = cs * x - cs * y, cs * x + cs * y

xp, yp = 0.25, 0.5

X = np.array([x, y]).reshape(-1, 2)

Xp = np.array([xp, yp]).reshape(-1, 2)

fig1, axes1 = plt.subplots(2, 2)  axes1[0, 0].scatter(X[:, 0], X[:, 1], marker='o', s=4)

axes1[0, 0].scatter(Xp[0, 0], Xp[0, 1], marker='x', s=40)

print("Xp original:{}".format(Xp))

# Xp original:[[0.25 0.5 ]]

.....

for axis in axes1.ravel():

axis.set_aspect('equal')

axis.set_xlim(-1, 1)

axis.set_ylim(-1, 1)

plt.show()

フィッティングと元データの変換

次に元データをPCAによって変換し、変換後の散布図を右上に描画。

斜めだった分布が、第1主成分がx軸と重るように変換されて水平になる。

.....

pca = PCA().fit(X)
X_trans = pca.transform(X)
Xp_trans = pca.transform(Xp)

axes1[0, 1].scatter(X_trans[:, 0], X_trans[:, 1], marker='o', s=4)
axes1[0, 1].scatter(Xp_trans[0, 0], Xp_trans[0, 1], marker='x', s=40)

print("Xp transfomed:{}".format(Xp_trans))
# Xp transfomed:[[ 0.53055026 -0.17107022]]

.....

.....

pca = PCA().fit(X)

X_trans = pca.transform(X)

Xp_trans = pca.transform(Xp)

axes1[0, 1].scatter(X_trans[:, 0], X_trans[:, 1], marker='o', s=4)

axes1[0, 1].scatter(Xp_trans[0, 0], Xp_trans[0, 1], marker='x', s=40)

print("Xp transfomed:{}".format(Xp_trans))

# Xp transfomed:[[ 0.53055026 -0.17107022]]

.....

第2主成分のデータの削除

変換後のデータにおいて、第2主成分に関する値を0とし、左下に散布図を描画。

第2成分に相当する垂直成分が0になる。

.....

X_trans[:, 1] = 0
Xp_trans[0, 1] = 0
axes1[1, 0].scatter(X_trans[:, 0], X_trans[:, 1], marker='o', s=4)
axes1[1, 0].scatter(Xp_trans[0, 0], Xp_trans[0, 1], marker='x', s=40)

.....

.....

X_trans[:, 1] = 0

Xp_trans[0, 1] = 0

axes1[1, 0].scatter(X_trans[:, 0], X_trans[:, 1], marker='o', s=4)

axes1[1, 0].scatter(Xp_trans[0, 0], Xp_trans[0, 1], marker='x', s=40)

.....

逆変換

第2主成分を0としたデータを逆変換して描画。

第1主成分に直角な第2主成分が0となり、全点が一直線上に並ぶ。

X_inv = pca.inverse_transform(X_trans)
Xp_inv = pca.inverse_transform(Xp_trans)
axes1[1, 1].scatter(X_inv[:, 0], X_inv[:, 1], marker='o', s=4)
axes1[1, 1].scatter(Xp_inv[0, 0], Xp_inv[0, 1], marker='x', s=40)

print("Xp inversed:{}".format(Xp_inv))
# Xp inversed:[[0.37112019 0.37919056]]

.....

X_inv = pca.inverse_transform(X_trans)

Xp_inv = pca.inverse_transform(Xp_trans)

axes1[1, 1].scatter(X_inv[:, 0], X_inv[:, 1], marker='o', s=4)

axes1[1, 1].scatter(Xp_inv[0, 0], Xp_inv[0, 1], marker='x', s=40)

print("Xp inversed:{}".format(Xp_inv))

# Xp inversed:[[0.37112019 0.37919056]]

.....

最初から主成分を限定する手順

概要

次の例は元の特徴量を操作せず、以下のような手順に寄っている。

PCAのモデル生成時に、n_component=1とする
そのPCAモデルで元データを変換する
変換したデータを逆変換する

fig2, axes2 = plt.subplots(1, 2)

axes2[0].scatter(X[:, 0], X[:, 1], marker='o', s=4)
axes2[0].scatter(Xp[0, 0], Xp[0, 1], marker='x', s=40)

print("Xp original:{}".format(Xp))
# Xp original:[[0.25 0.5 ]]

pca = PCA(n_components=1).fit(X)
X_trans = pca.transform(X)
Xp_trans = pca.transform(Xp)

print("Xp transfomed:{}".format(Xp_trans))
# Xp transfomed:[[0.53055026]]

X_inv = pca.inverse_transform(X_trans)
Xp_inv = pca.inverse_transform(Xp_trans)
axes2[1].scatter(X_inv[:, 0], X_inv[:, 1], marker='o', s=4)
axes2[1].scatter(Xp_inv[0, 0], Xp_inv[0, 1], marker='x', s=40)

print("Xp inversed:{}".format(Xp_inv))
# Xp inversed:[[0.37112019 0.37919056]]

for axis in axes2.ravel():
    axis.set_aspect('equal')
    axis.set_xlim(-1, 1)
    axis.set_ylim(-1, 1)

plt.show()

fig2, axes2 = plt.subplots(1, 2)

axes2[0].scatter(X[:, 0], X[:, 1], marker='o', s=4)

axes2[0].scatter(Xp[0, 0], Xp[0, 1], marker='x', s=40)

print("Xp original:{}".format(Xp))

# Xp original:[[0.25 0.5 ]]

pca = PCA(n_components=1).fit(X)

X_trans = pca.transform(X)

Xp_trans = pca.transform(Xp)

print("Xp transfomed:{}".format(Xp_trans))

# Xp transfomed:[[0.53055026]]

X_inv = pca.inverse_transform(X_trans)

Xp_inv = pca.inverse_transform(Xp_trans)

axes2[1].scatter(X_inv[:, 0], X_inv[:, 1], marker='o', s=4)

axes2[1].scatter(Xp_inv[0, 0], Xp_inv[0, 1], marker='x', s=40)

print("Xp inversed:{}".format(Xp_inv))

# Xp inversed:[[0.37112019 0.37919056]]

for axis in axes2.ravel():

axis.set_aspect('equal')

axis.set_xlim(-1, 1)

axis.set_ylim(-1, 1)

plt.show()

まとめ

途中で表示させているXpのデータが、2つの手順で全く同じであることがわかる。

最初から主成分を限定することで、元の特徴量を意識せずに次元削減ができる。

LFWデータセット – k近傍法（PCA変換付き）

2020-11-23 / tau / コメントする

概要

“Pythonではじめる機械学習”の主成分分析(PCA)のところで、著名人の顔画像データ（LFW peopleデータセット）に対するk-近傍法の精度を確認している。

LFW peopleのデータを、最低20枚以上の画像がある人物で絞り込んで読み込み
各人物の画像を最大でも50枚以内となるよう制限（配列要素数の制限手順についてはこちらを参照）
2063人の人物について、87×65ピクセルを1次元化した5,655個の数値配列を特徴量データ(X_people)、各画像の人物の番号を収めた配列(y_people)をターゲットデータとする
画像データを訓練データとテストデータに分割
このデータセットをそのまま1-nnで予測したスコアは、0.2程度でそれほどよくない
画像データに主成分分析(PCA)を用いて、100個の主成分で変換したデータについても1-nnを適用していて、この場合のスコアは0.31

コード例ではPCAインスタンス生成時の引数としてwhiten=Trueを指定しているが、これを指定しない場合、データ変換後のスコアは0.23でこうじょうしなかった。

なお、スコアが書籍掲載値と異なるが、画像データの内容も書籍と今回実行時で異なっている。

コードと実行結果

import numpy as np
from sklearn.datasets import fetch_lfw_people
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA

ds = fetch_lfw_people(min_faces_per_person=20, resize=0.7)

mask = np.zeros(ds.target.shape, dtype=np.bool)
for target in np.unique(ds.target):
    mask[np.where(ds.target == target)[0][:50]] = 1

X_people = ds.data[mask]
y_people = ds.target[mask]

X_people /= 255

X_train, X_test, y_train, y_test = train_test_split(
    X_people, y_people, stratify=y_people, random_state=0)

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test score (1-nn): {:.2f}".format(knn.score(X_test, y_test)))

pca = PCA(n_components=100, whiten=True, random_state=0).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

knn.fit(X_train_pca, y_train)
print("Test score (1-nn) w/ PCA: {:.2f}".format(knn.score(X_test_pca, y_test)))

print("\nUsed data shape")
print(X_train.shape, X_test.shape)
print(X_train_pca.shape, X_test_pca.shape)
print(y_train.shape, X_test.shape)

# Test score (1-nn): 0.23
# Test score (1-nn) w/ PCA: 0.31
# 
# Used data shape
# (1547, 5655) (516, 5655)
# (1547, 100) (516, 100)
# (1547,) (516, 5655)

import numpy as np

from sklearn.datasets import fetch_lfw_people

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.decomposition import PCA

ds = fetch_lfw_people(min_faces_per_person=20, resize=0.7)

mask = np.zeros(ds.target.shape, dtype=np.bool)

for target in np.unique(ds.target):

mask[np.where(ds.target == target)[0][:50]] = 1

X_people = ds.data[mask]

y_people = ds.target[mask]

X_people /= 255

X_train, X_test, y_train, y_test = train_test_split(

X_people, y_people, stratify=y_people, random_state=0)

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train, y_train)

print("Test score (1-nn): {:.2f}".format(knn.score(X_test, y_test)))

pca = PCA(n_components=100, whiten=True, random_state=0).fit(X_train)

X_train_pca = pca.transform(X_train)

X_test_pca = pca.transform(X_test)

knn.fit(X_train_pca, y_train)

print("Test score (1-nn) w/ PCA: {:.2f}".format(knn.score(X_test_pca, y_test)))

print("\nUsed data shape")

print(X_train.shape, X_test.shape)

print(X_train_pca.shape, X_test_pca.shape)

print(y_train.shape, X_test.shape)

# Test score (1-nn): 0.23

# Test score (1-nn) w/ PCA: 0.31

# Used data shape

# (1547, 5655) (516, 5655)

# (1547, 100) (516, 100)

# (1547,) (516, 5655)

LFW peopleデータセット

2020-11-23 / tau / コメントする

概要

Scikit-learnから入手できるLFW peopleデータセットは、世界の著名人の顔画像データを集めたものである。

1人につき1枚～最大530枚の画像データが、それぞれの人に対して紐づけされている。

LFWは”Labeled Faces in the Wild”の略で、”in the Wild”には「出回っている」というニュアンスがあるらしい。

IrisやBostonなどのデータと異なり、Scikit-learnをインストールした状態ではデータはローカルに格納されず、最初の読み込み時にデータがダウンロードされてローカルに格納される。1度ダウンロードされた後は、ローカルのデータが使われる。

【注意】

fetch_lfw_people()のresize引数を変更すると、そのたびにデータのダウンロードが実行されるようなので、実行ごとの時間を節約したい場合はresizeの値を決めておくとよい

データの取得

データの読み込みは以下の手順による。

sklearn.datasets.fetch_lfw_peopleをインポートする
fetch_lfw_poeple()関数でBunchオブジェクトのデータセットを読み込む
- fetch_lfw_peple()関数を最初に実行したときに、ローカルにデータが読み込まれる（これには数分程度かかる）
- 一度読み込まれた後は、ローカル上のデータが使われる
- ローカル上のデータの場所は、ログインしたユーザーのホームディレクトリー下、schikit_learn_dataディレクトリー

データ構造

データセットはBunchオブジェクトで、辞書型のkeyとvalueで内容を取得できる。

from sklearn.datasets import fetch_lfw_people

ds = fetch_lfw_people()

for key, value in zip(ds.keys(), ds.values()):
    print("{}:\n{}\n".format(key, value))

from sklearn.datasets import fetch_lfw_people

ds = fetch_lfw_people()

for key, value in zip(ds.keys(), ds.values()):

print("{}:\n{}\n".format(key, value))

内容は以下の通りで、

dataとimagesは顔画像データを異なる形状の配列で格納したもの
targetは各顔画像の人物id
target_namesは各idに対する人物の名前
DESCRはデータに関する説明。

data:
[[ 34.        29.333334  22.333334 ...  14.666667  16.        14.      ]
 [158.       160.66667  169.66667  ... 138.66667  135.66667  130.33333 ]
 [ 76.666664  81.        87.666664 ... 191.66667  145.33333   66.      ]
 ...
 [ 38.333332  41.666668  55.       ...  66.        63.333332  54.666668]
 [ 16.666666  24.333334  60.333332 ... 219.       144.        69.      ]
 [ 58.333332  48.        20.       ... 116.       105.666664 143.66667 ]]

images:
[[[ 34.        29.333334  22.333334 ...  20.        26.        31.      ]
  [ 37.333332  32.        25.333334 ...  21.        27.        32.      ]
  [ 33.333332  32.        40.333332 ...  23.333334  28.666666  35.666668]
  ...
  [166.        96.666664  44.666668 ...   9.333333  14.        12.      ]
  [ 64.333336  39.        30.333334 ...  13.        16.        14.      ]
  [ 30.333334  29.        26.333334 ...  14.666667  16.        14.      ]]

 ...

 [[ 58.333332  48.        20.       ...  66.       101.666664  94.666664]
  [ 62.333332  32.666668  26.333334 ...  50.666668  90.       101.666664]
  [ 56.        29.333334  47.333332 ...  55.333332  76.666664 106.666664]
  ...
  [116.333336 107.        95.       ... 113.333336 100.666664  88.333336]
  [116.666664 105.        94.       ... 116.       103.666664 111.666664]
  [116.       103.333336  95.333336 ... 116.       105.666664 143.66667 ]]]

target:
[5360 3434 3807 ... 2175  373 2941]

target_names:
['AJ Cook' 'AJ Lamas' 'Aaron Eckhart' ... 'Zumrati Juma' 'Zurab Tsereteli'
 'Zydrunas Ilgauskas']

DESCR:
.. _labeled_faces_in_the_wild_dataset:

The Labeled Faces in the Wild face recognition dataset
------------------------------------------------------

This dataset is a collection of JPEG pictures of famous people collected
over the internet, all details are available on the official website:

.....

Examples
~~~~~~~~

:ref:`sphx_glr_auto_examples_applications_plot_face_recognition.py`


PS C:\Users\tomo\GoogleDrive\IT_and_Mobile\dev\python\packages\sklearn.datasets\lfw_people>

data:

[[ 34. 29.333334 22.333334 ... 14.666667 16. 14. ]

[158. 160.66667 169.66667 ... 138.66667 135.66667 130.33333 ]

[ 76.666664 81. 87.666664 ... 191.66667 145.33333 66. ]

...

[ 38.333332 41.666668 55. ... 66. 63.333332 54.666668]

[ 16.666666 24.333334 60.333332 ... 219. 144. 69. ]

[ 58.333332 48. 20. ... 116. 105.666664 143.66667 ]]

images:

[[[ 34. 29.333334 22.333334 ... 20. 26. 31. ]

[ 37.333332 32. 25.333334 ... 21. 27. 32. ]

[ 33.333332 32. 40.333332 ... 23.333334 28.666666 35.666668]

...

[166. 96.666664 44.666668 ... 9.333333 14. 12. ]

[ 64.333336 39. 30.333334 ... 13. 16. 14. ]

[ 30.333334 29. 26.333334 ... 14.666667 16. 14. ]]

...

[[ 58.333332 48. 20. ... 66. 101.666664 94.666664]

[ 62.333332 32.666668 26.333334 ... 50.666668 90. 101.666664]

[ 56. 29.333334 47.333332 ... 55.333332 76.666664 106.666664]

...

[116.333336 107. 95. ... 113.333336 100.666664 88.333336]

[116.666664 105. 94. ... 116. 103.666664 111.666664]

[116. 103.333336 95.333336 ... 116. 105.666664 143.66667 ]]]

target:

[5360 3434 3807 ... 2175 373 2941]

target_names:

['AJ Cook' 'AJ Lamas' 'Aaron Eckhart' ... 'Zumrati Juma' 'Zurab Tsereteli'

'Zydrunas Ilgauskas']

DESCR:

.. _labeled_faces_in_the_wild_dataset:

The Labeled Faces in the Wild face recognition dataset

------------------------------------------------------

This dataset is a collection of JPEG pictures of famous people collected

over the internet, all details are available on the official website:

.....

Examples

~~~~~~~~

:ref:`sphx_glr_auto_examples_applications_plot_face_recognition.py`

PS C:\Users\tomo\GoogleDrive\IT_and_Mobile\dev\python\packages\sklearn.datasets\lfw_people>

データの内容

`target_names`～ターゲットの人物

ターゲットとなる人の名前はtarget_namesに格納されていて、その数は20201122時点で5,749人分のユニークなデータ。

名前に対するインデックスがターゲットのidになる。

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_lfw_people

ds = fetch_lfw_people()

print(ds.target_names.shape)
# (5749,)

print(ds.target_names)
# ['AJ Cook' 'AJ Lamas' 'Aaron Eckhart' ... 'Zumrati Juma' 'Zurab Tsereteli'
#  'Zydrunas Ilgauskas']

import numpy as np

import pandas as pd

from sklearn.datasets import fetch_lfw_people

ds = fetch_lfw_people()

print(ds.target_names.shape)

# (5749,)

print(ds.target_names)

# ['AJ Cook' 'AJ Lamas' 'Aaron Eckhart' ... 'Zumrati Juma' 'Zurab Tsereteli'

# 'Zydrunas Ilgauskas']

`target`～ターゲット数

ターゲットのidはtargetに1次元配列で格納されている。

1人のターゲットに複数枚の異なる顔画像が格納されているものもあり、targetデータに格納されたターゲットデータ全体は13,233個。

これらのidがターゲットとなる人の名前と顔画像データに結びついている。

print(ds.target.shape)
(13233,)

print(ds.target)
# [5360 3434 3807 ... 2175  373 2941]

print(ds.target.shape)

(13233,)

print(ds.target)

# [5360 3434 3807 ... 2175 373 2941]

`images`～顔画像のピクセルデータ

imagesには各顔画像のデータが1次元のピクセル値として格納されている。

配列のインデックスとtargetのインデックスが紐づいていて、targetの要素から顔画像の人物が特定できる。

print(ds.images.shape)
print(ds.images)

1 2	print(ds.images.shape) print(ds.images)

このデータの構造は以下のとおりで、13,233個の画像データが62×47のグレイスケールの配列として保存されている。

(13233, 62, 47)

1	(13233, 62, 47)

3次元データの構造は以下の通り。

(13233, 62, 47)
[[[ 34.        29.333334  22.333334 ...  20.        26.        31.      ]
  [ 37.333332  32.        25.333334 ...  21.        27.        32.      ]
  [ 33.333332  32.        40.333332 ...  23.333334  28.666666  35.666668]
  ...
  [166.        96.666664  44.666668 ...   9.333333  14.        12.      ]
  [ 64.333336  39.        30.333334 ...  13.        16.        14.      ]
  [ 30.333334  29.        26.333334 ...  14.666667  16.        14.      ]]

 [[158.       160.66667  169.66667  ...  74.666664  28.        16.      ]
  [155.66667  156.       163.33333  ...  83.        25.666666  14.      ]
  [146.66667  144.       145.       ...  82.333336  25.666666  14.666667]
  ...
  [118.666664 120.       170.       ... 131.33333  126.666664 125.666664]
  [125.       117.666664 140.33333  ... 133.66667  132.       129.33333 ]
  [128.33333  123.       122.       ... 138.66667  135.66667  130.33333 ]]

 ...

 [[ 58.333332  48.        20.       ...  66.       101.666664  94.666664]
  [ 62.333332  32.666668  26.333334 ...  50.666668  90.       101.666664]
  [ 56.        29.333334  47.333332 ...  55.333332  76.666664 106.666664]
  ...
  [116.333336 107.        95.       ... 113.333336 100.666664  88.333336]
  [116.666664 105.        94.       ... 116.       103.666664 111.666664]
  [116.       103.333336  95.333336 ... 116.       105.666664 143.66667 ]]]

(13233, 62, 47)

[[[ 34. 29.333334 22.333334 ... 20. 26. 31. ]

[ 37.333332 32. 25.333334 ... 21. 27. 32. ]

[ 33.333332 32. 40.333332 ... 23.333334 28.666666 35.666668]

...

[166. 96.666664 44.666668 ... 9.333333 14. 12. ]

[ 64.333336 39. 30.333334 ... 13. 16. 14. ]

[ 30.333334 29. 26.333334 ... 14.666667 16. 14. ]]

[[158. 160.66667 169.66667 ... 74.666664 28. 16. ]

[155.66667 156. 163.33333 ... 83. 25.666666 14. ]

[146.66667 144. 145. ... 82.333336 25.666666 14.666667]

...

[118.666664 120. 170. ... 131.33333 126.666664 125.666664]

[125. 117.666664 140.33333 ... 133.66667 132. 129.33333 ]

[128.33333 123. 122. ... 138.66667 135.66667 130.33333 ]]

...

[[ 58.333332 48. 20. ... 66. 101.666664 94.666664]

[ 62.333332 32.666668 26.333334 ... 50.666668 90. 101.666664]

[ 56. 29.333334 47.333332 ... 55.333332 76.666664 106.666664]

...

[116.333336 107. 95. ... 113.333336 100.666664 88.333336]

[116.666664 105. 94. ... 116. 103.666664 111.666664]

[116. 103.333336 95.333336 ... 116. 105.666664 143.66667 ]]]

`data`～1次元の顔画像データ

dataには顔画像のピクセルデータが各画像ごとに1次元で格納されている。

imagesと同じく、各画像データと人物が紐づけられる。

print(ds.data.shape)
print(ds.data)

1 2	print(ds.data.shape) print(ds.data)

このデータの構造は以下の通りで、13,233行のデータがあり、各行が2次元の配列を1次元にフラット化した形で格納されている（62×47＝2914）。

(13233, 2914)

1	(13233, 2914)

2次元のデータ構造は以下の通り。

[[ 34.        29.333334  22.333334 ...  14.666667  16.        14.      ]
 [158.       160.66667  169.66667  ... 138.66667  135.66667  130.33333 ]
 [ 76.666664  81.        87.666664 ... 191.66667  145.33333   66.      ]
 ...
 [ 38.333332  41.666668  55.       ...  66.        63.333332  54.666668]
 [ 16.666666  24.333334  60.333332 ... 219.       144.        69.      ]
 [ 58.333332  48.        20.       ... 116.       105.666664 143.66667 ]]

[[ 34. 29.333334 22.333334 ... 14.666667 16. 14. ]

[158. 160.66667 169.66667 ... 138.66667 135.66667 130.33333 ]

[ 76.666664 81. 87.666664 ... 191.66667 145.33333 66. ]

...

[ 38.333332 41.666668 55. ... 66. 63.333332 54.666668]

[ 16.666666 24.333334 60.333332 ... 219. 144. 69. ]

[ 58.333332 48. 20. ... 116. 105.666664 143.66667 ]]

データの概要

顔画像データの確認

顔画像データの内容を確認してみる。ここでは、書籍”Pythonではじめる機械学習”の例に沿って、最低20枚以上の画像がある人物から最初の10人分を取り出して表示している。

import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people

people = fetch_lfw_people(min_faces_per_person=20)

print(people.images[0].shape)

fig, axes_list = plt.subplots(2, 5, figsize=(6.4, 3.2))
fig.subplots_adjust(hspace=0.5)

for target, image, axes in zip(people.target, people.images, axes_list.ravel()):
    axes.imshow(image, 'gray')
    axes.set_title(people.target_names[target], fontsize=8)

plt.show()

import matplotlib.pyplot as plt

from sklearn.datasets import fetch_lfw_people

people = fetch_lfw_people(min_faces_per_person=20)

print(people.images[0].shape)

fig, axes_list = plt.subplots(2, 5, figsize=(6.4, 3.2))

fig.subplots_adjust(hspace=0.5)

for target, image, axes in zip(people.target, people.images, axes_list.ravel()):

axes.imshow(image, 'gray')

axes.set_title(people.target_names[target], fontsize=8)

plt.show()

人物の並びは原著どおりだが、それぞれの顔画像が異なっている。著書執筆後にデータが追加／更新されたようだ。

データの俯瞰

全体の画像データを、1人あたりの枚数ごとに集計してみる。

多くの人について顔画像が1つだけで、George Bush元大統領の顔画像が飛びぬけて多いようだ。

df = pd.DataFrame()
df['name'] = ds.target_names
df['counts'] = np.bincount(ds.target)
print(df.sort_values('counts'))

#                    name  counts
# 0               AJ Cook       1
# 3518     Marina Canetti       1
# 3513  Marie-Josee Croze       1
# 3512       Marie Haghal       1
# 3511  Maribel Dominguez       1
# ...                 ...     ...
# 1892  Gerhard Schroeder     109
# 1404    Donald Rumsfeld     121
# 5458         Tony Blair     144
# 1047       Colin Powell     236
# 1871      George W Bush     530
# 
# [5749 rows x 2 columns]

df = pd.DataFrame()

df['name'] = ds.target_names

df['counts'] = np.bincount(ds.target)

print(df.sort_values('counts'))

# name counts

# 0 AJ Cook 1

# 3518 Marina Canetti 1

# 3513 Marie-Josee Croze 1

# 3512 Marie Haghal 1

# 3511 Maribel Dominguez 1

# ... ... ...

# 1892 Gerhard Schroeder 109

# 1404 Donald Rumsfeld 121

# 5458 Tony Blair 144

# 1047 Colin Powell 236

# 1871 George W Bush 530

# [5749 rows x 2 columns]

画像枚数ごとに人数を整理してみる。

そこで、顔画像の個数ごとに見た時の人数を確認してみる。

print(np.bincount(df['counts']))

# [   0 4069  779  291  187  112   55   39   33   26   15   16   10   11
#    10   11    3    8    5    7    5    4    5    3    3    1    2    1
#     2    2    2    2    3    3    0    1    1    1    0    2    0    2
#     2    0    1    0    0    0    1    1    0    0    2    1    0    1
#     0    0    0    0    1    0    0    0    0    0    0    0    0    0
#     0    1    0    0    0    0    0    1    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    1    0    0
#     0    0    0    0    0    0    0    0    0    1    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    1    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    1    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#
# .....この間はすべてゼロ
#
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    1]

print(np.bincount(df['counts']))

# [ 0 4069 779 291 187 112 55 39 33 26 15 16 10 11

# 10 11 3 8 5 7 5 4 5 3 3 1 2 1

# 2 2 2 2 3 3 0 1 1 1 0 2 0 2

# 2 0 1 0 0 0 1 1 0 0 2 1 0 1

# 0 0 0 0 1 0 0 0 0 0 0 0 0 0

# 0 1 0 0 0 0 0 1 0 0 0 0 0 0

# 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# 0 0 0 0 0 0 0 0 0 0 0 1 0 0

# 0 0 0 0 0 0 0 0 0 1 0 0 0 0

# 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# 0 0 0 0 1 0 0 0 0 0 0 0 0 0

# 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# 0 0 0 0 0 0 0 0 0 0 0 0 1 0

# 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# .....この間はすべてゼロ

# 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# 0 0 0 0 0 0 0 0 0 0 0 0 1]

上の配列は0～530の531個の要素の1次元配列で、インデックスが画像枚数、要素の値はそのインデックスの枚数の画像データがある人の数。

顔画像が1枚の人数が4千人以上と、ほとんどの人物については顔画像が1枚しかない。そして画像枚数2枚以降の人物の数が減っていっている。

読み込みパラメーター

`resize`～画像のサイズ変更（再読み込みされる）

fetch_lfw_people()のresize引数で、画像データのサイズを指定できる。

デフォルトは0.5でこの時のサイズは62×47、書籍”Pythonではじめる機械学習”ではresize=0.7を指定していて、この時のサイズは87×65になる。

自分のマシンでは、resize=1.0とするとメモリーの制約なのかエラーになった。

from sklearn.datasets import fetch_lfw_people

ds = fetch_lfw_people()

print(ds.target_names.size)
print(ds.target.size)
print(ds.images.shape)

# 5749
# 13233
# (13233, 62, 47)

ds = fetch_lfw_people(resize=0.7)

print(ds.target_names.size)
print(ds.target.size)
print(ds.images.shape)

# 5749
# 13233
# (13233, 87, 65)

from sklearn.datasets import fetch_lfw_people

ds = fetch_lfw_people()

print(ds.target_names.size)

print(ds.target.size)

print(ds.images.shape)

# 5749

# 13233

# (13233, 62, 47)

ds = fetch_lfw_people(resize=0.7)

print(ds.target_names.size)

print(ds.target.size)

print(ds.images.shape)

# 5749

# 13233

# (13233, 87, 65)

`min_faces_per_person`～1人あたりの最低画像数

分析の目的によって、1人あたりの画像が複数必要な場合に、最低限登録されている画像数を指定する。

ここで指定した数以上の画像が登録されている人物とその画像データのみ抽出される。

from sklearn.datasets import fetch_lfw_people

ds = fetch_lfw_people(min_faces_per_person=0, resize=0.7)

print(ds.target_names.size)
print(ds.target.size)
print(ds.images.shape)

# 5749
# 13233
# (13233, 87, 65)

ds = fetch_lfw_people(min_faces_per_person=20, resize=0.7)

print(ds.target_names.size)
print(ds.target.size)
print(ds.images.shape)

# 62
# 3023
# (3023, 87, 65)

ds = fetch_lfw_people(min_faces_per_person=100, resize=0.7)

print(ds.target_names.size)
print(ds.target.size)
print(ds.images.shape)

# 5
# 1140
# (1140, 87, 65)

from sklearn.datasets import fetch_lfw_people

ds = fetch_lfw_people(min_faces_per_person=0, resize=0.7)

print(ds.target_names.size)

print(ds.target.size)

print(ds.images.shape)

# 5749

# 13233

# (13233, 87, 65)

ds = fetch_lfw_people(min_faces_per_person=20, resize=0.7)

print(ds.target_names.size)

print(ds.target.size)

print(ds.images.shape)

# 62

# 3023

# (3023, 87, 65)

ds = fetch_lfw_people(min_faces_per_person=100, resize=0.7)

print(ds.target_names.size)

print(ds.target.size)

print(ds.images.shape)

# 5

# 1140

# (1140, 87, 65)

numpy.where – インデックスの検索

2020-11-22 / tau / コメントする

概要

numpy.where()関数の主な使い方は以下の通り。

配列の要素のうち条件に合う要素のインデックスを取り出す
配列の要素の条件によって、2つの配列のいずれかの要素を割り当てる

基本的な挙動

条件に応じた値の取り出し

以下のように、3項演算子と同じように使える。

print(np.where(True, "T", "F"))
♯T

print(np.where(False, "T", "F"))
# F

print(np.where(True, "T", "F"))

♯T

print(np.where(False, "T", "F"))

# F

True/Falseの代わりに数値でも可。

print(np.where(1, "T", "F"))
# T

print(np.where(0, "T", "F"))
# F

print(np.where(1, "T", "F"))

# T

print(np.where(0, "T", "F"))

# F

`bool`配列によるインデックスの取り出し

bool配列を引数に渡すと、True要素のインデックスの配列を返す。True/Falseの代わりに数値でも可。

print(np.where([True, False, False, True]))
# (array([0, 3], dtype=int32),)

print(np.where([1, 0, 0, 1]))
# # (array([0, 3], dtype=int32),)

print(np.where([True, False, False, True]))

# (array([0, 3], dtype=int32),)

print(np.where([1, 0, 0, 1]))

# # (array([0, 3], dtype=int32),)

bool配列と同じ形の2つの配列を引数に加えると、bool配列の要素のTrue/Falseに応じて、1つ目の配列／2つ目の配列の要素が取り出されて並べられた配列が返される。

print(np.where([True, False, False, True], [1, 2, 3, 4], ["A", "B", "C", "D"]))
# ['1' 'B' 'C' '4']

print(np.where([1, 0, 0, 1], [1, 2, 3, 4], ["A", "B", "C", "D"]))
# ['1' 'B' 'C' '4']

print(np.where([True, False, False, True], [1, 2, 3, 4], ["A", "B", "C", "D"]))

# ['1' 'B' 'C' '4']

print(np.where([1, 0, 0, 1], [1, 2, 3, 4], ["A", "B", "C", "D"]))

# ['1' 'B' 'C' '4']

上の例では、True(1)が0番目と3番目、False(0)が1番目と2番目にあるので、戻り値の配列の0番目と3番目には2つ目の配列の対応する要素、1番目と2番目には3つ目の配列の対応する要素があてられている。

利用法～条件に合う要素のインデックス

条件に合う要素が1つの場合

where()関数の引数として、配列の要素に関する条件式を与えると、条件に合致する要素のインデックスが得ることができる。

ただし戻り値はタプルで、かつ2次元のタプルの第1要素にndarrayとして納められている点に注意。

import numpy as np

names = np.array(["JPN", "USA", "GBR", "FRA", "JPN", "KOR", "JPN", "CHN"])

print(np.where(names=="FRA"))

# (array([3], dtype=int32),)

import numpy as np

names = np.array(["JPN", "USA", "GBR", "FRA", "JPN", "KOR", "JPN", "CHN"])

print(np.where(names=="FRA"))

# (array([3], dtype=int32),)

そのndarrayは1つの要素を持ち、その値が"FRA"のインデックスになっている。

print(np.where(names=="FRA")[0])

# [3]

print(np.where(names=="FRA")[0])

# [3]

インデックスの数値を取り出したいときは、このndarrayの要素を取り出す。

print(np.where(names=="FRA")[0][0])

# 3

print(np.where(names=="FRA")[0][0])

# 3

条件に合う要素が複数の場合

先の配列には"JPN"が3つ含まれている。このように条件に合致する要素が複数ある場合は、インデックスが配列で返される。

ただしこの場合も戻り値は2次元のタプルで、その第1要素に目的の配列が格納されている。

print(np.where(names=="JPN"))

# (array([0, 4, 6], dtype=int32),)

print(np.where(names=="JPN"))

# (array([0, 4, 6], dtype=int32),)

インデックスの配列を利用する場合は、タプルの先頭要素を取り出す。

print(np.where(names=="JPN")[0])

# [0 4 6]

print(np.where(names=="JPN")[0])

# [0 4 6]

条件に応じた配列の要素の選択

配列に対する条件式と、その条件の真偽に応じて選択される配列を引数に与える。文章にするとややこしいので、以下例示。

x = np.array([1, 2, 3, 4, 5])
y = np.array([10, 20, 30, 40, 50])
print(np.where(x % 2 == 1, x, y))

# [ 1 20  3 40  5]

x = np.array([1, 2, 3, 4, 5])

y = np.array([10, 20, 30, 40, 50])

print(np.where(x % 2 == 1, x, y))

# [ 1 20 3 40 5]

xの各要素が基数の場合はxから、偶数の場合はyから、同じ位置にある要素が取り出されて結果の配列にセットされる。

以下はもう一つの例。

a = np.arange(10)
print(np.where(a % 2 == 0, a / 2, (a - 1) / 2))

# [0. 0. 1. 1. 2. 2. 3. 3. 4. 4.]

a = np.arange(10)

print(np.where(a % 2 == 0, a / 2, (a - 1) / 2))

# [0. 0. 1. 1. 2. 2. 3. 3. 4. 4.]

この例では条件、真の場合、偽の場合に同じ配列を使っている。条件の配列の要素が偶数の場合は、その要素の1/2、奇数の場合はその要素から1を引いて1/2にした数値を持つ配列が返される。その結果、同じ数が2つずつ並ぶ配列が得られる。

numpy – bincount

2020-11-21 / tau / コメントする

概要

numpy.bincount()関数の仕様

整数型の配列を引数にとる
配列中、同じ値の要素の個数をカウントする
0～要素の最大値を要素とし、各要素番号に対応する値の個数を要素とする配列を返す
元のデータの要素ごとの重みを指定することができる

使い方

基本形

引数で与えた整数型配列中の同じ値をカウントして、各値ごとの個数を要素とする配列を返す。

import numpy as np

a = np.array([0, 1, 1, 2, 2, 2, 3, 3])
print(np.bincount(a))

# [1 2 3 2]

import numpy as np

a = np.array([0, 1, 1, 2, 2, 2, 3, 3])

print(np.bincount(a))

# [1 2 3 2]

上の結果の意味は、0が1個、1が2個、2が3個、3が2個。

値が飛んでいる場合

引数の配列中、0～最大値までの整数値に対する数をカウントする。値が存在しない場合の個数は0。

b = np.array([1, 3, 3, 5, 5, 5])
print(np.bincount(b))

# [0 1 0 2 0 3]

b = np.array([1, 3, 3, 5, 5, 5])

print(np.bincount(b))

# [0 1 0 2 0 3]

上の例では、0～5までの個数がカウントされ、0, 2, 4は配列中に存在しないので0となっている。

順不同

引数の配列中の要素は昇順である必要はない。

c = np.array([2, 2, 3, 3, 3, 1])
print(np.bincount(c))

# [0 1 2 3]

c = np.array([2, 2, 3, 3, 3, 1])

print(np.bincount(c))

# [0 1 2 3]

`weights`の意味

引数にweightsを指定する場合。

データの配列と同じ要素数のweightsの配列を与える。
要素をカウントの場合に1ずつ足すのではなく、各要素の位置に対応した重みが加算されていく

w = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
d = np.array([1, 2, 4, 2, 4, 4])
print(np.bincount(d, weights=w))

# [0.  0.1 0.6 0.  1.4]

w = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

d = np.array([1, 2, 4, 2, 4, 4])

print(np.bincount(d, weights=w))

# [0. 0.1 0.6 0. 1.4]

上の例では以下のように動作している。

0は存在しないので0
1は0番目に1つだけ存在し、その位置のweightsの値は0.1
2は1番目と3番目に存在するので、weightsの第1要素0.2と第3要素0.4を加えて0.6
3は存在しないので0
4は2番目、4番目、5番目にあるので、weightsの第2要素0.3、第4要素0.5、第5要素0.6を加えて1.4

pyplot.imshow – 画像表示

2020-11-21 / tau / コメントする

概要

matplotlib.pyplot.imshow()は画像表示用のメソッドで、表示対象として、画像ファイルや画像情報を格納した配列を指定する。

pyplotやsubplotで直接実行するほか、Axesオブジェクトのメソッドとしても実行できる。

ピクセルデータのレンジのデフォルト設定と与えるデータのレンジによって予期しない結果になることもあり、vmin、vmaxを明示的に指定した方がよい。

画像ファイルの表示

以下のコードは、JPEGファイルを読み込んで表示する。

ここではpyplot.subplotのメソッドとしてimshow()を実行している。画像が1つの場合、pyplot.imshow()でもよい。

1つは画像ファイルをそのまま引数にし、もう1つは画像ファイルを配列の形にしてから引数に渡している。画像の配列の形については後述。

import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

image = Image.open("./coffeemill.jpg")

image_array = np.asarray(image)

plt.subplot(121).imshow(image)
plt.subplot(122).imshow(image_array)
plt.show()

print(image_array.shape)
# (871, 653, 3)

print(image_array)
[[[171 147 111]
  [167 143 107]
  [170 147 113]
  ...
  [ 93  82  86]
  [ 92  81  87]
  [ 93  83  91]]

 [[170 146 110]
  [167 143 107]
  [170 147 113]
  ...
  [ 92  81  87]
  [ 85  76  81]
  [ 77  67  75]]

  .....

 [[ 36  23  15]
  [ 37  24  16]
  [ 41  26  21]
  ...
  [248 196 123]
  [249 197 124]
  [247 195 122]]]

import numpy as np

import matplotlib.pyplot as plt

from PIL import Image

image = Image.open("./coffeemill.jpg")

image_array = np.asarray(image)

plt.subplot(121).imshow(image)

plt.subplot(122).imshow(image_array)

plt.show()

print(image_array.shape)

# (871, 653, 3)

print(image_array)

[[[171 147 111]

[167 143 107]

[170 147 113]

...

[ 93 82 86]

[ 92 81 87]

[ 93 83 91]]

[[170 146 110]

[167 143 107]

[170 147 113]

...

[ 92 81 87]

[ 85 76 81]

[ 77 67 75]]

.....

[[ 36 23 15]

[ 37 24 16]

[ 41 26 21]

...

[248 196 123]

[249 197 124]

[247 195 122]]]

配列の画像表示

基本形

imshow()は配列を引数にとることができる。

以下の例では、カラーマップを指定して2×2=4要素の2次元配列を表示している。最小値0がカラーマップbwrの青に、最大値255が赤に対応し、その間の数値の大きさに応じたカラーマップ上の色が選択されている（デフォルトのcmapはvirいdis）。

なお、この例ではpyplotから直接imshow()を実行している。

import numpy as np
import matplotlib.pyplot as plt

im_array = np.array([
    [0, 85],
    [170, 255]
])

plt.imshow(im_array, cmap='bwr')

plt.show()

import numpy as np

import matplotlib.pyplot as plt

im_array = np.array([

[0, 85],

[170, 255]

])

plt.imshow(im_array, cmap='bwr')

plt.show()

レンジ

imshow()に配列を渡して描画させるとき、数値のレンジに留意する必要がある。

デフォルトでは、imshow()は渡された配列の中の最小値と最大値をカラーマップの下限値と上限値に対応させ、線形にマッピングする。

なお、この例ではarray-likeとして2次元のリストを渡していて、Axesからimshow()を呼び出している。

import numpy as np
import matplotlib.pyplot as plt

images = []
images.append([[0, 0.5, 1]])
images.append([[0, 127, 255]])
images.append([[-1, 0, 1]])
images.append([[1000, 1000.5, 1001]])
print(images)

fig, axs = plt.subplots(2, 2)

for image, ax in zip(images, axs.flatten()):
    ax.imshow(image, cmap='bwr')

plt.show()

import numpy as np

import matplotlib.pyplot as plt

images = []

images.append([[0, 0.5, 1]])

images.append([[0, 127, 255]])

images.append([[-1, 0, 1]])

images.append([[1000, 1000.5, 1001]])

print(images)

fig, axs = plt.subplots(2, 2)

for image, ax in zip(images, axs.flatten()):

ax.imshow(image, cmap='bwr')

plt.show()

4つの配列はそれぞれ最小値と最大値が異なり、かつその中央の値を持つ。値は異なるが全て最小値がカラーマップbwrの下限値に対応する青、最大値が上限値に対応する赤、中央値は白となっている（特段フランス国旗を意図したものではない）。

viminとvmax

imshow()の引数でvminとvmaxを設定すると、配列の値に関わらず、vminとvmiaxをカラーマップの下限値と上限値に対応させる。

以下の例では最小値0、最大値1の2要素の配列を、vminとvmaxを変えてカラーマップbwrで描画させている。

import numpy as np
import matplotlib.pyplot as plt

image = [[0, 1]]

fig, axs = plt.subplots(2, 2)

axs[0, 0].imshow(image, cmap='bwr')
axs[0, 0].set_title("Default")
axs[0, 1].imshow(image, cmap='bwr', vmin=0, vmax=2)
axs[0, 1].set_title("vmin=0, vmax=2")
axs[1, 0].imshow(image, cmap='bwr', vmin=-1, vmax=2)
axs[1, 0].set_title("vmin=-1, vmax=2")
axs[1, 1].imshow(image, cmap='bwr', vmin=0.2, vmax=0.8)
axs[1, 1].set_title("vmin=0.2, vmax=0.8")

plt.show()

import numpy as np

import matplotlib.pyplot as plt

image = [[0, 1]]

fig, axs = plt.subplots(2, 2)

axs[0, 0].imshow(image, cmap='bwr')

axs[0, 0].set_title("Default")

axs[0, 1].imshow(image, cmap='bwr', vmin=0, vmax=2)

axs[0, 1].set_title("vmin=0, vmax=2")

axs[1, 0].imshow(image, cmap='bwr', vmin=-1, vmax=2)

axs[1, 0].set_title("vmin=-1, vmax=2")

axs[1, 1].imshow(image, cmap='bwr', vmin=0.2, vmax=0.8)

axs[1, 1].set_title("vmin=0.2, vmax=0.8")

plt.show()

左上はデフォルトなので、最小値0がカラーマップ下限値に対応した青に、最大値1が上限値に対応した赤になっている。

右上はvmin=0で配列の最小値0と同じだが、vmax=2としている。このため配列の0はカラーマップ下限の青で、配列の1はカラーマップ中央の白になっている。

左下はvmin=-1も設定されているので、配列の0、1はカラーマップの左から1/3、2/3に相当する色となっている。

右下はvminとvmaxが配列の最小値と最大値の範囲より内側にある。このため、配列の最小値・最大値はそれぞれカラーマップの下限・上限に対応する青・赤となっている。

RGB

array-likeの次元が3次元になると、RGB/RGBA形式だと認識される。

[rows, cols, 3]: 3次元目のサイズが3の時はRGB表現と認識される。1次元目と2次元目はそれぞれ画像の行数と列数とみなされ、3次元目は3つの列がR, G, Bの値に対応する。
[rows, cols, 4]: 3次元目のサイズが4の時はRGBA表現と認識される。1次元目と2次元目はそれぞれ画像の行数と列数とみなされ、3次元目は3つの列がR, G, Bの値に対応し、4つ目の列が透明度に対応する。

R, G, B, Aの値は、配列のdtypeがint形式の時には0～255、floatの時には0～1の範囲が想定される。

以下の例の内容。

画像サイズを2行×4列として、R, G, Bごとに画像のピクセルデータを設定→shape=(3, 2, 4)
ピクセル並び替え後の配列を4つ準備
forループでピクセル並び替え
画像表示とデータ内容の表示

3次元配列のピクセルの並び替えは、泥臭くforループで回しているが、もっとエレガントな方法があるかもしれない（もとから(3, rows, cols)の形にしてくれればよかったのに）。

# --- --B -G- -GB
# R-- R-B RG- RGB

import numpy as np
import matplotlib.pyplot as plt

image_org = np.array([
    [
        [  0,   0,   0,   0],
        [255, 255, 255, 255]
    ],
    [
        [  0,   0, 255, 255],
        [  0,   0, 255, 255]
    ],
    [
        [  0, 255,   0, 255],
        [  0, 255,   0, 255]
    ]
])

row = image_org.shape[1]
col = image_org.shape[2]

image_int1 = np.empty((row, col, 3), dtype=int)
image_int2 = np.empty((row, col, 3), dtype=int)
image_float1 = np.empty((row, col, 3), dtype=float)
image_float2 = np.empty((row, col, 3), dtype=float)

for m in range(row):
    for n in range(col):
        for rgb in range(3):
            image_int1[m, n, rgb] = image_org[rgb, m, n]
            image_int2[m, n, rgb] = image_org[rgb, m, n] /255
            #-> all the pixels become BLACK
            image_float1[m, n, rgb] = image_org[rgb, m, n]
            image_float2[m, n, rgb] = image_org[rgb, m, n] / 255
            #- >works, but
            # "Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers)."

fig, axs = plt.subplots(2, 2)

axs[0, 0].imshow(image_int1)
axs[0, 0].set_title("dtype=int, 0~255")
axs[0, 1].imshow(image_int2)
axs[0, 1].set_title("dtype=int, 0.0~1.0")
axs[1, 0].imshow(image_float1)
axs[1, 0].set_title("dtype=float, 0.0~255.0")
axs[1, 1].imshow(image_float2)
axs[1, 1].set_title("dtype=float, 0.0~1.0")

plt.show()

print("original image shape  :{}".format(image_org.shape))
print("rearranged image shape:{}".format(image_int1.shape))
print("rearranged image:\n{}".format(image_int1))

# Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
# original image shape  :(3, 2, 4)
# rearranged image shape:(2, 4, 3)
# rearranged image:
# [[[  0   0   0]
#   [  0   0 255]
#   [  0 255   0]
#   [  0 255 255]]
# 
#  [[255   0   0]
#   [255   0 255]
#   [255 255   0]
#   [255 255 255]]]

# --- --B -G- -GB

# R-- R-B RG- RGB

import numpy as np

import matplotlib.pyplot as plt

image_org = np.array([

[

[ 0, 0, 0, 0],

[255, 255, 255, 255]

[

[ 0, 0, 255, 255],

[ 0, 0, 255, 255]

[

[ 0, 255, 0, 255],

[ 0, 255, 0, 255]

]

])

row = image_org.shape[1]

col = image_org.shape[2]

image_int1 = np.empty((row, col, 3), dtype=int)

image_int2 = np.empty((row, col, 3), dtype=int)

image_float1 = np.empty((row, col, 3), dtype=float)

image_float2 = np.empty((row, col, 3), dtype=float)

for m in range(row):

for n in range(col):

for rgb in range(3):

image_int1[m, n, rgb] = image_org[rgb, m, n]

image_int2[m, n, rgb] = image_org[rgb, m, n] /255

#-> all the pixels become BLACK

image_float1[m, n, rgb] = image_org[rgb, m, n]

image_float2[m, n, rgb] = image_org[rgb, m, n] / 255

#- >works, but

# "Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers)."

fig, axs = plt.subplots(2, 2)

axs[0, 0].imshow(image_int1)

axs[0, 0].set_title("dtype=int, 0~255")

axs[0, 1].imshow(image_int2)

axs[0, 1].set_title("dtype=int, 0.0~1.0")

axs[1, 0].imshow(image_float1)

axs[1, 0].set_title("dtype=float, 0.0~255.0")

axs[1, 1].imshow(image_float2)

axs[1, 1].set_title("dtype=float, 0.0~1.0")

plt.show()

print("original image shape :{}".format(image_org.shape))

print("rearranged image shape:{}".format(image_int1.shape))

print("rearranged image:\n{}".format(image_int1))

# Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

# original image shape :(3, 2, 4)

# rearranged image shape:(2, 4, 3)

# rearranged image:

# [[[ 0 0 0]

# [ 0 0 255]

# [ 0 255 0]

# [ 0 255 255]]

# [[255 0 0]

# [255 0 255]

# [255 255 0]

# [255 255 255]]]

imshow()に渡す配列のdtypeがint型の時は、ピクセルデータのレンジが0～255になる。

左上は元の配列のままR, G,Bが0か255なので、想定した組み合わせの色となっている
右上は想定されているレンジに対して0.0～1.0の値を与えていることから、どのピクセルともR, G, Bが0か1（ほぼゼロ）となり黒くなっている（そのまま実行され、特にメッセ維持は出ない）

配列のdtypeがfloatの時は、ピクセルデータの想定レンジは0.0～1.0になる。

左下は最小値0と最大255を与えているが、結果は左上と同じで、imshow()のデフォルトのレンジ0～255に変更されているようである（特にメッセージは出ない）
右下は与えるピクセルデータを0.0～1.0としたところ、”入力データをクリップしている”というメッセージが出たが、レンジが修正されたらしく結果は意図通り

並べ替えた後の配列は、直感的にはわかりにくい形になっている。

グレースケール

グレースケールの場合は、cmap='gray'を指定する。vmin、vmaxは省略しても同じ結果となるが念のため。

import numpy as np
import matplotlib.pyplot as plt

row_L = np.arange(0, 256, 8, dtype=int)
row_R = np.arange(256, 0, -8)
row = np.append(row_L, row_R)
n_size = 256 // 8

image_array = np.empty((n_size, n_size), dtype=int)

for r in range(n_size):
    image_array[r, :] = row[:n_size]
    row = np.roll(row, -1)

plt.imshow(image_array, vmin=0, vmax=255, cmap='gray')
plt.show()

import numpy as np

import matplotlib.pyplot as plt

row_L = np.arange(0, 256, 8, dtype=int)

row_R = np.arange(256, 0, -8)

row = np.append(row_L, row_R)

n_size = 256 // 8

image_array = np.empty((n_size, n_size), dtype=int)

for r in range(n_size):

image_array[r, :] = row[:n_size]

row = np.roll(row, -1)

plt.imshow(image_array, vmin=0, vmax=255, cmap='gray')

plt.show()

matplotlib – カラーマップ

2020-11-21 / tau / コメントする

matplotlibのカラーマップの一覧。cmap引数の指定のために。ソースはこちら。

../../_images/sphx_glr_colormaps_001.png

../../_images/sphx_glr_colormaps_002.png

../../_images/sphx_glr_colormaps_003.png

../../_images/sphx_glr_colormaps_004.png

../../_images/sphx_glr_colormaps_005.png

../../_images/sphx_glr_colormaps_006.png

../../_images/sphx_glr_colormaps_007.png

PCA – Boston house-pricesデータセット

2020-11-20 / tau / コメントする

概要

scikit-learnの主成分分析モデル(PCA)をBiston housing pricesデータに適用して、その挙動を確認する。

主成分が適切に発見されてよい相関が得られることを期待したが、IrisデータやBreast cancerデータの場合のようなクラス分類データにおける良好な結果は得られなかった。

ただし、Boston housing pricesデータはIrisやcancerのデータよりも複雑な社会行動に関するものであり、その指標も限定されていることから、これをもってPCAが回帰系のデータに不向きとまでは言い切れない。

なお、Boston housing pricesデータの特徴量には属性データ（カテゴリーデータ、クラスデータ）が含まれることから、DataFrameのget_dummis()メソッドによるone-hot encodingを行っている。

計算の手順

必要なパッケージをインポート
Boston housing pricesデータセットを準備
データセットをスケーリング／エンコーディング
1. 属性データの列を取り出して、get_dummiesでone-hot化
2. StandardScalerで残りの特徴量データを標準化
3. 上記2つを結合して前処理済みデータとして準備
PCAモデルのインスタンスを生成
- 引数n_components=2として、2つの特徴量について計算
fit()メソッドにより、モデルにデータを学習させる
主成分やその寄与率を確認
- 主成分はPCA.comonents_を、寄与率はPCA.explained_variance_ratio_を確認
transform()メソッドによって、主成分に沿ってデータを変換
3つの主成分について3次元可視化
2つの主成分について2次元可視化

前処理

特徴量のうちの1つCHASについては、「チャールズ川に関するダミー変数（1：川沿い、0：それ以外）」～”Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)“となっていて、0か1の属性変数である。この変数をDataFrameのget_dummies()メソッドでone-hot化する。

また、その他のデータについてはStandardScalerで標準化する。

CHASのデータのみone-hot化
CHASの列を除いたデータをStandardScalerで標準化
上記2つのデータをjoin()で結合

df_chas_encoded = pd.get_dummies(df['CHAS'], prefix='CHAS')
df_wo_chas = df.drop('CHAS', axis=1)
df_wo_chas_scaled = pd.DataFrame(
    StandardScaler().fit_transform(df_wo_chas), columns=df_wo_chas.columns)
df_preprocessed = df_wo_chas_scaled.join(df_chas_encoded)

df_chas_encoded = pd.get_dummies(df['CHAS'], prefix='CHAS')

df_wo_chas = df.drop('CHAS', axis=1)

df_wo_chas_scaled = pd.DataFrame(

StandardScaler().fit_transform(df_wo_chas), columns=df_wo_chas.columns)

df_preprocessed = df_wo_chas_scaled.join(df_chas_encoded)

可視化

2次元

ここではまず、2次元可視化の結果を確認する。

クラス分類の場合は2次元で2つの主成分を確認できるが、回帰データの場合はターゲットの量を確認する必要があるため、グラフの軸を1つ消費する。このため、2次元による表現では1つの主成分による説明性を確認することになる。

各点の色や大きさをターゲットの値によって変化させ、2つの軸を2つの主成分に割り当てる方法も考えられるが、直感的にとらえにくくなる。

この結果を見る限り、あまり美しい結果とはなっていない。データを俯瞰した際、各特徴量であまりいい説明ができなかったが、その中でもある程度関係がみられたMDEVやLSTATとの相関と変わらないくらい。

3次元

そこで3次元の可視化にして、2つの主成分による説明性を確認する。

これでもあまりいい結果にならない。ただしグラフを見ると、大きく2つの塊に分かれているように見える。ターゲットである住居価格とは別に、特徴量の組み合わせに隠れている、性質の違うグループがあるのかもしれない。

主成分と寄与率

2つの主成分と寄与率について表示させてみる。

Components:
     feature    comp_0    comp_1
0       CRIM  0.251012  0.399530
1         ZN -0.256281  0.436197
2      INDUS  0.346630 -0.109264
3        NOX  0.342782 -0.169987
4         RM -0.189333  0.063747
5        AGE  0.313601 -0.318114
6        DIS -0.321462  0.331591
7        RAD  0.319816  0.381156
8        TAX  0.338513  0.318517
9    PTRATIO  0.205059  0.182159
10         B -0.203023 -0.333020
11     LSTAT  0.309817 -0.027564
12  CHAS_0.0 -0.001091  0.035239
13  CHAS_1.0  0.001091 -0.035239

Explained :[0.50514047 0.1109301 ]

Components:

feature comp_0 comp_1

0 CRIM 0.251012 0.399530

1 ZN -0.256281 0.436197

2 INDUS 0.346630 -0.109264

3 NOX 0.342782 -0.169987

4 RM -0.189333 0.063747

5 AGE 0.313601 -0.318114

6 DIS -0.321462 0.331591

7 RAD 0.319816 0.381156

8 TAX 0.338513 0.318517

9 PTRATIO 0.205059 0.182159

10 B -0.203023 -0.333020

11 LSTAT 0.309817 -0.027564

12 CHAS_0.0 -0.001091 0.035239

13 CHAS_1.0 0.001091 -0.035239

Explained :[0.50514047 0.1109301 ]

寄与率は第1主成分が50%程度で低いため高い相関が出ないといえるかもしれない。だが、Breast cancerデータセットのクラス分類では第1主成分の寄与率が40%台だが、明確なクラス分類ができていた。やはり回帰系の問題にはPCAは不向きなのかもしれない。

主成分の要素について、先の散布図が第1主成分と負の相関があることから、第1主成分の各特徴量は価格低下に寄与するものはプラス、価格上昇に寄与するものはマイナスとなるはずである。

たとえばZNやRMがマイナスなのは頷けるが、DISがマイナスなのは微妙。TAXやPRATIOがプラスなのも逆のような気がする。

先にも書いたが、Boston housing pricesデータセットで取りそろえられた特徴量は、住居の価格以外の何かを特徴づける傾向が強いのかもしれない。

puts

print

p

概要

手順

ターゲットごとのインデックスの取得

取り出す要素の制限

要素の抽出

他の配列の同時操作

概要

簡単な例

全主成分を使った手順

概要

元データの作成

フィッティングと元データの変換

第2主成分のデータの削除

逆変換

最初から主成分を限定する手順

概要

まとめ

概要

コードと実行結果

概要

データの取得

データ構造

データの内容

target_names～ターゲットの人物

target～ターゲット数

images～顔画像のピクセルデータ

data～1次元の顔画像データ

データの概要

顔画像データの確認

データの俯瞰

読み込みパラメーター

resize～画像のサイズ変更（再読み込みされる）

min_faces_per_person～1人あたりの最低画像数

概要

基本的な挙動

条件に応じた値の取り出し

bool配列によるインデックスの取り出し

利用法～条件に合う要素のインデックス

条件に合う要素が1つの場合

条件に合う要素が複数の場合

条件に応じた配列の要素の選択

概要

使い方

基本形

値が飛んでいる場合

順不同

weightsの意味

概要

画像ファイルの表示

配列の画像表示

基本形

レンジ

viminとvmax

RGB

グレースケール

概要

計算の手順

前処理

可視化

2次元

3次元

主成分と寄与率

`target_names`～ターゲットの人物

`target`～ターゲット数

`images`～顔画像のピクセルデータ

`data`～1次元の顔画像データ

`resize`～画像のサイズ変更（再読み込みされる）

`min_faces_per_person`～1人あたりの最低画像数

`bool`配列によるインデックスの取り出し

`weights`の意味