matplotlib.pyplot.barh – 横棒グラフ

2020-06-01 / tau / コメントする

概要

barh()は横棒グラフを描く。主要なパラメーターは以下の通り。
barh(y, width, height, left, align, fc, ec, linewidth, xerr, capsize, log)

y, width, height: yは縦方向の座標で棒グラフのラベルをリスト等で指定するのが一般的。widthは棒の長さでこれもリスト等で指定。heightは棒の太さでデフォルトは0.8だが数値／リスト等で指定可。
align: alignはデフォルトで'center'だが、'edge'を指定すると棒の下側がラベルに合わせられる。上を合わせるにはheightに負の値を指定する。
fc, ec, linewidth: fc/colorは棒の塗りつぶし色、ec/edgecolorは縁の色、linewidthは縁の太さ
xerr, capsize: xerrは誤差の範囲でリスト等で指定。capsizeは誤差範囲の両端の直交線の長さ。
log: log=Trueを指定すると横軸が対数スケールになる。

実行例

基本形

基本的な使い方で、第1引数yに縦軸のラベル、第2引数widthに各棒の長さをそれぞれリストで与える。

import numpy as np
import matplotlib.pyplot as plt

subjects = np.array(["math", "physics", "chemistry", "earth science"])
scores = np.array([80, 95, 45, 65])

fig, ax = plt.subplots()
fig.subplots_adjust(left=0.2)

ax.barh(subjects, scores)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

subjects = np.array(["math", "physics", "chemistry", "earth science"])

scores = np.array([80, 95, 45, 65])

fig, ax = plt.subplots()

fig.subplots_adjust(left=0.2)

ax.barh(subjects, scores)

plt.show()

色・枠線の指定

棒の塗りつぶし色と枠線の色・太さを指定。

ax.barh(subjects, scores, fc='tab:green', ec='k', linewidth=3)

1	ax.barh(subjects, scores, fc='tab:green', ec='k', linewidth=3)

高さ・位置

棒の高さ、開始位置を指定し、ラベルに対して棒の下端を合わせている。

heights = [1, 0.8, 0.6, 0.4]
lefts = [0, 5, 10, 15]
ax.barh(subjects, scores, height=heights, left=lefts, align='edge')

heights = [1, 0.8, 0.6, 0.4]

lefts = [0, 5, 10, 15]

ax.barh(subjects, scores, height=heights, left=lefts, align='edge')

誤差

棒の端に誤差範囲を表示。

ax.barh(subjects, scores, xerr=xerr, capsize=8)

1	ax.barh(subjects, scores, xerr=xerr, capsize=8)

対数軸

横軸を対数軸としている。

import matplotlib.pyplot as plt

names = ["Andrew", "Bob", "Charie", "Dick"]
properties = [200000000, 50000000, 8000000, 1000000]

fig, ax = plt.subplots()

ax.barh(names, properties, log=True)

plt.show()

import matplotlib.pyplot as plt

names = ["Andrew", "Bob", "Charie", "Dick"]

properties = [200000000, 50000000, 8000000, 1000000]

fig, ax = plt.subplots()

ax.barh(names, properties, log=True)

plt.show()

pyplot – グラフエリアが切れる・はみ出る

2020-06-01 / tau / コメントする

グラフのラベルがはみ出てしまう場合がある。

このようなときは、pyplotやfigureに対してsubplots_adjust()でleft、bottomなどの引数でマージンを指定する。

import numpy as np
import matplotlib.pyplot as plt

subjects = np.array(["math", "physics", "chemistry", "earth science"])
scores = np.array([95, 92, 83, 75])

fig, ax = plt.subplots()
fig.subplots_adjust(left=0.2)

ax.barh(subjects, scores)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

subjects = np.array(["math", "physics", "chemistry", "earth science"])

scores = np.array([95, 92, 83, 75])

fig, ax = plt.subplots()

fig.subplots_adjust(left=0.2)

ax.barh(subjects, scores)

plt.show()

pyplot – グラフの端が枠線で切れる

2020-05-30 / tau / コメントする

pyplotでグラフを描画したとき、軸の端の方でグラフが見切れてしまう。軸の外側も使って線や点をクリップせずに表示させるには、各グラフ描画の引数でclip_on=Falseを指定する。

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, num=200)
ys = np.sin(3*x)
yc = np.sin(3*x - np.pi)

xp = [-np.pi, np.pi]
yp1 = [-1, 1]
yp2 = [1, -1]

fig, ax = plt.subplots()

ax.plot(x, ys, linewidth=4)
ax.plot(x, yc, linewidth=4, clip_on=False)

ax.scatter(xp, yp1, s=80)
ax.scatter(xp, yp2, s=80, clip_on=False)

ax.set_xlim(-np.pi, np.pi)
ax.set_ylim(-1, 1)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, num=200)

ys = np.sin(3*x)

yc = np.sin(3*x - np.pi)

xp = [-np.pi, np.pi]

yp1 = [-1, 1]

yp2 = [1, -1]

fig, ax = plt.subplots()

ax.plot(x, ys, linewidth=4)

ax.plot(x, yc, linewidth=4, clip_on=False)

ax.scatter(xp, yp1, s=80)

ax.scatter(xp, yp2, s=80, clip_on=False)

ax.set_xlim(-np.pi, np.pi)

ax.set_ylim(-1, 1)

plt.show()

ndarray.reshape()の使い方

2020-05-23 / tau / コメントする

reshape()の考え方

a.reshape(d₁, ..., d_n)として変形する場合

n次元の配列になる
d₁ + ... + d_n = a.sizeでなければならない

要素が1つの場合

ndarrayの引数に1つの数値を指定するとndarrayクラスだが数値のように表示される。

import numpy as np

a = np.array(1)
print(a)
print(type(a))
print(a.size)
print(a * 2)

# 1
# <class 'numpy.ndarray'>
# 1
# 2

import numpy as np

a = np.array(1)

print(a)

print(type(a))

print(a.size)

print(a * 2)

# 1

# <class 'numpy.ndarray'>

# 1

# 2

これをreshape(1)とすると、1要素の1次元配列になる。

b = a.reshape(1)
print(b)

# [1]

b = a.reshape(1)

print(b)

# [1]

reshape(1, 1)とすると、1要素の2次元配列になる。reshape(1, 1, 1)なら3次元配列。

c = a.reshape(1, 1)
print(c)

d = a.reshape(1, 1, 1)
print(d)

# [[1]]
# [[[1]]]

c = a.reshape(1, 1)

print(c)

d = a.reshape(1, 1, 1)

print(d)

# [[1]]

# [[[1]]]

2次元化、3次元化された配列をreshape(1)とすると、1要素の1次元配列になる。

print(c.reshape(1))
print(d.reshape(1))

# [1]
# [1]

print(c.reshape(1))

print(d.reshape(1))

# [1]

1次元配列の変形

2次元1行の配列への変形

1次元配列をreshape(1, -1)とすると、その配列を要素とする2次元1行の配列になる。

import numpy as np

a = np.arange(4)
print(a)

b = a.reshape(1, -1)
print(b)

# [0 1 2 3]
# [[0 1 2 3]]

import numpy as np

a = np.arange(4)

print(a)

b = a.reshape(1, -1)

print(b)

# [0 1 2 3]

# [[0 1 2 3]]

2次元1列の配列への変形

1次元配列をreshape(-1, 1)とすると、その配列を要素とする2次元1列の配列となる。

c = a.reshape(-1, 1)
print(c)

# [[0]
#  [1]
#  [2]
#  [3]]

c = a.reshape(-1, 1)

print(c)

# [[0]

# [1]

# [2]

# [3]]

任意の次元の配列への変形

1次元配列をreshape(m, n)とすると、m行n列の2次元配列になる。m×nが配列のサイズと等しくないとエラーになる（いずれかを−1として自動設定させることは可能）。

d = a.reshape(2, 2)
print(d)

# [[0 1]
#  [2 3]]

d = a.reshape(2, 2)

print(d)

# [[0 1]

# [2 3]]

3次元以上の配列へも変形可能。

e = np.arange(12)
print(e)
print(e.reshape(2, 2, 3))

# [ 0  1  2  3  4  5  6  7  8  9 10 11]
# [[[ 0  1  2]
#   [ 3  4  5]]
# 
#  [[ 6  7  8]
#   [ 9 10 11]]]

e = np.arange(12)

print(e)

print(e.reshape(2, 2, 3))

# [ 0 1 2 3 4 5 6 7 8 9 10 11]

# [[[ 0 1 2]

# [ 3 4 5]]

# [[ 6 7 8]

# [ 9 10 11]]]

1次元配列への変換

任意の形状の配列aについてreshape(a.size)とすることで、1次元の配列に変換できる。

print(b.reshape(b.size))
print(c.reshape(c.size))
print(d.reshape(d.size))
print(e.reshape(e.size))

# [0 1 2 3]
# [0 1 2 3]
# [0 1 2 3]
# [ 0  1  2  3  4  5  6  7  8  9 10 11]

print(b.reshape(b.size))

print(c.reshape(c.size))

print(d.reshape(d.size))

print(e.reshape(e.size))

# [0 1 2 3]

# [ 0 1 2 3 4 5 6 7 8 9 10 11]

Python – itertools

2020-05-21 / tau / コメントする

概要

itertoolsは高速でメモリー効率のよいイテレーターを生成するツールを提供する。

主となる引数にはコレクション（リスト、タプル）を与える。

from itertools import cycle

for n, next in enumerate(cycle(['A', 'B', 'C'])):
    print(next, end='')
    if n == 6: break

# ABCABCA

from itertools import cycle

for n, next in enumerate(cycle(['A', 'B', 'C'])):

print(next, end='')

if n == 6: break

# ABCABCA

文字列を渡すと文字列中の1文字ずつを要素としたリストと同じ効果。

for n, next in enumerate(cycle("ABC")):
    print(next, end='')
    if n == 6: break

# ABCABCA

for n, next in enumerate(cycle("ABC")):

print(next, end='')

if n == 6: break

# ABCABCA

range()関数などコレクションを生成する対象も使える。

for n, next in enumerate(cycle(range(3))):
    print(next, end='')
    if n == 6: break

# 0120120

for n, next in enumerate(cycle(range(3))):

print(next, end='')

if n == 6: break

# 0120120

無限イテレーター(infinite iterators)

無限イテレーターは、コレクションの要素を繰り返し取り出し続ける。ループ処理に使う場合、break文などの終了処理が必要。

count()

itertools.count(start, [step]): startに与えた数値から初めてstepずつ増加させて取り出す。stepを省略した場合は1ずつ増やす。

for  n, digit in enumerate(count(3, 2)):
    print(digit, end=',')
    if n==5: break

# 3,5,7,9,11,13,

for n, digit in enumerate(count(3, 2)):

print(digit, end=',')

if n==5: break

# 3,5,7,9,11,13,

cycle()

itertools.cycle(p): コレクションpを与えて、その要素p0, p1, …, plastを取り出し、その後p0へ戻って繰り返す。

from itertools import cycle

for  n, digit in enumerate(cycle(range(4))):
    print(digit, end=',')
    if n==10: break

# 0,1,2,3,0,1,2,3,0,1,2,

from itertools import cycle

for n, digit in enumerate(cycle(range(4))):

print(digit, end=',')

if n==10: break

# 0,1,2,3,0,1,2,3,0,1,2,

repeat()

itertools.repeat(elem [, n]): elemで与えた要素を第2引数で与えた数値の回数分繰り返す。第2引数を省略すると無限回繰り返す。

for ch in repeat('Ha', 8):
    print(ch, end='')

# HaHaHaHaHaHaHaHa

for ch in repeat('Ha', 8):

print(ch, end='')

# HaHaHaHaHaHaHaHa

組み合わせイテレーター(combinatoric iterator)

組み合わせイテレーターは、コレクションの要素から指定した数を取り出し、それらの直積、順列、組み合わせを結果とする。

product()

itertools.product(p [, repeat=n]): コレクションpの要素について、repeatで指定した数の直積の結果をタプルで返す。同一の要素、順番の異なる同じ組み合わせの要素を持つ結果を許す。; 第2引数repeatを省略すると要素数1のタプルを返す。

from itertools import product

for str in product(['A', 'B', 'C'], repeat=2):
    print(str, end='')

print()

for str in product(['A', 'B', 'C']):
    print(str, end='')

# ('A', 'A')('A', 'B')('A', 'C')('B', 'A')('B', 'B')('B', 'C')('C', 'A')('C', 'B')('C', 'C')
# ('A',)('B',)('C',)

from itertools import product

for str in product(['A', 'B', 'C'], repeat=2):

print(str, end='')

print()

for str in product(['A', 'B', 'C']):

print(str, end='')

# ('A', 'A')('A', 'B')('A', 'C')('B', 'A')('B', 'B')('B', 'C')('C', 'A')('C', 'B')('C', 'C')

# ('A',)('B',)('C',)

permutations

itertools.permutations（p [, r=n]）: コレクションpの要素について、rで指定した数の順列の結果をタプルで返す。統一要素の組はなく、同じ組み合わせの要素の順番が異なる結果は許す。; 第2引数はrepeatではなくrである点に注意。rを省略すると、全ての要素に対する組み合わせを返す。

from itertools import permutations

for str in permutations("ABC", r=2):
    print(str, end='')

print()

for str in permutations("ABC"):
    print(str, end='')

# ('A', 'B')('A', 'C')('B', 'A')('B', 'C')('C', 'A')('C', 'B')
# ('A', 'B', 'C')('A', 'C', 'B')('B', 'A', 'C')('B', 'C', 'A')('C', 'A', 'B')('C', 'B', 'A')

from itertools import permutations

for str in permutations("ABC", r=2):

print(str, end='')

print()

for str in permutations("ABC"):

print(str, end='')

# ('A', 'B')('A', 'C')('B', 'A')('B', 'C')('C', 'A')('C', 'B')

# ('A', 'B', 'C')('A', 'C', 'B')('B', 'A', 'C')('B', 'C', 'A')('C', 'A', 'B')('C', 'B', 'A')

combinations

itertools.combinations(p, repeat=n): コレクションpの要素について、repeatで指定した数の組み合わせの結果をタプルで返す。同一要素の組はなく、同じ組み合わせで順番が異なるものは同じ結果となる。; 第2引数rは省略できない。省略するとそれ以降の実行がされないなど動作が不定になる。

from itertools import combinations

for str in combinations("ABC", r=2):
    print(str, end='')

# ('A', 'B')('A', 'C')('B', 'C'),

from itertools import combinations

for str in combinations("ABC", r=2):

print(str, end='')

# ('A', 'B')('A', 'C')('B', 'C'),

combinations_with_replacement

itertools.combinations_with_replacement(iterable, r)

組み合わせに、同一要素の重複を許す。

第2引数rは省略できない。省略するとそれ以降の実行がされないなど動作が不定になる。

from itertools import combinations

for str in combinations("ABC", r=2):
    print(str, end='')

# ('A', 'A')('A', 'B')('A', 'C')('B', 'B')('B', 'C')('C', 'C')

from itertools import combinations

for str in combinations("ABC", r=2):

print(str, end='')

# ('A', 'A')('A', 'B')('A', 'C')('B', 'B')('B', 'C')('C', 'C')

特に役立ちそうなもの

chain～リストの結合に使える

itertools.chain(*iterables): 複数のiterableを与え、それらの内容を並べた1つのイテレーターを返す。引数の先頭の'*'は複数のiterablesを展開したものであることを表す。

戻り値はイテレーターオブジェクト。

from itertools import chain

print(chain([0, 1, 2, 3], [4, 5]))

# <itertools.chain object at 0x028081B0>

from itertools import chain

print(chain([0, 1, 2, 3], [4, 5]))

# <itertools.chain object at 0x028081B0>

list()関数でリスト化すると、展開されたリストが得られる。

print(list(chain([0, 1, 2, 3], [4, 5])))

# [0, 1, 2, 3, 4, 5]

print(list(chain([0, 1, 2, 3], [4, 5])))

# [0, 1, 2, 3, 4, 5]

引数にはRangeのようなイテレーターも混在可能。

print(list(chain(range(3), [3, 4, 5])))

# [0, 1, 2, 3, 4, 5]

print(list(chain(range(3), [3, 4, 5])))

# [0, 1, 2, 3, 4, 5]

蛇足だが単一のiteratableはそのまま返されるだけ。

print(list(chain([0, 1, 2, 3, 4, 5])))

# [0, 1, 2, 3, 4, 5]

print(list(chain([0, 1, 2, 3, 4, 5])))

# [0, 1, 2, 3, 4, 5]

chain.from_iterabble～2次元リストの展開に

itertools.chain.from_iterable(iterables): 複数のiterableを与え、それらの内容を並べた1つのイテレーターを返す。引数の先頭に’*’がないのは、引数がiterableを要素に持つiterableであることを表す。

たとえば複数のリストを含む2次元リストの全要素を1次元に展開可能。from_iterable()はchainのコンストラクターの一つであり、モジュールのインポート方法とコンストラクターの呼び方に注意。

from itertools import chain

print(list(chain.from_iterable([[0], [1, 2], [3, 4, 5]])))

# [0, 1, 2, 3, 4, 5]

from itertools import chain

print(list(chain.from_iterable([[0], [1, 2], [3, 4, 5]])))

# [0, 1, 2, 3, 4, 5]

1次元リストは要素がiterableでないのでエラー。

print(list(chain.from_iterable([0, 1, 2])))

# TypeError: 'int' object is not iterable

print(list(chain.from_iterable([0, 1, 2])))

# TypeError: 'int' object is not iterable

ndarrayを要素とするリストは、要素の配列が展開されて1次元リストに。

print(list(chain.from_iterable([np.array([1]), np.array([2, 3])])))

# [1, 2, 3]

print(list(chain.from_iterable([np.array([1]), np.array([2, 3])])))

# [1, 2, 3]

ndarrayの2次元配列も展開可能。結果をリストでほしいときはlist()関数、配列でほしいときは一旦list()関数でリスト化してからnumpy.array()で配列化。

ary = np.array([[1, 2], [3, 4]])
print(ary)
# [[1 2]
#  [3 4]]

print(list(chain.from_iterable(ary)))
# [1, 2, 3, 4]

print(np.array(list(chain.from_iterable(ary))))
# [1 2 3 4]

ary = np.array([[1, 2], [3, 4]])

print(ary)

# [[1 2]

# [3 4]]

print(list(chain.from_iterable(ary)))

# [1, 2, 3, 4]

print(np.array(list(chain.from_iterable(ary))))

# [1 2 3 4]

zip_longest～最長の引数に合わせるzip

itertools.zip_longest(*iterables, fillvalue=None): 複数のiterableを与え、それらを先頭から順にまとめたイテレーターを返す。結果は最も長いiterableに合わせられ、足りない値はfillvalueで埋められる。

from itertools import zip_longest

iterable1 = "ABCDE"
iterable2 = [1, 2, 3]

for item1, item2 in zip_longest(iterable1, iterable2):
    print(item1, item2)
# A 1
# B 2
# C 3
# D None
# E None

for item1, item2 in zip_longest(iterable1, iterable2, fillvalue=0):
    print(item1, item2)
# A 1
# B 2
# C 3
# D 0
# E 0

from itertools import zip_longest

iterable1 = "ABCDE"

iterable2 = [1, 2, 3]

for item1, item2 in zip_longest(iterable1, iterable2):

print(item1, item2)

# A 1

# B 2

# C 3

# D None

# E None

for item1, item2 in zip_longest(iterable1, iterable2, fillvalue=0):

print(item1, item2)

# A 1

# B 2

# C 3

# D 0

# E 0

scikit-learn – make_blobs

2020-05-18 / tau / コメントする

概要

sklearn.datasets.make_blobls()は、クラス分類のためのデータを生成する。blobとはインクの染みなどを指し、散布図の点の様子からつけられてるようだ。

標準では、データの総数、特徴量の数、クラスターの数などを指定して実行し、特徴量配列X、ターゲットとなるクラスデータyのタプルが返される（引数の指定によってはもう1つ戻り値が追加される）。

得られるデータの形式

特徴量配列Xは列が特徴量、行がレコードの2次元配列。ターゲットyはレコード数分のクラス属性値の整数。

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10, random_state=0)

print(X)
print(y)

# [[ 1.12031365  5.75806083]
#  [ 1.7373078   4.42546234]
#  [ 2.36833522  0.04356792]
#  [ 0.87305123  4.71438583]
#  [-0.66246781  2.17571724]
#  [ 0.74285061  1.46351659]
#  [-4.07989383  3.57150086]
#  [ 3.54934659  0.6925054 ]
#  [ 2.49913075  1.23133799]
#  [ 1.9263585   4.15243012]]
# [0 0 1 0 2 2 2 1 1 0]

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10, random_state=0)

print(X)

print(y)

# [[ 1.12031365 5.75806083]

# [ 1.7373078 4.42546234]

# [ 2.36833522 0.04356792]

# [ 0.87305123 4.71438583]

# [-0.66246781 2.17571724]

# [ 0.74285061 1.46351659]

# [-4.07989383 3.57150086]

# [ 3.54934659 0.6925054 ]

# [ 2.49913075 1.23133799]

# [ 1.9263585 4.15243012]]

# [0 0 1 0 2 2 2 1 1 0]

利用例

そのままscikit-learnのモデルの入力とする。

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X, y = make_blobs(n_samples=100, centers=2, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X_train, y_train)

print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))
# 1.0
# 0.96

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

X, y = make_blobs(n_samples=100, centers=2, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = KNeighborsClassifier(n_neighbors=1)

clf.fit(X_train, y_train)

print(clf.score(X_train, y_train))

print(clf.score(X_test, y_test))

# 1.0

# 0.96

クラスごとに色やマークを変えて散布図を描く。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=30, centers=3, random_state=10)

markers = ['o', '^', 'v']
fig, ax = plt.subplots()

for cluster, marker in zip(range(3), markers):
    x = X[y==cluster]
    ax.scatter(x[:, 0], x[:, 1], marker=marker)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=30, centers=3, random_state=10)

markers = ['o', '^', 'v']

fig, ax = plt.subplots()

for cluster, marker in zip(range(3), markers):

x = X[y==cluster]

ax.scatter(x[:, 0], x[:, 1], marker=marker)

plt.show()

パラメーターの指定

make_blobs(n_samples, n_features, centers, cluster_std,
           center_box, shuffle, random_state, return_centers)

1 2	make_blobs(n_samples, n_features, centers, cluster_std, center_box, shuffle, random_state, return_centers)

主なもの。

n_samples: 整数で指定した場合、生成されるサンプルの総数で戻り値Xの行数になる。配列で指定した場合、その要素数がクラスターの数となり、各要素はクラスターのデータ数となる。デフォルトは100。
n_features: 特徴量の数で、戻り値Xの列数になる。デフォルトは2
centers: クラスター中心の数。n_samplesを整数で指定してcentersを指定しない場合（デフォルトのNoneの場合）、centers=3となる。n_samplesを配列で指定した場合はNoneか[n_centers, n_features]の配列。
center_std: クラスターの標準偏差。

Logistic回帰～cancer～Pythonではじめる機械学習より

2020-05-17 / tau / コメントする

モデルの精度

breast_cancerデータセットに対してLogistic回帰モデル、scikit-learnのLogisticRegression適用し、訓練データとテストデータのスコアを計算してみる。

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(
    ds.data, ds.target, stratify=ds.target, random_state=42)

logreg = LogisticRegression().fit(X_train, y_train)
print("")
print("Training score: {}".format(logreg.score(X_train, y_train)))
print("Test score    : {}".format(logreg.score(X_test, y_test)))

# Training score: 0.9530516431924883
# Test score    : 0.958041958041958

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(

ds.data, ds.target, stratify=ds.target, random_state=42)

logreg = LogisticRegression().fit(X_train, y_train)

print("")

print("Training score: {}".format(logreg.score(X_train, y_train)))

print("Test score : {}".format(logreg.score(X_test, y_test)))

# Training score: 0.9530516431924883

# Test score : 0.958041958041958

（注）solverに関する警告と計算結果

上のコードを実行したとき、結果は書籍と整合しているが、警告表示が出た

...FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Training score: 0.9530516431924883
Test score    : 0.958041958041958

...FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

FutureWarning)

Training score: 0.9530516431924883

Test score : 0.958041958041958

この時点でscikit-learnのバージョンが古く(0.21.3)、将来のデフォルトが変更されるとのこと。そこでインスタンス生成時にデフォルトのソルバーを明示的にsolver='liblinear'と指定して実行すると、警告は出ず値もそのまま。

なお、solver='lbfgs'としてみたところ、計算が収束しない旨の警告が出た。

logreg = LogisticRegression(solver='lbfgs').fit(X_train, y_train)

1	logreg = LogisticRegression(solver='lbfgs').fit(X_train, y_train)

...ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
Training score: 0.9483568075117371
Test score    : 0.951048951048951

...ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.

"of iterations.", ConvergenceWarning)

Training score: 0.9483568075117371

Test score : 0.951048951048951

そこで収束回数を増やしていったところ、最大回数2000では収束せず、3000で収束し、警告は出なくなった。

logreg = LogisticRegression(solver='lbfgs', max_iter=3000).fit(X_train, y_train)

1	logreg = LogisticRegression(solver='lbfgs', max_iter=3000).fit(X_train, y_train)

Training score: 0.9577464788732394
Test score    : 0.958041958041958

1 2	Training score: 0.9577464788732394 Test score : 0.958041958041958

その後、scikit-learnのバージョンを0.23.0にアップグレードしたところ、デフォルトで警告は表示されず、収束回数に関する警告が同じように出て、結果も再現された。以下、ソルバーとしてliblinearを明示的に指定し、random_stateの値も書籍と同じ値として確認する。

学習精度の向上

先のC=1.0とliblinearによるスコアは、訓練データに対して0.953、テストデータに対して0.958と両方に対して高い値となっている。ここで、訓練データとテストデータのスコアが近いということは、適合不足の可能性がある。そこでC=100と値を大きくして、より柔軟なモデルにしてみる（柔軟なモデルとは、正則化を弱めて訓練データによりフィットしやすくしたモデル）。

logreg100 = LogisticRegression(C=100, solver='liblinear').fit(X_train, y_train)
print("Training score: {}".format(logreg100.score(X_train, y_train)))
print("Test score    : {}".format(logreg100.score(X_test, y_test)))

# Training score: 0.9788732394366197
# Test score    : 0.965034965034965

logreg100 = LogisticRegression(C=100, solver='liblinear').fit(X_train, y_train)

print("Training score: {}".format(logreg100.score(X_train, y_train)))

print("Test score : {}".format(logreg100.score(X_test, y_test)))

# Training score: 0.9788732394366197

# Test score : 0.965034965034965

訓練データ、テストデータともそれぞれ若干向上している。なお、Cの値を1000、10000ともっと大きくしてもスコアはほとんど変わらない。

今度は逆に、Cの値を1.0より小さくして正則化を強めてみると、訓練データ、テストデータ両方に対するスコアが下がってしまう。

logreg001 = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, y_train)
print("Training score: {}".format(logreg001.score(X_train, y_train)))
print("Test score    : {}".format(logreg001.score(X_test, y_test)))

# Training score: 0.9342723004694836
# Test score    : 0.9300699300699301

logreg001 = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, y_train)

print("Training score: {}".format(logreg001.score(X_train, y_train)))

print("Test score : {}".format(logreg001.score(X_test, y_test)))

# Training score: 0.9342723004694836

# Test score : 0.9300699300699301

Cを変化させたときの学習率曲線は以下の通り。Cが10より小さいところでは正則化が強く学習不足、そこを超えると学習率が頭打ちで、学習率の改善はそれほど顕著ではない。Logistic回帰モデルの学習率曲線のバリエーションについては、こちらでまとめている。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

C_pow_min = -4
C_pow_max = 3
C_pow_num = 100
Cs_pows = np.linspace(C_pow_min, C_pow_max, C_pow_num)
Cs = 10**Cs_pows

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(
    ds.data, ds.target, stratify=ds.target, random_state=42)

fig, ax = plt.subplots()

score_trains = np.empty(0)
score_tests = np.empty(0)

for C in Cs:
    logreg = LogisticRegression(C=C, solver='liblinear').fit(X_train, y_train)
    score_trains = np.append(score_trains, logreg.score(X_train, y_train))
    score_tests = np.append(score_tests, logreg.score(X_test, y_test))

ax.plot(Cs, score_trains, label="Training")
ax.plot(Cs, score_tests, label="Test")

ax.set_xscale('log')
ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

C_pow_min = -4

C_pow_max = 3

C_pow_num = 100

Cs_pows = np.linspace(C_pow_min, C_pow_max, C_pow_num)

Cs = 10**Cs_pows

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(

ds.data, ds.target, stratify=ds.target, random_state=42)

fig, ax = plt.subplots()

score_trains = np.empty(0)

score_tests = np.empty(0)

for C in Cs:

logreg = LogisticRegression(C=C, solver='liblinear').fit(X_train, y_train)

score_trains = np.append(score_trains, logreg.score(X_train, y_train))

score_tests = np.append(score_tests, logreg.score(X_test, y_test))

ax.plot(Cs, score_trains, label="Training")

ax.plot(Cs, score_tests, label="Test")

ax.set_xscale('log')

ax.legend()

plt.show()

特徴量の係数

L2正則化の場合

breast_cancerデータセットに対してLogisticRegressionを学習させた場合の、30個の特徴量に対する係数をプロットする。liblinearソルバーで、デフォルトでL2正則化を行っている。Cの値が大きいほど正則化の効果が弱く、係数の絶対値が大きくなっている。

書籍で注意喚起しているのは3番目の特徴量mean perimeterで、モデルによって正負が入れ替わることから、クラス分類に対する信頼性を問題にしている。

ここで書籍について以下の点が気になった。

logreg001のインスタンス生成時にC=0.01としているが、凡例で”C=0.001″としている（グラフの結果はあまり変わらない）
logreg100でC=100とすると、書籍にあるような結果にならない（worst concave pointsが-8以下になるなど、分布が大幅に変わってくる）
C=20とすると、概ね書籍と同じ分布になる（若干異なる部分は残る）

いずれにしても”Pythonではじめる機械学習”は、入門者にとってとてもありがたいきっかけを提供してくれる良著であることに変わりはない。

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as pch
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(
    ds.data, ds.target, stratify=ds.target, random_state=42)

logreg1 = LogisticRegression(solver='liblinear').fit(X_train, y_train)
logreg20 = LogisticRegression(C=20, solver='liblinear').fit(X_train, y_train)
logreg001 = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, y_train)

x_scatter = np.linspace(0, 1, len(ds.feature_names))

fig, ax = plt.subplots(figsize=(6.4, 6.4))

ax.scatter(x_scatter, logreg1.coef_, marker='o', c='grey', s=100,
    label="C=1.0")
ax.scatter(x_scatter, logreg001.coef_, marker='1', c='blue', s=100,
    label="C=0.01")
ax.scatter(x_scatter, logreg20.coef_, marker='2', c='red', s = 100,
    label="C=20")
ax.plot([0, 1], [0, 0], c='k', zorder=-100)
ax.add_patch(pch.Arrow(2/30, -4, 0, 3, width=1/30))

ax.set_xticks(x_scatter)
ax.set_xticklabels(ds.feature_names, rotation=90)
ax.set_xlim(0, 1)
ax.legend()

fig.subplots_adjust(bottom=0.3)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

import matplotlib.patches as pch

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(

ds.data, ds.target, stratify=ds.target, random_state=42)

logreg1 = LogisticRegression(solver='liblinear').fit(X_train, y_train)

logreg20 = LogisticRegression(C=20, solver='liblinear').fit(X_train, y_train)

logreg001 = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, y_train)

x_scatter = np.linspace(0, 1, len(ds.feature_names))

fig, ax = plt.subplots(figsize=(6.4, 6.4))

ax.scatter(x_scatter, logreg1.coef_, marker='o', c='grey', s=100,

label="C=1.0")

ax.scatter(x_scatter, logreg001.coef_, marker='1', c='blue', s=100,

label="C=0.01")

ax.scatter(x_scatter, logreg20.coef_, marker='2', c='red', s = 100,

label="C=20")

ax.plot([0, 1], [0, 0], c='k', zorder=-100)

ax.add_patch(pch.Arrow(2/30, -4, 0, 3, width=1/30))

ax.set_xticks(x_scatter)

ax.set_xticklabels(ds.feature_names, rotation=90)

ax.set_xlim(0, 1)

ax.legend()

fig.subplots_adjust(bottom=0.3)

plt.show()

L1正則化の場合

ソルバーを同じliblinearとして、penalty='l1'と明示的に指定する。今回はL2正則化の時と違って、C=0.001はコード中に明示され、C=100としてスコアの計算結果まで合う。ただしset_ylim()によって表示範囲を制限しており、C=100に対するいくつかの点が枠外にある。

L1正則化によって、多くの係数がゼロとなり、少ない特徴量によるシンプルなモデルでそれなりのスコアを出している。

logreg1 = LogisticRegression(solver='liblinear', penalty='l1').\
    fit(X_train, y_train)
logreg100 = LogisticRegression(C=100, solver='liblinear', penalty='l1').\
    fit(X_train, y_train)
logreg001 = LogisticRegression(C=0.001, solver='liblinear', penalty='l1').\
    fit(X_train, y_train)
print("C=0.001")
print(" Training score: {:5.3f}".format(logreg001.score(X_train, y_train)))
print(" Test score    : {:5.3f}".format(logreg001.score(X_test, y_test)))
print("C=1")
print(" Training score: {:5.3f}".format(logreg1.score(X_train, y_train)))
print(" Test score    : {:5.3f}".format(logreg1.score(X_test, y_test)))
print("C=100")
print(" Training score: {:5.3f}".format(logreg100.score(X_train, y_train)))
print(" Test score    : {:5.3f}".format(logreg100.score(X_test, y_test)))

# C=0.001
#  Training score: 0.913
#  Test score    : 0.923
# C=1
#  Training score: 0.960
#  Test score    : 0.958
# C=100
#  Training score: 0.986
#  Test score    : 0.979

logreg1 = LogisticRegression(solver='liblinear', penalty='l1').\

fit(X_train, y_train)

logreg100 = LogisticRegression(C=100, solver='liblinear', penalty='l1').\

fit(X_train, y_train)

logreg001 = LogisticRegression(C=0.001, solver='liblinear', penalty='l1').\

fit(X_train, y_train)

print("C=0.001")

print(" Training score: {:5.3f}".format(logreg001.score(X_train, y_train)))

print(" Test score : {:5.3f}".format(logreg001.score(X_test, y_test)))

print("C=1")

print(" Training score: {:5.3f}".format(logreg1.score(X_train, y_train)))

print(" Test score : {:5.3f}".format(logreg1.score(X_test, y_test)))

print("C=100")

print(" Training score: {:5.3f}".format(logreg100.score(X_train, y_train)))

print(" Test score : {:5.3f}".format(logreg100.score(X_test, y_test)))

# C=0.001

# Training score: 0.913

# Test score : 0.923

# C=1

# Training score: 0.960

# Test score : 0.958

# C=100

# Training score: 0.986

# Test score : 0.979

係数の符号と選択確率について

ターゲットのクラスは、malignant(悪性)が0、benign(良性)が1で、係数が正の場合は良性となる確率を上げる方向に、負の場合は悪性となる確率を上げる方向に効くことになる。

ここでL2正則化のworst concavityを見てみると、負～0の値をとっているが、元のデータを俯瞰すると良性の集団の方が全体的に高い値を示していて矛盾している。一方、L1正則化の場合は、C=0.001で全ての係数がゼロとなっていて、結果に影響していないことを示唆している。

L1正則化で正則化の程度を弱めて、C=1, 0.5, 0.1としてみると、worst concavityは結局ゼロとなるが、worst textureは一貫して負の値を維持している。この傾向はarea errorにも僅かだが見られる。

cancerデータを俯瞰してみると、worst textureは良性・悪性の分布がかなり重なっていて、悪性のデータのボリュームが大きい。area errorも両クラスのデータが近く、値が小さく、良性のデータ量が卓越している。

ヒストグラムを見る限りほとんどの特性量の値が大きいときに良性を示唆しているようみ見えるが、Logistic回帰の結果からは、多くの特性量が効いておらず、中には分布からの推測と逆の傾向を示す。

scikit-learn – LogisticRegression

2020-05-17 / tau / コメントする

概要

scikit-learnのLogisticRegressionモデルはLogistic回帰のモデルを提供する。利用方法の概要は以下の手順で、LinearRegressionなど他の線形モデルとほぼ同じだが、モデルインスタンス生成時に与える正則化パラメーターCはRidge/Lassoのalphaと逆で、正則化の効果を強くするにはCを小さくする（Cを大きくすると正則化が弱まり、訓練データに対する精度は高まるが過学習の可能性が高くなる）。

また、正則化の方法をL1正則化、L2正則化、Elastic netから選択できる。

LogisticRegressのクラスをインポートする
ハイパーパラメーターC、正則化方法、solver（収束計算方法）などを指定し、モデルのインスタンスを生成する
fit()メソッドに訓練データを与えて学習させる

学習済みのモデルの利用方法は以下の通り。

score()メソッドにテストデータを与えて適合度を計算する
predict()メソッドに説明変数を与えてターゲットを予測
モデルインスタンスのプロパティーからモデルのパラメーターを利用
- 切片はintercept_、重み係数はcoef_(末尾のアンダースコアに注意)

利用例

以下は、breast_cancerデータセットに対してLogisticRegressionを適用した例。デフォルトのsolverは'lbfgs'でデフォルトの最大収束回数(100)では収束しなかったため、max_iter=3000を指定している。

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(
    ds.data, ds.target, stratify=ds.target, random_state=42)

logreg = LogisticRegression(max_iter=3000).fit(X_train, y_train)
print("")
print("Training score: {}".format(logreg.score(X_train, y_train)))
print("Test score    : {}".format(logreg.score(X_test, y_test)))
print("Prediction")
for i in range(3):
    print("{} -> {}".format(y_test[i], logreg.predict(X_test[i].reshape(1, -1))))

# Training score: 0.9577464788732394
# Test score    : 0.958041958041958
# Prediction
# 1 -> [1]
# 0 -> [0]
# 1 -> [1]

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

ds = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(

ds.data, ds.target, stratify=ds.target, random_state=42)

logreg = LogisticRegression(max_iter=3000).fit(X_train, y_train)

print("")

print("Training score: {}".format(logreg.score(X_train, y_train)))

print("Test score : {}".format(logreg.score(X_test, y_test)))

print("Prediction")

for i in range(3):

print("{} -> {}".format(y_test[i], logreg.predict(X_test[i].reshape(1, -1))))

# Training score: 0.9577464788732394

# Test score : 0.958041958041958

# Prediction

# 1 -> [1]

# 0 -> [0]

# 1 -> [1]

利用方法

LogisticRgressionの主な利用方法はLineaRegressionとほとんど同じで、以下は特有の設定を中心にまとめる。

モデルクラスのインポート

scikit-learn.linear_modelパッケージからLogisticRegressonクラスをインポートする。

from sklearn.linear_model import LogisticRegression

1	from sklearn.linear_model import LogisticRegression

モデルのインスタンスの生成

LogisticRegressionでは、ハイパーパラメーターCによって正則化の強さを指定する。このCはRidge/Lassoのalphaと異なり、正則化の効果を強めるためには値を小さくする。デフォルトはC=1.0。

logreg = LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0,
             fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None,
             solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False,
             n_jobs=None, l1_ratio=None)

logreg = LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0,

fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None,

solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False,

n_jobs=None, l1_ratio=None)

以下、RidgeとLassoに特有のパラメーターのみ説明。LinearRegressionと共通のパラメーターはLinearRegressionを参照。

penalty: 'l1', 'l2', 'elasticnet', 'none'で正則化項のノルムのタイプを指定する。ソルバーの'newton-cg','sag','lbfgs'はL2正則化のみサポートし、'elasticnet'は'saga'のみがサポートする。デフォルトは'none'で正則化は適用されない('liblinear'は'none'に対応しない)。
tol: 収束計算の解の精度で、デフォルトは1e-4。
C: 正則化の強さの逆数。正の整数で指定し、デフォルトは1.0。
solver: 'newton-cg'、'lbfgs'、'liblinear'、'sag'、'saga'のうちから選択される。デフォルトは'lbfgs'。小さなデータセットには'liblnear'が適し、大きなデータセットに対しては'sag'、'saga'の計算が速い。複数クラスの問題には、'newton-cg'、'sag'、'saga'、'lbfgs'が対応し、'liblinear'は一対他しか対応しない。その他ノルムの種類とソルバーの対応。
max_iter: 収束計算の制限回数を指定する。デフォルト値は100。
random_state: データをシャッフルする際のランダム・シードで、solver='sag'の際に用いる。
l1_ratio: Elastic-Netのパラメーター。[0, 1]の値で、penalty='elasticnet'の時のみ使われる。

モデルの学習

fit()メソッドに特徴量とターゲットの訓練データを与えてモデルに学習させる(回帰係数を決定する)。

lr.fit(X, y)

1	lr.fit(X, y)

X: 特徴量の配列。2次元配列で、各列が各々の説明変数に対応し、行数はデータ数を想定している。変数が1つで1次元配列の時はreshape(-1, 1)かスライス([:, n:n+1])を使って1列の列ベクトルに変換する必要がある。
y: ターゲットの配列で、通常は1変数で1次元配列。

3つ目の引数sample_weightは省略。

適合度の計算

score()メソッドに特徴量とターゲットを与えて適合度を計算する。

lr.score(X, y)

1	lr.score(X, y)

その他のメソッド

decision_function(X)
densiffy()
predict_proba(X)
predict_log_proba()
sparsify()

Logistic回帰～forgeデータ～Pythonではじめる機械学習より

2020-05-17 / tau / コメントする

概要

O’REILLYの書籍”Pythonではじめる機械学習”の2.3.3.5、Logistic回帰でforgeデータの決定境界をトレースしてみたとき、収束計算のソルバーの違いや、元データと書籍のデータの違いなどから再現性に悩んだので記録しておく。

決定境界

mglearnのforgeデータセットに対してLogisticRegressionを適用してみる。

Cがかなり大きい場合、すなわち正則をほとんど行わない場合には、与えられたデータに対して可能な限り適合させようとしており、データに対する適合度は高い。Cが小さくなると正則化が効いてきて、データ全体に対して適合させようとしているように見える。

ここで上の図のC=1のケースは、書籍の図2-15右側と比べると決定境界の勾配が逆になっている。その理由は次のようであることが分かった。

書籍ではLogisticRegression()の収束手法を指定せず、デフォルトのsolver='liblinear'が使用されている
今回指定なしで実行したところ、以下のような警告が発生
- FutureWarning: Default solver will be changed to ‘lbfgs’ in 0.22. Specify a solver to silence this warning.
  FutureWarning)
- デフォルトのソルバーが（現在はliblinearだが）ver 0.22ではlbfgsになる／このwarningを黙らせるためにソルバーを指定せよ
そこでモデルのインスタンス生成時にLogisticRegression(solver='lbfgs')としたところ先の結果となった
指定なし、あるいはsolver='liblinear'とすると書籍と同じ結果になる

liblinearによる結果が以下の通り。正則化の度合いに応じてlbfgsよりも傾きがダイナミックに変わっているように見える。

なお、これらの図の傾きについて、今度は書籍の図2-16と随分違っている。よく見てみると、同図のforgeデータは特に下側の〇印の点でオリジナルにはないデータがいくつか加わっているためと考えられる。

これらのコードは以下の通り。

import numpy as np
import matplotlib.pyplot as plt
from mglearn.datasets import make_forge
from sklearn.linear_model import LogisticRegression

X, y = make_forge()

xmin, xmax = 7.5, 12.5
ymin, ymax = -1, 6

C_values = [1e5, 1e2, 1e0, 1e-2]

fig, axs = plt.subplots(2, 2, figsize=(6.4, 6.4))
fig.subplots_adjust(hspace=0.4)

axs_1d = axs.reshape(1, -1)

for ax, c in zip(axs_1d[0], C_values):
    logreg = LogisticRegression(C=c, solver='liblinear')
    logreg.fit(X, y)

    b = logreg.intercept_[0]
    w0 = logreg.coef_[0][0]
    w1 = logreg.coef_[0][1]

    x_border = np.linspace(xmin, xmax)
    y_border = (-b - w0 * x_border) / w1

    ax.scatter(X[:, 0][y==1], X[:, 1][y==1], marker='^')
    ax.scatter(X[:, 0][y==0], X[:, 1][y==0], marker='o')

    ax.plot(x_border, y_border, 'k')

    ax.set_xlim(xmin, xmax)
    ax.set_ylim(ymin, ymax)

    ax.set_title("C={:2.2f}(score={:6.3f})".format(c, logreg.score(X, y)))
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")
    ax.label_outer()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from mglearn.datasets import make_forge

from sklearn.linear_model import LogisticRegression

X, y = make_forge()

xmin, xmax = 7.5, 12.5

ymin, ymax = -1, 6

C_values = [1e5, 1e2, 1e0, 1e-2]

fig, axs = plt.subplots(2, 2, figsize=(6.4, 6.4))

fig.subplots_adjust(hspace=0.4)

axs_1d = axs.reshape(1, -1)

for ax, c in zip(axs_1d[0], C_values):

logreg = LogisticRegression(C=c, solver='liblinear')

logreg.fit(X, y)

b = logreg.intercept_[0]

w0 = logreg.coef_[0][0]

w1 = logreg.coef_[0][1]

x_border = np.linspace(xmin, xmax)

y_border = (-b - w0 * x_border) / w1

ax.scatter(X[:, 0][y==1], X[:, 1][y==1], marker='^')

ax.scatter(X[:, 0][y==0], X[:, 1][y==0], marker='o')

ax.plot(x_border, y_border, 'k')

ax.set_xlim(xmin, xmax)

ax.set_ylim(ymin, ymax)

ax.set_title("C={:2.2f}(score={:6.3f})".format(c, logreg.score(X, y)))

ax.set_xlabel("Feature 0")

ax.set_ylabel("Feature 1")

ax.label_outer()

plt.show()

3次元表示

2つのCの値について、二つの特徴量の組み合わせに対する青い点の確率分布を表示してみる(solver='lbfgs')。Cが小さいと確率分布がなだらかになる様子が見て取れるが、データに対する判別の適合度との関係はよくわからない。

import numpy as np
import matplotlib.pyplot as plt
from mglearn.datasets import make_forge
from sklearn.linear_model import LogisticRegression
from mpl_toolkits.mplot3d import Axes3D

X, y = make_forge()

xmin, xmax = 7.5, 12.5
ymin, ymax = -1, 6

gx = np.linspace(xmin, xmax, 40)
gy = np.linspace(ymin, ymax, 40)
gx, gy = np.meshgrid(gx, gy)

C_values = [1e3, 1e-1]

fig = plt.figure(figsize=(12, 4.8))

ax0 = fig.add_subplot(121, projection='3d')
ax1 = fig.add_subplot(122, projection='3d')
axs = [ax0, ax1]

for ax, c in zip(axs, C_values):
    logreg = LogisticRegression(C=c, solver='lbfgs')
    logreg.fit(X, y)

    b = logreg.intercept_[0]
    w0 = logreg.coef_[0][0]
    w1 = logreg.coef_[0][1]
    gz = 1/(1 + np.exp(-b - w0*gx - w1*gy))
    gz05 = np.full_like(gz, 0.5)

    y_border_min = (-b - w0 * xmin) / w1
    y_border_max = (-b - w0 * xmax) / w1

    ax.scatter(X[:, 0][y==1], X[:, 1][y==1], 0.5, color='tab:blue')
    ax.scatter(X[:, 0][y==0], X[:, 1][y==0], 0.5, color='tab:red')
    ax.plot_wireframe(gx, gy, gz, color='tab:green', alpha=0.5)
    ax.plot_surface(gx, gy, gz05, color='k', alpha=0.2)
    ax.plot([xmin, xmax], [y_border_min, y_border_max], 0.5)

    ax.set_xlim(xmin, xmax)
    ax.set_ylim(ymin, ymax)

    ax.set_title("C={}".format(c))
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from mglearn.datasets import make_forge

from sklearn.linear_model import LogisticRegression

from mpl_toolkits.mplot3d import Axes3D

X, y = make_forge()

xmin, xmax = 7.5, 12.5

ymin, ymax = -1, 6

gx = np.linspace(xmin, xmax, 40)

gy = np.linspace(ymin, ymax, 40)

gx, gy = np.meshgrid(gx, gy)

C_values = [1e3, 1e-1]

fig = plt.figure(figsize=(12, 4.8))

ax0 = fig.add_subplot(121, projection='3d')

ax1 = fig.add_subplot(122, projection='3d')

axs = [ax0, ax1]

for ax, c in zip(axs, C_values):

logreg = LogisticRegression(C=c, solver='lbfgs')

logreg.fit(X, y)

b = logreg.intercept_[0]

w0 = logreg.coef_[0][0]

w1 = logreg.coef_[0][1]

gz = 1/(1 + np.exp(-b - w0*gx - w1*gy))

gz05 = np.full_like(gz, 0.5)

y_border_min = (-b - w0 * xmin) / w1

y_border_max = (-b - w0 * xmax) / w1

ax.scatter(X[:, 0][y==1], X[:, 1][y==1], 0.5, color='tab:blue')

ax.scatter(X[:, 0][y==0], X[:, 1][y==0], 0.5, color='tab:red')

ax.plot_wireframe(gx, gy, gz, color='tab:green', alpha=0.5)

ax.plot_surface(gx, gy, gz05, color='k', alpha=0.2)

ax.plot([xmin, xmax], [y_border_min, y_border_max], 0.5)

ax.set_xlim(xmin, xmax)

ax.set_ylim(ymin, ymax)

ax.set_title("C={}".format(c))

ax.set_xlabel("Feature 0")

ax.set_ylabel("Feature 1")

plt.show()

ndarray – 行・列の抽出

2020-05-09 / tau / コメントする

例示用の配列

以下の配列を例示用に準備する。

import numpy as np

a = np.arange(30).reshape(6, 5)
print(a)

# [[ 0  1  2  3  4]
#  [ 5  6  7  8  9]
#  [10 11 12 13 14]
#  [15 16 17 18 19]
#  [20 21 22 23 24]
#  [25 26 27 28 29]]

import numpy as np

a = np.arange(30).reshape(6, 5)

print(a)

# [[ 0 1 2 3 4]

# [ 5 6 7 8 9]

# [10 11 12 13 14]

# [15 16 17 18 19]

# [20 21 22 23 24]

# [25 26 27 28 29]]

単一の行・列の抽出

単一の行の抽出

単に1つ目のインデックスを指定すると、それに対応する行が抽出される。2つ目の引数を省略すると、全て':'を指定したことになる。

print(a[3])

# [15 16 17 18 19]

print(a[3])

# [15 16 17 18 19]

単一の列の抽出

1つ目の引数を':'とし、2つ目にインデックスを指定すると、対応する列が抽出される。ただし結果は1次元の配列となる。

print(a[:, 2])

# [ 2  7 12 17 22 27]

print(a[:, 2])

# [ 2 7 12 17 22 27]

これを列ベクトルとして取り出すのに2つの方法がある。

1つ目の方法はreshape(-1, 1)とする定石。2つ目の引数1は列数1を指定し、1つ目の引数を−1にすることで、列数とサイズから適切な行数が設定される。

print(a[:, 2].reshape(-1, 1))

# [[ 2]
#  [ 7]
#  [12]
#  [17]
#  [22]
#  [27]]

print(a[:, 2].reshape(-1, 1))

# [[ 2]

# [ 7]

# [12]

# [17]

# [22]

# [27]]

2つ目の方法は、列数を指定するのに敢えて1列のスライスで指定する方法。後述するように、列をスライスで指定した場合は2次元の形状が保持されることを利用している。以下の例では、2列目から2列目までの「範囲」を指定している。

print(a[:, 2:3])

# [[ 2]
#  [ 7]
#  [12]
#  [17]
#  [22]
#  [27]]

print(a[:, 2:3])

# [[ 2]

# [ 7]

# [12]

# [17]

# [22]

# [27]]

連続する複数の行・列の抽出

連続する複数行の抽出

1つ目の引数をスライスで指定して、連続する複数行を抽出。

print(a[2:5])

# [[10 11 12 13 14]
#  [15 16 17 18 19]
#  [20 21 22 23 24]]

print(a[2:5])

# [[10 11 12 13 14]

# [15 16 17 18 19]

# [20 21 22 23 24]]

連続する複数列の抽出

2つ目の引数をスライスで指定して、連続する複数列を抽出。

print(a[:, 1:4])

# [[ 1  2  3]
#  [ 6  7  8]
#  [11 12 13]
#  [16 17 18]
#  [21 22 23]
#  [26 27 28]]

print(a[:, 1:4])

# [[ 1 2 3]

# [ 6 7 8]

# [11 12 13]

# [16 17 18]

# [21 22 23]

# [26 27 28]]

不連続な複数の行・列を抽出

不連続な複数の行を抽出

第1引数をリストで指定すると、その要素をインデックスとする複数の行が抽出される。このような指定方法のインデックスを、ファンシーインデックスと言う。

print(a[[2, 4]])

# [[10 11 12 13 14]
#  [20 21 22 23 24]]

print(a[[2, 4]])

# [[10 11 12 13 14]

# [20 21 22 23 24]]

リストの要素は昇順である必要はなく、要素順に行が取り出される。

print(a[[4, 2]])

# [[20 21 22 23 24]
#  [10 11 12 13 14]]

print(a[[4, 2]])

# [[20 21 22 23 24]

# [10 11 12 13 14]]

不連続な複数の列の抽出

1つ目の引数を':'とし、2つ目の引数をリストで指定して要素に対応する列を取り出せる。

print(a[:, [1, 3]])

# [[ 1  3]
#  [ 6  8]
#  [11 13]
#  [16 18]
#  [21 23]
#  [26 28]]

print(a[:, [1, 3]])

# [[ 1 3]

# [ 6 8]

# [11 13]

# [16 18]

# [21 23]

# [26 28]]

列についても、要素の順番は任意。

print(a[:, [3, 1]])

# [[ 3  1]
#  [ 8  6]
#  [13 11]
#  [18 16]
#  [23 21]
#  [28 26]]

print(a[:, [3, 1]])

# [[ 3 1]

# [ 8 6]

# [13 11]

# [18 16]

# [23 21]

# [28 26]]