Ridge回帰の理解

2020-04-26 / tau / コメントする

定義

Ridge回帰は多重回帰の損失関数に罰則項としてL2正則化項を加味する。正則化の意味についてはこちらに詳しくまとめている。

L2ノルムは原点からのユークリッド距離。

(1) $\begin{equation*} \| \boldsymbol{w} \| _2 = \sqrt{w_1 ^2 + \cdots + w_m^2} \end{equation*}$

ただしリッジ回帰では、根号の中の二乗項で計算する。

(2) $\begin{equation*} \mathrm{minimize} \quad \sum_{i=1}^n (y_i - \hat{y}_i) + \alpha \sum_{j=1}^m w_j^2 \end{equation*}$

定式化

最小化すべき関数は、

(3) $\begin{align*} L &= \sum_{i=1}^n ( \hat{y}_i - y_i )^2 + \alpha ({w_1}^2 + \cdots + {w_2}^2) \\ &= \sum ( w_0 + w_1 x_{1i} + \cdots + w_m x_{mi} - y_i )^2 + \alpha ({w_1}^2 + \cdots + {w_m}^2) \end{align*}$

重み係数を計算するために、それぞれで偏微分してゼロとする。

(4) $\begin{align*} \frac{\partial L}{\partial w_0} &= 2 \sum (w_0 + w_1 x_{1i} + \cdots + w_m x_{mi} - y_i) = 0 \\ \frac{\partial L}{\partial w_1} &= 2 \sum x_{1i} (w_0 + w_1 x_{1i} + \cdots + w_m x_{mi} - y_i) + 2 \alpha w_1 = 0 \\ \vdots\\ \frac{\partial L}{\partial w_m} &= 2 \sum x_{mi} (w_0 + w_1 x_{1i} + \cdots + w_m x_{mi} - y_i) + 2 \alpha w_m = 0\\ \end{align*}$

その結果得られる連立方程式は以下の通り。

(5) $\begin{align*} n w_0 + w_1 \sum x_{1i} + \cdots + w_m \sum x_{mi} &= \sum y_i \\ w_0 \sum x_{1i} + w_1 \left( \sum {x_{1i}}^2 + \alpha \right) + \cdots + w_m \sum x_{1i} x_{mi} &= \sum x_{1i} y_i \\ \vdots \\ w_0 \sum x_{mi} + w_1 \sum x_{1i} x_{mi} + \cdots+ w_m \left( \sum {x_{mi}}^2 + \alpha \right) &= \sum x_{mi} y_i \\ \end{align*}$

ここでそれぞれの和を記号Sと添字で表し、さらに行列表示すると以下の通り。

(6) $\begin{equation*} \left[ \begin{array}{cccc} n & S_1 & \cdots & S_m \\ S_1 & S_{11} + \alpha & & S_{1m} \\ \vdots & \vdots & & \vdots \\ S_m & S_{m1} & \cdots & S_{mm} + \alpha \end{array} \right] \left[ \begin{array}{c} w_0 \\ w_1 \\ \vdots \\ w_m \end{array} \right] = \left[ \begin{array}{c} S_y \\S_{1y} \\ \vdots \\ S_{my} \end{array} \right] \end{equation*}$

ここで $w_0$ を消去して、以下の連立方程式を得る。

(7) $\begin{align*} &\left[ \begin{array}{ccc} ( S_{11} + \alpha ) - \dfrac{{S_1}^2}{n} & \cdots & S_{1m} - \dfrac{S_1 S_m}{n} \\ \vdots & & \vdots \\ S_{m1} - \dfrac{S_m S_1}{n} & \cdots & ( S_{mm} + \alpha )- \dfrac{{S_2}^2}{n} \end{array} \right] \left[ \begin{array}{c} w_1 \\ \vdots \\ w_m \end{array} \right] \\&= \left[ \begin{array}{c} S_{1y} - \dfrac{S_1 S_y}{n} \\ \vdots \\ S_{my} - \dfrac{S_m S_y}{n} \end{array} \right] \end{align*}$

これを分散・共分散で表すと、

(8) $\begin{equation*} \left[ \begin{array}{ccc} V_{11} + \dfrac{\alpha}{n} & \cdots & V_{1m} \\ \vdots & & \vdots \\ V_{m1} & \cdots & V_{mm} + \dfrac{\alpha}{n} \end{array} \right] \left[ \begin{array}{c} w_1 \\ \vdots \\ w_m \end{array} \right] = \left[ \begin{array}{c} V_{1y} \\ \vdots \\ V_{my} \end{array} \right] \end{equation*}$

ここで仮に、x_jiとx_kiが完全な線形関係にある場合を考えてみる。 $x_j = a x_i + b$ とすると、分散・共分散の性質より、

(9) $\begin{equation*} V_{jj} = a^2V_{ii}, \; V_{ji} = V_{ij} = aV_{ii}, \; V_{jk} = V_{kj} = aV_{ji} = aV_{ij} \end{equation*}$

このような場合、通常の線形回帰は多重共線性により解を持たないが、式(8)に適用すると係数行列は以下のようになる。

(10) $\begin{align*} \left[ \begin{array}{ccccccc} V_{11} + \dfrac{\alpha}{n} & \cdots & V_{1i} & \cdots & aV_{1i} & \cdots & V_{1m}\\ \vdots && \vdots && \vdots && \vdots\\ V_{i1} & \cdots & V_{ii} + \dfrac{\alpha}{n} & \cdots & aV_{ii} & \cdots & V_{im}\\ \vdots && \vdots && \vdots && \vdots\\ aV_{i1} & \cdots & aV_{ii} & \cdots & a^2V_{ii} + \dfrac{\alpha}{n} & \cdots & aV_{im}\\ \vdots && \vdots && \vdots && \vdots\\ V_{m1} & \cdots & V_{mi} & \cdots & aV_{mi} & \cdots & V_{mm} + \dfrac{\alpha}{n} \end{array} \right] \end{align*}$

対角要素にαが加わることで、多重共線性が強い場合でも係数行列の行列式は正則となり、方程式は解を持つ。また正則化の効果より、αを大きな値とすることによって係数の値が小さく抑えられる。

行列による表示

式(3)の損失関数を、n個のデータに対する行列で表示すると以下の通り（重回帰の行列表現はこちらを参照）。

(11) $\begin{align*} L &= \left( \boldsymbol{Xw} - \boldsymbol{y} \right)^T \left( \boldsymbol{Xw} - \boldsymbol{y} \right) + \alpha \boldsymbol{w}^T \boldsymbol{w} \\ &= \boldsymbol{w}^T \boldsymbol{X}^T \boldsymbol{Xw} - 2\boldsymbol{y}^T \boldsymbol{Xw} + \boldsymbol{y}^T \boldsymbol{y} + \alpha \boldsymbol{w}^T \boldsymbol{w} \end{align*}$

これをwで微分してLを最小とする値を求める。

(12) $\begin{gather*} \frac{dL}{d\boldsymbol{w}} = 2\boldsymbol{X}^T \boldsymbol{Xw} - 2 \boldsymbol{X}^T \boldsymbol{y} + 2 \alpha \boldsymbol{w} = \boldsymbol{0} \\ \boldsymbol{w} = \left( \boldsymbol{X}^T \boldsymbol{X} + \alpha \boldsymbol{I} \right)^{-1} \boldsymbol{X}^T \boldsymbol{y} \end{gather*}$

waveデータセット – 線形回帰

2020-04-05 / tau / コメントする

O’Reillyの”Pythonではじめる機械学習”に載っている、scikit-learnの線形回帰のwaveデータセットへの適用の再現。

waveデータセットのサンプル数を60、train_test_split()でrandom_satet=42として、書籍と同じグラフを得る。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mglearn.datasets import make_wave

xmin, xmax = -3, 3
ymin, ymax = -3, 3

X_source, y_source = make_wave(n_samples=60)
X_train, X_test, y_train, y_test = train_test_split(X_source, y_source, random_state=42)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

X_test = np.linspace(xmin, xmax, 2).reshape(-1, 1)
y_test = linreg.predict(X_test)

print(linreg.coef_[0], linreg.intercept_)

fig, ax = plt.subplots(figsize=(6.4, 6.4))

ax.scatter(X_source, y_source, s=20)
ax.plot(X_test, y_test, c="tab:orange")

ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid()

ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)

ax.set_aspect('equal')

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from mglearn.datasets import make_wave

xmin, xmax = -3, 3

ymin, ymax = -3, 3

X_source, y_source = make_wave(n_samples=60)

X_train, X_test, y_train, y_test = train_test_split(X_source, y_source, random_state=42)

linreg = LinearRegression()

linreg.fit(X_train, y_train)

X_test = np.linspace(xmin, xmax, 2).reshape(-1, 1)

y_test = linreg.predict(X_test)

print(linreg.coef_[0], linreg.intercept_)

fig, ax = plt.subplots(figsize=(6.4, 6.4))

ax.scatter(X_source, y_source, s=20)

ax.plot(X_test, y_test, c="tab:orange")

ax.spines['bottom'].set_position('zero')

ax.spines['left'].set_position('zero')

ax.spines['top'].set_visible(False)

ax.spines['right'].set_visible(False)

ax.grid()

ax.set_xlim(xmin, xmax)

ax.set_ylim(ymin, ymax)

ax.set_aspect('equal')

plt.show()

また、訓練結果の係数、切片とスコアについても同じ結果を得ることができる。

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mglearn.datasets import make_wave

X, y = make_wave(n_samples=60)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

print("coef_     : {}".format(linreg.coef_))
print("intercept_: {}".format(linreg.intercept_))

print("training score: {:.3f}".format(linreg.score(X_train, y_train)))
print("test score    : {:.3f}".format(linreg.score(X_test, y_test)))

# coef_     : [0.39390555]
# intercept_: -0.031804343026759746
# training score: 0.670
# test score    : 0.659

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from mglearn.datasets import make_wave

X, y = make_wave(n_samples=60)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

linreg = LinearRegression()

linreg.fit(X_train, y_train)

print("coef_ : {}".format(linreg.coef_))

print("intercept_: {}".format(linreg.intercept_))

print("training score: {:.3f}".format(linreg.score(X_train, y_train)))

print("test score : {:.3f}".format(linreg.score(X_test, y_test)))

# coef_ : [0.39390555]

# intercept_: -0.031804343026759746

# training score: 0.670

# test score : 0.659

Breast cancer データセット – Logistic回帰による学習率曲線

2020-04-05 / tau / コメントする

概要

breast-cancerデータセットにscikit-learnのLogisticRegressionクラスでLogistic回帰を適用した結果。

手法全般の適用の流れはLogistic回帰～cancer～Pythonではじめる機械学習よりを参照。

ここではハイパーパラメーターを変化させたときの学習率の違いをみている。

学習率曲線

scikit-learnのLogisticRegressionクラスで、正則化のパラメーターを変化させたときの学習率曲線。同クラスにはsolver引数で収束計算のいくつかの手法が選択できるが、収束手法の違いによって意外に学習率曲線に違いが出た。またtrain_test_split()のrandom_stateを変えても違いがある。569のデータセットで訓練データとテストデータを分けてもいるが、その程度では結構ばらつきが出るということかもしれない。

まず、random_state=0とした場合の、4つの収束手法における学習率曲線を示す。L-BFGSは準ニュートン法の1つらしいので、Newton-CGと同じ傾向であるのは頷ける。SAG(Stochastic Average Gradient)はまた違った計算方法のようで、他の手法と随分挙動が異なる。収束回数はmax_iter=10000で設定していて、これくらいでも計算回数オーバーの警告がいくつか出る。回数をこれより2オーダー多くしても、状況はあまり変わらない。

random_state=11としてみると、liblinearでは大きく違わないが、他の3つの手法では傾向が違っていて、特にsagを用いた場合は訓練データの学習率の方がテストデータの学習率よりも低くなっている。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

ds = load_breast_cancer()

df = pd.DataFrame(ds.data, columns=ds.feature_names)

X_train, X_test, y_train, y_test = \
    train_test_split(df, ds.target, stratify=ds.target, random_state=0)

C_sup = np.linspace(5, -4, 20)
C_val = 10**C_sup

solvers = ['liblinear', 'lbfgs', 'newton-cg', 'sag']


fig, axs = plt.subplots(2, 2, figsize=(8, 8))
axs_1d = axs.reshape(-1)

for ax, solver in zip(axs_1d, solvers):
    train_scores = np.empty(0)
    test_scores = np.empty(0)
    for C in C_val:
        logreg = LogisticRegression(C=C, solver=solver, max_iter=10000)
        logreg.fit(X_train, y_train)
        train_scores = np.append(train_scores, logreg.score(X_train, y_train))
        test_scores = np.append(test_scores, logreg.score(X_test, y_test))

    ax.plot(C_val, train_scores, label="Training scores")
    ax.plot(C_val, test_scores, label="Test scores")

    ax.set_xscale('log')
    ax.set_ylim(0.9, 1)
    ax.grid(True)
    ax.legend()
    ax.set_title(solver)

plt.show()

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_breast_cancer

ds = load_breast_cancer()

df = pd.DataFrame(ds.data, columns=ds.feature_names)

X_train, X_test, y_train, y_test = \

train_test_split(df, ds.target, stratify=ds.target, random_state=0)

C_sup = np.linspace(5, -4, 20)

C_val = 10**C_sup

solvers = ['liblinear', 'lbfgs', 'newton-cg', 'sag']

fig, axs = plt.subplots(2, 2, figsize=(8, 8))

axs_1d = axs.reshape(-1)

for ax, solver in zip(axs_1d, solvers):

train_scores = np.empty(0)

test_scores = np.empty(0)

for C in C_val:

logreg = LogisticRegression(C=C, solver=solver, max_iter=10000)

logreg.fit(X_train, y_train)

train_scores = np.append(train_scores, logreg.score(X_train, y_train))

test_scores = np.append(test_scores, logreg.score(X_test, y_test))

ax.plot(C_val, train_scores, label="Training scores")

ax.plot(C_val, test_scores, label="Test scores")

ax.set_xscale('log')

ax.set_ylim(0.9, 1)

ax.grid(True)

ax.legend()

ax.set_title(solver)

plt.show()

ndarray.reshape – 配列の形状変更

2020-04-05 / tau / コメントする

基本

配列の形状変更は、reshape()メソッドで行う。reshape()メソッドは、元の配列を破壊せず新たな配列を生成する。

具体のいろいろな使い方は、ndarray.reshapeの使い方を参照

以下の例では6個の要素の1次元配列を2×3の2次元配列に変更し、それをさらに3 ×2の2次元配列に変更している。要素は常に行を上から、各行の列要素を左からネストした形で埋めていく。

import numpy as np

a = np.arange(6)
b = a.reshape(2, 3)
c = b.reshape(3, 2)
print(a)
print(b)
print(c)

# [0 1 2 3 4 5]
# [[0 1 2]
#  [3 4 5]]
# [[0 1]
#  [2 3]
#  [4 5]]

import numpy as np

a = np.arange(6)

b = a.reshape(2, 3)

c = b.reshape(3, 2)

print(a)

print(b)

print(c)

# [0 1 2 3 4 5]

# [[0 1 2]

# [3 4 5]]

# [[0 1]

# [2 3]

# [4 5]]

暗黙指定

サイズ変更の際、ある次元の要素数を-1とすると、他の要素数に合わせて適切に設定してくれる。

以下の例では2×3×2の3次元配列をつくり、それを3×2×2に変形しているが、2次元目を-1として1次元目と3次元目から設定させている。

import numpy as np

a1 = np.arange(10, 16).reshape(3, 2)
a2 = np.arange(20, 26).reshape(3, 2)
b = np.array([a1, a2])
print(b.ndim, b.shape)
print(b)

c = b.reshape(3, -1, 2)
print(c)

# 3 (2, 3, 2)
# [[[10 11]
#   [12 13]
#   [14 15]]
# 
#  [[20 21]
#   [22 23]
#   [24 25]]]
# [[[10 11]
#   [12 13]]
# 
#  [[14 15]
#   [20 21]]
# 
#  [[22 23]
#   [24 25]]]

import numpy as np

a1 = np.arange(10, 16).reshape(3, 2)

a2 = np.arange(20, 26).reshape(3, 2)

b = np.array([a1, a2])

print(b.ndim, b.shape)

print(b)

c = b.reshape(3, -1, 2)

print(c)

# 3 (2, 3, 2)

# [[[10 11]

# [12 13]

# [14 15]]

# [[20 21]

# [22 23]

# [24 25]]]

# [[[10 11]

# [12 13]]

# [[14 15]

# [20 21]]

# [[22 23]

# [24 25]]]

この方法は、たとえば行ベクトルの配列を列ベクトルに変換するときに使われる。以下の例では1次元の配列をつくり、それを列ベクトルとするのに、列数を1で固定し、行数を-1として算出させている。

import numpy as np

a = np.arange(3)
b = a.reshape(-1, 1)
print(b)

# [[0]
#  [1]
#  [2]]

import numpy as np

a = np.arange(3)

b = a.reshape(-1, 1)

print(b)

# [[0]

# [1]

# [2]]

1次元化するときの注意

多次元配列や列ベクトルを1次元化するとき、行数を1、列数を-1で暗黙指定すると求める1次元配列を1つだけ含む2次元の配列になる。こうなってしまのはreshape()の引数で1行×n列の2次元で指定したため。

import numpy as np

a = np.arange(12).reshape(2, 3, 2)
print(a)
print(a.reshape(1, -1))

# [[[ 0  1]
#   [ 2  3]
#   [ 4  5]]
# 
#  [[ 6  7]
#   [ 8  9]
#   [10 11]]]
# [[ 0  1  2  3  4  5  6  7  8  9 10 11]]

b = np.arange(3).reshape(-1, 1)
print(b)
print(b.reshape(1, -1))

# [[0]
#  [1]
#  [2]]
# [[0 1 2]]

import numpy as np

a = np.arange(12).reshape(2, 3, 2)

print(a)

print(a.reshape(1, -1))

# [[[ 0 1]

# [ 2 3]

# [ 4 5]]

# [[ 6 7]

# [ 8 9]

# [10 11]]]

# [[ 0 1 2 3 4 5 6 7 8 9 10 11]]

b = np.arange(3).reshape(-1, 1)

print(b)

print(b.reshape(1, -1))

# [[0]

# [1]

# [2]]

# [[0 1 2]]

そこで、size属性で1つの整数だけを指定すると、1次元でその要素数の配列になってくれる。

import numpy as np

a = np.arange(12).reshape(2, 3, 2)
print(a.reshape(a.size))

b = np.arange(3).reshape(-1, 1)
print(b.reshape(b.size))

# [ 0  1  2  3  4  5  6  7  8  9 10 11]
# [0 1 2]

import numpy as np

a = np.arange(12).reshape(2, 3, 2)

print(a.reshape(a.size))

b = np.arange(3).reshape(-1, 1)

print(b.reshape(b.size))

# [ 0 1 2 3 4 5 6 7 8 9 10 11]

# [0 1 2]

さらには、引数を-1のみで指定すると、配列のサイズを適当に持ってきて適用してくれる。

import numpy as np

a = np.arange(12).reshape(2, 3, 2)
print(a.reshape(-1))

b = np.arange(3).reshape(-1, 1)
print(b.reshape(-1))

# [ 0  1  2  3  4  5  6  7  8  9 10 11]
# [0 1 2]

import numpy as np

a = np.arange(12).reshape(2, 3, 2)

print(a.reshape(-1))

b = np.arange(3).reshape(-1, 1)

print(b.reshape(-1))

# [ 0 1 2 3 4 5 6 7 8 9 10 11]

# [0 1 2]

これは列ベクトルを行ベクトル化するときのほか、pyplotで複数のAxesインスタンスを行×列の形で受け取った時に、全てのインスタンスに同じ設定を適用したいときなどに1次元化してループで回す、といったようなことにも使える。

ndarray – 配列の次元・形状・サイズ

2020-04-05 / tau / コメントする

`ndim`属性～配列の次元

ndim属性は配列の次元を整数で返す。

1次元配列を1つだけ要素に持つ配列や列ベクトルの次元が2となっている点に注意。とにかく[]のネストの数だと考えればよい。

import numpy as np

a = np.array([1, 2, 3])
print(a.ndim)  # 1

b = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print(b.ndim)  # 2

c = np.array([
    [[1, 2], [3, 4]],
    [[5, 6], [7, 8]]
])
print(c.ndim)  # 3

d = np.array([[1, 2, 3]])
print(d.ndim)  # 2

e = np.array([
    [1],
    [2],
    [3],
])
print(e.ndim)  # 2

import numpy as np

a = np.array([1, 2, 3])

print(a.ndim) # 1

b = np.array([

[1, 2, 3],

[4, 5, 6]

])

print(b.ndim) # 2

c = np.array([

[[1, 2], [3, 4]],

[[5, 6], [7, 8]]

])

print(c.ndim) # 3

d = np.array([[1, 2, 3]])

print(d.ndim) # 2

e = np.array([

[1],

[2],

[3],

])

print(e.ndim) # 2

`shape`属性～配列の形状

shape属性は配列の形状を返す。

1次元1行の単純な配列のときにはshapeが(1, n)とならないのが気になるがこれは結果が常にタプルで返されるためで、1次元とわかっているときには1つの整数が返ってくると考えてよい。

ndim=2となる形状の場合にはタプルも2要素となって、shape=(行数, 列数)となる。より多次元の場合、外側の次元の要素数からの順番になる。

import numpy as np

a = np.array([1, 2, 3])
print(a.shape)  # (3,)

b = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print(b.shape)  # (2, 3)

c = np.array([
    [[11, 12, 13, 14],
     [15, 16, 17, 18],
     [19, 20, 21, 22]],
    [[51, 52, 53, 54],
     [55, 56, 57, 58],
     [59, 60, 61, 62]]
])
print(c.shape)  # (2, 3, 4)

d = np.array([[1, 2, 3]])
print(d.shape)  # (1, 3)

e = np.array([
    [1],
    [2],
    [3],
])
print(e.shape)  # (3, 1)

import numpy as np

a = np.array([1, 2, 3])

print(a.shape) # (3,)

b = np.array([

[1, 2, 3],

[4, 5, 6]

])

print(b.shape) # (2, 3)

c = np.array([

[[11, 12, 13, 14],

[15, 16, 17, 18],

[19, 20, 21, 22]],

[[51, 52, 53, 54],

[55, 56, 57, 58],

[59, 60, 61, 62]]

])

print(c.shape) # (2, 3, 4)

d = np.array([[1, 2, 3]])

print(d.shape) # (1, 3)

e = np.array([

[1],

[2],

[3],

])

print(e.shape) # (3, 1)

`size`属性～配列のサイズ

size属性で得られる配列のサイズは配列の要素数。

import numpy as np

a = np.array([1, 2, 3])
print(a.size)  # 3

b = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print(b.size)  # 6

c = np.array([
    [[1, 2], [3, 4]],
    [[5, 6], [7, 8]]
])
print(c.size)  # 8

d = np.array([[1, 2, 3]])
print(d.size)  # 3

e = np.array([
    [1],
    [2],
    [3],
])
print(e.size)  # 3

import numpy as np

a = np.array([1, 2, 3])

print(a.size) # 3

b = np.array([

[1, 2, 3],

[4, 5, 6]

])

print(b.size) # 6

c = np.array([

[[1, 2], [3, 4]],

[[5, 6], [7, 8]]

])

print(c.size) # 8

d = np.array([[1, 2, 3]])

print(d.size) # 3

e = np.array([

[1],

[2],

[3],

])

print(e.size) # 3

Logistic回帰

2020-04-04 / tau / コメントする

概要

適用対象となる問題

ロジスティック回帰(Logistic regression)は「回帰」という名称だが、その機能は2クラス分類である。複数の特徴量に対して、2つのクラス1/0の何れになるかを判定する。

たとえば旅行会社の会員顧客20人の年齢と、温泉(SPA)とレジャーランド(LSR)のどちらを選んだかというデータが以下のように得られているとき、新たな顧客にどちらを勧めればよいか、といったような問題。

上のデータの”is SPA”列は、温泉を選んだ場合に1、レジャーランドを選んだ場合に0としている。これを散布図にすると以下の通り。

年齢が高いほど温泉を選ぶ傾向があるのは分かるが、「年齢が与えられたときに温泉とレジャーランドのどちらを選ぶか」というモデルを導出するのが目的。

(1) $\begin{equation*} (x_1, \ldots, x_m) \rightarrow \left \{ \begin{array}{cc} 1 \\ 0 \end{array} \right. \end{equation*}$

モデルを線形モデルとし、特徴量の線形和の値によってクラス1/0の何れになるかが決まるとすると

(2) $\begin{equation*} b + w_1 x_1 + \cdots + w_m x_m \rightarrow \left \{ \begin{array}{cc} 1 \\ 0 \end{array} \right. \end{equation*}$

線形回帰による場合

ここでそのままデータに対して線形回帰を適用して、たとえば回帰式の値0.5に対する年齢を境にして温泉とレジャーランドを判定してもよさそうに見える。ただ、この時の回帰式がデータに対してどのような意味を持つのかが定かではない。

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

path = os.path.join(os.path.dirname(__file__), "data_spa.csv")
df = pd.read_csv(path)

X = np.array(df['age']).reshape(-1, 1)
y = df['is_spa']

linreg = LinearRegression()
linreg.fit(X, y)
b = linreg.intercept_
w = linreg.coef_[0]
x_border = (0.5 - b) / w
print(x_border)

xl = 0
xr = 100
yl = b + w * xl
yr = b + w * xr

fig, ax = plt.subplots()

ax.scatter(X, y, c='tab:blue')
ax.plot([xl, xr], [yl, yr], c='tab:orange')
ax.plot([x_border, x_border], [0, 1], c='k', linestyle='dashed')

ax.set_xlim(xl, xr)
ax.set_ylim(-0.1, 1.1)
ax.set_yticks(np.linspace(0, 1, 11))
ax.grid(True)

plt.show()

import os

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

path = os.path.join(os.path.dirname(__file__), "data_spa.csv")

df = pd.read_csv(path)

X = np.array(df['age']).reshape(-1, 1)

y = df['is_spa']

linreg = LinearRegression()

linreg.fit(X, y)

b = linreg.intercept_

w = linreg.coef_[0]

x_border = (0.5 - b) / w

print(x_border)

xl = 0

xr = 100

yl = b + w * xl

yr = b + w * xr

fig, ax = plt.subplots()

ax.scatter(X, y, c='tab:blue')

ax.plot([xl, xr], [yl, yr], c='tab:orange')

ax.plot([x_border, x_border], [0, 1], c='k', linestyle='dashed')

ax.set_xlim(xl, xr)

ax.set_ylim(-0.1, 1.1)

ax.set_yticks(np.linspace(0, 1, 11))

ax.grid(True)

plt.show()

確率分布の仮定

単純な線形回帰ではなく、特徴量の線形和に対してクラス1となる確率分布を考える。

(3) $\begin{align*} \Pr (y=1) &= f(z) = f( b + w_1 x_1 + \cdots + w_m x_m ) \\ f(0) &= 0.5 \end{align*}$

このときの確率分布の関数形として以下が必要となる。

xの(−∞, +∞)の範囲に対して、値が(0, 1)となる
xが小さい→確率が0に近くなる→クラス0と判別されやすくなる
xが大きい→確率が1に近くなる→クラス1と判別されやすくなる

これをグラフにすると、以下のような形になる。

このような関数の回帰によるクラス分類は、確率分布に用いる関数形からLogistic回帰と呼ばれている。

Logistic回帰の定式化

確率分布関数の仮定

説明変数が $x_1, \ldots, x_m$ であり、それに対してある事象が発生する／発生しないの2通りがあるとする。

いま、 $i=1, \ldots, n$ の観察対象について $x_{i1}, \ldots, x_{im}$ のデータが観測されたとする。このとき、以下のように定式化する。

(4) $\begin{align*} & z_i = b + w_1 x_{i1} + \cdots + w_m x_{im} = b + {\boldsymbol a}{\boldsymbol x} \\ & y_i = \left\{ \begin{array}{cl} 1 & \textrm{(occured)}\\ 0 & \textrm{(not occured)} \end{array} \right. \\ & \Pr (y=1) = p = \frac{1}{1 + e^{- ( b + {\boldsymbol w}{\boldsymbol x} ) }} \end{align*}$

$x_{ij} \; (j=1, \ldots, m)$ が得られたとき、その多項式を使った確率によって事象が発生することを意味している。この確率分布の式はLogistic関数と呼ばれる。

ロジスティック関数(logistic function)はシグモイド関数(sigmoid function)とも呼ばれ、その値は(0, 1)の間にあることから確率分布として採用している。

(5) $\begin{equation*} {\rm logistic} (t) = \sigma (t) = \frac{1}{1 + e^{-t}} = p \end{equation*}$

なお、ロジスティック関数の逆関数はロジット関数(logit function)と呼ばれる。

(6) $\begin{equation*} {\rm logit} (p) = \ln \frac{p}{1 - p} = t \end{equation*}$

ここでLogit関数の対数の中はオッズの形になっている。

最尤推定

$x_{ij} (i=1, \ldots , n)$ のm×n個の説明変数データと、 $y_i (1 \; \textrm{or} \; 0)$ のターゲットデータが得られたとする。

このとき、ターゲットデータが得られたパターンとなる確率は、それぞれが独立事象であるとすれば確率の積になるから、

(7) $\begin{equation*} L &= \prod_{i=1}^n \Pr (y_i = 1) ^{y_i} \Pr (y_i = 0) ^{1 - y_i} \end{equation*}$

この確率は、パラメーター $b , w_1, \ldots, w_m$ に対する尤度関数であり、その対数尤度関数は以下のようになる。

(8) $\begin{align*} \ln L &= \sum_{i=1}^n \left( y_i \ln \Pr(y_i = 1) + (1 - y_i) \ln \Pr(y_i = 0) \right) \\ &= \sum_{i=1}^n \left( y_i \ln \Pr(y_i = 1) + (1 - y_i) \ln \left(1 - \Pr(y_i = 1) \right) \right) \\ \end{align*}$

ここで、

(9) $\begin{align*} \ln \left( 1 - \Pr(y_i = 1) \right) &= \ln \left(1 - \frac{1}{1 - e^{-(b + {\boldsymbol w}{\boldsymbol x})}} \right) \\ &= \ln \frac{e^{-(b + {\boldsymbol w}{\boldsymbol x})}}{1 + e^{-(b + {\boldsymbol a}{\boldsymbol x})}} \\ &= \ln e^{-(b + {\boldsymbol w}{\boldsymbol x})} + \ln \Pr (y_i = 1 ) \\ &= -b - {\boldsymbol w}{\boldsymbol x} + \ln \Pr (y_i = 1 ) \end{align*}$

これを使って、対数尤度関数は以下のように変形される。

(10) $\begin{align*} \ln L &= \sum_{i=1}^n \left( y_i \ln \Pr(y_i = 1) + (1 - y_i) \left( -b - {\boldsymbol w}{\boldsymbol x} + \ln \Pr (y_i = 1 ) \right) \right) \\ &= \sum_{i=1}^n \left( \ln \Pr (y_i = 1 ) + (1 - y_i) (-b - {\boldsymbol w}{\boldsymbol x}) \right) \\ &= \sum_{i=1}^n \left( \ln \frac{1}{1 + e^{-(b + {\boldsymbol w}{\boldsymbol x})}} + (1 - y_i) (-b - {\boldsymbol w}{\boldsymbol x}) \right) \\ &= \sum_{i=1}^n \left( -\ln \left( 1 + e^{-(b + {\boldsymbol w}{\boldsymbol x})} \right) + (1 - y_i) (-b - {\boldsymbol w}{\boldsymbol x}) \right) \\ \end{align*}$

得られたデータに対してこの対数尤度を最大とするような ${\boldsymbol w}$ を求めるのには、通常数値計算が用いられる。

1変数の場合

定式化

線形表現は以下のようになる。

(11) $\begin{equation*} z = b + w x \end{equation*}$

ロジスティック関数による確率表現は

(12) $\begin{equation*} \Pr (y=1) = p = \frac{1}{1 + e^{-(b + w x)}} \end{equation*}$

nこのデータセット $(x_i, y_i)$ が与えられたとき、この確率分布に関する尤度関数は

(13) $\begin{equation*} L(b, w) = \prod_{i=1}^n \left( \frac{1}{1 + e^{-(b + wx_i)}} \right)^{y_i} + \left( 1 - \frac{1}{1 + e^{-(b + wx_i)}} \right)^{1 - y_i} \end{equation*}$

その対数尤度関数は

(14) $\begin{equation*} \ln L(b, w) &=& \sum_{i=1}^{n} \left( y_i \ln \frac{1}{1 + e^{-b - w x_i}} + (1 - y_i) \ln \left( 1 - \frac{1}{1 + e^{-b - w x_i}} \right) \right) \end{equation*}$

ここでデータセット(x_i, y_i)として、が与えられたとき、対数尤度を最大とするようなb, wを計算する。

LogisticRegressionモデル

温泉のデータセットに対して、scikit-learnのLogisticRegressionモデルを適用した結果が冒頭のグラフで、コードは以下の通り。

LogisticRegressionのコンストラクターの引数としてsolver='lbfgs'を指定しているが、2020年4月時点ではこれを指定しないと警告が出るため。収束アルゴリズムにはいくつかの種類があって、データサイズや複数クラス分類・1対他分類の別などによって使い分けるようだ。

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

path = os.path.join(os.path.dirname(__file__), "data_spa.csv")
df = pd.read_csv(path)

X = np.array(df['age']).reshape(-1, 1)
y = df['is_spa']

logreg = LogisticRegression(C=1e5, solver='lbfgs')
logreg.fit(X, y)
b = logreg.intercept_[0]
w = logreg.coef_[0][0]
x_border = 0.0 - b / w
print("intercept  : {}".format(b))
print("coefficient: {}".format(w))
print("x_border   : {}".format(x_border))

xl = 0
xr = 100

x_graph = np.linspace(xl, xr)
y_graph = 1 / (1 + np.exp(- b - w * x_graph))

fig, ax = plt.subplots()

ax.scatter(X, y, c='tab:blue')
ax.plot(x_graph, y_graph, c='tab:orange')
ax.plot([x_border, x_border], [0, 1], c='k', linestyle='dashed')

ax.set_xlim(xl, xr)
ax.set_ylim(-0.1, 1.1)
ax.set_yticks(np.linspace(0, 1, 11))
ax.grid(True)

plt.show()

# intercept  : -13.44995797571776
# coefficient: 0.23116842117614878
# x_border   : 58.18250566961732

import os

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

path = os.path.join(os.path.dirname(__file__), "data_spa.csv")

df = pd.read_csv(path)

X = np.array(df['age']).reshape(-1, 1)

y = df['is_spa']

logreg = LogisticRegression(C=1e5, solver='lbfgs')

logreg.fit(X, y)

b = logreg.intercept_[0]

w = logreg.coef_[0][0]

x_border = 0.0 - b / w

print("intercept : {}".format(b))

print("coefficient: {}".format(w))

print("x_border : {}".format(x_border))

xl = 0

xr = 100

x_graph = np.linspace(xl, xr)

y_graph = 1 / (1 + np.exp(- b - w * x_graph))

fig, ax = plt.subplots()

ax.scatter(X, y, c='tab:blue')

ax.plot(x_graph, y_graph, c='tab:orange')

ax.plot([x_border, x_border], [0, 1], c='k', linestyle='dashed')

ax.set_xlim(xl, xr)

ax.set_ylim(-0.1, 1.1)

ax.set_yticks(np.linspace(0, 1, 11))

ax.grid(True)

plt.show()

# intercept : -13.44995797571776

# coefficient: 0.23116842117614878

# x_border : 58.18250566961732

また、同じデータセットについてExcelで解いた例はこちら。

正則化

LogisticeRegressionモデルのコンストラクターの引数Cは正則化の強さを正の実数で指定する。大きな値を設定するほど正則化が弱く、小さくするほど正則化が効いてくる。

温泉のデータセットに対してCを変化させてみたところ、1000あたりから上はほとんど関数形が変わらず、ほぼ正則化がない状態。デフォルトのC=1(赤い線)は結構強い正則化のように思える。

興味深いのはこれらのグラフがほぼ1点で交わっており、その点での確率が0.6近いことである。温泉かレジャーランドかを選ぶ年齢と確率が正則化の程度に寄らず一定だということになるが、これが何を意味しているか、今のところよくわからない。

一方で、確率0.5に相当する年齢は正則化を強めていくにしたがって低くなっていて、これを判定基準とするとC=0.001ではほぼ全年齢で温泉が選択されることになってしまう。

さらにCが更に小さい値になると、確率曲線はどんどんなだらかになっていき、C=1e-6では全年齢にわたって0.5、すなわち温泉、レジャーランドのいずれを選ぶかは年齢に関わらず1/2の賭けのようになっている。

2変数の場合

2変数の例として、”O’REILLYの”Pythonではじめる機械学習”にあるforgeデータの例をトレースしたものをまとめた。

Logistic回帰～forgeデータ～Pythonではじめる機械学習より

特徴量の係数

式(4)によるとz_iの値が大きいほどy = 1となる確率が高くなり、小さいほどy = 0となる確率が高くなる。また、特徴量x_i ≥ 0とすると、係数w_iが正のときにはz_iを増加させる効果、負の時にはz_iを減少させる効果がある。y = 0, 1にターゲット0, 1が割り当てられ、それらがクラス0、クラス1を表すとすると、係数が正のときにはその特徴量の増加がクラス1を選択する方向に働き、係数が負の時にはクラス0をを選択する方向に働く。

以下は特徴量数2、クラス数2の簡単な例で、Feature-0は係数が正なのでクラス0を選択する確率を高め、Feature-1の係数は係数が負なのでクラス1を選択する確率を高める。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression

X, y = make_blobs(n_samples=100, n_features=2, centers=2, random_state=9)

logreg = LogisticRegression().fit(X, y)
coef = logreg.coef_.reshape(logreg.coef_.size)
f0 = np.linspace(-15, 3)
f1 = (- logreg.intercept_ - coef[0] * f0) / coef[1]
print("intercept:{}".format(logreg.intercept_))
print("coef:{}".format(coef))

fig, ax = plt.subplots()
ax.scatter(X[y==0][:, 0], X[y==0][:, 1], label="Class-0")
ax.scatter(X[y==1][:, 0], X[y==1][:, 1], label="Class-1")
ax.plot(f0, f1, c='g')
ax.set_xlim(-14, 4)
ax.set_ylim(-12, 6)
ax.set_xlabel("Feature-0")
ax.set_ylabel("Feature-1")
ax.legend()
plt.show()

# intercept:[2.11704743]
# coef:[ 0.89424625 -0.66371147]

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from sklearn.linear_model import LogisticRegression

X, y = make_blobs(n_samples=100, n_features=2, centers=2, random_state=9)

logreg = LogisticRegression().fit(X, y)

coef = logreg.coef_.reshape(logreg.coef_.size)

f0 = np.linspace(-15, 3)

f1 = (- logreg.intercept_ - coef[0] * f0) / coef[1]

print("intercept:{}".format(logreg.intercept_))

print("coef:{}".format(coef))

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1], label="Class-0")

ax.scatter(X[y==1][:, 0], X[y==1][:, 1], label="Class-1")

ax.plot(f0, f1, c='g')

ax.set_xlim(-14, 4)

ax.set_ylim(-12, 6)

ax.set_xlabel("Feature-0")

ax.set_ylabel("Feature-1")

ax.legend()

plt.show()

# intercept:[2.11704743]

# coef:[ 0.89424625 -0.66371147]

Logistic回帰～1変数・Excelによる解

2020-04-04 / tau / コメントする

例題のデータ

ある旅行会社の会員顧客20人の年齢と、温泉(SPA)とレジャーランド(LSR)のどちらを選んだかというデータが以下のように得られているとき、新たな顧客にどちらを勧めればより適切か。

このようなクラス分けの問題にLogistic回帰を使うのにPythonのパッケージなどによる方法もあるが、ここではExcelを使った方法を示す。

その流れは、各観光客の選択結果のカテゴリー変数と年齢から個別の尤度と合計の尤度の計算式を定義し、切片と係数の初期値を設定しておいてから、尤度が最大となるような切片・係数を求めるためにExcelのソルバーを使う。

元となるデータは、各観光客の年齢と、行先に選んだのが温泉(SPA)かレジャーランドか(LSR)の別、それらに対して温泉を選んだ場合は1、レジャーランドを選んだ場合は0となるカテゴリー変数。

計算表の準備

このデータから以下のような表を作る。各セルの意味と内容は以下の通り。

coef：線形式の切片Aと係数Bの初期値としてそれぞれ0をセットし、収束計算の結果が入る
intercept：切片の計算のために使われるデータで、全て固定値の1
prob：coefがA, Bの値の時に各顧客の年齢に対してis_spa=1となる確率で、Logistic関数の計算値
- セルの内容は計算式で=1/(1+EXP(-$A*Y-$B*Z))
- $A, $Bは固定座標を表し、全てのデータに対してこれらのセルの内容を使う
LH：is_spaの値に対する尤度(likelihood)
- セルの内容は計算式でX*LN(C)+(1-X)*LN(1-C)
MLE：全データのLHの和で、このデータセットのパターンに対する最大尤度の結果が入る

収束計算

データタブの一番右にあるソルバーに入る(ない場合はファイル→オプション→アドイン→設定からソルバーアドインにチェックを入れる)。

ソルバーのパラメーター設定ダイアログで、

目的セルを上記のDで選択
変数セルを上記のA:Bの範囲で選択。
目標値は「最大値」を選択

「解決」ボタンを押して収束計算すると、Dの値を最大化するA:Bの内容がセットされる。

この場合の結果は以下の通り

coef：-13.6562, 0.234647
MLE：-7.45298

確率0.5(線形式の値が0)を温泉とレジャーランドの閾値とするなら、それに相当する年齢は以下のように計算される。

(1) $\begin{equation*} -13.6562 + 0.234647 \times x = 0 \quad \rightarrow \quad x = \frac{13.6562 }{0.234647 } =58.2 \end{equation*}$

Pythonのscikit-learnのLogisticRegressionモデルを同じデータに適用した結果(C=1e5)は以下の通りで、かなり近い値となっている。

intercept_ = [-13.38993211]
coefficient_ = [0.23015561]

得られた係数の値を使って、以下の関数式のグラフを描いてみたのが以下の図でLogistic曲線が現れている。

Python – file

2020-03-30 / tau / コメントする

osパッケージのgetcwd()を使ってカレントディレクトリーを取得しようとして、Atomからの直接実行などのときにうまくいかなかった。

このような場合、__file__で実行ファイルの位置を得ることができる。また、os.path.dirname(__file__)でファイル名を除いたパスを、os.path.basename(__file__)でファイル名のみを得ることができる。

実行方法によってはエラーとなる。

import os

print("__file__:{}".format(__file__))
print("dirname :{}".format(os.path.dirname(__file__)))
print("basename:{}".format(os.path.basename(__file__)))
print("files   :{}".format(os.listdir(os.path.dirname(__file__))))

import os

print("__file__:{}".format(__file__))

print("dirname :{}".format(os.path.dirname(__file__)))

print("basename:{}".format(os.path.basename(__file__)))

print("files :{}".format(os.listdir(os.path.dirname(__file__))))

Atomから実行した場合

__file__:C:\Users\...\dev\python\packages\os\__file__.py
dirname :C:\Users\...\dev\python\packages\os
basename:__file__.py
files   :[..., '__file__.py']

__file__:C:\Users\...\dev\python\packages\os\__file__.py

dirname :C:\Users\...\dev\python\packages\os

basename:__file__.py

files :[..., '__file__.py']

コマンドラインから実行ファイル(__file__.py)のあるディレクトリーに移動し、直接ファイル名をタイプした場合

C:\Users\...\dev\python\packages\os>__file__.py
__file__:C:\Users\...\dev\python\packages\os\__file__.py
dirname :C:\Users\...\dev\python\packages\os
basename:__file__.py
files   :[..., '__file__.py']

C:\Users\...\dev\python\packages\os>__file__.py

__file__:C:\Users\...\dev\python\packages\os\__file__.py

dirname :C:\Users\...\dev\python\packages\os

basename:__file__.py

files :[..., '__file__.py']

コマンドラインから実行ファイル(__file__.py)のあるディレクトリーに移動し、”python 実行ファイル名”で実行した場合

C:\Users\tomo\...\dev\python\packages\os>python __file__.py
__file__:__file__.py
dirname :
basename:__file__.py
Traceback (most recent call last):
  File "__file__.py", line 6, in <module>
    print("files   :{}".format(os.listdir(os.path.dirname(__file__))))
FileNotFoundError: [WinError 3] 指定されたパスが見つかりません。: ''

C:\Users\tomo\...\dev\python\packages\os>python __file__.py

__file__:__file__.py

dirname :

basename:__file__.py

Traceback (most recent call last):

File "__file__.py", line 6, in <module>

print("files :{}".format(os.listdir(os.path.dirname(__file__))))

FileNotFoundError: [WinError 3] 指定されたパスが見つかりません。: ''

上記と同じだが、実行ファイル名を相対パスで指定した場合

C:\Users\...\dev\python\packages\os>python ./__file__.py
__file__:C:\Users\...\dev\python\packages\os\__file__.py
dirname :C:\Users\...\dev\python\packages\os
basename:__file__.py
files   :[..., '__file__.py']

C:\Users\...\dev\python\packages\os>python ./__file__.py

__file__:C:\Users\...\dev\python\packages\os\__file__.py

dirname :C:\Users\...\dev\python\packages\os

basename:__file__.py

files :[..., '__file__.py']

Boston house pricesデータセットの俯瞰

2020-03-28 / tau / コメントする

概要

Boston house pricesデータセットは、持家の価格とその持家が属する地域に関する指標からなるデータセットで、多変量の特徴量から属性値を予想するモデルに使われる。

各特徴量の分布

データセットからBostonにおける506の地域における13の特徴量と住宅価格の中央値が得られるが、それぞれ単独の分布を見ておく。最後のMEDVは持家価格(1000ドル単位)の中央値(Median Value)。

特徴量CHASはチャールズ川の川沿いに立地しているか否かのダミー変数で、0/1の2通りの値を持つ。いくつかの特徴量は値が集中していたり、離れたところのデータが多かったりしている。

import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

boston_ds = load_boston()

X = boston_ds.data
y = boston_ds.target
feature_names = boston_ds.feature_names

n_features = X.shape[1]

fig, axs = plt.subplots(3, 5, figsize=(13, 7))
plt.subplots_adjust(hspace=0.4)
axs_1d = axs.reshape(1, -11)[0]

for ax, nf in zip(axs_1d, range(n_features)):
    ax.hist(X[:, nf], ec='k')
    ax.set_title(feature_names[nf])

axs_1d[n_features].hist(y, ec='k')
axs_1d[n_features].set_title("MEDV")

axs_1d[-1].axis('off')

plt.show()

import matplotlib.pyplot as plt

from sklearn.datasets import load_boston

boston_ds = load_boston()

X = boston_ds.data

y = boston_ds.target

feature_names = boston_ds.feature_names

n_features = X.shape[1]

fig, axs = plt.subplots(3, 5, figsize=(13, 7))

plt.subplots_adjust(hspace=0.4)

axs_1d = axs.reshape(1, -11)[0]

for ax, nf in zip(axs_1d, range(n_features)):

ax.hist(X[:, nf], ec='k')

ax.set_title(feature_names[nf])

axs_1d[n_features].hist(y, ec='k')

axs_1d[n_features].set_title("MEDV")

axs_1d[-1].axis('off')

plt.show()

各特徴量と価格の関係

13の特徴量1つ1つと価格の関係を散布図で見てみる。

比較的明らかな関係がみられるのはRM(1戸あたり部屋数)とLATAT(下位層の人口比率)で、この2つは特徴量自体の分布が比較的”整っている”。

NOX(NOx濃度)も特徴量の分布はそこそこなだらかだが、散布図では強い相関とは言い難い。

AGE(古い物件の比率)とDIS(職業紹介所への距離)はそれぞれ分布が単調減少／単調増加で、特徴量の大小と価格の高低の関係はある程度予想通りだがかなりばらついている。いずれの指標についてもMDEVがある値以下で密度が高くなっているように見えるのは興味深い。

import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

boston_ds = load_boston()

X = boston_ds.data
target = boston_ds.target
feature_names = boston_ds.feature_names

n_features = X.shape[1]

fig, axs = plt.subplots(3, 5, figsize=(13, 7))
fig.subplots_adjust(hspace=0.4, wspace=0.4)

axs_1d = axs.reshape(1, -1)[0]

for ax, nf in zip(axs_1d, range(n_features)):
    ax.scatter(X[:, nf], target, s=5)
    ax.set_xlabel(feature_names[nf])
    ax.set_ylabel("MDEV")

for i in range(-2, 0): axs_1d[i].axis('off')

plt.show()

import matplotlib.pyplot as plt

from sklearn.datasets import load_boston

boston_ds = load_boston()

X = boston_ds.data

target = boston_ds.target

feature_names = boston_ds.feature_names

n_features = X.shape[1]

fig, axs = plt.subplots(3, 5, figsize=(13, 7))

fig.subplots_adjust(hspace=0.4, wspace=0.4)

axs_1d = axs.reshape(1, -1)[0]

for ax, nf in zip(axs_1d, range(n_features)):

ax.scatter(X[:, nf], target, s=5)

ax.set_xlabel(feature_names[nf])

ax.set_ylabel("MDEV")

for i in range(-2, 0): axs_1d[i].axis('off')

plt.show()

2つの特徴量と価格の関係

個々の特徴量ごとの、価格との相関がある程度が明確だったRMとLSTATについて価格との関係を3次元で見てみる。

それぞれの相関がある程度明確なので、3次元でも一つの帯のようになっている。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import load_boston

boston_ds = load_boston()

X_df = pd.DataFrame(boston_ds.data, columns=boston_ds.feature_names)
x = np.array(X_df['RM'])
y = np.array(X_df['LSTAT'])
z = boston_ds.target

fig = plt.figure(figsize=(12, 4.8))

ax1 = fig.add_subplot(121, projection='3d')
ax1.scatter(x, y, z)
ax1.set_xlabel("RM")
ax1.set_ylabel("LSTAT")
ax1.set_zlabel("MDEV")

ax2 = fig.add_subplot(122, projection='3d')
ax2.scatter(x, y, z)
ax2.set_xlabel("RM")
ax2.set_ylabel("LSTAT")
ax2.set_zlabel("MDEV")

plt.show()

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from sklearn.datasets import load_boston

boston_ds = load_boston()

X_df = pd.DataFrame(boston_ds.data, columns=boston_ds.feature_names)

x = np.array(X_df['RM'])

y = np.array(X_df['LSTAT'])

z = boston_ds.target

fig = plt.figure(figsize=(12, 4.8))

ax1 = fig.add_subplot(121, projection='3d')

ax1.scatter(x, y, z)

ax1.set_xlabel("RM")

ax1.set_ylabel("LSTAT")

ax1.set_zlabel("MDEV")

ax2 = fig.add_subplot(122, projection='3d')

ax2.scatter(x, y, z)

ax2.set_xlabel("RM")

ax2.set_ylabel("LSTAT")

ax2.set_zlabel("MDEV")

plt.show()

Boston house‐pricesデータセット

2020-03-25 / tau / コメントする

概要

Boston house-pricesデータセットは、カーネギーメロン大学のStatLibライブラリーから取得したもので、持家の価格とその持家が属する地域に関する指標からなる。

ボストンの各地域にある506の持家の価格の中央値に対して、その地域の犯罪発生率やNOx濃度など13の指標が得られる。

ここではPythonのscikit-learnにあるbostonデータの使い方をまとめる。

データの取得とデータ構造

Pythonで扱う場合、scikit-learnのdatasetsモジュールにあるload_breast_cancer()でデータを取得できる。データはBunchクラスのオブジェクト。

from sklearn.datasets import load_boston

boston_ds = load_boston()

for key, value in zip(boston_ds.keys(), boston_ds.values()):
    print("{}:\n{}\n".format(key, value))

from sklearn.datasets import load_boston

boston_ds = load_boston()

for key, value in zip(boston_ds.keys(), boston_ds.values()):

print("{}:\n{}\n".format(key, value))

データセットの構造は辞書型で、506の地域に関する13の特徴量と、当該地域における持家住宅の1000ドル単位の価格などのデータ。

data:
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

target:
[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 .....
 16.7 12.  14.6 21.4 23.  23.7 25.  21.8 20.6 21.2 19.1 20.6 15.2  7.
  8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
 22.  11.9]

feature_names:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

DESCR:
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

.....

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.


filename:
C:\Users\tomo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv

data:

[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]

[2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]

[2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]

...

[6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]

[1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]

[4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

target:

[24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4

18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8

18.4 21. 12.7 14.5 13.2 13.1 13.5 18.9 20. 21. 24.7 30.8 34.9 26.6

.....

16.7 12. 14.6 21.4 23. 23.7 25. 21.8 20.6 21.2 19.1 20.6 15.2 7.

8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9

22. 11.9]

feature_names:

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'

'B' 'LSTAT']

DESCR:

.. _boston_dataset:

Boston house prices dataset

---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

.....

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.

- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

filename:

C:\Users\tomo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv

データのキーは以下のようになっている。

from sklearn.datasets import load_boston

boston_ds = load_boston()

print(boston_ds.keys())

# dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

from sklearn.datasets import load_boston

boston_ds = load_boston()

print(boston_ds.keys())

# dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

データの内容

`'data'`～特徴量データセット

506の地域における13の指標を特徴量として格納した2次元配列。列のインデックスが特徴量の番号に対応している。

data:
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

data:

[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]

[2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]

[2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]

...

[6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]

[1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]

[4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

`'target'`～住宅価格

506の地域における持家住宅の1000ドル単位の価格中央値

target:
[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 .....
 16.7 12.  14.6 21.4 23.  23.7 25.  21.8 20.6 21.2 19.1 20.6 15.2  7.
  8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
 22.  11.9]

target:

[24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4

18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8

18.4 21. 12.7 14.5 13.2 13.1 13.5 18.9 20. 21. 24.7 30.8 34.9 26.6

.....

16.7 12. 14.6 21.4 23. 23.7 25. 21.8 20.6 21.2 19.1 20.6 15.2 7.

8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9

22. 11.9]

`'feature_names'`～特徴名

13種類の特徴量の名称。

CRIM：町ごとの人口当たり犯罪率
ZN：25,000平方フィート以上の区画の住居用途地区比率
INDUS：町ごとの小売り以外の産業用途地区比率
CHAS：チャールズ川に関するダミー変数（1：川沿い、0：それ以外）
NOX：NOx濃度（10ppm単位）
RM：1戸あたり部屋数
AGE：1940年より前に建てられた持家物件の比率
DIS：ボストンの5つの職業紹介所への重みづけ平均距離
RAD：放射道路へのアクセス性
TAX：10,000ドルあたりの固定資産税総額
PTRATIO：生徒対教師の比率
B：1000(Bk – 0.63)^2（Bkは待ちにおける黒人比率）
LSTAT：下位層の人口比率(%)

feature_names:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

feature_names:

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'

'B' 'LSTAT']

`'filename'`～ファイル名

CSVファイルのフルパス名が示されている。1行目にはデータ数、特徴量数が並んでおり、2行目に13の特徴量とターゲットの住宅価格、その後に506行のレコードに対する13列の特徴量と1列のターゲットデータが格納されている。このファイルにはDESCRに当たるデータは格納されていない。

'C:...\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv'

1	'C:...\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv'

`'DESCR'`～データセットの説明

データセットの説明。print(breast_ds_dataset['DESCR'])のようにprint文で整形表示される。

レコード数506個
属性は、13の数値／カテゴリー属性と、通常はターゲットに用いられる中央値

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

.. _boston_dataset:

Boston house prices dataset

---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,

vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics

...', Wiley, 1980. N.B. Various transformations are used in the table on

pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression

problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.

データの利用

データの取得方法

bostonデータセットから各データを取り出すのに、以下の2つの方法がある。

辞書のキーを使って呼び出す（例：boston['DESCR']）
キーの文字列をプロパティーに指定する（例：boston.DESCR）

全レコードの特徴量データの取得

'data'から、506のレコードに関する13の特徴量が506行13列の2次元配列で得られる。13の特徴量は’feature_names’の13の特徴名に対応している。

from sklearn.datasets import load_boston

boston_ds = load_boston()

print(boston_ds.data)

# [[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
#  [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
#  [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
#  ...
#  [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
#  [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
#  [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

from sklearn.datasets import load_boston

boston_ds = load_boston()

print(boston_ds.data)

# [[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]

# [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]

# [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]

# ...

# [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]

# [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]

# [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

特定の特徴量のデータのみ取得

特定の特徴量に関する全レコードのデータを取り出すときにはX[:, n]の形で指定する。

from sklearn.datasets import load_boston

boston_ds = load_boston()

features = boston_ds.feature_names
X = boston_ds.data
n_feature = 10

feature = X[:, n_feature]

print("feature name : {}".format(features[n_feature]))
print("feature data :\n{}".format(feature))

# feature name : PTRATIO
# feature data :
# [15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 15.2 15.2 15.2 21.
#  21.  21.  21.  21.  21.  21.  21.  21.  21.  21.  21.  21.  21.  21.
#  21.  21.  21.  21.  21.  21.  21.  19.2 19.2 19.2 19.2 18.3 18.3 17.9
#  ...
#  20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.1 20.1
#  20.1 20.1 20.1 19.2 19.2 19.2 19.2 19.2 19.2 19.2 19.2 21.  21.  21.
#  21.  21. ]

from sklearn.datasets import load_boston

boston_ds = load_boston()

features = boston_ds.feature_names

X = boston_ds.data

n_feature = 10

feature = X[:, n_feature]

print("feature name : {}".format(features[n_feature]))

print("feature data :\n{}".format(feature))

# feature name : PTRATIO

# feature data :

# [15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 15.2 15.2 15.2 21.

# 21. 21. 21. 21. 21. 21. 21. 21. 21. 21. 21. 21. 21. 21.

# 21. 21. 21. 21. 21. 21. 21. 19.2 19.2 19.2 19.2 18.3 18.3 17.9

# ...

# 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.2 20.1 20.1

# 20.1 20.1 20.1 19.2 19.2 19.2 19.2 19.2 19.2 19.2 19.2 21. 21. 21.

# 21. 21. ]

定義

定式化

行列による表示

概要

学習率曲線

基本

暗黙指定

1次元化するときの注意

ndim属性～配列の次元

shape属性～配列の形状

size属性～配列のサイズ

概要

適用対象となる問題

線形回帰による場合

確率分布の仮定

Logistic回帰の定式化

確率分布関数の仮定

最尤推定

1変数の場合

定式化

LogisticRegressionモデル

正則化

2変数の場合

特徴量の係数

例題のデータ

計算表の準備

収束計算

概要

各特徴量の分布

各特徴量と価格の関係

2つの特徴量と価格の関係

概要

データの取得とデータ構造

データの内容

'data'～特徴量データセット

'target'～住宅価格

'feature_names'～特徴名

'filename'～ファイル名

'DESCR'～データセットの説明

データの利用

データの取得方法

全レコードの特徴量データの取得

特定の特徴量のデータのみ取得

`ndim`属性～配列の次元

`shape`属性～配列の形状

`size`属性～配列のサイズ

`'data'`～特徴量データセット

`'target'`～住宅価格

`'feature_names'`～特徴名

`'filename'`～ファイル名

`'DESCR'`～データセットの説明