母比率の信頼区間

2020-06-18 / tau / コメントする

Bernoulli試行の成功確率をpとする。この試行をn回繰り返す場合の二項分布に従う確率変数X（成功回数）の平均と分散は以下で表される。

(1) $\begin{align*} E(X) &= np \\ V(X) &= np(1 - p) \end{align*}$

試行回数nが大きいとき、中心極限定理より以下の確率変数は標準正規分布に従う。

(2) $\begin{equation*} Z = \frac{X - np}{\sqrt{np(1 - p)}} \end{equation*}$

分母・分子をnで割り、サンプルから観測された確率として $X/n = \hat{p}$ と置く。

(3) $\begin{equation*} Z = \frac{\dfrac{X}{n} - p}{\sqrt{\dfrac{p(1 - p)}{n}}} = \frac{\hat{p} - p}{\sqrt{\dfrac{p(1 - p)}{n}}} \end{equation*}$

Zが標準正規分布に従うことから、信頼確率αの信頼区間は以下のように表せる。

(4) $\begin{equation*} -Z_\alpha = Z\left( \frac{1 - \alpha}{2} \right) \le \frac{\hat{p} - p}{\sqrt{\dfrac{p(1 - p)}{n}}} \le Z\left( \frac{1 + \alpha}{2} \right) = Z_\alpha \end{equation*}$

これよりpの信頼区間は以下のように表せる。

(5) $\begin{equation*} \hat{p} - Z_\alpha \sqrt{\dfrac{p(1 - p)}{n}} \le p \le \hat{p} + Z_\alpha \sqrt{\dfrac{p(1 - p)}{n}} \end{equation*}$

ここで信頼区間の境界値の計算に母比率pが含まれているが、nが大きいときは $\hat{p} = p$ として、以下を得る。

(6) $\begin{equation*} \hat{p} - Z_\alpha \sqrt{\dfrac{\hat{p}(1 - \hat{p})}{n}} \le p \le \hat{p} + Z_\alpha \sqrt{\dfrac{\hat{p}(1 - \hat{p})}{n}} \end{equation*}$

ここで、母比率0～1.0のBernoulli試行を繰り返し数を変えて試行したときの観測確率について、その平均と標準偏差がどうなるか計算してみた。

import numpy as np
import scipy.stats as stats
import pandas as pd


def p_trials(n, p, m):
    sum_p = []
    for traial in range(m):
        x = stats.uniform.rvs(size=n)
        sum_p.append(len(x[x<p]) / n)
    return np.mean(sum_p), np.std(sum_p, ddof=1)


np.random.seed(0)

p_list = np.arange(0, 1.1, 0.1)
n_list = [10, 20, 30, 50, 100, 1000]
n_trials = 100

mean_results = np.empty((len(p_list), len(n_list)))
std_results = np.empty((len(p_list), len(n_list)))

for cp, p in enumerate(p_list):
    for cn, n in enumerate(n_list):
        mean, std = p_trials(n, p, n_trials)
        mean_results[cp, cn] = mean
        std_results[cp, cn] = std

pd.options.display.precision = 3

df_mean = pd.DataFrame(mean_results, columns=n_list)
df_mean["p"] = p_list
columns = ["p"] + n_list
df_mean = df_mean.loc[:, columns]

df_std = pd.DataFrame(std_results, columns=n_list)
df_std["p"] = p_list
columns = ["p"] + n_list
df_std = df_std.loc[:, columns]

print(df_mean)
print(df_std)

import numpy as np

import scipy.stats as stats

import pandas as pd

def p_trials(n, p, m):

sum_p = []

for traial in range(m):

x = stats.uniform.rvs(size=n)

sum_p.append(len(x[x<p]) / n)

return np.mean(sum_p), np.std(sum_p, ddof=1)

np.random.seed(0)

p_list = np.arange(0, 1.1, 0.1)

n_list = [10, 20, 30, 50, 100, 1000]

n_trials = 100

mean_results = np.empty((len(p_list), len(n_list)))

std_results = np.empty((len(p_list), len(n_list)))

for cp, p in enumerate(p_list):

for cn, n in enumerate(n_list):

mean, std = p_trials(n, p, n_trials)

mean_results[cp, cn] = mean

std_results[cp, cn] = std

pd.options.display.precision = 3

df_mean = pd.DataFrame(mean_results, columns=n_list)

df_mean["p"] = p_list

columns = ["p"] + n_list

df_mean = df_mean.loc[:, columns]

df_std = pd.DataFrame(std_results, columns=n_list)

df_std["p"] = p_list

columns = ["p"] + n_list

df_std = df_std.loc[:, columns]

print(df_mean)

print(df_std)

まずpの平均についてはn = 10でもそれなりの精度となっていて、あまり試行回数による変化は大きくない。

      p     10     20     30     50    100   1000
0   0.0  0.000  0.000  0.000  0.000  0.000  0.000
1   0.1  0.093  0.102  0.105  0.099  0.097  0.101
2   0.2  0.215  0.194  0.196  0.208  0.206  0.203
3   0.3  0.328  0.287  0.295  0.297  0.299  0.299
4   0.4  0.393  0.384  0.394  0.396  0.407  0.399
5   0.5  0.494  0.491  0.514  0.494  0.497  0.498
6   0.6  0.596  0.609  0.605  0.592  0.598  0.600
7   0.7  0.695  0.714  0.704  0.698  0.694  0.700
8   0.8  0.811  0.807  0.799  0.791  0.793  0.798
9   0.9  0.910  0.904  0.887  0.898  0.903  0.902
10  1.0  1.000  1.000  1.000  1.000  1.000  1.000

p 10 20 30 50 100 1000

0 0.0 0.000 0.000 0.000 0.000 0.000 0.000

1 0.1 0.093 0.102 0.105 0.099 0.097 0.101

2 0.2 0.215 0.194 0.196 0.208 0.206 0.203

3 0.3 0.328 0.287 0.295 0.297 0.299 0.299

4 0.4 0.393 0.384 0.394 0.396 0.407 0.399

5 0.5 0.494 0.491 0.514 0.494 0.497 0.498

6 0.6 0.596 0.609 0.605 0.592 0.598 0.600

7 0.7 0.695 0.714 0.704 0.698 0.694 0.700

8 0.8 0.811 0.807 0.799 0.791 0.793 0.798

9 0.9 0.910 0.904 0.887 0.898 0.903 0.902

10 1.0 1.000 1.000 1.000 1.000 1.000 1.000

次にpの標準偏差（不偏分散の平方根）を見てみる。母比率が1/2に近いほどばらつきは大きく、試行回数nが大きいほどばらつきは小さくなっている。実務的にはn = 50～100あたりでそれなりのばらつきで観測確率をを母比率の代わりに用いてよいだろうか。

      p     10     20     30     50    100   1000
0   0.0  0.000  0.000  0.000  0.000  0.000  0.000
1   0.1  0.090  0.067  0.061  0.041  0.029  0.010
2   0.2  0.120  0.092  0.083  0.053  0.038  0.011
3   0.3  0.162  0.103  0.090  0.068  0.043  0.013
4   0.4  0.145  0.110  0.079  0.074  0.049  0.016
5   0.5  0.148  0.105  0.094  0.060  0.048  0.016
6   0.6  0.150  0.124  0.102  0.069  0.047  0.016
7   0.7  0.127  0.106  0.084  0.060  0.042  0.015
8   0.8  0.117  0.098  0.065  0.052  0.036  0.012
9   0.9  0.089  0.060  0.056  0.043  0.030  0.010
10  1.0  0.000  0.000  0.000  0.000  0.000  0.000

p 10 20 30 50 100 1000

0 0.0 0.000 0.000 0.000 0.000 0.000 0.000

1 0.1 0.090 0.067 0.061 0.041 0.029 0.010

2 0.2 0.120 0.092 0.083 0.053 0.038 0.011

3 0.3 0.162 0.103 0.090 0.068 0.043 0.013

4 0.4 0.145 0.110 0.079 0.074 0.049 0.016

5 0.5 0.148 0.105 0.094 0.060 0.048 0.016

6 0.6 0.150 0.124 0.102 0.069 0.047 0.016

7 0.7 0.127 0.106 0.084 0.060 0.042 0.015

8 0.8 0.117 0.098 0.065 0.052 0.036 0.012

9 0.9 0.089 0.060 0.056 0.043 0.030 0.010

10 1.0 0.000 0.000 0.000 0.000 0.000 0.000

以下はB(n, 0.5)についてnを変化させたときの観測確率のグラフで、やはりn = 50あたりまでにばらつきが急に減っていることがわかる。

母分散・標準偏差の信頼区間～カイ二乗分布

2020-06-16 / tau / コメントする

概要

母集団が母分散σ²の正規分布に従うとき、そこから抽出されたサンプルのサンプルサイズをn、不偏分散をs²とすると、以下のχ²は自由度n−1のカイ二乗分布に従う。

(1) $\begin{equation*} \chi^2 = \frac{(n - 1) s^2}{\sigma^2} \end{equation*}$

このことを利用して、母分散の信頼区間を推定する。

手順

母集団から取り出したn個のサンプルから不偏分散s²を計算する。

(2) $\begin{equation*} s^2 = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \overline{x} )^2 \end{equation*}$

意図する確率αを定め、自由度n−1に対するχ²値を求める。両側の境界を持つ信頼区間の場合、χ²分布は左右非対称なので、左側・右側についてχ²((1−α)/2; n−1)とを算出する。

(3) $\begin{align*} {\chi^2}_- &= \chi^2\left(\frac{1 - \alpha}{2}; n - 1 \right) \\ {\chi^2}_+ &= \chi^2\left(\frac{1 + \alpha}{2}; n - 1 \right) \end{align*}$

これらを用いて信頼区間を設定する。

(4) $\begin{equation*} {\chi^2}_- \le \frac{(n - 1) s^2}{\sigma^2} \le {\chi^2}_+ \end{equation*}$

これをについて以下のように変形して母分散の信頼区間を得る。

(5) $\begin{equation*} \frac{(n - 1) s^2}{{\chi^2}_+} \le \sigma^2 \le \frac{(n - 1) s^2}{{\chi^2}_-} \end{equation*}$

例題

e-statの身長・体重に関する国民健康・栄養調査2017年のデータから、40歳代の日本国民の身長の平均171.2cm及び標準偏差6.0cmを母集団のパラメーターとして用いる（データ数は374人）。

このパラメーターから、正規分布に従う10個の乱数を発生させた結果が以下の通り。

180.9 167.5 168.  164.8 176.4 157.4 181.7 166.6 173.1 169.7

1	180.9 167.5 168. 164.8 176.4 157.4 181.7 166.6 173.1 169.7

これらのデータの不偏分散は56.73であり、これとサンプルサイズ10から以下のχ²統計量を準備する。

(6) $\begin{equation*} \chi^2 = \frac{(n - 1) s^2}{\sigma^2} = \frac{9 \times 56.73}{\sigma^2} = \frac{510.57}{\sigma^2} \end{equation*}$

一方、95%確率に対するカイ二乗分布の両側の値は以下のように得られる。

(7) $\begin{align*} {\chi^2}_- &= \chi^2(0.025; 9) = 2.7\\ {\chi^2}_+ &= \chi^2(0.975; 9) = 19.02 \end{align*}$

これらからχ²統計量の信頼区間を設定。

(8) $\begin{equation*} 2.7 \le \frac{510.57}{\sigma^2} \le 19.02 \end{equation*}$

移項してσ²及びσの信頼区間を得る。

(9) $\begin{gather*} \frac{510.57}{19.02} \le \sigma^2 \le \frac{510.57}{2.7} \\ 26.84 \le \sigma^2 \le 189.1 \\ 5.18 \le \sigma \le 13.75 \end{gather*}$

ところで、不偏分散s² = 56.73やその平方根s = 7.53は、信頼区間の中央ではなくかなり左に寄っていることがわかる。

(10) $\begin{align*} &\frac{56.73 - 26.84}{189.1 - 26.84} \approx 0.184 \\ &\frac{7.53 - 5.2}{13.7 - 5.2} \approx 0.274 \end{align*}$

これはカイ二乗分布の確率密度が左右非対称であることに由来している。もし同じ不偏分散が100個のデータから得られたものだとするとカイ二乗分布の確率密度関数は左右対称に近づき、推定値は信頼区間の中央に近くなることが予想される。まずn = 100に対するχ²値は以下のようになる。

(11) $\begin{equation*} \chi^2 = \frac{99 \times 56.73}{\sigma^2} \approx \frac{5616}{\sigma^2} \end{equation*}$

また、95%確率に対するカイ二乗分布の両側の値は以下のように得られる。

(12) $\begin{align*} {\chi^2}_- &= \chi^2(0.025; 99) = 72.50\\ {\chi^2}_+ &= \chi^2(0.975; 99) = 127.28 \end{align*}$

σ²およびσの信頼区間は以下のようになる。

(13) $\begin{gather*} 72.50 \le \frac{5616}{\sigma^2} \le 127.28 \\ \frac{5616}{127.28} \le \sigma^2 \le \frac{5616}{72.50} \\ 44.12 \le \sigma^2 \le 77.46 \\ 6.64 \le \sigma \le 8.80 \end{gather*}$

不偏分散s² = 56.73やその平方根s = 7.53の信頼区間の中での位置を見てみると、中央に近くなっていることがわかる。

(14) $\begin{align*} &\frac{56.73 - 44.12}{77.46 - 44.12} \approx 0.378 \\ &\frac{7.53 - 6.64}{8.80 - 6.64} \approx 0.412 \end{align*}$

サンプルサイズに対する信頼区間の傾向

サンプルサイズを大きくしていったときの標準偏差の信頼区間の傾向は以下の通り。母集団の標準偏差に対して上側区間の方が広く、下側区間の方が狭くなっている。サンプルサイズが大きくなるとこの差は小さくなるが、それでも若干のインバランスは残っている。

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

np.random.seed(0)

h_pop_mean = 171.2
h_pop_std = 6
h_pop_var = h_pop_std**2

sample_size_list = range(10, 200)
prob_lower = 0.025
prob_upper = 0.975

fig, ax = plt.subplots()

std_cil_list = []
std_cir_list = []
for n in sample_size_list:
    h_smp = stats.norm.rvs(loc=h_pop_mean, scale=h_pop_std, size=n)
    uvar = np.var(h_smp, ddof=1)
    chil = stats.chi2.ppf(prob_lower, df=n-1)
    chir = stats.chi2.ppf(prob_upper, df=n-1)
    std_cil_list.append(np.sqrt((n - 1) * uvar / chir))
    std_cir_list.append(np.sqrt((n - 1) * uvar / chil))

ax.plot(sample_size_list, std_cil_list)
ax.plot(sample_size_list, std_cir_list)
ax.plot(sample_size_list, [h_pop_std]*len(sample_size_list))

ax.set_xlabel("number of samples")
ax.set_ylabel("STD of height(cm)")
ax.set_title("Confidence Interval of STD")

plt.show()

import numpy as np

import scipy.stats as stats

import matplotlib.pyplot as plt

np.random.seed(0)

h_pop_mean = 171.2

h_pop_std = 6

h_pop_var = h_pop_std**2

sample_size_list = range(10, 200)

prob_lower = 0.025

prob_upper = 0.975

fig, ax = plt.subplots()

std_cil_list = []

std_cir_list = []

for n in sample_size_list:

h_smp = stats.norm.rvs(loc=h_pop_mean, scale=h_pop_std, size=n)

uvar = np.var(h_smp, ddof=1)

chil = stats.chi2.ppf(prob_lower, df=n-1)

chir = stats.chi2.ppf(prob_upper, df=n-1)

std_cil_list.append(np.sqrt((n - 1) * uvar / chir))

std_cir_list.append(np.sqrt((n - 1) * uvar / chil))

ax.plot(sample_size_list, std_cil_list)

ax.plot(sample_size_list, std_cir_list)

ax.plot(sample_size_list, [h_pop_std]*len(sample_size_list))

ax.set_xlabel("number of samples")

ax.set_ylabel("STD of height(cm)")

ax.set_title("Confidence Interval of STD")

plt.show()

カイ二乗分布～χ2分布

2020-06-16 / tau / 4件のコメント

概要

独立に標準正規分布に従う確率変数X₁, …, X_kがあるとき、以下の統計量は自由度kのカイ二乗分布に従う。

(1) $\begin{equation*} Z = \sum_{i=1}^k {X_i}^2 \end{equation*}$

確率密度関数

x ≥ 0に対して、以下の形をとる。Γはガンマ関数。

(2) $\begin{equation*} f(x; k) = \frac{1}{2^{\frac{k}{2}} \Gamma \left(\dfrac{k}{2} \right)} x^{\frac{k}{2} - 1} e^{-\frac{x}{2}} \end{equation*}$

自由度kのカイ二乗分布の平均はk、分散は2k。

自由度と確率分布の関係

自由度kを変化させたときのカイ二乗分布の確率密度は以下の通り。

χ²分布表

カイ二乗分布は左右非対称なため、左側と右側それぞれの確率値に対するzの値を得る必要がある。以下の計算は、scipy.stats.chi2.ppf()の計算に準拠して、最上段の確率以下となるzの値を示している。

	0.005	0.01	0.025	0.05	0.1	0.9	0.95	0.975	0.99	0.995
5	0.412	0.554	0.831	1.145	1.610	9.236	11.070	12.833	15.086	16.750
6	0.676	0.872	1.237	1.635	2.204	10.645	12.592	14.449	16.812	18.548
7	0.989	1.239	1.690	2.167	2.833	12.017	14.067	16.013	18.475	20.278
8	1.344	1.646	2.180	2.733	3.490	13.362	15.507	17.535	20.090	21.955
9	1.735	2.088	2.700	3.325	4.168	14.684	16.919	19.023	21.666	23.589
10	2.156	2.558	3.247	3.940	4.865	15.987	18.307	20.483	23.209	25.188
11	2.603	3.053	3.816	4.575	5.578	17.275	19.675	21.920	24.725	26.757
12	3.074	3.571	4.404	5.226	6.304	18.549	21.026	23.337	26.217	28.300
13	3.565	4.107	5.009	5.892	7.042	19.812	22.362	24.736	27.688	29.819
14	4.075	4.660	5.629	6.571	7.790	21.064	23.685	26.119	29.141	31.319
15	4.601	5.229	6.262	7.261	8.547	22.307	24.996	27.488	30.578	32.801
16	5.142	5.812	6.908	7.962	9.312	23.542	26.296	28.845	32.000	34.267
17	5.697	6.408	7.564	8.672	10.085	24.769	27.587	30.191	33.409	35.718
18	6.265	7.015	8.231	9.390	10.865	25.989	28.869	31.526	34.805	37.156
19	6.844	7.633	8.907	10.117	11.651	27.204	30.144	32.852	36.191	38.582
20	7.434	8.260	9.591	10.851	12.443	28.412	31.410	34.170	37.566	39.997
30	13.787	14.953	16.791	18.493	20.599	40.256	43.773	46.979	50.892	53.672
40	20.707	22.164	24.433	26.509	29.051	51.805	55.758	59.342	63.691	66.766
50	27.991	29.707	32.357	34.764	37.689	63.167	67.505	71.420	76.154	79.490
60	35.534	37.485	40.482	43.188	46.459	74.397	79.082	83.298	88.379	91.952
70	43.275	45.442	48.758	51.739	55.329	85.527	90.531	95.023	100.425	104.215
80	51.172	53.540	57.153	60.391	64.278	96.578	101.879	106.629	112.329	116.321
90	59.196	61.754	65.647	69.126	73.291	107.565	113.145	118.136	124.116	128.299
100	67.328	70.065	74.222	77.929	82.358	118.498	124.342	129.561	135.807	140.169

なお、これらの値はPythonのscipy.stats.chi2を用いて計算した。

import numpy as np
import scipy.stats as stats

probs = np.array([0.005, 0.01, 0.025, 0.05, 0.1, 0.9, 0.95, 0.975, 0.99, 0.995])

fmt_header = "{0:>2}{1[0]:>7}{1[1]:>7}{1[2]:>7}{1[3]:>7}{1[4]:>7}" \
                   "{1[5]:>7}{1[6]:>7}{1[7]:>7}{1[8]:>7}{1[9]:>7}"
fmt_data = "{0:2d}{1[0]:7.3f}{1[1]:7.3f}{1[2]:7.3f}{1[3]:7.3f}{1[4]:7.3f}" \
                 "{1[5]:7.3f}{1[6]:7.3f}{1[7]:7.3f}{1[8]:7.3f}{1[9]:7.3f}"

print(fmt_header.format(" ", probs))
for df in range(5, 21):
    print(fmt_data.format(df, stats.chi2.ppf(probs, df=df)))
for df in range(30, 101, 10):
    print(fmt_data.format(df, stats.chi2.ppf(probs, df=df)))

import numpy as np

import scipy.stats as stats

probs = np.array([0.005, 0.01, 0.025, 0.05, 0.1, 0.9, 0.95, 0.975, 0.99, 0.995])

fmt_header = "{0:>2}{1[0]:>7}{1[1]:>7}{1[2]:>7}{1[3]:>7}{1[4]:>7}" \

"{1[5]:>7}{1[6]:>7}{1[7]:>7}{1[8]:>7}{1[9]:>7}"

fmt_data = "{0:2d}{1[0]:7.3f}{1[1]:7.3f}{1[2]:7.3f}{1[3]:7.3f}{1[4]:7.3f}" \

"{1[5]:7.3f}{1[6]:7.3f}{1[7]:7.3f}{1[8]:7.3f}{1[9]:7.3f}"

print(fmt_header.format(" ", probs))

for df in range(5, 21):

print(fmt_data.format(df, stats.chi2.ppf(probs, df=df)))

for df in range(30, 101, 10):

print(fmt_data.format(df, stats.chi2.ppf(probs, df=df)))

母平均の信頼区間～母分散が未知の場合

2020-06-14 / tau / コメントする

概要

母集団の分散がわからない場合の、母平均の信頼区間の推定について。

サンプルの平均値、不偏分散、母平均から計算されるt値がt分布に従うことを利用している。信頼区間の推定の考え方は以下の通り。

サンプルを抽出し、標本平均 $\overline{x}$ と不偏分散s²を求める
サンプルの各データを標本平均と不偏分散で標準化したt値は、サンプル数をnとすると、自由度n−1のt分布に従う
t分布の自由度n−1、信頼確率αに対する値を用いて信頼区間を設定
母平均の信頼区間を計算

手順

まず、母集団からn個のサンプルx₁, …, x_nを抽出し、その平均と不偏分散を求める。

(1) $\begin{align*} \overline{x}_n &= \frac{1}{n}\sum_{i=1}^n x_i \\ {s^2}_n &= \frac{1}{n - 1}\sum_{i=1}^n \left( x_i - \overline{x} \right) \end{align*}$

次に、これらの値から以下のt値を構成する。

(2) $\begin{equation*} t = \frac{\overline{X}_n - \mu}{\sqrt{{s^2}_n / n}} \end{equation*}$

このt値が自由度n−1のt分布に従うことから、意図する確率値αに対する信頼区間を設定。両側に境界を持つ信頼区間の場合は以下のようになる。

(3) $\begin{equation*} t\left( p \le \frac{1 - \alpha}{2}; n-1 \right) \le \frac{\overline{X}_n - \mu}{\sqrt{{s^2}_n / n}} \le t\left( p \le \frac{1 + \alpha}{2}; n-1 \right) \end{equation*}$

これを移項して、平均μに対する信頼区間として表示。

(4) $\begin{equation*} \overline{X}_n - t_{n-1}^{\frac{1-\alpha}{2}} \sqrt{\frac{{s^2}_n}{n}} \le \mu \le \overline{X}_n + t_{n-1}^{\frac{1+\alpha}{2}} \sqrt{\frac{{s^2}_n}{n}} \end{equation*}$

tに関する値は、自由度と意図する確率の値から計算され、こちらに例示した。

例題

このパラメーターから、正規分布に従う10個の乱数を発生させた結果が以下の通り。

180.9 167.5 168.  164.8 176.4 157.4 181.7 166.6 173.1 169.7

1	180.9 167.5 168. 164.8 176.4 157.4 181.7 166.6 173.1 169.7

import numpy as np

np.random.seed(1)

pop_mean = 171.2
pop_std = 6
n_sample = 10

x_sample = np.random.normal(pop_mean, pop_std, n_sample)
np.set_printoptions(precision=1)
print(x_sample)
print("mean = {:5.1f}".format(np.mean(x_sample)))
print("u-var= {:5.2f}".format(np.var(x_sample, ddof=1)))

# [180.9 167.5 168.  164.8 176.4 157.4 181.7 166.6 173.1 169.7]
# mean = 170.6
# u-var= 56.73

import numpy as np

np.random.seed(1)

pop_mean = 171.2

pop_std = 6

n_sample = 10

x_sample = np.random.normal(pop_mean, pop_std, n_sample)

np.set_printoptions(precision=1)

print(x_sample)

print("mean = {:5.1f}".format(np.mean(x_sample)))

print("u-var= {:5.2f}".format(np.var(x_sample, ddof=1)))

# [180.9 167.5 168. 164.8 176.4 157.4 181.7 166.6 173.1 169.7]

# mean = 170.6

# u-var= 56.73

これらのデータの平均は170.6、不偏分散は56.73。自由度10 − 1 = 9に対する両側確率95%（片側2.5%）のt値はこちらの表から2.262となることから、μの信頼区間は以下のように計算される。

(5) $\begin{gather*} 170.6 - 2.262 \sqrt{\frac{56.73}{10}} \le \mu \le 170.6 + 2.262 \sqrt{\frac{56.73}{10}} \\ 165.2 \le \mu \le 176.0 \end{gather*}$

この結果は、母分散が既知の場合（168.7～172.5）に比べて区間幅が広くなっている。母分散が未知で情報が少ないのでこれは自然な結果で、式でいえば同じ確率に対するt値が標準正規分布のz値より大きいことと、不偏分散が標準偏差より大きくなることからも確認できる。

t分布

2020-06-14 / tau / コメントする

概要

t分布は連続確率分布の1つで、以下のような場合に用いられる。

正規分布する母集団の平均と分散が未知で標本サイズが小さい場合に平均を推定
2つの平均値の差の統計的有意性に対するt検定

サンプルX₁, …, X_nが平均μの正規分布に従うとし、標本平均 $\overline{X}$ と不偏分散s²が以下であるとする。

(1) $\begin{align*} \overline{X}_n &= \frac{1}{n} \sum_{i=1}^n X_i \\ {s^2}_n &= \frac{1}{n - 1} \sum_{i=1}^n \left( X_i - \overline{X} \right) \end{align*}$

ここで以下の変数（t値）を考える。

(2) $\begin{equation*} t = \frac{\overline{X}_n - \mu}{\sqrt{{s^2}_n / n}} \end{equation*}$

このとき、上記のt値は以下の確率分布でν = n − 1としたものに従うことが知られている。

(3) $\begin{equation*} f(t; \nu) = \dfrac{\Gamma \left( \dfrac{\nu + 1}{2}\right) }{\sqrt{\nu \pi} \Gamma \left( {\dfrac{\nu}{2}}\right)} \left( 1 + \dfrac{t^2}{\nu} \right)^{- \dfrac{\nu + 1}{2} \end{equation*}$

この確率分布はstudentのt分布と呼ばれ、Γはガンマ関数。

自由度と確率分布の関係

t分布の自由度νを変化させて確率分布を描いてみる。

自由度20あたりでかなり標準積分布に近くなっていることがわかる。自由度1～20に対して片側確率が10%, 5%, 2.5%, 1%, 0.5%ととなるzの値を計算すると以下のようになる。

t分布表

以下に、自由度1 ～20に対して、いくつかの片側確率に対するt値の表を示す（Pr(t) > α)となるt値）。

自由度が20くらいになるとかなり標準正規分布に近い形になるが、zの値は有効数値2桁目で違ってくる。自由度が700くらいで何とか3桁目まで標準正規分布の値と同じになる。

ν	0.1	0.05	0.025	0.01	0.005
1	3.078	6.314	12.706	31.821	63.657
2	1.886	2.920	4.303	6.965	9.925
3	1.638	2.353	3.182	4.541	5.841
4	1.533	2.132	2.776	3.747	4.604
5	1.476	2.015	2.571	3.365	4.032
6	1.440	1.943	2.447	3.143	3.707
7	1.415	1.895	2.365	2.998	3.499
8	1.397	1.860	2.306	2.896	3.355
9	1.383	1.833	2.262	2.821	3.250
10	1.372	1.812	2.228	2.764	3.169
11	1.363	1.796	2.201	2.718	3.106
12	1.356	1.782	2.179	2.681	3.055
13	1.350	1.771	2.160	2.650	3.012
14	1.345	1.761	2.145	2.624	2.977
15	1.341	1.753	2.131	2.602	2.947
16	1.337	1.746	2.120	2.583	2.921
17	1.333	1.740	2.110	2.567	2.898
18	1.330	1.734	2.101	2.552	2.878
19	1.328	1.729	2.093	2.539	2.861
20	1.325	1.725	2.086	2.528	2.845
N(0, 1)	1.282	1.645	1.960	2.326	2.576

なお、これらの値はPythonのscipy.statsからt分布と正規分布の関数を呼び出して得られる。

import numpy as np
import scipy.stats as stats

probs = np.array([0.1, 0.05, 0.025, 0.01, 0.005])

fmt_header = "{0:>2}{1[0]:>7}{1[1]:>7}{1[2]:>7}{1[3]:>7}{1[4]:>7}"
fmt_data = "{0:2d}{1[0]:7.3f}{1[1]:7.3f}{1[2]:7.3f}{1[3]:7.3f}{1[4]:7.3f}"
fmt_footer = "{0:>2}{1[0]:7.3f}{1[1]:7.3f}{1[2]:7.3f}{1[3]:7.3f}{1[4]:7.3f}"

print(fmt_header.format(" ", probs))
for df in range(1, 21):
    print(fmt_data.format(df, -stats.t.ppf(probs, df=df)))
print()
print(fmt_footer.format("N", -stats.norm.ppf(probs, loc=0, scale=1)))

import numpy as np

import scipy.stats as stats

probs = np.array([0.1, 0.05, 0.025, 0.01, 0.005])

fmt_header = "{0:>2}{1[0]:>7}{1[1]:>7}{1[2]:>7}{1[3]:>7}{1[4]:>7}"

fmt_data = "{0:2d}{1[0]:7.3f}{1[1]:7.3f}{1[2]:7.3f}{1[3]:7.3f}{1[4]:7.3f}"

fmt_footer = "{0:>2}{1[0]:7.3f}{1[1]:7.3f}{1[2]:7.3f}{1[3]:7.3f}{1[4]:7.3f}"

print(fmt_header.format(" ", probs))

for df in range(1, 21):

print(fmt_data.format(df, -stats.t.ppf(probs, df=df)))

print()

print(fmt_footer.format("N", -stats.norm.ppf(probs, loc=0, scale=1)))

母平均の信頼区間～母分散が既知の場合

2020-06-14 / tau / 5件のコメント

概要

母集団の分散がわかっている場合の、母平均の信頼区間の推定について。

信頼区間の推定の考え方は以下の通り。

サンプルを抽出し、標本平均 $\overline{x}$ を求める
既知の分散σ²から標本平均は正規分布N(μ, σ²/n)に従う
標本平均をμ, σ²/nで標準化し、標準正規分布の信頼確率αに対する信頼区間を設定
母平均μの信頼区間を計算

手順

まず、母集団からn個のサンプルx₁, …, x_nを抽出し、その平均を求める。

(1) $\begin{equation*} \overline{x} = \frac{1}{n}\sum_{i=1}^n x_i \end{equation*}$

次に平均と分散で標準化した変数に対して、意図する確率値αに対する標準正規分布の確率変数値zを使って信頼区間を設定。両側の境界を持つ信頼区間の場合は以下のようになる。

(2) $\begin{equation*} z\left( p \le \frac{1 - \alpha}{2} \right) \le \frac{\overline{x} - \mu}{\sqrt{\sigma^2 / n}} \le z\left( p \le \frac{1+ \alpha}{2} \right) \end{equation*}$

これを移項してμの信頼区間として表示。

(3) $\begin{align*} \overline{x} - z\left( p \le \frac{1 - \alpha}{2} \right) \sqrt{\frac{\sigma^2}{n}} \le \mu \le \overline{x} + z\left( p \le \frac{1+ \alpha}{2} \right) \sqrt{\frac{\sigma^2}{n}} \end{align*}$

信頼確率αに対応する標準正規分布のzを設定してμの信頼区間を算出する。たとえば両側95%信頼区間なら、片側2.5%確率に対応する1.96など、標準正規分布のzの値はこちらを参照。

(4) $\begin{align*} \overline{x} - 1.96 \sqrt{\frac{\sigma^2}{n}} \le \mu \le \overline{x} + 1.96 \sqrt{\frac{\sigma^2}{n}} \end{align*}$

例題

このパラメーターから、正規分布に従う10個の乱数を発生させた結果が以下の通り。

180.9 167.5 168.  164.8 176.4 157.4 181.7 166.6 173.1 169.7

1	180.9 167.5 168. 164.8 176.4 157.4 181.7 166.6 173.1 169.7

これらのデータの平均は170.6となり、これとσ²= 36、サンプル数10、両側95%に対する1.96を用いて、信頼区間は以下のように計算される。

(5) $\begin{gather*} 170.6 - 1.96 \sqrt{\frac{36}{10}} \le \mu \le 170.6 + 1.96 \sqrt{\frac{36}{10}} \\ 166.9 \le \mu \le 174.3 \end{gather*}$

【注】上記のデータはPythonでseed(1)として発生させた。

import numpy as np

np.random.seed(1)

pop_mean = 171.2
pop_std = 6
n_sample = 10

x_sample = np.random.normal(pop_mean, pop_std, n_sample)
np.set_printoptions(precision=1)
print(x_sample)
print("mean = {:5.1f}".format(np.mean(x_sample)))

# [180.9 167.5 168.  164.8 176.4 157.4 181.7 166.6 173.1 169.7]
# mean = 170.6

import numpy as np

np.random.seed(1)

pop_mean = 171.2

pop_std = 6

n_sample = 10

x_sample = np.random.normal(pop_mean, pop_std, n_sample)

np.set_printoptions(precision=1)

print(x_sample)

print("mean = {:5.1f}".format(np.mean(x_sample)))

# [180.9 167.5 168. 164.8 176.4 157.4 181.7 166.6 173.1 169.7]

# mean = 170.6

当初seed(0)で発生させた際には以下のようになり、95%信頼区間が母集団の平均を含まなくなった。

[181.8 173.6 177.1 184.6 182.4 165.3 176.9 170.3 170.6 173.7]
mean = 175.6

1 2	[181.8 173.6 177.1 184.6 182.4 165.3 176.9 170.3 170.6 173.7] mean = 175.6

(6) $\begin{gather*} 175.6 - 1.96 \sqrt{\frac{36}{10}} \le \mu \le 175.6 + 1.96 \sqrt{\frac{36}{10}} \\ 171.9 \le \mu \le 179.3 \end{gather*}$

seed(0)はよく使う系列だが、このようなこともあるので乱数系列を複数変えて試すのが望ましい。

サンプルサイズに対する信頼区間の傾向

サンプルサイズを大きくしていったときの平均身長の95%信頼区間は以下の通りで、かなりばらつきながら徐々に区間幅は小さくなるが、ある程度サンプルサイズを大きくしてもあまり顕著な区間幅の減少はみられない。

これは信頼区間に現れる $1/\sqrt{n}$ のグラフを描いてみると分かるが、n=20程度まで急激に小さくなり、その後の減少スピードはかなり遅いことがわかる。したがって、信頼区間を狭めようとしても、効果があるのはせいぜいデータ数50程度までということになる。

【補足】

本記事にいただいたコメントの通り、これの考え方は適切ではない。正しくは、 $1.96 \sqrt{\sigma^2 / 2}$ などのグラフを描くべき。ご指摘に感謝します。

なお、1つ目のグラフの計算手順は以下の通り。

母集団の平均・標準偏差から、サンプルサイズを変えながら正規乱数を発生させる
サンプルごとにサンプル平均を計算する
サンプル平均と母分散から母平均推定の信頼区間の上限値と下限値を計算してリストに追加する
結果をグラフに表示する

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt

rnd.seed(0)

# height of Japanese in 2017(40-49, #374)
hpop_mean = 171.2
hpop_std = 6.0
hpop_var = hpop_std ** 2

sample_size_list = range(10, 200)
z95 = 1.96

fig, ax = plt.subplots()

hsmp_mean_conf_lower95 = []
hsmp_mean_conf_upper95 = []
for nsmp in sample_size_list:
    hsmp = rnd.normal(loc=hpop_mean, scale=hpop_std, size=nsmp)
    hsmp_mean = np.mean(hsmp)
    hsmp_mean_conf_lower95.append(hsmp_mean - z95 * np.sqrt(hpop_var / nsmp))
    hsmp_mean_conf_upper95.append(hsmp_mean + z95 * np.sqrt(hpop_var / nsmp))

ax.plot(sample_size_list, hsmp_mean_conf_upper95,
    linestyle='solid', label="CI(95%)-Upper")
ax.plot(sample_size_list, hsmp_mean_conf_lower95,
    linestyle='dashed', label="CI(95%)-Lower")
ax.plot(sample_size_list, [hpop_mean]*len(sample_size_list),
    linestyle='dashdot', label="Pop mean")

ax.set_xlabel("sample size")
ax.set_ylabel("height(cm)")
ax.set_title("Confidence Interval of Height(40-49)")
ax.legend()

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

rnd.seed(0)

# height of Japanese in 2017(40-49, #374)

hpop_mean = 171.2

hpop_std = 6.0

hpop_var = hpop_std ** 2

sample_size_list = range(10, 200)

z95 = 1.96

fig, ax = plt.subplots()

hsmp_mean_conf_lower95 = []

hsmp_mean_conf_upper95 = []

for nsmp in sample_size_list:

hsmp = rnd.normal(loc=hpop_mean, scale=hpop_std, size=nsmp)

hsmp_mean = np.mean(hsmp)

hsmp_mean_conf_lower95.append(hsmp_mean - z95 * np.sqrt(hpop_var / nsmp))

hsmp_mean_conf_upper95.append(hsmp_mean + z95 * np.sqrt(hpop_var / nsmp))

ax.plot(sample_size_list, hsmp_mean_conf_upper95,

linestyle='solid', label="CI(95%)-Upper")

ax.plot(sample_size_list, hsmp_mean_conf_lower95,

linestyle='dashed', label="CI(95%)-Lower")

ax.plot(sample_size_list, [hpop_mean]*len(sample_size_list),

linestyle='dashdot', label="Pop mean")

ax.set_xlabel("sample size")

ax.set_ylabel("height(cm)")

ax.set_title("Confidence Interval of Height(40-49)")

ax.legend()

plt.show()

numpy.varやnumpy.stdの自由度

2020-06-13 / tau / コメントする

numpy.var、numpy.stdは、それぞれ配列で与えたデータの分散、標準偏差を返す。

numpy.var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>)

numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False)

この関数の引数にddofというのがあり、numpyのドキュメントには以下のように書かれている。

ddof : int, optional: Means Delta Degrees of Freedom. The divisor used in calculations is N – ddof, where N represents the number of elements. By default ddof is zero.

つまり、分散の計算の際にN−ddofで割っていて、デフォルトではddof=0なので、母分散及び母集団の標準偏差として計算される。

ddof=1とすると不偏分散およびその平方根として計算される。

import numpy as np

x = np.arange(10)

print(np.var(x))
# 8.25

print(np.var(x, ddof=1))
# 9.166666666666666

n = x.size
print(np.var(x) * n / (n - 1))
# 9.166666666666666

import numpy as np

x = np.arange(10)

print(np.var(x))

# 8.25

print(np.var(x, ddof=1))

# 9.166666666666666

n = x.size

print(np.var(x) * n / (n - 1))

# 9.166666666666666

ただし正確には、不偏分散の平方根は母集団の標準偏差の不偏推定量ではないらしい。

決定境界／クラス分類の分布を描く関数

2020-06-11 / tau / コメントする

概要

2つの特徴量を持つデータセットを学習したモデルに対し、2次元の特徴量空間における決定境界やクラス分類の分布を描く関数の例。

draw_decision_boundary()で決定境界の線を描き、draw_decision_area()で領域のクラス分布を色分けで表示する。

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patch
from sklearn.datasets import make_moons
from sklearn.neighbors import KNeighborsClassifier


def draw_decision_boundary(clf, ax, x0s, x1s,
        threshold=0, color='k', alpha=1.0):

    y_predicted = np.empty((len(x1s), len(x0s)))

    for row, x1 in enumerate(x1s):
        for col, x0 in enumerate(x0s):
            y_predicted[row, col] = clf.predict(np.array([[x0, x1]]))

    ax.contour(x0s, x1s, y_predicted,
        colors=color, levels=[threshold] , alpha=alpha)


def draw_decision_field(clf, ax, x0s, x1s, n_areas=2,
        colors=['tab:blue', 'tab:orange'], alpha=0.5, fill=True):

    y_predicted = np.empty((len(x1s), len(x0s)))

    for row, x1 in enumerate(x1s):
        for col, x0 in enumerate(x0s):
            y_predicted[row, col] = clf.predict(np.array([[x0, x1]]))
    ax.contourf(x0s, x1s, y_predicted,
        colors=colors, levels=n_areas-1, alpha=alpha)


X, y = make_moons(n_samples=100, noise=0.25, random_state=3)

x0_min, x0_max = -2.0, 2.5
x1_min, x1_max = -1.0, 1.5
x0s = np.linspace(x0_min, x0_max, 50)
x1s = np.linspace(x1_min, x1_max, 50)

fig, axs = plt.subplots(1, 2, figsize=(11, 4.8))
fig.subplots_adjust(wspace=0.3)
axs_1d = axs.reshape(-1)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

for ax in axs_1d:
    ax.scatter(X[y==0][:, 0], X[y==0][:, 1], marker='o')
    ax.scatter(X[y==1][:, 0], X[y==1][:, 1], marker='^')

    ax.set_xlim(x0_min, x0_max)
    ax.set_ylim(x1_min, x1_max)
    ax.set_xlabel("Feature-0")
    ax.set_ylabel("Feature-1")

draw_decision_boundary(knn, axs[0], x0s, x1s, threshold=0.5)
draw_decision_field(knn, axs[1], x0s, x1s, alpha=0.2)

axs[0].set_title("decision boundary")
axs[1].set_title("decision field")

plt.show()

import numpy as np

import matplotlib.pyplot as plt

import matplotlib.patches as patch

from sklearn.datasets import make_moons

from sklearn.neighbors import KNeighborsClassifier

def draw_decision_boundary(clf, ax, x0s, x1s,

threshold=0, color='k', alpha=1.0):

y_predicted = np.empty((len(x1s), len(x0s)))

for row, x1 in enumerate(x1s):

for col, x0 in enumerate(x0s):

y_predicted[row, col] = clf.predict(np.array([[x0, x1]]))

ax.contour(x0s, x1s, y_predicted,

colors=color, levels=[threshold] , alpha=alpha)

def draw_decision_field(clf, ax, x0s, x1s, n_areas=2,

colors=['tab:blue', 'tab:orange'], alpha=0.5, fill=True):

y_predicted = np.empty((len(x1s), len(x0s)))

for row, x1 in enumerate(x1s):

for col, x0 in enumerate(x0s):

y_predicted[row, col] = clf.predict(np.array([[x0, x1]]))

ax.contourf(x0s, x1s, y_predicted,

colors=colors, levels=n_areas-1, alpha=alpha)

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)

x0_min, x0_max = -2.0, 2.5

x1_min, x1_max = -1.0, 1.5

x0s = np.linspace(x0_min, x0_max, 50)

x1s = np.linspace(x1_min, x1_max, 50)

fig, axs = plt.subplots(1, 2, figsize=(11, 4.8))

fig.subplots_adjust(wspace=0.3)

axs_1d = axs.reshape(-1)

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X, y)

for ax in axs_1d:

ax.scatter(X[y==0][:, 0], X[y==0][:, 1], marker='o')

ax.scatter(X[y==1][:, 0], X[y==1][:, 1], marker='^')

ax.set_xlim(x0_min, x0_max)

ax.set_ylim(x1_min, x1_max)

ax.set_xlabel("Feature-0")

ax.set_ylabel("Feature-1")

draw_decision_boundary(knn, axs[0], x0s, x1s, threshold=0.5)

draw_decision_field(knn, axs[1], x0s, x1s, alpha=0.2)

axs[0].set_title("decision boundary")

axs[1].set_title("decision field")

plt.show()

関数の使い方

それぞれの関数単体では特にパッケージは必要ないが、いくつかのパラメーターは一定のクラスを想定している。

`draw_decision_boundary()`

draw_decision_boundary(clf, ax, x0s, x1s, threshold, color, alpha)

clf: 学習済みのクラス分類モデルのインスタンスを指定する。predict()メソッドを持つこと（引数は2次元配列を想定）。
ax: 決定境界を描くAxesオブジェクト。
x0s, x1s: クラスを計算する領域の計算点の座標を1次元配列で指定。
threshold: 決定境界の値を整数で与える。デフォルトは0。決定値がクラスラベル（例えば0と1）で与えられる場合はその平均（たとえば0.5）を与える。
color: 決定境界の場合はカラーコード。デフォルトは’k’（黒）
alpha: 分布図の場合の塗りつぶしの透明度を実数で指定。デフォルトは1（不透明）

`draw_decision_field()`

draw_decision_field(clf, ax, x0s, x1s, n_areas=2, colors, alpha)

clf: 学習済みのクラス分類モデルのインスタンスを指定する。predict()メソッドを持つこと（引数は2次元配列を想定）。
ax: 決定境界を描くAxesオブジェクト。
x0s, x1s: クラスを計算する領域の計算点の座標を1次元配列で指定。
n_areas: 分割される領域の数を整数で指定。デフォルトは2（2つの領域）
colors: 分割される領域を塗りつぶす色をカラーコードの配列で与える。デフォルトは['tab:blue', 'tab:oranbe']。
alpha: 分布図の場合の塗りつぶしの透明度を実数で指定。デフォルトは0.5（半透明）。

関数の内容

`draw_decision_boundary()`

pyplotのcontourを利用している。

def draw_decision_boundary(clf, ax, x0s, x1s,
        threshold=0, color='k', alpha=1.0):

    y_predicted = np.empty((len(x1s), len(x0s)))

    for row, x1 in enumerate(x1s):
        for col, x0 in enumerate(x0s):
            y_predicted[row, col] = clf.predict(np.array([[x0, x1]]))

    ax.contour(x0s, x1s, y_predicted,
        colors=color, levels=[threshold] , alpha=alpha)

def draw_decision_boundary(clf, ax, x0s, x1s,

threshold=0, color='k', alpha=1.0):

y_predicted = np.empty((len(x1s), len(x0s)))

for row, x1 in enumerate(x1s):

for col, x0 in enumerate(x0s):

y_predicted[row, col] = clf.predict(np.array([[x0, x1]]))

ax.contour(x0s, x1s, y_predicted,

colors=color, levels=[threshold] , alpha=alpha)

`draw_decision_field()`

pyplotのcontourfを利用している。

def draw_decision_field(clf, ax, x0s, x1s, n_areas=2,
        colors=['tab:blue', 'tab:orange'], alpha=0.5, fill=True):

    y_predicted = np.empty((len(x1s), len(x0s)))

    for row, x1 in enumerate(x1s):
        for col, x0 in enumerate(x0s):
            y_predicted[row, col] = clf.predict(np.array([[x0, x1]]))
    ax.contourf(x0s, x1s, y_predicted,
        colors=colors, levels=n_areas-1, alpha=alpha)

def draw_decision_field(clf, ax, x0s, x1s, n_areas=2,

colors=['tab:blue', 'tab:orange'], alpha=0.5, fill=True):

y_predicted = np.empty((len(x1s), len(x0s)))

for row, x1 in enumerate(x1s):

for col, x0 in enumerate(x0s):

y_predicted[row, col] = clf.predict(np.array([[x0, x1]]))

ax.contourf(x0s, x1s, y_predicted,

colors=colors, levels=n_areas-1, alpha=alpha)

決定木の境界描画

2020-06-10 / tau / コメントする

概要

書籍”Pythonではじめる機械学習”の決定木のところで、ノードの分割をするごとの境界を描いている。

書籍ではmglearnパッケージを使っているが、これを自前の関数で再現した例。

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patch
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier


def draw_tree_boundary(tree, ax, left, right, bottom, top,
        i_node=0, stop_level=None, n_level=0):

    if tree.children_left[i_node] == -1 or stop_level == n_level:
        fc =\
            'tab:orange' if np.argmax(tree.value[i_node][0])==0 else 'tab:blue'
        rect = patch.Rectangle(xy=(left, bottom),
            width=right-left, height=top-bottom, fc=fc, alpha=0.2)
        ax.add_patch(rect)
        return

    if tree.feature[i_node] == 0:
        f0 = tree.threshold[i_node]
        ax.plot([f0, f0], [top, bottom])
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=f0, top=top, bottom=bottom,
            i_node=tree.children_left[i_node],
            stop_level=stop_level, n_level=n_level+1,)
        draw_tree_boundary(tree=tree, ax=ax,
            left=f0, right=right, top=top, bottom=bottom,
            i_node=tree.children_right[i_node],
            stop_level=stop_level, n_level=n_level+1)
    else:
        f1 = tree.threshold[i_node]
        ax.plot([left, right], [f1, f1])
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=right, top=f1, bottom=bottom,
            i_node=tree.children_left[i_node],
            stop_level=stop_level, n_level=n_level+1)
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=right, top=top, bottom=f1,
            i_node=tree.children_right[i_node],
            stop_level=stop_level, n_level=n_level+1)


X, y = make_moons(n_samples=100, noise=0.25, random_state=3)

treeclf = \
    DecisionTreeClassifier(max_depth=None, min_samples_leaf=1, random_state=0)
treeclf.fit(X, y)

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],
    ec='k', s=60, marker='o', fc='tab:orange', label="Class 0")
ax.scatter(X[y==1][:, 0], X[y==1][:, 1],
    ec='k', s=60, marker='^', fc='tab:blue', label="Class 1")

x0_min, x0_max = -2, 2.5
x1_min, x1_max = -1, 1.5

draw_tree_boundary(tree=treeclf.tree_, i_node=0, ax=ax,
    left=x0_min, right=x0_max, bottom=x1_min, top=x1_max)

ax.set_xlim(x0_min, x0_max)
ax.set_ylim(x1_min, x1_max)
ax.legend()

plt.show()

import numpy as np

import matplotlib.pyplot as plt

import matplotlib.patches as patch

from sklearn.datasets import make_moons

from sklearn.tree import DecisionTreeClassifier

def draw_tree_boundary(tree, ax, left, right, bottom, top,

i_node=0, stop_level=None, n_level=0):

if tree.children_left[i_node] == -1 or stop_level == n_level:

fc =\

'tab:orange' if np.argmax(tree.value[i_node][0])==0 else 'tab:blue'

rect = patch.Rectangle(xy=(left, bottom),

width=right-left, height=top-bottom, fc=fc, alpha=0.2)

ax.add_patch(rect)

return

if tree.feature[i_node] == 0:

f0 = tree.threshold[i_node]

ax.plot([f0, f0], [top, bottom])

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=f0, top=top, bottom=bottom,

i_node=tree.children_left[i_node],

stop_level=stop_level, n_level=n_level+1,)

draw_tree_boundary(tree=tree, ax=ax,

left=f0, right=right, top=top, bottom=bottom,

i_node=tree.children_right[i_node],

stop_level=stop_level, n_level=n_level+1)

else:

f1 = tree.threshold[i_node]

ax.plot([left, right], [f1, f1])

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=right, top=f1, bottom=bottom,

i_node=tree.children_left[i_node],

stop_level=stop_level, n_level=n_level+1)

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=right, top=top, bottom=f1,

i_node=tree.children_right[i_node],

stop_level=stop_level, n_level=n_level+1)

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)

treeclf = \

DecisionTreeClassifier(max_depth=None, min_samples_leaf=1, random_state=0)

treeclf.fit(X, y)

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],

ec='k', s=60, marker='o', fc='tab:orange', label="Class 0")

ax.scatter(X[y==1][:, 0], X[y==1][:, 1],

ec='k', s=60, marker='^', fc='tab:blue', label="Class 1")

x0_min, x0_max = -2, 2.5

x1_min, x1_max = -1, 1.5

draw_tree_boundary(tree=treeclf.tree_, i_node=0, ax=ax,

left=x0_min, right=x0_max, bottom=x1_min, top=x1_max)

ax.set_xlim(x0_min, x0_max)

ax.set_ylim(x1_min, x1_max)

ax.legend()

plt.show()

関数の仕様

描画用の関数draw_tree_boundary()の引数は以下の通り。

draw_tree_boundary(tree, ax, left, right, bottom, top, i_node=0, stop_level=None, n_level=0)

tree: 描きたい決定木モデルのtree_オブジェクトを渡す。
ax: 境界図を描くターゲットのAxesオブジェクトを渡す。
left, right, bottom, top: その時点でのノードの描画範囲をaxに即した座標で指定する。
i_node: エリアを描画するノード。省略した場合のデフォルトは0で、ルートノード（全域）以下を描画。
stop_level: 描画する木の深さを指定。デフォルトはNoneで、この場合は最深部まで描く。
n_level: この関数の再帰呼び出しの際に内部的に使われる。

この関数を呼び出し方の例は以下の通りで、stop_levelを省略しているので、リーフノードまで含めた木全体を描いている。

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)

treeclf = \
    DecisionTreeClassifier(max_depth=None, min_samples_leaf=1, random_state=0)
treeclf.fit(X, y)

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],
    ec='k', s=60, marker='o', fc='tab:orange', label="Class 0")
ax.scatter(X[y==1][:, 0], X[y==1][:, 1],
    ec='k', s=60, marker='^', fc='tab:blue', label="Class 1")

x0_min, x0_max = -2, 2.5
x1_min, x1_max = -1, 1.5

draw_tree_boundary(tree=treeclf.tree_, i_node=0, ax=ax,
    left=x0_min, right=x0_max, bottom=x1_min, top=x1_max)

ax.set_xlim(x0_min, x0_max)
ax.set_ylim(x1_min, x1_max)
ax.legend()

plt.show()

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)

treeclf = \

DecisionTreeClassifier(max_depth=None, min_samples_leaf=1, random_state=0)

treeclf.fit(X, y)

fig, ax = plt.subplots()

ax.scatter(X[y==0][:, 0], X[y==0][:, 1],

ec='k', s=60, marker='o', fc='tab:orange', label="Class 0")

ax.scatter(X[y==1][:, 0], X[y==1][:, 1],

ec='k', s=60, marker='^', fc='tab:blue', label="Class 1")

x0_min, x0_max = -2, 2.5

x1_min, x1_max = -1, 1.5

draw_tree_boundary(tree=treeclf.tree_, i_node=0, ax=ax,

left=x0_min, right=x0_max, bottom=x1_min, top=x1_max)

ax.set_xlim(x0_min, x0_max)

ax.set_ylim(x1_min, x1_max)

ax.legend()

plt.show()

関数の処理内容

この関数の大まかな処理の流れは、以下の通り。

ルートノードの分割から初めて、リーフノードに行きつくまで分割と下の階層の探索を再帰的に進める
リーフノードであればそのノードのクラスで色を塗り、親のノードに戻る
あるノードの左の子ノードの下のリーフノードの処理が全部終わったら、右の子ノードの処理に移り、それも終わったら親のノードに戻る

関数の処理内容を最初の呼び出しから追うと以下の通り。

i_nodeとstop_levelを省略して呼び出し→ルートノードから木全体を描く
現在のノードがリーフノード（子ノードのインデックスが–1）あるいは現在の深さがstop_levelに達したなら、以下を実行してreturn（親ノードに戻る）
1. 現在のノードの卓越クラスに応じてtab:orangeかtab:blueでフェイスカラーを設定
2. 引数で得られた矩形領域をフェイスカラーで塗りつぶす
3. 塗りつぶした矩形をaxに追加
現在のノードがリーフノードでなく、終了深さにも達していない場合は、現在のノードを分割する特徴量によって以下を実行してreturn（親ノードに戻る）
1. ノードの分割基準が特徴量0の場合
  1. 分割基準の特徴量0の値で領域の上から下まで境界線を引く
  2. 左側のエリアを指定して左子ノードを処理する
  3. 戻ってきたら右側のエリアを指定して右子ノードを処理する
2. ノードの分割基準が特徴量1の場合
  1. 分割基準の特徴量1の値で領域の左から右まで境界線を引く
  2. 下側のエリアを指定して左子ノードを処理する
  3. 戻ってきたら上側のエリアを指定して右子ノードを処理する

def draw_tree_boundary(tree, ax, left, right, bottom, top,
        i_node=0, stop_level=None, n_level=0):

    if tree.children_left[i_node] == -1 or stop_level == n_level:
        fc =\
            'tab:orange' if np.argmax(tree.value[i_node][0])==0 else 'tab:blue'
        rect = patch.Rectangle(xy=(left, bottom),
            width=right-left, height=top-bottom, fc=fc, alpha=0.2)
        ax.add_patch(rect)
        return

    if tree.feature[i_node] == 0:
        f0 = tree.threshold[i_node]
        ax.plot([f0, f0], [top, bottom])
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=f0, top=top, bottom=bottom,
            i_node=tree.children_left[i_node],
            stop_level=stop_level, n_level=n_level+1,)
        draw_tree_boundary(tree=tree, ax=ax,
            left=f0, right=right, top=top, bottom=bottom,
            i_node=tree.children_right[i_node],
            stop_level=stop_level, n_level=n_level+1)
    else:
        f1 = tree.threshold[i_node]
        ax.plot([left, right], [f1, f1])
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=right, top=f1, bottom=bottom,
            i_node=tree.children_left[i_node],
            stop_level=stop_level, n_level=n_level+1)
        draw_tree_boundary(tree=tree, ax=ax,
            left=left, right=right, top=top, bottom=f1,
            i_node=tree.children_right[i_node],
            stop_level=stop_level, n_level=n_level+1)

def draw_tree_boundary(tree, ax, left, right, bottom, top,

i_node=0, stop_level=None, n_level=0):

if tree.children_left[i_node] == -1 or stop_level == n_level:

fc =\

'tab:orange' if np.argmax(tree.value[i_node][0])==0 else 'tab:blue'

rect = patch.Rectangle(xy=(left, bottom),

width=right-left, height=top-bottom, fc=fc, alpha=0.2)

ax.add_patch(rect)

return

if tree.feature[i_node] == 0:

f0 = tree.threshold[i_node]

ax.plot([f0, f0], [top, bottom])

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=f0, top=top, bottom=bottom,

i_node=tree.children_left[i_node],

stop_level=stop_level, n_level=n_level+1,)

draw_tree_boundary(tree=tree, ax=ax,

left=f0, right=right, top=top, bottom=bottom,

i_node=tree.children_right[i_node],

stop_level=stop_level, n_level=n_level+1)

else:

f1 = tree.threshold[i_node]

ax.plot([left, right], [f1, f1])

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=right, top=f1, bottom=bottom,

i_node=tree.children_left[i_node],

stop_level=stop_level, n_level=n_level+1)

draw_tree_boundary(tree=tree, ax=ax,

left=left, right=right, top=top, bottom=f1,

i_node=tree.children_right[i_node],

stop_level=stop_level, n_level=n_level+1)

決定木による回帰

2020-06-09 / tau / コメントする

概要

決定木を回帰に用いる場合、回帰木(regression tree)とも呼ぶ。ここでは決定木の回帰における性質・挙動を確認する。

回帰木の学習過程

以下は、sin関数に対して回帰木を適用し、剪定の深さを深くしていった場合の推移。

剪定深さ1の場合、特徴量を2つに分割しそれぞれの領域のデータから学習し予測値を得ている。剪定深さ2の場合、さらに各領域を2分割して4つの領域で予測値を得ている。このようにして剪定深さnに対して2ⁿの領域のデータで学習する。この例の場合は訓練セットとして80個のデータを準備し、1000個のデータの予測をしている。

剪定深さ6で2⁶=64訓練セットの個数と近くなるが、サインカーブの山と谷のところで区間が長く、誤差が出ている。これは、回帰木のノード分割がy = sin xの値に基づいて行われるとき、その値がかなり近くなる山・谷のところでなかなか分離されないからと考えられる。

剪定深さ10で2¹⁰=1024のとき、分割数がテストセットと同じくらいの数になるので初めて値が近い点も区別され、全体がフィットする。

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.tree import DecisionTreeRegressor

x_train = np.linspace(0, np.pi * 2, num=80)
y_train = np.sin(x_train).reshape(-1, 1)
X_train = x_train.reshape(-1, 1)

x_test = np.linspace(0, np.pi * 2, num=1000).reshape(-1, 1)
X_test = x_test.reshape(-1, 1)

depths = [1, 2, 3, 4, 6, 10]

fig, axs = plt.subplots(2, 3, figsize=(12, 6))
fig.subplots_adjust(hspace=0.3)
ax_1d = axs.reshape(-1)

for ax, depth in zip(ax_1d, depths):
    ax.scatter(x_train, y_train, s=10, ec='gray', fc="coral")
    treereg = DecisionTreeRegressor(max_depth=depth).fit(X_train, y_train)
    y_pred = treereg.predict(X_test)
    ax.plot(x_test, y_pred, c='blue', linewidth=1)
    ax.set_title("max_depth={}".format(depth))

plt.show()

import numpy as np

import matplotlib.pyplot as plt

from pandas import DataFrame

from sklearn.tree import DecisionTreeRegressor

x_train = np.linspace(0, np.pi * 2, num=80)

y_train = np.sin(x_train).reshape(-1, 1)

X_train = x_train.reshape(-1, 1)

x_test = np.linspace(0, np.pi * 2, num=1000).reshape(-1, 1)

X_test = x_test.reshape(-1, 1)

depths = [1, 2, 3, 4, 6, 10]

fig, axs = plt.subplots(2, 3, figsize=(12, 6))

fig.subplots_adjust(hspace=0.3)

ax_1d = axs.reshape(-1)

for ax, depth in zip(ax_1d, depths):

ax.scatter(x_train, y_train, s=10, ec='gray', fc="coral")

treereg = DecisionTreeRegressor(max_depth=depth).fit(X_train, y_train)

y_pred = treereg.predict(X_test)

ax.plot(x_test, y_pred, c='blue', linewidth=1)

ax.set_title("max_depth={}".format(depth))

plt.show()

ここで、学習途上の状況を、剪定深さ2(max_depth=2)の時の状態で確認してみる。

分割された4つの領域に対する境界(0.517, 3.142, 5.766)のうち、最初の境界3.142はπの値で、0～2πの領域においてπの両側で対称なことから自然な結果。4つの領域における予測値(value)はグラフ上でも確認でき、やはりπの両側で対称な値となっている。

import numpy as np
import matplotlib.pyplot as plt
import graphviz
from sklearn.tree import DecisionTreeRegressor, export_graphviz

x_train = np.linspace(0, np.pi * 2, num=80)
y_train = np.sin(x_train)
X_train = x_train.reshape(-1, 1)

treereg = DecisionTreeRegressor(max_depth=2).fit(X_train, y_train)

x_test = np.linspace(0, np.pi * 2, num=1000)
X_test = x_test.reshape(-1, 1)
y_pred = treereg.predict(X_test)

dot_data = export_graphviz(treereg, out_file=None,
    feature_names=["x"], class_names=["sin"])
graph = graphviz.Source(dot_data)
graph.render("image", view=True)

fig, ax = plt.subplots()
ax.scatter(x_train, y_train, s=20, ec='gray', fc="coral")
ax.plot(X_test, y_pred, c='blue')
ax.set_xlabel("x")
plt.show()

import numpy as np

import matplotlib.pyplot as plt

import graphviz

from sklearn.tree import DecisionTreeRegressor, export_graphviz

x_train = np.linspace(0, np.pi * 2, num=80)

y_train = np.sin(x_train)

X_train = x_train.reshape(-1, 1)

treereg = DecisionTreeRegressor(max_depth=2).fit(X_train, y_train)

x_test = np.linspace(0, np.pi * 2, num=1000)

X_test = x_test.reshape(-1, 1)

y_pred = treereg.predict(X_test)

dot_data = export_graphviz(treereg, out_file=None,

feature_names=["x"], class_names=["sin"])

graph = graphviz.Source(dot_data)

graph.render("image", view=True)

fig, ax = plt.subplots()

ax.scatter(x_train, y_train, s=20, ec='gray', fc="coral")

ax.plot(X_test, y_pred, c='blue')

ax.set_xlabel("x")

plt.show()

ノイズの影響と過学習

先のサインカーブにノイズが乗った場合の回帰木を見てみる。剪定深さ3、5としたときの回帰木による回帰線の形は以下の通りで、深さが深いと個別のデータに対して過学習となっている様子がわかる。

これらのモデルのスコアは以下の通りで、深さ5の場合には訓練スコアに対してテストスコアが低く、過学習となっている。

深さ3の場合に訓練スコアの方がテストスコアより低いが、これは訓練スコアにノイズが含まれるのに対してテストスコアのyの値をすべてノイズがないsin値としているためで、訓練セットにおいて乱数を加える程度を小さくするとこの逆転現象は解消される。

depth=3
 training score: 0.944
 test score    : 0.951
depth=5
 training score: 0.979
 test score    : 0.942

depth=3

training score: 0.944

test score : 0.951

depth=5

training score: 0.979

test score : 0.942

上記の実行コードは以下の通り。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.tree import DecisionTreeRegressor

n_train = 80
n_rand = 10
depth1 = 3
depth2 = 5

rnd.seed(23)

x = np.linspace(0, np.pi * 2, num=n_train)
y = np.sin(x)
for i in range(0, len(x), n_train//n_rand):
    y[i] = y[i] + (rnd.rand() * 2 - 1)
df = DataFrame(x, columns=['x'])
df['y'] = y

x_train = np.array(df['x'])
y_train = np.array(df['y'])
X_train = x_train.reshape(-1, 1)

treereg1 = DecisionTreeRegressor(max_depth=depth1).fit(X_train, y_train)
treereg2 = DecisionTreeRegressor(max_depth=depth2).fit(X_train, y_train)

x_test = np.linspace(0, np.pi * 2, 1000)
y_test = np.sin(x_test)
X_test = x_test.reshape(-1, 1)
y_pred1 = treereg1.predict(X_test)
y_pred2 = treereg2.predict(X_test)

print("depth={}".format(depth1))
print(" training score:{:6.3f}".format(treereg1.score(X_train, y_train)))
print(" test score    :{:6.3f}".format(treereg1.score(X_test, y_test)))
print("depth={}".format(depth2))
print(" training score:{:6.3f}".format(treereg2.score(X_train, y_train)))
print(" test score    :{:6.3f}".format(treereg2.score(X_test, y_test)))

fig, ax = plt.subplots()

ax.scatter(x_train, y_train, s=20, ec='gray', fc="coral")
ax.plot(X_test, y_pred1,
    c='forestgreen', linestyle='dashed', label="depth={}".format(depth1))
ax.plot(X_test, y_pred2, c='blue', label="depth={}".format(depth2))
ax.legend()

plt.show()

import numpy as np

import numpy.random as rnd

import matplotlib.pyplot as plt

from pandas import DataFrame

from sklearn.tree import DecisionTreeRegressor

n_train = 80

n_rand = 10

depth1 = 3

depth2 = 5

rnd.seed(23)

x = np.linspace(0, np.pi * 2, num=n_train)

y = np.sin(x)

for i in range(0, len(x), n_train//n_rand):

y[i] = y[i] + (rnd.rand() * 2 - 1)

df = DataFrame(x, columns=['x'])

df['y'] = y

x_train = np.array(df['x'])

y_train = np.array(df['y'])

X_train = x_train.reshape(-1, 1)

treereg1 = DecisionTreeRegressor(max_depth=depth1).fit(X_train, y_train)

treereg2 = DecisionTreeRegressor(max_depth=depth2).fit(X_train, y_train)

x_test = np.linspace(0, np.pi * 2, 1000)

y_test = np.sin(x_test)

X_test = x_test.reshape(-1, 1)

y_pred1 = treereg1.predict(X_test)

y_pred2 = treereg2.predict(X_test)

print("depth={}".format(depth1))

print(" training score:{:6.3f}".format(treereg1.score(X_train, y_train)))

print(" test score :{:6.3f}".format(treereg1.score(X_test, y_test)))

print("depth={}".format(depth2))

print(" training score:{:6.3f}".format(treereg2.score(X_train, y_train)))

print(" test score :{:6.3f}".format(treereg2.score(X_test, y_test)))

fig, ax = plt.subplots()

ax.scatter(x_train, y_train, s=20, ec='gray', fc="coral")

ax.plot(X_test, y_pred1,

c='forestgreen', linestyle='dashed', label="depth={}".format(depth1))

ax.plot(X_test, y_pred2, c='blue', label="depth={}".format(depth2))

ax.legend()

plt.show()

同じデータで決定木の剪定深さを変えていったときの状況を如何に示す。訓練スコアとテストスコアの関係から、深さ3までは学習不足、深さ4以降は過学習となっていることが示され、過学習になるとノイズの影響を受けていることがわかる。

決定木の限界～外挿

決定木は、与えられた訓練データに対しては完全な予測も可能だが、訓練データの領域外のデータに対しては妥当な予測ができない。書籍”Pythonではじめる機械学習”で紹介されている、メモリー単価の推移によってこれを確認する（データについてはこちらのサイトのものを使わせてもらった）。

時間をx、メモリー単価をyとするとメモリー単価を対数で表したlog yはxに対して概ね線形関係になっている。以下は、縦軸を対数目盛とした場合のメモリー単価、xとlog yについて線形回帰と決定木による学習と予測の結果を示したもので、2000年より前のデータによって双方のモデルを学習させ、2000年以降の価格を予測している。

線形回帰はデータの細かい傾向は再現できないが、訓練セットの外側についてもその傾向をある程度予測できている。一方決定木については、訓練セットについては完全に予測しているが、その外側になった途端に、外側の直前のデータの値をそのまま予測値としている。

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

data_path = \
    r"C:\Users\tomo\GoogleDrive\IT_and_Mobile\dev\python\Machine_Learning"
memory_prices = \
    pd.read_csv(os.path.join(data_path, r"regression_tree\memory_prices.csv"))

data_train = memory_prices[memory_prices.date < 2000]
data_test = memory_prices[memory_prices.date >= 2000]

x_train = np.array(data_train.date)
x_test = np.array(data_test.date)
y_train = np.array(data_train.price)
y_test = np.array(data_test.price)
X_train = x_train.reshape(-1, 1)
X_test = x_test.reshape(-1, 1)

linreg = LinearRegression().fit(X_train, np.log(y_train))
y_pred_linreg_train = np.exp(linreg.predict(X_train))
y_pred_linreg_test = np.exp(linreg.predict(X_test))

treereg = DecisionTreeRegressor().fit(X_train, np.log(y_train))
y_pred_treereg_train = np.exp(treereg.predict(X_train))
y_pred_treereg_test = np.exp(treereg.predict(X_test))

fig, ax = plt.subplots()
ax.plot(data_train.date, data_train.price, c='tab:blue', label="memory prices")
ax.plot(data_test.date, data_test.price, c='tab:blue', linestyle='dashed')
ax.plot(x_train, y_pred_linreg_train, c='tab:green', label="Linear Regression")
ax.plot(x_test, y_pred_linreg_test, c='tab:green', linestyle='dashed')
ax.plot(x_train, y_pred_treereg_train,
    c='tab:red', linestyle='dotted', linewidth=3, label="Decision Tree")
ax.plot(x_test, y_pred_treereg_test, c='tab:red', linestyle='dashed')
ax.set_yscale('log')
ax.legend()
plt.show()

import os

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

from sklearn.tree import DecisionTreeRegressor

data_path = \

r"C:\Users\tomo\GoogleDrive\IT_and_Mobile\dev\python\Machine_Learning"

memory_prices = \

pd.read_csv(os.path.join(data_path, r"regression_tree\memory_prices.csv"))

data_train = memory_prices[memory_prices.date < 2000]

data_test = memory_prices[memory_prices.date >= 2000]

x_train = np.array(data_train.date)

x_test = np.array(data_test.date)

y_train = np.array(data_train.price)

y_test = np.array(data_test.price)

X_train = x_train.reshape(-1, 1)

X_test = x_test.reshape(-1, 1)

linreg = LinearRegression().fit(X_train, np.log(y_train))

y_pred_linreg_train = np.exp(linreg.predict(X_train))

y_pred_linreg_test = np.exp(linreg.predict(X_test))

treereg = DecisionTreeRegressor().fit(X_train, np.log(y_train))

y_pred_treereg_train = np.exp(treereg.predict(X_train))

y_pred_treereg_test = np.exp(treereg.predict(X_test))

fig, ax = plt.subplots()

ax.plot(data_train.date, data_train.price, c='tab:blue', label="memory prices")

ax.plot(data_test.date, data_test.price, c='tab:blue', linestyle='dashed')

ax.plot(x_train, y_pred_linreg_train, c='tab:green', label="Linear Regression")

ax.plot(x_test, y_pred_linreg_test, c='tab:green', linestyle='dashed')

ax.plot(x_train, y_pred_treereg_train,

c='tab:red', linestyle='dotted', linewidth=3, label="Decision Tree")

ax.plot(x_test, y_pred_treereg_test, c='tab:red', linestyle='dashed')

ax.set_yscale('log')

ax.legend()

plt.show()

概要

手順

例題

サンプルサイズに対する信頼区間の傾向

概要

自由度と確率分布の関係

χ2分布表

概要

手順

例題

概要

自由度と確率分布の関係

t分布表

概要

手順

例題

サンプルサイズに対する信頼区間の傾向

概要

関数の使い方

draw_decision_boundary()

draw_decision_field()

関数の内容

draw_decision_boundary()

draw_decision_field()

概要

関数の仕様

関数の処理内容

概要

回帰木の学習過程

ノイズの影響と過学習

決定木の限界～外挿

χ²分布表

`draw_decision_boundary()`

`draw_decision_field()`

`draw_decision_boundary()`

`draw_decision_field()`