訓練データとテストデータの分割～train_test_split()

1 概要
2 乱数系列の固定
3 データのサイズ
4 複数データの同時分割
5 stratifyによる層化(相似化)
6 シャッフルの有無

概要

scikit-learnのtrain_test_split()関数を使うと、与えたデータをいろいろな方法で訓練データとテストデータに切り分けてくれる。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.arange(1, 13)
print(x)
# [ 1  2  3  4  5  6  7  8  9 10 11 12]

print(train_test_split(x))
# [array([ 7,  2, 12,  5,  3,  9, 11,  8, 10]), array([1, 6, 4])]

x_train, x_test = train_test_split(x)

print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
# x_train:[ 6  1 12  7  3  2 11  5  4]
# x_test :[ 8  9 10]

import numpy as np

from sklearn.model_selection import train_test_split

x = np.arange(1, 13)

print(x)

# [ 1 2 3 4 5 6 7 8 9 10 11 12]

print(train_test_split(x))

# [array([ 7, 2, 12, 5, 3, 9, 11, 8, 10]), array([1, 6, 4])]

x_train, x_test = train_test_split(x)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[ 6 1 12 7 3 2 11 5 4]

# x_test :[ 8 9 10]

8行目で、train_test_split()に配列を与えた結果、それが2つの配列に分割されていることがわかる。

11行目では、その結果を訓練用、テスト用の配列として取得している。

デフォルトでtrain_test_split()は、テスト用データのサイズが与えた配列のサイズの0.25となるように配列を分割する（1つ目のサイズ：2つ目のサイズ＝3:1）。x_testのサイズが12×0.25=3、x_trainのサイズが9となっていることが確認できる。

乱数系列の固定

データの分割あたって、要素の選択はtrain_test_split()の実行ごとにランダムに行われるが、random_stateパラメーターを指定することで固定できる。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.arange(1, 13)

x_train, x_test = train_test_split(x, random_state=0)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
# x_train:[11  3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5]

x_train, x_test = train_test_split(x, random_state=0)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
# x_train:[11  3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5]

x_train, x_test = train_test_split(x, random_state=1)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
# x_train:[11  2  7  1  8 12 10  9  6]
# x_test :[3 4 5]

x_train, x_test = train_test_split(x, random_state=1)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
# x_train:[11  2  7  1  8 12 10  9  6]
# x_test :[3 4 5]

import numpy as np

from sklearn.model_selection import train_test_split

x = np.arange(1, 13)

x_train, x_test = train_test_split(x, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[11 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5]

x_train, x_test = train_test_split(x, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[11 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5]

x_train, x_test = train_test_split(x, random_state=1)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[11 2 7 1 8 12 10 9 6]

# x_test :[3 4 5]

x_train, x_test = train_test_split(x, random_state=1)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[11 2 7 1 8 12 10 9 6]

# x_test :[3 4 5]

データのサイズ

テストデータサイズの指定

テストデータのサイズはtest_sizeパラメーターで指定することができる。

以下の例では、テストデータの比率をデフォルトの0.25→0.3に変更しており、テストデータのサイズが4となっている（test_size=0.26としてもx_testのサイズが4になり、テストデータのサイズは切り上げで計算されている）。

比率によってデータサイズを指定する場合は0<test_size<1の実数で指定(0や1.0で指定するとエラー)

x_train, x_test = train_test_split(x, test_size=0.3, random_state=0)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))

# x_train:[ 3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5 11]

x_train, x_test = train_test_split(x, test_size=0.3, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[ 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5 11]

訓練データのサイズを比率ではなく実際のサイズ(要素数)で指定することもできる。その場合、test_sizeを1以上の整数で指定。

以下の例ではテストデータのサイズを4として指定している。

x_train, x_test = train_test_split(x, test_size=4, random_state=0)
print("y_train:{}".format(x_train))
print("y_test :{}".format(x_test))

# y_train:[ 3  9  2  8 10  4  1  6]
# y_test :[ 7 12  5 11]

x_train, x_test = train_test_split(x, test_size=4, random_state=0)

print("y_train:{}".format(x_train))

print("y_test :{}".format(x_test))

# y_train:[ 3 9 2 8 10 4 1 6]

# y_test :[ 7 12 5 11]

訓練データサイズの指定

train_sizeパラメーターで訓練データのサイズを指定することもできる。

以下の例ではtrain_size=0.8とし、訓練データサイズが9となっている（訓練データサイズの計算は切り下げで行われている）。

x_train, x_test = train_test_split(x, train_size=0.8, random_state=0)
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))

# x_train:[11  3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5]

x_train, x_test = train_test_split(x, train_size=0.8, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[11 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5]

訓練データサイズも要素数での指定が可能。

x_train, x_test = train_test_split(x, train_size=10, random_state=0)
print("x_train:{}".format(x_train))
print("y_test :{}".format(x_test))

# x_train:[ 5 11  3  9  2  8 10  4  1  6]
# y_test :[ 7 12]

x_train, x_test = train_test_split(x, train_size=10, random_state=0)

print("x_train:{}".format(x_train))

print("y_test :{}".format(x_test))

# x_train:[ 5 11 3 9 2 8 10 4 1 6]

# y_test :[ 7 12]

データ選択の内部手続

ここで、random_state=0としてtest_sizeやtrain_sizeを変化させたとき、テストデータの要素が現れる順番は変わらないということに気づいた。

x_train, x_test = train_test_split(x, test_size=0.2, random_state=0)
# x_train:[11  3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5]

x_train, x_test = train_test_split(x, test_size=0.3, random_state=0)
# x_train:[ 3  9  2  8 10  4  1  6]
# x_test :[ 7 12  5 11]

x_train, x_test = train_test_split(x, test_size=0.4, random_state=0)
# x_train:[ 9  2  8 10  4  1  6]
# x_test :[ 7 12  5 11  3]

x_train, x_test = train_test_split(x, train_size=9, random_state=0)
# x_train:[ 5 11  3  9  2  8 10  4  1  6]
# y_test :[ 7 12]

x_train, x_test = train_test_split(x, train_size=8, random_state=0)
# x_train:[11  3  9  2  8 10  4  1  6]
# y_test :[ 7 12  5]

x_train, x_test = train_test_split(x, train_size=7, random_state=0)
# x_train:[ 3  9  2  8 10  4  1  6]
# y_test :[ 7 12  5 11]

x_train, x_test = train_test_split(x, test_size=0.2, random_state=0)

# x_train:[11 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5]

x_train, x_test = train_test_split(x, test_size=0.3, random_state=0)

# x_train:[ 3 9 2 8 10 4 1 6]

# x_test :[ 7 12 5 11]

x_train, x_test = train_test_split(x, test_size=0.4, random_state=0)

# x_train:[ 9 2 8 10 4 1 6]

# x_test :[ 7 12 5 11 3]

x_train, x_test = train_test_split(x, train_size=9, random_state=0)

# x_train:[ 5 11 3 9 2 8 10 4 1 6]

# y_test :[ 7 12]

x_train, x_test = train_test_split(x, train_size=8, random_state=0)

# x_train:[11 3 9 2 8 10 4 1 6]

# y_test :[ 7 12 5]

x_train, x_test = train_test_split(x, train_size=7, random_state=0)

# x_train:[ 3 9 2 8 10 4 1 6]

# y_test :[ 7 12 5 11]

test_size/train_sizeのどちらで指定しても、また比率／要素数の何れで指定しても、常にテストデータの要素は7, 12, 5,…の順番で現れている。

これに対して訓練データの方は、テストデータの要素数が変わると変化するが、テストデータの結果が同じなら訓練データのパターンも同じ。

すなわちtrain_test_split()のサイズ指定は、どのように指定しても一旦テストデータの要素数に変換し、共通の手順でテストデータを選んでいっていると考えられる。

複数データの同時分割

train_test_split()は複数データを同時に分割することもできる。

以下の例では、二つの配列を引数として与えている。その結果は、与えた配列ごとに訓練データ、テストデータの順でタプルとして返される。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.arange(1, 9)
y = np.arange(11, 19)

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))
print("y_train:{}".format(y_train))
print("y_test :{}".format(y_test))

# x_train:[2 8 4 1 6 5]
# x_test :[7 3]
# y_train:[12 18 14 11 16 15]
# y_test :[17 13]

import numpy as np

from sklearn.model_selection import train_test_split

x = np.arange(1, 9)

y = np.arange(11, 19)

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

print("y_train:{}".format(y_train))

print("y_test :{}".format(y_test))

# x_train:[2 8 4 1 6 5]

# x_test :[7 3]

# y_train:[12 18 14 11 16 15]

# y_test :[17 13]

これが一般的な使い方で、複数の特徴量に関する個体のデータセットと各個体のクラスに関するデータを、同時に訓練データとテストデータに分割するときに用いられる。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.vstack((np.arange(1, 11), np.arange(11, 21))).T
print("original x:\n{}".format(x))

y = np.arange(21, 31)
print("original y:{}".format(y))

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

print("x_train:\n{}".format(x_train))
print("x_test :\n{}".format(x_test))
print("y_train:{}".format(y_train))
print("y_test :{}".format(y_test))

import numpy as np

from sklearn.model_selection import train_test_split

x = np.vstack((np.arange(1, 11), np.arange(11, 21))).T

print("original x:\n{}".format(x))

y = np.arange(21, 31)

print("original y:{}".format(y))

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

print("x_train:\n{}".format(x_train))

print("x_test :\n{}".format(x_test))

print("y_train:{}".format(y_train))

print("y_test :{}".format(y_test))

元のデータは

original x:
[[ 1 11]
 [ 2 12]
 [ 3 13]
 [ 4 14]
 [ 5 15]
 [ 6 16]
 [ 7 17]
 [ 8 18]
 [ 9 19]
 [10 20]]
original y:[21 22 23 24 25 26 27 28 29 30]

original x:

[[ 1 11]

[ 2 12]

[ 3 13]

[ 4 14]

[ 5 15]

[ 6 16]

[ 7 17]

[ 8 18]

[ 9 19]

[10 20]]

original y:[21 22 23 24 25 26 27 28 29 30]

これを訓練データとテストデータに分割した結果は

x_train:
[[10 20]
 [ 2 12]
 [ 7 17]
 [ 8 18]
 [ 4 14]
 [ 1 11]
 [ 6 16]]
x_test :
[[ 3 13]
 [ 9 19]
 [ 5 15]]
y_train:[30 22 27 28 24 21 26]
y_test :[23 29 25]

x_train:

[[10 20]

[ 2 12]

[ 7 17]

[ 8 18]

[ 4 14]

[ 1 11]

[ 6 16]]

x_test :

[[ 3 13]

[ 9 19]

[ 5 15]]

y_train:[30 22 27 28 24 21 26]

y_test :[23 29 25]

`stratify`による層化(相似化)

train_test_split()による要素の選択はランダムに行われる。この場合、クラス分類のパターンが、元データ、訓練データ、テストデータで異なってくる。

以下の例では、元のデータの0と1の比率が1:2だが、訓練データでは1:4、テストデータでは2:1になっている。ケースによっては特定のクラスが極端に少ない／存在しないということも起こり得る。

import numpy as np
from sklearn.model_selection import train_test_split

y = np.array([0, 0, 0, 1, 1, 1, 1, 1])

y_train, y_test = train_test_split(y, test_size=3, random_state=0)
print("y_train:{}".format(y_train))
print("y_test :{}".format(y_test))

# y_train:[1 1 0 1 1]
# y_test :[1 0 0]

import numpy as np

from sklearn.model_selection import train_test_split

y = np.array([0, 0, 0, 1, 1, 1, 1, 1])

y_train, y_test = train_test_split(y, test_size=3, random_state=0)

print("y_train:{}".format(y_train))

print("y_test :{}".format(y_test))

# y_train:[1 1 0 1 1]

# y_test :[1 0 0]

そこで、stratifyパラメーターで配列を指定すると、その配列でのパターンと同じになるように訓練データとテストデータを分割してくれる。

以下の例では、先の配列を元の配列の0/1のパターンと相似になるように分割している。

y_train, y_test = train_test_split(y, test_size=3, stratify=y, random_state=0)
print("y_train:{}".format(y_train))
print("y_test :{}".format(y_test))

# y_train:[0 1 1 0 1]
# y_test :[1 1 0]

y_train, y_test = train_test_split(y, test_size=3, stratify=y, random_state=0)

print("y_train:{}".format(y_train))

print("y_test :{}".format(y_test))

# y_train:[0 1 1 0 1]

# y_test :[1 1 0]

次の例は、9個体の特徴量データxと各個体のクラス区分データyを、クラスの分布に沿って訓練データとテストデータに分割するイメージ。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.array([10, 10, 10, 11, 11, 11, 11, 11, 11])
y = np.array([0, 0, 0, 1, 1, 1, 1, 1, 1])

x_train, x_test, y_train, y_test =\
    train_test_split(x, y, test_size=3, stratify=y, random_state=0)
print("y_train:{}".format(y_train))
print("y_test :{}".format(y_test))
print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))

# y_train:[0 1 1 0 1 1]
# y_test :[1 1 0]
# x_train:[10 11 11 10 11 11]
# x_test :[11 11 10]

import numpy as np

from sklearn.model_selection import train_test_split

x = np.array([10, 10, 10, 11, 11, 11, 11, 11, 11])

y = np.array([0, 0, 0, 1, 1, 1, 1, 1, 1])

x_train, x_test, y_train, y_test =\

train_test_split(x, y, test_size=3, stratify=y, random_state=0)

print("y_train:{}".format(y_train))

print("y_test :{}".format(y_test))

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# y_train:[0 1 1 0 1 1]

# y_test :[1 1 0]

# x_train:[10 11 11 10 11 11]

# x_test :[11 11 10]

シャッフルの有無

デフォルトでtrain_test_split()は、データの分割にあたって要素の選択をランダムに行うが、shuffle=Falseを指定すると要素の順番を保持する。

import numpy as np
from sklearn.model_selection import train_test_split

x = np.arange(1, 13)

x_train, x_test = train_test_split(x, shuffle=False, random_state=0)

print("x_train:{}".format(x_train))
print("x_test :{}".format(x_test))

# x_train:[1 2 3 4 5 6 7 8 9]
# x_test :[10 11 12]

import numpy as np

from sklearn.model_selection import train_test_split

x = np.arange(1, 13)

x_train, x_test = train_test_split(x, shuffle=False, random_state=0)

print("x_train:{}".format(x_train))

print("x_test :{}".format(x_test))

# x_train:[1 2 3 4 5 6 7 8 9]

# x_test :[10 11 12]

TauStation

訓練データとテストデータの分割～train_test_split()

概要

乱数系列の固定

データのサイズ

テストデータサイズの指定

訓練データサイズの指定

データ選択の内部手続

複数データの同時分割

`stratify`による層化(相似化)

シャッフルの有無

コメントを残すコメントをキャンセル

概要

乱数系列の固定

データのサイズ

テストデータサイズの指定

訓練データサイズの指定

データ選択の内部手続

複数データの同時分割

stratifyによる層化(相似化)

シャッフルの有無

コメントを残す コメントをキャンセル

`stratify`による層化(相似化)

コメントを残すコメントをキャンセル