scikit-learn에서 계층화 된 기차 / 테스트 분할

Programing

scikit-learn에서 계층화 된 기차 / 테스트 분할

lottogame 2020. 11. 30. 07:40

scikit-learn에서 계층화 된 기차 / 테스트 분할

데이터를 훈련 세트 (75 %)와 테스트 세트 (25 %)로 분할해야합니다. 현재 아래 코드를 사용하여 수행합니다.

X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)

그러나 훈련 데이터 세트를 계층화하고 싶습니다. 어떻게하나요? 나는 StratifiedKFold방법을 조사해 왔지만 75 % / 25 % 분할을 지정하지 않고 훈련 데이터 세트 만 계층화하도록합니다.

[0.17 업데이트]

문서 참조 sklearn.model_selection.train_test_split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.25)

[/0.17 업데이트]

여기에 pull 요청이 있습니다 . 그러나 train, test = next(iter(StratifiedKFold(...)))원하는 경우 기차 및 테스트 인덱스를 간단히 수행 하고 사용할 수 있습니다 .

TL; DR : StratifiedShuffleSplit 사용test_size=0.25

Scikit-learn은 계층화 분할을위한 두 가지 모듈을 제공합니다.

StratifiedKFold :이 모듈은 직접 k- 폴드 교차 검증 연산자로 유용 n_folds합니다. 클래스가 양쪽에서 균등하게 균형을 이루도록 훈련 / 테스트 세트를 설정합니다.

여기에 몇 가지 코드가 있습니다 (위 문서에서 직접)

>>> skf = cross_validation.StratifiedKFold(y, n_folds=2) #2-fold cross validation
>>> len(skf)
2
>>> for train_index, test_index in skf:
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
...    #fit and predict with X_train/test. Use accuracy metrics to check validation performance

StratifiedShuffleSplit :이 모듈은 균등하게 균형 잡힌 (계층화 된) 클래스를 갖는 단일 훈련 / 테스트 세트를 생성합니다. 기본적으로 이것은 n_iter=1. 여기에서 테스트 크기를 언급 할 수 있습니다.train_test_split

암호:

>>> sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
>>> len(sss)
1
>>> for train_index, test_index in sss:
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
>>> # fit and predict with your classifier using the above X/y train/test

다음은 연속 / 회귀 데이터에 대한 예입니다 ( GitHub에서이 문제 가 해결 될 때까지 ).

# Your bins need to be appropriate for your output values
# e.g. 0 to 50 with 25 bins
bins     = np.linspace(0, 50, 25)
y_binned = np.digitize(y_full, bins)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y_binned)

@Andreas Mueller가 수락 한 답변 외에도 위에서 언급 한 @tangy로 추가하고 싶습니다.

StratifiedShuffleSplit 은 다음 기능이 추가 된 train_test_split ( stratify = y)와 가장 유사합니다 .

기본적으로 계층화
n_splits 를 지정하여 반복적으로 데이터를 분할합니다.

train_test_split()Scikit에서 사용할 수있는 방법으로 간단히 수행 할 수 있습니다 .

from sklearn.model_selection import train_test_split 
train, test = train_test_split(X, test_size=0.25, stratify=X['YOUR_COLUMN_LABEL'])

stratify옵션 작동 방식을 보여주는 짧은 GitHub Gist도 준비했습니다 .

https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9

#train_size is 1 - tst_size - vld_size
tst_size=0.15
vld_size=0.15

X_train_test, X_valid, y_train_test, y_valid = train_test_split(df.drop(y, axis=1), df.y, test_size = vld_size, random_state=13903) 

X_train_test_V=pd.DataFrame(X_train_test)
X_valid=pd.DataFrame(X_valid)

X_train, X_test, y_train, y_test = train_test_split(X_train_test, y_train_test, test_size=tst_size, random_state=13903)

참고 URL : https://stackoverflow.com/questions/29438265/stratified-train-test-split-in-scikit-learn

'Programing' 카테고리의 다른 글

많은 경고를 제공하는 Rails 4의 RSpec으로 보호 (0)	2020.11.30
KitKat 용 Android 5.0 머티리얼 디자인 스타일 탐색 창 (0)	2020.11.30
ServerCertificateValidationCallback을 설정 했는데도 SSL / TLS 보안 채널을 만들 수 없습니다. (0)	2020.11.30
다트에서 이중 점 (.) 사용을 나열 하시겠습니까? (0)	2020.11.30
정수를 확인하는 SQL LIKE 조건? (0)	2020.11.30

현재글scikit-learn에서 계층화 된 기차 / 테스트 분할

lottogame 복권의 역사, 로또 정보와 IT 기술 등을 다루는 블로그입니다.

복권의 역사, 로또 정보와 IT 기술 등을 다루는 블로그입니다.

뮤지컬, 관광, 행사, 놀거리, c#, c++, 공연, 볼거리, Javascript, Spring3, 무비순위, 가족나들이, 자바, java, 극장순위, 여행, 연극, JQuery, spring, 축제,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

lottogame

scikit-learn에서 계층화 된 기차 / 테스트 분할

scikit-learn에서 계층화 된 기차 / 테스트 분할

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

scikit-learn에서 계층화 된 기차 / 테스트 분할

scikit-learn에서 계층화 된 기차 / 테스트 분할

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역