hyperparameter tuning

hyperparameter tuning

2022. 3. 19. 14:42ㆍAI/Big data

hyper-parameter tuning

max_depth와 min_child_weight라는 parameter를 grid search 방법으로 tuning

import itertools

params = {
    'max_depth': [3, 5, 7],
    'min_child_weight': [1.0, 2.0, 4.0]
}

param_combs = itertools.product(params['max_depth'], params['min_child_weight'])

scores = []
param_sets = []

for max_depth, min_child_weight in param_combs:
    fold_scores = []
    kf = KFold(n_splits=4, shuffle=True, random_state=12345)

    for tr_idxs, va_idxs in kf.split(train_x):
        tr_x, va_x = train_x.iloc[tr_idxs], train_x.iloc[va_idxs]
        tr_y, va_y = train_y.iloc[tr_idxs], train_y.iloc[va_idxs]

        model = XGBClassifier(n_estimators=20, random_state=71, use_label_encoder=False,
          max_depth=max_depth, min_child_weight=min_child_weight)

        model.fit(tr_x, tr_y)

        va_pred = model.predict_proba(va_x)[:, 1]
        logloss = log_loss(va_y, va_pred)
        fold_scores += logloss,


    scores += np.mean(fold_scores),
    param_sets += (max_depth, min_child_weight),

best_idx = np.argsort(scores)[0]
best_max_depth, best_min_child_weight = param_sets[best_idx]

print(f'max_depth: {best_max_depth}, min_child_weight: {best_min_child_weight}')

    max_depth: 3, min_child_weight: 1.0

random_state가 123456인 경우 7과 2.0이 가장 좋음

ensemble 기법

여러 모델을 조합하여 성능을 올리는 기법
각 모델의 성능이 높고, 모델 종류가 다양할 때 성능이 오름

xgboost + logistic regression model

전처리

logistic regression model을 사용하기 위한 전처리 수행

결측치 값이 있으면 안 됨
범주형 값의 경우 one-hot encoding이 되어야 함

불필요 column 삭제

from sklearn.preprocessing import OneHotEncoder


train_x2 = train.drop(['Survived'], axis=1)
test_x2 = test.copy()

train_x2 = train_x2.drop(['PassengerId'], axis=1)
test_x2 = test_x2.drop(['PassengerId'], axis=1)

train_x2 = train_x2.drop(['Name', 'Ticket', 'Cabin'], axis=1)
test_x2 = test_x2.drop(['Name', 'Ticket', 'Cabin'], axis=1)

one-hot encoding

cat_cols = ['Sex', 'Embarked', 'Pclass']
ohe = OneHotEncoder(categories='auto', sparse=False)
ohe.fit(train_x2[cat_cols].fillna('NA'))

one-hot encoding된것의 column name 생성

ohe_columns = []
for i, c in enumerate(cat_cols):
    ohe_columns += [f'{c}_{v}' for v in ohe.categories_[i]]

ohe_train_x2 = pd.DataFrame(ohe.transform(train_x2[cat_cols].fillna('NA')), columns=ohe_columns)
ohe_test_x2 = pd.DataFrame(ohe.transform(test_x2[cat_cols].fillna('NA')), columns=ohe_columns)

원래 column 제거

train_x2 = train_x2.drop(cat_cols, axis=1)
test_x2 = test_x2.drop(cat_cols, axis=1)

추가한 one-hot encoding된 column들 추가

train_x2 = pd.concat([train_x2, ohe_train_x2], axis=1)
test_x2 = pd.concat([test_x2, ohe_test_x2], axis=1)

결측값 채우기

결측 값에 대해 fillna로 평균 값으로 채우기

num_cols = ['Age', 'SibSp', 'Parch', 'Fare']
for col in num_cols:
    train_x2[col].fillna(train_x2[col].mean(), inplace=True)
    test_x2[col].fillna(train_x2[col].mean(), inplace=True)

운임의 경우 log scale로 변환

train_x2['Fare'] = np.log1p(train_x2['Fare'])
test_x2['Fare'] = np.log1p(test_x2['Fare'])

emsemble 모델

from sklearn.linear_model import LogisticRegression


model_xgb = XGBClassifier(n_estimators=20, random_state=71, use_label_encoder=False)
model_xgb.fit(train_x, train_y)
pred_xgb = model_xgb.predict_proba(test_x)[:, 1]

model_lr = LogisticRegression(solver='lbfgs', max_iter=300)
model_lr.fit(train_x2, train_y)
pred_lr = model_lr.predict_proba(test_x2)[:, 1]

pred = pred_xgb * 0.8 + pred_lr * 0.2
pred_label = np.where(pred > 0.5, 1, 0)

submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': pred_label})
submission.to_csv('submission_first_ensemble.csv', index=False)

encoders

label encoder
- string 등을 0, 1 등의 정수값으로 encoding
one hot encoder
- 가질수 있는 모든 값들에 0 혹은 1을 할당
- e.g.
  - sex -> sex_male, sex_female 두 개의 column으로 분리하고, male이면 sex_male만 1 (sex_female은 0)

저작자표시 비영리 변경금지

'AI > Big data' 카테고리의 다른 글

Collaboration Filtering (0)	2022.03.21
영화 추천 - 높은 평균 평점 기준 (0)	2022.03.21
Titanic data training (0)	2022.03.19
Titanic data analysis (0)	2022.03.19
Pandas DataFrame (0)	2022.03.18

spring rain