XGBoost

XGBoost

2021. 8. 17. 08:00ㆍAI/Machine learning

Gradient boosting 알고리즘을 분산환경에서 동작하도록 구현한 라이브러리

regression, classification을 지원하며, 성능과 자원 효율이 좋아서 많이 사용됨.

xgboost는 여러 decision tree를 조합해서 사용하는 ensemble 알고리즘

Gradient boosting 특징

GBM (Gradient Boosting Algorithm)

회귀분석, 분류를 수행하는 예측모형

ensemble 방법론 중 boosting 계열에 속함

tabular format data에서 높은 예측 성능을 보임

이를 구현한 packages

LightGBM, CatBoost, XGBoost

GBM은 계산양이 많기에 H/W를 효율적으로 활용하도록 구현해야 함

Gradient boosting 단점

성능

decision tree

decision을 계속 수행해 가는 tree

ensemble 방식

1) bagging

여러 decision tree에서 나온 결과에 대해 평균등을 취해 최종 결과로 취함

2) boosting

여러 모델에 입력을 넣고 나온 결과 중 예측이 잘못된 입력의 결과를 다음 tree에 가중치를 반영해 넣음

이를 여러 decision tree에 대해 반복

각 모델은 성능에 따라 서로 다른 가중치를 적용

y = h1(x) + err1(x)

err1(x) = h2(x) + err2(x)

err2(x) = h3(x) + err3(x)

y = W1*h1(x) + W2*h2(x) + W3*h3(x) + err3(x)

Gradient boosting

Gradient를 이용한 boosting 방법

model1을 통해 예측한 '잔차'(에러)를 model2에서 예측하고, 이를 반복하여 잔차(에러)를 점차 줄여감

각각의 model들을 약분류기(weak learner), 이를 결합한 분류기가 강분류기(strong learner)이다.

약분류기로 간단한 decision tree를 많이 사용한다.

여기서 gradient는

loss function이 squared error로 산출되는 경우 residual은 negative gradient이다.

loss function이 1/2(yi - f(xi))^2인 경우

이를 미분하면, f(xi) - yi

즉, residual이 yi - f(xi)가 되기에 negative gradient인 residual이 된다.

모델 구축

정형 데이터를 다룰 경우 GBDT 모델이 많이 사용됨

from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder


train = ... titanic train data ...
test = ... titanic test data ...


train_x = train.drop(['Survived'], axis=1)
train_y = train['Survived']

train_x = train_x.drop(['PassengerId'], axis=1)
test_x = test_x.drop(['PassengerId'], axis=1)

train_x = train_x.drop(['Name', 'Ticket', 'Cabin'], axis=1)
test_x = test_x.drop(['Name', 'Ticket', 'Cabin'], axis=1)

# label encoding: one-hot encoding
for column_name in ['Sex', 'Embarked']:
    le = LabelEncoder()
    le.fit(train_x[column_name].fillna('NA'))

    train_x[column_name] = le.transform(train_x[column_name].fillna('NA'))
    test_x[column_name] = le.transform(test_x[column_name].fillna('NA'))


model = XGBClassifier(n_estimators=20, random_state=71, use_label_encoder=False)
model.fit(train_x, train_y)

# predicted probability
pred = model.predict_proba(test_x)[:, 1]

# transform pred into (1 or 0)
pred_label = np.where(pred > 0.5, 1, 0)

# output file
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': pred_label})
submission.to_csv('submission_first.csv', index=False)

n_estimators
- 생성할 weak learner의 수
random_state
- random number seed
use_label_encoder
- use the label encoder from scikit-learn to encode the lables

'AI > Machine learning' 카테고리의 다른 글

자연어 처리 (0)	2021.09.27
JPype1 설치 (0)	2021.09.10
K-means clustering (0)	2021.08.16
kNN (0)	2021.08.16
LDA Latent Dirichlet Allocation (0)	2021.08.16

spring rain