7

I would like to pre-train a model and then train it with another model.

I have model Decision Tree Classifer and then I would like to train it further with model LGBM Classifier. Is there a possibility to do this in scikit learn? I have already read this post about it https://datascience.stackexchange.com/questions/28512/train-new-data-to-pre-trained-model.. In the post it says

As per the official documentation, calling fit() more than once will overwrite what was learned by any previous fit()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 

# Train Decision Tree Classifer
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

lgbm = lgb.LGBMClassifier()
lgbm = lgbm.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = lgbm.predict(X_test)

3 Answers 3

2
+25

Perhaps you are looking for stacked classifiers.

In this approach, the predictions of earlier models are available as features for later models.

Look into StackingClassifiers.

Adapted from the documentation:

from sklearn.ensemble import StackingClassifier

estimators = [
     ('dtc_model', DecisionTreeClassifier()),
 ]

clf = StackingClassifier(
                estimators=estimators, 
                final_estimator=LGBMClassifier()
      )
1

Unfortunately this is not possible at present. According to the doc at https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html?highlight=init_model, you can continue training the model if the model is from lightgbm.

I did try this setup with:

# dtc
dtc_model = DecisionTreeClassifier()
dtc_model = dtc_model.fit(X_train, y_train)
    
# save
dtc_fn = 'dtc.pickle.db'
pickle.dump(dtc_model, open(dtc_fn, 'wb'))
    
# lgbm
lgbm_model = LGBMClassifier()
lgbm_model.fit(X_train_2, y_train_2, init_model=dtc_fn)

And I get:

LightGBMError: Unknown model format or submodel type in model file dtc.pickle.db
1

As @Ferdy explained in his post, there is no simple way to perform this operation and it is understandable.

Scikit-learn DecisionTreeClassifier takes only numerical features and cannot handle nan values whereas LGBMClassifier can handle those.

By looking at the decision function of scikit-learn you can see that all it can perform is splits based on feature <= threshold.

On the contrary LGBM can perform the following:

  • feature is na
  • feature <= threshold
  • feature in categories

Splits in decision tree are selected at each step as they best splits the set of items. They try to minimize the node impurity (giny) or entropy.

The risk of further training a DecisionTreeClassifier is that you are not sure that splits performed in the original tree are the best, since you have new splits capabilities with LGBM that might/should lead in better performance.

I would recommend you to retrain the model with LGBMClassifier only as it might be possible that splits will be different from the original scikit-learn Tree.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.