External Memory Usage
When you have an exceptionally large dataset that you can't load on to your RAM, the external memory feature of the XGBoost library will come to your rescue. This feature will train XGBoost models for you without loading the entire dataset on the RAM.
Using this feature requires minimal effort; you just need to add a cache prefix at the end of the filename.
train = xgb.DMatrix('data/wholesale-data.dat.train#train.cache')
This feature supports only libsvm
file. So, we will now convert a dataset loaded in pandas into a libsvm
file to be used with the external memory feature.
Note
You might have to do this in batches depending on how big your dataset is.
from sklearn.datasets import dump_svmlight_file dump_svmlight_file(X_train, Y_train, 'data/wholesale-data.dat.train', zero_based=True, multilabel=False)
Here, X_train
and Y_train
are the predictor and target variables respectively. The libsvm
file will get saved into...