HomeBlogBlog Detail

How to tune hyperparameters on XGBoost

By Juan Navas and Richard Liaw   

💡 This blog post is part 2 in our series on hyperparameter tuning. If you're just getting started, check out part 1, What is hyperparameter tuning?. In part 3, How to distribute hyperparameter tuning using Ray Tune, we'll dive into a hands-on example of how to speed up the tuning task.

In the first article of this series, we learned what hyperparameter tuning is, its importance, and our various options. In this hands-on article, we’ll explore a practical case to explain how to tune hyperparameters on XGBoost. You just need to know some Python to follow along, and we’ll show you how to easily deploy machine learning models and then optimize their performance.

We’ll use the Modified National Institute of Standards and Technology (MNIST) database, including 60,000 training samples and 10,000 test samples. Each sample is an image of a handwritten digit, normalized to a 28 by 28-pixel box and anti-aliased (grayscale levels). 

The figures below are a couple of samples from the MNIST database:

hands-on-hyperparameter-tuning-MNIST-samples

To build our digit identification model, we’ll use the popular library, XGBoost. Then to tune it, we will use the scikit-learn library, which provides a relatively easy and uniform hyperparameter tuning method. Let’s get started.

LinkPre-processing the original MNIST files

First, we download the four files in the MNIST data set: train-images-idx3-ubyte and train-labels-idx1-ubyte for the training, and t10k-images-idx3-ubyte and t10k-labels-idx1-ubyte for the test data.

Then, we convert the ubyte files to comma-separated values (CSV) files to input them into the machine learning algorithm.

The function below performs the conversion:

1def convert(imgf, labelf, outf, n):
2    f = open(imgf, "rb")
3    o = open(outf, "w")
4    l = open(labelf, "rb")
5
6    f.read(16)
7    l.read(8)
8    images = []
9
10    for i in range(n):
11      image = [ord(l.read(1))]
12      for j in range(28*28):
13        image.append(ord(f.read(1)))
14        images.append(image)
15
16    for image in images:
17      o.write(",".join(str(pix) for pix in image)+"\n")
18
19    f.close()
20    o.close()
21    l.close()
22
23convert("train-images-idx3-ubyte", 
24        "train-labels-idx1-ubyte",
25        "mnist_train.csv", 60000)
26convert("t10k-images-idx3-ubyte", 
27        "t10k-labels-idx1-ubyte",
28        "mnist_test.csv", 10000)

Now we need to adjust the label column for the machine learning algorithm. The training set is in column 5, while the test set is in column 7. We need to rename those columns and generate new CSV files.

1import pandas as pd
2
3# read the converted files
4df_orig_train = pd.read_csv('mnist_train.csv')
5df_orig_test = pd.read_csv('mnist_test.csv')
6
7# rename columns
8df_orig_train.rename(columns={'5':'label'}, inplace=True)
9df_orig_test.rename(columns={'7':'label'}, inplace=True)
10
11# write final version of the csv files
12df_orig_train.to_csv('mnist_train_final.csv', index=False)
13df_orig_test.to_csv('mnist_test_final.csv', index=False)

We have now finished pre-processing the data set and can start building the model.

LinkBuilding model parameters without tuning hyperparameters

Let’s start by building the model without any hyperparameter tuning. Instead, we will use typically recommended values for our hyperparameters.

To run this code yourself, you’ll need to install NumPy, sklearn (scikit-learn), pandas, and XGBoost using pip, Conda, or another Python package management tool. Note that this isn't intended to be a comprehensive XGBoost tutorial. If you’re new to XGBoost, we recommend starting with the guides and tutorials in the XGBoost documentation.

First, we’ll define a model_mnist function that takes a hyperparameter list as input. This makes it easy to quickly recreate the model with different hyperparameters.

Also, for practical reasons, we will work with the first 1,000 samples from the training dataset. The MNIST training dataset has 60,000 samples with high dimensionality (784 features), which means about 47 million data points. Building the model for the complete dataset takes time (in the range of 10-15 minutes for an 8-core CPU), so it will take many hours, or even days, to perform hyperparameter tuning on a single machine. So, using a smaller dataset while we’re learning allows us to experiment with different tuning techniques more quickly.

1import numpy as np
2import pandas as pd   # data processing, CSV file I/O (e.g. pd.read_csv)
3import xgboost as xgb
4from sklearn.model_selection import train_test_split
5from sklearn import metrics
6
7def model_mnist(params):
8
9    # Input data files are available in the "./data/" directory.
10    train_df = pd.read_csv("./data/mnist_train_final.csv")
11    test_df = pd.read_csv("./data/mnist_test_final.csv")
12
13    # limit the dataset size to 1000 samples
14    dataset_size = 1000
15    train_df = train_df.iloc[0:dataset_size, :]
16    test_df = test_df.iloc[0:dataset_size, :]
17
18    y = train_df.label.values
19    X = train_df.drop('label', axis=1).values
20
21    # build train and validation datasets
22    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15)
23    print("Shapes - X_train: ", X_train.shape,
24          ", X_val: ", X_val.shape, ", y_train: ",
25          y_train.shape, ", y_val: ", y_val.shape)
26
27    train_data = xgb.DMatrix(X_train, label=y_train)
28    val_data = xgb.DMatrix(X_val, label=y_val)
29
30    n_rounds = 600
31    early_stopping = 50
32
33    # build the model
34    results = {}
35
36    model = xgb.train(params, 
37                      train_data, 
38                      num_boost_round=100, 
39                      evals=[(val_data, 'val')], 
40                      evals_result=results)
41
42    # let’s check how good the model is
43    x_test = test_df.drop('label', axis=1).values
44    test_labels = test_df.label.values
45    test_data = xgb.DMatrix(x_test)
46
47    predictions = model.predict(test_data)
48
49    accuracy = metrics.accuracy_score(test_labels, predictions)
50
51    return accuracy
52
53if __name__ == '__main__':
54
55    # define number of classes = 10 (digits) 
56    # and the metric as merror (multi-class error classification rate)
57    default_params = [
58        ("num_class", 10), ("eval_metric", "merror")
59    ]
60
61    accuracy_result = model_mnist(default_params)
62
63    print("accuracy: ", accuracy_result)

Save the above Python code in a .py file (for instance, mnist_model.py) and run it from the command line:

$ python mnist_model.py

We should see the following output:

1Shapes - X_train:  (850, 784) , X_val:  (150, 784) , y_train:  (850,) , y_val:  (150,)
2[0]	val-merror:0.326667
3[1]	val-merror:0.233333
4 .
5 .
6[98]	val-merror:0.166667
7[99]	val-merror:0.166667
8accuracy:  0.826

The output shows we obtained an accuracy of 82.6 percent (that is, our model correctly classified 82.6 percent of test samples).

LinkTuning hyperparameters

Now we’ll tune our hyperparameters using the random search method. For that, we’ll use the sklearn library, which provides a function specifically for this purpose: RandomizedSearchCV.

First, we save the Python code below in a .py file (for instance, random_search.py).

1import numpy as np
2import pandas as pd   
3import xgboost as xgb
4
5from sklearn.model_selection import train_test_split
6from sklearn.model_selection import RandomizedSearchCV
7
8def random_search_tuning():
9    # Input data files are available in the "./data/" directory.
10    train_df = pd.read_csv("./data/mnist_train_final.csv")
11    test_df = pd.read_csv("./data/mnist_test_final.csv")
12    print (train_df.shape, test_df.shape)
13
14    y = train_df.label.values
15    x = train_df.drop('label', axis=1).values
16
17    # define the train set and test set
18    x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.05)
19    print("Shapes - X_train: ", x_train.shape,
20          ", X_val: ", x_val.shape, ", y_train: ",
21          y_train.shape, ", y_val: ", y_val.shape)
22
23    params = {'max_depth': [3, 6, 10, 15],
24              'learning_rate': [0.01, 0.1, 0.2, 0.3, 0.4],
25              'subsample': np.arange(0.5, 1.0, 0.1),
26              'colsample_bytree': np.arange(0.5, 1.0, 0.1),
27              'colsample_bylevel': np.arange(0.5, 1.0, 0.1),
28              'n_estimators': [100, 250, 500, 750],
29              'num_class': [10]
30              }
31
32    xgbclf = xgb.XGBClassifier(objective="multi:softmax", tree_method='hist')
33    clf = RandomizedSearchCV(estimator=xgbclf,
34                             param_distributions=params,
35                             scoring='accuracy',
36                             n_iter=25,
37                             n_jobs=4,
38                             verbose=1)
39
40    clf.fit(x_train, y_train)
41
42    best_combination = clf.best_params_
43
44    return best_combination
45
46if __name__ == '__main__':
47
48    best_params = random_search_tuning()
49
50    print("Best hyperparameter combination: ", best_params)

Our expected output looks something like this:

1Fitting 5 folds for each of 25 candidates, totaling 125 fits
2Accuracy:  0.858
3Best hyperparameter combination:  {'subsample': 0.9, 
4                                   'num_class': 10, 
5                                   'n_estimators': 500, 
6                                   'max_depth': 10, 
7                                   'learning_rate': 0.1, 
8                                   'colsample_bytree': 0.6, 
9                                   'colsample_bylevel': 0.7}

The accuracy has improved to 85.8 percent. We’ve now found the best hyperparameter combination for our model.

LinkNext steps

After running our untuned and tuned models, we discovered that our model with tuned hyperparameters has better accuracy. The tuned model can make better predictions on new data, so it can more accurately identify images of digits beyond those in the training set.

In the final article of our three-part series, we’ll explore hands-on how distributed hyperparameter tuning improves our results even more. We’ll use Ray to deploy our Python machine learning model to the cloud for distributed tuning using Ray Tune.

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.