Author : Kavitha S
XGBoost stands for “Extreme Gradient Boosting”. XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements Machine Learning algorithms under the Gradient Boosting framework. XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now. Tianqi Chen, one of the co-creators of XGBoost, announced (in 2016) that the innovative system features and algorithmic optimizations in XGBoost have rendered it 10 times faster than most sought after machine learning solutions. A truly amazing technique! In this, we will first look at installation procedure of XGBoost and then knowing about the evolution and performance of XGBoost. It’s good to be able to implement it in Python or R, but understanding the nitty-gritties of the algorithm will help you become a better data scientist.
XGBoost is a software library that you can download and install on your machine, then access from a variety of interfaces.
GOALS OF XGBOOST:
The two reasons to use XGBoost are :
Execution Speed : XGBoost was almost always faster than the other benchmarked implementations from R, Python Spark and H2O and it is really faster when compared to the other algorithms
Model Performance: XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems.
EVOLUTION OF XGBOOST:
Decision trees, in their simplest form, are easy-to-visualize and fairly interpretable algorithms but building intuition for the next-generation of tree-based algorithms can be a bit tricky. See below for a simple analogy to better understand the evolution of tree-based algorithms. The chart below for the evolution of tree-based algorithms over the years.
XGBoost and Gradient Boosting Machines (GBMs) are both ensemble tree methods that apply the principle of boosting weak learners (CARTs generally) using the gradient descent architecture. However, XGBoost improves upon the base GBM framework through systems optimization and algorithmic enhancements.
Parallelization: XGBoost approaches the process of sequential tree building using parallelized implementation. This is possible due to the interchangeable nature of loops used for building base learners; the outer loop that enumerates the leaf nodes of a tree, and the second inner loop that calculates the features. This nesting of loops limits parallelization because without completing the inner loop (more computationally demanding of the two), the outer loop cannot be started. Therefore, to improve run time, the order of loops is interchanged using initialization through a global scan of all instances and sorting using parallel threads. This switch improves algorithmic performance by offsetting any parallelization overheads in computation.
Tree Pruning: The stopping criterion for tree splitting within GBM framework is greedy in nature and depends on the negative loss criterion at the point of split. XGBoost uses ‘max_depth’ parameter as specified instead of criterion first, and starts pruning trees backward. This ‘depth-first approach improves computational performance significantly.
Hardware Optimization: This algorithm has been designed to make efficient use of hardware resources. This is accomplished by cache awareness by allocating internal buffers in each thread to store gradient statistics. Further enhancements such as ‘out-of-core’ computing optimize available disk space while handling big data-frames that do not fit into memory.
Regularization: It penalizes more complex models through both LASSO (L1) and Ridge (L2) regularization to prevent overfitting.
Sparsity Awareness: XGBoost naturally admits sparse features for inputs by automatically ‘learning’ best missing value depending on training loss and handles different types of sparsity patterns in the data more efficiently.
Weighted Quantile Sketch: XGBoost employs the distributed weighted Quantile Sketch algorithm to effectively find the optimal split points among weighted datasets.
Cross-validation: The algorithm comes with built-in cross-validation method at each iteration, taking away the need to explicitly program this search and to specify the exact number of boosting iterations required in a single run.
This algorithm goes by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient boosting or gradient boosting machines.
Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. A popular example is the AdaBoost algorithm that weights data points that are hard to predict.
Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
Pandas and sklearn libraries are imported. Iris, an inbuilt flower data set is imported. A data frame containing the data and features of the iris data set is defined as df using pandas. df.head() gives the first 10 records. df.info() gives the information of the data frame.
The number of sample, the number of features and the target names are displayed. From sklearn.model_selection, train _ test_split is imported. x_ test, y_ test, x_train and y_train are set using train_test_split with test size as 0.2. xgboost is imported. Train and test set is defined using DMatrix(an internaldata structure of xgboost). The parameters are defined. Epoch value is set as 10.
Model is built using xgboost and the prediction is made and displayed. The accuracy score is calculated using sklearn.metrics and it is displayed. plot_importance from xgboost is imported and the importance of each feature is plotted using matplotlib.
This approach supports both regression and classification predictive modelling problems.
We used Scikit-learn’s ‘Make_Classification’ data package to create a random sample of 1 million data points with 20 features (2 informative and 2 redundant). While tested with several algorithms such as Logistic Regression, Random Forest, standard Gradient Boosting, and XGBoost. As demonstrated in the chart below, XGBoost model has the best combination of prediction performance and processing time compared to other algorithms. Other rigorous benchmarking studies have produced similar results. No wonder XGBoost is widely used in recent Data Science competitions.
XGBoost is a faster algorithm when compared to other algorithms because of its parallel and distributed computing. XGBoost is developed with both deep considerations in terms of systems optimization and principles in machine learning. The goal of this library is to push the extreme of the computation limits of machines to provide a scalable, portable and accurate library.
In this blog, I have introduced you to XGBoost Algorithm, a widely used algorithm that saves resources and time. I have discussed the working of the algorithm and also the different parameters that play an important role in the model’s performance.
GITHUB LINK -