# DATA PREPROCESSING FOR MACHINE LEARNING

Updated: Aug 6, 2021

Decision tree is a machine learning tool used for making clear decision upon possible aggregation of outcomes , It shows trees like control structure of supervised learning algorithm consist mainly classification and regression problems.

Decision trees models are simple to comprehend consist of root node , splitting , decision node , leaf and sub-tree. Generally decision trees algorithm are named as CART which stands for classification and regression trees.

**Root Node**:- It is starting or first node of full population i.e., it has primarily 100% population , which we are trying to later on subdivide further .**Splitting :**- Dividing of node into further sub-parts or sub-region called splitting .**Decision Node**:- If a sub-nodes divides into more sub-nodes termed as decision node.**Leaf**:- If nodes do not separates any more are termed as leaf / terminal node.**Subtree**:- Subsection of a whole tree.

**Types:-**

Primarily there are 2 types of decision trees.

**Regression Tree**:- Used in continuous quantitative variables.**Classification Tree**:- Used in discrete categorical variables.

**Regression Tree** :-

These trees are top – down and greedy approach attribute may be found at the highest level as a root node , while the minimum important qualities can be found at the lower level as leaf nodes.

This is greedy so because best division is performed and considered through each phase of the tree building process . Instead of getting closer , consider choosing a splitting which will result in a better tree later on future . These top - down and greedy approach are termed as recursive binary splitting.

**Steps :-**

Take into account are predictors well as all possible cut point values.

For each likelihood compute RSS.

Choose the one with minimum RSS.

It carry on to this like this until stopping criteria is attained.

**Stopping criteria :-**

When we run the code we need to specify the stopping criteria for tree , by default , programs set some criteria , But we need to understand this so that we can control tree length. Also one of the major problems with overfitting , this problem can be avoided if we don't let the tree grow beyond a certain point.

There are some method to control tree growth:-

At internal node there should be least observation.

At terminal /leaf node should be least observation.

Highest should be depth i.e, highest layer of tree achievable .

**Pruning :- **

There are two issues with two large decision trees with many nodes. They are tough to interpret at first, and they overfit the training data, resulting in poor performance. As a result, controlling tree growth is a primary concern. To get rid of any dividing nodes that aren't needed , the elimination of non essential nodes can assist to lessen the likelihood of an over-fitting tree . An over-fitting tree model could lead in data misclassification in practical application , hence this approach is very beneficial .

We will never find the split was feasible since tree development was stopped at a certain node when we predefined the limitation. As a result, we require a method that ensures we do not miss any splits. Tree pruning is the term for this approach.

In this method, we design a very large tree and then prune it, or chop off some of the non-beneficial sections of the tree, to get an ideal subtree. Now that we're looking for a subtree with the lowest test error rate, we can use cross-validation to determine the test error rate of all subtrees. However, because this is computationally costly, this could take quite a long time to execute on software. As a result, we apply a technique known as cost complexity pruning, in which we add the cost of the tree's terminal nodes to the RSS.

Rather than minimizing RSS, we reduce RSS plus the number of terminal nodes. here alpha is called pruning parameter or complexity parameter, When alpha is 0, tree growth is normal; when we minimize the RSS term, we obtain typical tree growth; however, as alpha increases, the penalty for having more splits grows; hence, alpha controls tree growth.

FIG:1

**Classification Tree:-**

In classification we apply:-

Classification Error rate

Gini Index

Cross Entropy

To begin, we find the mean of the response variable at each leaf node of the regression tree to obtain the predicted value. However, for classification trees, we will be using mode, which means we will assign that class to the area where it is the most usually occurring class. The classification process is comparable to regression trees in that we apply recursive binary splitting in classification trees, just as like we do in regression trees. Nevertheless, in a regression tree, we select the split that provides us a lowest RSS, however in a classification tree, we clearly cannot use RSS, thus there are various additional variables to consider when splitting.

**Classification Error rate :-**One natural or logical criterion is classification error rate; first, we will take into account all variables and all possible split values; after which, for each reason, we will allocate a class to a certain region; and finally, we will simply calculate the portion of trained observations within this region which do not pertain to even the most common class.**Gini Index :-**However, it comes out that the classification error rate isn't precise enough for tree growth, so we use two alternative metrics instead. The first is the Gini index, which is represented by

FIG:2

The idea the beyond Gini index is that we are categorizing two classes, pass or fail, and thus P represents the likelihood of passing in that location or even at a certain node. When P is really low, say 0, therefore the node is exceedingly pure. That seems to be, the majority of the observations are from a single class, and the Gini index value is quite low. The alternative possibility is that if the likelihood is very high for an intense event, like one, then all of the observations in that situation go to the past group. Likewise, there is great purity at this node, and the Gini index value is zero since one minus one equals zero according to the equation. As can be seen, a low Gini value suggests node purity, i.e., that a specific node includes mostly observations of a single class alone.

**Class Entropy :-**Another option is class entropy, which is referred to as

FIG:3

It is mathematically comparable to Gini index and get hold of a small particularity when a node is pure, but when creating a classification tree, Gini index or cross entropy are preferred since they are more sensitive to node integrity.

**Advantages:-**

Trees are simple to describe to others and they are, in general, simpler to comprehend than linear regression.

Other regression and classification algorithms are less similar to human decision making than decision trees.

Trees may be seen visually and are simple to understand including for non-experts.

Without the use of dummy variables, trees can readily accommodate quantitative predictors.

The effectiveness is unaffected by non-linear relationships between variables.

It works well with both numerical and categorical data sets.

**Disadvantages:-**

The predicted effectiveness of trees is not as high as that of other regression and classification methods.

It may create a complicated tree along a shallow depth.

Because slight changes in input data might lead in an entirely different tree being created, the system is inherently unpredictable.

Because it is a greedy method, it may not discover the optimum tree for a data set worldwide.

**Github link :-**

https://github.com/arunish07/DecisionTree.git