Author: Simreeta Saha
Spam was considered as garbage or unwanted electronic mails or messages. But now-a-days spam bring with them various virus attachments and spyware attachments, so it becomes very necessary to detect those. In this entire, paper we deal with detecting spam based on a given spam data set.
The dataset looks like
It has 5572 rows of data and two columns, one of which is the SMS /text and other contains whether it is SPAM or not (HAM). We will be using Random Forest for classifying the result. It can be done by only naïve bayes uncertainty prediction algorithm also. First, we will convert the data set in 0 or 1 to fit for the classification and then predict.
Finally, after prediction it will look like:
An electronic mail is an efficient and increasingly popular communication method. Concern about the proliferation of unsolicited bulk email, commonly referred to as “spam”, has been steadily increasing. So, it is extremely helpful to have a model which can automatically detect spams. Here, we will use a spam detection model based on random forest using parameter optimization and feature selection simultaneously.
It is an ensemble learning technique which builds an ensemble of CART tree classifications using bagging mechanism. By using this only small subsets of features are selected by every node, for split. RF is more efficient on large number of features. RF has two parameters:
The random subset at each node (mtry)
The number of trees in the forest (ntree)
It is easily handle-able but requires maximum optimization for appropriate result.
Overall Flow of proposed detection model
Fig 1: Flow-chart of the proposed detection model
Detailed description of phrases
Two parameters are optimised, namely mtry and ntree. For getting the optimal value of mtry we consider a particular function ‘tuneRF()’ which is provided in random Forest package of R-project. For ntree we consider the heuristics of various ntree values and carry out experiments. Then, the stable and lowest value is chosen which gives the highest detection rates.
Fig 2: Final mtry values
2. Building the Initial Spam Detection Model:
We build the preliminary spam detection model using random Forest and all features of the taken data set. Here, we do not use feature detection.
3. Estimation of Detection Rates:
From the previous phase we get the confusion matrix holding the true positive, true negative, false positive and false negative values. Generally, the cost of losing a legitimate message (false positives) is much greater than that of allowing a spam message (false negatives).
The detection rates (accuracy) are defined as equation (1):
The error rates are defined as equation (2):
These estimated results are then compared with threshold values.
4. Feature Selection:
Here, using RF all features are presented in numerical form to fit the data set into the model and feature selection. Then these data are ranked in descending order of importance. This reduces the chances of overfit or underfit.
5. Rebuilding the Spam Detection Model:
There are two approaches in rebuilding the spam detection model:
I)Only one parameter Optimization during Overall Feature Selection:
At the initial modelling, optimization is performed in one parameter and then all these optimised parameters are used for rebuilding the process. This is less time consuming but may not have the ultimate optimal values.
II)Parameters Optimization in Every Feature Elimination Phase:
Here, optimization is performed on every feature. This method is more time consuming but is more optimal than Only one parameter Optimization during Overall Feature Selection.
Fig 3: mtry and ntree values based on first and second approaches.
6. Final Evaluation of the Ex-Model:
Finally, the effectiveness of the model is checked using parameter optimisation and feature selection simultaneously by five-fold cross validation.
978-0-7695-3967-6/10 $26.00 © 2010 IEEE DOI 10.1109/CISIS.2010.116
L. Breiman, “Random Forests,” Machine Learning, Vol. 45, pp. 5– 32, October 2001.
Y. Xie, “An Introduction to Support Vector Machine and Implementation in R,” May 2007.
R Project for Statistical Computing, http://www.r-project.org/
L. Faith Cranor and B. A. LaMacchia, “SPAM!,” Communications of the ACM, Vol. 41(8), Aug 1998, pp. 74–83.