Author : J Kiron
NLP stands for Natural Language Processing and it is a type of Artificial Intelligence that enables machines to automatically manipulate natural language like speech, text, etc. Natural language like speech is one of the most important types of interactions that humans perform. To be able to understand and derive meaning from it automatically is essential when it comes to assimilating huge chunks of data. NLP is on the rise now, since it is the next step towards understanding human wants and needs. There is also an increase in computational power required to aid this process. In addition, there is a staggering amount of data that is available ready to be structured and analyzed. Many industries like healthcare, finance, entertainment media, use NLP to achieve meaningful analysis.
Machine Learning in NLP
First, we must know the differences between NLP, Machine Learning, Artificial Intelligence and Deep Learning to understand how dependent one technique is on the other ones.
Artificial Intelligence is related to making machines perform tasks that require human-like intelligence. Machine Learning is one such application of AI and Deep Learning is a subset of ML which is applied for advanced learning. NLP is also a part of AI, but it can overlap with ML and DL.
NLP uses machine learning algorithms to understand and find the meaning of the text documents that range from small tweets to legal documents. ML and AI aids in improving and automating the text analytics functions of NLP that convert unstructured data into structured information. There are many techniques to make the given unstructured data into useable bits. These techniques help identify sentiment, parts of speech, etc.
Here are some of the popular NLP Machine Learning algorithms:
Random Forest Classification
Support Vector Machine
Hybrid Machine Learning Systems:
Hybrid Machine Learning means exactly that, it is a combination of two or more machine learning algorithms, be it supervised or unsupervised. We might have even used Hybrid Machine Learning systems without our knowledge sometimes. It combines different algorithms or processes from a wide array of selections. This hybridization increases the accuracy of the model since no single model will solve a problem completely.
There are endless ways to combine different models of ML, like classification + classification, classification + clustering, clustering + clustering, and even more. Using hybrid machine learning systems for NLP is definitely an advantage since it improves the ability of the machine to understand the data better. Different variations of machine learning algorithms can be used in tandem to solve a given problem.
Let us take a sentiment analysis problem and look at how hybrid machine learning systems affect the accuracy of the model. The dataset contains tweets that talk about the problems of six different major US airlines. Our aim is to train a model to categorize the tweets into positive, negative, and neutral in sentiment. We will use a hybrid machine learning system that uses Logistic Regression and Support Vector Machines Classification to achieve the expected result. Let us start by importing the necessary libraries
# Importing necessary packages %matplotlib inline import numpy as np import pandas as pd import re import nltk import seaborn as sns import matplotlib.pyplot as plt from sklearn.feature_extraction.text import TfidfVectorizer from nltk.corpus import stopwords from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.metrics import accuracy_score from sklearn.ensemble import VotingClassifier
To load the data, let us download the dataset from Kaggle:
link : https://www.kaggle.com/crowdflower/twitter-airline-sentiment
Upload the downloaded csv file to your Google Drive. Now mount the drive in the colab using the below code snippet.
from google.colab import drive drive.mount('/content/drive')
Change your directory to the location of the dataset:
Let us read the comma separated values and clean the column that contains the tweets.
tweets = pd.read_csv("Tweets.csv", sep = ',') tweets.head()
Cleaning of data is done by removing all the special characters, single characters, and the stopwords like “during’, ‘out’, ‘very’, ‘having’, etc. These characters carry far less meaning than the other keywords, hence this step is essential. Also, the text is converted to lowercase for uniformity.
# Cleaning the data text_set = tweets.iloc[:, 10].values #Selecting the 11th column: text sentiment_set = tweets.iloc[:, 1].values #Selecting the 2nd columnn: airline_sentiment cleaned_text_set = list() for input_phrase in range(0, len(text_set)): clean_text = re.sub(r'\W', ' ', str(text_set[input_phrase])) #Removing spl characters and single characters clean_text= re.sub(r'\s+[a-zA-Z]\s+', ' ', clean_text) clean_text = re.sub(r'\^[a-zA-Z]\s+', ' ', clean_text) clean_text = clean_text.lower() cleaned_text_set.append(clean_text)
To remove stopwords, we have to download the list of stopwords from nltk library. To extract features from the cleaned version, we will be using TF-IDF features.
import nltk nltk.download("stopwords") input_vector = TfidfVectorizer (max_features=3000, min_df=6, max_df=0.8, stop_words=stopwords.words('english')) cleaned_text_set = input_vector.fit_transform(cleaned_text_set).toarray()
We have used the 3000 most frequent words in the data as features. Now let us split the data into training and validation sets. A 80-20 split is optimum.
X_train, X_test, y_train, y_test = train_test_split(cleaned_text_set, sentiment_set, test_size=0.2, random_state=42)
Since we are using a hybrid machine learning system with two algorithms, let us test the data with each on the algorithms.
# Logistic Regression lr = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X_train, y_train) lr_score = lr.score(X_train, y_train) print("Logistic Regression Accuracy Score: ", lr_score) # Support Vector Machine Linear Classification svc = SVC(kernel='linear') svc.fit(X_train, y_train) svc_score = svc.score(X_train, y_train) print("SVM Classification Accuracy Score: ", svc_score)
It is time to combine the two learning models. We have defined each of the two machine learning models two times that that result in a combination of 4 weak learners. Now, the Max Voting Classifier method. The final class prediction of the ensemble model will be the class which has been predicted mostly by the weak learners.
# create the sub-models estimators =  #Defining 2 Logistic Regression Models model11 = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X_train, y_train) estimators.append(('logistic1', model11)) model12 = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X_train, y_train) estimators.append(('logistic2', model12)) #Defining 2 Support Vector Classifiers model21 = SVC(kernel = 'linear') estimators.append(('svm1', model21)) model22 = SVC(kernel = 'linear') estimators.append(('svm2', model22)) # Defining the ensemble model ensemble = VotingClassifier(estimators) ensemble.fit(X_train, y_train) #y_pred = ensemble.predict(X_test) ensemble_score = ensemble.score(X_train, y_train) print("Ensemble Score: ", ensemble_score)
Running each of the models on the test data,
final_pred_lr = lr.predict(X_test) final_pred_svc = svc.predict(X_test) final_pred_ensemble = ensemble.predict(X_test) # Accuracy score of the final prediction print("SVM prediction: ", accuracy_score(y_test, final_pred_svc)) print("LR prediction: ", accuracy_score(y_test, final_pred_lr)) print("Ensemble prediction: ", accuracy_score(y_test, final_pred_ensemble))
We can see that the hybrid machine learning model has done better or equal to the other individual learning models. Therefore, a hybrid approach to machine learning for NLP is one of the best ways there is.