AUTHOR: Abhinav Kale, Rimpa Poria
A efficient method for training deep neural networks is discussed for semi-supervised learning. We introduce self-ensembling and discuss how it is a part of supervised learning. Here, we briefly discuss different learning methods and self-ensembling and its two types, pi-model and temporal-ensembling, and its related applications with its pros and cons.
Temporal Ensembling was introduced by Timo Aila and Samuel Laine (both from NVIDIA) in 2016. It comes from a relatively simple idea “ that we can use an ensemble of the previous outputs of a neural network as an unsupervised target”. As from all the experiments, it has been suggested that an ensemble of multiple neural networks provides better results or predictions than a single network. Most of our training data in the world still operate on a single network, but the predictions made on different epochs correspond to an ensemble prediction of a large number of individual sub-networks because of dropout regularization. We will study different ensembling techniques in order to understand temporal ensembling.
The name itself gives the hint that it has the presence of a supervisor as a teacher. We use supervised learning when we train the machine using data that is well labeled, Which also means that our data is already tagged with the corresponding labels. Then the training data is analyzed by the supervised learning algorithm which in turn is provided with the new set of examples(data) and the algorithm then provides a correct outcome for the labeled data.
Unsupervised learning is a machine learning technique, where you do not need a supervisor to guide the model. Instead, of that, the model has to work on its own to discover information and patterns. Unsupervised learning mainly deals with unlabeled data. Algorithms are left to their own choices to find the collection of data that is similar in pattern and present an interesting structure in the data.
Why do we need Semi-supervised learning?
The biggest difference between supervised and unsupervised machine learning is: Supervised machine learning algorithms are trained on datasets that include labels added by a machine learning engineer or data scientist that guide the algorithm to understand which features are important. This is a very costly process, especially when dealing with large volumes of data. Unsupervised machine learning algorithms, on the other hand, are trained on unlabeled data and must determine feature importance on their own based on inherent patterns in the data, The most basic disadvantage of any Unsupervised Learning is that its application spectrum is limited.
To counter these disadvantages, the concept of Semi-Supervised Learning was introduced. In this type of learning, the algorithm is trained upon a combination of labeled and unlabeled data. Typically, this combination will contain a very small amount of labeled data and a very large amount of unlabeled data. This is useful for a few reasons. First, the process of labeling massive amounts of data for supervised learning is often time-consuming and expensive, and too much labeling can impose human biases on the model. That means including lots of unlabeled data during the training process actually tends to improve the accuracy of the final model while reducing the time and cost spent building it. You can use unsupervised learning techniques to discover and learn the structure in the input variables. You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data, and use the model to make predictions on new unseen data.
Self Ensembling is a technique that achieved state-of-the-art results in the area of semi-supervised learning. It comes from a relatively simple idea, that we can use an ensemble of the previous outputs of a neural network as an unsupervised target. The two different methods are pi-Model and temporal ensembling.
Fig 1: Pi Modal and Temporal
In the Pi Model, the training inputs xi are evaluated twice, under different conditions of dropout regularization and augmentations, resulting in two outputs zi and zi
The loss function here includes two components:
Supervised Component: standard cross-entropy between ground-truth yi and predicted values zi (only applied to labeled inputs).
Unsupervised Component: penalization of different outputs for the same input under different augmentations and dropout conditions by mean square difference minimization between zi and zi. This component is applied to all inputs (labeled and unlabelled).
These two components are combined by summing both components and scaling the unsupervised one using a time-dependent weighting function.
The augmentations and dropouts are always randomly different for each input resulting in different output predictions. Additionally to the augmentations, the inputs are also combined with random Gaussian noise to increase the variability.
The weighting function will be described later since is also used by temporal ensembling, but it ramps up from zero and increases the contribution from the unsupervised component reaching its maximum by 80 epochs. This means that initially the loss and the gradients are mainly dominated by the supervised component (the authors found that this slow ramp-up is important to keep the gradients stable).
The only problem that occurs is that we faced during -Model is that it relies on the output predictions that can be quite unstable during the training process. To combat this instability the authors propose temporal ensembling.
Temporal Ensembling was introduced by Timo Aila and Samuel Laine (both from NVIDIA) in 2016. It generally means that they compare the network's current outputs (post-softmax) to a weighted sum of all its previous outputs. Previous outputs are gathered during training: in an epoch, each input is seen once and its corresponding output is memorized to serve as a comparison later.
Fig 2 : temporal ensembling mechanism
Step 1: To resolve the problem of the noisy predictions during train, first we aggregate the predictions of past predictions into an ensemble prediction.
Step 2: The predictions are forced to be close to the ensemble prediction that is based on previous predictions of the network. This algorithm stores each prediction vector and at the end of each epoch, these are accumulated in the ensemble vector by using the formula: where $\alpha$ is a term that controls how far past predictions influence the temporal ensemble.
Step 3: This vector contains a weighted average of previous predictions for all instances, with recent ones. As compared to both the ensemble training targets, they need to be scaled by diving them by
Step4: In this, and are zero on the first epoch, since no past predictions exist.
The advantages of this algorithm,
when compared to the PI Model, are:
Training is approximately 2x faster (only one evaluation per input each epoch)
The is less noisy than in PI Model.
The disadvantages of this algorithm,
when compared to the PI Model, are:
The predictions need to be stored across epochs.
A new hyperparameter is introduced.
Application of ensembling techniques:
When given a test and train data from the population data set, the ensembling technique feeds the train data to different machine learning algorithms like KNN, linear regression ,decision tree and SVM such that each of the machine learning algorithms provides the output and then we can just calculate the mean of the output. This way we get more accurate results. So the basic application of different ensembling techniques is to decrease the percentage of error and have more accurate results with a good fit of data avoiding overfitting.