Most of the time, machine learning models are prepared to deal with data that is assumed to be correct and balanced and also with less hidden information. When it comes to real-life problems we barely have such data. To make the models learn more accurately in such scenarios, different approaches to modelling can be used. Two-phase learning is one such approach that recently gained the attention of practitioners. In this article, we will discuss two-phase machine learning, which can be considered as one of the approaches the modelling with the data can be done in two phases to get an accurate result. The major updates that we will discuss here are listed below.
Table of Contents
- What is Two-Phase Learning?
- Basic Approach to Two-Phase Learning Model
- Practical Use-Cases of Two-Phase Learning
- Where is Two-Phase Learning Applied?
What is Two-Phase Learning?
A machine learning approach can be referred to as a two-phase machine learning algorithm if it uses two algorithms together where the first algorithm is used for imputing variables of the dataset and the second algorithm is used to predict the values which can be treated successfully. More formally we can say that two-phase machine learning can be used as the procedure of data preprocessing where in the first phase a machine learning algorithm is used for imputation of the values in the data set which we know are incorrect.
After which, in the second phase, again a machine learning algorithm is used for making a conditional measure of the uncertainty of phase one. In two-phase learning, an algorithm is used for collecting critical data points in the first phase and another algorithm in the second phase determines whether making imputations is better than collecting data using approaches such as an indirect interview or call back.
Let’s understand it with an example. Suppose we are examining a variable from the data set which is large and continuous and values under the variable are false because of some circumstances at the time of collecting data. In such scenarios, the incorrectness of the data led to more imputations for the variable in modelling, and for imputation, the second phase is used for learning the circumstances in which the algorithm in the first phase learned to give the prediction for incorrect values. Interviewing such predictions which are wrong using a second phase algorithm will lead to producing more correct and effective predictions.
The above image is a representation of the workflow diagram of a two-phase learning approach. If the values under the variables are correct and predictable, they can go to the end of the learning and if not they need to be learnt at second phase.
Basic Approach to Two-Phase Learning Model
Let’s take an example of the random forest. At phase one, where we split a data set in two-third for training where the correct values in the data set and other variables from the data set are also included. As we know about the random forest it can be used for classification and regression procedure. And it uses the trees to grow random forests. And the tree is constructed by splitting a node into two child nodes repeatedly. The first node contains whole learning samples where the criteria for splitting is measured by the gini index which can be defined as:
Where the X is the features and D is the node after which the split of the node is going to take place after splitting the D1 and D2 are nodes that can be considered as the child node of D.
If the number of elements in the parent node is N and elements in the child nodes are N1 and N2 then Gini(N) can be defined as:
Where PK can be considered as the fraction of classes where K is a representation of the classes in the node.
Most of the time it can be seen that the number of datapoints which are wrong can be in lower amounts. This can cause the imbalance problem in the data which can cause the random forest to learn improperly. To deal with such problems, we can use the oversampling or undersampling technique where we create a balance between the classes of the dataset so that the model can learn properly or not be biased with one class.
Let’s say that we are making a random forest of 100 trees where we chose the number of trees by some testing or by cross validation methods. Also the trees under the random forest grown the outcome is the majority vote from the terminal node. The final result obtained from the entire random forest is the proportion of votes from all the trees. Let’s take a look on the below table:
The above table can be a representation of the possible outcome from phase one when applied on the test data one-third after the training of random forest.
Where, the following measures can be used to asses the phase one algorithm:
- True Positive rate:
- False Discovery rate:
- Accuracy:
Here the accuracy and true positive rate can be affected from the distribution of the outcome class and the false discovery rate is not affected by the distribution. We can also use the following statistics for producing statistics.
- Predicted prevalence:
- Balance:
So if the balance in the scenario is 1, we can proceed for production of the unbiased estimation.
As discussed above, we have some incorrect values in the data set which can be separated from the dataset and again we can use the similar random forest for producing prediction and a variable can be introduced in the dataset for the congruency between the correct predicted values and the observed incorrect values.
Let’s say the name of new outcome variable is align, then,
Used variables in phase 2 algorithm are the same as used in phase 1 but the new outcome variable is also included in phase 2. The training in phase 2 should be on the ⅔ split of the whole data.
The above given structure of modeling using the random forest algorithm can be considered as a part of the two-phase modeling where we have seen how it can be applied to the data where we have basically data in two classes where one class is of pure data and the other is of impure data.
Practical Use-Cases of Two-Phase Learning
In the above section we have seen how we can improve the machine learning procedure in the context of accuracy. In this section of the article we will see an example of the two-phase learning procedure.
- In the work done by Susie Jentoft and team, there is a similar approach to apply random forest in the in two-phase machine learning where the data they have used is from Norwegian Labour Force Survey (LFS) where the study is about variable partial absence(from work), which is a large, continuous, rotating panel survey of around
- 24 000 people per quarter the results according to phase one and phase two are presented in the following tables.
- Results from Phase One:
- Results from Phase Two:
In the data set the inaccurate values came from the wrong interviews or proxy interviews.
- In a work done by researchers, we can see an AR-based, Two-phase, Learning Guide Strategy where they have studies about applying augmented reality(AR) and image positioning technologies to paper textbooks, and two-phase learning is applied in the program so that they can control the learners as a sequential learning procedure. So that wrong induced results from phase one can be cured using phase two.
- In a research work of Dahun Kim and team, we can see how the problem of Weakly Supervised Object Localization can be solved using the two-phase learning where the Weakly Supervised Object Localization techniques are not capable of focusing on the not important objects of the image. In the paper we can see that using the phase one fully connected network is used for finding the most discriminative parts of an image. And the second phase is to learn the image and find the second most important objects from the images. The below image is a representation of the effect of the two-phase learning.
Where portion (a) of the image is input to the model and the red and yellow cross are estimated locations of the image as most important and less important respectively. Portion (b) is The heat map from the network in phase one .portion c is the segmentation prediction of the baseline. Portion (d) is a Ground truth segmentation mask. Portion (e) is the heat map from the phase two network and portion (f) is of the segmentation
prediction using the second phase network.
Where is Two-Phase Learning Applied?
After going through the above real-life examples of applying two-phase learning algorithms, we can say that the main motive of two-phase learning is to make the models more accurate. In accordance with the accuracy, we can say that in the following conditions we can apply the two-phase learning model:-
- Inaccurate Data – When the data is collected with inaccuracy we can use the two-phase learning where phase one can be used for the correct data filtered from the raw data and phase two can be used for learning the data which is inaccurate.
- Imbalance Data – In many real-life situations we see that the data has classes in the imbalance situation which causes the conventional learning methods to be biased with classes and for this, we can separately make the model learn different classes in phases.
- Models Incompatibility with Hidden Data: As we know that the fully connected convolutional networks are the best options for learning from the sequencing data. But we also see in the third example that sometimes the model misses the sequence of the data in between. In such a scenario, we can use the two-phase learning for better performances of the models and make the model learn from the whole data without missing any of the hidden information in the data.
Final Words
In this article, we have seen an overview of two-phase learning and how can we approach making a two-phase learning model. Along with this, we have seen some of the examples where the two-phase learning approach is applied to make more fruitful results and got to know that in what situations we can apply the two-phase learning.