Date: 2/26/2020
Classifiers are consistently used to estimate probabilities and categorical outcomes, when considering communications systems, receiver systems are called on for demodulation and data mapping from symbols. Common machine learning algorithms used for time series data are convolutional neural networks (CNN). CNN can be computationally intensive since each node has a weight associated with it and data that is processed requires a multiplication and an activation function. Decision trees are simpler since each decision uses a compare function. Ensemble of trees can be grouped used for classification of outcomes.
Description of the model building and performance evaluation
The overall process of building a machine learning model consisted of several steps. First, the time series data was read in, the data is normalized to a working maximum value. Time series data samples are converted from serial to parallel for input to classifier and for training. The features are generated for early samples, on-time samples and late samples. The samples are separated grouped by bit time with the categorical output of on-time samples used for labels. 24 features are used for inphase features and 24 features are used for quadrature features. A total of 48 features are utilized in this example. The model is built assuming time synchronization has occurred (carrier lock and bit synchronization).
The classifier was run with three data sets QAM data was generated with a 3dB SNR. The first data set was composed of 500 data points (QAM symbols) with a train (80%) and test (20%) for initial testing. The 2nd data set was composed of 1,000,000 data points with 250,000 used for training and 750,000 used for test. The 3rd data set was 1,250,000 symbols with 250,000 used for training and 1,000,000 used for test. The data input to the classifier is synthesized through simulation. Data is simulated with 8 samples per symbol.
Figure 1 in-phase filtered data shown with binary category
Figure 1 shows in-phase (real part) of the symbol information after filtering with an RRC filter with alpha equal to 0.5. The filter kernel is also shown in red dots. Input data and up-sampled version of the data is shown with red dots and blue lines. After symbol data is generated AWGN noise is added to the signal. For the small data set used for test and train the confusion matrix showed no errors (false positives or false negatives) for a SNR of 3dB.
Figure 2 Constellation plot of IQ symbols with AWGN noise added 3dB SNR
Figure 2 shows the constellation plot of the IQ symbols after filtering and addition of AWGN noise. Figure 3 shows same data (real valued) in red dots, blue lines as a time series with categorical output for IQ data in green for SNR of 3dB.
Figure 3 Time Series In-phase Sampled Data with Categorical Output
The model type that was selected for proof of concept studies was a random forest classifier. Random forest is an ensemble of classification trees (more broadly decision trees). It was developed to reduce the overfitting typically observed for decision trees. A random forest tree is constructed by randomly selecting N training data samples with replacement where N = total number of data samples. Moreover, a random forest tree is constructed to randomly use only a subset of the features with the default being the square root of the total number of features M. A tree is grown to the largest extent possible (node purity) without any pruning. To predict a new sample, the values of its features are input into / flow down a tree. The final vote is the majority class predicted by the collection of trees. The selection of random samples and random features for each tree results in loosely correlated trees and makes the random forest one of the most effective types of models for a variety of ML applications. In general, random forest models show excellent performance metrics, rarely overfit, are able to model non-linear relationships and have the ability to handle large number of noisy and correlated features. Moreover, given that some samples remain “out of bag” for a given tree, and a given tree does not utilize all the features, the importance of a feature can be estimated for the random forest and reflects the predictive value of that feature. Thus, random forest models enable both high performance and a level of interpretability which is important for biological applications.
For this proof of concept study the number of trees was set to 100 trees and number of features was set to the default which is the square root of the total number of features (sqrt(48) = ~7). The performance metrics that were calculated on the test data were as follows: accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the receiver operating characteristic curve. Definitions are shown below. The predicted class and corresponding performance metrics were reported for both the 0.5 probability threshold and the optimal probability threshold that is derived from the ROC-AUC curve. The optimal probability threshold is the threshold corresponding to the maximal difference between True Positive Rate (Sensitivity) and False Positive Rate (1 – Specificity). This represents the point on the curve that is closest to the left hand upper corner of the plot. In addition to these metrics the predicted probability was utilized to rank the samples for each of the peptide complexes. As such the predicted probability can be utilized as the ML model derived docking score.
Features used:
'i0s0', 'i0s1', 'i0s2', 'i0s3', 'i0s4', 'i0s5', 'i0s6', 'i0s7',
'i1s0', 'i1s1', 'i1s2', 'i1s3', 'i1s4', 'i1s5', 'i1s6', 'i1s7',
'i2s0', 'i2s1', 'i2s2', 'i2s3', 'i2s4', 'i2s5', 'i2s6', 'i2s7',
'q0s0', 'q0s1', 'q0s2', 'q0s3', 'q0s4', 'q0s5', 'q0s6', 'q0s7',
'q1s0', 'q1s1', 'q1s2', 'q1s3', 'q1s4', 'q1s5', 'q1s6', 'q1s7',
'q2s0', 'q2s1', 'q2s2', 'q2s3', 'q2s4', 'q2s5', 'q2s6', 'q2s7'
The prefix ‘i’ or ‘q’ denotes in-phase or quadrature data sample. The next number denotes the bit sample. The suffix ‘s#’ is the sample number of the symbol.
Definition of metrics:
Confusion matrix were generated and shown in tables below for the categorical outputs for the 2nd and 3rd data sets. The categories represent IQ values of ‘00’, ‘01’,’10’ and ‘11’. Zero is represents ‘-1’ in the IQ constellation.
The confusion matrix shown in tables 1 and 2 give an indication of the BER rate and where the errors occur. The anti-diagonal where both bits are flipped show zero errors. These values represent both bits flipping on the constellation or phase change of 180 degrees. For the data sets run at 3dB SNR it appears that this classifier would not show errors on a BPSK constellation.
From the above confusion matrix the metrics below can be calculated. For communications systems we are mainly interested in BER rate. This can be estimated by:
(N_data_points – TP) / N_data_points
TP = count value where predicted = actual, True Positives
FP = count value where predicted /= actual, False Positives
TN = actual 0, predicted 0, True Negatives
FN = actual 1, predicted 0, False Negatives
N = AP + AN, N data points
AP = FN + TP, All positives
AN = TN + FP, All Negatives
PP = FP + TP
PN = FN + TN
ACC = (TN + TP) / N
PREV = All positives for a category / N data points
SENS = TP / AP, sensitivity
SPEC = TN / AN, specificity
PPV = TP / PP, positive predictive value
NPV = TN / PN, negative predictive value
Optimization and future work
Machine learning model performance can be improved in various ways. Increasing dataset size and variety can have a profound impact on model performance, the larger the data the better the performance. Increasing the dataset size will also enable the split of the data into training / validation and test sets enabling various optimization techniques on train/validation data while maintaining a sufficiently large test data for final performance evaluation. Model performance is also highly dependent on the features utilized by the models. The feature set can be expanded by engineering additional features. Moreover, feature selection can improve performance by reducing model variance. Feature selection can be performed via recursive feature elimination in the context of cross validation. Briefly, features are removed one at a time and the performance of the reduced feature model is estimated on the validation folds. The features that result in best cross-validation performance are then selected for the final model build. Moreover, some performance improvement can also be observed from hyperparameter tuning. In the case of random forest, the number of trees (ntrees) as well as the number of features to try (max_features) are two hyperparameters that can be tuned in the context of cross validation. In addition, the threshold utilized to establish a TP vs FP class can be optimized. Lastly, while random forests typically generate fantastic results, newer model types such as gradient boosted trees may provide better results and can be tested. Moreover, classical logistic regression models can also be tested. While such models don’t always give the best performance, they provide unparalleled interpretability as each feature weight represents both the magnitude and direction of the effect of that feature on the outcome. Lastly, models can be combined into ensembles to provide a final boost in performance, this is a common technique employed when compute power and speed are not critical (e.g. in situations where the models don’t have to perform for real time prediction, or on slower symbol rates).