Koliber Engineering

Project: Categorical estimation of QAM symbols from time series data using machine learning
Date: 2/26/2020

Classifiers are consistently used to estimate probabilities and categorical outcomes, when considering communications systems, receiver systems are called on for demodulation and data mapping from symbols. Common machine learning algorithms used for time series data are convolutional neural networks (CNN). CNN can be computationally intensive since each node has a weight associated with it and data that is processed requires a multiplication and an activation function. Decision trees are simpler since each decision uses a compare function. Ensemble of trees can be grouped used for classification of outcomes.

Description of the model building and performance evaluation

The overall process of building a machine learning model consisted of several steps. First, the time series data was read in, the data is normalized to a working maximum value. Time series data samples are converted from serial to parallel for input to classifier and for training. The features are generated for early samples, on-time samples and late samples. The samples are separated grouped by bit time with the categorical output of on-time samples used for labels. 24 features are used for inphase features and 24 features are used for quadrature features. A total of 48 features are utilized in this example. The model is built assuming time synchronization has occurred (carrier lock and bit synchronization).

The classifier was run with three data sets QAM data was generated with a 3dB SNR. The first data set was composed of 500 data points (QAM symbols) with a train (80%) and test (20%) for initial testing. The 2nd data set was composed of 1,000,000 data points with 250,000 used for training and 750,000 used for test. The 3rd data set was 1,250,000 symbols with 250,000 used for training and 1,000,000 used for test. The data input to the classifier is synthesized through simulation. Data is simulated with 8 samples per symbol.

Figure 1 in-phase filtered data shown with binary category

Figure 1 shows in-phase (real part) of the symbol information after filtering with an RRC filter with alpha equal to 0.5. The filter kernel is also shown in red dots. Input data and up-sampled version of the data is shown with red dots and blue lines. After symbol data is generated AWGN noise is added to the signal. For the small data set used for test and train the confusion matrix showed no errors (false positives or false negatives) for a SNR of 3dB.

Figure 2 Constellation plot of IQ symbols with AWGN noise added 3dB SNR

Figure 2 shows the constellation plot of the IQ symbols after filtering and addition of AWGN noise. Figure 3 shows same data (real valued) in red dots, blue lines as a time series with categorical output for IQ data in green for SNR of 3dB.

Figure 3 Time Series In-phase Sampled Data with Categorical Output

The model type that was selected for proof of concept studies was a random forest classifier. Random forest is an ensemble of classification trees (more broadly decision trees). It was developed to reduce the overfitting typically observed for decision trees. A random forest tree is constructed by randomly selecting N training data samples with replacement where N = total number of data samples. Moreover, a random forest tree is constructed to randomly use only a subset of the features with the default being the square root of the total number of features M. A tree is grown to the largest extent possible (node purity) without any pruning. To predict a new sample, the values of its features are input into / flow down a tree. The final vote is the majority class predicted by the collection of trees. The selection of random samples and random features for each tree results in loosely correlated trees and makes the random forest one of the most effective types of models for a variety of ML applications. In general, random forest models show excellent performance metrics, rarely overfit, are able to model non-linear relationships and have the ability to handle large number of noisy and correlated features. Moreover, given that some samples remain “out of bag” for a given tree, and a given tree does not utilize all the features, the importance of a feature can be estimated for the random forest and reflects the predictive value of that feature. Thus, random forest models enable both high performance and a level of interpretability which is important for biological applications.

For this proof of concept study the number of trees was set to 100 trees and number of features was set to the default which is the square root of the total number of features (sqrt(48) = ~7). The performance metrics that were calculated on the test data were as follows: accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the receiver operating characteristic curve. Definitions are shown below. The predicted class and corresponding performance metrics were reported for both the 0.5 probability threshold and the optimal probability threshold that is derived from the ROC-AUC curve. The optimal probability threshold is the threshold corresponding to the maximal difference between True Positive Rate (Sensitivity) and False Positive Rate (1 – Specificity). This represents the point on the curve that is closest to the left hand upper corner of the plot. In addition to these metrics the predicted probability was utilized to rank the samples for each of the peptide complexes. As such the predicted probability can be utilized as the ML model derived docking score.

Features used:

'i0s0', 'i0s1', 'i0s2', 'i0s3', 'i0s4', 'i0s5', 'i0s6', 'i0s7',
'i1s0', 'i1s1', 'i1s2', 'i1s3', 'i1s4', 'i1s5', 'i1s6', 'i1s7',
'i2s0', 'i2s1', 'i2s2', 'i2s3', 'i2s4', 'i2s5', 'i2s6', 'i2s7',
'q0s0', 'q0s1', 'q0s2', 'q0s3', 'q0s4', 'q0s5', 'q0s6', 'q0s7',
'q1s0', 'q1s1', 'q1s2', 'q1s3', 'q1s4', 'q1s5', 'q1s6', 'q1s7',
'q2s0', 'q2s1', 'q2s2', 'q2s3', 'q2s4', 'q2s5', 'q2s6', 'q2s7'

The prefix ‘i’ or ‘q’ denotes in-phase or quadrature data sample. The next number denotes the bit sample. The suffix ‘s#’ is the sample number of the symbol.

Definition of metrics:

Confusion matrix were generated and shown in tables below for the categorical outputs for the 2nd and 3rd data sets. The categories represent IQ values of ‘00’, ‘01’,’10’ and ‘11’. Zero is represents ‘-1’ in the IQ constellation.

The confusion matrix shown in tables 1 and 2 give an indication of the BER rate and where the errors occur. The anti-diagonal where both bits are flipped show zero errors. These values represent both bits flipping on the constellation or phase change of 180 degrees. For the data sets run at 3dB SNR it appears that this classifier would not show errors on a BPSK constellation.

From the above confusion matrix the metrics below can be calculated. For communications systems we are mainly interested in BER rate. This can be estimated by:

(N_data_points – TP) / N_data_points

TP = count value where predicted = actual, True Positives
FP = count value where predicted /= actual, False Positives
TN = actual 0, predicted 0, True Negatives
FN = actual 1, predicted 0, False Negatives
N = AP + AN, N data points
AP = FN + TP, All positives
AN = TN + FP, All Negatives
PP = FP + TP
PN = FN + TN
ACC = (TN + TP) / N
PREV = All positives for a category / N data points
SENS = TP / AP, sensitivity
SPEC = TN / AN, specificity
PPV = TP / PP, positive predictive value
NPV = TN / PN, negative predictive value

Optimization and future work

Machine learning model performance can be improved in various ways. Increasing dataset size and variety can have a profound impact on model performance, the larger the data the better the performance. Increasing the dataset size will also enable the split of the data into training / validation and test sets enabling various optimization techniques on train/validation data while maintaining a sufficiently large test data for final performance evaluation. Model performance is also highly dependent on the features utilized by the models. The feature set can be expanded by engineering additional features. Moreover, feature selection can improve performance by reducing model variance. Feature selection can be performed via recursive feature elimination in the context of cross validation. Briefly, features are removed one at a time and the performance of the reduced feature model is estimated on the validation folds. The features that result in best cross-validation performance are then selected for the final model build. Moreover, some performance improvement can also be observed from hyperparameter tuning. In the case of random forest, the number of trees (ntrees) as well as the number of features to try (max_features) are two hyperparameters that can be tuned in the context of cross validation. In addition, the threshold utilized to establish a TP vs FP class can be optimized. Lastly, while random forests typically generate fantastic results, newer model types such as gradient boosted trees may provide better results and can be tested. Moreover, classical logistic regression models can also be tested. While such models don’t always give the best performance, they provide unparalleled interpretability as each feature weight represents both the magnitude and direction of the effect of that feature on the outcome. Lastly, models can be combined into ensembles to provide a final boost in performance, this is a common technique employed when compute power and speed are not critical (e.g. in situations where the models don’t have to perform for real time prediction, or on slower symbol rates).

What is Phase Modulation?

Phase modulation is described by the following equation:

Phase modulated signal: Xpm(t) = Ac *cos[2pi fc +m(t)]

Where m(t) is the modulating signal. Kp is the modulating index. When digital information is presented as fixed discrete phase offsets of the modulating signal such as 0, pi/4, 3pi/4, … the phase modulation is referred to as phase shift keying and is a form of digital modulation. When the modulating signal is continuous signal it is a Xpm(t) is a form of linear modulation. For example if the modulating signal is :

Modulating signal: m(t) = kp sin⁡[2 pi Fpm t + x(t)]

The modulating signal is referred to as a side band where m(t) is itself a modulated signal, where Fpm is the sideband.

In the figure the carrier Fc is phase modulated with m(t) with x(t) set to zero. The modulating signal is a sine function. Since kp is equal to pi/2 when the sine() is equal to ‘1’ the phase is advanced by 90 degrees, when sine() is equal to ‘-1’ the phase is retarded by 90 degrees.

Figure 2 Time Domain Plot of Phase Modulated Signal

Figure 2 shows a phase modulated signal where

kp is pi/2 radians
m(t) = kp * sin(2pi * Fpm * t), x(t) = 0
Fc = 1MHz
Fpm = 100KHz

Note how the modulated signal advances by 90deg when high and retards by 90 when low.

Taking a 2^16 bin FFT of the real valued equation gives us positive and negative frequency components. The spectrum of the modulated signal shows frequency carrier, Fc, peaks at +/-1MHz and side band peaks at multiple 100KHz away from Fc. Now that we have an understanding of the Linear Phase Modulated signal, how do we get our information out of it?

Figure 3 Frequency Domain Plot (65K point FFT) of real valued Phase Modulated Signal

Demodulating Linear PM signals

There are multiple methods of extracting the modulating signal from a PM source. We will show one method using Hilbert transforms and Arctangent calculation with complex math demodulation algorithm. To get the negative frequency components of the complex valued waveform we remove the negative image by preforming a Hilbert transform on the modulated waveform. Once the complex valued waveform is obtained a frequency rotation or down conversion is performed. A 3-dimensional plot shows the variation in time (z-axis) of the PM signal with respect to the carrier. Both signals rotate around the IQ diagram at a rate of 1MHz.

The carrier signal is used as our reference, the modulated signal when compared to carrier we notice a difference in time, this is our phase difference. The modulated signal is either advanced or retarded for positive or negative phase modulating values.

Figure 4 Time Domain 3d Plot of PM signal and Carrier

We can remove the carrier by translating the modulated signal to zero IF frequency. Doing this stops the carrier and modulation waveforms from rotating on the IQ diagram. Fc is a DC component and equal to zero. Since Fc is now zero after translation to zero frequency the 3d plot would look like this:

Figure 5 Modulating Signal m(t) translation to zero frequency (DC)

At this point the signal can be recovered by performing an arctangent the inphase and quadrature components.

m(t) = arctan( (I/Q) * t )

Note: the signal recovered below contains a slight phase offset with respect to the original signal.

Figure 6 Arctan of I/Q signal (Kp=pi/2)

Koliber Engineering

Thursday, February 27, 2020

Categorical Estimation of QAM symbols from time series data using machine learning

Monday, October 7, 2019

What is Phase Modulation?

What is Phase Modulation?

Demodulating Linear PM signals