Mini Project 6

Watchanan Chantapakul (wcgzm)


This miniproject has only one question, which is a bit longer than the questions in previous assignments, however it also builds on the previous assignments. So, at the end of the day, you should be able to "borrow" a lot of your own code from before and finish this assignment quite easily. In this experiment, you are to compare various different classification approaches:

  1. using complete knowledge of the statistics of the data and computing optimum discriminant functions for the classification (Chapter 2);
  2. assuming that you know only the model of the distributions, but not their parameters (Chapter 3);
  3. by first reducing the dimensionality of the data set and then classifying it (Chapter 3);
  4. assuming that you do not know the underlying distributions and employing Parzen Window (Chapter 4);
  5. assuming that you do not know the underlying distributions and employing k-NN (Chapter 4);
  6. assuming that you do not know anything, but the class labels (Chapter 5).

Generate dataset and save it to 'data/mp6.npy' file

Initially, you must create four data sets according to the information below, but next week, you will be given the actual datasets with which you will write your final reports. The datasets that you will create are 'trivial', but they will be useful for debugging your programs. The ones that I will provide will be more challenging.

While using your datasets, the four testing/training data sets described here MUST be kept the same through out all parts below. That is, you should create your data points ONCE, save them, and use them for all parts and subparts below. Do NOT create different data for each question or you may end up with very strange results!!

The four data sets will be referred to as: 1) Training Data I, with 50 samples in each class; 2) Training Data II, with 500 samples in each class; 3) Testing Data I, also with 500 samples in each class; and finally 4) Testing Data II, with 10000 samples in each class1

All data sets must consist of 5-d points divided in 3-classes with the underlying normal distributions $p(\vec{x} | \omega_i) = \mathsf{N}(\vec{\mu}_i, \Sigma_i)$, where:

$$ \begin{align*} \vec{\mu}_1 &= \begin{bmatrix}2 & 3 & 1 & 5.5 & 8.7\end{bmatrix}^t\\ \vec{\mu}_2 &= \begin{bmatrix}-4.5 & 6 & -1 & 3 & 10\end{bmatrix}^t\\ \vec{\mu}_3 &= \begin{bmatrix}1.2, -2.3, 1.5, -0.5, 2.7\end{bmatrix}^t\\ \end{align*} $$$$ \vec{\Sigma}_1 = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 0.5 & 0 & 0 & 0 \\ 0 & 0 & 2.5 & 0 & 0 \\ 0 & 0 & 0 & 0.7 & 0 \\ 0 & 0 & 0 & 0 & 3.5 \\ \end{bmatrix} \vec{\Sigma}_2 = \begin{bmatrix} 2 & 0 & 1 & 0.5 & 0 \\ 0 & 3.5 & 0 & 0 & 0.6 \\ 1 & 0 & 4.5 & 1.2 & 0 \\ 0.5 & 0 & 1.2 & 1.6 & 0 \\ 0 & 0.6 & 0 & 0 & 2.5 \\ \end{bmatrix} \vec{\Sigma}_3 = \begin{bmatrix} 4.2 & 0 & 1.3 & 2.5 & 1.4 \\ 0 & 5 & 0 & 0 & 3.6 \\ 1.3 & 0 & 4.5 & 4.2 & 0 \\ 2.5 & 0 & 4.2 & 5.6 & 0 \\ 1.4 & 3.6 & 0 & 0 & 7.5 \\ \end{bmatrix} $$

You will assume that all states of nature are equally probable.


1 The sizes of the data sets above are not typical to real-life classification problems. The choices above were made solely to create interesting situations for discussion in your report.


* Comment out the code for generating the dataset (run once for the first time)

Load dataset from 'data/mp6.npy' file

Cross-check the defined mean vectors and covariance matrices with the question.

Confusion Matrix

All confusion matrices in this miniproject has the following structure:

(predicted) 1 (predicted) 2 (predicted) 3
(actual) 1 - - -
(actual) 2 - - -
(actual) 3 - - -
Table 1: Confusion matrix format with the actual classes in rows and the predicted classes in columns.

Part I

Here, you will first (in a and b) use complete knowledge about the data. Then (in c and d), you will "forget" that you know the means and covariances, and use ML to estimate them.

a) Classify the Testing Data I, using the given statistics above and the Bayes decision rule. Compute the confusion matrix.

We have a priori knowledge about the distributions that they are Gaussian distributions. So, we can define the discriminant function for the normal density with the following generic form: $$ g_i(\vec{x}) = -\frac{1}{2}(\vec{x}-\vec{\mu}_i)^{\mathsf{T}}\mathbf{\Sigma}_i^{-1}(\vec{x}-\vec{\mu}_i)-\frac{d}{2}\ln (2\pi) - \frac{1}{2} \ln |\mathbf{\Sigma}_i| + \ln P(\omega_i) $$ for any given $d$-dimensional data, mean $\vec{\mu}_i$, covariance matrix $\mathbf{\Sigma}_i$ and prior probabilities $P(\omega_i)$ of class $i$.

Given the equal number of samples in every class, thus the priors are equal. We then can discard the term $\ln P(\omega_i)$. Also, since $d$ is a constant, we then also get rid of the constant term $- \frac{d}{2} \ln (2\pi)$.

We classify a sample $\vec{x}$, based on the Bayes decision rule, to be class $\omega_i$ if $g_i(\vec{x}) > g_j(\vec{x})$, $\forall j \neq i$. This can also be written in the form of argmax as follows: $$ \hat{\omega} = \underset{i}{\mathrm{argmax}} g_i(\vec{x}) $$

b) Repeat part a) for the Testing Data II.

c) Compute the ML estimates $(\hat{\vec{\mu}}_i$ and $\hat{\Sigma}_i$) for each class using the Training Data I and classify the Testing Data II using Bayes decision rule. Compute the confusion matrix.

According to the maximum likelihood estimation (MLE) for a Gaussian distribution, the estimated mean vector of class $\omega_i$ can be computed using $$ \vec{\mu}_i = \frac{1}{n_i} \sum_{k=1}^{n_i} \vec{x}_k $$ where $n$ is the number of training samples in class $\omega_i$.

The estimated unbiased covariance matrix of class $\omega_i$ is given by: $$ \mathbf{\Sigma}_i = \frac{1}{n_i-1} \sum_{k=1}^{n_i} (\vec{x}_k - \vec{\mu}_i) (\vec{x}_k - \vec{\mu}_i)^{\mathsf{T}} $$

d) Repeat part c), but this time use the Training Data II for the ML and then classify the Testing Data II again.

e) Make comments on each of the results above and then compare them (’compare’ does not mean to say “this was better than that”, but to say why that was the case).

Part Question Algorithm Training set Acccuracy on Testing data I Accuracy on Testing data II
I a, b BDR A prior knowledge 99.87% 99.89%
I c BDR Training data I - 99.81%
I d BDR Training data II - 99.88%
Table 2: Part I classification results

First of all, we use a priori knowledge about the data which was given in the instruction, i.e., we know how each sample is drawn from the distribution. In this case, we do not need the training data at all. The accuracy of the questions (a) and (b) are then obviously high (on the two testing sets). When it comes to the questions (c) and (d), instead of using the a priori knowledge, we need to estimate the parameters of the Guassian distributions using maximum likelihood estimation (MLE). Clearly, since the estimated parameters are not the actual parameters, the accuracies on the testing sets should drop. For the question (c) which we train the model on the training data I (smaller number of training samples), the accuracy drops from 99.89% to 99.81%. But if we train the model on the training data II (higher number of training samples), the estimated parameters are more accurate. This results in a higher accuracy of 99.88% (versus the 99.81% accuracy from the training data I). As we all know, the accuracy shall not exceed the accuracy of 99.89% that is from the complete knowledge about the distributions.


Part II

In this part, you will reduce the dimensionality of the data by applying MDA. That is,

a) Using MDA on the Training Data I, find the matrix W such that $\vec{y}$ = $W\vec{x}$. (What is the expected dimension of $\vec{y}$?). For each class, compute the $\hat{\vec{\mu}}_i$ and $\Sigma_i$ of the new reduced-space variable $\vec{y}$.

Scatter matrix of class $i$ is defined by: $$ \mathbf{S}_i = \sum_{\vec{x} \in \mathcal{D}_i} (\vec{x} - \vec{m}_i)(\vec{x} - \vec{m}_i)^{\mathsf{T}} $$

The $d$-dimensional sample mean of class $i$ or $\vec{m}_i$ is given by: $$ \vec{m}_i = \frac{1}{n_i} \sum_{\vec{x} \in \mathcal{D}_i} \vec{x} $$ where $n_i$ is the number of training samples in class $i$.

Within-class scatter matrix is the summation of scatter matrices from all classes. $$ \mathbf{S}_W = \sum_{i=1}^{C} \mathbf{S}_i $$

Between-class scatter matrix can be computed by: $$ \mathbf{S}_B = \sum_{i=1}^{C} n_i (\vec{m}_i - \vec{m}) (\vec{m}_i - \vec{m})^{\mathsf{T}} $$ where $\vec{m}$ is a total mean vector defined as: $$ \vec{m} = \frac{1}{n} \sum_{\vec{x}} \vec{x} = \frac{1}{n} \sum_{i=1}^{C} n_i \vec{m}_i $$

Solve for the eigenvalues and eigenvectors $$ \mathbf{S_W^{-1}} \mathbf{S_B} \mathbf{w} = \lambda \mathbf{w} $$

For the $C$-class classification problem, we select the largest $C-1$ non-zero eigenvalues to indicate which columns we should use to form a weight matrix $\mathbf{W}$.

With the $d$-by-$(C-1)$ weight matrix $\mathbf{W}$, we can project from $d$-dimensional space to a $(C-1)$-dimensional space by taking the dot product between a sample $\vec{x}$ with the weight matrix $\mathbf{W}$. In this case, the expected dimension of $\vec{y}$ after the projection of an $\vec{x}$ with $W$ is $C-1 = 2$ (since $C=3$ is the number of classes for this dataset).

This can also be verified by applying the dot product between a sample $\vec{x}$ and the weight matrix $W$ as follows:

The mean vectors and covariance matrices are the estimated just like in the part I, but we estimate on the transformed samples.

b) Apply the transformation W above to the Testing Data II and classify it using Bayes decision rule. Compute the confusion matrix.

The transformation is simply the matrix-vector product as follows: $$ \vec{y} = \mathbf{W}^{\mathsf{T}} \vec{x} $$

We then classify the transformed sample $\vec{y}$ using the discriminant function

c) Repeat parts a) above, but this time use the Training Data II to compute the matrix W and the $(\hat{\vec{\mu}}_i$ and $\Sigma_i$ for each class of the new reduced-space variable $\vec{y}$. Then classify the Testing Data II again. Compute the confusion matrix.

With the $d$-by-$(C-1)$ weight matrix $\mathbf{W}$, we can project from $d$-dimensional space to a $(C-1)$-dimensional space by taking the dot product between a sample $\vec{x}$ with the weight matrix $\mathbf{W}$. In this case, the expected dimension of $\vec{y}$ after the projection of an $\vec{x}$ with $W$ is $C-1 = 2$ (since $C=3$ is the number of classes for this dataset).

This can also be verified by applying the dot product between a sample $\vec{x}$ and the weight matrix $W$ as follows:

The mean vectors and covariance matrices are the estimated just like in the part I, but we estimate on the transformed samples.

d) Comment on the results from this Part II and then compare these results with the results from the previous Part I.

Part Question Algorithm Training set Acccuracy on Testing data I Accuracy on Testing data II
I a, b BDR A prior knowledge 99.87% 99.89%
I c BDR Training data I - 99.81%
I d BDR Training data II - 99.88%
II b MDA Training data I - 99.51%
II c MDA Training data II - 99.59%
Table 3: Part II classification results

The training data I has smaller number of samples compared to the training data II. Again, this makes the parameter estimation less accurate. The point of Part II is to apply multiple discriminant analysis (MDA), i.e., projection from a $d$-dimensional space to $(C-1)$-dimensional space. This brings about a dimensionality reduction of the feature space. Of course, since the number of features are reduced from 5 to 2 in this case, we would lose some accuracy. However, it is a decent trade-off as the accuracies drop less than 1% when compared to Part I. With the reduced dimensionality, we can apply classifier algorithms more efficient, i.e., it is less computationally expensive.


Part III

Now, you will completely forget that you know anything about any of the distributions and/or their parameters and apply a non-parametric approach to classification.

a) First, you will apply Parzen Window to each class in the Testing Data II using the Training Data I and a hypercube window function with $h_n = 0.7$. You must classify the Testing Data II according to the maximum posterior probability. Compute the confusion matrix. Repeat the classification for $h_n = 0.1$ and $h_n = 5$.

Based on the density estimation method, we can estimate an unknown probability density function without knowing its true density and its parameters. A $d$-dimensional hypercube $\mathcal{R}_n$ is a region that we are interested in since we can measure the probability that a sample $\vec{x}$ will fall in the region of class $i$, or $p_n(\vec{n} | \omega_i)$ which is given by: $$ \begin{align*} p_n(\vec{x} | \omega_i) &= \frac{k_n}{n V_n} \end{align*} $$ where $n$ is the total number of samples, and $V_n$ is the volume of the hypercube $\mathcal{R}_n$. So, the volume calculation is quite straightforward according to the definition of the hypercube, we arrive at: $$ V_n = h_n^d $$ where $h_n$ is the length of an edge of the hypercube $\mathcal{R}_n$.

We then define $k_n$ which is the number of samples that reside in a $d$-dimensional hypercube $\mathcal{R}_n$ as: $$ k_n = \sum_{i=1}^{n} \varphi \left( \frac{\vec{x} - \vec{x}_i}{h_n} \right) $$

We combine all of the above equations together, so we could compute the probability by: $$ p_n(\vec{x} | \omega_i) = \frac{k_n}{n V_n} = \frac{1}{n V_n} \sum_{i=1}^{n} \varphi \left( \frac{\vec{x} - \vec{x}_i}{h_n} \right) $$

Based on the decision rule for $C$-class classification problem, a sample $\vec{x}$ is classified to be the predicted class $\hat{\omega}$ by using: $$ \hat{\omega} = \underset{i}{\mathrm{argmax}} p_n(\vec{x} | \omega_i)P(\omega_i) $$ In this project, the priors are equal, thus we can discard the prior term. So, we arrive at the decision rule in terms of class-conditional probability densities: $$ \hat{\omega} = \underset{i}{\mathrm{argmax}} p_n(\vec{x} | \omega_i) $$

$\varphi(\cdot)$ can be any kernel function. The standard hypercube kernel is defined as: $$ \varphi \left(\frac{\vec{x} - \vec{x}_i}{h_n} \right) = \begin{cases} 1 & \text{if}\ |\vec{x} - \vec{x}_i| \leq \frac{h_n}{2} \\ 0 & \text{otherwise} \end{cases} $$

b) Repeat part a), but this time use the Training Data II for the Parzen Window and classify the same Testing Data II with, again, $h_n = 0.1$, $h_n = 0.7$, and $h_n = 5$.

c) Repeat part b) using a Gaussian kernel (a Gaussian window function) with $\sigma = 0.1$. Then repeat this part with $\sigma = 0.7$ and $\sigma = 5$.

In this question, we replace the hypercube kernel with the Gaussian kernel which is given by $$ \varphi \left(\frac{\vec{x} - \vec{x}_i}{\sigma} \right) = \frac{1}{(\sqrt{2\pi})^d V_n} \exp{\left(-\frac{1}{2}(\frac{\vec{x} - \vec{x}_i}{\sigma})^2\right)} $$ where $V_n = \sigma^d$.

d) Comment on the results above. Compare them with the results from the previous Parts above (I and II).

Part Question Algorithm Training set Acccuracy on Testing data I Accuracy on Testing data II
I a, b BDR A prior knowledge 99.87% 99.89%
I c BDR Training data I - 99.81%
I d BDR Training data II - 99.88%
II b MDA Training data I - 99.51%
II c MDA Training data II - 99.59%
III a Parzen Window (Hypercube $h_n=0.1$) Training data I - 33.33%
III a Parzen Window (Hypercube $h_n=0.7$) Training data I - 33.39%
III a Parzen Window (Hypercube $h_n=5.0$) Training data I - 94.86%
III b Parzen Window (Hypercube $h_n=0.1$) Training data II - 33.33%
III b Parzen Window (Hypercube $h_n=0.7$) Training data II - 34.19%
III b Parzen Window (Hypercube $h_n=5.0$) Training data II - 99.16%
III c Parzen Window (Gaussian $\sigma=0.1$) Training data II - 93.14%
III c Parzen Window (Gaussian $\sigma=0.7$) Training data II - 94.83%
III c Parzen Window (Gaussian $\sigma=5.0$) Training data II - 96.73%
Table 4: Part III classification results

The very first setting is that we set the width of the hypercube window to be $h_n = 0.1$ which is way too small. So, there are no samples that are in the Parzen windows at different locations of the training samples. The effect of using argmax in the implementation makes the Parzen window predicts only the class $\omega_1$ which results in only 33.33% accuracy. This is the case for using either training data I or training data II. The results are the same.

When we increase the width of the hypercube kernel to $h_n = 0.7$, the Parzen window luckily covers just some test samples, thus it gains just a little more accuracy (33.39%). The result on the testing data II is 34.19%. The reason is that the number of testing samples are larger. So, we have a little bit more chance to reside in the windows.

But when we set the width of the hypercube kernel to $h_n = 5.0$, the window width is large enough to cover areas that samples might fall in. It yields good classifications results on the test set. It can attain the accuracy of up to 94.86% if we use the training data I. Once we increase the number of training samples (utilizing training data II), it even yields a better accuracy of 99.16%. Due to the fact that we endow the Parzen window with the finite number of training samples, small number of training samples would cause many holes at the fringes (along side with local spikes in the densities). To conclude, the more number of training samples can lead to a better model performance because we have fewer holes at the fringes.

Once we switch over to the Gaussian kernel, even the standard deviation $\sigma=0.1$ produces a very high accuracy of 93.14% due to the characteristic of the Gaussian kernel that has infinite support (unlike the hypercube kernel that has a crisp window). Therefore, even a test sample is far away from the training samples, it still contributes and gives us some value for the term $\varphi \left(\frac{\vec{x} - \vec{x}_i}{h_n} \right)$. So, when it comes to the classification based on the decision rule defined above, we are interested in the class that has the maximum probability, even from infinitesimal values. It seems like if we increase $\sigma$ to 0.7 and 5.0, we get better results, 94.83% and 96.73%, respectively. Because their window widths are larger which make a sample to be more probable to fall in the window of the more likely class.

In terms of the comparison between the Parzen window and the previous classification methods, it seems to be less accurate. However, we need to keep in mind that this is a non-parametric technique which does not require much a priori knowledge. Instead, we need to choose a kernel function and its hyperparameter ($h_n$ or $\sigma$). The biggest problem is that we need to fine-tune or search the best setting for the Parzen window, e.g., we do not know what should be the value of $h_n$. But doing so can be prone to overfitting and generalization issue, we should consider using cross validation to find the best setting. On the flip side, MLE estimates $\vec{\mu}$ and $\Sigma$ of a Gaussian distribution. We see the results from Part I are pretty high since, the assumed distribution matches the true distribution. If the true distribution is not Gaussian and we do not know it, this would be another case. It this the same for Part II which is MDA that provides a dimensionality reduction follwed by a BDR.

Also note that, if we use a large number of training samples, the Parzen window takes more time to execute since its computational complexity involves the number of training samples to go through when performing a classification.


Part IV

Once again, you will forget that you know anything about any of the distributions and/or their parameters and apply another non-parametric approach.

a) Using the Training Data I, you will classify the Testing Data II using k-Nearest-Neighbour. Use $k_n = \sqrt{n}$.

Recall that the parzen window method requires us to choose a kernel function (window function) and its size. However, if we do not have that knowledge, we can utilize the k-nearest-neighbor algorithm instead. Unlike Parzen window that fixes the volume $V_n$, k-NN fixes $k$ which is the number of nearest neighbors. This means we grow a cell at a sample $\vec{x}$ until it covers $k_n$ samples. $k_i$ samples out of $k_n$ samples are labeled as class $\omega_i$. So, we can compute the estimated a posteriori probability by: $$ p_n(\omega_i | \vec{x}) = \frac{k_i}{k_n} $$

The way that we find the nearest neighbors is to use a distance function $D$. In this project, euclidean distance which is a Minkowski distance with $p=2$ in $d$-dimensional space is used. $$ D(\vec{a}, \vec{b}) = \left( \sum_{j=1}^{d} (a_j - b_j)^2 \right)^{1/2} $$

Based on the Bayes decision rule, we can classify a sample $\vec{x}$ to be a class $\hat{\omega}$ by computing: $$ \hat{\omega} = \underset{i}{\mathrm{argmax}} p_n(\omega_i | \vec{x}) = \underset{i}{\mathrm{argmax}} k_i $$ which means we classify a sample $\vec{x}$ based on the most frequent class within the grown cell that covers $k_n$ samples.

In this question, since the training data I has 50 samples for each class, this means we have 150 training samples. We can estimate $k_n = \sqrt{n} = \sqrt{150} \approx 12$.

b) Repeat part a), but this time use the Training Data II to classify the Testing Data II.

In this question, since the training data II has 500 samples for each class, this means we have 1,500 training samples. We can estimate $k_n = \sqrt{n} = \sqrt{1500} \approx 39$.

c) Repeat part b) using your choice for $k_n = f(n)$.

In this question, we define the number of $k_n$ nearest neighbors to be: $$ k_n = f(n) = \frac{\sqrt{n}}{2} $$

Hence, we substitute $n = 1,500$, we arrive at: $$ k_n = \frac{\sqrt{n}}{2} = \frac{\sqrt{1500}}{2} \approx 19 $$

d) Comment on the results above. Compare them with the results from the previous Parts above, in special Part III.

Part Question Algorithm Training set Acccuracy on Testing data I Accuracy on Testing data II
I a, b BDR A prior knowledge 99.87% 99.89%
I c BDR Training data I - 99.81%
I d BDR Training data II - 99.88%
II b MDA Training data I - 99.51%
II c MDA Training data II - 99.59%
III a Parzen Window (Hypercube $h_n=0.1$) Training data I - 33.33%
III a Parzen Window (Hypercube $h_n=0.7$) Training data I - 33.39%
III a Parzen Window (Hypercube $h_n=5.0$) Training data I - 94.86%
III b Parzen Window (Hypercube $h_n=0.1$) Training data II - 33.33%
III b Parzen Window (Hypercube $h_n=0.7$) Training data II - 34.19%
III b Parzen Window (Hypercube $h_n=5.0$) Training data II - 99.16%
III c Parzen Window (Gaussian $\sigma=0.1$) Training data II - 93.14%
III c Parzen Window (Gaussian $\sigma=0.7$) Training data II - 94.83%
III c Parzen Window (Gaussian $\sigma=5.0$) Training data II - 96.73%
IV a $k_n$-NN ($k_n = \sqrt{n}$) Training data I - 98.59%
IV b $k_n$-NN ($k_n = \sqrt{n}$) Training data II - 99.39%
IV c $k_n$-NN ($k_n = \frac{\sqrt{n}}{2}$) Training data II - 99.58%
Table 5: Part IV classification results

First of all, we compute $k_n = \sqrt{n} = \sqrt{150} = 12$. As can be seen from the above results, $12$-NN yields 98.59% accuracy when we use the training data I. The accuracy raises to 99.39% when we increase the number of training samples, i.e., using training data II. Therefore, we have 1,500 samples which means we use $39$-NN in this case. Once we change $k_n$ to 19 based on our owned defined function ($k_n = \frac{\sqrt(n)}{2}$), the accuracy increases to 99.58%. All of these literally depend on the choice of $k_n$ that needs empirical experiments to find the best $k_n$. But we need to be careful about the overfitting and generalization problems as a low training error does not guarantee a small test error. We can also do cross validation to mitigate this issue.

Interestingly, if we compare the results from $k_n$-NN and the Parzen window, all the results from $k_n$-NN are better than those from the Parzen windows. This can lead to the fact that, when it comes to the window function and its hyperparameter of the Parzen window, they are unknown in this case. Hence, $k_n$-NN is a better solution compared to the Parzen window. Because $k_n$-nearest neighbors fixes the number of nearest neighbors $k_n$, as opposed to the Parzen window that fixes the volume $V$ (i.e., it requires the window and the width). To conclude, it seems to provide a better classification results without knowing the underlying knowledge about the data (e.g., the underlying distributions). The performance of an $k_n$-NN will be increase as the number of training samples goes up. However, also note that, if we use a large number of training samples, it takes more time to run since its computational complexity involves the number of training samples to go through when performing a classification.


Part V

Once again, you will forget that you know anything about any of the distributions and/or their parameters and apply another non-parametric approach. This time, repeat Part IV above using a linear classifier and the Perceptron criterion – for part a) and c), you obviously cannot pick a value for $k_n$, so, instead, use $\eta = \frac{1}{2}$ for part a) and then use $\eta = \frac{1}{\sqrt{k}}$ for part c), where $k$ is the iteration.

a) Using the Training Data I, you will classify the Testing Data II using the Percetron criterion. Use $\eta = \frac{1}{2}$.

We classify a sample $\vec{x}$ based on the linear discriminant function $g_i(\vec{x})$ of class $\omega_i$ which is given by $$ g_i(\vec{x}) = \vec{a}_i^{\mathsf{T}}\vec{y} $$ where $\vec{a}$ is a weight vector, and $\vec{y}$ is a sample or feature vector.

Therefore, our decision rule to assign $\vec{x}$ to the class $\omega_i$ if $g_i(\vec{x}) > g_j(\vec{x})$, $\forall j \neq i$.

The Perceptron Criterion function is defined as: $$ J(\vec{a}) = \sum_{\vec{y} \in \mathcal{Y}} \left( - \vec{a}^{\mathsf{T}} \vec{y} \right) $$ where $\mathcal{Y}$ is the set of misclassified samples.

We can derive the gradient of $J$ w.r.t. the weight vector $\vec{a}$ at iteration $k$ by: $$ \nabla J(\vec{a}(k)) = \frac{\partial J(\vec{a}(k))}{\partial \vec{a}(k)} = \sum_{\vec{y} \in \mathcal{Y}} (- \vec{y}) $$

Since the question does not specify the way to initialize weight vectors, we randomly initialize $C$ weight vectors at time step 0 by drawing from the Gaussian distribution $\vec{a}(0) \sim \mathcal{N}(\vec{\mu},\,\mathbf{\Sigma})\,$ where $\vec{\mu} = \vec{0}$ and $\mathbf{\Sigma} = \mathbf{I}$.

Note that the random seed is fixed, so that the result is reproducible.

In terms of training the weight vectors, the update rule is employed based on the delta rule and the above gradient. $$ \begin{align*} \vec{a}(k+1) &= \vec{a}(k) - \eta(k) \nabla J(\vec{a}(k))\\ &= \vec{a}(k) - \eta(k) \sum_{\vec{y} \in \mathcal{Y}} (- \vec{y})\\ &= \vec{a}(k) + \eta(k) \sum_{\vec{y} \in \mathcal{Y}} \vec{y} \end{align*} $$

In this miniproject, the stopping criterion for training the Perceptron criterion function is the maximum iteration $K=10$. This means we train the weights of the model 10 iterations and then stop.

The learning rate $\eta$ is given by the question (a) to be $\eta = \frac{1}{2}$.

Let us try including the bias $w_0$ to the weight vector $\vec{a}$.

According to the defined linear discriminant function above, $g_i(\vec{x}) = \vec{a}_i^{\mathsf{T}}\vec{y}$, in order to include a bias term $w_0$ to the weights, our weight vector $\vec{a}$ is modified to be: $$ \vec{a} = \begin{bmatrix} w_0\\ w_1\\ w_2\\ w_3\\ w_4\\ w_5 \end{bmatrix} $$ , and the feature vector becomes an augmented feature vector which is given by:

$$ \vec{y} = \begin{bmatrix} 1\\ x_1\\ x_2\\ x_3\\ x_4\\ x_5 \end{bmatrix} = \begin{bmatrix} w_0\\ \\ \vec{x}\\ \\ \end{bmatrix} $$

Everything else is the same.

b) Repeat part a), but this time use the Training Data II to classify the Testing Data II.

The learning rate $\eta$ is given by the question (b) to be $\eta = \frac{1}{2}$ (same as the question (a)).

c) Repeat part b) using $\eta = \frac{1}{\sqrt{k}}$.

The learning rate $\eta$ is given by the question (c) to be $\eta = \frac{1}{\sqrt{k}}$.

Let us try to change our only hyperparameter here, the maximum number of iterations $K \in \{10, 20, 40, 60, 80\}$.

Let us try varing the maximum number of iterations $K \in \{10, 20, 40, 60, 80\}$ for the models with the bias term.

d) Comment on the results above. Compare them with the results from the previous Parts above.

Part Question Algorithm Training set Acccuracy on Testing data I Accuracy on Testing data II
I a, b BDR A prior knowledge 99.87% 99.89%
I c BDR Training data I - 99.81%
I d BDR Training data II - 99.88%
II b MDA Training data I - 99.51%
II c MDA Training data II - 99.59%
III a Parzen Window (Hypercube $h_n=0.1$) Training data I - 33.33%
III a Parzen Window (Hypercube $h_n=0.7$) Training data I - 33.39%
III a Parzen Window (Hypercube $h_n=5.0$) Training data I - 94.86%
III b Parzen Window (Hypercube $h_n=0.1$) Training data II - 33.33%
III b Parzen Window (Hypercube $h_n=0.7$) Training data II - 34.19%
III b Parzen Window (Hypercube $h_n=5.0$) Training data II - 99.16%
III c Parzen Window (Gaussian $\sigma=0.1$) Training data II - 93.14%
III c Parzen Window (Gaussian $\sigma=0.7$) Training data II - 94.83%
III c Parzen Window (Gaussian $\sigma=5.0$) Training data II - 96.73%
IV a $k_n$-NN ($k_n = \sqrt{n}$) Training data I - 98.59%
IV b $k_n$-NN ($k_n = \sqrt{n}$) Training data II - 99.39%
IV c $k_n$-NN ($k_n = \frac{\sqrt{n}}{2}$) Training data II - 99.58%
V a The Perceptron Criterion ($\eta=\frac{1}{2}$), $K=25$ Training data I - 97.98%
V b The Perceptron Criterion ($\eta=\frac{1}{2}$), $K=25$ Training data II - 98.24%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=25$ Training data II - 98.22%
V a The Perceptron Criterion ($\eta=\frac{1}{2}$), $K=25$, with bias Training data I - 98.31%
V b The Perceptron Criterion ($\eta=\frac{1}{2}$), $K=25$, with bias Training data II - 98.36%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=25$, with bias Training data II - 98.37%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=10$ Training data II - 98.01%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=20$ Training data II - 98.23%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=40$ Training data II - 98.22%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=60$ Training data II - 98.25%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=80$ Training data II - 98.22%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=10$, with bias Training data II - 98.23%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=20$, with bias Training data II - 98.46%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=40$, with bias Training data II - 98.54%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=60$, with bias Training data II - 98.60%
V c The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=80$, with bias Training data II - 98.61%
Table 6: Part V classification results

In Part V, again, we do not know anything about the a priori knowledge. Although, all we know is the training data, we can train weight vectors $\vec{a}_i$ for linear discriminant functions $g_i(\vec{x})$ by minimizing a criterion function which in this case is the Perceptron criterion function. According to the results in the table above, the models achieve very promising results without knowing anything, i.e., this is actually a machine learning. Also, we can conclude that the problem is quite linearly separable since we only utilize the linear discriminant functions. The stopping criterion for all the Perceptron criterion above is stopping when the iteration $k$ is equal to the maximum number of iterations $K$.

We have tried a bunch of different $K$, and found out that $K=25$ is a good value to get good classification results. It is also interesting to see how the bias term $w_0$ would affect the performance of the models. With the same setting (i.e., the same $K$ and the same intial weights), we can see that the models with bias term $w_0$ yield better results in all cases.

For the question (c), we have the dynamic learning rate, which is a function of the current iteration, $\eta(k)$. This means the larger current iteration $k$ resulting in the smaller learning rate $\eta$, i.e., the learning rate decreases over time. In order to see how the dynamic learning rate has an effect to the performance of a model, we have varied $K \in \{10, 20, 40, 60, 80\}$. As can be seen from the table above, if we let a model trains further, the accuracy tend to increase. Nevertheless, one must be careful, if we train a model too much, it will eventually be overfit to the training data. There are many possible solutions to this issue, for instance, adding reguarlization, or applying cross validation.


All in all, this is actually a really wonderful MiniProject that wraps up pretty much everything about this course. Different supervised classification techniques with different knowledge and assumptions are employed in a series to see how each one works. The comparisons between them are also discussed. And this comes an end of this Introduction to Pattern Recognition and Machine Learning class.