This miniproject has only one question, which is a bit longer than the questions in previous assignments, however it also builds on the previous assignments. So, at the end of the day, you should be able to "borrow" a lot of your own code from before and finish this assignment quite easily. In this experiment, you are to compare various different classification approaches:
%matplotlib inline
import os
import numpy as np
# save np.load
np_load_old = np.load
# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)
import scipy
import scipy.io
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.lines import Line2D
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import display, Math
np.random.seed(7720)
def show(text, ans, precision=4):
    if type(ans) == np.ndarray:
        t = r'\begin{bmatrix} '
        for i in ans:
#             print(i, type(i))
#             print(r' \\ '.join(i))
            if type(i) != np.ndarray:
                t += f'{i:.{precision}f}' + r' \\ '
            else:
                a_str = np.array2string(i, precision=precision, separator=r' & ')
                t += a_str[1:-1]
                t += r' \\ '
        t += r'\end{bmatrix}'
        display(Math(f'{text} = {t}'))
    else:
        display(Math(f'{text} = {ans:.{precision}f}'))
def show_percent(text, ans, precision=2):
    display(Math(f'{text} = {ans:.{precision}f}\%'))
Initially, you must create four data sets according to the information below, but next week, you will be given the actual datasets with which you will write your final reports. The datasets that you will create are 'trivial', but they will be useful for debugging your programs. The ones that I will provide will be more challenging.
While using your datasets, the four testing/training data sets described here MUST be kept the same through out all parts below. That is, you should create your data points ONCE, save them, and use them for all parts and subparts below. Do NOT create different data for each question or you may end up with very strange results!!
The four data sets will be referred to as: 1) Training Data I, with 50 samples in each class; 2) Training Data II, with 500 samples in each class; 3) Testing Data I, also with 500 samples in each class; and finally 4) Testing Data II, with 10000 samples in each class1
All data sets must consist of 5-d points divided in 3-classes with the underlying normal distributions $p(\vec{x} | \omega_i) = \mathsf{N}(\vec{\mu}_i, \Sigma_i)$, where:
$$ \begin{align*} \vec{\mu}_1 &= \begin{bmatrix}2 & 3 & 1 & 5.5 & 8.7\end{bmatrix}^t\\ \vec{\mu}_2 &= \begin{bmatrix}-4.5 & 6 & -1 & 3 & 10\end{bmatrix}^t\\ \vec{\mu}_3 &= \begin{bmatrix}1.2, -2.3, 1.5, -0.5, 2.7\end{bmatrix}^t\\ \end{align*} $$$$ \vec{\Sigma}_1 = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 0.5 & 0 & 0 & 0 \\ 0 & 0 & 2.5 & 0 & 0 \\ 0 & 0 & 0 & 0.7 & 0 \\ 0 & 0 & 0 & 0 & 3.5 \\ \end{bmatrix} \vec{\Sigma}_2 = \begin{bmatrix} 2 & 0 & 1 & 0.5 & 0 \\ 0 & 3.5 & 0 & 0 & 0.6 \\ 1 & 0 & 4.5 & 1.2 & 0 \\ 0.5 & 0 & 1.2 & 1.6 & 0 \\ 0 & 0.6 & 0 & 0 & 2.5 \\ \end{bmatrix} \vec{\Sigma}_3 = \begin{bmatrix} 4.2 & 0 & 1.3 & 2.5 & 1.4 \\ 0 & 5 & 0 & 0 & 3.6 \\ 1.3 & 0 & 4.5 & 4.2 & 0 \\ 2.5 & 0 & 4.2 & 5.6 & 0 \\ 1.4 & 3.6 & 0 & 0 & 7.5 \\ \end{bmatrix} $$You will assume that all states of nature are equally probable.
* Comment out the code for generating the dataset (run once for the first time)
# C = 3
# data = {}
# data['train1'] = [None] * C
# data['train2'] = [None] * C
# data['test1'] = [None] * C
# data['test2'] = [None] * C
# mu = [
#     np.array([2, 3, 1, 5.5, 8.7]),
#     np.array([-4.5, 6, -1, 3, 10]),
#     np.array([1.2, -2.3, 1.5, -0.5, 2.7]),
# ]
      
# cov = [
#     np.identity(5) * np.array([1, 0.5, 2.5, 0.7, 3.5]),
#     np.array([
#         [2, 0, 1, 0.5, 0],
#         [0, 3.5, 0, 0, 0.6],
#         [1, 0, 4.5, 1.2, 0],
#         [0.5, 0, 1.2, 1.6, 0],
#         [0, 0.6, 0, 0, 2.5]
#     ]),
#     np.array([
#         [4.2, 0, 1.3, 2.5, 1.4],
#         [0, 5, 0, 0, 3.6],
#         [1.3, 0, 4.5, 4.2, 0],
#         [2.5, 0, 4.2, 5.6, 0],
#         [1.4, 3.6, 0, 0, 7.5]
#     ]),
# ]    
# for c in range(C):
#     data['train1'][c] = np.random.multivariate_normal(mu[c], cov[c], 50)
#     data['train2'][c] = np.random.multivariate_normal(mu[c], cov[c], 500)
#     data['test1'][c] = np.random.multivariate_normal(mu[c], cov[c], 500)
#     data['test2'][c] = np.random.multivariate_normal(mu[c], cov[c], 10000)
    
# data['mu'] = mu
# data['cov'] = cov
# np.save('data/mp6.npy', data)
data = np.load('data/mp6.npy').item()
C = len(data['mu'])
Cross-check the defined mean vectors and covariance matrices with the question.
show(r'\vec{\mu}_1^{\mathsf{T}}', data['mu'][0][:, None].T, precision=1)
show(r'\vec{\mu}_2^{\mathsf{T}}', data['mu'][1][:, None].T, precision=1)
show(r'\vec{\mu}_3^{\mathsf{T}}', data['mu'][2][:, None].T, precision=1)
show(r'\Sigma_1', data['cov'][0], precision=1)
show(r'\Sigma_2', data['cov'][1], precision=1)
show(r'\Sigma_3', data['cov'][2], precision=1)
All confusion matrices in this miniproject has the following structure:
| (predicted) 1 | (predicted) 2 | (predicted) 3 | |
|---|---|---|---|
| (actual) 1 | - | - | - | 
| (actual) 2 | - | - | - | 
| (actual) 3 | - | - | - | 
Here, you will first (in a and b) use complete knowledge about the data. Then (in c and d), you will "forget" that you know the means and covariances, and use ML to estimate them.
We have a priori knowledge about the distributions that they are Gaussian distributions. So, we can define the discriminant function for the normal density with the following generic form: $$ g_i(\vec{x}) = -\frac{1}{2}(\vec{x}-\vec{\mu}_i)^{\mathsf{T}}\mathbf{\Sigma}_i^{-1}(\vec{x}-\vec{\mu}_i)-\frac{d}{2}\ln (2\pi) - \frac{1}{2} \ln |\mathbf{\Sigma}_i| + \ln P(\omega_i) $$ for any given $d$-dimensional data, mean $\vec{\mu}_i$, covariance matrix $\mathbf{\Sigma}_i$ and prior probabilities $P(\omega_i)$ of class $i$.
Given the equal number of samples in every class, thus the priors are equal. We then can discard the term $\ln P(\omega_i)$. Also, since $d$ is a constant, we then also get rid of the constant term $- \frac{d}{2} \ln (2\pi)$.
We classify a sample $\vec{x}$, based on the Bayes decision rule, to be class $\omega_i$ if $g_i(\vec{x}) > g_j(\vec{x})$, $\forall j \neq i$. This can also be written in the form of argmax as follows: $$ \hat{\omega} = \underset{i}{\mathrm{argmax}} g_i(\vec{x}) $$
def squared_mahalanobis_distance(x, y, cov):
    a = np.array(x) - np.array(y)
    r2 = a.T @ np.linalg.inv(cov) @ a
    return r2
def discriminant_function(x, mean, cov):
    d = len(x)
    A = -0.5 * squared_mahalanobis_distance(x, mean, cov)
#     B = - ((d/2) * np.log(2 * np.pi))
    C = - (0.5 * np.log(np.linalg.det(cov)))
#     D = np.log(prior)
    return A + C
a_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test1'][c]:
        distances = np.zeros(C)
        for i in range(C):
            distances[i] = discriminant_function(x, data['mu'][i], data['cov'][i])
        pred_class = np.argmax(distances)
        a_confusion_matrix[c][pred_class] += 1
show('\mathrm{[Part\ I]\ a)\ \ Confusion\ matrix}', a_confusion_matrix)
def accuracy(cm):
    return cm.trace() / cm.sum(axis=None) * 100
show_percent(r'\mathrm{[Part\ I]\ a)\ accuracy}', accuracy(a_confusion_matrix))
show_percent(r'\mathrm{[Part\ I]\ a)\ error}', 100 - accuracy(a_confusion_matrix))
b_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        distances = np.zeros(C)
        for i in range(C):
            distances[i] = discriminant_function(x, data['mu'][i], data['cov'][i])
        pred_class = np.argmax(distances)
        b_confusion_matrix[c][pred_class] += 1
show('\mathrm{[Part\ I]\ b)\ \ Confusion\ matrix}', b_confusion_matrix)
show_percent(r'\mathrm{[Part\ I]\ b)\ accuracy}', accuracy(b_confusion_matrix))
show_percent(r'\mathrm{[Part\ I]\ b)\ error}', 100 - accuracy(b_confusion_matrix))
According to the maximum likelihood estimation (MLE) for a Gaussian distribution, the estimated mean vector of class $\omega_i$ can be computed using $$ \vec{\mu}_i = \frac{1}{n_i} \sum_{k=1}^{n_i} \vec{x}_k $$ where $n$ is the number of training samples in class $\omega_i$.
The estimated unbiased covariance matrix of class $\omega_i$ is given by: $$ \mathbf{\Sigma}_i = \frac{1}{n_i-1} \sum_{k=1}^{n_i} (\vec{x}_k - \vec{\mu}_i) (\vec{x}_k - \vec{\mu}_i)^{\mathsf{T}} $$
c_MLE_mu = np.full_like(data['mu'], np.nan)
c_MLE_cov = np.full_like(data['cov'], np.nan)
for c in range(C):
    c_MLE_mu[c] = np.mean(data['train1'][c], axis=0)
    c_MLE_cov[c] = np.cov(data['train1'][c].T)
show('\mathrm{[Part\ I]\ c)\ \ MLE\ \mu_1^{\mathsf{T}}}', c_MLE_mu[0][:, None].T)
show('\mathrm{[Part\ I]\ c)\ \ MLE\ \mu_2^{\mathsf{T}}}', c_MLE_mu[1][:, None].T)
show('\mathrm{[Part\ I]\ c)\ \ MLE\ \mu_3^{\mathsf{T}}}', c_MLE_mu[2][:, None].T)
show('\mathrm{[Part\ I]\ c)\ \ MLE\ \Sigma_1}', c_MLE_cov[0])
show('\mathrm{[Part\ I]\ c)\ \ MLE\ \Sigma_2}', c_MLE_cov[1])
show('\mathrm{[Part\ I]\ c)\ \ MLE\ \Sigma_3}', c_MLE_cov[2])
c_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        distances = np.zeros(C)
        for i in range(C):
            distances[i] = discriminant_function(x, c_MLE_mu[i], c_MLE_cov[i])
        pred_class = np.argmax(distances)
        c_confusion_matrix[c][pred_class] += 1
show('\mathrm{[Part\ I]\ c)\ \ Confusion\ matrix}', c_confusion_matrix)
show_percent(r'\mathrm{[Part\ I]\ c)\ accuracy}', accuracy(c_confusion_matrix))
show_percent(r'\mathrm{[Part\ I]\ c)\ error}', 100 - accuracy(c_confusion_matrix))
d_MLE_mu = np.full_like(data['mu'], np.nan)
d_MLE_cov = np.full_like(data['cov'], np.nan)
for c in range(C):
    d_MLE_mu[c] = np.mean(data['train2'][c], axis=0)
    d_MLE_cov[c] = np.cov(data['train2'][c].T)
show('\mathrm{[Part\ I]\ d)\ \ MLE\ \mu_1^{\mathsf{T}}}', d_MLE_mu[0][:, None].T)
show('\mathrm{[Part\ I]\ d)\ \ MLE\ \mu_2^{\mathsf{T}}}', d_MLE_mu[1][:, None].T)
show('\mathrm{[Part\ I]\ d)\ \ MLE\ \mu_3^{\mathsf{T}}}', d_MLE_mu[2][:, None].T)
show('\mathrm{[Part\ I]\ d)\ \ MLE\ \Sigma_1}', d_MLE_cov[0])
show('\mathrm{[Part\ I]\ d)\ \ MLE\ \Sigma_2}', d_MLE_cov[1])
show('\mathrm{[Part\ I]\ d)\ \ MLE\ \Sigma_3}', d_MLE_cov[2])
d_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        distances = np.zeros(C)
        for i in range(C):
            distances[i] = discriminant_function(x, d_MLE_mu[i], d_MLE_cov[i])
        pred_class = np.argmax(distances)
        d_confusion_matrix[c][pred_class] += 1
show('\mathrm{[Part\ I]\ d)\ \ Confusion\ matrix}', d_confusion_matrix)
show_percent(r'\mathrm{[Part\ I]\ d)\ accuracy}', accuracy(d_confusion_matrix))
show_percent(r'\mathrm{[Part\ I]\ d)\ error}', 100 - accuracy(d_confusion_matrix))
| Part | Question | Algorithm | Training set | Acccuracy on Testing data I | Accuracy on Testing data II | 
|---|---|---|---|---|---|
| I | a, b | BDR | A prior knowledge | 99.87% | 99.89% | 
| I | c | BDR | Training data I | - | 99.81% | 
| I | d | BDR | Training data II | - | 99.88% | 
First of all, we use a priori knowledge about the data which was given in the instruction, i.e., we know how each sample is drawn from the distribution. In this case, we do not need the training data at all. The accuracy of the questions (a) and (b) are then obviously high (on the two testing sets). When it comes to the questions (c) and (d), instead of using the a priori knowledge, we need to estimate the parameters of the Guassian distributions using maximum likelihood estimation (MLE). Clearly, since the estimated parameters are not the actual parameters, the accuracies on the testing sets should drop. For the question (c) which we train the model on the training data I (smaller number of training samples), the accuracy drops from 99.89% to 99.81%. But if we train the model on the training data II (higher number of training samples), the estimated parameters are more accurate. This results in a higher accuracy of 99.88% (versus the 99.81% accuracy from the training data I). As we all know, the accuracy shall not exceed the accuracy of 99.89% that is from the complete knowledge about the distributions.
In this part, you will reduce the dimensionality of the data by applying MDA. That is,
Scatter matrix of class $i$ is defined by: $$ \mathbf{S}_i = \sum_{\vec{x} \in \mathcal{D}_i} (\vec{x} - \vec{m}_i)(\vec{x} - \vec{m}_i)^{\mathsf{T}} $$
The $d$-dimensional sample mean of class $i$ or $\vec{m}_i$ is given by: $$ \vec{m}_i = \frac{1}{n_i} \sum_{\vec{x} \in \mathcal{D}_i} \vec{x} $$ where $n_i$ is the number of training samples in class $i$.
Within-class scatter matrix is the summation of scatter matrices from all classes. $$ \mathbf{S}_W = \sum_{i=1}^{C} \mathbf{S}_i $$
def scatter_matrix(X):
    m = X.mean(axis=0)
    S = np.zeros((len(m), len(m)))
    for x in X:
        S += np.outer(x - m, (x - m).T)
    return S
def within_class_scatter(data):
    S_w = np.zeros((data[0].shape[1], data[0].shape[1]))
    for c in range(len(data)):
        S_w += scatter_matrix(data[c])
    return S_w
Between-class scatter matrix can be computed by: $$ \mathbf{S}_B = \sum_{i=1}^{C} n_i (\vec{m}_i - \vec{m}) (\vec{m}_i - \vec{m})^{\mathsf{T}} $$ where $\vec{m}$ is a total mean vector defined as: $$ \vec{m} = \frac{1}{n} \sum_{\vec{x}} \vec{x} = \frac{1}{n} \sum_{i=1}^{C} n_i \vec{m}_i $$
def total_mean_vector(data):
    m = np.zeros(data[0].shape[1])
    n = 0
    for c in range(len(data)):
        n_i = len(data[c])
        m_i = data[c].mean(axis=0)
        n += n_i
        m += n_i * m_i
    m /= n
    return m
def between_class_scatter(data):
    S_B = np.zeros((data[0].shape[1], data[0].shape[1]))
    m = total_mean_vector(data)
    for c in range(len(data)):
        n_i = len(data[c])
        m_i = data[c].mean(axis=0)
        S_B += n_i * np.outer((m_i - m), (m_i - m).T)
    return S_B
def total_scatter(data):
    return within_class_scatter(data) + between_class_scatter(data)
II_a_S_W = within_class_scatter(data['train1'])
show(r'\mathrm{[Part\ II]\ a)\ \ Within-class\ scatter\ matrix}\ \ S_W', II_a_S_W)
II_a_S_W_inv = np.linalg.inv(II_a_S_W)
show(r'\mathrm{[Part\ II]\ a)\ \ Inverse\ of\ within-class\ scatter\ matrix}\ \ S_W^{-1}', II_a_S_W_inv)
II_a_S_B = between_class_scatter(data['train1'])
show(r'\mathrm{[Part\ II]\ a)\ \ Between-class\ scatter\ matrix}\ \ S_B', II_a_S_B)
Solve for the eigenvalues and eigenvectors $$ \mathbf{S_W^{-1}} \mathbf{S_B} \mathbf{w} = \lambda \mathbf{w} $$
II_a_eigenvalues, II_a_eigenvectors = np.linalg.eigh(II_a_S_W_inv.dot(II_a_S_B))
show(r'\mathrm{[Part\ II]\ a)\ \ Eigenvalues}\ \Lambda', II_a_eigenvalues[:, None].T)
show(r'\mathrm{[Part\ II]\ a)\ \ Eigenvectors}\ \Phi', II_a_eigenvectors)
For the $C$-class classification problem, we select the largest $C-1$ non-zero eigenvalues to indicate which columns we should use to form a weight matrix $\mathbf{W}$.
II_a_largest_columns = II_a_eigenvalues.argsort()[-(C-1):][::-1]
show(r'\mathrm{[Part\ II]\ a)\ \ Largest}\ C-1 = 3-1 = 2\ \mathrm{eigenvalues, so\ we\ select\ the\ columns}', II_a_largest_columns[:, None].T, precision=0)
II_a_W = II_a_eigenvectors[:, II_a_largest_columns]
show(r'\mathrm{[Part\ II]\ a)\ \ Weight\ matrix}\ W', II_a_W)
With the $d$-by-$(C-1)$ weight matrix $\mathbf{W}$, we can project from $d$-dimensional space to a $(C-1)$-dimensional space by taking the dot product between a sample $\vec{x}$ with the weight matrix $\mathbf{W}$. In this case, the expected dimension of $\vec{y}$ after the projection of an $\vec{x}$ with $W$ is $C-1 = 2$ (since $C=3$ is the number of classes for this dataset).
This can also be verified by applying the dot product between a sample $\vec{x}$ and the weight matrix $W$ as follows:
x = data['train1'][0][0]
y = (x @ II_a_W)
print("x's shape is", x[:, None].shape)
print("y's shape is", y[:, None].shape)
x's shape is (5, 1) y's shape is (2, 1)
The mean vectors and covariance matrices are the estimated just like in the part I, but we estimate on the transformed samples.
II_a_mu = np.full((C, C-1), np.nan)
II_a_cov = np.full((C, C-1, C-1), np.nan)
for c in range(C):
    II_a_Y = data['train1'][c] @ II_a_W
    II_a_mu[c] = II_a_Y.mean(axis=0)
    II_a_cov[c] = np.cov(II_a_Y.T)
show('\mathrm{[Part\ II]\ a)\ \ MLE\ }\mu_1^{\mathsf{T}}', II_a_mu[0][:, None].T)
show('\mathrm{[Part\ II]\ a)\ \ MLE\ }\mu_2^{\mathsf{T}}', II_a_mu[1][:, None].T)
show('\mathrm{[Part\ II]\ a)\ \ MLE\ }\mu_3^{\mathsf{T}}', II_a_mu[2][:, None].T)
show('\mathrm{[Part\ II]\ a)\ \ MLE\ }\Sigma_1', II_a_cov[0])
show('\mathrm{[Part\ II]\ a)\ \ MLE\ }\Sigma_2', II_a_cov[1])
show('\mathrm{[Part\ II]\ a)\ \ MLE\ }\Sigma_3', II_a_cov[2])
The transformation is simply the matrix-vector product as follows: $$ \vec{y} = \mathbf{W}^{\mathsf{T}} \vec{x} $$
We then classify the transformed sample $\vec{y}$ using the discriminant function
II_a_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        y = x @ II_a_W
        distances = np.zeros(C)
        for i in range(C):
            distances[i] = discriminant_function(y, II_a_mu[i], II_a_cov[i])
        pred_class = np.argmax(distances)
        II_a_confusion_matrix[c][pred_class] += 1
show('\mathrm{[Part\ II]\ b)\ \ Confusion\ matrix}', II_a_confusion_matrix)
show_percent(r'\mathrm{[Part\ II]\ b)\ accuracy}', accuracy(II_a_confusion_matrix))
show_percent(r'\mathrm{[Part\ II]\ b)\ error}', 100 - accuracy(II_a_confusion_matrix))
II_c_S_W = within_class_scatter(data['train2'])
II_c_S_W_inv = np.linalg.inv(II_c_S_W)
II_c_S_B = between_class_scatter(data['train2'])
show(r'\mathrm{[Part\ II]\ c)\ \ Within-class\ scatter\ matrix}\ \ S_W', II_c_S_W)
show(r'\mathrm{[Part\ II]\ c)\ \ Inverse\ of\ within-class\ scatter\ matrix}\ \ S_W^{-1}', II_c_S_W_inv)
show(r'\mathrm{[Part\ II]\ c)\ \ Between-class\ scatter\ matrix}\ \ S_B', II_c_S_B)
II_c_eigenvalues, II_c_eigenvectors = np.linalg.eigh(II_c_S_W_inv.dot(II_c_S_B))
II_c_largest_columns = II_c_eigenvalues.argsort()[-(C-1):][::-1]
show(r'\mathrm{[Part\ II]\ c)\ \ Eigenvalues}\ \Lambda', II_c_eigenvalues[:, None].T)
show(r'\mathrm{[Part\ II]\ c)\ \ Eigenvectors}\ \Phi', II_c_eigenvectors)
show(r'\mathrm{[Part\ II]\ c)\ \ Largest}\ C-1 = 3-1 = 2\ \mathrm{eigenvalues, so\ we\ select\ the\ columns}', II_c_largest_columns[:, None].T, precision=0)
II_c_W = II_c_eigenvectors[:, II_c_largest_columns]
show(r'\mathrm{[Part\ II]\ c)\ \ Weight\ matrix}\ W', II_c_W)
With the $d$-by-$(C-1)$ weight matrix $\mathbf{W}$, we can project from $d$-dimensional space to a $(C-1)$-dimensional space by taking the dot product between a sample $\vec{x}$ with the weight matrix $\mathbf{W}$. In this case, the expected dimension of $\vec{y}$ after the projection of an $\vec{x}$ with $W$ is $C-1 = 2$ (since $C=3$ is the number of classes for this dataset).
This can also be verified by applying the dot product between a sample $\vec{x}$ and the weight matrix $W$ as follows:
x = data['train1'][0][0]
y = (x @ II_c_W)
print("x's shape is", x[:, None].shape)
print("y's shape is", y[:, None].shape)
x's shape is (5, 1) y's shape is (2, 1)
The mean vectors and covariance matrices are the estimated just like in the part I, but we estimate on the transformed samples.
II_c_mu = np.full((C, C-1), np.nan)
II_c_cov = np.full((C, C-1, C-1), np.nan)
for c in range(C):
    Y = data['train2'][c] @ II_c_W
    II_c_mu[c] = Y.mean(axis=0)
    II_c_cov[c] = np.cov(Y.T)
show('\mathrm{[Part\ II]\ c)\ \ MLE\ }\mu_1^{\mathsf{T}}', II_c_mu[0][:, None].T)
show('\mathrm{[Part\ II]\ c)\ \ MLE\ }\mu_2^{\mathsf{T}}', II_c_mu[1][:, None].T)
show('\mathrm{[Part\ II]\ c)\ \ MLE\ }\mu_3^{\mathsf{T}}', II_c_mu[2][:, None].T)
show('\mathrm{[Part\ II]\ c)\ \ MLE\ }\Sigma_1', II_c_cov[0])
show('\mathrm{[Part\ II]\ c)\ \ MLE\ }\Sigma_2', II_c_cov[1])
show('\mathrm{[Part\ II]\ c)\ \ MLE\ }\Sigma_3', II_c_cov[2])
II_c_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        y = x @ II_c_W
        distances = np.zeros(C)
        for i in range(C):
            distances[i] = discriminant_function(y, II_c_mu[i], II_c_cov[i])
        pred_class = np.argmax(distances)
        II_c_confusion_matrix[c][pred_class] += 1
show('\mathrm{[Part\ II]\ c)\ \ Confusion\ matrix}', II_c_confusion_matrix)
show_percent(r'\mathrm{[Part\ II]\ c)\ \ accuracy}', accuracy(II_c_confusion_matrix))
show_percent(r'\mathrm{[Part\ II]\ c)\ \ error}', 100 - accuracy(II_c_confusion_matrix))
| Part | Question | Algorithm | Training set | Acccuracy on Testing data I | Accuracy on Testing data II | 
|---|---|---|---|---|---|
| I | a, b | BDR | A prior knowledge | 99.87% | 99.89% | 
| I | c | BDR | Training data I | - | 99.81% | 
| I | d | BDR | Training data II | - | 99.88% | 
| II | b | MDA | Training data I | - | 99.51% | 
| II | c | MDA | Training data II | - | 99.59% | 
The training data I has smaller number of samples compared to the training data II. Again, this makes the parameter estimation less accurate. The point of Part II is to apply multiple discriminant analysis (MDA), i.e., projection from a $d$-dimensional space to $(C-1)$-dimensional space. This brings about a dimensionality reduction of the feature space. Of course, since the number of features are reduced from 5 to 2 in this case, we would lose some accuracy. However, it is a decent trade-off as the accuracies drop less than 1% when compared to Part I. With the reduced dimensionality, we can apply classifier algorithms more efficient, i.e., it is less computationally expensive.
Now, you will completely forget that you know anything about any of the distributions and/or their parameters and apply a non-parametric approach to classification.
Based on the density estimation method, we can estimate an unknown probability density function without knowing its true density and its parameters. A $d$-dimensional hypercube $\mathcal{R}_n$ is a region that we are interested in since we can measure the probability that a sample $\vec{x}$ will fall in the region of class $i$, or $p_n(\vec{n} | \omega_i)$ which is given by: $$ \begin{align*} p_n(\vec{x} | \omega_i) &= \frac{k_n}{n V_n} \end{align*} $$ where $n$ is the total number of samples, and $V_n$ is the volume of the hypercube $\mathcal{R}_n$. So, the volume calculation is quite straightforward according to the definition of the hypercube, we arrive at: $$ V_n = h_n^d $$ where $h_n$ is the length of an edge of the hypercube $\mathcal{R}_n$.
We then define $k_n$ which is the number of samples that reside in a $d$-dimensional hypercube $\mathcal{R}_n$ as: $$ k_n = \sum_{i=1}^{n} \varphi \left( \frac{\vec{x} - \vec{x}_i}{h_n} \right) $$
We combine all of the above equations together, so we could compute the probability by: $$ p_n(\vec{x} | \omega_i) = \frac{k_n}{n V_n} = \frac{1}{n V_n} \sum_{i=1}^{n} \varphi \left( \frac{\vec{x} - \vec{x}_i}{h_n} \right) $$
Based on the decision rule for $C$-class classification problem, a sample $\vec{x}$ is classified to be the predicted class $\hat{\omega}$ by using: $$ \hat{\omega} = \underset{i}{\mathrm{argmax}} p_n(\vec{x} | \omega_i)P(\omega_i) $$ In this project, the priors are equal, thus we can discard the prior term. So, we arrive at the decision rule in terms of class-conditional probability densities: $$ \hat{\omega} = \underset{i}{\mathrm{argmax}} p_n(\vec{x} | \omega_i) $$
$\varphi(\cdot)$ can be any kernel function. The standard hypercube kernel is defined as: $$ \varphi \left(\frac{\vec{x} - \vec{x}_i}{h_n} \right) = \begin{cases} 1 & \text{if}\ |\vec{x} - \vec{x}_i| \leq \frac{h_n}{2} \\ 0 & \text{otherwise} \end{cases} $$
def is_hypercube(kernel_fx):
    return kernel_fx == 'hypercube'
def is_gaussian(kernel_fx):
    return kernel_fx == 'gaussian'
def parzen_window(training_data, x, h_n, kernel_fx='hypercube'):
    d = x.shape[0]
    V_n = h_n ** d
    C = len(training_data)
    if is_hypercube(kernel_fx):
        p_n = np.full(C, np.nan)
    elif is_gaussian(kernel_fx):
        p_n = np.full((C, d), np.nan)
    for c in range(C):
        n = len(training_data[c])
        k = 0
        for x_i in training_data[c]:
            if is_hypercube(kernel_fx) and np.all(np.abs(x - x_i) < h_n / 2):
                k += 1
            elif is_gaussian(kernel_fx):
                k += np.exp(-0.5 * ((x - x_i) / h_n)**2) / ((np.sqrt(2 * np.pi))**d * V_n)
        p_n[c] = (1 / (n * V_n)) * k
    if is_gaussian(kernel_fx):
        p_n = p_n.sum(axis=1)
    pred_class = np.argmax(p_n)
    return pred_class
III_a_01_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        pred_class = parzen_window(data['train1'], x, h_n=0.1, kernel_fx='hypercube')
        III_a_01_confusion_matrix[c][pred_class] += 1
show(r'\mathrm{[Part\ III]\ a)\ }\ \ h_n=0.1: \mathrm{Confusion\ matrix}', III_a_01_confusion_matrix)
show_percent(r'\mathrm{[Part\ III]\ a)\ }\ \ h_n=0.1: \mathrm{accuracy}', accuracy(III_a_01_confusion_matrix))
show_percent(r'\mathrm{[Part\ III]\ a)\ }\ \ h_n=0.1: \mathrm{error}', 100 - accuracy(III_a_01_confusion_matrix))
III_a_07_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        pred_class = parzen_window(data['train1'], x, h_n=0.7, kernel_fx='hypercube')
        III_a_07_confusion_matrix[c][pred_class] += 1
        
show(r'\mathrm{[Part\ III]\ a)\ }\ \ h_n=0.7: \mathrm{Confusion\ matrix}', III_a_07_confusion_matrix)
show_percent(r'\mathrm{[Part\ III]\ a)\ }\ \ h_n=0.7: \mathrm{accuracy}', accuracy(III_a_07_confusion_matrix))
show_percent(r'\mathrm{[Part\ III]\ a)\ }\ \ h_n=0.7: \mathrm{error}', 100 - accuracy(III_a_07_confusion_matrix))
III_a_5_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        pred_class = parzen_window(data['train1'], x, h_n=5, kernel_fx='hypercube')
        III_a_5_confusion_matrix[c][pred_class] += 1
show(r'\mathrm{[Part\ III]\ a)\ }\ \ h_n=5: \mathrm{Confusion\ matrix}', III_a_5_confusion_matrix)
show_percent(r'\mathrm{[Part\ III]\ a)\ }\ \ h_n=5: \mathrm{accuracy}', accuracy(III_a_5_confusion_matrix))
show_percent(r'\mathrm{[Part\ III]\ a)\ }\ \ h_n=5: \mathrm{error}', 100 - accuracy(III_a_5_confusion_matrix))
III_b_01_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        pred_class = parzen_window(data['train2'], x, h_n=0.1, kernel_fx='hypercube')
        III_b_01_confusion_matrix[c][pred_class] += 1
show(r'\mathrm{[Part\ III]\ b)\ }\ \ h_n=0.1: \mathrm{Confusion\ matrix}', III_b_01_confusion_matrix)
show_percent(r'\mathrm{[Part\ III]\ b)\ }\ \ h_n=0.1: \mathrm{accuracy}', accuracy(III_b_01_confusion_matrix))
show_percent(r'\mathrm{[Part\ III]\ b)\ }\ \ h_n=0.1: \mathrm{error}', 100 - accuracy(III_b_01_confusion_matrix))
III_b_07_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        pred_class = parzen_window(data['train2'], x, h_n=0.7, kernel_fx='hypercube')
        III_b_07_confusion_matrix[c][pred_class] += 1
        
show(r'\mathrm{[Part\ III]\ b)\ }\ \ h_n=0.7: \mathrm{Confusion\ matrix}', III_b_07_confusion_matrix)
show_percent(r'\mathrm{[Part\ III]\ b)\ }\ \ h_n=0.7: \mathrm{accuracy}', accuracy(III_b_07_confusion_matrix))
show_percent(r'\mathrm{[Part\ III]\ b)\ }\ \ h_n=0.7: \mathrm{error}', 100 - accuracy(III_b_07_confusion_matrix))
III_b_5_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        pred_class = parzen_window(data['train2'], x, h_n=5, kernel_fx='hypercube')
        III_b_5_confusion_matrix[c][pred_class] += 1
show(r'\mathrm{[Part\ III]\ b)\ }\ h_n=5: \mathrm{Confusion\ matrix}', III_b_5_confusion_matrix)
show_percent(r'\mathrm{[Part\ III]\ b)\ }\ \ h_n=5: \mathrm{accuracy}', accuracy(III_b_5_confusion_matrix))
show_percent(r'\mathrm{[Part\ III]\ b)\ }\ \ h_n=5: \mathrm{error}', 100 - accuracy(III_b_5_confusion_matrix))
In this question, we replace the hypercube kernel with the Gaussian kernel which is given by $$ \varphi \left(\frac{\vec{x} - \vec{x}_i}{\sigma} \right) = \frac{1}{(\sqrt{2\pi})^d V_n} \exp{\left(-\frac{1}{2}(\frac{\vec{x} - \vec{x}_i}{\sigma})^2\right)} $$ where $V_n = \sigma^d$.
III_c_01_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        pred_class = parzen_window(data['train2'], x, h_n=0.1, kernel_fx='gaussian')
        III_c_01_confusion_matrix[c][pred_class] += 1
show(r'\mathrm{[III]\ c)\ Gaussian\ kernel}\ \sigma=0.1: \mathrm{Confusion\ matrix}', III_c_01_confusion_matrix)
show_percent(r'\mathrm{[III]\ c)\ Gaussian\ kernel}\ \sigma=0.1: \mathrm{accuracy}', accuracy(III_c_01_confusion_matrix))
show_percent(r'\mathrm{[III]\ c)\ Gaussian\ kernel}\ \sigma=0.1: \mathrm{error}', 100 - accuracy(III_c_01_confusion_matrix))
III_c_07_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        pred_class = parzen_window(data['train2'], x, h_n=0.7, kernel_fx='gaussian')
        III_c_07_confusion_matrix[c][pred_class] += 1
        
show(r'\mathrm{[III]\ c)\ Gaussian\ kernel}\ \sigma=0.7: \mathrm{Confusion\ matrix}', III_c_07_confusion_matrix)
show_percent(r'\mathrm{[III]\ c)\ Gaussian\ kernel}\ \sigma=0.7: \mathrm{accuracy}', accuracy(III_c_07_confusion_matrix))
show_percent(r'\mathrm{[III]\ c)\ Gaussian\ kernel}\ \sigma=0.7: \mathrm{error}', 100 - accuracy(III_c_07_confusion_matrix))
III_c_5_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in data['test2'][c]:
        pred_class = parzen_window(data['train2'], x, h_n=5, kernel_fx='gaussian')
        III_c_5_confusion_matrix[c][pred_class] += 1
show(r'\mathrm{[III]\ c)\ Gaussian\ kernel} \sigma=5: \mathrm{Confusion\ matrix}', III_c_5_confusion_matrix)
show_percent(r'\mathrm{[III]\ c)\ Gaussian\ kernel}\ \sigma=5: \mathrm{accuracy}', accuracy(III_c_5_confusion_matrix))
show_percent(r'\mathrm{[III]\ c)\ Gaussian\ kernel}\ \sigma=5: \mathrm{error}', 100 - accuracy(III_c_5_confusion_matrix))
| Part | Question | Algorithm | Training set | Acccuracy on Testing data I | Accuracy on Testing data II | 
|---|---|---|---|---|---|
| I | a, b | BDR | A prior knowledge | 99.87% | 99.89% | 
| I | c | BDR | Training data I | - | 99.81% | 
| I | d | BDR | Training data II | - | 99.88% | 
| II | b | MDA | Training data I | - | 99.51% | 
| II | c | MDA | Training data II | - | 99.59% | 
| III | a | Parzen Window (Hypercube $h_n=0.1$) | Training data I | - | 33.33% | 
| III | a | Parzen Window (Hypercube $h_n=0.7$) | Training data I | - | 33.39% | 
| III | a | Parzen Window (Hypercube $h_n=5.0$) | Training data I | - | 94.86% | 
| III | b | Parzen Window (Hypercube $h_n=0.1$) | Training data II | - | 33.33% | 
| III | b | Parzen Window (Hypercube $h_n=0.7$) | Training data II | - | 34.19% | 
| III | b | Parzen Window (Hypercube $h_n=5.0$) | Training data II | - | 99.16% | 
| III | c | Parzen Window (Gaussian $\sigma=0.1$) | Training data II | - | 93.14% | 
| III | c | Parzen Window (Gaussian $\sigma=0.7$) | Training data II | - | 94.83% | 
| III | c | Parzen Window (Gaussian $\sigma=5.0$) | Training data II | - | 96.73% | 
The very first setting is that we set the width of the hypercube window to be $h_n = 0.1$ which is way too small. So, there are no samples that are in the Parzen windows at different locations of the training samples. The effect of using argmax in the implementation makes the Parzen window predicts only the class $\omega_1$ which results in only 33.33% accuracy. This is the case for using either training data I or training data II. The results are the same.
When we increase the width of the hypercube kernel to $h_n = 0.7$, the Parzen window luckily covers just some test samples, thus it gains just a little more accuracy (33.39%). The result on the testing data II is 34.19%. The reason is that the number of testing samples are larger. So, we have a little bit more chance to reside in the windows.
But when we set the width of the hypercube kernel to $h_n = 5.0$, the window width is large enough to cover areas that samples might fall in. It yields good classifications results on the test set. It can attain the accuracy of up to 94.86% if we use the training data I. Once we increase the number of training samples (utilizing training data II), it even yields a better accuracy of 99.16%. Due to the fact that we endow the Parzen window with the finite number of training samples, small number of training samples would cause many holes at the fringes (along side with local spikes in the densities). To conclude, the more number of training samples can lead to a better model performance because we have fewer holes at the fringes.
Once we switch over to the Gaussian kernel, even the standard deviation $\sigma=0.1$ produces a very high accuracy of 93.14% due to the characteristic of the Gaussian kernel that has infinite support (unlike the hypercube kernel that has a crisp window). Therefore, even a test sample is far away from the training samples, it still contributes and gives us some value for the term $\varphi \left(\frac{\vec{x} - \vec{x}_i}{h_n} \right)$. So, when it comes to the classification based on the decision rule defined above, we are interested in the class that has the maximum probability, even from infinitesimal values. It seems like if we increase $\sigma$ to 0.7 and 5.0, we get better results, 94.83% and 96.73%, respectively. Because their window widths are larger which make a sample to be more probable to fall in the window of the more likely class.
In terms of the comparison between the Parzen window and the previous classification methods, it seems to be less accurate. However, we need to keep in mind that this is a non-parametric technique which does not require much a priori knowledge. Instead, we need to choose a kernel function and its hyperparameter ($h_n$ or $\sigma$). The biggest problem is that we need to fine-tune or search the best setting for the Parzen window, e.g., we do not know what should be the value of $h_n$. But doing so can be prone to overfitting and generalization issue, we should consider using cross validation to find the best setting. On the flip side, MLE estimates $\vec{\mu}$ and $\Sigma$ of a Gaussian distribution. We see the results from Part I are pretty high since, the assumed distribution matches the true distribution. If the true distribution is not Gaussian and we do not know it, this would be another case. It this the same for Part II which is MDA that provides a dimensionality reduction follwed by a BDR.
Also note that, if we use a large number of training samples, the Parzen window takes more time to execute since its computational complexity involves the number of training samples to go through when performing a classification.
Once again, you will forget that you know anything about any of the distributions and/or their parameters and apply another non-parametric approach.
Recall that the parzen window method requires us to choose a kernel function (window function) and its size. However, if we do not have that knowledge, we can utilize the k-nearest-neighbor algorithm instead. Unlike Parzen window that fixes the volume $V_n$, k-NN fixes $k$ which is the number of nearest neighbors. This means we grow a cell at a sample $\vec{x}$ until it covers $k_n$ samples. $k_i$ samples out of $k_n$ samples are labeled as class $\omega_i$. So, we can compute the estimated a posteriori probability by: $$ p_n(\omega_i | \vec{x}) = \frac{k_i}{k_n} $$
The way that we find the nearest neighbors is to use a distance function $D$. In this project, euclidean distance which is a Minkowski distance with $p=2$ in $d$-dimensional space is used. $$ D(\vec{a}, \vec{b}) = \left( \sum_{j=1}^{d} (a_j - b_j)^2 \right)^{1/2} $$
Based on the Bayes decision rule, we can classify a sample $\vec{x}$ to be a class $\hat{\omega}$ by computing: $$ \hat{\omega} = \underset{i}{\mathrm{argmax}} p_n(\omega_i | \vec{x}) = \underset{i}{\mathrm{argmax}} k_i $$ which means we classify a sample $\vec{x}$ based on the most frequent class within the grown cell that covers $k_n$ samples.
def squared_euclidean_distance(a, b):
    return np.sum((a - b)**2)
def k_nearest_neighbors(training_data, x, k):
    distances = []
    for c in range(len(training_data)):
        for y in training_data[c]:
            d = squared_euclidean_distance(x, y)
            distances.append((d, c))
    distances = np.array(distances)
    k_indices = np.argpartition(distances[:, 0], k)[:k]
    unique, counts = np.unique(distances[k_indices, 1].astype(int), return_counts=True)
    pred_class = unique[counts.argmax()]
    return pred_class
In this question, since the training data I has 50 samples for each class, this means we have 150 training samples. We can estimate $k_n = \sqrt{n} = \sqrt{150} \approx 12$.
IV_a_training_data = np.array(data['train1'])
IV_a_N = IV_a_training_data.shape[0] * IV_a_training_data.shape[1]
IV_a_k_n = round(np.sqrt(IV_a_N))
show(r'\mathrm{[Part\ IV]\ a)}\ \ k_n', IV_a_k_n, precision=0)
IV_a_test_data = np.array(data['test2'])
IV_a_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in IV_a_test_data[c]:
        pred_class = k_nearest_neighbors(IV_a_training_data, x, k=IV_a_k_n)
        IV_a_confusion_matrix[c][pred_class] += 1
show(r'\mathrm{[Part\ IV]\ a)}\ \ \mathrm{Confusion\ matrix}', IV_a_confusion_matrix)
show_percent(r'\mathrm{[IV]\ a)}\ \ \mathrm{accuracy}', accuracy(IV_a_confusion_matrix))
show_percent(r'\mathrm{[IV]\ a)}\ \ \mathrm{error}', 100 - accuracy(IV_a_confusion_matrix))
In this question, since the training data II has 500 samples for each class, this means we have 1,500 training samples. We can estimate $k_n = \sqrt{n} = \sqrt{1500} \approx 39$.
IV_b_training_data = np.array(data['train2'])
IV_b_N = IV_b_training_data.shape[0] * IV_b_training_data.shape[1]
IV_b_k_n = round(np.sqrt(IV_b_N))
show(r'\mathrm{[Part\ IV]\ b)}\ \ k_n', IV_b_k_n, precision=0)
IV_b_test_data = np.array(data['test2'])
IV_b_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in IV_b_test_data[c]:
        pred_class = k_nearest_neighbors(IV_b_training_data, x, k=IV_b_k_n)
        IV_b_confusion_matrix[c][pred_class] += 1
show(r'\mathrm{[Part\ IV]\ b)}\ \mathrm{Confusion\ matrix}', IV_b_confusion_matrix)
show_percent(r'\mathrm{[IV]\ b)}\ \ \mathrm{accuracy}', accuracy(IV_b_confusion_matrix))
show_percent(r'\mathrm{[IV]\ b)}\ \ \mathrm{error}', 100 - accuracy(IV_b_confusion_matrix))
In this question, we define the number of $k_n$ nearest neighbors to be: $$ k_n = f(n) = \frac{\sqrt{n}}{2} $$
Hence, we substitute $n = 1,500$, we arrive at: $$ k_n = \frac{\sqrt{n}}{2} = \frac{\sqrt{1500}}{2} \approx 19 $$
IV_c_k_n = round(np.sqrt(IV_b_N) / 2)
show(r'\mathrm{[Part\ IV]\ c)}\ \ k_n', IV_c_k_n, precision=0)
IV_c_confusion_matrix = np.zeros((C, C), dtype='int')
for c in range(C):
    for x in IV_b_test_data[c]:
        pred_class = k_nearest_neighbors(IV_b_training_data, x, k=IV_c_k_n)
        IV_c_confusion_matrix[c][pred_class] += 1
show(r'\mathrm{[Part\ IV]\ c)}\ \mathrm{Confusion\ matrix}', IV_c_confusion_matrix)
show_percent(r'\mathrm{[IV]\ c)}\ \ \mathrm{accuracy}', accuracy(IV_c_confusion_matrix))
show_percent(r'\mathrm{[IV]\ c)}\ \ \mathrm{error}', 100 - accuracy(IV_c_confusion_matrix))
| Part | Question | Algorithm | Training set | Acccuracy on Testing data I | Accuracy on Testing data II | 
|---|---|---|---|---|---|
| I | a, b | BDR | A prior knowledge | 99.87% | 99.89% | 
| I | c | BDR | Training data I | - | 99.81% | 
| I | d | BDR | Training data II | - | 99.88% | 
| II | b | MDA | Training data I | - | 99.51% | 
| II | c | MDA | Training data II | - | 99.59% | 
| III | a | Parzen Window (Hypercube $h_n=0.1$) | Training data I | - | 33.33% | 
| III | a | Parzen Window (Hypercube $h_n=0.7$) | Training data I | - | 33.39% | 
| III | a | Parzen Window (Hypercube $h_n=5.0$) | Training data I | - | 94.86% | 
| III | b | Parzen Window (Hypercube $h_n=0.1$) | Training data II | - | 33.33% | 
| III | b | Parzen Window (Hypercube $h_n=0.7$) | Training data II | - | 34.19% | 
| III | b | Parzen Window (Hypercube $h_n=5.0$) | Training data II | - | 99.16% | 
| III | c | Parzen Window (Gaussian $\sigma=0.1$) | Training data II | - | 93.14% | 
| III | c | Parzen Window (Gaussian $\sigma=0.7$) | Training data II | - | 94.83% | 
| III | c | Parzen Window (Gaussian $\sigma=5.0$) | Training data II | - | 96.73% | 
| IV | a | $k_n$-NN ($k_n = \sqrt{n}$) | Training data I | - | 98.59% | 
| IV | b | $k_n$-NN ($k_n = \sqrt{n}$) | Training data II | - | 99.39% | 
| IV | c | $k_n$-NN ($k_n = \frac{\sqrt{n}}{2}$) | Training data II | - | 99.58% | 
First of all, we compute $k_n = \sqrt{n} = \sqrt{150} = 12$. As can be seen from the above results, $12$-NN yields 98.59% accuracy when we use the training data I. The accuracy raises to 99.39% when we increase the number of training samples, i.e., using training data II. Therefore, we have 1,500 samples which means we use $39$-NN in this case. Once we change $k_n$ to 19 based on our owned defined function ($k_n = \frac{\sqrt(n)}{2}$), the accuracy increases to 99.58%. All of these literally depend on the choice of $k_n$ that needs empirical experiments to find the best $k_n$. But we need to be careful about the overfitting and generalization problems as a low training error does not guarantee a small test error. We can also do cross validation to mitigate this issue.
Interestingly, if we compare the results from $k_n$-NN and the Parzen window, all the results from $k_n$-NN are better than those from the Parzen windows. This can lead to the fact that, when it comes to the window function and its hyperparameter of the Parzen window, they are unknown in this case. Hence, $k_n$-NN is a better solution compared to the Parzen window. Because $k_n$-nearest neighbors fixes the number of nearest neighbors $k_n$, as opposed to the Parzen window that fixes the volume $V$ (i.e., it requires the window and the width). To conclude, it seems to provide a better classification results without knowing the underlying knowledge about the data (e.g., the underlying distributions). The performance of an $k_n$-NN will be increase as the number of training samples goes up. However, also note that, if we use a large number of training samples, it takes more time to run since its computational complexity involves the number of training samples to go through when performing a classification.
Once again, you will forget that you know anything about any of the distributions and/or their parameters and apply another non-parametric approach. This time, repeat Part IV above using a linear classifier and the Perceptron criterion – for part a) and c), you obviously cannot pick a value for $k_n$, so, instead, use $\eta = \frac{1}{2}$ for part a) and then use $\eta = \frac{1}{\sqrt{k}}$ for part c), where $k$ is the iteration.
V_a_training_data = data['train1']
V_a_test_data = data['test2']
d = len(V_a_training_data[0][0])
We classify a sample $\vec{x}$ based on the linear discriminant function $g_i(\vec{x})$ of class $\omega_i$ which is given by $$ g_i(\vec{x}) = \vec{a}_i^{\mathsf{T}}\vec{y} $$ where $\vec{a}$ is a weight vector, and $\vec{y}$ is a sample or feature vector.
Therefore, our decision rule to assign $\vec{x}$ to the class $\omega_i$ if $g_i(\vec{x}) > g_j(\vec{x})$, $\forall j \neq i$.
The Perceptron Criterion function is defined as: $$ J(\vec{a}) = \sum_{\vec{y} \in \mathcal{Y}} \left( - \vec{a}^{\mathsf{T}} \vec{y} \right) $$ where $\mathcal{Y}$ is the set of misclassified samples.
We can derive the gradient of $J$ w.r.t. the weight vector $\vec{a}$ at iteration $k$ by: $$ \nabla J(\vec{a}(k)) = \frac{\partial J(\vec{a}(k))}{\partial \vec{a}(k)} = \sum_{\vec{y} \in \mathcal{Y}} (- \vec{y}) $$
def perceptron_criterion(Y):
    Y = np.array(Y)
    return (-Y).sum(axis=0)
Since the question does not specify the way to initialize weight vectors, we randomly initialize $C$ weight vectors at time step 0 by drawing from the Gaussian distribution $\vec{a}(0) \sim \mathcal{N}(\vec{\mu},\,\mathbf{\Sigma})\,$ where $\vec{\mu} = \vec{0}$ and $\mathbf{\Sigma} = \mathbf{I}$.
Note that the random seed is fixed, so that the result is reproducible.
np.random.seed(7720)
V_a_weights = np.random.normal(0, 1, (C, d))
show(r'\mathrm{[Part\ V]\ a)\ \ initial\ weight\ vector\ of\ class 1\ :\ }\vec{a}_1^{\mathsf{T}}(0)', V_a_weights[0][:, None].T)
show(r'\mathrm{[Part\ V]\ a)\ \ initial\ weight\ vector\ of\ class 2\ :\ }\vec{a}_2^{\mathsf{T}}(0)', V_a_weights[1][:, None].T)
show(r'\mathrm{[Part\ V]\ a)\ \ initial\ weight\ vector\ of\ class 2\ :\ }\vec{a}_3^{\mathsf{T}}(0)', V_a_weights[2][:, None].T)
In terms of training the weight vectors, the update rule is employed based on the delta rule and the above gradient. $$ \begin{align*} \vec{a}(k+1) &= \vec{a}(k) - \eta(k) \nabla J(\vec{a}(k))\\ &= \vec{a}(k) - \eta(k) \sum_{\vec{y} \in \mathcal{Y}} (- \vec{y})\\ &= \vec{a}(k) + \eta(k) \sum_{\vec{y} \in \mathcal{Y}} \vec{y} \end{align*} $$
In this miniproject, the stopping criterion for training the Perceptron criterion function is the maximum iteration $K=10$. This means we train the weights of the model 10 iterations and then stop.
def update_weights(a, learning_rate, gradient):
    a -= learning_rate * gradient
    return a
The learning rate $\eta$ is given by the question (a) to be $\eta = \frac{1}{2}$.
V_a_eta = 1/2
V_a_max_epoch = 25
def perceptron_train(weights, training_data, eta, max_epoch, bias=False):
    C = len(weights)
    for epoch in range(1, max_epoch+1):
#     print(f'[Part V] training epoch {epoch}')
        for c in range(C):
            Y = []
            for x in training_data[c]:
                if bias:
                    x = np.concatenate(([1], x))
                linear_discriminant_functions = np.full(C, np.nan)
                for i in range(C):
                    linear_discriminant_functions[i] = x @ weights[i]
                pred_class = linear_discriminant_functions.argmax()
                if pred_class != c:
                    Y.append(x)
            gradient = perceptron_criterion(Y)
            eta = 1 / np.sqrt(epoch) if eta is None else eta
            weights[c] = update_weights(weights[c], eta, gradient)
    return weights
V_a_weights = perceptron_train(V_a_weights, V_a_training_data, V_a_eta, max_epoch=V_a_max_epoch)
show(r'\mathrm{[Part\ V]\ a)}\ \ \ \mathrm{trained\ weight\ vector\ of\ class\ 1}\ : \vec{a}_1(' + str(V_a_max_epoch) +')', V_a_weights[0][:, None].T)
show(r'\mathrm{[Part\ V]\ a)}\ \ \ \mathrm{trained\ weight\ vector\ of\ class\ 2}\ : \vec{a}_2(' + str(V_a_max_epoch) +')', V_a_weights[1][:, None].T)
show(r'\mathrm{[Part\ V]\ a)}\ \ \ \mathrm{trained\ weight\ vector\ of\ class\ 3}\ : \vec{a}_3(' + str(V_a_max_epoch) +')', V_a_weights[2][:, None].T)
def perceptron_predict(weights, data, bias=False):
    C = len(weights)
    cm = np.zeros((C, C), dtype='int')
    for c in range(C):
        Y = []
        for x in data[c]:
            if bias:
                x = np.concatenate(([1], x))
            linear_discriminant_functions = np.full(C, np.nan)
            for i in range(C):
                linear_discriminant_functions[i] = x @ weights[i]
            pred_class = linear_discriminant_functions.argmax()
            if pred_class != c:
                Y.append(x)
            cm[c][pred_class] += 1
    return cm
V_a_training_confusion_matrix = perceptron_predict(V_a_weights, V_a_training_data)
show(r'\mathrm{[Part\ V]\ a)}\ \mathrm{Training\ Confusion\ matrix}', V_a_training_confusion_matrix)
V_a_confusion_matrix = perceptron_predict(V_a_weights, V_a_test_data)
show(r'\mathrm{[Part\ V]\ a)}\ \mathrm{Test\ Confusion\ matrix}', V_a_confusion_matrix)
show_percent(r'\mathrm{[Part\ V]\ a)}\ \ \mathrm{Test\ accuracy}', accuracy(V_a_confusion_matrix))
show_percent(r'\mathrm{[Part\ V]\ a)}\ \ \mathrm{Test\ error}', 100 - accuracy(V_a_confusion_matrix))
Let us try including the bias $w_0$ to the weight vector $\vec{a}$.
According to the defined linear discriminant function above, $g_i(\vec{x}) = \vec{a}_i^{\mathsf{T}}\vec{y}$, in order to include a bias term $w_0$ to the weights, our weight vector $\vec{a}$ is modified to be: $$ \vec{a} = \begin{bmatrix} w_0\\ w_1\\ w_2\\ w_3\\ w_4\\ w_5 \end{bmatrix} $$ , and the feature vector becomes an augmented feature vector which is given by:
$$ \vec{y} = \begin{bmatrix} 1\\ x_1\\ x_2\\ x_3\\ x_4\\ x_5 \end{bmatrix} = \begin{bmatrix} w_0\\ \\ \vec{x}\\ \\ \end{bmatrix} $$Everything else is the same.
np.random.seed(7720)
V_a_weights_with_bias = np.random.normal(0, 1, (C, d+1))
V_a_weights_with_bias = perceptron_train(V_a_weights_with_bias, V_a_training_data, V_a_eta, max_epoch=V_a_max_epoch, bias=True)
V_a_confusion_matrix_with_bias = perceptron_predict(V_a_weights_with_bias, V_a_test_data, bias=True)
show(r'\mathrm{[Part\ V]\ a)\ \ with\ bias}\ \ w_0\ ,K=' + str(V_a_max_epoch) + ':\ \mathrm{Test\ Confusion\ matrix}', V_a_confusion_matrix_with_bias)
show_percent(r'\mathrm{[Part\ V]\ a)\ \ with\ bias}\ \ w_0\ ,K=' + str(V_a_max_epoch) + ':\mathrm{Test\ accuracy}', accuracy(V_a_confusion_matrix_with_bias))
np.random.seed(7720)
V_b_weights = np.random.normal(0, 1, (C, d))
V_b_max_epoch = 25
show(r'\mathrm{[Part\ V]\ b)\ \ initial\ weight\ vector\ of\ class 1\ :\ }\vec{a}_1^{\mathsf{T}}(0)', V_b_weights[0][:, None].T)
show(r'\mathrm{[Part\ V]\ b)\ \ initial\ weight\ vector\ of\ class 2\ :\ }\vec{a}_2^{\mathsf{T}}(0)', V_b_weights[1][:, None].T)
show(r'\mathrm{[Part\ V]\ b)\ \ initial\ weight\ vector\ of\ class 2\ :\ }\vec{a}_3^{\mathsf{T}}(0)', V_b_weights[2][:, None].T)
The learning rate $\eta$ is given by the question (b) to be $\eta = \frac{1}{2}$ (same as the question (a)).
V_b_training_data = data['train2']
V_b_test_data = data['test2']
d = len(V_b_training_data[0][0])
V_b_eta = 1/2
V_b_weights = perceptron_train(V_b_weights, V_b_training_data, V_b_eta, max_epoch=V_b_max_epoch)
V_b_training_confusion_matrix = perceptron_predict(V_b_weights, V_b_training_data)
show(r'\mathrm{[Part\ V]\ b)}\ \mathrm{Training\ Confusion\ matrix}', V_b_training_confusion_matrix)
show(r'\mathrm{[Part\ V]\ b)}\ \ \ \mathrm{trained\ weight\ vector\ of\ class\ 1}\ : \vec{a}_1(' + str(V_b_max_epoch) +')', V_b_weights[0][:, None].T)
show(r'\mathrm{[Part\ V]\ b)}\ \ \ \mathrm{trained\ weight\ vector\ of\ class\ 2}\ : \vec{a}_2(' + str(V_b_max_epoch) +')', V_b_weights[1][:, None].T)
show(r'\mathrm{[Part\ V]\ b)}\ \ \ \mathrm{trained\ weight\ vector\ of\ class\ 3}\ : \vec{a}_3(' + str(V_b_max_epoch) +')', V_b_weights[2][:, None].T)
V_b_confusion_matrix = perceptron_predict(V_b_weights, V_b_test_data)
show(r'\mathrm{[Part\ V]\ b)}\ \mathrm{Test\ Confusion\ matrix}', V_b_confusion_matrix)
show_percent(r'\mathrm{[Part\ V]\ b)}\ \ \mathrm{Test\ accuracy}', accuracy(V_b_confusion_matrix))
show_percent(r'\mathrm{[Part\ V]\ b)}\ \ \mathrm{Test\ error}', 100 - accuracy(V_b_confusion_matrix))
np.random.seed(7720)
V_b_weights_with_bias = np.random.normal(0, 1, (C, d+1))
V_b_weights_with_bias = perceptron_train(V_b_weights_with_bias, V_b_training_data, V_b_eta, max_epoch=25, bias=True)
V_b_confusion_matrix_with_bias = perceptron_predict(V_b_weights_with_bias, V_b_test_data, bias=True)
show(r'\mathrm{[Part\ V]\ b)\ \ with\ bias}\ \ w_0\ ,K=' + str(V_b_max_epoch) + ':\ \mathrm{Test\ Confusion\ matrix}', V_b_confusion_matrix_with_bias)
show_percent(r'\mathrm{[Part\ V]\ b)\ \ with\ bias}\ \ w_0\ ,K=' + str(V_b_max_epoch) + ':\mathrm{Test\ accuracy}', accuracy(V_b_confusion_matrix_with_bias))
np.random.seed(7720)
V_c_weights = np.random.uniform(0, 1, (C, d))
V_c_max_epoch = 25
show(r'\mathrm{[Part\ V]\ c)\ \ initial\ weight\ vector\ of\ class 1\ :\ }\vec{a}_1^{\mathsf{T}}(0)', V_c_weights[0][:, None].T)
show(r'\mathrm{[Part\ V]\ c)\ \ initial\ weight\ vector\ of\ class 2\ :\ }\vec{a}_2^{\mathsf{T}}(0)', V_c_weights[1][:, None].T)
show(r'\mathrm{[Part\ V]\ c)\ \ initial\ weight\ vector\ of\ class 2\ :\ }\vec{a}_3^{\mathsf{T}}(0)', V_c_weights[2][:, None].T)
V_c_training_data = data['train2']
V_c_test_data = data['test2']
d = len(V_c_training_data[0][0])
The learning rate $\eta$ is given by the question (c) to be $\eta = \frac{1}{\sqrt{k}}$.
V_c_weights = perceptron_train(V_c_weights, V_c_training_data, eta=None, max_epoch=V_c_max_epoch)
show(r'\mathrm{[Part\ V]\ c)}\ \ \ \mathrm{trained\ weight\ vector\ of\ class\ 1}\ : \vec{a}_1(' + str(V_c_max_epoch) +')', V_c_weights[0][:, None].T)
show(r'\mathrm{[Part\ V]\ c)}\ \ \ \mathrm{trained\ weight\ vector\ of\ class\ 2}\ : \vec{a}_2(' + str(V_c_max_epoch) +')', V_c_weights[1][:, None].T)
show(r'\mathrm{[Part\ V]\ c)}\ \ \ \mathrm{trained\ weight\ vector\ of\ class\ 3}\ : \vec{a}_3(' + str(V_c_max_epoch) +')', V_c_weights[2][:, None].T)
V_c_training_confusion_matrix = perceptron_predict(V_c_weights, V_c_training_data)
show(r'\mathrm{[Part\ V]\ c)}\ \mathrm{Training\ Confusion\ matrix}', V_c_training_confusion_matrix)
V_c_confusion_matrix = perceptron_predict(V_c_weights, V_c_test_data)
show(r'\mathrm{[Part\ V]\ c)}\ \mathrm{Test\ Confusion\ matrix}', V_c_confusion_matrix)
show_percent(r'\mathrm{[Part\ V]\ c)}\ \ K=' + str(V_c_max_epoch) +'\ :\mathrm{Test\ accuracy}', accuracy(V_c_confusion_matrix))
show_percent(r'\mathrm{[Part\ V]\ c)}\ \ K=' + str(V_c_max_epoch) +'\ :\mathrm{Test\ error}', 100 - accuracy(V_c_confusion_matrix))
np.random.seed(7720)
V_c_weights_with_bias = np.random.normal(0, 1, (C, d+1))
V_c_weights_with_bias = perceptron_train(V_c_weights_with_bias, V_c_training_data, eta=None, max_epoch=V_c_max_epoch, bias=True)
V_c_confusion_matrix_with_bias = perceptron_predict(V_c_weights_with_bias, V_c_test_data, bias=True)
show(r'\mathrm{[Part\ V]\ c)\ \ with\ bias}\ \ w_0\ ,K=' + str(V_c_max_epoch) + ':\mathrm{Test\ Confusion\ matrix}', V_c_confusion_matrix_with_bias)
show_percent(r'\mathrm{[Part\ V]\ c)\ \ with\ bias}\ \ w_0\ ,K=' + str(V_c_max_epoch) + ':\mathrm{Test\ accuracy}', accuracy(V_c_confusion_matrix_with_bias))
Let us try to change our only hyperparameter here, the maximum number of iterations $K \in \{10, 20, 40, 60, 80\}$.
for max_iteration in (10, 20, 40, 60, 80):
    np.random.seed(7720)
    V_c_vary_weights = np.random.uniform(0, 1, (C, d))
    V_c_vary_weights = perceptron_train(V_c_vary_weights, V_c_training_data, eta=None, max_epoch=max_iteration)
    V_c_vary_confusion_matrix = perceptron_predict(V_c_vary_weights, V_c_test_data)
    show(r'\mathrm{[Part\ V]\ c)}\ \ K=' + str(max_iteration) + '\ :\mathrm{Test\ Confusion\ matrix}', V_c_vary_confusion_matrix)
    show_percent(r'\mathrm{[V]\ c)}\ \ K=' + str(max_iteration) + '\ :\mathrm{Test\ accuracy}', accuracy(V_c_vary_confusion_matrix))
Let us try varing the maximum number of iterations $K \in \{10, 20, 40, 60, 80\}$ for the models with the bias term.
for max_iteration in (10, 20, 40, 60, 80):
    np.random.seed(7720)
    V_c_vary_weights_with_bias = np.random.uniform(0, 1, (C, d+1))
    V_c_vary_weights_with_bias = perceptron_train(V_c_vary_weights_with_bias, V_c_training_data, eta=None, max_epoch=max_iteration, bias=True)
    V_c_vary_confusion_matrix_with_bias = perceptron_predict(V_c_vary_weights_with_bias, V_c_test_data, bias=True)
    show(r'\mathrm{[Part\ V]\ c)\ \ with\ bias}\ \ w_0\ ,K=' + str(max_iteration) + ':\mathrm{Test\ Confusion\ matrix}', V_c_vary_confusion_matrix_with_bias)
    show_percent(r'\mathrm{[Part\ V]\ c)\ \ with\ bias}\ \ w_0\ ,K=' + str(max_iteration) + ':\mathrm{Test\ accuracy}', accuracy(V_c_vary_confusion_matrix_with_bias))
| Part | Question | Algorithm | Training set | Acccuracy on Testing data I | Accuracy on Testing data II | 
|---|---|---|---|---|---|
| I | a, b | BDR | A prior knowledge | 99.87% | 99.89% | 
| I | c | BDR | Training data I | - | 99.81% | 
| I | d | BDR | Training data II | - | 99.88% | 
| II | b | MDA | Training data I | - | 99.51% | 
| II | c | MDA | Training data II | - | 99.59% | 
| III | a | Parzen Window (Hypercube $h_n=0.1$) | Training data I | - | 33.33% | 
| III | a | Parzen Window (Hypercube $h_n=0.7$) | Training data I | - | 33.39% | 
| III | a | Parzen Window (Hypercube $h_n=5.0$) | Training data I | - | 94.86% | 
| III | b | Parzen Window (Hypercube $h_n=0.1$) | Training data II | - | 33.33% | 
| III | b | Parzen Window (Hypercube $h_n=0.7$) | Training data II | - | 34.19% | 
| III | b | Parzen Window (Hypercube $h_n=5.0$) | Training data II | - | 99.16% | 
| III | c | Parzen Window (Gaussian $\sigma=0.1$) | Training data II | - | 93.14% | 
| III | c | Parzen Window (Gaussian $\sigma=0.7$) | Training data II | - | 94.83% | 
| III | c | Parzen Window (Gaussian $\sigma=5.0$) | Training data II | - | 96.73% | 
| IV | a | $k_n$-NN ($k_n = \sqrt{n}$) | Training data I | - | 98.59% | 
| IV | b | $k_n$-NN ($k_n = \sqrt{n}$) | Training data II | - | 99.39% | 
| IV | c | $k_n$-NN ($k_n = \frac{\sqrt{n}}{2}$) | Training data II | - | 99.58% | 
| V | a | The Perceptron Criterion ($\eta=\frac{1}{2}$), $K=25$ | Training data I | - | 97.98% | 
| V | b | The Perceptron Criterion ($\eta=\frac{1}{2}$), $K=25$ | Training data II | - | 98.24% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=25$ | Training data II | - | 98.22% | 
| V | a | The Perceptron Criterion ($\eta=\frac{1}{2}$), $K=25$, with bias | Training data I | - | 98.31% | 
| V | b | The Perceptron Criterion ($\eta=\frac{1}{2}$), $K=25$, with bias | Training data II | - | 98.36% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=25$, with bias | Training data II | - | 98.37% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=10$ | Training data II | - | 98.01% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=20$ | Training data II | - | 98.23% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=40$ | Training data II | - | 98.22% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=60$ | Training data II | - | 98.25% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=80$ | Training data II | - | 98.22% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=10$, with bias | Training data II | - | 98.23% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=20$, with bias | Training data II | - | 98.46% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=40$, with bias | Training data II | - | 98.54% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=60$, with bias | Training data II | - | 98.60% | 
| V | c | The Perceptron Criterion ($\eta(k)=\frac{1}{\sqrt{k}}$), $K=80$, with bias | Training data II | - | 98.61% | 
In Part V, again, we do not know anything about the a priori knowledge. Although, all we know is the training data, we can train weight vectors $\vec{a}_i$ for linear discriminant functions $g_i(\vec{x})$ by minimizing a criterion function which in this case is the Perceptron criterion function. According to the results in the table above, the models achieve very promising results without knowing anything, i.e., this is actually a machine learning. Also, we can conclude that the problem is quite linearly separable since we only utilize the linear discriminant functions. The stopping criterion for all the Perceptron criterion above is stopping when the iteration $k$ is equal to the maximum number of iterations $K$.
We have tried a bunch of different $K$, and found out that $K=25$ is a good value to get good classification results. It is also interesting to see how the bias term $w_0$ would affect the performance of the models. With the same setting (i.e., the same $K$ and the same intial weights), we can see that the models with bias term $w_0$ yield better results in all cases.
For the question (c), we have the dynamic learning rate, which is a function of the current iteration, $\eta(k)$. This means the larger current iteration $k$ resulting in the smaller learning rate $\eta$, i.e., the learning rate decreases over time. In order to see how the dynamic learning rate has an effect to the performance of a model, we have varied $K \in \{10, 20, 40, 60, 80\}$. As can be seen from the table above, if we let a model trains further, the accuracy tend to increase. Nevertheless, one must be careful, if we train a model too much, it will eventually be overfit to the training data. There are many possible solutions to this issue, for instance, adding reguarlization, or applying cross validation.
All in all, this is actually a really wonderful MiniProject that wraps up pretty much everything about this course. Different supervised classification techniques with different knowledge and assumptions are employed in a series to see how each one works. The comparisons between them are also discussed. And this comes an end of this Introduction to Pattern Recognition and Machine Learning class.