Mini Project 5

Watchanan Chantapakul (wcgzm)


This question is to demonstrate that: 1) when the selected model is poor, the maximum-likelihood classifier does not produce satisfactory results; and 2) proper transformation of the data can compensate for poor models. The dataset1 used for this question is divided into training2 and test3 data, with each one consisting of 3 classes in a 2D-feature space.

The problem is now 1-D and again, if you inspect the data, Gaussian distribution is a more suitable pdf to describe all three classes. Assume then that each class has $p(r|\omega_i) = \mathcal{N}(\mu_i, \sigma^2)$ with $\mu_i$ unknown and variance $\sigma^2 = 0.25$ for all three classes. Let the only prior knowledge about $\mu_i$ be $p(\mu_i) = \mathcal{N}(\mu_0 = 0, \sigma_0^2 = 100)$ and compute the Bayes estimates for $\mu_i$ and the posterior distribution $p(\mu_i|D_i)$ of all three classes. Next compute $p(r|\omega_i,D_i) = \int p(r|\mu_i) p(\mu_i|D_i) \,d\mu_i$ and use this density estimate to classify the test data and compute the test error using confusion matrix.

Do not forget to comment on the results from each step above.

1 http://vigir.missouri.edu/~gdesouza/ece7720/test_train_data_class3.mat
2 access the training data by TrainPts = Data.train
3 access the test data by TestPts = Data.test
4 http://vigir.missouri.edu/~gdesouza/ece7720/Lecture_Notes/Lecture10.pdf


Dataset

Load data from the test_train_data_class3.mat file. The customized loadmat function of scipy is needed here as the original loadmat function cannot cope with nested dictionaries.

The stored Data variable contains two subsets including test and train.

We dig deeper into the number of classes and the number of dimensions.

Data Visualization

Maximum Likelihood Estimation (MLE)

Assume that the class-conditional densities $p(\mathbf{x}|\omega_i)$ are in the form of Gaussian distribution with unknown parameters, means $\mathbf{\mu}_i$ and covariance matrices $\mathbf{\Sigma}_i$. But we can estimate those parameters by using a method called maximum likelihood estimation (MLE) on the given training data.

From the training data, we have $C=3$ classes. Thus, there are 3 sets, $D_1, D_2,$ and $D_3$. All samples in each set are i.i.d. (independent and identically distributed). Also, we use the samples in $D_i$ only to estimate $\theta_i$ without interfering with other classes.

There are $n_i$ samples, $x_1, \dots, x_{n_i}$, in $D_i$

Log-likelihood $\ln p(x_k | \mathbf{\theta})$

Since the data are in 2D space, the class-conditional densities are multivariate Gaussian distribution with the following parameters $\vec{\theta}$:

Our log-likelihood $\ln p(x_k | \vec{\theta})$ of a single sample $x_k$ is then defined as:

$$ \ln p(x_k | \vec{\theta}) = -\frac{1}{2}(x_k-\mu_i)^\mathsf{T}\Sigma_i^{-1}(x_k-\mu_i) -\frac{d}{2}\ln (2\pi) - \frac{1}{2} \ln |\Sigma_i| + \ln P(\omega_i) $$

Log-likelihood $l$ of the dataset $D = [x_1, \dots, x_{n_i}]$ is: $$ l(\vec{\theta}) = \sum_{k=1}^{n} \ln p(x_k | \vec{\theta}) $$

We arrive at the gradients or the derivatives of the log-likelihood $l$ w.r.t. $\vec{\theta} = [\theta_1\ \theta_2]^{\mathsf{T}}$ as follows: $$ \nabla_{\vec{\theta}} l = \sum_{k=1}^{n} \nabla_{\vec{\theta}} \ln p(x_k | \vec{\theta}) $$

To estimate $\vec{\theta}$, we set the gradients to be zero to obtain the maximum likelihood estimate of $\vec{\theta}$, $$ \nabla_{\vec{\theta}} l = \vec{0} $$

$$ \nabla_{\vec{\theta}} l = \begin{bmatrix} \frac{\partial l}{\partial \theta_1}\\ \frac{\partial l}{\partial \theta_2} \end{bmatrix} $$

Estimate $\theta_1 = \mu$

$$ \frac{\partial l}{\partial \theta_1} = \frac{\partial l}{\partial \mu} = \frac{\partial}{\partial \mu} \left( \sum_{k=1}^{n} -\frac{1}{2}(x_k-\mu_i)^\mathsf{T}\Sigma_i^{-1}(x_k-\mu_i) \right) = \vec{0} $$
$$ \frac{\partial}{\partial \mu} \left( \sum_{k=1}^{n} (x_k-\mu_i)^\mathsf{T}\Sigma_i^{-1}(x_k-\mu_i) \right) = \vec{0} $$
$$ \frac{\partial}{\partial \mu} \left( \sum_{k=1}^{n} (x_k^{\mathsf{T}}\Sigma_i^{-1}-\mu_i^{\mathsf{T}}\Sigma_i^{-1}) (x_k-\mu_i) \right) = \vec{0} $$
$$ \frac{\partial}{\partial \mu} \left( \sum_{k=1}^{n} (x_k^{\mathsf{T}}\Sigma_i^{-1}x_k - 2\mu_i^{\mathsf{T}}\Sigma_i^{-1}x_k + \mu_i^{\mathsf{T}} \Sigma_i^{-1} \mu_i) \right) = \vec{0} $$
$$ \sum_{k=1}^{n} (0 - 2\Sigma_i^{-1}x_k + 2 \Sigma_i^{-1} \mu_i) = \vec{0} $$
$$ \sum_{k=1}^{n} (- \Sigma_i^{-1}x_k + \Sigma_i^{-1} \mu_i) = \vec{0} $$
$$ \sum_{k=1}^{n} (\Sigma_i^{-1} \mu_i - \Sigma_i^{-1}x_k) = \vec{0} $$
$$ \sum_{k=1}^{n} \left(\Sigma_i^{-1} (\mu_i - x_k) \right) = \vec{0} $$
$$ \Sigma_i^{-1} \sum_{k=1}^{n} \left(\mu_i - x_k \right) = \vec{0} $$

Multiplying both sides by $\Sigma$, $$ \sum_{k=1}^{n} \left(\mu_i - x_k \right) = \vec{0} $$

$$ n \mu_i - \sum_{k=1}^{n} x_k = \vec{0} $$
$$ \mu_i = \frac{1}{n} \sum_{k=1}^{n} x_k $$

Estimate $\theta_2 = \Sigma$

This is a biased estimate for a covariance matrix $\Sigma_i$.

$$ \frac{\partial l}{\partial \theta_2} = \frac{\partial l}{\partial \Sigma} = \frac{\partial}{\partial \Sigma} \left( \sum_{k=1}^{n} \left( -\frac{1}{2}(x_k-\mu_i)^\mathsf{T}\Sigma_i^{-1}(x_k-\mu_i) - \frac{1}{2} \ln |\Sigma_i| \right) \right) = \vec{0} $$
$$ \frac{\partial}{\partial \Sigma} \left( \sum_{k=1}^{n} \left( (x_k-\mu_i)^\mathsf{T}\Sigma_i^{-1}(x_k-\mu_i) + \ln |\Sigma_i| \right) \right) = \vec{0} $$

As the determinant of the inverse of a matrix is the inverse of the determinant, we arrive at:

$$ \frac{\partial}{\partial \Sigma} \left( \sum_{k=1}^{n} \left( (x_k-\mu_i)^\mathsf{T}\Sigma_i^{-1}(x_k-\mu_i) + \ln |\Sigma_i^{-1}| \right) \right) = \vec{0} $$
$$ \sum_{k=1}^{n} \left( -(x_k-\mu_i)(x_k-\mu_i)^\mathsf{T} + \frac{1}{\Sigma_i^{-1}} \right) = \vec{0} $$
$$ - \sum_{k=1}^{n} \left((x_k-\mu_i)(x_k-\mu_i)^\mathsf{T} \right) + n\Sigma_i = \vec{0} $$

Thus, we can estimate a covariance matrix $\Sigma_i$ with

$$ \Sigma_i = \frac{1}{n} \sum_{k=1}^{n} \left((x_k-\mu_i)(x_k-\mu_i)^\mathsf{T} \right) $$

Cross-check with the Numpy's built-in function np.cov

Classification

Mahalanobis Distance

$$ r = \sqrt{(\vec{x} - \vec{y})^{\mathsf{T}} \mathbf{\Sigma}^{-1} (\vec{x} - \vec{y})} $$

In this case, we can substitute $\vec{y}$ with the mean and the covariance matrix of class $i$, so we get:

$$ r = \sqrt{(\vec{x} - \vec{\mu}_i)^{\mathsf{T}} \mathbf{\Sigma_i}^{-1} (\vec{x} - \vec{\mu}_i)} $$

Classification Rule

We predict the class of a sample $\vec{x}$ based on the minimum Mahalanobis distance:

$$ \hat{y}(\vec{x}) = \operatorname*{argmin}_{i} r_i(\vec{x}) $$
(predicted) 1 (predicted) 2 (predicted) 3
(actual) 1 1 9 60
(actual) 2 0 0 94
(actual) 3 0 0 46
Table 1: Confusion matrix of the model with the estimated parameter $\mu_i$ and biased $\Sigma_i$ using Maximum Likelihood Estimation, and classifying based on Mahalanobis distance

The confusion matrix from using Mahalanobis distance only does not look promising. The test samples are mostly classified as class 3. Some interesting points are:

The reason would be we just take only means $\mu_i$ and the inverse of covariance matrices $\Sigma_i^{-1}$ into account. All means are very close together. The dominant factor then is the covariance matrix of class 3. The class 3 covers most of the area as shown in the scatter plot above.

Thus, we are moving on to discriminant function as our classification criterion.

Discriminant function $g_i(\cdot)$

Since we don't know anything a priori, the priors $P(\omega_i)$ for all classes are uninformative or equally probable with probability of $\frac{1}{3}$ each.

From MiniProject 2, the discriminant function with the following generic form is given by: $$ g_i(x) = -\frac{1}{2}(x-\mu_i)^t\Sigma_i^{-1}(x-\mu_i)-\frac{d}{2}\ln (2\pi) - \frac{1}{2} \ln |\Sigma_i| + \ln P(\omega_i) $$ also for any given $d$ dimensional data, mean, covariance matrix and prior probabilities.

Note: these below functions are from MiniProject 2.

Classification Rule

We predict the class of a sample $\vec{x}$ based on the following equation:

$$ \hat{y}(\vec{x}) = \operatorname*{argmax}_{i} g_i(\vec{x}) $$

Confusion Matrix

Confusion matrix is a tool that we can use to see the performance of a classifier. It is simply a table.

(predicted) 1 (predicted) 2 (predicted) 3
(actual) 1 59 11 0
(actual) 2 15 61 18
(actual) 3 0 7 39
Table 2: Confusion matrix of the model with the estimated parameter $\mu_i$ and $\Sigma_i$ using Maximum Likelihood Estimation, and classifying based on discriminant function

According to the confusion matrix shown above, most of the samples are classified correctly. They are placed in the diagonal of the confusion matrix. On the other hand, there are misclassified samples that are off-diagonal of the confusion matrix.

No samples of class 1 are misclassified as class 3, and vice versa. The reason is the class 1 samples are compact and are in the most inner part as opposed to the class 3 samples. If they will be misclassified, they should be classified as class 2 more than class 3.

We can compute an accuracy out of the confusion matrix $\mathbf{CM}$ as follows:

$$ \mathrm{Accuracy} = \operatorname{tr}(\mathbf{CM}) = \frac{\sum_{i=1}^{C} \mathbf{CM}_{ii}}{\sum_{i=1}^{C}\sum_{j=1}^{C} \mathbf{CM}_{ij}} = \frac{1}{N} \sum_{i=1}^{C} \mathbf{CM}_{ii} $$

Obviously, we can also calculate an error by subtracting the accuracy from 1.

$$ \mathrm{Error} = 1 - \mathrm{Accuracy} $$

We can achieve the test accuracy of up to $75.71\%$ which also results in the test error of $24.29\%$.

Unbiased Covariance Matrix

Since the above case, the estimated covariance matrices are biased by nature. It is also interesting to see how unbiased covariance matrices differ from the biased ones. Thus, we can estimate an unbiased covariance matrix $\Sigma_i$ with

$$ \Sigma_i = \frac{1}{n-1} \sum_{k=1}^{n} \left((x_k-\mu_i)(x_k-\mu_i)^\mathsf{T} \right) $$
(predicted) 1 (predicted) 2 (predicted) 3
(actual) 1 59 11 0
(actual) 2 15 61 18
(actual) 3 0 7 39
Table 3: Confusion matrix of the model with the estimated parameter $\mu_i$ and unbiased $\Sigma_i$ using Maximum Likelihood Estimation, and classifying based on discriminant function

Well, the results from using biased and unbiased covariance matrices are different a little bit. Again, no samples of class 1 are misclassified as class 3, and vice versa.

Some noticeable changes are as follows:

Obviously, the test samples that are affected are only from class 2.

Well, the good news is the test accuracy from using the unbiased covariance matrices are increased by $0.48\%$ (from $75.71\%$ to $76.19\%$). Therefore, the test error also decreases by the same amount.

Bayesian Estimation

Here, we transform the data by converting them from Cartesian Coordinates $(x, y)$ to Polar Coordinates $(r, \theta)$ using the cart2pol function below.

Let's visualize the transformed training samples in polar coordinates. The scatter plot of them are shown below.

Ditch the dimension of $\theta$, we are now interested in the dimension of $r$ only. The reason is we cannot use the information of $\theta$ to distinguish the samples.

We only have one axis (1-d) data, but for the sake of visualization, the height of each sample is added. This allows us to see some overlapped samples.

Assume that each class has the following properties:

Known variance $\sigma^2 = 0.25$

Prior $p(\mu_i) \sim \mathcal{N}(\mu_0 = 0, \sigma_0^2 = 100)$

The Desired Class-conditional Density $p(r | \omega_i, D_i)$

$$ p(r|\omega_i,D_i) = \int p(r|\mu_i) p(\mu_i|D_i) \,d\mu_i $$

As we know from the question that, $$ p(r|\mu_i) \sim \mathcal{N}(\mu_i, \sigma^2) $$

This means that a posteriori density $p(\mu_i | \mathcal{D}_i)$ is the only term that we need to derive.

$$ \begin{align*} p(\mu | \mathcal{D}) &= \frac{p(\mathcal{D} | \mu) p(\mu)}{\int p(\mathcal{D} | \mu) p(\mu) d\,\mu}\\ &= \frac{\prod_{k=1}^{n} p(\mathcal{x_k} | \mu) p(\mu)}{\int \prod_{k=1}^{n} p(\mathcal{x_k} | \mu) p(\mu) d\,\mu}\\ &= \alpha \prod_{k=1}^{n} p(\mathcal{x_k} | \mu) p(\mu)\\ \end{align*} $$

We have the constant $\alpha$ that represents a normalization factor (integration in the denominator) since it does not depend on $\mu$.

$$ \begin{align*} p(\mu | \mathcal{D}) &= \alpha \prod_{k=1}^{n} p(\mathcal{x_k} | \mu) \cdot p(\mu)\\ &= \alpha \prod_{k=1}^{n} \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2} \left( \frac{x_k - \mu}{\sigma} \right)^2} \cdot \frac{1}{\sqrt{2 \pi} \sigma_0} e^{-\frac{1}{2} \left( \frac{\mu - \mu_0}{\sigma_0} \right)^2}\\ &= \alpha \frac{1}{\sqrt{2 \pi} \sigma} \frac{1}{\sqrt{2 \pi} \sigma_0} \prod_{k=1}^{n} e^{-\frac{1}{2} \left( \frac{x_k - \mu}{\sigma} \right)^2} \cdot e^{-\frac{1}{2} \left( \frac{\mu - \mu_0}{\sigma_0} \right)^2}\\ &= \alpha' \prod_{k=1}^{n} e^{-\frac{1}{2} \left( \frac{x_k - \mu}{\sigma} \right)^2} \cdot e^{-\frac{1}{2} \left( \frac{\mu - \mu_0}{\sigma_0} \right)^2}\\ &= \alpha' \exp{\left( \sum_{k=1}^{n} -\frac{1}{2} \left( \frac{x_k - \mu}{\sigma} \right)^2 + \left[-\frac{1}{2} \left( \frac{\mu - \mu_0}{\sigma_0} \right)^2 \right] \right)}\\ &= \alpha' \exp{\left( -\frac{1}{2} \left[ \sum_{k=1}^{n} \left( \frac{x_k - \mu}{\sigma} \right)^2 + \left( \frac{\mu - \mu_0}{\sigma_0} \right)^2 \right] \right)}\\ &= \alpha' \exp{\left( -\frac{1}{2} \left[ \sum_{k=1}^{n} \frac{(x_k - \mu)^2}{\sigma^2} + \frac{(\mu - \mu_0)^2}{\sigma_0^2} \right] \right)}\\ &= \alpha' \exp{\left( -\frac{1}{2} \left[ \sum_{k=1}^{n} \frac{x_k^2 - 2 x_k \mu + \mu^2}{\sigma^2} + \frac{\mu^2 - 2 \mu \mu_0 + \mu_0^2}{\sigma_0^2} \right] \right)}\\ &= \alpha' \exp{\left( -\frac{1}{2} \left[ \frac{\sum_{k=1}^{n}x_k^2}{\sigma^2} - \frac{\sum_{k=1}^{n} 2 x_k \mu}{\sigma^2} + \frac{n\mu^2}{\sigma^2} + \frac{\mu^2}{\sigma_0^2} - \frac{2 \mu \mu_0}{\sigma_0^2} + \frac{\mu_0^2}{\sigma_0^2} \right] \right)}\\ &= \alpha' \exp{\left( -\frac{1}{2} \left[ \left( \frac{n\mu^2}{\sigma^2} + \frac{\mu^2}{\sigma_0^2} \right) + \left(- \frac{\sum_{k=1}^{n} 2 x_k \mu}{\sigma^2} - \frac{2 \mu \mu_0}{\sigma_0^2} \right)\\ + \left( \frac{\sum_{k=1}^{n}x_k^2}{\sigma^2} + \frac{\mu_0^2}{\sigma_0^2} \right) \right] \right)}\\ &= \alpha'' \exp{\left( -\frac{1}{2} \left[ \left( \frac{n\mu^2}{\sigma^2} + \frac{\mu^2}{\sigma_0^2} \right) + \left(- \frac{\sum_{k=1}^{n} 2 x_k \mu}{\sigma^2} - \frac{2 \mu \mu_0}{\sigma_0^2} \right) \right] \right)}\\ &= \alpha'' \exp{\left( -\frac{1}{2} \left[ \left( \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} \right)\mu^2 - 2 \mu\left(\frac{\sum_{k=1}^{n} x_k}{\sigma^2} + \frac{\mu_0}{\sigma_0^2} \right) \right] \right)}\\ \end{align*} $$

$p(\mu|\mathcal{D})$ is in the form of an exponential function of $\mu^2$ and $\mu$. If we assume that $p(\mu|\mathcal{D})$ is a normal density as follows:

$$ p(\mu|\mathcal{D}) \sim \mathcal{N}(\mu_n, \sigma_n^2) $$

where $\mu_n$ is the estimated mean, and $\sigma_n^2$ is the estimated variance.

$$ \begin{align*} p(\mu | \mathcal{D}) &= \frac{1}{\sqrt{2 \pi} \sigma_n} \exp{\left( -\frac{1}{2} \left( \frac{(\mu - \mu_n)^2}{\sigma_n^2} \right) \right)}\\ &= \frac{1}{\sqrt{2 \pi} \sigma_n} \exp{\left( -\frac{1}{2} \left( \frac{\mu^2 - 2 \mu \mu_n + \mu_n^2}{\sigma_n^2} \right) \right)}\\ &= \frac{1}{\sqrt{2 \pi} \sigma_n} \exp{\left( -\frac{1}{2} \left( \frac{\mu^2}{\sigma_n^2} - \frac{2 \mu \mu_n}{\sigma_n^2} + \frac{\mu_n^2}{\sigma_n^2} \right) \right)}\\ &= \frac{1}{\sqrt{2 \pi} \sigma_n} \exp{\left( -\frac{1}{2} \left( \left[ \frac{1}{\sigma_n^2} \right] \mu^2 - 2 \mu \left[ \frac{\mu_n}{\sigma_n^2} \right] + \frac{\mu_n^2}{\sigma_n^2} \right) \right)}\\ \end{align*} $$

Thus, we can estimate $\mu_n$ and $\sigma_n^2$ by equating the above equation with a normal distribution.

$$ \begin{align*} p(\mu | \mathcal{D}) &= \frac{1}{\sqrt{2 \pi} \sigma_n} \exp{\left( -\frac{1}{2} \left( \left[ \frac{1}{\sigma_n^2} \right] \mu^2 - 2 \mu \left[ \frac{\mu_n}{\sigma_n^2} \right] + \frac{\mu_n^2}{\sigma_n^2} \right) \right)}\\ p(\mu | \mathcal{D}) &= \alpha'' \exp{\left( -\frac{1}{2} \left( \left[ \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} \right]\mu^2 - 2 \mu \left[\frac{\sum_{k=1}^{n} x_k}{\sigma^2} + \frac{\mu_0}{\sigma_0^2} \right] \right) \right)}\\ \end{align*} $$

Comparing the coefficients of the first terms in the exponential ($\mu_n^2$ and $\mu^2$) between left hand side and right hand side gives us:

$$ \begin{align*} \frac{1}{\sigma_n^2} &= \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}\\ \frac{1}{\sigma_n^2} &= \frac{n \sigma_0^2 + \sigma^2}{\sigma^2 \sigma_0^2}\\ \sigma_n^2 &= \frac{\sigma^2 \sigma_0^2}{n \sigma_0^2 + \sigma^2} \end{align*} $$

We can also compare the coefficients of the second terms in the exponential ($\mu_n$ and $\mu$) between left hand side and right hand side. This gives us:

$$ \begin{align*} \frac{\mu_n}{\sigma_n^2} &= \frac{\sum_{k=1}^{n} x_k}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}\\ &= \frac{1}{\sigma^2} \sum_{k=1}^{n} x_k + \frac{\mu_0}{\sigma_0^2}\\ &= \frac{1}{\sigma^2} \frac{n}{n} \sum_{k=1}^{n} x_k + \frac{\mu_0}{\sigma_0^2}\\ &= \frac{1}{\sigma^2} n \left[ \frac{1}{n} \sum_{k=1}^{n} x_k \right] + \frac{\mu_0}{\sigma_0^2}\\ &= \frac{1}{\sigma^2} n \bar{x_n} + \frac{\mu_0}{\sigma_0^2}\\ \frac{\mu_n}{\sigma_n^2} &= \frac{n}{\sigma^2} \bar{x_n} + \frac{\mu_0}{\sigma_0^2}\\ \mu_n &= \sigma_n^2 \left( \frac{n}{\sigma^2} \bar{x_n} + \frac{\mu_0}{\sigma_0^2} \right)\\ &= \frac{n \sigma_n^2}{\sigma^2} \bar{x_n} + \frac{\sigma_n^2}{\sigma_0^2} \mu_0 \\ &= \frac{n}{\sigma^2} \left[ \frac{\sigma^2 \sigma_0^2}{n \sigma_0^2 + \sigma^2} \right] \bar{x_n} + \frac{1}{\sigma_0^2} \left[ \frac{\sigma^2 \sigma_0^2}{n \sigma_0^2 + \sigma^2} \right] \mu_0 \\ \mu_n &= \frac{n \sigma_0^2}{n \sigma_0^2 + \sigma^2} \bar{x_n} + \frac{\sigma^2}{n \sigma_0^2 + \sigma^2} \mu_0 \\ \end{align*} $$

where the sample mean $\bar{x}_n$ is:

$$ \bar{x}_n = \frac{1}{n} \sum_{k=1}^{n} x_k $$

Based on the derived $\sigma_n^2$, we can arrive at a final form of computing $\mu_n$ without the dependency of $\sigma_n^2$.

Next, the desired class-conditional density $p(r|\mathcal{D})$ will be derived based on the a posteriori density $p(\mu|\mathcal{D})$.

Note: Whereas we are focusing on $p(r|\mathcal{D})$, its true form is actually $p(r|\omega_i, \mathcal{D_i})$. We just drop the class distinctions for simplicity.

As we use both the prior probability $p(\mu_i | \mathcal{D})$ and the probability density $p(r | \mu_i)$, we arrive at: $$ \begin{align*} p(r|\mathcal{D}) &= \int p(r|\mu_i) \cdot p(\mu_i|\mathcal{D}) \,d\mu_i\\ &= \int \frac{1}{\sqrt{2 \pi} \sigma} \exp \left[ -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^2 \right] \cdot \int \frac{1}{\sqrt{2 \pi} \sigma_n} \exp \left[ -\frac{1}{2} \left( \frac{\mu - \mu_n}{\sigma_n} \right)^2 \right] \,d\mu\\ \end{align*} $$

We pull the terms that do not relate to $\mu$ out, we get

$$ \begin{align*} p(r|\mathcal{D}) &= \frac{1}{2 \pi \sigma \sigma_n} \exp \left[ -\frac{1}{2} \frac{(x - \mu_n)^2}{\sigma^2 + \sigma_n^2} \right] \cdot \int \exp \left[ -\frac{1}{2} \frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n^2} \left( \mu - \frac{\sigma_n^2 x + \sigma^2 \mu_n}{\sigma^2 + \sigma_n^2} \right)^2 \right] \,d\mu\\ \end{align*} $$

From the first term of the above equation, we can see that $p(r|\mathcal{D})$ can be viewed as an approximate normal distribution where $\mu_n$ is the mean, and $\sigma^2 + \sigma_n^2$ is the variance.

$$ p(r|\mathcal{D}) \sim \mathcal{N}(\mu_n, \sigma^2 + \sigma_n^2) $$

Classification

We use the estimated class-conditional densities to classify the test data. Again, the class $i$ with the maximum class-conditional densities at $r$ will be selected.

$$ \hat{y}(r) = \operatorname*{argmax}_{i} p(r|\omega_i, \mathcal{D}_i) $$

Confusion Matrix

(predicted) 1 (predicted) 2 (predicted) 3
(actual) 1 66 4 0
(actual) 2 10 67 17
(actual) 3 0 3 43
Table 4: Confusion matrix of the model with the estimated parameter $\mu_i$ using Bayesian Estimation, and classifying based on the desired class-conditional densities

According to the confusion matrix shown above, most of the samples are classified correctly. They are placed in the diagonal of the confusion matrix. On the other hand, there are misclassified samples that are off-diagonal of the confusion matrix.

Again, no samples of class 1 are misclassified as class 3, and vice versa.

This time, the test accuracy based on the estimated parameters $\mu_i$ using Bayesian Estimation can attain $83.81\%$ test accuracy. The test error is then equal to $16.19\%$.