Mini Project 4

Watchanan Chantapakul (wcgzm)


Questions 1

Consider Gaussian density models in different dimensions.

(a)

Write a program to find the maximum-likelihood values $\hat{\mu}$ and $\hat{\sigma}^2$. Apply your program individually to each of the three features $x_i$ of category $\omega_1$ in the table above.

Right here, the unbiased estimator of the population variance is used.

(b)

Modify your program to apply to two-dimensional Gaussian data $p(x) \sim \mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma})$. Apply your data to each of the three possible pairings of two features for $\omega_1$.

samples1 consists of feature 1 and 2 for $\omega_1$

samples2 consists of feature 1 and 3 for $\omega_1$

samples3 consists of feature 2 and 3 for $\omega_1$

(c)

Modify your program to apply to three-dimensional Gaussian data. Apply your program to the full three-dimensional data for $\omega_1$.

We can use the function in sub-question (b).

(d)

Assume your three-dimensional model is separable, so that $$ \mathbf{\Sigma} = diag(\sigma_1^2, \sigma_2^2, \sigma_3^2). $$ Write a program to estimate the mean and the diagonal components of $\mathbf{\Sigma}$. Apply your program to the data in $\omega_2$.

(e) Compare your results for the mean of each feature $\mu_i$ calculated in the above ways. Explain why they are the same or different.

Since we calculate a mean for each dimension, it can be combined to form a higher dimensional mean.

Questions a, b, and c are about $\omega_1$. We can structure the mean as follows:

Question a) 1-d feature

$$ \mu_1^{(1)} = −0.0709 $$$$ \mu_2^{(1)} = −0.6047 $$$$ \mu_3^{(1)} = −0.9110 $$

Question b) 2-d feature

$$ \vec{\mu}_1^{(2)} = \begin{bmatrix} \mu_1^{(1)}\\ \mu_2^{(1)} \end{bmatrix} = \begin{bmatrix} −0.0709\\ −0.6047 \end{bmatrix} $$$$ \vec{\mu}_2^{(2)} = \begin{bmatrix} \mu_1^{(1)}\\ \mu_3^{(1)} \end{bmatrix} = \begin{bmatrix} −0.0709\\ −0.9110 \end{bmatrix} $$$$ \vec{\mu}_3^{(2)} = \begin{bmatrix} \mu_2^{(1)}\\ \mu_3^{(1)} \end{bmatrix} = \begin{bmatrix} −0.6047\\ −0.9110 \end{bmatrix} $$

Question c) 3-d feature

$$ \vec{\mu}^{(3)} = \begin{bmatrix} \mu_1^{(1)}\\ \mu_2^{(1)}\\ \mu_3^{(1)} \end{bmatrix} = \begin{bmatrix} −0.0709\\ −0.6047\\ −0.9110 \end{bmatrix} $$

However, for question d, it is about $\omega_2$, so, the mean value is different from question a, b, and c.

(f) Compare your results for the variance of each feature $\sigma_i^2$ calculated in the above ways. Explain way they are the same or different.

Just like the mean, 1-d variance can constitute higher-dimensional variance (covariance matrix). We can see the pattern as follows:

Question a) 1-d feature

$$ \sigma_1^2 = 1.0069 $$$$ \sigma_2^2 = 4.6675 $$$$ \sigma_3^2 = 5.0466 $$

Question b) 2-d feature

$$ \mathbf{\Sigma_1} = \begin{bmatrix} \sigma_1^2 & \sigma_{12}\\ \sigma_{21} & \sigma_2^2\\ \end{bmatrix} = \begin{bmatrix} 1.0069 & 0.6309\\ 0.6309 & 4.6675\\ \end{bmatrix} $$$$ \mathbf{\Sigma_2} = \begin{bmatrix} \sigma_1^2 & \sigma_{13}\\ \sigma_{31} & \sigma_3^2\\ \end{bmatrix} = \begin{bmatrix} 1.0069 & 0.4379\\ 0.4379 & 5.0466\\ \end{bmatrix} $$$$ \mathbf{\Sigma_3} = \begin{bmatrix} \sigma_2^2 & \sigma_{23}\\ \sigma_{32} & \sigma_3^2\\ \end{bmatrix} = \begin{bmatrix} 4.6675 & 0.8152\\ 0.8152 & 5.0466\\ \end{bmatrix} $$

Question c) 3-d feature

$$ \mathbf{\Sigma} = \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \sigma_{13}\\ \sigma_{21} & \sigma_2^2 & \sigma_{23}\\ \sigma_{31} & \sigma_{32} & \sigma_3^2\\ \end{bmatrix} = \begin{bmatrix} 1.0069 & 0.6309 & 0.4379\\ 0.6309 & 4.6675 & 0.8152\\ 0.4379 & 0.8152 & 5.0466\\ \end{bmatrix} $$

Question d)

Since we use the data in $\omega_2$ instead of $\omega_1$, obviously, the variances are different from question a, b, and c. The covariance matrix $\mathbf{\Sigma}$ in question d is also a diagonal matrix that only contains the variances, not covariances.

Question 2

Consider a one-dimensional model of a triangular density governed by two scalar parameters: $$ p(x|\mathbf{\theta}) \equiv T(\mu, \delta) = \begin{cases} \frac{\delta - |x - \mu|}{\delta^2} & \text{for $|x - \mu| < \delta$} \\ 0 & \text{otherwise,} \end{cases} $$ where $\mathbf{\theta} = \begin{pmatrix} \mu \\ \delta \end{pmatrix} $. Write a program to calculate the density $p(x|\mathcal{D})$ via Bayesian methods (Eq. 25) and apply it to the $x_2$ feature of category $\omega_2$. Assume your priors on the parameters are uniform throughout the range of the data. Plot your resulting posterior density $p(x|\mathcal{D})$.


I will show two different methods to tackle this question 2.

  1. Estimating one parameter at a time, and then using the estimated parameters to compute the desired class-conditional density $p(x|D)$
  2. Deriving a close form of the desired class-conditional density $p(x|D)$ with double integrals

Get the feature $x_2$ from $\omega_2$

[METHOD 1]

Triangular Density

Try plotting a triangular density

Uniform Distribution

Try plotting a uniform distribution

Bayesian estimation (BE)

Unlike maximum likelihood approach that considers $\theta$ to be fixed, Bayesian estimation treats $\theta$ as a random variable. So, we need to find a distribution of the random variable—parameter distribution.

We assume that the parametric form of the density $p(x|\mathbf{\theta})$ is known. In this case, $p(x|\mathbf{\theta})$ is a triangular density function. But we want to find the value of the parameter vector $\mathbf{\theta}$.

We can compute $p(x|D)$ from $p(x|\theta)$ and $P(\theta|D)$ as given by:

$$\begin{align} p(x|D) &= \int p(x, \theta | D) \,d\theta\\ &= \int p(x | \theta, D) p(\theta | D) \,d\theta\\ &= \int p(x | \theta) p(\theta | D) \,d\theta\\ \end{align}$$

$p(\theta | D)$ is estimated by using Bayes formula.

$$\begin{align} p(\theta|D) &= \frac{p(D|\theta)p(\theta)}{p(D)}\\ % &= \alpha \cdot p(D|\theta)p(\theta)\\ \end{align}$$

Training data set $D = \{x_1, \dots, x_n\}$ has $n$ samples. All samples are i.i.d. (independent identically distributed random variables). Thus, we can compute $p(D|\theta)$ as follows:

$$\begin{align} p(D|\theta) &= \prod_{k=1}^{n} p(x_k|\theta)\\ \end{align}$$

We substitute $p(D|\theta)$ back to the $p(\theta|D)$, we get:

$$\begin{align} % p(\theta|D) &= \alpha \cdot p(D|\theta)p(\theta)\\ p(\theta|D) &= \frac{p(D|\theta)p(\theta)}{p(D)}\\ p(\theta|D) &= \frac{\prod_{k=1}^{n} p(x_k|\theta)p(\theta)}{p(D)}\\ % &= \alpha \cdot \prod_{k=1}^{n} p(x_k|\theta)p(\theta)\\ \end{align}$$

where

$$\begin{align} p(D) &= \int p(D|\theta)p(\theta)\,dx\\ \end{align}$$

The prior density $p(\theta)$ is given by the question to be an uniform distribution over the range of the data. This means the prior is uninformative.

$$\begin{align} p(\theta) &= \mathcal{U}(0.054, 0.69) =\begin{cases} \frac{1}{0.69-0.054} & \text{for $0.054 \leq \theta \leq 0.69$} \\ 0 & \text{otherwise,} \end{cases} \\ \end{align}$$

We also know $p(x|\theta)$ as it is provided by the question.

The range of the input data is as follows:

1) Estimate $\mu$ by fixing $\delta = 2$

So, our $\theta$ is just $\mu$. Assume that the width of the likelihood (triangular) density is 2.

A Priori Density $p(\theta) = p(\mu)$

Since, we are using uninformative prior (uniform) throughout the range of the data.

Check the a priori density $p(\theta)$ sum to $1$.

Compute $p(\theta|D) = p(\mu|D)$

Check the posterior density $p(\theta|D)$ sum to $1$.

Posterior density $p(x|D)$ after estimating $\mu$

Below is the plot of $p(x|D)$ with the estimated$\hat{\mu}$ but the fixed $\delta$.

Check thedesired class-conditional density $p(x|D)$ sum to $1$.

We can also verify that the peak of $p(x|D)$ should be around the mean.

The maximum $p(x|D)$ is at $x = 0.4670$ which is close to $0.4299$.

2) Estimate $\delta$ by fixing $\mu$

So, our $\theta$ is $\delta$ now. Assume that the center of the likelihhod (triangular) density $\mu = 0.49$ based on the maximum posterior density $p(\mu|D)$.

A Priori Density $p(\theta) = p(\delta)$

Check the a priori density $p(\theta)$ sum to $1$.

Compute $p(\theta|D) = p(\delta|D)$

Check the posterior density $p(\theta|D)$ sum to $1$.

Posterior density $p(x|D)$ after estimating $\delta$

Check the desired class-conditional density $p(x|D)$ sum to $1$.

The estimated parameters

We can also get the estimated parameter vector:

$$ \begin{align} \vec{\theta} &= \begin{pmatrix} \hat{\mu}\\ \hat{\delta} \end{pmatrix} = \begin{pmatrix} 0.4900\\ 0.5440 \end{pmatrix} \end{align} $$

[METHOD 2] Estimating two parameters at the same time

We can compute $p(x|D)$ from $p(x|\theta)$ and $P(\theta|D)$ as given by:

$$\begin{align} p(x|D) &= \int p(x, \theta | D) \,d\theta\\ &= \int p(x | \theta, D) p(\theta | D) \,d\theta\\ &= \int p(x | \theta) p(\theta | D) \,d\theta\\ \end{align}$$

Since the parameter vector $\vec{\theta}$ has two parameters, $\mu$ and $\delta$, then the desired class-conditional density is:

$$ p(x|D) = \int \int p(x | \mu, \delta) p(\mu, \delta | D) \,d\delta \,d\mu\\ $$

The priors $p(\vec{\theta}|D)$ of the parameters are given by the question to be uniform throughout the range of the data. Then the term $p(\vec{\theta}|D)$ or $p(\mu, \delta|D)$ is just a constant $\alpha$. The equation is reduced to:

$$\begin{align} p(x|D) &= \int \int p(x | \mu, \delta) \mathbf{\alpha} \,d\delta\,d\mu\\ &= \alpha \int \int p(x | \mu, \delta) \,d\delta\,d\mu\\ \end{align}$$

The likelihood density $p(x|\mu, \delta)$ is a density density. We can substitute its definition into $p(x|D)$ as follows:

$$\begin{align} p(x|D) &= \alpha \int \int_{|x - \mu|}^{\delta_{\mathrm{max}}} \frac{\delta - |x - \mu|}{\delta^2} \,d\delta\,d\mu\\ &= \alpha \int \int_{|x - \mu|}^{\delta_{\mathrm{max}}} \left( \frac{1}{\delta} - \frac{|x - \mu|}{\delta^2} \right) \,d\delta\,d\mu\\ &= \alpha \int \int_{|x - \mu|}^{\delta_{\mathrm{max}}} \left( \frac{1}{\delta} - |x - \mu|\delta^{-2} \right) \,d\delta\,d\mu\\ &= \alpha \int \left( \ln{\delta} - (-1)|x - \mu|\delta^{-1} \right) \Big|_{\delta=|x - \mu|}^{\delta_{\mathrm{max}}} \,d\mu\\ &= \alpha \int \left( \ln{\delta} + |x - \mu|\delta^{-1} \right) \Big|_{\delta=|x - \mu|}^{\delta_{\mathrm{max}}} \,d\mu\\ &= \alpha \int \left( \ln{\delta} + \frac{|x - \mu|}{\delta} \right) \Big|_{\delta=|x - \mu|}^{\delta_{\mathrm{max}}} \,d\mu\\ &= \alpha \int \left( \left[ \ln{\delta_{\mathrm{max}}} - \ln{|x - \mu|} \right] + \left[ |x - \mu| \left( \frac{1}{\delta_{\mathrm{max}}} - \frac{1}{|x - \mu|} \right) \right] \right) \,d\mu\\ &= \alpha \int_{x_{\mathrm{max}}}^{x_{\mathrm{min}}} \left( \left[ \ln{\delta_{\mathrm{max}}} - \ln{|x - \mu|} \right] + \left[ |x - \mu| \left( \frac{1}{\delta_{\mathrm{max}}} - \frac{1}{|x - \mu|} \right) \right] \right) \,d\mu\\ \end{align}$$

The value of $|x - \mu|$ can be + or - based on the interval. So the integral is changed the form into:

$$\begin{align} p(x|D) &= \alpha \int_{x_{\mathrm{max}}}^{x} \left( \left[ \ln{\delta_{\mathrm{max}}} - \ln{(x - \mu)} \right] + \left[ (x - \mu) \left( \frac{1}{\delta_{\mathrm{max}}} - \frac{1}{(x - \mu)} \right) \right] \right) \,d\mu\\ &+ \alpha \int_{x}^{x_{\mathrm{min}}} \left( \left[ \ln{\delta_{\mathrm{max}}} - \ln{(\mu - x)} \right] + \left[ (\mu - x) \left( \frac{1}{\delta_{\mathrm{max}}} - \frac{1}{(\mu - x)} \right) \right] \right) \,d\mu\\ \end{align}$$

Next, solving the last integral:

$$\begin{align} p(x|D) &= \alpha \left[ \mu \left(\ln \delta_{max} - \frac{1}{2} \right) + (x - \mu) \ln (x - \mu) - (x - \mu) - \frac{(x - \mu)^{2}}{2\delta_{max}} \right]_{\mu = x_{min}}^{x}\\ &+ \alpha \left[ \mu \left(\ln \delta_{max} - \frac{1}{2} \right) - (\mu - x) \ln (\mu - x) + (\mu - x) + \frac{(\mu - x)^{2}}{2\delta_{max}} \right]_{\mu = x}^{x_{max}}\\ \end{align}$$

Plug in the boundary of integral, and move $\alpha$ to the other side, we get: $$\begin{align} \frac{p(x|D)}{\alpha} &= \delta_{max} \left( \ln \delta_{max} - \frac{1}{2} \right) - (x_{max} - x) \ln (x_{max} - x) - (x - x_{min}) \ln (x - x_{min})\\ &+ \delta_{max} + \frac{(x_{max} - x)^{2} + (x - x_{min})^{2}}{2\delta_{max}} \end{align}$$

Since the desired class-conditional density $p(x|D)$ has a constant $\alpha$ with it, the sum of all x's is not $1$.

We can also find $\hat{x} = 0.3740$ which is the value of x with the maximum $\frac{p(x|D)}{\alpha}$.