{
"cells": [
{
"cell_type": "markdown",
"id": "7fe21a3f",
"metadata": {},
"source": [
"# Discrete Data and the Multinomial Distribution\n",
"\n",
"- **[1]** (##) We consider IID data $D = \\{x_1,x_2,\\ldots,x_N\\}$ obtained from tossing a $K$-sided die. We use a *binary selection variable*\n",
"$$x_{nk} \\equiv \\begin{cases} 1 & \\text{if $x_n$ lands on $k$th face}\\\\\n",
" 0 & \\text{otherwise}\n",
"\\end{cases}\n",
"$$\n",
"with probabilities $p(x_{nk} = 1)=\\theta_k$. \n",
" (a) Write down the probability for the $n$th observation $p(x_n|\\theta)$ and derive the log-likelihood $\\log p(D|\\theta)$. \n",
" (b) Derive the maximum likelihood estimate for $\\theta$.\n",
"> See lecture notes (on class homepage). \n",
" (a) $p(x_n|\\theta) = \\prod_k \\theta_k^{x_{nk}} \\quad \\text{subject to} \\quad \\sum_k \\theta_k = 1$.\n",
"$$\\ell(\\theta) = \\sum_k m_k \\log \\theta_k$$\n",
"where $m_k = \\sum_n x_{nk}$. \n",
" (b) $\\hat \\theta = \\frac{m_k}{N}$, the *sample proportion*.\n",
"\n",
"- **[2]** (#) In the notebook, Laplace's generalized rule of succession (the probability that we throw the $k$th face at the next toss) was derived as \n",
"$$\\begin{align*}\n",
"p(x_{\\bullet,k}=1|D) = \\frac{m_k + \\alpha_k }{ N+ \\sum_k \\alpha_k}\n",
"\\end{align*}$$\n",
"Provide an interpretation of the variables $m_k,N,\\alpha_k,\\sum_k\\alpha_k$.\n",
"\n",
"> $m_k$ is the total number of occurances that we threw $k$ eyes, $\\alpha_k$ is the prior pseudo counts representing the number of observations in the $k$th that we assume to have seen already. $\\sum_k m_k = N $ is the total number of rolls and $\\sum_k \\alpha_k $ is the total number of prior pseudo rolls.\n",
"\n",
"\n",
"\n",
"- **[3]** (##) Show that Laplace's generalized rule of succession can be worked out to a prediction that is composed of a prior prediction and data-based correction term. \n",
"\n",
"> $$\\begin{align*}\n",
"p(x_{\\bullet,k}=1|D) &= \\frac{m_k + \\alpha_k }{ N+ \\sum_k \\alpha_k} \\\\\n",
"&= \\frac{N}{N+\\sum_k \\alpha_k} \\frac{m_k}{N} + \\frac{\\sum_k \\alpha_k}{N+\\sum_k \\alpha_k}\\frac{\\alpha_k}{\\sum_k\\alpha_k} \\\\\n",
"&= \\underbrace{\\frac{\\alpha_k}{\\sum_k\\alpha_k}}_{\\text{prior prediction}} + \\underbrace{\\frac{N}{N+\\sum_k \\alpha_k} \\cdot \\underbrace{\\left(\\frac{m_k}{N} - \\frac{\\alpha_k}{\\sum_k\\alpha_k}\\right)}_{\\text{prediction error}}}_{\\text{data-based correction}}\n",
"\\end{align*}$$\n",
"\n",
"- **[4]** (#) Verify that \n",
" (a) the categorial distribution is a special case of the multinomial for $N=1$. \n",
" (b) the Bernoulli is a special case of the categorial distribution for $K=2$. \n",
" (c) the binomial is a special case of the multinomial for $K=2$.\n",
"\n",
"> (a) The probability mass function of a multinomial distribution is $p(D_m|\\mu) =\\frac{N!}{m_1! m_2!\\ldots m_K!} \\,\\prod_k \\mu_k^{m_k}$ over the data frequencies $D_m=\\{m_1,\\ldots,m_K\\}$ with the constraint that $\\sum_k \\mu_k = 1$ and $\\sum_k m_k=N$. Setting $N=1$ we see that $p(D_m|\\mu) \\propto \\prod_k \\mu_k^{m_k}$ with $\\sum_k m_k=1$, making the sample space one-hot coded given by the categorical distribution. \n",
"> (b) When $K=2$, the constraint for the categorical distribution takes the form $m_1=1-m_2$ leading to $p(D_m|\\mu) \\propto \\mu_1^{m_1}(1-\\mu_1)^{1-m_1}$ which is associated with the Bernoulli distribution. \n",
"> (c) Plugging $K=2$ into the multinomial distribution leads to $p(D_m|\\mu) =\\frac{N!}{m_1! m_2!}\\mu_1^{m_1}\\left(\\mu_2^{m_2}\\right)$ with the constraints $m_1+m_2=N$ and $\\mu_1+\\mu_2=1$. Then plugging the constraints back in we obtain $p(D_m|\\mu) = \\frac{N!}{m_1! (N-m1)!}\\mu_1^{m_1}\\left(1-\\mu_1\\right)^{N-m_1}$ as the binomial distribution.\n",
"\n",
"\n",
"- **[5]** (###) Determine the mean, variance and mode of a Beta distribution.\n",
"> The Beta distribution is given by $\\frac{\\Gamma(\\alpha+\\beta)}{\\Gamma(\\alpha)\\Gamma(\\beta)}x^{\\alpha-1}(1-x)^{\\beta-1}$. Define $\\frac{\\Gamma(\\alpha)\\Gamma(\\beta)}{\\Gamma(\\alpha+\\beta)} \\triangleq \\mathcal{B}(\\alpha,\\beta)$, which is the normalization constant. Notice that this definition makes $\\int_0^1 x^{\\alpha-1}(1-x)^{\\beta-1}\\mathrm{d}x = \\mathcal{B}(\\alpha,\\beta)$. Together with $\\Gamma(x+1) = x\\Gamma(x)$ we can use these identities to obtain the requested statistics:\n",
"$$\\begin{align*}\n",
"\\mathbb{E}[x] &= \\frac{1}{\\mathcal{B}(\\alpha,\\beta)}\\int_0^1 x x^{\\alpha-1}(1-x)^{\\beta-1}\\mathrm{d}x \\\\\n",
"&= \\frac{1}{\\mathcal{B}(\\alpha,\\beta)}\\int_0^1x^{\\alpha}(1-x)^{\\beta-1}\\mathrm{d}x \\\\\n",
"&= \\frac{\\mathcal{B}(\\alpha+1,\\beta)}{\\mathcal{B}(\\alpha,\\beta)} \\\\\n",
"&= \\frac{\\Gamma(\\alpha+1)\\Gamma(\\alpha+\\beta)}{\\Gamma(\\alpha)\\Gamma(\\alpha+\\beta+1)}\\\\\n",
"&= \\frac{\\alpha \\Gamma(\\alpha)\\Gamma(\\alpha+\\beta) }{(\\alpha+\\beta)\\Gamma(\\alpha)\\Gamma(\\alpha+\\beta)}\\\\\n",
"&= \\frac{\\alpha}{\\alpha+\\beta} \\\\\n",
"\\mathbb{V}[x] &= \\mathbb{E}[x^2] - \\mathbb{E}[x]^2 \\\\\n",
"&= \\frac{1}{\\mathcal{B}(\\alpha,\\beta)}\\int_0^1 x^2 x^{\\alpha-1}(1-x)^{\\beta-1}\\mathrm{d}x - \\frac{\\alpha^2}{(\\alpha+\\beta)^2} \\\\\n",
"&= \\frac{\\mathcal{B}(\\alpha+2,\\beta)}{\\mathcal{B}(\\alpha,\\beta)} - \\frac{\\alpha^2}{(\\alpha+\\beta)^2} \\\\\n",
"&= \\frac{\\alpha}{\\alpha+\\beta}\\left(\\frac{\\alpha+1}{\\alpha+\\beta+1} - \\frac{\\alpha}{\\alpha+\\beta}\\right) \\\\\n",
"&= \\frac{\\alpha\\beta}{(\\alpha+\\beta)^2(\\alpha+\\beta+1)}\n",
"\\end{align*}$$\n",
"If $\\alpha=\\beta$, then the Beta distribution is identical to a uniform distribution, which doesn't have a unique mode. If one of the parameters is $<1$, then the mode is at one of the edges. When both parameters are $>1$, then the mode is well-defined and is within the interior of the distribution. Assuming the parameters are $>1$ we can evaluate the mode as\n",
"$$\\begin{align*}\n",
"\\nabla_x x^{\\alpha-1}(1-x)^{\\beta-1} &= 0\\\\\n",
"\\frac{\\alpha-1}{\\beta-1} &= \\frac{x}{1-x} \\\\\n",
"\\alpha-1 &= x(\\alpha+\\beta-2) \\\\\n",
"\\Rightarrow x_{mode} &= \\frac{\\alpha-1}{\\alpha+\\beta-2}.\n",
"\\end{align*}$$\n",
"\n",
"- **[6]** (###) Consider a data set of binary variables $D=\\{x_1,x_2,\\ldots,x_N\\}$ with a Bernoulli distribution $\\mathrm{Ber}(x_k|\\mu)$ as data generating distribution and a Beta prior for $\\mu$. Assume that you make $n$ observations with $x=1$ and $N-n$ observations with $x=0$. Now consider a new draw $x_\\bullet$. We are interested in computing $p(x_\\bullet|D)$. Show that the mean value for $p(x_\\bullet|D)$ lies in between the prior mean and Maximum Likelihood estimate.\n",
"\n",
"> In the lectures we have seen that $p(x_\\bullet =1|D) = \\frac{a+n}{a+b+N}$, where $a$ and $b$ are parameters of the Beta prior. The ML estimate is $\\frac{n}{N}$ and the prior mean is $\\frac{a}{a+b}$. To show that the prediction lies in between ML and prior estimate, we will try to write the prediction as a convex combination of the latter two. That is we want to solve for $\\lambda$ \n",
"$$\\begin{align*}\n",
"(1-\\lambda) \\frac{n}{N} + \\lambda\\frac{a}{a+b} &= \\frac{a+n}{a+b+N} \\\\\n",
"\\lambda &= \\frac{1}{1+\\frac{N}{a+b}} \n",
"\\end{align*}$$\n",
"Since $a,b$ and $N$ are positive, it follows that $0<\\lambda <1$. This means the prediction is a convex combination of prior and ML estimates and thus lies in between the two.\n",
"\n",
"- **[7]** Consider a data set $D = \\{(x_1,y_1), (x_2,y_2),\\dots,(x_N,y_N)\\}$ with 1-of-$K$ notation for the discrete classes, i.e.,\n",
"$$\\begin{equation*} y_{nk} = \\begin{cases} 1 & \\text{if $y_n$ in $k$th class} \\\\\n",
" 0 & \\text{otherwise} \n",
" \\end{cases}\n",
"\\end{equation*}$$\n",
"together with class-conditional distribution $p(x_n| y_{nk}=1,\\theta) = \\mathcal{N}(x_n|\\mu_k,\\Sigma)$ and multinomial prior $p(y_{nk}=1) = \\pi_k$. \n",
" (a) Proof that the joint log-likelihood is given by\n",
"$$\\begin{equation*}\n",
"\\log p(D|\\theta) = \\sum_{n,k} y_{nk} \\log \\mathcal{N}(x_n|\\mu_k,\\Sigma) + \\sum_{n,k} y_{nk} \\log \\pi_k\n",
"\\end{equation*}$$\n",
" > $$\\begin{align*}\n",
" \\log p(D|\\theta) &= \\sum_n \\log \\prod_k p(x_n,y_{nk}|\\theta)^{y_{nk}} \\\\\n",
" &= \\sum_{n,k} y_{nk} \\log p(x_n,y_{nk}|\\theta)\\\\\n",
" &= \\sum_{n,k} y_{nk} \\log \\mathcal{N}(x_n|\\mu_k,\\Sigma) + \\sum_{n,k} y_{nk} \\log \\pi_k\n",
" \\end{align*}$$ \n",
" \n",
" (b) Show now that the MLE of the *class-conditional* mean is given by\n",
"$$\\begin{equation*}\n",
" \\hat \\mu_k = \\frac{\\sum_n y_{nk} x_n}{\\sum_n y_{nk}} \n",
"\\end{equation*}\n",
"$$\n",
"\n",
"\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8e5ede6",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.6.3",
"language": "julia",
"name": "julia-1.6"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}