This page reviews the concepts of random variables (rv's) and probability density functions (pdfs). It describes Kullback-Leibler (KL) Divergence and Maximum Likelihood (ML) estimation, as well as multivariate probability densities and the effect of linear transformations on multivariate probability density functions.

## RVs and PDFs

A random variable $X$ can be thought of as anordinary variable $x$, together with a rule for assigning to everyset $\mathcal{D}$ a probability that the variable takes a value inthat set, $P(X \in \mathcal{D})$, which in our case will bedefined in terms of the probability density function: $P(X \in \mathcal{D}) = \int_{\mathcal{D} } p_{X}(x)\,dx$ That is, the probability that $X \in \mathcal{D}$ is given by theintegral of the probability density function over $\mathcal{D}$.So a (continuous) random variable can be thought of as a variable and a pdf. When the values taken by a random variable are discrete, e.g. 0 or 1, then the distribution associated with the random variable is referred to as a probability *mass* function, or pmf. Here we will be concerned primarily with signals taking values in a continuous range. Continuous random variables are often taken to be Gaussian, in which case the associated probability density function is the Gaussian, or Normal, distribution, $p_{X}(x) \;=\; \mathcal{N}(x; \mu, \sigma^2) \;\triangleq\;(2\pi)^{-1/2}\sigma^{-1} \exp(\mbox{-\frac{1}{2}}\hspace{1pt}\sigma^{-2} (x-\mu)^2 )$ ![500px](Npdf.png) The Gaussian density is defined by two parameters: the location, or mean, $\mu$, and the scale, or variance, $\sigma^2$.

### Confidence intervals and p values

An example of using the density function to calculate probabilities is the computation of confidence intervals and $p$$\mathcal{D} \triangleq$\mu-R,\mu+R$$that, $P(\mu-R\le X \le \mu+R) = \int_{\mu-R}^{\mu+R} p_X(x)\,dx\; = \; 0.95$ Similarly, a (one-sided) $p$ value or score for an observation$x_0 > 0$, given a probability density function $p_0(x)$ isgiven by, $P(X \ge x_0) = \int_{x_0}^{\infty} p_0(x)\,dx \; = \; p$ This gives the probability that the random variable takes a value in the tail region, defined (after the observation) as the set of values with positive magnitude at least as great as the observed value, given that the probability density is $p_0(x)$. (A two-sided $p$ valueconcerning the magnitude would include the integral from $-x_0$ to$-\infty$ as well.) A low $p$ value can be used as evidencethat the probability density function $p_0(x)$ is not the trueprobability density function $p_X(x)$, i.e. to reject the nullhypothesis that $p_0(x)$ is the probability density function, ormodel, associated with $X$, on the grounds that if it were thecorrect model, then an event of very low probability would have occurred. Note that the value of a pdf at any point is not a probability value. Probabilities for continuous random variables are only associated with *regions*, and are only determined by integrating the pdf.

### Model Comparison and Posterior Likelihood Evaluation

Related to the idea of $p$ values is testing the"goodness of fit" of a model. The model is defined in terms of a probability distribution, and the fit of the model is defined in terms of the fit of the model probability distribution to the actual probability distribution. Bayes' Rule is often used to calculate the probability that a certain model, say $M_{i}$ from a set of $n$ models,$M_1,\ldots,M_n$, generated an observation $x_0$: $P(M_i \| x_0) \; = \; \frac{p(x_0 \| M_i) P(M_i)} {\sum_{j=1}^np(x_0 \| M_j) P(M_j)}$

## Model Fitting – KL Divergence and ML Estimation

A probability density with a set of parameters can be thought of as a class or set of probability density functions, for example the set of all Gaussian densities with $-\infty < \mu < \infty, \sigma^2 > 0$. Fitting a model to an observed data set can be thought of as looking for the particular density in the class of densities defined by the model, that "best fits" the distribution of the data. One way of defining the distance between two densities, as a measure of fit, is the Kullback-Leibler Divergence: $D\big(p(x) \|\, p_0(x)\big)\; \triangleq \; \int p(x) \log\frac{p(x)}{p_0(x)} \, dx$ where $p_0(x)$ is a model density, and $p(x)$ is the truedensity. The KL divergence is non-negative and zero if and only if densities are the same. However note that it is non-symmetric in the densities. If we write out the KL divergence as stated, we get, $D\big(p(x) \|\, p_0(x)\big)\; \triangleq \; \int -p(x)\log p_0(x) \, dx - h(X) \; \ge \; 0$ where $h(X)$ is the entropy of $X$. This shows that we the KLdivergence can be viewed as the excess entropy, or minimal coding rate, imposed by assuming that the distribution of $X$ is $p_0$. Writing the KL divergence in this way also shows its relationship to Maximum Likelihood (ML) estimation with independent samples. In this case, the ML problem, assuming a model $M_0$ with parameters $\theta$, for the random variable $X$, is to maximize: $L(\{x_1,\ldots,x_T\} | M_0) =\sum_{t=1}^T \log p_0(x_t)$ But by the law of large numbers, we have, $\frac{1}{T}\sum_{t=1}^T\log p_0(x_t)\to_{T\to\infty} \int p(x)\log p_0(x)dx$ So in fact, $\arg \max_{\theta} \frac{1}{T}\sum_{t=1}^T \log p_0(x_t\,\, \\; \theta) \; \to \; \arg \max_{\theta} -D\big(p(x) \\|\, p_0(x \,\\; \theta)\big)-h(X) \; = \; \arg \min_{\theta} D\big(p(x)\\|\,p_0(x\,\, \\; \theta)\big)$ and we see that as $T \to \infty$, ML estimation is equivalent todetermining the density in the class of densities defined by the variation of the parameter $\theta$.

## Multivariate Probability Densities and Independence

As in the univariate case, multivariate RVs are defined by rules for assigning probabilities to the events that the multivariate random random variable (i.e. random vector) takes a value in some multidimensional set. $P\big($X_1,\ldots,X_n$^T \in \mathcal{D}\big) =\int_{\mathcal{D} } p_{X_1,\ldots,X_n}(\mathbf{x})\,d\mathbf{x}$ A set of random variables is defined to be independent if it's joint probability density function factorizes into the product of the "marginal" densities: $p_{X_1,\ldots,X_n}(x_1,\ldots,x_n) \; = \; \prod_{i=1}^n\,p_{X_i}(x_i)$ In the case of a random vector with independent components, the probability that the vector takes a value in a hypercubic set is simply the product of the probabilities that the individual components lie in the region defining the respective side of the hypercube: $P\big($X_1,\ldots,X_n$^T \in \mathcal{D}_1 \times \cdots\times \mathcal{D}_n \big) = \prod_{i=1}^n\,P(X_i \in \mathcal{D}_i)$

## Probability Densities of Linear Transformations of RVs

If $a \ne 0$ is a fixed real number, and $S$ is a randomvariable $X$ with pdf $p_s(s)$, then a random variable definedby $X = a S$ has pdf, $p_{x}(x) = \frac{1}{\|a\|}\, p_s\!\left(\frac{x}{a}\right) =\|a\|^{-1} p_{s}(a^{-1}x)$ If $A$ is an invertible $n \times n$ matrix, and$\mathbf{s}$ is a random vector with pdf p_{\mathbf{s}}(\\mathbf{s}), then the probability density of the random vector $\mathbf{x}$, produced by the linear transformation, $\mathbf{x} = \mathbf{A} \mathbf{s}$ is given by the formula, $p_{\mathbf{x} }(\mathbf{x}) = \|\det \mathbf{A}\|^{-1}p_{\mathbf{s} }(\mathbf{A}^{-1}\mathbf{x})$ If $\mathbf{A}$ is not square, but rather is "undercomplete", then PCA analysis can readily identifyan orthonormal basis for the $r$-dimensional subspace in which thedata resides, and subsequent processing, e.g. ICA, can generally be carried out in the reduced $r$-dimensional space and a square r\\times r linear transformation. If there is additional non-negligible noise in the undercomplete or complete (square) case, $\mathbf{x} = \mathbf{A} \mathbf{s} + \mathbf{\nu}$ with $\mathbf{\Sigma} = E\{\mathbf{\nu}\mathbf{\nu}^T\}$,then the problem essentially becomes an "overcomplete" one with $\tilde{\mathbf{A} } \triangleq\big$\mathbf{A}\,\mathbf{\Sigma}^{1/2}\big$$ If the matrix $\mathbf{A}$ is "overcomplete" with $n\>m$,then the pdf of $\mathbf{x}$ cannot generally be determined inclosed form unless $\mathbf{s}$ is Gaussian. We will consider theovercomplete in another section.