This page reviews the concepts of random variables (rv's) and probability density functions (pdfs). It describes Kullback-Leibler (KL) Divergence and Maximum Likelihood (ML) estimation, as well as multivariate probability densities and the effect of linear transformations on multivariate probability density functions. ##

A random variable can be thought of as anordinary variable , together with a rule for assigning to everyset a probability that the variable takes a value inthat set, , which in our case will bedefined in terms of the probability density function: That is, the probability that is given by theintegral of the probability density function over .So a (continuous) random variable can be thought of as a variable and a pdf. When the values taken by a random variable are discrete, e.g. 0 or 1, then the distribution associated with the random variable is referred to as a probability *mass* function, or pmf. Here we will be concerned primarily with signals taking values in a continuous range. Continuous random variables are often taken to be Gaussian, in which case the associated probability density function is the Gaussian, or Normal, distribution, ![500px](Npdf.png) The Gaussian density is defined by two parameters: the location, or mean, , and the scale, or variance, . ### Confidence intervals and

An example of using the density function to calculate probabilities is the computation of confidence intervals and that, Similarly, a (one-sided) value or score for an observation, given a probability density function isgiven by, This gives the probability that the random variable takes a value in the tail region, defined (after the observation) as the set of values with positive magnitude at least as great as the observed value, given that the probability density is . (A two-sided valueconcerning the magnitude would include the integral from to as well.) A low value can be used as evidencethat the probability density function is not the trueprobability density function , i.e. to reject the nullhypothesis that is the probability density function, ormodel, associated with , on the grounds that if it were thecorrect model, then an event of very low probability would have occurred. Note that the value of a pdf at any point is not a probability value. Probabilities for continuous random variables are only associated with *regions*, and are only determined by integrating the pdf. ##

A probability density with a set of parameters can be thought of as a class or set of probability density functions, for example the set of all Gaussian densities with . Fitting a model to an observed data set can be thought of as looking for the particular density in the class of densities defined by the model, that "best fits" the distribution of the data. One way of defining the distance between two densities, as a measure of fit, is the Kullback-Leibler Divergence: where is a model density, and is the truedensity. The KL divergence is non-negative and zero if and only if densities are the same. However note that it is non-symmetric in the densities. If we write out the KL divergence as stated, we get, where is the entropy of . This shows that we the KLdivergence can be viewed as the excess entropy, or minimal coding rate, imposed by assuming that the distribution of is . Writing the KL divergence in this way also shows its relationship to Maximum Likelihood (ML) estimation with independent samples. In this case, the ML problem, assuming a model with parameters , for the random variable , is to maximize: But by the law of large numbers, we have, So in fact, and we see that as , ML estimation is equivalent todetermining the density in the class of densities defined by the variation of the parameter . ##

As in the univariate case, multivariate RVs are defined by rules for assigning probabilities to the events that the multivariate random random variable (i.e. random vector) takes a value in some multidimensional set. A set of random variables is defined to be independent if it's joint probability density function factorizes into the product of the "marginal" densities: In the case of a random vector with independent components, the probability that the vector takes a value in a hypercubic set is simply the product of the probabilities that the individual components lie in the region defining the respective side of the hypercube: ##

If is a fixed real number, and is a randomvariable with pdf , then a random variable definedby has pdf, If is an invertible matrix, and is a random vector with pdf p_{\mathbf{s}}(\\mathbf{s}) , then the probability density of the random vector , produced by the linear transformation, is given by the formula, If is not square, but rather is "undercomplete", then PCA analysis can readily identifyan orthonormal basis for the -dimensional subspace in which thedata resides, and subsequent processing, e.g. ICA, can generally be carried out in the reduced -dimensional space and a square r\\times r linear transformation. If there is additional non-negligible noise in the undercomplete or complete (square) case, with ,then the problem essentially becomes an "overcomplete" one with If the matrix is "overcomplete" with ,then the pdf of cannot generally be determined inclosed form unless is Gaussian. We will consider theovercomplete in another section.