## **Basis Vectors**

Originally we are given the recorded data in the channel space, say with channels, and samples (i.e. time points, frames). Thedata can be thought of as a collection of vectors in-dimensional space, each of which in the case of EEG is asnapshot of the electric potential at the electrodes (relative to a given reference) at a particular time point. The data can also be thought of as a collection of time series,or channel vectors, in -dimensional space; or as a collection ofspatiotemporal data segments (each e.g. an matrix) in-dimensional space. As we are concerned here withinstantaneous ICA, we'll primarily think of the data as a set of vectors in -dimensional space, disregarding thetemporal order of the vectors. ICA is a type of linear representation of data in terms of a set of "basis" vectors. Since we're working here in channel space, thevectors we're interested in will be in . Toillustrate in the following we'll use a three dimensional example, say recorded using three channels. The data then is given to us in three dimensional vector space. ![700px](R3_2.png) Each of these data points is a vector in three dimensional space. In general, any point in -dimensional space can be representedas a linear combination of any vectors that are linearlyindependent. For example let's take the vectors, Linear independence means that no vector in the set can be formed as a linear combination of the others, i.e. each vector branches out into a new dimension, and they do not all lie in a zero volume subspace of . Equivalently, there is no vector that can mulitply to produce thezero vector: Mathematically, this is true if and only if: So for example any data vector, ,can be represented in terms of three linearly independent basis vectors, (unique) coefficient vector, : A linear representation of the data is a fixed basis set, , that is used to represent each data point: \\triangleq \[\\mathbf{c}_1\\cdots \\mathbf{c}_T\]</m>, the we can write, where is the data matrix, is the matrix of basis vectors,and is the coefficient (orloading, or weight) matrix, with giving the"coordinates" of the point in the coordinate spacerepresented by the basis . We have assumed thus far that the data itself is "full rank", i.e. that there exists a set of data vectors that are linearlyindependent. It may happen, however, that the data do not lie in the "full volume" of , but rather occupy a subspace ofsmaller dimension. In three dimensions, for example, all of the data might exist in a two-dimensional subspace. !![400px](R3_3.png)!![400px](R3_4.png) The data is still represented as points or vectors in three dimensional space, with three coordinates, but in fact only two coordinates are required (once a "center" point has been fixed in the subspace). Even if the data does not lie *exactly* in a subspace, it may be the case that one of dimensions (directions) is just numerical noise. Eliminating such extraneous dimensions can lead to more efficient and stable subsequent processing of the data. To understand how the data occupies the space volumetrically, and in the case of data that is not full rank, how to determine which subspace the data lies in, we will use Principle Component Analysis, described in the next section. ## **Principle Component Analysis (PCA)**

Let the data be represented by an vectors contained in the columns. Let us also assume that thedata is "zero mean", i.e. that the mean of each channel (row of ) has been removed (subtracted from the row), so that: Now, one way to determine the rank of the data is to examine the covariance matrix, or matrix of channel correlations, which is defined by, The matrix has the same rank, or intrinsicdimensionality, as the matrix . If we perform aneigen-decomposition of , we get, where and are the eigenvalues and eigenvectors respectively. Since is symmetric and "positive semidefinite",all the eigenvalues are real and non-negative. (and thus ) is full rank if and only if all If some of the eigenvalues are zero, then the data is not full rank, and the rank is equal to the number of nonzero eigenvalues. In this case, the data lies entirely in the -dimensional subspace spanned bythe eigenvectors corresponding to the nonzero eigenvalues. and, where is the data matrix, is the matrix of basis vectors, and is the coefficient matrix, with giving the "coordinates" of the point in the-dimensional space of the nonzero eigenvectors. The data is reduced in dimension from to by "projecting" onto the -dimensionalspace, Analysis may be conducted on the reduced data , e.g.ICA may be performed, giving results in dimensional space. Thecoordinates in the original dimensional data space are thengiven by simply multiplying the dimensional vectors by. The , rank , matrix, in this case is called a "projection matrix", projecting the data in the full space onto the subspace spanned by the first eigenvectors. ## **Singular Value Decomposition (SVD)**

A related decomposition, called the Singular Value Decomposition (SVD), can be performed directly on the data matrix itself to produce a linear representation (of possibly reduced rank). The SVD decomposes the data matrix into, where is the data matrix, is the matrix of ortho-normal (orthogonal and unit norm) "left eigenvectors", is the diagonalmatrix of strictly positive "singular values", and isthe matrix of orthonormal "right eigenvectors". From the SVD, we see that, so that and . The SVD directly gives the linear representation: . The vectors in orthonormal (orthogonal and unit norm), and the rows of are orthogonal (since is diagonal,and is orthonormal.) The SVD gives the unique linear representation (assuming singular values are distinct) of the data matrix such that the columns of are orthonormal, and the rows of values are all distinct; a subspace determined by equal singular values does not have a unique orthonormal basis in this subspace, allowing for arbitrary cancelling rotations of the left and right eigenvectors in this subspace.) Having the rows of be orthogonal, i.e. uncorrelated,is a desirable feature of the representation, but having the basis vectors be orthonormal is overly restrictive in many cases of interest, like EEG. However, if we only require the rows of tobe orthogonal, then we lose the uniqueness of the representation, since for any orthonormal matrix , and any full rankdiagonal matrix , we have, where the rows of the new coefficient matrix are stillorthogonal, but the new matrix of basis vectors in the columns of, , are nolonger orthogonal. A linear representation of the data, implies that the coefficients can be recovered from the data using the inverse of (or in the case of rank deficient, any left inverse, like the pseudoinverse): ## **PCA and Sphering**

We have seen that the SVD representation is one linear representation of the data matrix. The SVD puts, where is the identity matrix. Another representation, which we call "sphering", puts, This latter representation has certain advantages. We can show, e.g., that the sphering transformation leaves the data changed as little as possible among all "whitening" transformations, i.e. those that leave the resulting rows of the coefficient matrix uncorrelated with unit average power. This is equivalent to taking . Let thegeneral form of a "whitening" decorrelating transformation, then, be: for arbitrary orthonormal matrix . We measure thedistance of the transformed data from the original data by the sum of the squared errors: Writing in the general form of the decorrelatingtransformation, we get, Equality is achieved in the last inequality if and only if . The resulting minimal squared error isthe same squared error that would be result from simply normalizing the variance of each channel, which is equivalent to the transformation . We shall refer to this particular whitening transformation, as the inverse of the "square root" of the covariance matrix . It is the unique symmetric matrix Remarks:

We can view this result as saying that the whitening matrix either as a collection of channel vectors, or as a collection of channel . We have found in practice, performing ICA on EEG data, that using the (symmetric) sphering matrix as an initialization of for ICA generally yields the best results and the quickest convergence, especially in whitening transformation produces more independent components than the latter. This is confirmed empirically in our mutual information computations.

Why should the sphering matrix produce moreindependent time series and a better starting point for ICA than the whitening matrix ? In the case ofEEG, this is likely due to the fact that the EEG sensor electrodes are spread out at distances of the same order as the distance between the EEG sources. Thus the sources tend to have a much larger effect on a relatively small number of sensors, rather than a moderate effect on all of the sensors.

The whitening matrix , inprojecting the data onto the eigenvectors of the covariance matrix, produces time series that are each mixtures of all of the channels, and in this sense more mixed than the original data, in which the sources distribute over a relatively small number of channels.

The sphering matrix onthe other hand, rotates the transformed data back into its original coordinates, and produces time series that are closest to the original data, which was relatively independent at the start.

By leaving the data in the eigenvector coordinate system, the whitening matrix forces the ICA algorithm to“undo” a great deal of mixing in the time series, and as a starting point for iterative algorithms, makes it more difficult (in terms of potential local optima) and more time consuming (since the starting point is farther from the ICA optimum).

## **EEG Data Reference and Re-referencing**

EEG data is recorded as a potential difference between the electrode location and the reference. Biosemi active recordings use a reference that is separate from the scalp electrodes. If data is recorded with a specific electrode reference, then the data essentially includes a "zero" channel corresponding to the signal at the reference location relative to itself. A commonly used reference is the "average reference", which consists essentially of subtracting the mean scalp potential at each time point from the recorded channel potential. Let the vector of all ones be denoted, . If the datais denoted , then average referenced data isequivalent to, The average reference reduces the rank of the data because the referencing matrix is rank (note that if you include theoriginal reference when computing average reference, average reference does not reduce the rank of the data). In particular, the vector is in the "null space" of the referencing matrix: The left-hand side is transformed as Here, the (1/**n**) is key since (**e**^{*T*}\* **e**)/**n** = 1. Therefore, Re-referencing to a specific channel or channels can be represented similarly. Let the vector with one in the *j*th position be denoted Suppose e.g. that the mastoid electrode numbers are and. Then the linked mastoid re-reference is equivalent to: Again, however, is in the null space of thisreferencing matrix, showing that the rank is . Any referencingmatrix will be rank deficient, and will thus leave the data rank deficient by one dimension. In addition to referencing, EEG pre-processing usually includes high-pass filtering (to reduce non-stationarity caused by slow drifts). Linear filtering (such as high, low, band-pass, FIR, IIR, etc.) can be represented as a matrix multiplication of the data on the right by a large matrix whose columns are time shifted versionsof each other. The combined referencing and filtering operations can be represented as: The resulting referenced and filtered matrix should remain rank deficient by one. However when referencing is done first, reducing the rank by one, and then filtering is performed, it may happen that the rank of the data increases so that it becomes essentially full rank again. This is apparently due to numerical effects of multiplying (in effect) by a matrix . To summarize, re-referencing should reduce the rank of the data, relegating it to an dimensional subspace of the-dimensional channel space. However, subsequent filtering of therank-reduced referenced data *may* increase the rank of the data again (so that the minimum singular value is significantly larger than zero.) In this case, numerical noise in the vector (direction) is essentially added back into the data as anindependent component.