Multivariate assessment of event-related potentials with the t-CWT method

Background Event-related brain potentials (ERPs) are usually assessed with univariate statistical tests although they are essentially multivariate objects. Brain–computer interface applications are a notable exception to this practice, because they are based on multivariate classification of single-trial ERPs. Multivariate ERP assessment can be facilitated by feature extraction methods. One such method is t-CWT, a mathematical-statistical algorithm based on the continuous wavelet transform (CWT) and Student’s t-test. Results This article begins with a geometric primer on some basic concepts of multivariate statistics as applied to ERP assessment in general and to the t-CWT method in particular. Further, it presents for the first time a detailed, step-by-step, formal mathematical description of the t-CWT algorithm. A new multivariate outlier rejection procedure based on principal component analysis in the frequency domain is presented as an important pre-processing step. The MATLAB and GNU Octave implementation of t-CWT is also made publicly available for the first time as free and open source code. The method is demonstrated on some example ERP data obtained in a passive oddball paradigm. Finally, some conceptually novel applications of the multivariate approach in general and of the t-CWT method in particular are suggested and discussed. Conclusions Hopefully, the publication of both the t-CWT source code and its underlying mathematical algorithm along with a didactic geometric introduction to some basic concepts of multivariate statistics would make t-CWT more accessible to both users and developers in the field of neuroscience research. Electronic supplementary material The online version of this article (doi:10.1186/s12868-015-0185-z) contains supplementary material, which is available to authorized users.


Background
An event-related brain potential (ERP) is extracted from an electroencephalogram (EEG) [1]. Since the EEG is a stochastic process, the ERP is a multivariate statistical object, as well. It is a set of random curves (one curve per EEG channel), and a random curve cannot be simply represented by a single univariate feature (or a curve parameter) without loosing a lot of potentially useful information. Nevertheless, ERPs are usually represented by univariate "components", which most often are prominent peaks of the curves. An ERP component is usually assessed by its peak value or by the area under the curve in a narrow time window around the peak. This method has the huge * Correspondence: bostanov@uni-tuebingen.de Institute of Medical Psychology and Behavioral Neurobiology, University of Tübingen, Gartenstr. 29, 72074 Tübingen, Germany advantage that it is both simple and visually and intuitively clear. For instance, peak latency can be interpreted as brain processing time related to an ERP-triggering event. More sophisticated methods for ERP component extraction from single trials exist as well [2]. The statistical assessment of ERP components is usually based on analysis of variance (ANOVA) [3]. Such assessment is sufficient for many applications. Some inherent limitations of this approach have been addressed by "mass univariate analysis" methods based on permutation/randomization tests [4,5].
There are, however, cases in which univariate assessment is not sufficient. For instance, ERP-based braincomputer interfaces (BCI) require full-fledged multivariate assessment of the ERP curves [6]. The t-CWT [7], a feature extraction method based on the continuous wavelet transform (CWT) [8,9] and Student's t-test, was first introduced as one possible multivariate solution to the problem of single-trial ERP classification at the International BCI Competition 2003, where it was compared to a variety of other feature extraction and classification methods ranging from simple and weak to advanced and powerful ones like, e.g., Support Vector Machines (SVM) [6,10]. t-CWT was a winner on two of the six ERP datasets provided for the competition [6]; these two were obtained in two different BCI paradigms, "P300 Speller" and "Self regulation of Slow Cortical Potentials (Biofeedback)" [7].
Apart from BCI applications, it has been shown how t-CWT can be used for detection and quantification of ERPs in other paradigms as well, e.g., for individual (clinical) diagnostics [11,12]. In the latter application, t-CWT was combined with a multivariate randomization test based on Hotelling's T 2 -statistic. Similarly to the univariate randomization tests mentioned above [4], which provide an effective correction for multiple univariate comparisons, the multivariate randomization test [11] corrects the accumulation of chance bias arising from performing both t-CWT feature extraction and hypothesis testing on the same ERP dataset.
A comparison of t-CWT to classical univariate ERP assessment methods based on peak picking and area computation showed a clear advantage of multivariate t-CWT features over univariate measures [11]. More recently, t-CWT has been systematically evaluated in comparison to several other ERP component detection methods ranging from simple univariate peak picking to t max randomization tests [4] performed on band pass filtered EEG and on t-CWT features [13]. This evaluation was performed with both simulated and real ERP data at different levels of signal-to-noise ratio. Sensitivity, specificity, positive and negative predictive values were assessed, and, as a result, t-CWT showed superior to all other methods, especially at low signal-to-noise ratios. It should be noted, however, that, in that study, t-CWT was used only as a feature extraction method and only mass univariate analysis (t max ), but no multivariate statistical assessment of the obtained features was done. In another recent study, t-CWT feature extraction was compared to two other wavelet-based ERP component detection methods using the spikelet technique and wavelet asymmetry, respectively, with the result that both t-CWT and the novel wavelet asymmetry method showed a marked advantage over the spikelet technique in terms of detection accuracy: 83 % and 91 % vs. 43 %, correspondingly [14]. In the latter study, however, only a single ERP component of interest, the N160, was detected without any multivariate or mass univariate analyses of the whole ERP curves.
Although t-CWT features may well be interpreted as ERP components [7,11,14], the current article is mainly focused on t-CWT applications in which, as in the case of BCI, the whole (multivariate) difference between two ERP curves is of primary importance, while differences in single components are only of secondary interest.
The goals of the this article are: (a) to provide a didactic geometric primer on some basic concepts of multivariate statistics as applied to ERP assessment in general and to the t-CWT method in particular; (b) to present for the first time a detailed, step-by-step, formal description of the t-CWT algorithm and to make its MATLAB and GNU Octave [15] implementation publicly available as free and open source code [16,17] (Additional file 1) released under the GNU General Public License, version 3 (GPLv3, full text available in the t-CWT documentation [17]); (c) to demonstrate the t-CWT method in the assessment of example ERP data [18] obtained in a passive oddball paradigm [19,20]; (d) to suggest and discuss conceptually novel applications of the multivariate approach in general and of the t-CWT method in particular for the purpose of hypothesis testing rather than for BCI and single-trial classification.
An important new pre-processing step in the revised version of the t-CWT algorithm presented here is the novel multivariate outlier rejection procedure based on principal component analysis (PCA, see below). Other changes in the algorithm were aimed at simplification for the sake of easier understanding. For instance, the time-dependent filtering module [7] was removed from the current version and the discrete wavelet transform (DWT) [8,21] was replaced by a discrete Fourier transform (DFT) [22]. Visualization was simplified as well by plotting the onedimensional linear discriminant function (LDF, see below) instead of the two-dimensional t-value scalogram [7].

Multivariate statistics: a geometric primer
Multivariate statistical analysis [23] cannot be substituted by a multitude of univariate analyses, because the latter cannot address adequately the covariance of the data. Even if the effect of the correlations between the variables on multiple univariate tests is taken into account by a proper correction of the overall α-level, e.g., by a permutation test [4], the univariate approach still misses a lot of information contained in the covariance matrix [23, p. 8].
In this section, some basic concepts of multivariate analysis are represented as geometric notions. This approach is meant as an aid to better understanding of these concepts through there association with familiar images from Euclidean space.

ERPs as points in a vector space
Consider a sample of N ERP trials obtained from K EEG channels, in the time interval (epoch) 0 ≤ t < T, where t is the time relative to a triggering event. For each channel k = 1, 2, . . . K , and each single trial n = 1, 2, . . . N , the ERP voltage curve v k n (t) can be represented by a row vec- , where t l = (l − 1)T /L, l = 1, 2, . . . L are the equidistant sampling time points whose number L is obtained from the original sampling frequency R 0 : The entire K -channel ERP trial can be represented by the vector Thus, the ERP is represented by a random vector v in a K × L-dimensional space, and the ERP sample is represented by the N -by-K × L matrix Note that this notation is different from the standard notation in which vectors are represented by columns [23, p. 7].
Here, vectors are represented by rows, as in the MATLAB and GNU Octave [15] implementation of t-CWT [16,17].

Mahalanobis distance and Hotelling's test
where v A and v B are the mean vectors (representing the average ERP curves). Geometrically, v A and v B represent two points in the K × L-dimensional space. Hence, we can test H 0 by testing the hypothesis that the distance between these two points is zero. Thus, the problem of multivariate hypothesis testing can be reduced to the geometric problem of defining and measuring distance in this vector space and then performing a univariate test on this (random) distance. Like in the univariate case, the scale for measuring distance is provided by the variance or, more precisely, by its square root, the standard deviation (like in Student's ttest). There are, however, two problems in the multivariate case: (a) we have different variances corresponding to the different vector components, and (b) these variables are correlated. The latter problem can be solved by principal component analysis (PCA). The principal component transform (PCT) represented by its corresponding matrix T p is an orthogonal transformation, i.e., a kind of rotation of the coordinate axes, such that, in the new coordinate system, the covariance matrix has a diagonal form [23, pp. 343-344]. The transformation of the vector components is given by The covariance matrix S is diagonalized by this transformation, i.e., the correlations between the new variables v i p , i = 1, 2, . . . K × L are all zero. Because T p is an orthogonal transformation, the inverse transformation (i.e., the rotation back to the original axes) is given by the transpose and, correspondingly, the diagonalized covariance matrix S p is given by The diagonal elements (the eigenvalues) of S p are the squared standard deviations σ i p 2 of the uncorrelated variables v i p . Normalizing these variables by the transformation we land in a familiar K × L-dimensional Euclidean space where the scale is the same in all directions, i.e., the constant density ellipsoids [23, p. 40] of the multivariate normal distribution of x are spheres (the procedure (4-7) is known as PCA "sphering" or "whitening"). In Euclidean space, the squared distance D 2 between two points x A and x B is simply given by the Pythagorean theorem Substituting (7) into (8), we can compute the distance between v p A and v p B In (9), we use the fact that S −1 p is a diagonal matrix with eigenvalues σ i p −2 . Substituting the right parts of (5) and (6) into (9), we obtain In (10), we use the property that S −1 is diagonalized by the same transformation as S. The expression (10) the Mahalanobis distance (10) is expressed as a LDF value: The column vector d defines a direction in space. The pro- and it is assigned to the population B otherwise. The rule (14) is optimal because it minimizes the classification error rate. Moreover, the LDF value difference v x − 1 2 (v A + v B ) d is proportional to the distance from the point v x to the separation plane (13) and can be seen as a measure of the affiliation of v x with A or B.
The optimal LDA classification rule (14) is based on the assumption that the a priori probability p A that an ERP trial of unknown affiliation belongs to A is equal to the a priori probability p B that it belongs to B. If p A = p B , the optimal LDA classification rule (14) is generalized as follows [23, and to B otherwise, i.e., the separation plane (13) is shifted in the direction of v A if p A < p B , and it is shifted in the direction of v B if p A > p B . Note that (15) is a generalization of (14) because the logarithmic term is zero when p A = p B . The LDF value v x d has an interesting property that should be noted, because it could be important for some interesting applications. Since the LDF d is obtained from the sample V while v x is drawn from a different sample V X , the LDF value v x d has univariate normal distribution [23, p. 45]. This property can be used to reduce the multivariate ERP v x to the univariate random variable v x d that exclusively and fully reflects the (multivariate) ERP difference between two particular experimental conditions, A and B. The important point here is that v x d can be used for other purposes beside classification. For instance, if the A-B structure of V X is known, the LDF value v x d can be subjected to Student's t-test in order to test the hypothesis about the mean A-B difference within V X . Note the difference to Hotelling's T 2 -test: while in the Mahalanobis distance (12), the LDF d is computed from the sample that is tested, in v x d it is computed from a different sample. This means that the multivariate structure of the A-B difference derived from Vis imposed on V X in order to assess the (univariate) magnitude of the A-B difference within V X . This procedure provides a methodologically important alternative to Hotelling's T 2 -test, because it implies a conceptually novel approach to multivariate ERP assessment which could be useful, e.g., in clinical applications (see the "Discussion" section).

The problem of dimensionality
The multivariate ERP model presented above is idealized. It can only work with few EEG channels, low timesampling rates, short epochs and large number of ERP trials. In most applications, however, the dimensionality of the vector space is very large and exceeds the number of trials. For instance, if the number of channels is K = 32, the sampling rate (1) is R 0 = 500 Hz, and the epoch length is T = 1 s, then the number of time points is L = R 0 T =500, and we have K × L = 16,000 dimensions. If the number of trials is, e.g., N = 1,000, the rank of the covariance matrix S is N − 1 = 999 [23, pp. 9,406], which means that S is singular (i.e. S −1 does not exist, because S has at least 15,001 zero eigenvalues), and (9-15) cannot be applied. Even if N − 1 > K × L and S is not singular, it is still necessary to reduce the dimensionality of the model because of (a) the loss of statistical power caused by the inclusion of noise variables in the D 2 -sum (9), and (b) the computational problems associated with too many variables and too small eigenvalues.
A standard solution to the dimensionality problem is given by PCA (see above): the variables v i p with small variances σ i p 2 are simply deleted from the model according to a certain criterion, e.g., all eigenvalues greater than the average are retained, or the largest eigenvalues explaining a certain proportion of the total variance are retained [23, pp. 347-348].
The t-CWT method provides a further solution to the dimensionality problem. It uses explicitly the special fact that the random vectors represent ERP curves, i.e., continuous functions of time. The continuous wavelet transform (CWT) [8,9] and Student's t-test are used to extract certain "features" of the ERP curves, which build the "feature space" whose dimensionality is substantially smaller than that of the original space. All the standard multivariate procedures described above (PCA, Hottelling's test, LDA) can then be performed in this feature space.
Here, it is interesting to mention an alternative, recently proposed, dimensionality reduction approach based on Effect-Matched Spatial (EMS) filtering [24]. While t-CWT reduces the dimensionality of the ERP time curves, independently for each single channel, the EMS filtering method reduces the number of channels (to one), independently for each single time point, thus representing the multichannel ERP by a single "surrogate time-course". Thus, in a certain sense, EMS filtering can be seen as complementary to t-CWT and, for particular purposes, the two methods can be used in combination with each other.

The t-CWT method
The t-CWT method has already been described in the continuous notation [7]. Here, it is presented in the discrete vector and matrix notation as well, because this discrete representation is the one that is used in the computational algorithm [16,17].

CWT
The CWT [8,9] w k n (s, t) of the EEG signal v k n (t) of the kth channel of the nth ERP trial is given by where ψ(t) is a wavelet function which is (a) well localized in both time and frequency, and (b) has a zero mean: The approximate position of ψ (τ − t)/s in time is determined by the time shift t, while the scale s, which is the inverse frequency, defines the approximate position in the frequency domain (Fig. 1a). The CWT (16) is a linear transformation, which means that it can be represented by a matrix T w such that where the random vectors v and w represent the ERP in the time domain and in the time-frequency domain, respectively, and the matrices V and W represent the corresponding single-trial samples. The coefficients of T w in (18) are obtained by substituting v k n (t) and ψ (τ − t)/s in (16) with their respective discrete representations as in (2) and converting the integrals into corresponding sums. Note, however, that the CWT (16) is highly redundant and the wavelet space defined by w is much "larger" than the original vector space defined by v. Note also that T w is actually a block diagonal matrix built from K identical blocks, one per channel. The t-CWT computer application [16,17] uses only one CWT block which is applied to each channel.

t-CWT
Now, consider again two experimental conditions A and B M and N , respectively, of K -channel ERPs represented by the random curves v k A m (t) and v k B n (t), where: k = 1, 2, . . . K ; m = 1, 2, . . . M; n = 1, 2, . . . N ; and 0 ≤ t < T. The corresponding CWTs are computed by (16). Then, Student's two-sample t-value t k (s, t) is computed for each k and each scale-time point (s, t) from the corresponding CWT values w k A m (s, t) and w k B n (s, t): where w k A (s, t) and w k B (s, t) are the sample means and σ k AB (s, t) is the pooled standard deviation computed from the corresponding sums of squares (SS).
In the next step, each of the points (s kj , t kj ), at which the functions t k (s, t) reach a local extremum, are detected. In the last step, we define the t-CWT vector samples W A and W B by their respective components, the t-CWT features Finding the local extrema of a function of two variables is an analytical operation, but its result can be represented by a simple projection in the wavelet space, i.e. selecting the vector components that correspond to the points (s kj , t kj ) and discarding all other dimensions. This projection can be represented by the matrix T w which is obtained from T w in (18) by deleting the columns corresponding to the discarded space dimensions. The t-CWT vectors are obtained by substituting T w in (18) where V is the "total" ERP sample comprising the subsamples V A and V B , and v is the corresponding random vector.

Methods
This section provides a detailed, step-by-step, formal delineation of the t-CWT algorithm. In the brief intuitive descriptions published before [7,11], most of the details were omitted. Here, a rigorous mathematical delineation of all steps is presented for the first time.

Frequency domain representation
The first pre-processing step is based on a frequency domain representation of the ERPs by a discrete Fourier transform (DFT [22]; note the difference to the previous version of t-CWT [7] in which time-dependent filtering and DWT [8,21] were performed instead of DFT). The dimensionality of the vector space is reduced by deleting all frequencies larger than 2f c , where f c is a cutoff frequency defined by a cutoff scale This is done as follows. First, we compute the orthogo- Geometrically, the DFT (23) can be seen as a rotation of the axes, analogical to the PCT (4). Note, however, that T f is actually a block diagonal matrix built from zeros and K identical DFT blocks, one per channel. (The t-CWT computer application uses only one copy of the DFT block which is applied to each channel.) As next, we retain only those columns of T f that correspond to frequencies f j fulfilling the cutoff condition A "reduced" matrixT f is obtained from T f by deleting all columns corresponding to frequencies f j > 2f c . The reduced DFT is then given bŷ As in (5), the "inverse" transform (i.e., the rotation back to the time domain axes) is given bŷ Note, however, that, sinceT f is not a square matrix,v and V are filtered versions of v and V.
In order to smooth the cutoff, the vector components corresponding to the frequencies f j of the last octave f c < f j ≤ 2f c are attenuated gradually. This is done by multiplication with a diagonal matrix R f whose diagonal elements are given by the values of an envelope function Similarly, the vector components in the time domain, v and V, can also be multiplied with an appropriate window function before DFT. This is done by left multiplication ofT f with a diagonal matrix R t whose diagonal elements are given by the values of the corresponding envelope function. The current t-CWT implementation uses a modified Tukey window [25] defined by the envelope function where T in is the fade-in time and T out is the fade-out time.
Chaining all transformations together, we obtaiñ whereT f is defined bỹ For some purposes (e.g. visualization), it might be useful to represent the filtered ERP back in the time domain by the "inverse" transform (26): The number of frequency components per channel is where T is the length of the time interval. The dimension of the frequency domain space (i.e., the number of rows ofT f orT f ) is then KN F , where K is the number of EEG channels. Substituting (22) in (31) we obtain the following approximation: In the t-CWT software [16,17], the time-to-frequency domain transformation (28) is implemented by the function tcwt_t2f; the preceding computation of the transformation matrixT f (22)(23)(24)(25)(26)(27)(28)(29) is implemented by the function tcwt_prm2mat (Table 1).

PCA and outlier detection
The multivariate outlier detection procedure proposed here is heavily based on PCA not only for computational reasons (due to small eigenvalues), but also because a multivariate outlier can strongly influence the dimensionality of the model, producing "fake" dimensions that survive PCA unless the outlier is excluded from the computation of the covariance matrix [23, p. 373].
An important distinction should be made at this point. PCA, as commonly used in ERP applications [26,27] is performed in the time domain and it usually includes an additional rotation of the axes [28] following the initial one (4). This second rotation is aimed at obtaining more meaningful components, which, however, are not uncorrelated. In t-CWT, where PCA is only used in the pre/postprocessing (see below), it does not include any additional rotations and the resulting components remain uncorrelated.
Consider an ERP represented via (28) in the frequency domain by the random vectorṽ f and the single-trial sam-pleṼ f , the latter comprising two subsamplesṼ f A and V f B corresponding to two different experimental conditions;Ṽ f A andṼ f B may in turn comprise subsamples of trials obtained from different individuals. PCT is performed according to (4), using the total covariance matrix obtained directly fromṼ f (i.e. ignoring the subsample Then, components (represented by columns of V p and T p ) corresponding to eigenvalues smaller than a certain cutoff value (defined by one of the criteria mentioned above) are temporarily removed. The remaining Q p variables v i p , i = 1, 2, . . . Q p , are normalized by (7). From the normalized variables x i , for each n, the Mahalanobis distance D n from the n-th single-trial ERP x n to the total mean x is computed according to (8) With growing number of trials, each of the terms of the sum in (34) rapidly converges to the square of a standard normally distributed random variable. Hence, D 2 is approximately χ 2 -distributed. With growing Q p , the square root of a χ 2 -distributed random variable converges rapidly to a normal distribution as well, which means that D = √ D 2 is approximately normally distributed. The nth ERP trial is temporarily marked as an outlier if where D is the mean, σ D is the standard deviation of D, and C is a heuristically chosen coefficient, (usually C ≥ 2.5). The steps described above are repeated iteratively. Trials marked as outliers (represented by rows of V p ) are excluded from the PCA and the computation of D and σ D in the next iteration, but then they are tested again by (35) together with all other trials. Principal components are also excluded for one iteration only until their number remains unchanged through two consecutive iterations. After that, the PCA criterion is not applied any more and PCT is performed further with a fixed number of components. This is done in order to facilitate convergence. Also, in order to prevent oscillatory behavior, if the number of marked outliers does not increase after the current iteration, the outliers detected in the previous iteration are marked again together with those detected in the current iteration. The procedure ends when the set of detected outliers does not change any more (i.e. the same trials are marked in two consecutive iterations).
IfṼ f comprises different individual datasets, a whole dataset is marked as an outlier (with all its trials) if a certain percentage of its trials are already marked. This criterion is applied, however, only if the number of single-trial outliers does not increase at the end of the current iteration. Note that, like single trials, whole data sets excluded at a certain step, can nevertheless be included again later. As a result of the procedure described above, both rows ofṼ f representing single-trial outliers and columns of T p representing "noise components" or "outlier components" are deleted. The reduction of dimensionality is thus represented by the reduced PCTT p and the corresponding "reduced" ERP matricesV p andv p wherê We use the "inverse" PCTT T p to represent the dimensionality reduction in the frequency domaiñ From (36) and (37), we finally obtaiñ Note that in (38) we assume that all rows ofṼ f corresponding to outliers have already been deleted. It is important to emphasize that, the principal components obtained by this procedure are identical with those which would be obtained if PCA were performed in the time domain using the filtered ERP,ṽ (30). This is so, because the principal component axes that diagonalize the covariance matrix are unique (although their representation and the corresponding PCT depend on the representation of the input sample). The frequency domain representation is solely a matter of computational convenience due to dimensionality reduction by frequency filtering. Note also that although the dimensionality of the model is further reduced by the statistical "PCA filtering" (38), the dimensionality of the frequency domain representation remains unchanged.
The above procedure can be additionally applied to each of the two subsamplesṼ f A andṼ f B separately, using the principal components obtained from the whole samplẽ V f . The only difference is that no PCA is done any more (because the components have already been fixed).
It is important to note that the pre-processing procedures described above can be used independently from the t-CWT feature extraction (described below). For instance, the ERP sampleṼ p f (filtered and free from outliers) can be represented back in the time domain by (26): and then used as input for other assessment procedures.
In this article, the representation (39) is used solely for visualization purposes (see Fig. 2).
In the t-CWT software [16,17], the PCA-based multivariate outlier detection procedure (33-38) is implemented by the function tcwt_f2pc; outlier detection for each experimental condition separately with fixed PCT obtained by tcwt_f2pc is implemented by the function tcwt_pc2cnd2ri (Table 1).

Log-grid sampling
In (18), the number of rows of the CWT matrix T w is equal to the number of components of w, which is equal to the number of sampling points in the s-tplane of the wavelet ψ (τ − t)/s in (16). This number can be significantly reduced by using the log-grid introduced in [7] instead of a regular sampling grid. The vertices (s g , t g,h ) of the log-grid (Fig. 1b) are defined by where S 0 is some unit scale, R is the scale-invariant sampling rate measured in points per scale (pps), and g and h take integer values (including negative and zero). The scale invariance can be expressed by the two properties: s g+R = 2s g (i.e., we have R grid lines per octave), and t g,h+R = t g,h + s g (i.e., on each grid line, we have R sampling points per time interval of length s g ). The special case R = 1 yields the DWT with its dyadic structure [8,21]. The t-CWT application uses only the part of the infinite log-grid (40) confined by the rectangle In (41), the minimal scale S c /2 corresponds to the maximal frequency 2f c in (24). For a large number of log-grid vertices N G , a good approximation is given by Thus, the number of CWT sampling points is significantly reduced, compared to the number of vertices of a rectilinear grid with a regular spacing defined in the same rectangle (41), e.g., by the original sampling frequency R 0 applied to both axes as in (1). Note that the sampling frequency R of the log-grid can be chosen to correspond to the original sampling rate R 0 by setting R = S c R 0 , but this is not necessary and R can as well take an independent value R = S c R 0 .
In the t-CWT software [16,17], the log-grid (s g , t g,h ) (40-41) is implemented by a matrix of vertices computed by the function tcwt_prm2mat (Table 1).

CWT and t-CWT from the frequency domain
As in (18), we define the CWTT w of the filtered ERP defined by (26) aŝ Substituting (26) in (43), we obtain wherẽ The rows of the inverse DFT T T f are the discrete vector representations of the basic EEG oscillations sin(2πf j t) and cos(2πf j t) with frequencies f j = j/T , where j = 0, 1, 2... such that f j ≤ 2f c according to (24). Hence,T w is computed by (16) using the log-grid sampling (40-42). The convolution integrals are represented by the corresponding sums with the original sampling rate R 0 .
Like T w (18),T w is a block diagonal matrix built from K identical blocks (one per channel). The size of each block is N F ×N G where N F and N G are the number of frequency components (31,32) and the number of log-grid vertices (42), respectively. From (32) and (42) we obtain the following approximation: The current implementation of t-CWT uses a Mexican Hat wavelet defined by Note that (47) differs from the standard definition of the Mexican Hat, ψ(t) = (1 − t 2 ) exp(−t 2 /2). The unity scale defined by (47) is four times larger than the standard. This is done for convenience: defined in this way, the scale corresponds better to the durations of the ERP waves matched by the wavelet and to the periods of the oscillations sin(2πf j t) and cos(2πf j t) in (45). In order to exclude the outliers detected above, we apply the obtained CWTT w to the "reduced" matricesṽ p f and  Table 3  The t-CWT features are computed by (19,20) and the t-CWT matrixT w is obtained fromT w by retaining only the columns that represent the t-CWT features (20). Substi-tutingT w in (48) withT w , we obtain In the t-CWT software [16,17], the computation of the CWT matrixT w (16)(17)(18)(43)(44)(45) is implemented by the function tcwt_prm2mat; the computations of the t-CWT scalogram t k (s, t) (19), the t-CWT extrema and the t-CWT matrixT w (49) are implemented by the functions tcwt_f2cwss and tcwt_f2x (Table 1).

Post-processing in the feature space
The t-CWT features w are still strongly correlated, because one and the same ERP component is found in more than one EEG channel represented by at least one extremum in each channel's sub-scalogram. Furthermore, not all such sets of extrema represent significant ERP components. For these reasons, the dimensionality of the feature space is reduced further by PCA and step-down selection of principal components. Finally, the LDF is computed in the reduced feature space.

PCA and step-down test
PCT is performed in the feature space according to (4) and the set of components is reduced according to one of the PCA criteria mentioned above. Then, a subset of components, "selected principal components" (SPC), is selected by a step-down test [23, pp. 111, 177, 217] based on the natural ordering of the components (sorted by eigenvalue). The "reduced" matricesT p ,Ŵ p andŵ p are obtained by deleting the columns corresponding to the eliminated components in T p , W p and w p , respectively:

LDA
The LDF d w is computed in the reduced feature space as in (11): whereŵ p A andŵ p B are the respective means of the two subsamplesŴ p A andŴ p B , andŜ p AB is the pooled covariance matrix. The LDA separation plane is defined as in (13-15) bŷ Now, the transformations (28), (38), (49), and (51) can be chained together in order to represent the LDF d w and the LDF valueŵ p d w back in the frequency domain and in the time domain: where and Finally, it might be useful to mention also the continuous time domain representation of the LDF value: where K is the number of channels, T is the length of the time window, and v k (t) and d k (t) are the continuous representations of v and d respectively (see Fig. 2).
In the t-CWT software [16,17], the PCA-based stepdown reduction of the t-CWT features (50-51) and the computation of the LDF d w (52) and d f (55) are implemented by the function tcwt_x2ld (Table 1).

Algorithm summary with links to the t-CWT software
In Table 1, the processing steps delineated above are summarized and linked to the corresponding functions and input/output data files defined by the t-CWT software [16] (for a detailed description of these files and functions, see the t-CWT software documentation [17]). Output file names displayed in the last column of Table 1 are provided with references to corresponding mathematical variables and equations defined in this article. The wildcard symbol " * " denotes an ERP dataset name and indicates that the corresponding t-CWT functions accept a whole list of such names as an argument and then iterate over the list, thus processing multiple ERP datasets in a single call (see the pseudocode at the bottom of Table 1).

Computational demands
The most computationally demanding procedures are the PCA and the t-CWT including CWT and extremum detection from a scalogram.

PCA
The number of matrix elements of the PCT is K 2 N 2 F . The covariance matrix has the same number of elements. Using (32) we obtain the approximate total number of elements of both matrices: Both the memory and the processing time required for PCA are approximately proportional to N P . Each doubleprecision matrix element needs eight bytes of memory.
The processing time depends on the central processing unit (CPU), but, as a rule of thumb, one microsecond per matrix element can be used for rough estimation of the time needed for one iteration of the PCA-based multivariate outlier detection procedure (see above).

t-CWT
The approximate number of (non-zero) CWT matrix elements per channel is given by (46). The amount of memory consumed by t-CWT is proportional only to this number and does not depend on the number of channels K , because the t-CWT application [16,17] performs CWT on one channel at a time. Furthermore, if the processed dataset is too big, t-CWT does not transform all trials at once, but processes smaller blocks of trials (the block size is controlled by an input parameter), thus limiting the memory demand. The time t-CWT needs to process one scalogram (comprising K sub-scalograms) is approximately proportional to the total number of non-zero CWT matrix elements N W = KN F N G . From (46) we obtain the following approximation: Again, as a rule of thumb, one microsecond per matrix element can be used for rough estimation of the time t-CWT needs to process one scalogram (with K channels). The t-CWT software package [16,17] includes the function tcwt_prm2info that evaluates both the exact values of N P and N W and their approximations computed by (58, 59) for a given set of input parameters. Further, tcwt_prm2info makes a rough estimate of the computational demands based on the simple assumption of eight bytes of working memory and one microsecond of processing time per matrix element. All this is done in real time, without actually creating the corresponding matrices and can be very useful in the planning stage of a t-CWT project when the available computational resources must be taken into account.

Example
Consider, e.g., the following settings: K = 64, T = 1 s, S c = 50 ms (f c = 20 Hz), and R = 15 pps. Then we would have the following (exact) numbers obtained with tcwt_prm2info: N F = 81, N G = 13,276, N P = 53,747,712, and N W = 68,822,784. The memory used for PCA would be about 430 MB and for CWT (transforming 1,000 trials at ones), about 116 MB. The estimated processing time for one PCA iteration would be about 54 s, and for one CWT scalogram, about 69 s. On a powerful hardware (see below), however, these computational times can be significantly shorter (see Table 2).

Example: oddball paradigm
In this section, the t-CWT is demonstrated on example ERP data [18] obtained in a passive oddball paradigm [19,20]. Since these datasets have already been described in detail elsewhere [11], only the most important information about the experiment is provided here.

Datasets
ERP datasets were obtained from 36 healthy participants (right-handed, mean age = 27 years, 20 females) in a passive oddball task [19,20], in which 255 standard and 45 deviant stimuli were presented at a constant rate in a randomized sequence. The standard and the deviant stimuli were 50-ms-long, 75-dB-loud sine tones, with a frequency of 1.3 and 0.8 kHz, respectively; the interstimulus interval was 0.8 s. The participants were instructed just to listen attentively to all tones (passive task). Digitized EEG [time resolution: 2 ms/step (500 Hz), voltage resolution: 0.1678 microvolts/step] was continuously recorded from nine scalp positions according to the 10-20 system: Fz, Cz, Pz, F3, F4, C3, C4, P3, and P4. All electrodes were referenced to the linked mastoids. Electrical eye activity was recorded by bipolar acquisition from the following sites: the two lateral orbital rims for horizontal eye movements, and FP2 and a site below the right eye for vertical eye movements and eye blinks. The first nine datasets (in alphabetical order) were excluded for technical reasons (in order to reduce the whole dataset archive to 100 MB for online storage and download purposes). The remaining 27 datasets were processed with t-CWT.

Data processing
Before feeding the data into the t-CWT processor, the raw EEG curves were segmented and corrected for eye blink and eye movement artifacts, by a standard procedure [29,30]. The epoch length was 1 s, starting 100 ms before stimulus onset. Then, the datasets were checked for series of more than one deviant trials. Only the first trials of such series were retained, the following deviant trials as well as the first subsequent standard trial were deleted. The first ten trials of each dataset were also deleted. As a result of this reduction, in each dataset, remained 242 standard trials and 38 deviant trials. Thus, the a priori probabilities (15, 53) were p s = 86.4 % for standard trials and p d = 13.6 % for deviant trials. The datasets were converted from ASCII format to the internal t-CWT format by the function tcwt_ascii2tmat [16,17] and the epochs were reduced to windows of interest starting at stimulus onset and ending 600 ms later. The signals in these windows were then referenced to the 100 ms pre-stimulus baseline. These signals were processed by t-CWT.
In its current implementation [16,17], t-CWT starts with a call to the function tcwt_prm2info which gives a rough estimate of the memory demands and the compu- tational time (see above). The next call is to the function tcwt_prm2mat that creates and saves to a file those transformation matrices which do not depend on the data, but are functions of global input parameters only (see Table 1). These matrices are the DFTT f (29) and the CWTT w (45), and they depend on the original sampling rate R 0 (1), the length of the time window T (1), the fade-in and the fadeout times T in and T out of the window function (27), the log-grid sampling rate R (40), and the cutoff scale S c (22). While R 0 is defined by the time resolution setting of the EEG amplifier (in this case, R 0 = 500 Hz), the other parameters can be varied to achieve best LDA classification results. In the current example, T , T in and T out were kept fixed: T = 600 ms, T in = 20 ms, T out = 200 ms, while R and S c took the following values: R = 10, 15, 20, and 25 pps; S c = 250, 100, 50, 40, and 30 ms. Note that these values of S c correspond to cutoff frequencies f c = 4, 10, 20, 25, and 33.3 Hz, respectively (22). After computingT f andT w , the data were represented in the frequency domain by (29) using the function tcwt_t2f ( Table 1). The outlier rejection procedure described above (33)(34)(35)(36)(37)(38) was performed on each dataset, first on the whole dataset (by the function tcwt_f2pc), then on each of the two subsets, standard and deviant (by the function tcwt_pc2cnd2ri, see Table 1). The greatest eigenvalues explaining a certain percentage P v of the variance were retained after PCT. Different values of P v were tried in order to minimize the LDA classification error rates (see below): P v = 95, 97, and 99 %. The outlier criterion was defined by setting C = 2.7 in (35). (Other values of C ranging from 2.5 to 9.9 were tried as well, but without any substantial effect on the results; see the Limitations section in the "Discussion" section.) As a next step, the t-CWT features w were computed (for each dataset) as described in (43-49) using the function tcwt_f2x (Table 1). The obtained set of t-CWT features was reduced further by PCAs (50) and step-down tests (51), implemented by the function tcwt_x2ld ( Table 1). The same PCA criterion with the same value of P v as in the outlier rejection procedure was applied. The overall α-level for the step-down test was set to α sd = 0.3. Finally, the LDFs d of the individual datasets were obtained by (52), also implemented by the function tcwt_x2ld (Table 1).
In the "individual hold-out" method, all of the above steps but the first one (outlier rejection from the whole dataset) were performed on a dataset obtained from the original one by excluding one single trial. The LDF thus obtained was used to classify the excluded trial as standard or deviant by (14) and (15). The error rates were obtained by repeating the procedure for each single trial and each dataset and comparing the result with the true category of each trial (standard or deviant). The hold-out method is very efficient, because it is almost unbiased and it uses the whole available statistical power, but it is also a computationally demanding procedure. In the t-CWT software [16,17], the individual hold-out method is implemented by the function tcwt_f2holdout.
In the "individual split-half" method, all steps were performed on the first half of the trials of each dataset (extracted by the function tcwt_f2split [16,17]). The LDFs obtained from these "training datasets" were then applied to the second halves of the datasets, the "test datasets". This method is quick and simple, but it has considerable loss of statistical power as a major disadvantage.
"Individual biased" error rates were also computed in order to demonstrate the bias resulting from using the same dataset (without excluding any trials) for both training and testing.
In order to demonstrate multivariate hypothesis testing, only the last 9 deviant trials and 50 standard trials of each dataset were used as training dataset, while the first half of the trials were used as test dataset. The SPCs obtained from each training dataset by the step-down method were applied to each test dataset and subjected to Hotelling's T 2 -test. Both the individual split-half error rates and the biased error rates, as well as Hotelling's T 2 -tests were performed with the function tcwt_f2stats [16,17].
Finally, all individual datasets were pooled together into one large group dataset (using the function tcwt_f2pool [16,17]). Outlier rejection (using tcwt_f2pc and tcwt_ pc2cnd2ri) was performed with the average eigenvalue as a PCA criterion in order to emphasize group features and to suppress individual and/or oscillatory features. The singletrial outlier criterion was defined by setting C = 2.5 in (35). (Again, experimenting with other values of C had virtually no effect on the results; see the Limitations section in the "Discussion" section.) The dataset outlier criterion was defined by setting the minimum number of trials retained to 50 % of the trials in the dataset. t-CWT, PCA (again, with the average eigenvalue criterion), step-down test, and LDA were performed as above (using tcwt_f2x and tcwt_x2ld). A variation of the hold-out method, the "group hold-out" was used to obtain classification error rates for the individual datasets: instead of excluding one single trial at a time, one whole individual dataset was excluded at each iteration. Each error rate obtained in this way is based on a group LDF applied to the respective excluded dataset.
The t-CWT software [16,17], provides the function tcwt_ri2ri1out that creates systematically hold-out indices of the pooled group dataset which are specially designed for the implementation of the group hold-out method. In a group hold-out index corresponding to a given individual dataset, all single trials belonging to this dataset are marked as "outliers". The functions tcwt_pc2cnd2ri, tcwt_f2x, tcwt_x2ld, and tcwt_f2stats can use these index files to compute the corresponding group hold-out t-CWT features, group hold-out LDFs, and group hold-out error rates. The hold-out mode of operation of these functions is very similar to their normal mode outlined in the pseudocode at the bottom of Table 1 (i.e., they iterate through the list of individual datasets, loading in each iteration the corresponding group hold-out index file and other group hold-out input files, performing the respective transformations and computations, and saving the results in corresponding group hold-out output files).
For all obtained error rates (individual hold-out, individual split-half, individual biased, and group hold-out), binomial distribution p-values were computed (by tcwt_f2stats and tcwt_f2holdout) to test the hypotheses whether these error rates were better (i.e., smaller) than the chance classification error rates defined by the a priori probabilities p d = 50 % or p d = 13.6 % (see above).

Results and discussion
In this section, some example results are presented and discussed. These results were obtained with the t-CWT method from the example datasets [18] described above.

Results from the example oddball data
Group averages of individual hold-out classification error rates for different values of the cutoff scale S c , the loggrid sampling rate R, and the percentage of variance P v explained by PCA are displayed in Table 2. Most of these results are visualized by bar plots in Fig. 3. Both total errors (tot) and errors for each category of trials, standard (std) and deviant (dev), are displayed. The total error rates computed by taking into account the a priori oddball probabilities p s = 86.4 % and p d = 13.6 % (15) were, as expected, much smaller than those computed for p s = p d =50 % (14). Note, however, that the error rates for deviant trials increased when the a priori probabilities were taken into account. It is also interesting to point out the following observation: while for p s = 86.4 % and p d = 13.6 %, the total error (Fig. 3f) was decreased by the error reduction in the classification of deviant trials (Fig.  3e), the corresponding total decrease in the case of equal a priori probabilities p s = p d =50 % (Fig. 3c), was caused by the decline of error of classification of standard trials (Fig. 3a).
Both the numbers in Table 2 and the plots in Fig.  3 demonstrate how the quality of the LDF d (52-56) can be optimized by minimizing the errors of classification through systematic variation of the values of the different input parameters S c , P v , R, etc.. This optimization can be very important for different applications of the t-CWT method. While this is obvious for cases in which classification is the ultimate goal of the application (as, e.g., in BCI), other applications aimed at multivariate hypothesis testing may also use optimized LDFs for their purposes (see the "Discussion" section).
Hold-out Error Rates as Functions of S c and P v for the known a priori oddball probabilities p s = 86.4%, p d = 13.6% Error Rate (%) In Table 2, N F denotes the number of frequency components per channel (31,32), and N G is the number of log-grid vertices (42). The number of non-zero CWT matrix elements (ME) is given by KN F N G , where K is the number of channels. Table 2 shows that the ME ratio (N F N G S 2 c )/(R 2 T 2 ) (see 46) is approximately constant. Table 2 also shows how the computational time depends on the number of non-zero CWT matrix elements KN F N G which is a function of S c , R, P v , and the number of channels K . This function is defined by (59) which was confirmed Hotelling's T 2 -tests were performed with only 9 deviant and 50 standard trials (of each dataset) by the results displayed in Table 2. The computational demands as functions of S c , f c and R are presented in graphical form in Fig. 3g-i. The total processing time for the hold-out method is the product of the processing time per non-zero CWT element, the number of such elements, and the number of scalograms. For the hold-out method, as applied to the example data, the latter was equal to the total number of trials = 27 datasets × 280 trials per dataset = 7,560. As the last column of Table 2 shows, the processing time per matrix element is approximately constant. Note, however, that this value of less than a half microsecond was obtained with MATLAB and certain hardware (see above), and that the corresponding value for GNU Octave and/or other hardware may be larger.
The results for S c = 40 ms, R = 15 pps, and P v = 99 % are presented in more detail. The latter values appeared to be optimal, because, as Table 2 and Fig. 3i show, larger values of R increased substantially the computational demands with little improvement of the results.
The average ERP curves are displayed in Fig. 2a for the whole group (grand average) and in Fig. 2b for one individual participant (ID='GIM'). Student's t-test results for the difference between the ERP responses to deviant vs. standard tones are displayed in Fig. 2c for the whole group and in Fig. 2d for the participant 'GIM'. Note that, since these t-values were not corrected for multiple comparisons, they cannot be used for inference about the statistical significance of the difference between the two ERP curves at each point in time [4,5]. It is interesting, however, to compare the forms of these t-value curves with the forms of the corresponding LDFs. The LDF obtained with the t-CWT method is displayed in Fig. 2e for the whole group and in Fig. 2f for the participant 'GIM'.
The grand average plots show that the ERP response to deviant stimuli relative to the response to standards was dominated by the Mismatch Negativity (peaking about 200 ms post-stimulus) [32], the P3 (300 ms) [19,20], and the Negative Slow Wave (400 ms) [33]. The individual LDF displayed in Fig. 2f shows, however, some additional oscillatory features which are not quite discernible in the time domain (Fig. 2b, d) and which, nevertheless, improve substantially the LDA classification results obtained with this LDF for the participant 'GIM' compared to those computed for the same participant with the group LDF displayed in Fig. 2e (see Table 3).
The different LDA error rates as well as the p-values obtained from Hotelling's T 2 -test (with 9 deviant and 50 standard trials) for each participant, for S c = 40 ms, R = 15 pps, and P v = 99 % are displayed in Table 3. Total error rates (tot) were tested by the binomial cumulative distribution function whether they were significantly smaller than the a priori probability for deviant (dev) trials p d = 13.6 %. The comparison between the different methods shows an advantage of the classifications based on individual t-CWT features over those based on group features. Furthermore, the individual hold-out method provided lower error rates than the split-half method. This can be explained by the split-half method's loss of statistical power (less that 20 deviant trials remained after splitting a dataset). Table 3 also shows the number of t-CWT extrema N X obtained with each method for each data set and the number of SPCs used for classification. While the error rates obtained with the individual biased method are incorrect, the numbers of t-CWT features N X and the numbers of SPCs N P are the correct numbers obtained from the whole datasets. They can also be seen as good approximations for the corresponding values obtained with the holdoutmethod, in which, of course, they vary with each iteration (and are not displayed in Table 3 for this reason). These numbers show that the originally very large dimensionality of the data was reduced to less than 200 (correlated) t-CWT extrema which in turn were reduced further to a few (12 or less, uncorrelated) SPCs. The results from Hotelling's T 2 -test (Table 3, last column) show that, for the most individual datasets, 9 deviants and 50 standards were enough for the ERP difference to reach statistical significance. Table 3 shows that all error rates vary largely among the different individual datasets. While the inherent EEG noise in the data is responsible for one portion of the error, participants' inattention to the stimuli provides an additional source of systematic error. Thus, error rates can be used as inverse measures of attention.

Discussion
A much more straightforward approach, however, would be to use directly the LDF value vd (52-57) to measure the ERP difference between the two experimental con-ditions and to interpret this difference as a measure of participants' attention to the stimuli. This could be done in different ways as described in the following three cases.
In the first case, both v and d are obtained from the same ERP sample V, comprising two subsamples V A and V B obtained under two different experimental conditions A and B (A-B design). Then, the mean difference LDF value v A − v B d is exactly the mean Mahalanobis distance (10)(11)(12) computed in the feature space. This same Mahalanobis distance is used in Hotelling's T 2 -test, which, however, should not be used directly in this case, because the t-CWT features are extracted from the very sample that is tested. As already mentioned in the Background section, a multivariate randomization test based on Hotelling's T 2statistic can be used instead [11].
In the second case, the LDF d is computed from the same sample as v, but, this time, using t-CWT features obtained from a training dataset, V Z (by an A-B design within Z). In this case, Hotelling's T 2 -test may be used, as this was done above in the assessment of the example data.
The third case, in which both the t-CWT features and the LDF d Z are computed from a training dataset V Z , while v is drawn from a test dataset V, is the most important and will be discussed in more detail. As already mentioned above, this is the usual scenario in a typical BCI application, which was also demonstrated above by the splithalf method in the assessment of classification error rates. But the representation of the Mahalanobis distance (12) suggests another usage of the LDF apart from single trial classification. Namely, we can construct a new estimator D 2 Z of the Mahalanobis distance by using d Z instead of d: Since d Z is computed from the training dataset V Z , it can be treated as constant in all tests and analyses concerning the test dataset V.
) are all (univariate) normally distributed and can be subjected to standard univariate tests. It should also be noted that in (60) a whole complex pattern recognition scheme derived from V Z (PCA, t-CWT, etc.) is imposed on V by simple multiplication. The significance of (60) reaches, however, beyond mere computational convenience, because it conveys a whole concept of multivariate ERP assessment. While, in most BCI applications, this concept is clearly the most effective for single trial classification, and, therefore, also the standard one, it has not been used in other ERP applications for testing hypotheses about mean ERP differences. (Note that mass univariate analyses [4,5], which have become popular recently, represent a different concept.) In the following, some general ideas about possible applications of this multivariate concept are presented.
The two kinds of t-CWT features, group and individual, that can be extracted from the data suggest two different kinds of diagnostic applications of the t-CWT method: within-subject and between-subject. Consider an ERP paradigm testing a certain cognitive function in a group of individuals under three different conditions: X , experimental condition (e.g. under the influence of a drug); Y, control condition (e.g. placebo); and Z, standard condition (no substances administered). Assume that the cognitive function of interest is reflected by the ERP difference between two sub-conditions, A and B, (A-B design, as above). Now, the individual t-CWT features, SPCs, and LDFs obtained from Z can be used to assess the difference between X and Y by means of Student's two-sample (X -Y) t-test of the Mahalanobis distance (60) and/or by comparison of classification error rates. This is an example of a within-subject application of the t-CWT method. Note that the t-CWT method described above for the case of an A-B design can be easily extended to the case of only one sub-condition, A, when the ERP v is compared to 0 (A-0 design).
As an example of a between-subject application, consider an ERP paradigm testing a certain cognitive function in three different samples of individuals: X , a sample of patients suffering a certain disorder which has the impairment of this cognitive function as a symptom; Y, a sample of healthy individuals; and Z, a mixed sample of patients and healthy individuals. In this case, the group t-CWT features, SPCs, and LDFs obtained from Z can be used to assess the difference between X and Y. Further, the t-CWT method can be applied directly to the ERP difference between X and Y and the resulting t-CWT features, SPCs, and LDFs can be used for classification and diagnostics of individuals from the general population (i.e. to determine whether an individual suffers from this particular symptom or not). The greatest problem in the case of a between-subject comparison would be the substantial increase in variance due to individual differences.
Finally, in both within-subject and between-subject applications, the LDF value (54-57) can be used as a measure of the magnitude of the (differential) ERP response in each of the conditions X , Y, and Z in order to investigate its relationship with other behavioral measures or with the amplitudes of single (classical) ERP components.

Limitations
All of the above ideas about possible applications of the t-CWT method assume the existence of a cleverly designed ERP paradigm that tests exclusively (changes in) a particular (set of) cognitive function(s) of interest without the resulting ERP differences being affected by (changes in) other cognitive functions. In many cases, however, this assumption might not be true. This is a conceptual limitation, not only of the t-CWT method, but of any multivariate ERP assessment method. In certain cases, the ampli-tudes of single ERP components might be better measures of particular cognitive functions of interest.
A purely technical limitation of the t-CWT method is imposed by the computational demands of its current software implementation [16,17]. For instance, the usage of dense electrode arrays combined with long time windows and small cutoff scales, might result in practical unusability of the application, if no access to adequate computational resources (e.g. a high performance computing cluster [31]) is provided. In spite of the good scalability of the application (e.g. the computationally demanding CWT is performed one channel at a time, the maximum number of single trials processed by CWT is controlled by an input parameter, thus limiting working memory usage), the increase of computational time can be handled only by parallel execution of t-CWT jobs on several powerful CPUs.
A notable limitation of the current study is the lack of evidence for (or against) the usefulness of the novel PCAbased multivariate outlier detection procedure introduced above. The results of the (within-subject) assessment of the example oddball data were not sensitive to variations in the number of outliers detected and excluded by the procedure, depending on the value of the input parameter C (35). Whether zero or as much as 20 % of the trials were rejected, the statistical assessments were practically not affected by these changes and the corresponding results remained virtually the same. It should be mentioned, however, that some unpublished evidence already exists, suggesting that the outlier detection procedure could be crucial in a between-subject design. This evidence comes from the (unfinished) t-CWT assessment of ERP data obtained in a previous study [34].

Conclusions
In the present article, some basic concepts of multivariate statistics were introduced as geometric notions. ERPs were defined as random vectors (points) in a metric space, in which the distance between two points was derived in a natural way from the covariance of the data. PCT and DFT were introduced as rotations in this space. LDA classification was described as computing a LDF vector, building a separation plane perpendicular to this vector, and assigning single-trial ERP points to two different categories according to their position relative to this dividing plane. CWT and t-CWT were also defined as linear transformations represented by their respective matrices. All these mathematical constructs were used to provide for the first time a detailed, step-by-step, formal description of the t-CWT algorithm. Its MATLAB and GNU Octave implementation was also made publicly available as free and open source code released for the first time under GPLv3 [16,17].
A new multivariate outlier rejection procedure based on PCA in the frequency domain was introduced. The timedependent filtering and the DWT used in the previous version of t-CWT [7] were replaced in the current version by simple uniform filtering via DFT. This was done solely for simplification for the sake of easier understanding and not for improvement of the method. In fact, a new version of t-CWT is planned including a more flexible procedure for time-dependent filtering based on PCA and DWT.
It should be noted that t-CWT is essentially a feature extraction method and t-CWT based classification does not necessarily imply LDA as a post-processing procedure. Classification can be performed using other methods as well, e.g., SVM [10]. Hypotheses can be tested by both mass-univariate [13] and multivariate [11] permutation/randomization tests. Moreover, as already mentioned above, t-CWT can also be used in combination with other dimensionality reduction methods, e.g. EMS filtering [24]. On the other hand, the PCA-based multivariate outlier detection introduced here, can be used independently from t-CWT as a pre-processing procedure in other assessment algorithms. It is also important to emphasize that although t-CWT feature extraction can be computationally demanding, taking several seconds or even minutes for large scalograms (with many channels), the t-CWT features, once computed, can be applied practically instantly in real time applications (e.g., BCI) via LDA or other proper classification method.
The t-CWT method was demonstrated on example ERP data [18] obtained in a passive oddball paradigm. Both group and individual t-CWT features were extracted from the data and were used for LDA classification of single trials and for testing mean ERP differences for each individual dataset via Hotelling's T 2 -test. Different methods for estimation of classification errors were introduced and compared with each other.
Finally, new ideas for further applications of the multivariate approach in general and of t-CWT method in particular were introduced on a conceptual level in the Discussion. Some of these ideas will be tested soon in a randomized clinical trial where ERPs are used for assessment of the sustained mindful attention developed by training in a course of mindfulness-based cognitive therapy for recurrent depression [34].