Academia.eduAcademia.edu

Introduction to machine learning for brain imaging

2011, NeuroImage

Machine learning and pattern recognition algorithms have in the past years developed to become a working horse in brain imaging and the computational neurosciences, as they are instrumental for mining vast amounts of neural data of ever increasing measurement precision and detecting minuscule signals from an overwhelming noise floor. They provide the means to decode and characterize task relevant brain states and to distinguish them from non-informative brain signals. While undoubtedly this machinery has helped to gain novel biological insights, it also holds the danger of potential unintentional abuse. Ideally machine learning techniques should be usable for any non-expert, however, unfortunately they are typically not. Overfitting and other pitfalls may occur and lead to spurious and nonsensical interpretation. The goal of this review is therefore to provide an accessible and clear introduction to the strengths and also the inherent dangers of machine learning usage in the neurosciences.

Introduction to Machine Learning for Brain Imaging Steven Lemma,c,∗, Benjamin Blankertza,b,c , Thorsten Dickhausa , Klaus-Robert Müllera,c a Berlin Institute of Technology, Dept. of Computer Science, Machine Learning Group, Berlin, Germany b Fraunhofer FIRST, Intelligent Data Analysis Group, Berlin, Germany c Bernstein Focus: Neurotechnology, Berlin, Germany Abstract Machine learning and pattern recognition algorithms have in the past years developed to become a working horse in brain imaging and the computational neurosciences, as they are instrumental for mining vast amounts of neural data of ever increasing measurement precision and detecting minuscule signals from an overwhelming noise floor. They provide the means to decode and characterize task relevant brain states and to distinguish them from non-informative brain signals. While undoubtedly this machinery has helped to gain novel biological insights, it also holds the danger of potential unintentional abuse. Ideally machine learning techniques should be usable for any non-expert, however, unfortunately they are typically not. Overfitting and other pitfalls may occur and lead to spurious and nonsensical interpretation. The goal of this review is therefore to provide an accessible and clear introduction to the strengths and also the inherent dangers of machine learning usage in the neurosciences. Keywords: pattern recognition, machine learning, model selection, cross validation 1. Introduction The past years have seen an immense rise of interest in the decoding of brain states as measured invasively with multi-unit arrays and electrocorticography (ECoG) or non-invasively with functional Magnetic Resonance Imaging (fMRI), electroencephalography (EEG), or Near-infrared spectroscopy (NIRS). Instrumental to this development has been the use of modern machine learning and pattern recognition algorithms (e.g. Bishop (1995); Vapnik (1995); Duda et al. (2001); Hastie et al. (2001); Müller et al. (2001); Schölkopf and Smola (2002); Rasmussen and Williams (2005)). Clearly, in this development the field of realtime decoding for brain-computer interfacing (BCI) has been a technology motor (e.g. Dornhege et al. (2007); Kübler and Kotchoubey (2007); Kübler and Müller (2007); Wolpaw (2007); Birbaumer (2006); Pfurtscheller et al. (2005); Curran and Stokes (2003); Wolpaw et al. (2002); Kübler et al. (2001)). As a result a number of novel data driven approaches specific to neuroscience have emerged: (a) dimension reduction and projection methods (e.g. Hyvärinen et al. (2001); Ziehe et al. (2000); Parra and Sajda (2003); ∗ corresponding author: steven.lemm@cs.tu-berlin.de Preprint submitted to NeuroImage Morup et al. (2008); von Bünau et al. (2009); Blanchard et al. (2006); Bießmann et al. (2009)), (b) classification methods (e.g. Müller et al. (2001); Müller et al. (2003); Parra et al. (2002, 2003); Lal et al. (2004); Blankertz et al. (2006a, 2008b); Müller et al. (2008); Parra et al. (2008); Tomioka et al. (2007); Tomioka and Müller (2010)), (c) spatio-temporal filtering algorithms (e.g. Fukunaga (1972); Koles (1991); Ramoser et al. (2000); Lemm et al. (2005); Blankertz et al. (2008b); Dornhege et al. (2006); Tomioka and Müller (2010); Parra et al. (2008); Dornhege et al. (2007); Blankertz et al. (2007)), (d) measures for determining synchrony, coherence or causal relations in data (e.g. Meinecke et al. (2005); Brunner et al. (2006); Nolte et al. (2004); Marzetti et al. (2008); Nolte et al. (2008, 2006); Gentili et al. (2009)) and (e) algorithms for assessing and counteracting non-stationarity in data (e.g. Shenoy et al. (2006); Sugiyama et al. (2007); von Bünau et al. (2009); Blankertz et al. (2008a); Krauledat et al. (2007)). Moreover a new generation of source localization techniques is now in use (e.g. Haufe et al. (2008, 2009)) and has been successfully applied in BCI research (e.g. Pun et al. (2006); Noirhomme et al. (2008); Grosse-Wentrup et al. (2007)). Despite of this multitude of powerful novel data analytical methods, this review will place its focus mainly September 24, 2010 on the essentials for decoding brain states, namely we will introduce the mathematical concept of classification and model selection, and explicate how to properly apply them to neurophysiological data. To this end, the paper first reviews the mathematical and algorithmic principles starting from a rather abstract level and regardless of the scope of application. The subsequent discussion of the practicalities and ’tricks’ in the spirit of Orr and Müller (1998) will then link to the analysis of brain imaging data. Unlike existing reviews on this topic (cf. Wolpaw et al. (2002); Kübler and Müller (2007); Dornhege et al. (2007); Haynes and Rees (2006); Pereira et al. (2009)), we will elaborate on common pitfalls and how to avoid them when applying machine learning to brain imaging data. A final note: while the paper introduces the main ideas of algorithms, we do not attempt a full treatment of the available literature. Rather, we present a somewhat biased view, mainly drawing from the author’s work and providing – to the best of our knowledge – links and reference to related studies and further reading. We sincerely hope that it will nevertheless be useful to the reader. The paper outline is the following: After a brief introduction, we discuss machine learning principles in Section 2 and then, Section 3 provides feature extraction and dimensionality reduction methods, both supervised and unsupervised. The main contributions of this review are a detailed account of how to validate and select models (Section 4), and the elaboration of the practical aspects of model evaluation and most common pitfalls specific to brain imaging data in Section 5. single trial responses in fMRI or EEG, and brain imaging data from any other spatio-temporal, or spectral domain. Formally, the scenario of discrimination of brain Figure 1: Left: Typical work flow for machine learning based classification of brain imaging data. First, brain imaging data are acquired according to the chosen neurophysiological paradigm. Then the data are preprocessed, e.g., by artifact rejection or bandpass filtering. The machine learning based approach comprises the reduction of the dimensionality by extraction of meaningful features and the final classification of the data in the feature space. Right: example of a linear classification of time series in the time frequency domain. Especially, the linear classifier partitions an appropriately chosen low dimensional feature space. states can be cast into a so-called classification problem, where in a data-driven manner a classifier is computed that partitions a set of observations into subsets with distinct statistical characteristics (see Fig. 1). Note, however, that not only paradigms with known physiological connotation but also novel scenarios can be scrutinized, where (1) a new paradigm can be explored with respect to its neurophysiological signatures, (2) a hypothesis about underlying task relevant brain processes is generated automatically by the learning machine. Feeding this information back to the experimenter, may lead to a refinement of the initial paradigm, such that, in principle, a better understanding of the brain processes may follow. In this sense, machine learning may not only be the technology of choice for a generic modeling of a neuroscientific experiment, it can also be of great use in a semi-automatic exploration loop for testing new neurophysiological paradigms. 2. Learning to classify Neuroscientific experiments often aim at contrasting specific brain states. Typically, the experimenter chooses a neurophysiological paradigm that maximizes the contrast between the brain states (note that there may be more than two states), while controlling for task unrelated processes. After recording brain imaging data, the goal of analysis is to find significant differences in the spatial and the temporal characteristics of the data contrasting the different states as accurate as possible. While simple statistics such as considering, e.g., grand averages (averaging across trials and subjects) may help for model building, more sophisticated machine learning techniques have become increasingly popular due to their great modeling power. Note that the methods we will presented in this paper are likewise applicable to multivariate fMRI voxel time series, to 2.1. Some theoretical background Let us start with the general notion of the learning problems that we consider in this paper. A classifier is a function which partitions a set of objects into several classes, for instance, recordings of brain activity during either auditory, visual, or cognitive processing into the 2 three distinct modalities (classes) 1 . Thus, based on a set of observations, the machine learning task of classification is to find a rule, which assigns an observation x to one of several classes. Here, x denotes a vector of N−dimensional neuroimaging data. In the simplest case there are only two different classes. Hence, a classifier can be formalized as a decision function f : RN → {−1, +1}, that assigns an observation x to one of the classes, denoted by −1 and 1, respectively. Typically, the set of possible decision functions f is constrained by the scientist to a (parameterized) class of functions F , e.g., to the class of linear functions. Note, a linear decision function f corresponds to a separating hyperplane (e.g., see Fig. 1), that is parameterized by its normal vector w and a bias term b. Here, the label y is predicted through (2.1) have to try to estimate the minimum of (2.3) based on the information available, such as the training sample and properties of the function class F . A straightforward approach is to approximate the risk in (2.3) by the empirical risk, i.e., the averaged loss on the training sample n (2.4) Remp [ f ] = 1X l(yi , f (xi )), n i=1 and minimize the empirical risk with respect to f . It is possible to give conditions on the learning machine which ensure that asymptotically (as the number of observations n → ∞) the minimum of the empirical risk will converge towards the one of the expected risk. Consequently, with an infinite amount of data the decision function f that minimizes the empirical risk will also minimize the expected risk. However, for small sample sizes this approximation is rather coarse and large deviations are possible. As a consequence of this, overfitting might occur, where the decision function f learns details of the sample rather than global properties of P(x, y) (see Fig. 2 and 4). y = f (x; w, b) = sgn(w> x + b). Then, based on a set of observed input-output relation (x1 , y1 ), . . . , (xn , yn ) ∈ RN × {−1, +1}, learning can be formally described as the task of selecting the parameter value (w, b) and hence selecting the decision function f ∈ F such that f will correctly classify unseen examples x. Here, the observed data (xi , yi ) is assumed to be independently and identically distributed (i.i.d.) according to an unknown distribution P(x, y) that reflects the relationship between the objects x and the class labels y, e.g., between the recorded brain activity and the paradigm specific mental states. However, in order to find the optimal decision function one needs to specify a suitable loss function that evaluates the goodness of the partitioning. One of the most commonly used loss functions for classification is the so-called 0/1-loss (see Smola and Schölkopf (1998) for a discussion of other loss functions)    0 y = f (x) (2.2) l(y, f (x)) =   1 else. Figure 2: Illustration of the overfitting dilemma: Given only a small sample (left) either, the solid or the dashed hypothesis might be true, the dashed one being more complex, but also having a smaller empirical error. Only with a large sample we are able to see which decision reflects the true distribution more closely. If the dashed hypothesis is correct the solid would underfit (middle); if the solid were correct the dashed hypothesis would overfit (right). From Müller et al. (2001). Under such circumstances, simply minimizing the empirical error (2.4) will not yield a small generalization error in general. One way to avoid the overfitting dilemma is to restrict the complexity of the function f Vapnik (1995). The intuition, which will be formalized in the following is that a “simple” (e.g., linear) function that explains most of the data is preferable over a complex one (Occam’s razor, cf. MacKay (2003)). This is typically realized by adding a regularization term (e.g., Kimeldorf and Wahba (1971); Tikhonov and Arsenin (1977); Poggio and Girosi (1990); Cox and O’Sullivan (1990)) to the empirical risk, i.e., Given a particular loss function, the best decision function f one can obtain, is the one minimizing the expected risk (also often called generalization error) Z (2.3) R[ f ] = l(y, f (x)) dP(x, y), under the unknown distribution P(x, y). As the underlying probability distribution P(x, y) is unknown the expected risk cannot be minimized directly. Therefore, we (2.5) 1 The classes could also correspond to complex brain states as in mind reading paradigms (see Haynes and Rees (2006)) or brain states such as attention, workload, emotions, etc. Rreg [ f ] = Remp + λkT f k2 . Here, an appropriately chosen regularization operator kT f k2 penalizes high complexity or other undesired 3 (i=1,2) and the true covariance matrix Σ are known, then the normal vector w of the Bayes optimal separating hyperplane of the LDA classifier is given as properties of a function f ; λ introduces an additional parameter to the model (often called hyperparameter) that needs to be estimated as well. The estimation of this parameter and hence taking model complexity into account, raises the problem of model selection (e.g., Akaike (1974); Poggio and Girosi (1990); Moody (1992); Murata et al. (1994)), i.e., how to find the optimal complexity of a function or accordingly the appropriate function class (in Section 4 we will discuss the practical aspects of model selection). As a brief remark, Linear Discriminant Analysis (LDA) employs regularization through shrinkage (Stein (1956); Friedman (1989)) while neural networks use early stopping (Amari et al. (1997)), weight decay regularization or asymptotic model selection criteria (e.g., network information criterion (Murata et al. (1994); Akaike (1974)), see also (Bishop (1995); Orr and Müller (1998))). On the other hand, support vector machines (SVMs) regularize according to what kernel is being used (Smola et al. (1998)) and limit their capacity according to Vapnik-Chervonenkis (VC) theory (Vapnik (1995)). We will now briefly outline a few algorithms. w = Σ−1 (µ1 − µ2 ). In order to compute w for real data, the means and covariances need to be approximated empirically, see Section 2.2.2. A more general framework is the well-known Fisher’s Discriminant analysis (FDA), that maximizes the so-called Rayleigh coefficient J(w) = (2.6) wT S B w , wT S W w P where the within class scatter S w = 2i=1 S i with S i = P µi )T and the class means are dex∈Ci (x − µi )(x − 1 P fined as µi = ni x∈Ci x and ni is the number of patterns xi in class Ci . The between class scatter S B = P P 1/2 2i=1 (µ − µi )(µ − µi )T , where µ = 1/2 2i=1 µi . A solution to (2.6) can be found by solving the generalized Eigenvalue problem (cf. Golub and van Loan (1996)). Considering only two classes, FDA and LDA can be shown to yield the same classifier solution. However, both methods can be extended for the application to multiple classes. Although it is a common wisdom that linear methods such as LDA and FDA are less likely to overfit, we would like to stress that they also require careful regularization: not only for numerical reasons. Here, the regularization procedure will be less necessary to avoid the typical overfitting problems due to excessive complexity encountered for nonlinear methods (see Fig. 2 and Fig. 4-right). Rather regularization will help to limit the influence of outliers that can distort linear models (see Fig. 4-center). However, if possible, a removal of outliers prior to learning is to be preferred (e.g., Schölkopf et al. (2001); Harmeling et al. (2006)). A mathematical programming formulation of regularized Fisher Discriminant analysis (RFDA) as a linear constrained, convex optimization problem was introduced in Mika et al. (2001) as Figure 3: Schematic illustration of model selection. The solid line represents the empirical error, the dashed line the expected error. With higher complexity, the ability of the model to overfit the sample data increases, visible from a low empirical and an increasing expected error. The task of model selection is to determine the model with the smallest generalization error. 2.2. Linear Classification For computing the parameters of a linear decision function (cf. (2.1) and Fig. 1), namely the normal vector w and the bias b, we will in the following discuss different linear methods: Linear Discriminant Analysis (LDA) including procedures for regularizing LDA, as well as linear programming machines (LPM). (2.7) minw,b,ξ s.t. 1 C kwk22 + kξk22 2 n yi · ((w> xi ) + b) = 1 − ξi , i = 1, . . . , n ξi ≥ 0, 2.2.1. Linear Discriminant Analysis, Fisher’s Discriminant and Regularization In case of LDA the two classes are assumed to be normally distributed with different means but identical full rank covariance matrix. Suppose the true means µi where kwk2 denotes the 2 -norm (kwk22 = w> w) and C is a model parameter that controls for the amount of constraint violations introduced by the slack variables ξi . The constraints yi · ((w> xi ) + b) = 1 − ξi ensure that 4 Then the following holds: Σ̃ and Σ̂ have the same Eigenvectors; extreme eigenvalues (large or small) are modified (shrunk or elongated) towards the average ν; γ = 0 yields unregularized LDA, γ = 1 assumes spherical covariance matrices. Using LDA with such a modified covariance matrix is termed regularized LDA or LDA with shrinkage. For a long time, complex or time-consuming methods have been used to select shrinkage parameter γ, e.g., by means of cross validation. Recently an analytic method to calculate the optimal shrinkage parameter for certain directions of shrinkage was found (Ledoit and Wolf (2004); see also Vidaurre et al. (2009) for the first application to brain imaging data) that is surprisingly simple. The optimal value only depends on the sample-tosample variance of entries in the empirical covariance matrix (and values of Σ̂ itself). When we denote by (xk )i and (µ̂)i the i-th element of the vector xk and µ̂, respectively and denote by si j the element in the i-th row and j-th column of Σ̂ and define the class means are projected to the corresponding class labels ±1. Minimizing the length of w maximizes the margin between the projected class means relative to the intra class variance. Note that (2.8) can be the starting point for further mathematical program formulations of classifiers such as the sparse FDA, which uses an 1P norm regularizer: kwk1 = |wn | etc. (cf. Mika et al. (2003)). Figure 4: The problem of finding a maximum margin “hyper-plane” on reliable data (left), data with outlier (middle) and with a mislabeled pattern (right). The solid line shows the resulting decision line, whereas the dashed line marks the margin area. In the middle and on the left the original decision line is plotted with dots. The hard margin implies noise sensitivity, because only one pattern can spoil the whole estimation of the decision line. Figure from Rätsch et al. (2001). zi j (k) = ((xk )i − (µ̂)i ) ((xk ) j − (µ̂) j ), then the optimal parameter γ for shrinkage towards identity (as defined by (2.8)) can be calculated as (Schäfer and Strimmer (2005)) Pd n i, j=1 vark (zi j (k)) ? . γ = P P 2 2 2 (n − 1) i, j si j + i (sii − ν) 2.2.2. Linear Discriminant Analysis with Shrinkage The optimality statement for LDA depends crucially on the never fullfilled assumption, that the true class distributions are known. Rather, means and covariance matrices of the distributions have to be estimated from the training data. The standard estimator for a covariance matrix is the empirical covariance which is unbiased and has under usual conditions good properties. But for extreme cases of high-dimensional data with only a few data points that is typically encountered in neuroimaging data, the empirical estimation may become inprecise, because the number of unknown parameters that have to be estimated is quadratic in the number of dimensions. As substantiated in Blankertz et al. (2010), this results in a systematic error: Large eigenvalues of the original covariance matrix are estimated too large, and small eigenvalues are estimated too small. Shrinkage is a common remedy for the systematic bias (Stein (1956)) of the estimated covariance matrices (e.g., Friedman (1989)): the empirical covariance matrix Σ̂ is replaced by (2.8) 2.2.3. Linear Programming Machines Finally, we would like to introduce so-called Linear Programming Machines (LPMs, Bennett and Mangasarian (1992); Tibshirani (1994, 1996); Hastie et al. (2001); Müller et al. (2001); Rätsch et al. (2002)). Here, slack variables ξ corresponding to the estimation error incurred as well as parameters w are optimized to yield a sparse regularized solution (2.9) minw,b,ξ s.t. C 1 kwk1 + kξk1 2 n yi · ((w> xi ) + b) ≥ 1 − ξi , i = 1, . . . , n ξi ≥ 0. LPMs achieve sparse solutions (i.e. most values of w become zero) by means of explicitly optimizing the 1−norm in the objective function instead of the 2−norm, which is known to yield non-sparse solutions. Due to this property, LPM and sparse FDA are excellent tools for variable selection. In other words, while solving the classification problem, the user is not only supplied with a good classifier but also with the list of variables that are relevant for solving the classification task (Blankertz Σ̃(γ) := (1 − γ)Σ̂ + γνI for a tuning parameter γ ∈ [0, 1] and ν defined as average eigenvalue of Σ̂ and I being the identity matrix. 5 (Vapnik (1995); Müller et al. (2001)), Kernel Fisher Discriminant (Mika et al. (2003)) or Kernel Principal Component Analysis (Schölkopf et al. (1998)). et al. (2006a); Müller et al. (2001); Blankertz et al. (2002); Lal et al. (2004); Tomioka and Müller (2010)). 2.3. Beyond linear classifiers Kernel based learning has taken the step from linear to nonlinear classification in a particularly interesting and efficient2 manner: a linear algorithm is applied in some appropriate (kernel) feature space. While this idea first described in Schölkopf et al. (1998) is simple, it is yet very powerful as all beneficial properties (e.g., optimality) of linear classification are maintained3 , but at the same time the overall classification is nonlinear in input space, since feature- and input space are nonlinearly related. A cartoon of this idea can be found in Fig. 5, where the classification in input space requires some complicated non-linear (multi-parameter) ellipsoid classifier. An appropriate feature space representation, in this case polynomials of second order, supply a convenient basis in which the problem can be most easily solved by a linear classifier. However, by virtue of the kernel-trick the input space does not need to be explicitly mapped to a high dimensional feature space by means of a non-linear function Φ : x 7→ Φ(x). Instead, kernel based methods take advantage from the fact that most linear methods only require the computation of dot products. Hence, the trick in kernel based learning is to substitute an appropriate kernel function in the input space for the dot products in the feature space, i.e., ❍ ❍ (2.13) S ✕ z1 ✕ ✕ ✕ ✕ ✕ z2 In order to illustrate the application of the kerneltrick, let us consider the example of the SVM (Vapnik (1995); Müller et al. (2001)). Here, the primal optimization problem of a linear SVM is given similar to (2.7) and (2.9) as (2.14) minw,b,ξ s.t. n X 1 ξi kwk22 + C 2 i=1 yi · ((w> xi ) + b) ≥ 1 − ξi , i = 1, . . . , n ξi ≥ 0. However, in order to apply the kernel trick we rather use the dual of (2.14), i.e., (2.15) maxα n X i=1 s.t. n   1X αi − αi α j yi y j · x>i x j 2 i, j=1 C ≥ αi ≥ 0, i = 1, . . . , n, n X αi yi = 0, i=1 To construct a nonlinear SVM in the input space, one (implicitly) maps the inputs x to the feature space by a non-linear feature map Φ(x) and computes an optimal hyperplane (with threshold) in feature space. To this end, one substitutes Φ(xi ) for each training example xi in (2.15). As the xi only occur in dot products, one can apply the kernel trick and substitute a kernel k for the dot products, i.e., k(xi , x j ) = Φ(xi )> Φ(x j ) (cf. Boser et al. (1992); Guyon et al. (1993)). Hence, the non-linear formulation of the SVM is identical to k(x, x0 ) = tanh(κ(x> x0 ) + θ), Examples of kernel-based learning machines are among others, e.g., Support Vector Machines (SVMs) 3 As ✕ 2.4. Support Vector Machines κ > 0, θ < 0 2 By ✕ ✕ Figure 5: Two dimensional classification example. Using the second √ order monomials x12 , 2x1 x2 and x22 as features a separation in feature space can be found using a linear hyperplane (right). In input space this construction corresponds to a non-linear ellipsoidal decision boundary (left). From Müller et al. (2001). More precisely, kernel-based methods restrict to the particular class of kernel functions k that correspond to dot products in feature spaces and hence only implicitly maps the input space to the corresponding feature space. Commonly used kernels are for instance: ! kx − x0 k2 0 (2.11) G k(x, x ) = exp − , 2σ2 σ>0 P k(x, x0 ) = (x> x0 + c)d ✕ ❍ ✕ ❍❍ ✕ ✕ ❍ ❍ ❍ ✕ ✕ such that k(x, x0 ) = φ(x)> φ(x0 ). (2.12) ❍ ❍ x1 ❍ ❍ ✕ ✕ ❍ ❍ ✕ ✕ ❍ ❍ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ k :RN × RN → R, (2.10) ✕ ✕ ✕ z3 x2 ✕ ✕ virtue of the so-called ’kernel trick’ Vapnik (1995). we do linear classification in this feature space. 6 (2.15), with the exception that the kernel k(xi , x j ) substitutes for the dot (x>i x j ). Advantageously, many of the αi will be vanishing; the samples associated with these non-zero coefficients are referred to as support vectors (SV). As the weight vector w in the primal formulation P reads w = i yi αi Φ(xi ), we obtain the nonlinear decision function as   f (x) = sgn w> Φ(x) + b     X   >  yi αi · Φ(xi ) Φ(x) + b = sgn  i: αi ,0     X (2.16) yi αi · k(xi , x) + b = sgn  is known to work well, if a reasonable distance (typically the Euclidean one) is available and if the number of data points in the training set is not huge. The hyperparameter k can be selected, e.g., by cross-validation; if the data is noisy a large k should be chosen. 2.6. Application to brain imaging data The predominant methods both for fMRI and EEG/MEG analysis are so far mostly linear, however, nonlinear methods can easily be included in the analysis by including these model classes into the model selection loop, as we will discuss in a later section in detail. For a brief discussion of linear versus nonlinear methods see, e.g., Müller et al. (2003). In EEG studies mainly LDA, shrinkage/regularized LDA, sparse Fisher, and linear programs are in use (e.g., Dornhege et al. (2007)) are in use. Here, it was observed that through a proper preprocessing the class conditional distributions become Gaussian with a very similar covariance structure (cf. Blankertz et al. (2002, 2006a); Dornhege et al. (2007)). Under such circumstances, theory suggests that LDA would be the optimal classifier. However, changes in the underlying distribution, the use of multimodal features (e.g., Dornhege et al. (2004)), or the presence of outliers may require to proceed to nonlinear methods (see the discussion in Müller et al. (2003)). The driving insight in fMRI analysis was to go from univariate analysis tools that correlate single voxels with behavior to a full multivariate correlation analysis (e.g., Hansen et al. (1999); Haxby and et al (2001); Strother et al. (2002); Cox and Savoy (2003); Kamitani and Tong (2005); Haynes and Rees (2006); Kriegeskorte et al. (2006); Mitchell et al. (2008)). A main argument for using especially SVMs and LPMs was their well-known benign properties in the case where the number of input dimensions in x is high while the number of samples is low (e.g., Vapnik (1995); Müller et al. (2001); Braun et al. (2008)). This particular very unbalanced situation is an important issue in fMRI, since the number of voxels is of the order ten-thousands while the number of samples rarely exceeds a few hundreds (e.g., Hansen et al. (1999); Strother et al. (2002); Cox and Savoy (2003)). Physiological priors that allow, e.g., to define regions of interest, have led to further specialized analysis tools like the search-light method. Here, the signal to noise ratio is improved by discarding some potentially noisy and task unrelated areas while enhancing the interesting bit of information spatially (e.g., Kriegeskorte et al. (2006)). In some cases improvements through the use of nonlinear kernel functions have been reported, e.g., LaConte et al. (2005) studied polynomial kernels. i: αi ,0 Here, the sum in (2.16) is a sparse expansion as it only runs over the set of SVs. Note, while the αi are determined from quadratic optimization problem 4 (2.15), the threshold b can be computed by exploiting the fact that for all support vectors xi with C ≥ αi ≥ 0, the slack variable ξi is zero, and hence (2.17) n X y j α j · k(xi , x j ) + b = yi . j=1 If one uses an optimizer that works with the double dual (see, e.g., Vanderbei and Shanno (1997); Boyd and Vandenberghe (2004)), one can also recover the value of the primal variable b directly from the corresponding double dual variable. Note that recently Braun et al. (2008) has observed that the excellent generalization that is typically observed when using SVMs in high dimensional applications with few samples is due to its very economic representation in Kernel Hilbert space. Given the appropriate kernel, only a very low dimensional subspace is task relevant. 2.5. k-nearest neighbors A further well-known non-linear algorithm for classification is the so-called k-nearest neighbor method. Here, every unseen point x is compared through a distance function dist(x, xi ) to all points xi (i = 1, . . . , n) of the training set. The k minimal distances are computed and the majority over the corresponding labels yi is taken as a resulting label for x (note that this simple rule also holds for multiclass problems). This strategy provides a very simple local approximation of the conditional density function. The k-nearest neighbor method 4 A selection of Open Source software for SVMs can be found on www.kernel-machines.org. 7 Other priors have been derived from the paradigm. For example, for understanding the neural basis of word representation, Mitchell et al. (2008) used word cooccurances in language inferred from a large scale document corpus, to derive a codebook of fMRI patterns corresponding to brain activity for representing words. These patterns could be superimposed for out of sample forecasting, thus predicting the most probable way a certain novel word is represented in the fMRI pattern. Figure 6: Essential difference between PCA and ICA. The left panel shows a mixture of two super-Gaussian signals in the observation coordinates, along with the estimated PCA axes (green) and ICA axes (red). Projecting the observed data to these axes reveals, that PCA did not properly identify the original independent variables (central), while ICA has well identified the original independent data variables (right). 3. Dimensionality Reduction So far, we did not much concern about the domain of the input data and solely assumed the classifiers to be applicable to the data (x, y). In practice and particularly in the domain of computational neuroscience, the input data x often exhibit an adverse ratio between its dimensionality and the number of samples. For example, a typical a common task for EEG-based Brain-Computer Interfaces (BCI) requires the classification of one second intervals of brain activity, recorded at sampling frequencies up to 1 kHz, from possibly 128 electrodes. Hence, the dimensionality of the input data x amounts approximately to 105 , while the number of training samples is typically rather small, up to a few hundred samples. Moreover, the data is contaminated by various sources of interfering noise, while the discriminative information that is the task relevant part of the data is often concentrated in a low dimensional subspace. Consequently, in order to make the classification task feasible the dimensionality of the data needs to be significantly reduced, and informative features have to be extracted (see also Fig. 1). Thus, feature extraction likewise involves spatial, spectral, and temporal preprocessing of the input. Nevertheless, the extraction of relevant features is a highly paradigm specific task and also differs for the various recording techniques due to their temporal and spatial resolutions. Moreover, the extraction of task relevant features should be facilitated by incorporating prior neurophysiological knowledge, e.g., about the cognitive processes underlying the specific paradigm. In return, purely data driven feature extraction methods can potentially provide new findings about the involved cognitive processing and might therefore contribute to the generation of neurophysiological hypotheses (Blankertz et al. (2006a)). Thus, feature extraction has to be considered not just as a data analytical but rather as a heavily interdisciplinary endeavor. The common approaches for dimensionality reduction can be subdivided into two main categories. On the one hand side there are variants of factor models, such as the well-known Principle Component Analysis (PCA), Independent Component Analysis (ICA) (cf. Comon (1994); Bell and Sejnowski (1995); Ziehe and Müller (1998); Hyvärinen et al. (2001)), non-negative matrix factorization, archetypal analysis, sparse PCA (Lee and Seung (2000); Zou et al. (2004); Schölkopf et al. (1998)) or Non-Gaussian Component Analysis (Blanchard et al. (2006)); that perform a factorization of the input data x in a purely unsupervised manner, i.e., without using the class information. The application of these methods serves several purposes (a) dimensionality reduction by projecting onto a few (meaningful) factors, (b) removal of interfering noise from the data to increase the signal-to-noise ratio of the signals of interest, (c) removal of nonstationary effects in data or (d) grouping of effects. Notwithstanding their general applicability, unsupervised factor analysis often requires manual identification of the factors of interest. The second class of methods, namely supervised methods, make explicit use of the class labels in order to find a transformation of the input data to a reduced set of features with high task relevance. For example, the Common Spatial Pattern (CSP) (cf. Fukunaga (1972); Koles (1991); Ramoser et al. (2000)) algorithm and derivatives thereof (Blankertz et al. (2008b); Lemm et al. (2005); Dornhege et al. (2006); Tomioka and Müller (2010)) is widely used to extract discriminative oscillatory features. In the following, we will briefly describe two frequently used dimensionality reduction methods. First we will briefly introduce the unsupervised Independent Component Analysis, and subsequently discuss the supervised CSP algorithm. 3.1. Independent component analysis Under the often valid assumption that the electric fields of different bioelectric current sources superim8 pose linearly, the measured neurophysiological data can be modeled as a linear combination of component vectors. For instance, in case of independent component analysis (ICA) it is assumed that the observed signals x(t) are a linear mixture of M ≤ N mutually independent sources s(t), i.e., data are modeled as a linear combination of component vectors: (3.18) Figure 7: Essential steps of CSP: The blue and green ellipsoids refer to the two class conditional covariance matrices along with the principal axes, while the mutual covariance matrix is depicted in gray. Left: original data. Central: data distribution after whitening. Right: after a final rotation the variance along the horizontal direction is maximal for the green class, while it is minimal for the blue class and vice versa along the vertical direction. x(t) = As(t), where A ∈ RN×M denotes the linear mixing matrix. In this case (3.18) is invertible and ICA decomposes the observed data x(t) into independent components y(t) by estimating the inverse decomposition matrix W  A−1 , such that y(t) = W x(t). There exists a vast number of ICA algorithms that can solve the task of estimating the mixing matrix A. They only differ in the particular choice of a so-called index funtion and the respective numerical procedures to optimize this function. In general, the index function employs a statistical property that takes on its extreme values if the projected sources are independent. Most research conducted in the field of ICA uses higher-order statistics for the estimation of the independent components (Comon (1994); Hyvärinen et al. (2001)). For instance, the Jade algorithm (Cardoso and Souloumiac (1993)) is based on the joint diagonalization of matrices obtained from ”parallel slices” of the 4th -order cumulant tensor. Although this algorithm performs very efficiently on low dimensional data, it becomes computational infeasible for high dimensional problems, as the effort for storing and processing the 4th -order cumulants is O(m4 ) in the number of sources. As a remedy for this problem Hyvärinen and Oja (1997) developed a general fix-point iteration algorithm termed FastICA, that optimizes a contrast function measuring the distance of the source probability distributions from a Gaussian distribution. Note that ICA can recover the original sources s(t) only up to scaling and permutation. Fig. 6 sketches the essential difference between ICA and the well known PCA method. directions (spatial filters) that maximize the signal variance for one class, while simultaneously minimizing the signal variance for the opposite class. To be more precise, let Σ1 and Σ2 denote the two class conditional signal covariance matrices. The spatial CSP filters w are obtained as the generalized Eigenvectors of the following system (3.19) Σ1 w = λΣ2 w. A solution to (3.19) is typically derived in two steps: first the data are whitened with respect to the mutual covariance matric Σ1 + Σ2 ; secondly a terminal rotation aligns the principal axes with the coordinate axes (see Fig. 7). However, the interpretation of filter matrix W is two-fold, the rows of W are the spatial filters, whereas the columns of W −1 can be seen as the common spatial patterns, i.e., the time-invariant coupling coefficients of each source with the different sensors. For a detailed discussion on the relation between spatial patterns and spatial filters see (Blankertz et al. (2010)). Originally the CSP algorithm was conceived for discriminating between two classes, but has also been extended to multi-class problems (Dornhege et al. (2004); Grosse-Wentrup and Buss (2008)). Further extension of CSP were proposed in Lemm et al. (2005); Dornhege et al. (2006); Tomioka and Müller (2010); Farquhar (2009); Li and Guan (2006) with the goal of simultaneously optimizing discriminative spatial and spectral filters. For a comprehensive overview of optimizing spatial filters we refer to Blankertz et al. (2008b). 3.2. Common spatial pattern Unlike unsupervised methods such as PCA and ICA, the common spatial pattern (CSP) algorithm (Fukunaga (1972)) makes explicit use of the label information in order to calculate discriminative spatial filters that emphasize differences in signal power of oscillatory processes (Koles and Soong (1998)). To illustrate the basic idea of CSP: suppose we observe two class distributions in a high-dimensional space, the CSP algorithm finds 4. Cross-validation and Model selection Given the data sample, the task of model selection is to choose a statistical model from a set of potential models (the function class), which may have produced the data with maximum likelihood, i.e., to choose the model which resembles the underlying functional relationship 9 and correspondingly f (x|D) denotes the model prediction at the instance x. Suppose further that the error of the model at an instance (x, y) from the sample space is measured by a given loss function err = l(y, f (x|D)), e.g., by the mean squared error, or the 0/1-loss as it is commonly used for classification. Based on this notation, we will introduce the prevalent concepts for assessing the generalization error of a statistical model given a sample set. All of the following concepts have in common that they are based on a holdout strategy. A holdout strategy generally splits the sample set in two independent, complementary subsets. One subset, commonly referred to as training set, is solely used for fitting the model, i.e., to estimate the model parameters, such as, the normal vector of the separating hyperplane of an SVM. In contrast, the second subset is exclusively used for the purpose of validating the estimated model on an independent data set and is therefore termed validation set. Formally, let Dv ⊂ D denote the holdout validation set of size nv , and define Dt = D \ Dv as the complementary sample set for training. The estimated validation error is defined as 1 X l(yi , f (xi |Dt )). (4.20) errv = nv i∈D best. The set of potential models is in general not restricted, although in practice limitations are imposed by the preselection of a model class by the scientists. A typical setting may, for example, comprise models from distinct classes, such as linear and non-linear models; but may also solely consist of non-linear models that differ in the employed SV-kernel function which needs to be selected. On the other hand, model selection is also frequently used to determine the optimal value of model hyperparameters5 . However, in all these cases the general task remains the same: the expected out-ofsample performances of the different models need to be evaluated on the common basis of the given data sample. As discussed previously, complex models can potentially better adapt to details of the data. On the other hand an excessively complex model will tend to overfit, i.e., will rather fit to the noise than to the underlying functional relationship. Hence, its out-of-sample performance will deteriorate (see Fig. 2 for an illustration of overfitting). However, overfitting not only occurs when a model has too many degrees of freedom, in relation to the amount of data available. Also simple models tend to overfit, if the influence of outliers is not treated appropriately (see Fig. 4). However, in any case the training error does not provide an unbiased estimate of the model performance and hence can not be used to select the best model. At this point we would like to stress, that an unbiased estimation of the model performance is one of the most fundamental issues in statistical data analysis, as it provides the answer to: How will the model generalize on new, previously unseen data and hence how accurately it will perform in practice. In terms of machine learning, this quantity is called the generalization error or expected risk of a model (cf. (2.3)), which applies likewise to a regression or classification model. So model selection in essence reduces to reliably estimating the ability of a model to generalize well to new unseen data and to pick the model with the smallest expected error (see Fig. 3). v Note, the learned model as well as the accuracy of the estimated generalization error will depend on the particularly chosen partition of the original sample into training and validation set and especially on the size of these sets. For example, the more instances we leave for validation, the less samples remain for training and hence the model becomes less accurate. Consequently, a bias is introduced to the estimated model error. On the contrary, using fewer instances for validating the model will increase the variance of the estimated model error. 4.1.1. Cross-validation To trade off between bias and variance several approaches have been proposed. On the one hand side, there is a multitude of cross-validation (CV) schemes, where the process of splitting the sample in two is repeated several times using different partitions of the sample data. Subsequently, the resulting validation errors are averaged across the multiple rounds of CV. The miscellaneous CV schemes differ by the way they split the sample data in two. The most widely used method is K-fold CV. Here, the sample data is randomly divided into K disjoint subsets D1 , . . . , DK of approximately equal size. The model is then trained K times, using all of the data subsamples except for one, which is left out as validation set. In particular, in the k-th run Dk 4.1. Estimation of the Generalization Error In order to present a universally applicable conceptual framework of estimating the generalization error, let D = {(x1 , y1 ), . . . , (xn , yn )} represent our original sample set of n labeled instances. Moreover, let f (·|D) be the model that has been learned on the sample set 5 This might simultaneously be the regularization strength C of a SVM and kernel width σ of say a Gaussian kernel. 10 the original sample D. The bootstrap sample b is then used for training, while the set left out that is D \ b serves as validation set. Using a sampling procedure with replacement, the expected size of the validation set is n(1 − 1n )n ≈ 0.368n. Correspondingly, the training set, which is of size n, has ≈ 0.632n unique observations which leads to an overestimation of the prediction error. The .632 bootstrap estimator Efron and Tibshirani (1993) corrects for this, by adding the underestimated resubstitution error, is selected as validation set, while the union of the remaining K −1 subsamples, i.e., D\Dk serves as training set. The K-fold CV-error is then the averaged error of the K estimated models, where each model is evaluated separately on its corresponding validation set K (4.21) errCV = 1XX l (yi , f (xi |D \ Dk )) . n k=1 i∈D k Note, that the cross-validation error is still a random number that depends on the particular partition of the data into the K folds. Therefore, it would be highly desirable to perform a complete K-fold CV to estimate its mean and variance by averaging across all possible partitions of the data into K folds. In practice this is computationally intractable even for small samples. Nevertheless, repeating the K-fold cross-validation several times can additionally reduce the variance of the estimator at lower computational costs. An alternative cross-validation scheme is the socalled leave-one-out cross-validation (LOO-CV). As the name already indicates, LOO-CV uses all but a single data point of the original sample for training the model. The estimated model is then validated on the single observation left out. This procedure is repeated, until each data point once served as validation set. In particular, in the i-th run the validation set corresponds to Di = (xi , yi ), while the model is trained on the complement D \ Di . The LOO-CV estimator is defined as the averaged error errboot = (4.23) X 1X 0.632 · l (yi , f (xi |b )) B b i<b +0.368 · n X l (yi , f (xi |D)) . i=1 However, the standard bootstrap estimate is an upwardly biased estimator of the model accuracy. In particular, it can become to overly optimistic for excessively complex models that are capable to highly overfit the data. In the context of a feature subset selection experiment for regression, a comprehensive comparison of different schemes for estimating the generalization error has been conducted in Breiman and Spector (1992). Among others schemes, they compared leave-one-out cross-validation, K-fold cross-validation for various K, stratified version of cross-validation and bias corrected bootstrap on artificially generated data. For the task of model selection they concluded ten-fold crossvalidation to outperform the other methods. n (4.22) errLOO = 1X l (yi , f (xi |D \ Di )) . n i=1 5. Practical Hints for Model Evaluation Note that LOO-CV actually performs a complete n-fold CV. However, since the model has to be trained n times, LOO-CV is computational demanding. On the other hand, it leads to an almost unbiased estimate of the generalization error, but at the expense of an increased variance of the estimator. In general, cross-validation schemes provide a nearly unbiased estimate of the generalization error, at the cost of significant variability, particularly for discontinuous loss functions Efron and Tibshirani (1997). In order to achieve a good compromise between bias and variance the use of 10-fold or 5-fold CV are often recommended. 5.1. Models with Hyperparameters Many preprocessing and classification methods have one or more hyperparameters that need to be adjusted to the data by means of model selection. Examples are the kernel width of an Gaussian kernel, the regularization parameter λ of the regularized LDA, but also the number of neighbors in a k-nearest neighbor approach, or the number of principal components to be selected. According to the previously introduced general concept of model selection, those hyperparameters have to be selected by means of an unbiased estimate of the generalization performance, i.e., have to be evaluated on a validation set that is independent of data used for training. Since the selection of the hyperparameter is an integral part of the model, the model selection scheme and therefore the estimation of the validation error becomes part of the model itself. Consequently, the cross-validation 4.1.2. Bootstrapping In case of discontinuous error functions bootstrap methods Efron and Tibshirani (1993) may smooth over possible discontinuity. A bootstrap sample b is created by sampling n instances with replacement from 11 Here, Dk denotes the kth validation set of the outer CVloop, while we use the short hand notation D\k for the corresponding outer loop training set D \ Dk . However, the main difference in the above equation compared to the ordinary K-fold CV is that the model fCV (·|D\k ) refers to the model that has been selected via the inner K-fold CV over the data set D\k , i.e., (5.25) K X  X    l yi , f xi |D\k \ D\kl , fCV ·|D\k := argmin error that has been used for adjusting the hyperparameter of a model becomes a biased estimate of the overall model performance, as it has been minimized by the model selection. Consequently, to estimate the generalization error of the entire model (including the hyperparameter selection) another independent data set is required. To emphasize the severeness of this issue, let us consider the following illustrative example. Given a fixed data set the classification performance of linear discriminant analysis (without hyperparameters) needs to be compared with the performance of an SVM with a Gaussian kernel. Cross-validation is performed for LDA and also for an SVM using a Gaussian kernel for various combinations of the hyperparameters, i.e., the kernel width and the regularization parameter. One can expect the SVM to yield better results than LDA for some combinations, while it will perform worse for others. Nevertheless, it cannot be concluded that the SVM with the ‘optimal selection’ of its hyperparameters outperforms the LDA solution. Obviously, as it has been minimized by the model selection procedure, the CV-error of the selected SVM is to optimistic about the model performance. Note that the CV-error is a random number that exhibits a certain degree of variability. Accordingly, repeated sampling from the distribution of the CV error will favor models with a larger set of parameter combinations (here, the SVM model). In particular, the smaller CV error for some parameter combinations of the SVM could just be a lucky coincidence induced by repeatedly evaluating the SVM model. The severeness of biasing the results depends on several factors, e.g., how much complexity can be controlled by the parameters, dimensionality of the features, and the methods of selecting the parameters. f ∈F with F denoting the set of models corresponding to different value of the hyperparameter and D\k l denoting the lth validation set of the inner CV loop. In order to distinguish more easily between the different layers of crossvalidation, one often refers to the holdout set Dk of the outer loop as test sets. Note, in each round of the outer CV loop the model fCV (·|D\k ) and hence the hyperparameter that is selected can be different. Consequently, nested CV will provide a probability distribution how often a hyperparameter had been selected by the model selection scheme rather than a particular value. On the other hand nested CV gives an unbiased estimate of the generalization performance of the complete model (including selection of the hyperparameters). 5.2. Cross-Validation for Dependent Data Another practical issue when estimating the generalization error of a model is the validity of the assumption about the independence of the training and the validation data. Obviously, if all instances are distributed independently and identically then arbitrary splits into training and validation set will yield stochastically independent samples and the prerequisite is trivially fulfilled. However, often the experimental design induces dependencies between samples. In such a case, special care is required when splitting the sample in order to ensure the aforementioned independence assumption. For example, a frequently used paradigm in human brain research divides the course of an experiment into blocks of different experimental conditions. We say that an experiment has a block design, if each block comprises several single-trials all belonging to the same condition, see Fig. 8-a. In such a setting, special care has to be taken in validating the classification of single-trials according to the experimental conditions. Typically, samples within such a block are likely to be stochastically dependent, while stochastic independence can be generally assumed for samples belonging to different blocks. To Nested Cross-Validation: In order to obtain an unbiased estimate of the generalization performance of the complete model (including selection of the hyperparameter), an additional data set is required which is independent from both, the training and the validation data. To this end, a nested cross-validation scheme is most appropriate. Algorithmically, it can be described as two nested loops of cross-validation. In the inner loop, the hyperparameter as part of the model has to be selected according to inner CV error, while in the outer loop, the selected models are evaluated with respect to their generalization ability on an independent validation set. The outer loop CV-error is similar to (4.21) K (5.24) errCV = l=1 i∈D\k l  1XX l yi , fCV xi |D\k . n k=1 i∈D k 12 (a) block #1 block #2 block #3 block #4 block #5 block #6 block #7 cond #1 cond #2 cond #1 cond #2 cond #1 cond #2 cond #1 Cov((xi , yi ), (xi+ j , yi+ j )) (measured in a suitable norm) are of negligible magnitude for all | j| > h. That is, samples that are further apart than h samples are assumed to be uncorrelated. A so-called h-block cross-validation scheme is a modification of the LOO-CV scheme. Similar to LOO-CV it uses each instance (xi , yi ) once as validation set, but unlike LOO-CV it leaves an h-block (of size 2h + 1) of the neighboring h samples from each side of the ith sample out of the training set. Following the notation in Racine (2000), we denote the h-block by (x(i:h) , y(i:h) ), thus the training set of the ith iteration corresponds to D(−i:h) := D \ (x(i:h) , y(i:h) ). The generalization error of the model is consequently obtained by the cross validation function (b) Training set Test set Figure 8: Schema of a block design experiment. (a) Alternation between the two conditions that are to be discriminated is on a longer time scale, compared to shorter segments for which classification is evaluated. This setting requires special treatment for validation. (b) Applying cross validation on the epochs violates the assumption that samples in the training set are independent from the ones in the test set. Due to slowly changing nonstationarities in the brain signals, a trial in the test set is very likely to be classified correctly, if trials from the same block are in the training set. n (5.26) illustrate this, let us consider a neuroimaging recording. Here, typically many slowly varying processes of background activity exist, and hence neighboring trials within a single block are dependent with respect to the information they share about the state of these slow cortical processes. Consequently, applying a simple holdout strategies as they are conventionally employed by generic LOO or K-fold CV, to block design data will most likely violate the assumption of independence between the training and validation data, see Fig. 8-b. In order to demonstrate the possible effect of applying a standard CV scheme to block-wise data, we will conduct an illustrative experiment to demonstrate the failure of a generic CV, that severely underestimates the generalization performance. Moreover, we show that the block CV scheme introduced below yields an accurate estimate of the generalization error. However, before we present this example, we will formally introduce a cross-validation scheme that is tailored to the particualar needs of block-wise data. errh = 1X l(yi , f (xi |D(−i:h) )). n i=1 Although the latter policy resolves the problem of dependencies, it still has one major drawback. As it was shown in Shao (1993) for the particular choice of the mean squared error as loss function, due to the small size of the validation sets the h-block cross-validation is inconsistent for the important model class of linear functions. This means, even asymptotically (i.e., for n → ∞) h-block cross-validation does not reliably select the true model f ∗ from a class of linear models. As worked out in Racine (2000), a way out of this pitfall is a technique called hv-cross validation. Heuristically spoken, hv-cross validation enlarges the validation sets sufficiently. To this end, for a “sufficiently large” v each validation set is expanded by v additional observations from either side, yielding Di = (xi:v , yi:v ). Hence, each validation set Di is of size nv = 2v + 1. Moreover, in order to take care of the dependencies, h observations on either side of (x(i:v) , y(i:v) ) are additionally removed to form the training data. Hence, the training data of the ith iteration is D(−i:(h+v)) := D \ (x(i:(h+v)) , y(i:(h+v)) ). Now, the cross-validation function (5.27) n−v X X 1 errhv = ky j − f (x j |D(−i:(h+v)) k2 nv (n − 2v) i=v j∈D 5.2.1. Block Cross-validation As indicated, data within a single block are likely to be stochastically dependent, while stochastic independence can be assumed for data across blocks. Consequently, the most simple form of a cross-validation scheme for block-wise data corresponds to a leave-oneblock-out cross-validation. Analog to LOO-CV a single block is left out for validation and the model is trained on the remaining data. For most of the experimental paradigms such a CV scheme will be appropriate. However, in order to introduce a more general block-wise CV method, we will assume that for some pre-defined constant h ∈ N the covariance matrices i is an appropriate measure of the generalization error. For asymptotic consistency it is necessary that limn→∞ nv /n = 1, thus choosing v such that v = (n − nδ − 2h − 1)/2 for a constant 0 < δ < 1 is sufficient to achieve consistency (Shao (1993)). 5.3. Caveats in Applying Cross-Validation Preprocessing the data prior to the application of cross-validation also requires particular care to avoid bi13 asing the generalization error of the model. Here, in order to adhere to the independence assumption of training and validation set, any parameters of the preprocessing needs to be estimated solely on the training set and not on the test set. This holds likewise for the estimation of principle and independent component analysis, but also for simpler preprocessing strategies, such as normalization of the data. If cross-validation is used, the estimation of the parameters of preprocessing has to be performed within the cross-validation loop on each training set separately and afterwards transfer to the corresponding test and validation data, respectively. In order to achieve a stringently sound validation, it is for example inappropriate to perform ICA on the entire data, select desirable components and use features extracted from those components as input to a classifier, whose performance needs to be evaluated by cross-validation. Although the bias induced by unsupervised preprocessing techniques will usually be not as severe, it can result in improper model selection and overoptimistic results. In contrast, strong overfitting may occur, if a preprocessing method which uses class label information is performed on the whole data set. of the test data, and all parameters (such as thresholds) have to be estimated on the training data. (2) For evaluation, a metric has to be used, that takes into account the rejection of trials. Obvioulsy, the rejection of test samples in a classification task means a reduced amount of information compared to a method that obtains the same classification accuarcy on all test samples. See Ferrez and Millán (2005) for an application of a performance measure based on Shannon’s information theory in the context of rejected samples based on the detection of an error-related brain response. An outlier rejection method may use label information (e.g., when class-conditional Malahanobis distances are used), but only training data may be used to determine parameters of the method, as in the estimation of the covariance matrix for the Malahanobis distance. 5.6. Loss Functions Allowing for Unbalanced Classes The classification performance is always evaluated by some loss function, see Section 4.1. Typical examples are 0-1-loss (i.e., average number of misclassified samples) and area under the receiver operator characteristic (ROC) curve (Fawcett (2006)). When using misclassification rate, it must be assured that the classes have approximately the same number of samples. Otherwise, a performance measure that respects the sizes of the classes (or the priors one has about their frequency in the application case) has to be employed. In oddball paradigms, e.g., the task is to discriminate brain responses to an attended rare stimulus from responses to a frequent stimulus, a typical ratio of frequent-torare stimuli is 85:15. In such a setting, an uninformative classifier which always predicts the majority class would obtain an accuracy of 85%. Accordingly, a different loss functions needs to be employed. Denoting the number of samples in class i by ni , the normalized error can be calculated as weighted average, where errors committed on samples of class i are weighted by P N/ni with N = k nk . 5.4. Model Evaluation Allowing for Feature Selection Feature selection is widely used in order to decrease the dimensionality of the feature vectors and thus to facilitate classification. Typically, feature selection methods are supervised, i.e., they exploit the label information of the data. A simple strategy is, e.g., to select those features that have a large separability index like Fisher Score (Müller et al. (2004)). A more sophisticated strategy is to use classifiers for feature selection, as, e.g., in Müller et al. (2004); Lal et al. (2004); Blankertz et al. (2006a); Tomioka and Müller (2010). For the same reasons as laid out in Section 5.3 it is vital for a sound validation that such feature selection methods are not performed on the whole data. Again, cross-validation has to be used to evaluate the entire model including feature selection scheme, rather then just cross validating the final step of classification. More specifically, feature selection has to be performed within the cross-validation loop on each training set. 5.7. Regarding Nonstationarities It is worth to note that any cross-validation scheme implicitly relies on the assumption that the samples are identically distributed. In the context of neurophysiological data this transfers to an assumption of stationarity. Unfortunately, nonstationarities are ubiquitous in neuroimaging data (e.g. Shenoy et al. (2006)). Accordingly, the characteristics of brain signals and, in particular, the feature distributions, often change slowly with time. Therefore, a model fitted to data from the beginning of an experiment may not generalize well on 5.5. Model Evaluation Allowing for Outlier Rejection It requires a prudent approach to fairly evaluate the performance of models which employ outlier rejection schemes. While the rejection of outliers from the training data set is unproblematic, their exclusion from the test set is rather not. Here, two issues have to be considered. (1) The rejection criterion must not use the labels 14 data towards the end of the experiment. This detrimental effect is obscured, when estimating the generalization performance by cross validation, since training samples are drawn from the full time course of the experiment. Accordingly, the classifier is, so to say, prepared for the nonstationarities that occur within the data. Therefore, it is advisable to compare the results of cross-validation with a so called chronological validation, in which the (chronologically) first half of the data is used for training and the second half for testing. If the data includes nonstationarities and the classifier fails to be invariant against them, the chronological validation would predict a substantially worse performance than cross-validation. This situation indicates that the method will suffer from nonstationarity during online operation. In general, one can avoid non-stationarity by (a) constructing invariant features (e.g. Blankertz et al. (2008a)), (b) tracking nonstationarity (e.g. Schlögl et al. (2009); Vidaurre and Blankertz (2010); Vidaurre et al. (2010)), (c) modeling nonstationarity and adapting CV schemes (Sugiyama et al. (2007)), or by (d) projecting to stationary subspaces (von Bünau et al. (2009)). 6. Conclusion Decoding brain states is a difficult data analytic endeavor, e.g., due to the unfavorable signal to noise ratio, the vast dimensionality of the data, the high trial-to-trial variability etc. In the past, machine learning and pattern recognition have provided significant contributions to alleviate these issues and thus have had their share in many of the recent exciting developments in the neurosciences. In this work, we have introduced some of the most common algorithmic concepts, first from a theoretical viewpoint and then from a practical neuroscience data analyst’s perspective. Our main original contribution in this review is a clear account for the typical pitfalls, see Table 1. Due to space constraints, the level of mathematical sophistication and the number of algorithms described are limited. However, a detailed account was given for the problems of hyper-parameter choice and model selection, where a proper cross-validation procedure is essential for obtaining realistic results that maintain their validity out-of-sample. Moreover, it should be highly emphasized that the proper inclusion of physiological apriori knowledge is helpful as it can provide the learning machines with representations that are more useful for prediction than if operating on raw data itself. We consider such a close interdisciplinary interaction between paradigm and computational model as essential. Potential pitfall See Preprocessing the data based on global statistics of the entire data (e.g., normalization using the global mean and variance) Section 5.3 Global rejection of artifacts or outliers prior to the analysis (resulting in a simplified test set) Section 5.5 Global extraction or selection of features (illegitimate use of information about the test data) Section 5.4 Simultaneously selecting model parameters and estimating the model performance by cross validation on the same data (yielding a too optimistic estimate of the generalization error) Section 5.1 Insufficient model evaluation for paradigms with block design Section 5.2 Neglecting unbalanced class frequencies Section 5.6 Disregarding effects of nonstationarity Section 5.7 Table 1: Hall of Pitfalls. The table presents a (incomplete) list of the most prominent sources of error that one need to take into consideration, when applying machine learning methods to brain imaging data. 15 We would like to end by adding the remark that an unforeseen progress in algorithmic development has been caused by the availability of high quality data with a clear task description. This has allowed a number of interested non-(computational) neuroscientists to study experimental data even without access to expensive measurement technology Sajda et al. (2003); Blankertz et al. (2004, 2006b). The authors express their hope that this development will extend also beyond EEG data. Blankertz, B., Curio, G., Müller, K.-R., 2002. Classifying single trial EEG: Towards brain computer interfacing. In: Diettrich, T. G., Becker, S., Ghahramani, Z. (Eds.), Advances in Neural Inf. Proc. Systems (NIPS 01). Vol. 14. pp. 157–164. Blankertz, B., Dornhege, G., Krauledat, M., Müller, K.-R., Curio, G., 2007. The non-invasive Berlin Brain-Computer Interface: Fast acquisition of effective performance in untrained subjects. Neuroimage 37 (2), 539–550. Blankertz, B., Dornhege, G., Lemm, S., Krauledat, M., Curio, G., Müller, K.-R., 2006a. The Berlin Brain-Computer Interface: Machine learning based detection of user specific brain states. J Universal Computer Sci 12 (6), 581–607. Blankertz, B., Kawanabe, M., Tomioka, R., Hohlefeld, F., Nikulin, V., Müller, K.-R., 2008a. Invariant common spatial patterns: Alleviating nonstationarities in brain-computer interfacing. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (Eds.), Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, pp. 113–120. Blankertz, B., Lemm, S., Treder, M. S., Haufe, S., Müller, K.-R., 2010. Single-trial analysis and classification of ERP components – a tutorial. NeuroimageIn press. Blankertz, B., Müller, K.-R., Curio, G., Vaughan, T. M., Schalk, G., Wolpaw, J. R., Schlögl, A., Neuper, C., Pfurtscheller, G., Hinterberger, T., Schröder, M., Birbaumer, N., 2004. The BCI competition 2003: Progress and perspectives in detection and discrimination of EEG single trials. IEEE Trans Biomed Eng 51 (6), 1044– 1051. Blankertz, B., Müller, K.-R., Krusienski, D., Schalk, G., Wolpaw, J. R., Schlögl, A., Pfurtscheller, G., del R. Millán, J., Schröder, M., Birbaumer, N., 2006b. The BCI competition III: Validating alternative approachs to actual BCI problems. IEEE Trans Neural Syst Rehabil Eng 14 (2), 153–159. Blankertz, B., Tomioka, R., Lemm, S., Kawanabe, M., Müller, K.-R., Jan. 2008b. Optimizing spatial filters for robust EEG single-trial analysis. IEEE Signal Process Mag 25 (1), 41–56. Boser, B., Guyon, I., Vapnik, V., 1992. A training algorithm for optimal margin classifiers. In: Haussler, D. (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory. pp. 144–152. Boyd, S., Vandenberghe, L., 2004. Convex Optimization. Cambrigde University Press, Cambridge, UK. Braun, M. L., Buhmann, J., Müller, K.-R., Aug 2008. On relevant dimensions in kernel feature spaces. Journal of Machine Learning Research 9, 1875–1908. Breiman, L., Spector, P., 1992. Submodel selection and evaluation in regression: The x-random case. International Statistical Review 60, 291–319. Brunner, C., Scherer, R., Graimann, B., Supp, G., Pfurtscheller, G., Dec 2006. Online control of a brain-computer interface using phase synchronization. IEEE Trans Biomed Eng 53, 2501–2506. Cardoso, J.-F., Souloumiac, A., 1993. Blind beamforming for non gaussian signals. IEE Proceedings-F 140, 362–370. Comon, P., 1994. Independent component analysis, a new concept? Signal Process. 36 (3), 287–314. Cox, D., O’Sullivan, F., 1990. Asymptotic analysis of penalized likelihood and related estimates. The Annals of Statistics 18 (4), 1676– 1695. Cox, D., Savoy, R., 2003. Functional magnetic resonance imaging (fmri) ’brain reading’: Detecting and classifying distributed patterns of fmri activity in human visual cortex. NeuroImage 19, 261¢??270. Curran, E. A., Stokes, M. J., 2003. Learning to control brain activity: A review of the production and control of EEG components for driving brain-computer interface (BCI) systems. Brain Cogn 51, 326–336. Acknowledgement We would like to thank our former co-authors for allowing us to use prior published material in this review. The studies were partly supported by the Bundesministerium für Bildung und Forschung (BMBF), Fkz 01IB001A/B, 01GQ0850, by the German Science Foundation (DFG, contract MU 987/3-2), by the European Union under the PASCAL2 Network of Excellence, ICT-216886. This publication only reflects the authors’ views. Funding agencies are not liable for any use that may be made of the information contained herein. Author contributions SL and KRM produced the draft for the part on the machine learning methods and SL work it out in detail and particularly elaborated on the mathematical foundations of model selection and cross-validation. BB contributed the part on pitfalls in model evaluation. TD contributed the part on cross-validation for dependent data. All authors read and approved the manuscript. Akaike, H., 1974. A new look at the statistical model identification. IEEE Trans. Automat. Control 19 (6), 716–723. Amari, S., Murata, N., Müller, K.-R., Finke, M., Yang, H., 1997. Asymptotic statistical theory of overtraining and cross-validation. IEEE Trans Neural Netw 8 (5), 985–996. Bell, A. J., Sejnowski, T. J., 1995. An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7 (6), 1129–1159. Bennett, K., Mangasarian, O., 1992. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software 1, 23–34. Bießmann, F., Meinecke, F. C., Gretton, A., Rauch, A., Rainer, G., Logothetis, N., Müller, K.-R., 2009. Temporal kernel canonical correlation analysis and its application in multimodal neuronal data analysis. Machine Learning 79 (1-2), 5—27. Birbaumer, N., Mar 2006. Brain-computer-interface research: coming of age. Clin Neurophysiol 117, 479–483. Bishop, C., 1995. Neural Networks for Pattern Recognition. Oxford University Press. Blanchard, G., Sugiyama, M., Kawanabe, M., Spokoiny, V., Müller, K.-R., 2006. In search of non-Gaussian components of a highdimensional distribution. Journal of Machine Learning Research 7, 247–282. 16 Dornhege, G., Blankertz, B., Curio, G., Müller, K.-R., Jun. 2004. Boosting bit rates in non-invasive EEG single-trial classifications by feature combination and multi-class paradigms. IEEE Trans Biomed Eng 51 (6), 993–1002. Dornhege, G., Blankertz, B., Krauledat, M., Losch, F., Curio, G., Müller, K.-R., 2006. Combined optimization of spatial and temporal filters for improving brain-computer interfacing. IEEE Trans Biomed Eng 53 (11), 2274–2281. Dornhege, G., del R. Millán, J., Hinterberger, T., McFarland, D., Müller, K.-R. (Eds.), 2007. Toward Brain-Computer Interfacing. MIT Press, Cambridge, MA. Duda, R., P.E.Hart, D.G.Stork, 2001. Pattern classification, 2nd Edition. John Wiley & Sons. Efron, B., Tibshirani, R., 1997. Improvements on cross-validation: The .632+ bootstrap method. Journal of the American Statistical Association 92 (438), 548–560. Efron, B., Tibshirani, R. J., 1993. An introduction to the Bootstrap. Vol. 57 of Monographs on Statistics and Applied Probability. Chapman and Hall. Farquhar, J., Nov 2009. A linear feature space for simultaneous learning of spatio-spectral filters in BCI. Neural Netw 22, 1278–1285. Fawcett, T., 2006. An introduction to ROC analysis. Pattern recognition letters 27 (88), 861–874. Ferrez, P., Millán, J., 2005. You are wrong! – automatic detection of interaction errors from brain waves. In: 19th International Joint Conference on Artificial Intelligence. pp. 1413–1418. Friedman, J. H., 1989. Regularized discriminant analysis. J Amer Statist Assoc 84 (405), 165–175. Fukunaga, K., 1972. Introduction to Statistical Pattern Recognition. Academic Press, New York. Gentili, R. J., Bradberry, T. J., Hatfield, B. D., Contreras-Vidal, J. L., 2009. Brain biomarkers of motor adaptation using phase synchronization. Conf Proc IEEE Eng Med Biol Soc 1, 5930–5933. Golub, G., van Loan, C., 1996. Matrix Computations, 3rd Edition. John Hopkins University Press, Baltimore, London. Grosse-Wentrup, M., Buss, M., Aug 2008. Multiclass common spatial patterns and information theoretic feature extraction. IEEE Trans Biomed Eng 55, 1991–2000. Grosse-Wentrup, M., Gramann, K., Buss, M., 2007. Adaptive spatial filters with predefined region of interest for EEG based braincomputer-interfaces. In: Schölkopf, B., Platt, J., Hoffman, T. (Eds.), Advances in Neural Information Processing Systems 19. MIT Press, pp. 537–544. Guyon, I., Boser, B., Vapnik, V., 1993. Automatic capacity tuning of very large VC-dimension classifiers. In: Hanson, S., Cowan, J., Giles, C. (Eds.), Advances in Neural Information Processing Systems. Vol. 5. Morgan Kaufmann, San Mateo, CA, pp. 147–155. Hansen, L., Larsen, J., Nielsen, F., Strother, S., Rostrup, E., Savoy, R., Lange, N., Sidtis, J., Svarer, C., Paulson, O., 1999. Generalizable patterns in neuroimaging: how many principal components? Neuroimage 9 (5), 534–544. Harmeling, S., Dornhege, G., Tax, D., Meinecke, F. C., Müller, K.-R., 2006. From outliers to prototypes: ordering data. Neurocomputing 69 (13–15), 1608–1618. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Springer. Haufe, S., Nikulin, V. V., Ziehe, A., Müller, K.-R., Nolte, G., Aug 2008. Combining sparsity and rotational invariance in EEG/MEG source reconstruction. Neuroimage 42 (2), 726–738. Haufe, S., Nikulin, V. V., Ziehe, A., Müller, K.-R., Nolte, G., 2009. Estimating vector fields using sparse basis field expansions. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (Eds.), Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA, pp. 617–624. Haxby, J., et al, 2001. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293, 2425¢??–2430. Haynes, J., Rees, G., 2006. Decoding mental states from brain activity in humans. Nat Neurosci 7, 523–534. Hyvärinen, A., Karhunen, J., Oja, E., 2001. Independent Component Analysis. Wiley. Hyvärinen, A., Oja, E., 1997. A fast fixed-point algorithm for independent component analysis. Neural Computation 9 (7), 1483–92. Kamitani, Y., Tong, F., 2005. Decoding the visual and subjective contents of the human brain. Nat Neurosci 8 (5), 679¢??685. Kimeldorf, G., Wahba, G., 1971. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic. 33, 82–95. Koles, Z. J., 1991. The quantitative extraction and topographic mapping of the abnormal components in the clinical EEG. Electroencephalogr Clin Neurophysiol 79 (6), 440–447. Koles, Z. J., Soong, A. C. K., 1998. EEG source localization: implementing the spatio-temporal decomposition approach. Electroencephalogr Clin Neurophysiol 107, 343–352. Krauledat, M., Shenoy, P., Blankertz, B., Rao, R. P. N., Müller, K.R., 2007. Adaptation in CSP-based BCI systems. In: Dornhege, G., del R. Millán, J., Hinterberger, T., McFarland, D., Müller, K.R. (Eds.), Toward Brain-Computer Interfacing. MIT Press, Cambridge, MA, pp. 305–309. Kriegeskorte, N., Goebel, R., Bandettini, P., Mar 2006. Informationbased functional brain mapping. Proc Natl Acad Sci U S A 103, 3863–3868. Kübler, A., Kotchoubey, B., Dec 2007. Brain-computer interfaces in the continuum of consciousness. Curr Opin Neurol 20, 643–649. Kübler, A., Kotchoubey, B., Kaiser, J., Wolpaw, J., Birbaumer, N., 2001. Brain-computer communication: Unlocking the locked in. Psychol Bull 127 (3), 358–375. Kübler, A., Müller, K.-R., 2007. An introducton to brain computer interfacing. In: Dornhege, G., del R. Millán, J., Hinterberger, T., McFarland, D., Müller, K.-R. (Eds.), Toward Brain-Computer Interfacing. MIT press, Cambridge, MA, pp. 1–25. LaConte, S., Strother, S., Cherkassky, V., Anderson, J., Hu, X., 2005. Support vector machines for temporal classification of block design fmri data. NeuroImage 26, 317¢??–329. Lal, T., Schröder, M., Hinterberger, T., Weston, J., Bogdan, M., Birbaumer, N., Schölkopf, B., 2004. Support vector channel selection in BCI. IEEE Trans Biomed Eng 51 (6), 1003–1010. Ledoit, O., Wolf, M., 2004. A well-conditioned estimator for largedimensional covariance matrices. J Multivar Anal 88 (2), 365–411. Lee, D. D., Seung, H. S., 2000. Algorithms for non-negative matrix factorization. In: NIPS. pp. 556–562. Lemm, S., Blankertz, B., Curio, G., Müller, K.-R., 2005. Spatiospectral filters for improving classification of single trial EEG. IEEE Trans Biomed Eng 52 (9), 1541–1548. Li, Y., Guan, C., 2006. An extended EM algorithm for joint feature extraction and classification in brain-computer interfaces. Neural Comput 18 (11), 2730–2761. MacKay, D. J. C., 2003. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, available from http://www.inference.phy.cam.ac.uk/mackay/itila/. Marzetti, L., Gratta, C. D., Nolte, G., 2008. Understanding brain connectivity from EEG data by identifying systems composed of interacting sources. NeuroImage 42 (1), 87 – 98. Meinecke, F. C., Ziehe, A., Kurths, J., Müller, K.-R., 2005. Measuring Phase Synchronization of Superimposed Signals. Physical Review Letters 94 (8), 084102. Mika, S., Rätsch, G., Müller, K.-R., 2001. A mathematical programming approach to the Kernel Fisher algorithm. In: Leen, T., Dietterich, T., Tresp, V. (Eds.), Advances in Neural Information Processing Systems. Vol. 13. MIT Press, pp. 591–597. Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A., Müller, 17 K.-R., May 2003. Constructing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel feature spaces. IEEE Transaction on Pattern Analysis and Machine Intelligence 25 (5), 623–628. Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K., Malave, V. L., Mason, R. A., Just, M. A., 2008. Predicting human brain activity associated with the meanings of nouns. Science 320, 1191. Moody, J., 1992. The effective number of parameters: An analysis of generalization and regularization in non-linear learning systems. In: J. Moody, S. H., Lippman, R. (Eds.), Advances in Neural information processings systems. Vol. 4. Morgan Kaufman, San Mateo, CA, pp. 847–854. Morup, M., Hansen, L., Arnfred, S., Lim, L.-K., Madsen, K., 2008. Shift invariant multilinear decomposition of neuroimaging data. Neuroimage 42 (4), 1439–1450. Müller, K.-R., Anderson, C. W., Birch, G. E., 2003. Linear and nonlinear methods for brain-computer interfaces. IEEE Trans Neural Syst Rehabil Eng 11 (2), 165–169. Müller, K.-R., Krauledat, M., Dornhege, G., Curio, G., Blankertz, B., 2004. Machine learning techniques for brain-computer interfaces. Biomed Tech 49 (1), 11–22. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B., May 2001. An introduction to kernel-based learning algorithms. IEEE Neural Networks 12 (2), 181–201. Müller, K.-R., Tangermann, M., Dornhege, G., Krauledat, M., Curio, G., Blankertz, B., 2008. Machine learning for real-time single-trial EEG-analysis: From brain-computer interfacing to mental state monitoring. J Neurosci Methods 167 (1), 82–90. Murata, N., Amari, S., Yoshizawa, S., 1994. Network information criterion — determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks 5, 865–872. Noirhomme, Q., Kitney, R. I., Macq, B., May 2008. Single-trial EEG source reconstruction for brain-computer interface. IEEE Trans Biomed Eng 55, 1592–1601. Nolte, G., Bai, O., Wheaton, L., Mari, Z., Vorbach, S., Hallett, M., 2004. Identifying true brain interaction from eeg data using the imaginary part of coherency. Clinical Neurophysiology 115, 2292– 2307. Nolte, G., Meinecke, F. C., Ziehe, A., Müller, K.-R., 2006. Identifying interactions in mixed and noisy complex systems. Physical Review E 73, 051913. Nolte, G., Ziehe, A., Nikulin, V., Schlögl, A., Krämer, N., Brismar, T., Müller, K.-R., June 2008. Robustly Estimating the Flow Direction of Information in Complex Physical Systems. Physical Review Letters 100, 234101. Orr, G., Müller, K.-R. (Eds.), 1998. Neural Networks: Tricks of the Trade. No. 1524 in LNCS. Springer Heidelberg. Parra, L., Alvino, C., Tang, A., Pearlmutter, B., Yeung, N., Osman, A., Sajda, P., Jun 2003. Single-Trial Detection in EEG and MEG: Keeping it Linear. Neurocomputing 52-54, 177–183. Parra, L., Alvino, C., Tang, A. C., Pearlmutter, B. A., Yeung, N., Osman, A., Sajda, P., 2002. Linear spatial integration for single trial detection in encephalography. Neuroimage 7 (1), 223–230. Parra, L., Christoforou, C., Gerson, A., Dyrholm, M., Luo, A., Wagner, M., Philiastides, M., Sajda, P., 2008. Spatiotemporal linear decoding of brain state. IEEE Signal Process Mag 25 (1), 107–115. Parra, L., Sajda, P., 2003. Blind source separation via generalized eigenvalue decomposition. J Mach Learn Res 4 (7-8), 1261–1269. Pereira, F., Mitchell, T., Botvinick, M., 2009. Machine learning classifiers and fmri: A tutorial overview. Neuroimage 45 (1, Supplement 1), S199 – S209, mathematics in Brain Imaging. Pfurtscheller, G., Neuper, C., Birbaumer, N., 2005. Human BrainComputer Interface. In: Riehle, A., Vaadia, E. (Eds.), Motor Cortex in Voluntary Movements. CRC Press, New York, Ch. 14, pp. 367–401. Poggio, T., Girosi, F., 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 978–982. Pun, T., Alecu, T. I., Chanel, G., Kronegg, J., Voloshynovskiy, S., Jun 2006. Brain-computer interaction research at the Computer Vision and Multimedia Laboratory, University of Geneva. IEEE Trans Neural Syst Rehabil Eng 14, 210–213. Racine, J., 2000. Consistent cross-validatory model-selection for dependent data: hv-block cross-validation. J. Econom. 99 (1), 39–61. Ramoser, H., Müller-Gerking, J., Pfurtscheller, G., 2000. Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE Trans Rehabil Eng 8 (4), 441–446. Rasmussen, C., Williams, C., 2005. Gaussian Processes for Machine Learning. MIT Press. Rätsch, G., Mika, S., Schölkopf, B., Müller, K.-R., Sep. 2002. Constructing boosting algorithms from SVMs: An application to oneclass classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (9), 1184–1199. Rätsch, G., Onoda, T., Müller, K.-R., Mar. 2001. Soft margins for AdaBoost. Machine Learning 42 (3), 287–320. Sajda, P., Gerson, A., Müller, K.-R., Blankertz, B., Parra, L., 2003. A data analysis competition to evaluate machine learning algorithms for use in brain-computer interfaces. IEEE Trans Neural Syst Rehabil Eng 11 (2), 184–185. Schäfer, J., Strimmer, K., 2005. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology 4 (1). Schlögl, A., Vidaurre, C., Müller, K.-R., 2009. Adaptive methods in BCI research – an introductory tutorial. In: Allison, B., Graimann, B., Pfurtscheller, G. (Eds.), The Frontiers Collection. Springer, in press. Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R., 2001. Estimating the support of a high-dimensional distribution. Neural Computation 13 (7), 1443–1471. Schölkopf, B., Smola, A., 2002. Learning with Kernels. MIT Press, Cambridge, MA. Schölkopf, B., Smola, A., Müller, K.-R., 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10 (5), 1299–1319. Shao, J., 1993. Linear model selection by cross-validation. J. Am. Stat. Assoc. 88 (422), 486–494. Shenoy, P., Krauledat, M., Blankertz, B., Rao, R. P. N., Müller, K.-R., 2006. Towards adaptive classification for BCI. J Neural Eng 3 (1), R13–R23. Smola, A., Schölkopf, B., 1998. On a kernel–based method for pattern recognition, regression, approximation and operator inversion. Algorithmica 22, 211–231. Smola, A., Schölkopf, B., Müller, K.-R., 1998. The connection between regularization operators and support vector kernels. Neural Networks 11, 637–649. Stein, C., 1956. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proc. 3rd Berkeley Sympos. Math. Statist. Probability 1, 197-206 (1956). Strother, S., Anderson, J., Hansen, L., annd R. Kustra, U. K., Siditis, J., Frutiger, S., Muley, S., LaConte, S., Rottenberg, D., 2002. The quantitative evaluation of functional neuroimaging experiments: The npairs data analysis framework. NeuroImage 15, 747¢??771. Sugiyama, M., Krauledat, M., Müller, K.-R., 2007. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8, 1027–1061. Tibshirani, R., June 1994. Regression selection and shrinkage via the LASSO. Tech. rep., Department of Statistics, University of Toronto, ftp://utstat.toronto.edu/pub/tibs/lasso.ps. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. 18 J Royal Statist Soc B 58 (1), 267–288. Tikhonov, A., Arsenin, V., 1977. Solutions of Ill-posed Problems. W.H. Winston, Washington, D.C. Tomioka, R., Aihara, K., Müller, K.-R., 2007. Logistic regression for single trial EEG classification. In: Schölkopf, B., Platt, J., Hoffman, T. (Eds.), Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, pp. 1377–1384. Tomioka, R., Müller, K. R., 2010. A regularized discriminative framework for EEG based communication. Neuroimage 49, 415–432. Vanderbei, R., Shanno, D., 1997. An interior point algorithm for nonconvex nonlinear programming. Tech. Rep. SOR-97-21, Statistics and Operations Research, Princeton University. Vapnik, V., 1995. The Nature of Statistical Learning Theory. Springer, New York. Vidaurre, C., Blankertz, B., 2010. Towards a cure for BCI illiteracy. Brain Topogr 23, 194–198, open Access. Vidaurre, C., Krämer, N., Blankertz, B., Schlögl, A., 2009. Time domain parameters as a feature for eeg-based brain computer interfaces. Neural Networks 22, 1313–1319. Vidaurre, C., Sannelli, C., Müller, K.-R., Blankertz, B., 2010. Machine-learning based co-adaptive calibration. Neural ComputIn press. von Bünau, P., Meinecke, F. C., Király, F., Müller, K.-R., 2009. Finding stationary subspaces in multivariate time series. Physical Review Letters 103, 214101. Wolpaw, J., Mar 2007. Brain-computer interfaces as new brain output pathways. J Physiol 579, 613–619. Wolpaw, J. R., Birbaumer, N., McFarland, D. J., Pfurtscheller, G., Vaughan, T. M., 2002. Brain-computer interfaces for communication and control. Clin Neurophysiol 113 (6), 767–791. Ziehe, A., Müller, K.-R., 1998. TDSEP – an efficient algorithm for blind separation using time structure. In: Niklasson, L., Bodén, M., Ziemke, T. (Eds.), Proceedings of the 8th International Conference on Artificial Neural Networks, ICANN’98. Perspectives in Neural Computing. Springer Verlag, Berlin, pp. 675 – 680. Ziehe, A., Müller, K.-R., Nolte, G., Mackert, B.-M., Curio, G., January 2000. Artifact reduction in magnetoneurography based on time-delayed second-order correlations. IEEE Trans Biomed Eng 47 (1), 75–87. Zou, H., Hastie, T., Tibshirani, R., 2004. Sparse principal component analysis. Journal of Computational and Graphical Statistics 15, 2006. 19