Academia.eduAcademia.edu

Direct information transfer rate optimisation for SSVEP-based BCI

2018, Journal of Neural Engineering

arXiv:1907.10509v1 [eess.SP] 19 Jul 2019 This is the version of the article before peer review or editing, as submitted by an author to Journal of Neural Engineering. IOP Publishing Ltd is not responsible for any errors or omissions in this version of the manuscript or any version derived from it. The Version of Record is available online at https://doi.org/10.1088/1741-2552/aae8c7. Direct information transfer rate optimisation for SSVEP-based BCI Anti Ingel Institute of Computer Science, University of Tartu, Tartu, Estonia E-mail: antiingel@gmail.com Ilya Kuzovkin Institute of Computer Science, University of Tartu, Tartu, Estonia E-mail: ilya.kuzovkin@gmail.com Raul Vicente Institute of Computer Science, University of Tartu, Tartu, Estonia E-mail: raulvicente@gmail.com June 2018 Abstract. In this work, a classification method for SSVEP-based BCI is proposed. The classification method uses features extracted by traditional SSVEP-based BCI methods and finds optimal discrimination thresholds for each feature to classify the targets. Optimising the thresholds is formalised as a maximisation task of a performance measure of BCIs called information transfer rate (ITR). However, instead of the standard method of calculating ITR, which makes certain assumptions about the data, a more general formula is derived to avoid incorrect ITR calculation when the standard assumptions are not met. This allows the optimal discrimination thresholds to be automatically calculated and thus eliminates the need for manual parameter selection or performing computationally expensive grid searches. The proposed method shows good performance in classifying targets of a BCI, outperforming previously reported results on the same dataset by a factor of 2 in terms of ITR. The highest achieved ITR on the used dataset was 62 bit/min. The proposed method also provides a way to reduce false classifications, which is important in real-world applications. 1. Introduction In this work, a classification method for steady-state visual evoked potential (SSVEP) based brain-computer interface (BCI) is proposed. Brain-computer interface is a direct communication channel between the brain and some external device. SSVEP-based BCI uses visual stimuli, called targets, to elicit a response in the brain of the user and then Direct information transfer rate optimisation for SSVEP-based BCI 2 extract features from the recorded brain activity and, finally, classifies the result as one of the commands that the user can send to the external device. Often, the visual stimuli used in SSVEP-based BCIs are multiple squares displayed on a computer screen, each flickering with different frequency [1]. Different flickering frequencies elicit different brain responses and there are many feature extraction methods to evaluate the presence of this signal in a recorded electroencephalography (EEG) signal. The novelty of the classification method proposed in this work is that it is based on information transfer rate (ITR) maximisation. ITR is a standard measure of performance for BCIs [2]. It combines the accuracy and the speed of the classifier into a single number which shows how much information is transferred by the BCI in one unit of time. Therefore, maximising ITR maximises the amount of information that the BCI can transfer in fixed time interval. Often, for SSVEP-based BCIs with similar classification rules, parameters like discrimination threshold are either chosen manually or found using grid search approaches [3–7]. This work proposes classification method that finds discrimination thresholds so that they maximise ITR thus improving the performance of the BCI. It is faster than grid-search approach and in addition provides a way to compare different feature extraction methods and parameters of the BCI without having to worry that the results are affected by the manual choice of discrimination threshold. The proposed method also provides a way to combine multiple feature extraction methods using linear discriminant analysis (LDA). LDA is used to project feature vectors produced by multiple feature extraction methods into lower dimensional space after which the classification algorithm can be used the same way as for one feature extraction method with distances to the decision borders as new features. Combining multiple feature extraction methods together with the threshold selection described above, the present approach achieves more than two times larger ITR than previously reported comparable result. The comparison was made to the works in which the same SSVEP dataset was used as in this work to make the comparison as fair as possible. 2. Methods 2.1. SSVEP dataset for offline experiments The performance of the proposed method was tested using publicly available SSVEP dataset by Bakardijan et al.‡ which contains SSVEP data from four subjects [8]. The dataset contains EEG recordings of the subjects’ brain responses to 8 Hz, 14 Hz and 28 Hz flickering. The visual stimuli were displayed on a computer monitor with 170 Hz refresh rate, placed approximately 90 cm away from the subjects eyes. For each subject, the dataset contains five trials for each flickering frequency. Each trial consists of about 25 seconds of EEG data, 10 seconds of which is without the visual ‡ http://www.bakardjian.com/work/ssvep_data_Bakardjian.html Direct information transfer rate optimisation for SSVEP-based BCI 3 stimulation. EEG data in the dataset was recorded using BIOSEMI EEG system with 128 channels and 256 Hz sampling rate. However, in this work only the data from two channels—O1 and O2—were used. These channels were chosen to make the results more comparable to the previous works in which similar channels were used (see Section 3.2). The exact choice of O1 and O2 was made to increase comparability to the results where consumer-grade EEG device Emotiv EPOC was used and for Emotiv EPOC, channels O1 and O2 are most suitable for SSVEP detection. 2.2. Feature extraction methods In this work, two feature extraction methods were used • Power spectral density analysis (PSDA) [9] • Canonical correlation analysis (CCA) [10] Both of these methods extract a feature for each target, that is for every possible command of the BCI. In general, the larger the feature value for a target, the more likely the target is the user’s choice. The same holds for other feature extraction methods, like Minimum energy combination [11], Likelihood ratio test [12] and Continuous wavelet transform [13]. Thus these feature extraction methods could also be used with the proposed classification method. 2.2.1. Power spectral density analysis Power spectral density analysis (PSDA) [9] feature extraction method is based on estimating the power spectrum of the EEG signal using fast fourier transform (FFT). Since each of the flickering targets elicits brain response with the same fundamental frequency as the flickering itself, the estimated power of this frequency serves as a good feature for classifying which of the targets is the user looking at. But since brain signals are inherently very noisy, single power value at certain time point is not enough to reliably classify the target. Thus the feature is calculated from a fixed-size signal segment after every fixed time step. In the experiments of this paper, window length of 1 second and time step of 0.125 seconds were used. In addition to the power of fundamental frequency of the flickering, also the powers of its second and third harmonic were used as features. To obtain one feature per target, all the powers obtained for one target can be added together. However, in the experiments of this work, LDA dimensionality reduction, described in Section 2.3, was used to find one feature for each target. These features are then used to classify the target. 2.2.2. Canonical correlation analysis Canonical correlation analysis (CCA) [10] feature extraction method is inherently multidimensional, meaning that it calculates single feature value for each target for multichannel EEG signal. This is not the case with the PSDA method, which calculates powers for each channel separately. CCA uses canonical correlation as a feature value, which is a correlation between two sets of signals—one Direct information transfer rate optimisation for SSVEP-based BCI 4 set is the multichannel EEG signal and the other is set of reference signals, which mimic the expected behaviour of the brain’s response to the stimulation. Each target has its own set of reference signals and reference signals also contain signals for the second and third harmonic of the flickering frequency. The feature extraction method calculates canonical correlation of each set of reference signals to the multichannel EEG signal. As before, the larger the correlation is, the more likely the corresponding target is the user’s choice. 2.3. Combining multiple feature extraction methods To make it possible for the BCI to use multiple feature extraction methods at the same time and thus improve performance of the BCI, a dimensionality reduction and classification method called linear discriminant analysis (LDA) was used. LDA projects the input data to a lower-dimensional space so that the variance between the classes is maximised with respect to the variance within the classes. To classify a new sample, LDA projects the new sample to the lower-dimensional space and finds which class mean is closest to it. This classification rule defines decision borders for the classes and the distance from the decision border to a new sample can be interpreted as confidence score for corresponding class. Therefore, to combine multiple feature extraction methods, LDA is used to find the lower-dimensional space into which the original features are projected. Then their distances to decision borders are used as new features. This combines the information from all the original features into single new feature per target. Having one feature per target is needed because this is required by the classification rule described in section 3.1.1. 3. Results 3.1. Classification method This section describes the derivation of the proposed classification method that is based on ITR maximisation. It follows the description of the method given in [14]. First, the classification rule is described, and then a formula for estimating ITR from discrimination thresholds is derived. Finally, it is shown that the formula is differentiable and thus gradient descent can be used to find optimal discrimination thresholds. For more details about the derivation see Appendix C and Appendix D. Implementation of the algorithm can be found in Appendix E. 3.1.1. Classification rule Assuming that there are n targets {1, . . . , n}, a feature extraction method extracts n features {f1 , . . . , fn }. The classification rule uses a discrimination threshold tj for each feature fj to classify a sample as follows: A sample is classified as class k if for this sample fk ≥ tk and fj < tj for j ∈ {1, 2 . . . , n} \ {k}. Direct information transfer rate optimisation for SSVEP-based BCI 5 Or in other words, a sample is classified as class k if the feature value for this class is over the threshold and all the other feature values are below their thresholds. It can be seen that this rule might not classify all the samples. If multiple features are over their thresholds, or none of the features are, then classification is not made. This case is viewed as the BCI not being confident enough and thus it avoids making a prediction. This is useful when not making a prediction is less costly than making a wrong prediction. 3.1.2. Finding the thresholds Finding the thresholds {t1 , . . . , tn } will be formalised as a maximisation task, where the function to be maximised is a performance measure of the BCI. The performance measure ITR(~t ) is a function of thresholds ~t = (t1 , . . . , tn )T and thus the maximisation task is arg max ITR(~t ) (1) ~t which was solved in this work using gradient descent algorithm. The performance measure itself is defined similarly to standard ITR, which is used to evaluate SSVEPbased BCIs, but the assumptions [2] about precisions, the probabilities of errors and predictions is not made. In particular, the standard assumptions are • all predictions are equally likely • if choosing a wrong target, all wrong targets are equally likely to be chosen • the accuracy of successfully classifying a target has to be the same for every target. Making these assumptions while they are not met can result in incorrect ITR calculation and can cause the maximisation algorithm to return non-optimal thresholds. The performance measure is desired to be a continuous function of thresholds and will be defined as ITR(~t ) = MI(~t ) · 60 MDT(~t ) (2) where MI(~t ) is the mutual information between random variables P and C, which model the predicted class and the correct class respectively, and MDT(~t ) is the mean detection time in seconds. Mutual information MI(~t ) shows how many bits of information is transferred with one prediction and is multiplied with the amount of predictions made in one minute, thus giving units of bits per minute. 3.1.3. Calculating mutual information By denoting the event of predicting class i by Pi and the event of correct class being j by Cj , the mutual information between random variables P and C as defined previously is   n X n X P(P i ∩ Cj ) . (3) P(Pi ∩ Cj ) log2 MI(~t ) = I(P, C) = P(Pi ) · P(Cj ) i=1 j=1 Direct information transfer rate optimisation for SSVEP-based BCI 6 When using the assumptions discussed in 3.1.2 the formula (3) can be simplified to standard ITR [2] calculation formula in units of bits per prediction   1−P (4) ITRS = log2 N + P log2 P (1 − P ) log2 N −1 where N is the number of targets and P is the classification accuracy. Therefore the proposed ITR calculation method is a generalisation of the standard method. Next we show that the mutual information is indeed a function of thresholds. If the feature value for class k is modelled with random variable Fk , the event of predicting class i can be represented as      ^   Pi = ω ∈ Ω : Fi (ω) ≥ ti ∧ Fj (ω) < tj (5)   j∈{1,...,n}\{i} where Ω denotes the sample space. From this, by making the assumption that the features are conditionally independent, conditioned on the correct class C, the conditional probability of predicting a class can be calculated as Y  P(Pi | Ck ) = 1 − FFi |Ck (ti ) FFj |Ck (tj ) (6) j∈{1,...,n}\{i} where FFℓ |Ck denotes the cumulative distribution function (CDF) of the conditional distribution Fℓ | Ck . It was empirically observed in this work that skew normal distribution models well the conditional distributions of features (see Appendix B). Thus by using least squares method for fitting skew normal distribution to the data, the required CDFs can be calculated. Therefore, after having fit skew normal distributions to the data, the probability P(Pi | Ck ) only depends on the thresholds ~t. Now, by using known rules from probability theory, all the probabilities required in calculating mutual information can be estimated Number of samples from class j Number of samples n X P(Pi ) = P(Pi | Cj )P(Cj ) P(Cj ) = (7) (8) j=1 P(Pi ∩ Cj ) = P(Pi | Cj )P(Cj ). (9) Thus mutual information can indeed be represented as a continuous function of thresholds and hence its gradient can be calculated. 3.1.4. Calculating mean detection time So far it has been shown that the amount of information transferred with one prediction can be calculated as a function of thresholds. Now we show that the mean detection time can also be represented as function of thresholds. Direct information transfer rate optimisation for SSVEP-based BCI 7 Since it is desired that the performance measure would reflect the performance of the online BCI, a method for estimating online MDT is proposed. First, let us discuss how the implemented BCI splits the incoming signal. The feature extraction methods always use w seconds of data at a time. Whether there is an overlap between consecutive windows depends on whether the last prediction was successful, that is prediction was made. If a prediction is made, there is no overlap between windows. In this case the time step between consecutive feature extractions is w. If a prediction is not made, there is w − s seconds of overlap between windows. In this case s is the time step between consecutive feature extractions. By making the assumption that the probability of making a prediction is the same in every time step and that the probability of making a prediction will be the same in online case as it was offline, the MDT can be calculated as   1 S −1 ·s (10) MDT(~t ) = w + P ( ni=1 Pi ) where s is multiplied with the expected number of failures before a successful prediction. MDT is a continuous function of thresholds, because ! n n [ X P Pi = P(Pi ) (11) i=1 i=1 and we have already shown that P(Pi ) is a continuous function of thresholds. Thus the performance measure as shown in equation (2) is a continuous function of thresholds. 3.1.5. Gradient Descent Finally, thresholds which maximise the ITR directly are found using gradient ascent (GA) algorithm. GA iteratively updates the threshold values by moving in the direction of greatest increase of ITR using the rule ~tn+1 = ~tn + µ · ∇ITR(~tn ) (12) where ∇ITR is the gradient of ITR and µ is step size parameter. For the calculation of the gradient, see Appendix C and Appendix D. GA stops if ITR improves less than 10−6 . Since ITR(~t ) is not concave, GA has to be run multiple times, each time obtaining a local maximum and the final result will be the largest from these local maxima. In this work, 20 random initialisations were used for GA. 3.2. Comparison of the proposed method to previous results The performance of the proposed method was tested using SSVEP dataset by Bakardijan et al. [8] and the results were compared to two other works [3, 7] where the same dataset was used. Detailed comparison of these works can be seen in Appendix A. In the work by Demir et al. [7], a feature extraction method called bio-inspired filter bank (BIFB) is introduced. The method is in essence very similar to PSDA method. In their method, power spectral density of the EEG signal is estimated and triangular 8 Direct information transfer rate optimisation for SSVEP-based BCI Table 1. Comparison of CCA, PSDA feature extraction methods and their combination, using proposed classifier. CCA + PSDA (using LDA) CCA PSDA S1 S2 S3 S4 Avg S1 S2 S3 S4 Avg S1 S2 S3 S4 Avg ITR proposed 64.62 53.73 33.19 5.65 39.30 55.6 47.33 13.71 14.47 32.78 43.41 28.2 16.68 5.06 23.34 ITR standard 62.33 48.31 27.61 1.74 35.00 54.52 44.08 16.12 9.91 31.16 40.93 22.9 11.18 2.84 19.46 Accuracy 0.94 0.89 0.82 0.55 0.80 0.94 0.9 0.74 0.63 0.80 0.87 0.76 0.62 0.49 0.68 Mean detection time 1.14 1.22 1.56 1.72 1.41 1.33 1.37 1.9 1.61 1.55 1.33 1.41 1.29 1.56 1.4 Number of predictions 790 606 308 348 513 1008 857 270 433 642 995 718 1231 479 856 Table 2. Comparison of proposed classification method, proposed method which always makes a prediction and random forest. Proposed method Proposed method: Always predicts Random Forest S1 S3 S3 S4 Avg S1 S2 S3 S4 Avg S1 S2 S3 S4 Avg ITR proposed 64.62 53.73 33.19 5.65 39.30 41.91 31.62 19.32 1.8 23.67 40.4 31.01 19.08 2.46 23.24 ITR standard 62.33 48.31 27.61 1.74 35.00 38.31 26.94 13.12 0.02 19.60 38.63 26.03 15.62 0.97 20.31 Accuracy 0.94 0.89 0.82 0.55 0.8 0.79 0.72 0.6 0.35 0.62 0.79 0.71 0.63 0.48 0.65 Mean detection time 1.14 1.22 1.56 1.72 1.41 1 1 1 1 1 1 1 1 1 1 Number of predictions 790 606 308 348 513 1695 1695 1695 1695 1695 1695 1695 1695 1695 1695 filters are used on the result to make the powers of different frequencies more comparable. They also compare their method to CCA feature extraction method. Only the data for Oz electrode is used in their work. In the work by Jukiewicz and Cysewska-Sobusiak [3] PSDA method was used for feature extraction and two classification methods were compared—bilinear separation and SVM. They only used electrodes O1, O2 and Oz and two classes 8 Hz and 14 Hz. These two works were chosen, because similarly to this work, the classifier was allowed to not make predictions for some samples. As can be seen from Table 2, the possibility to not make predictions results in better performance. Overview of the results of other works where the same dataset was used for testing SSVEP-based BCI can be seen in Appendix A. The proposed classifier was also compared to Random Forest to see how well it performs compared to standard machine learning algorithm. Detailed comparison can be seen in Table 2 and comparison to other works and to random forest can be seen in Table 3. All the tables contain both, the standard ITR and proposed ITR. The difference between them is that standard ITR uses formula (4) to estimate the amount of information sent by one prediction, while proposed ITR uses formula (3). Table 1 contains the comparison of PSDA and CCA feature extraction methods to their combination. In this work, classifiers were evaluated using five-fold cross-validation, each fold corresponding to data from separate recording trial. 9 Direct information transfer rate optimisation for SSVEP-based BCI Table 3. Proposed classifier compared to Random Forest and related work. Feature Classification Classes extraction Window MDT (s) Accuracy length (s) ITR ITR standard proposed PSDA + CCA Proposed method 3 1 1.41 80% 35.00 39.30 This PSDA + CCA Random Forest 3 1 1 64% 20.46 23.73 work PSDA Proposed method 3 1 1.4 68% 19.46 23.34 PSDA Random Forest 3 1 1 53% 9.64 12.41 [7] BIFB No ML 3 N/A 8.4 88% 8.2 - [3] PSDA Bilinear separation 2 1 N/A 74% 16 - 4. Discussion From the results it can be seen that allowing the classifier to not make predictions when it is not confident enough results in better performance. This can be seen as a trade-off between accuracy and mean detection time: allowing the classifier to not make predictions results in better accuracy but worse mean detection time. According to ITR, which is a combination of both of these measures, allowing to not make predictions results in better performance. Fair comparison between different methods was possible thanks to the proposed classification method, which tries to find optimal discrimination threshold for each of the methods and thus the results are not affected by manual parameter selection. The proposed method outperforms both, the Random Forest classifier and the previously published comparable results in terms of ITR. If one values accuracy over ITR, the performance measure can be modified to take this into account and therefore a classifier with higher accuracy can be obtained. By combining multiple feature extraction methods, the achieved ITR is more than two times larger than the ones reported in related work. For one of the subjects, ITR of over 60 bit/min was achieved in this work. For further comparison to related work see [14]. For future work, the classifier will be tested online as the functionality for doing that is already implemented in our software. This is required because in the current work online ITR and MDT were estimated and might differ from the actual values. Online testing should also provide information whether the learned discrimination thresholds can be used in long periods of time or the thresholds need to be re-learned frequently, which would decrease usability of this method. In our offline experiments, however, the proposed method clearly outperforms the previous comparable results. 5. Conclusion The main conclusions of the work are the following • Proposed classifier outperforms previous results (Table 3). Direct information transfer rate optimisation for SSVEP-based BCI 10 • Combining feature extraction methods outperforms the same methods used individually (Table 1). • Allowing the classifier to not make predictions when it is not confident enough results in better performance (Tables 2). In this work, a classification method for SSVEP-based BCI based on ITR optimisation was proposed. This enabled fair comparison of different feature extraction methods by finding the best discrimination threshold for each method. This clearly gives better results than manually chosen thresholds, as has been done in previous works, and is faster than grid-search method. By combining the features of different feature extraction methods using LDA, the proposed classification method can be used with multiple feature extraction methods, which increases the overall performance. Finally, by allowing the classifier to not make predictions when it is not confident enough is shown to be a good way to decrease the proportion of false-positive predictions. 6. Acknowledgments The authors thank the financial support from The Estonian Research Council through the personal research grant PUT1476. We also acknowledge funding by the European Regional Development Fund through the Estonian Center of Excellence in IT, EXCITE. 7. References [1] Zhu D, Bieger J, Molina G G and Aarts R M 2010 A survey of stimulation methods used in SSVEP-based BCIs Computational Intelligence and Neuroscience 2010 1–13 [2] Wolpaw J R, Ramoser H, McFarland D J and Pfurtscheller G 1998 EEG-Based Communication: Improved Accuracy by Response Verification IEEE Transactions on Rehabilitation Engineering 6 326–333 [3] Jukiewicz M and Cysewska-Sobusiak A 2015 Implementation of Bilinear Separation algorithm as a classification method for SSVEP-based brain-computer interface Measurement Automation Monitoring 61 51–53 [4] Karnati V B R, Verma U S G, Amerineni J S and Shah V H 2014 Frequency detection in Medium and High frequency SSVEP based Brain Computer Interface systems by scaling of sine-curve fit amplitudes 2014 First International Conference on Networks Soft Computing (ICNSC2014) pp 300–303 [5] Velchev Y, Radev D and Radeva S 2016 Features Extraction Based on Subspace Methods with Application to SSVEP BCI International Journal of Emerging Engineering Research and Technology 4 52–58 [6] Anindya S F, Rachmat H H and Sutjiredjeki E 2016 A prototype of SSVEP-based BCI for home appliances control 2016 1st International Conference on Biomedical Engineering (IBIOMED) pp 1–6 [7] Demir A F, Arslan H and Uysal I 2016 Bio-inspired Filter Banks for SSVEP-based Brain-computer Interfaces 2016 IEEE International Conference on Biomedical and Health Informatics (BHI) (Las Vegas, NV, USA) [8] Bakardjian H, Tanaka T and Cichocki A 2010 Optimization of SSVEP brain responses with application to eight-command Brain–Computer Interface Neuroscience letters 469 34–38 Direct information transfer rate optimisation for SSVEP-based BCI 11 [9] Cheng M, Gao X, Gao S and Xu D 2002 Design and implementation of a brain-computer interface with high transfer rates Biomedical Engineering, IEEE Transactions on 49 1181–1186 [10] Lin Z, Zhang C, Wu W and Gao X 2007 Frequency Recognition Based on Canonical Correlation Analysis for SSVEP-based BCIs Biomedical Engineering 54 [11] Friman O, Volosyak I and Graser A 2007 Multiple channel detection of steady-state visual evoked potentials for brain-computer interfaces Biomedical Engineering, IEEE Transactions on 54(4) 742–750 [12] Zhang Y, Dong L, Zhang R, Yao D, Zhang Y and Xu P 2014 An Efficient Frequency Recognition Method Based on Likelihood Ratio Test for SSVEP-Based BCI Computational and Mathematical Methods in Medicine 2014 [13] Zhang Z, Li X and Deng Z 2010 A CWT-based SSVEP classification method for braincomputer interface system 2010 International Conference on Intelligent Control and Information Processing pp 43–48 [14] Ingel A 2017 Machine Learning in VEP-based BCI Master’s thesis University of Tartu [15] Yehia A G, Eldawlatly S and Taher M 2015 Principal component analysis-based spectral recognition for SSVEP-based Brain-Computer Interfaces 2015 Tenth International Conference on Computer Engineering Systems (ICCES) pp 410–415 12 Direct information transfer rate optimisation for SSVEP-based BCI Appendix A. Related work Table A1. Results of the related articles [14]. In each of these works the dataset by Bakardijan et al. [8] was used. Article Feature extraction Window Classification Preprocessing Classes MUSIC Gaussian Channels kernel, 0.5 N/A 8 Hz, 1 14 Hz 2 SVM linear PSDA SVM RBF Windowed 8 Hz, sinc filter, 14 Hz, Blackman 28 Hz N/A Accuracy (%) 63.2 All N/A 4 one-vs-one [6] Subjects MDT (s) SVM, [5] length/ 75 91.1 94.3 All O1, O2, POz, Oz 65 71.67 window BIFB No machine learning (but requires [7] PSDA finding 7.5 1 100 7.8 2 100 Bandpass 10 3 86.7 filter, 8.33 4 66.7 Hamming 10 1 66.7 9 2 15 3 15 4 6.7 4 1 73.3 4 2 60 5 3 66.7 3 4 66.7 1 1 93 8 Hz, 1 All O1, O2, ∼74 14 Hz 1 1 Oz 90 1 All window 8 Hz, 14 Hz, 28 Hz subject specific parameters) CCA N/A Bilinear [3] PSDA separation N/A SVM Least [4] square sine fitting No machine learning 8 Hz AMUSE 14 Hz 0.5 28 Hz Two subjects Oz 66.7 60 ∼71 8 subject 86 specific 83 channels 92 1 92.43 PSDA 2 87.39 + PCA 3 94.56 4 Subject 80.50 2 specific 71.16 Average 3 from the 76.76 over time 4 selection: 60.53 windows 1 P7, P3, 69.98 up to 4 s 2 Pz, P4, 58.35 3 P8, O1, 79.80 4 Oz, O2 47.17 Multiway LDA (and CCA grid search Bandbass 8 Hz, for subject filter, CAR, 14 Hz, Multi- specific MAF 28 Hz set parameters) [15] CCA CCA 76.61 1 1 62.69 2 59.52 3 65.51 4 56.04 13 Direct information transfer rate optimisation for SSVEP-based BCI Appendix B. Skew normal distribution fit to data C=3 C=2 C=1 fF1 |C fF2 |C fF3 |C 8 6 4 2 08 8 6 4 2 08 8 6 4 2 08 6 4 2 80 6 4 2 08 6 4 2 80 6 4 2 0 6 4 2 0 6 4 2 0 0 0.5 1 0 0.5 1 0 0.5 1 Figure B1. Histograms of features given class (blue) for subject 1. Green line denotes skew normal distribution fit to the data and red lines indicate possible thresholds [14]. F̄F2 |C C=3 C=2 C=1 F̄F1 |C 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 0 0.5 1 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 0 0.5 F̄F3 |C 1 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 0 0.5 1 Figure B2. Complementary cumulative distribution function of the corresponding probability density function in Figure B1 [14]. 14 Direct information transfer rate optimisation for SSVEP-based BCI Appendix C. ITR calculation when predictions can be skipped In this appendix, we derive the formula for calculating ITR taking into account the fact that the classifier might not make a prediction at each time step. Therefore, in addition to the events of prediction class i and correct class being j denoted by Pi and Cj respectively, let us denote the event of making a prediction as M. Furthermore, let us assume that the number of classes is n. Then M= n [ (C.1) Pi i=1 and since predicting different classes are mutually exclusive events, it holds that ! n n [ X P(M) = P Pi = P(Pi ) (C.2) i=1 i=1 where P(Pi ) can be calculated using equation (8). The previous probability is required to calculate P(Pi | M) = P(Pi ) P(Pi ∩ M) = P(M) P(M) (C.3) which is needed to calculate ITR. We also need to condition Cj on the event M S P (Cj ∩ ( ni=1 Pi )) P(Cj | M) = P(M) Sn P ( i=1 (Cj ∩ Pi )) = P(M) n 1 X = P(Cj ∩ Pi ) P(M) i=1 n 1 X = P(Pi | Cj )P(Cj ) P(M) i=1 (C.4) where P(Cj ) can be calculated using equation (7). The last probability needed to calculate ITR is given by P(Pi | Cj , M) = P(Pi ∩ Cj ∩ M) P(Pi ∩ M | Cj )P(Cj ) P(Pi | Cj ) = = P(Cj ∩ M) P(Cj ∩ M) P(M | Cj ) (C.5) where P(M | Cj ) = n X P(Pi | Cj ). (C.6) i=1 Finally, ITR can be calculated as I(P |M, C|M) = n X n X i=1 j=1 P(Pi | Cj , M)P(Cj | M) log2  P(Pi | Cj , M) P(Pj | M)  . (C.7) 15 Direct information transfer rate optimisation for SSVEP-based BCI Appendix D. Deriving gradient of ITR In this appendix the formula for calculating gradient of ITR is derived. Let again feature value for class i be modelled by random variable Fi and denote the event of k being the correct class by Ck . First note that probability density function fFi |Ck of the random variable Fi | Ck is the derivative of the corresponding cumulative density function FFi |Ck dFFi |Ck = fFi |Ck . dti This allows us to calculate ( Q −fFi |Ck (ti ) · j∈{1,...,n}\{i} FFj |Ck (tj ) ∂ Q P(Pi | Ck ) = F Fi |Ck (ti ) · fFℓ |Ck (tℓ ) · j∈{1,...,n}\{i,ℓ} FFj |Ck (tj ) ∂tℓ (D.1) if i = ℓ if i = 6 ℓ (D.2) that is needed later. To make differentiation of ITR easier, the following property will be used I(P |M, C|M) = H(P |M) − H((P |M) | (C|M)) (D.3) where H(P |M) = − n X P(Pi |M) log2 P(Pi |M) (D.4) P(Cj | M)H(P | Cj , M). (D.5) i=1 H((P |M) | (C|M)) = n X j=1 The corresponding partial derivatives are n X ln(P(Pi | M)) + 1 ∂ ∂ H(P | M) = − P(Pi | M) (D.6) ∂tℓ ln 2 ∂tℓ i=1  n  X ∂ ∂ ∂ H((P |M) | (C|M)) = H(P | Cj , M) P(Ck | M) + P(Ck | M) H(P | Cj , M) ∂tℓ ∂tℓ ∂tℓ j=1 (D.7) and the partial derivatives for the probabilities are n X ∂ ∂ P(Pi ) = P(Cj ) P(Pi | Cj ) ∂tℓ ∂tℓ j=1 (D.8) n X ∂ ∂ P(M) = P(Pj ) ∂tℓ ∂t ℓ j=1 (D.9) P(M) ∂t∂ℓ P(Pi ) − P(Pi ) ∂t∂ℓ P(M) ∂ P(Pi | M) = (D.10) ∂tℓ (P(M))2   P  Pn ∂ n ∂ P(M) P(P ∩ C ) − P(P ∩ C ) P(M) j k j k ∂tℓ j=1 ∂tℓ j=1 ∂ P(Ck | M) = (D.11) ∂tℓ (P(M))2 n X ∂ ∂ P(M | Cj ) = P(Pi | Cj ) (D.12) ∂tℓ ∂t ℓ i=1 Direct information transfer rate optimisation for SSVEP-based BCI P(M | Cj ) ∂t∂ℓ P(Pi | Cj ) − P(Pi | Cj ) ∂t∂ℓ P(M | Cj ) ∂ . P(Pi | Cj , M) = ∂tℓ (P(M | Cj ))2 16 (D.13) Therefore we are able to calculate ∂ ∂ ∂ I(P |M, C|M) = H(P | M) − H(P | C, M). ∂tℓ ∂tℓ ∂tℓ (D.14) Next, the partial derivative of MDT will be calculated s ∂ ∂ MDT = − · P(M) 2 ∂tℓ (P(M)) ∂tℓ (D.15) which together with the previous result can be used to get   ∂ ∂ 60 ∂ ITR = MDT I(P |M, C|M) − I(P |M, C|M) MDT . ∂tℓ ∂tℓ ∂tℓ MDT2 (D.16) Finally, the required gradient is ∇ITR =  ∂ ∂ ITR, . . . , ITR ∂t1 ∂tn T . (D.17) Appendix E. Implementation of the algorithm The implementation of the proposed classification algorithm can be found in Github repository§. The detailed instruction on how to run the scrip is in README.md file in the repository. § https://github.com/antiingel/ITR-optimisation