Entropy 25 00033 v3
Entropy 25 00033 v3
Entropy 25 00033 v3
Article
Domain Adaptation Principal Component Analysis:
Base Linear Method for Learning with Out-of-Distribution Data
Evgeny M. Mirkes 1 , Jonathan Bac 2,3,4 , Aziz Fouché 2,3,4 , Sergey V. Stasenko 5 , Andrei Zinovyev 2,3,4, *
and Alexander N. Gorban 1, *
1 School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK
2 Institut Curie, PSL Research University, 75005 Paris, France
3 Institut National de la Santé et de la Recherche Médicale (INSERM), U900, 75012 Paris, France
4 CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75005 Paris, France
5 Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University,
603000 Nizhniy Novgorod, Russia
* Correspondence: andrei.zinovyev@curie.fr or zinovyev@gmail.com (A.Z.); ag153@le.ac.uk (A.N.G.)
Abstract: Domain adaptation is a popular paradigm in modern machine learning which aims at
tackling the problem of divergence (or shift) between the labeled training and validation datasets
(source domain) and a potentially large unlabeled dataset (target domain). The task is to embed
both datasets into a common space in which the source dataset is informative for training while
the divergence between source and target is minimized. The most popular domain adaptation
solutions are based on training neural networks that combine classification and adversarial learning
modules, frequently making them both data-hungry and difficult to train. We present a method
called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced
data representation useful for solving the domain adaptation task. DAPCA algorithm introduces
positive and negative weights between pairs of data points, and generalizes the supervised extension
of principal component analysis. DAPCA is an iterative algorithm that solves a simple quadratic
optimization problem at each iteration. The convergence of the algorithm is guaranteed, and the
number of iterations is small in practice. We validate the suggested algorithm on previously proposed
Citation: Mirkes, E.M.; Bac, J.;
benchmarks for solving the domain adaptation task. We also show the benefit of using DAPCA
Fouché, A.; Stasenko, S.V.; Zinovyev,
in analyzing single-cell omics datasets in biomedical applications. Overall, DAPCA can serve as
A.; Gorban, A.N. Domain Adaptation
Principal Component Analysis: Base
a practical preprocessing step in many machine learning applications leading to reduced dataset
Linear Method for Learning with representations, taking into account possible divergence between source and target domains.
Out-of-Distribution Data. Entropy
2023, 25, 33. https://doi.org/ Keywords: principal component analysis; machine learning; domain adaptation; out-of-distribution
10.3390/e25010033 generalization; transfer learning; single cell data analysis
Figure 1. The idea behind domain adaptation learning. The source domain has labels and can
be used to construct a classifier. The target domain where the classifier is supposed to work does
not have labels. It is suggested to find a common representation of two domains such that their
distributions would maximally match each other, and simultaneously build the efficient classifier
using this representation and available labels.
This representation should be insensitive to the differences between the data distribu-
tions underlying source and target domains and, at the same time, should not hinder the
classification task in the labeled source domain. The key question in domain adaptation-
based learning is the definition of the objective functional: how to measure the difference
between probability distributions of the source and the target domain sample. One possible
approach consists of adversarial training [1,5]:
• Select a family of classifiers in data space;
• Choose the best classifier from this family for separating the source domain samples
from the target ones;
• The error of this classifier is an objective function for maximization (large classification
error means that the samples are indistinguishable by the selected family of classifiers).
In domain adaptations, one usually talks about two complementary subsystems that
ideally must be trained simultaneously. The first one is a classifier that distinguishes the
feature vector as either source or target and whose error is maximized. The second one is a
feature generator that learns features that are as informative as possible for the classification
task. Theoretical foundations of domain adaptation based on H-divergence between
source and target domains and its estimates from finite datasets have been suggested
in [5]. Here we understand domain adaptation as an approach to a more general out-
of-distribution (OOD) generalization problem [6], and understand OOD as the situation
where the unlabeled dataset has a distribution different from the labeled one.
One of the most popular applications of domain adaptation in computer vision was
implemented using the framework of neural networks known as Domain Adaptation
Neural Networks (DANN) [1,7], based on outlined above principle of combination of
classification and adversarial learning modules. It is known that adversarial learning using
neural networks is computationally heavy and data hungry. Therefore, it can be questioned
if there exists a simple baseline linear or quasi-linear method for solving the supervised
domain adaptation task which would be easier to compute with a small sample size. To the
best of our knowledge, such a method has not been suggested so far. This situation is in
contrast with other domains of machine learning where the baseline linear methods pre-
existed in their generalizations using neural network-based tools (as trivial examples, linear
regression pre-existed the sigmoidal multilayered perceptron and principal component
analysis (PCA) pre-existed the neural network-based autoencoders).
The adversarial approach outlined above to reduce the shift between domains is not the
only one that can be exploited for this purpose. Methods for aligning multidimensional data
point clouds are well known in machine learning, and they can be used for solving domain
adaptation tasks even without considering labels in the source domain. In particular,
various generalizations of PCA or other matrix factorization approaches computing a
joint linear representation of two and more datasets are widely exploited in machine
Entropy 2023, 25, 33 3 of 26
learning [8–11]. Other linear methods such as Transfer Component Analysis minimizing
the maximum mean discrepancy (MMD) distance [12] between linear projections of the
source and the target datasets [3] and subspace alignment method [13] have been suggested.
Correlation Alignment for Unsupervised Domain Adaptation (CORAL) aligns the original
feature distributions of the source and target domains, rather than the bases of lower-
dimensional subspaces and is claimed to be “frustratingly easy” but still effective in many
applications approach to domain adaptation [14]. The computational simplicity of CORAL
allows it to be introduced as a component of the loss function in training neural network-
based classifiers and a deep transferrable data representation to be obtained [15]. The
MMD measure can be also used for this purpose as in the Joint Adaptation Networks (JAN)
framework where the joint maximum mean discrepancy (JMMD) criterion is optimized. A
family of methods was suggested for searching such linear projections that are domain-
invariant (i.e., mixing domains) and optimizing class compactness of data points projected
from the source and the target domains [16]. This methodology uses labels in the source
domain and introduces pseudo-labels in the target domain which was shown to be superior
to TCA. Other methods based on computing the reciprocal data point neighborhood
relations or application of optimal transport theory have become popular recently with
many applications in various domains such as single-cell data science, with applications to
data integration task [17,18].
In this study, we suggest a novel base linear method called Domain Adaptation Prin-
cipal Component Analysis (DAPCA) for dealing with the problem of domain adaptation.
It generalizes the Supervised PCA algorithm to the domain adaptation problem. The ap-
proach was first outlined in the context of one- and few-shot learning problems [19]. It relies
on the definition of weights between pairs of data points, both in the source and the target
domains and between them such that projections of data vectors onto the eigenvectors of a
simple quadratic form would serve as good features with respect to domain adaptation.
The number of such features is supposed to be smaller than the total number of variables
in the data space: therefore, the method also represents a form of dimensionality reduction.
The set of weights can depend on the features selected for representation: therefore, the
base quadratic optimization method is accompanied by iterations such that at each iteration
a simple quadratic optimization task is solved. As with many quasi-quadratic optimization
iterative algorithms, convergence is guaranteed and, in practice, the number of iterations
can be made relatively small.
There exist several linear domain adaptation methods, each of which is characterized
by specific features: for example, some of them produce low-dimensional embedding of the
source and target datasets, and some of them do not. A summary with a short description
of their working principles is provided in Table 1.
Table 1. Summary and comparison of linear Domain Adaptation methods. PCA and SPCA do not
solve the domain adaptation task but are listed here for convenience of comparison.
Low Use
Optimi- Dimen- Class
Method Name Reference Principle zation- sional Labels
Based Embed- in
ding Source
Table 1. Cont.
Low Use
Optimi- Dimen- Class
Method Name Reference Principle zation- sional Labels
Based Embed- in
ding Source
2. Background
2.1. Principal Component Analysis with Weighted Pairs of Observations
Principal Component Analysis is one of the most used machine learning methods with
applications in all domains of science (e.g., [22,23]). The classical formulation of the PCA
problem belonged to Pearson and was introduced in 1901. It is based on the minimization
of the mean squared distance from the data points to their projections on a hyperplane
defined by an orthonormal vector base [20]. An alternative but equivalent (because of the
Pythagorean theorem) definition of principal components is based on the maximization
of the variance of projections on a hyperplane. This definition became the leading text
book definition [24]. The third equivalent definition is the maximization of mean squared
pairwise distance between the data points projections onto a hyperplane.
All these PCA definitions can lead to useful generalizations [25]. Generalization of the
third above-mentioned definition by introducing weights for each pair of projections was
Entropy 2023, 25, 33 5 of 26
1 n 1 n
H= ∑
2 i,j=1
k Pxi − Px j k2 = ∑ k P( xi − x j )k2 → max .
2 i,j=1
(1)
For q = 1, the scattering of projections (1) on a straight line with the normalised basis vector
e is !
1 N N
2 i,j∑
H= ( xi − x j , e)2 = N ∑ ( xi , e)2 − (µ, e)2 = N (e, Qe) (2)
=1 i =1
where µ is mean vector of the dataset X, the coefficients of the quadratic form (e, Qe) are
the elements of the sample covariance matrix.
For an orthonormal basis {e1 , . . . , eq } of the q-dimensional plane in data space, the
maximum scattering of data projections (1) is achieved when e1 , . . . , eq are the eigenvectors
of Q corresponding to the q largest eigenvalues of Q (with taking into account possible
multiplicity) λ1 ≥ λ2 ≥ . . . ≥ λq . This is precisely the standard PCA.
In practice, users are usually interested in solving an applied problem, such as classifi-
cation or regression, rather than dimension reduction, which usually plays an auxiliary role.
The first principal components might not align with the most informative from the classifi-
cation point of view features. Therefore, ignoring a certain number of the first principal
components has become a common practice in many applications. For example, the first
principal components are frequently associated with technical artifacts in the analysis of
omics datasets in bioinformatics, and removing them might improve the downstream anal-
ysis [29,30]. Sometimes it is necessary to remove more than ten first principal components
to increase the signal/noise ratio [31].
Principal components can be significantly enriched in terms of the information they
hold for the classification task if we modify the optimization problem (1) and include
additional information in the principal component definition. One way of doing this is
introducing a weight Wij for each pair of data points [19]:
1 n
2 i,j∑
HW = Wij k P( xi − x j )k2 → max . (3)
=1
Figure 2. Illustration of the Domain Adaptation PCA (DAPCA) principle. (A) PCA, Supervised PCA
and DAPCA provide three different ways to reduce the data dimensionality by a linear projection.
DAPCA considers both labeled and unlabeled datasets and computes such projection that the
projection distributions would be as similar as possible. (B) Minimizing the quadratic functional
for finding each linear projection can be interpreted as introducing repulsive and attractive forces
between data point projections. Of course, data points (shown as 3D spheres) do not repulse or
attract, remaining fixed; therefore, the terms ’repulsion’ or ’attraction’ are quoted in this Figure’s text.
PCA can be interpreted as a result of effective repulsion between all data point projection pairs. In
projection onto the Supervised PCA plane, the scattering within a data point class is minimized while
the scattering between the classes is maximized. This can be interpreted as the effective attraction of
data point projections for the data points of the same class. In DAPCA, four types of effective “forces”
exist between data point projections: repulsive in source and target datasets, attractive between data
points of the same class in the source dataset, attractive between the data points in the target and the
closest data points in the source dataset.
Following the same logic as for (1), we consider the projection of (3) on a 1D subspace
with the normalized basis vector e and define a new quadratic form with coefficients qW lm :
" ! #
HW = ∑ ∑ ∑ Wir xil xim − ∑ Wij xil x jm el em = ∑ qW
lm el em . (4)
lm i r ij lm
For the q-dimensional planes the maximum of HW (4) is achieved when this plane
is spanned by q eigenvectors of the matrix QW = (qW lm ) (4) that correspond to q largest
eigenvalues of QW (taking into account possible multiplicity) λ1 ≥ λ2 ≥ . . . ≥ λq [19].
The difference compared to the standard PCA problem is that starting from some q, some
eigenvalues can become negative, as clarified below.
There are several methods to assign weights in the matrix W:
Entropy 2023, 25, 33 7 of 26
• Supervised PCA for regression task. In case the target attribute of data points is a set
of real values t = {t1 , . . . , t N }, ti ∈ R1 , the choice of weights in Supervised PCA can
be adapted accordingly. Thus, we can require that projections of points with similar
values of target attribute would have smaller weights, and those pairs of data points
with very different target attribute values would have larger weights. One of the
simplest choices of the weight matrix, in this case, is Wij = (ti − t j )2 .
• Supervised PCA for any supervising task. In principle, the weights Wij can be a function
of any standard similarity measure between data points. The closer the desired
outputs are, the smaller the weights should be. They can change the sign (from the
repulsion of projections, Wij > 0 to the attraction, Wij < 0) or change the strength of
projection repulsion.
• Semi-supervised PCA was defined for a mixture of labeled and unlabeled data [27]. In
this case, different weights can be assigned to the different types of pairs of data points
(both labeled in the same class, both labeled from different classes, one labeled and
one unlabeled data point, both unlabeled). One of the simplest ideas here can be that
projections of unlabeled data points effectively repulse (have positive weights), while
the labeled and unlabeled projections do not interact (have zero weights).
The choice of the number of retained components for further analysis is a nontrivial
question even for the classic PCA [32]. The most popular methods are based on evaluating
the fraction of (un)explained variance or, equivalently, the mean squared error of the data
approximation by the PCA hyperplane for different q. These methods take into account
only the measure of approximation. However, in the case of Supervised PCA, the number of
components needs to be optimized with respect to the final classification or regression task.
For weighted PCA where some of the weights are negative, some of the eigenvalues can also
become negative. Let us have k positive eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λk > 0 and d − k non-
positive ones 0 ≥ λk+1 ≥ . . . ≥ λd . Increasing the number of used principal components
above k increases the accuracy of data set approximation but does not increase the value
of the target function HW (4), so the data features defined by the principal components of
order > k are not useful from the downstream classification task. Therefore, the standard
practice is to use eigenvectors that correspond only to non-negative eigenvalues [33].
sion and attraction of projections from class 2 will play a negligible role. Changing the α
value can not fix the unbalance in the relative influence of attraction in two different classes.
Therefore, it appears reasonable to normalize the weights taking into account the
class sizes:
1
2Np Nr if L p = li 6= l j = Lr
Wij = , (6)
−α if li = l j = Lr
N ( N −1)
r r
where Nr is the number of data points of the class with label Lr . Weight matrix (6) equili-
brates the strengths of projection attraction within each class and the repulsion of projections
between two different classes.
More generally, attraction and repulsion between data point projections can be fine-
tuned using a priori knowledge about the expected similarity between class labels. For
example, this can be the case of ordinal class labels (where there exists a meaningful ranking
of class labels). Let us consider the most general form of coefficients of attraction in one
class and repulsion in different classes:
δ11 δ12 . . . δ1n
δ21 δ22 . . . δ2n
∆= .. .. .. ..
(7)
. . . .
δn1 δn2 . . . δnn
This matrix allows us to define the weight matrix in the following form:
δ pr
2Np Nr if L p = li 6= l j = Lr
Wij = . (8)
δrr
N ( N −1)
if l i = l j = L r
r r
order to achieve the maximum performance of the best classifier C1 from the family with
respect to distinguishing labels in X and, at the same time, minimize the performance of
the best classifier Copt from the family to distinguish points from X and Y. In the simplest
case, this means that the vectors f k ( X ) and f k (Y ) should be similar in some reasonable
metrics for every k, but, strictly speaking, this does not have to be the general case.
In this study, we are interested in finding a set of optimal for the domain adaptation
task linear features { f k }. At the same time, we do not assume that the family of classifiers
should be restricted to linear ones. Indeed, one of the most important applications of linear
domain adaptation is to define a restricted set of features { f k , k = 1, . . . , q} which can be
used for training a non-linear classifier, obtained as a weighted sum of the initial data
variables χ1 , . . . , χm (see examples below).
As usual, from the general considerations, we expect that a set of optimal, with respect
to the domain adaptation problem, linear functions f k should sufficiently well approximate
the initial dataset X. This means that a reasonable approach to finding the optimal functions
f k should be based on some kind of adaptation of the PCA problem.
3. Methods
3.1. Semi-Supervised PCA for a Joint Data Set
The main result of this study is introducing a novel linear algorithm of domain adap-
tation, representing a generalization of Supervised PCA to the case when one has a labeled
source dataset X and an unlabeled target dataset Y. As described above, we look for a com-
mon linear representation of X and Y, in which their multivariate distributions would be as
similar as possible, while the accuracy of the classification task (using an appropriate—and
not necessarily linear—classifier) for X in this representation remains acceptable.
We have a labeled source dataset X = { xi } with NX data points, set of labels li for
each point in X, and a unlabeled target dataset Y = {yi } with NY points. Let n be the total
Entropy 2023, 25, 33 10 of 26
where W XX is the matrix of SPCA (8), W XY = (W YX )> are zero matrices (WijXY = 0,
for all i, j), and W YY = βJNY NY , where β > 0 is the coefficient of repulsion for the tar-
get dataset Y.
The modified algorithm is characterized by increased computational time compared
to the simple Supervised PCA (see Appendix A). Vector wS (A1) becomes larger, but
β
all additional terms have the same value NY −1 and the required time for this is T×
that is negligible compared to other summands. The first summand in (4) requires
longer summation (additional time is NY d2 (2T× + T+ )). We also need to calculate vector
sY = ∑y∈Y y (additional time is dNY T+ ). The last additional calculation is the computation
of the matrix Y > W YY Y and the addition of it to the result (additional time is d2 T× and
d2 T+ ). Overall, the semi-supervised version of PCA is characterized by the following
computational time:
1 1
µ̃ X =
NX ∑ x̃; µ̃Y =
NY ∑ ỹ.
x̃∈ X̃ ỹ∈Ỹ
where weights Wir are assigned following the same rules as in semi-supervised PCA (10),
and φ > 0 is the attraction coefficient between the mean points of the data samples in X
and Y.
For computing the matrix QW (13), an accelerated algorithm (A4)–(A11) can be used.
The main advantage of TCA is its low computational complexity. In addition to the
semi-supervised PCA (4), it is necessary to calculate only vectors s X = ∑ x̃∈X̃ x̃ (additional
time is dNX T+ ), vectors of means µ̃ X = s X /NX and µ̃Y = sY /NY (additional time 2dT× ),
calculate one more matrix µ̃Y> µ̃Y (additional time d2 T× ) and add these matrices to the result
(additional time d2 T+ ). Therefore, the computational time required for TCA is
where k is the number of the nearest neighbours, kNN (y) is set of k labeled nearest neigh-
bours of a data point y ∈ Y, and γ is the effective attraction coefficient between the
projection of y ∈ Y and the projection of each data point x ∈ kNN (y).
However, the matching between the data points in two domains using the kNN
approach can be strongly affected by the differences between X and Y, including the
simplest translations. Here we deal with a sort of “chicken or egg” problem. To define the
neighbors between a data point in Y and the data points in X, one has to know the best
representation of both datasets, so they would be as similar as possible. On the other hand,
to find this representation using DAPCA, we need to know the “true” data point neighbors.
Entropy 2023, 25, 33 12 of 26
As usual, this problem can be approached by iterations. We will use the definition of
the nearest neighbors in the initial data variables as the first iteration (alternatively, one can
use any other suitable metrics, such as reduced PCA-based representation). It gives us the
q-dimensional plane of principal components (the eigenvectors of QW ) with the orthogonal
projector on it P1 . Afterward, we find for each target sample y ∈ Y the k nearest neighbors
kNN (y) from the source samples x ∈ X in the projection on this plane.
These definitions of the neighbors leads to a new Wij , which we use to find the new
projector P2 and define the new nearest neighbors. Afterward, we iterate. The iterations are
guaranteed to converge in a finite number of steps because the functional HW (4) increases
at each step (similarly to k-means and other splitting-based algorithms). In practice, it is
convenient to use an early stopping criterion that can already produce a useful feature set.
Our experiments show that the typical number of iterations can be below 10.
Since the DAPCA algorithm is iterative, estimating its computational complexity is
difficult. Of note, the accelerated algorithm’s usage for calculating matrix W allows us
to calculate only once the constant part of the matrix QW that corresponds to the semi-
supervised PCA and then calculate only the part of QW related to WXY , see Appendix.
4. Results
4.1. Neural Architectures Used to Validate DAPCA on Digit Image Data
We used ready Pytorch implementations of the neural network-based classifiers
from [1] downloaded from https://github.com/vcoyette/DANN, accessed on 23 Septem-
ber 2021.
Entropy 2023, 25, 33 13 of 26
Figure 3. Toy 3D dataset used to test the DAPCA algorithm. (A) Configuration of data points of
two classes in the source domain (green and yellow data points) and in the target domain (grey data
points). The target domain distribution differs from the source domain by a shift along the second
coordinate (the degree of the shift is different for two classes of the source domain), by the different
balance of class composition and by the different variance scales within each class. (B) Application of
three flavors of PCA, showing the projections onto the first two principal components (on the left)
and the histogram of projections on the first principal component (on the right). (C) Comparing the
accuracy of predicted labels in the target domain and the self-consistency of domain adaptation by
DAPCA for a range of key DAPCA parameters.
We applied three flavors of PCA described above: PCA, Supervised PCA (SPCA),
Domain Adaptation PCA (DAPCA). For each flavor, we computed two first principal
components (out of three possible). Neither the standard PCA nor SPCA aligned the source
and the target domains, as expected. SPCA produced better-separated classes in the source
domain. DAPCA applied with parameters α = 1, γ = 100 produced a representation of the
Entropy 2023, 25, 33 14 of 26
data in which both source and target domains were well aligned and at the same time the
class labels in the source domain were well separated (see Figure 3B).
DAPCA results were stable in a large interval of the parameters α, γ (Figure 3C).
We also found that the number of nearest neighbors in the kNN graph is not a sensitive
parameter. The number of iterations of the DAPCA algorithm producing the correct
alignment of the source and target datasets was approximately ten.
We used the simplest support vector classifier (SVC) in order to predict labels in the
target domain using known labels in the source domain. The classifier was trained using
the linear features computed by DAPCA. Since in the toy example we knew the hidden
labels in the target domain by design, we could estimate both accuracy and self-consistency
measures of domain adaptation as described in the Methods section (Figure 3C). The
pattern of computed self-consistency in a range of parameter values α, γ was informative
for anticipating the balanced accuracy of the prediction (correlation coefficient around 0.75).
The combination of parameters leading to the large self-consistency value corresponded
to the large prediction accuracy. However, the opposite was not true, small values of
self-consistency can correspond to both high and small accuracy.
Using the same toy example, we compared several linear domain adaptation methods,
listed in Table 1. The toy example is designed in such a way that the projections on the first
principal component do not separate well the classes neither in the source nor in the target
domain. In addition, the covariance matrices and the class balance are not exactly the same
in the source and the target. As a result, those linear methods of domain adaptation that
do not take into account class labeling information struggled to align the 2D projection
distributions, and DAPCA was the only method that resulted in a good alignment of
two classes, see Figure 4. The Python notebook with the code of this test is provided at
https://github.com/mirkes/DAPCA.
Figure 4. Comparison of linear domain adaptation methods, using the two classes toy example from
Figure 3. For CORAL, projections on all three dimensions are shown together with PCA, because
CORAL does not reduce the data dimensionality. Therefore, CORAL accuracy was computed in
the full feature space (marked as CORAL in the table) and after reducing the dimensionality of the
merged source and target datasets transformed by CORAL (marked as CORAL+PCA). The accuracy
of the domain adaptations task was estimated with known ground-truth target domain labels, using
the standard Support Vector Classifier implementation in sklearn, run with default parameters. The
bold font indicates the maximum accuracy.
sents a set of items of various kinds (books, kitchen, dvd, electronics), characterized by text
reviews and the annotated binary sentiment score (positive or negative). The text of the re-
view is represented as a vector in a multi-dimensional space by using an embedding method
that produces numerical features that can be ordered by their information importance. In
our experiments, we took the first 1000 features from a small Amazon reviews subset ob-
tained from https://github.com/GRAAL-Research/domain_adversarial_neural_network,
accessed on 10 December 2021. We trained a simple logistic regression either on the full set
of 1000 features or using the reduced dataset with PCA, SPCA or DAPCA. The regression
was trained using the labels for one item type as a source domain and then tested using
the items of another type as a target domain. In most pairwise comparisons between item
types, DAPCA provided the best set of features for the classification of items in the target
domain, see Figure 5.
Figure 5. Validating DAPCA using Amazon review dataset. Source and target lines indicate the
performance of the prediction separately on the source and target domains (without domain adapta-
tion). Other lines correspond to the performance of logistic regression trained on different features:
all features (FULL), PCA, SPCA, and DAPCA (200 top components were taken for each method).
DAPCA parameters used here were α = 0, γ = 1, kNN = 5.
using Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)
method. Namely, we compared three visualizations of the digit image representations:
(1) one obtained by training the CNN part of the neural network and using the source
domain only; (2) one obtained by applying DAPCA on top of (1), using the target domain
without labels; (3) one obtained through application of the full DANN architecture, using
both source domain with labels and target domain without labels. The results are shown in
Figure 6C,D.
Figure 6. Validation of DAPCA in digit image classification using two distinct domains. (A) the
original DANN adversarial learning-based architecture for solving the domain adaptation task. The
image is adapted with permission from [1]. (B) Simplified DAPCA-based architecture for domain
adaptation. The domain adaptation is performed for the features recorded from the last layer of
the neural network before applying the last classification step, which can be replaced with logistic
regression. (C,D) Computing the domain adaptation benefit for several architectures: CNN: no
domain adaptation, CNN/DAPCA: as shown in panel (B), DANN: adversarial learning-based
domain adaptation. UMAP visualizations of internal image representations from the source and
the target domains are shown on the plot. The text reports the accuracy of classification from these
representations using logistic regression. “Max” specifies the maximum achievable accuracy when
the CNN classifier is trained directly on the target domain with known labels.
the theoretical maximal performance achievable if all labels in the target domain would be
known. Thus, in the task with MNIST dataset as the source and MNIST-M as the target
domains, DAPCA resulted in b = 8% compared to b = 79% obtained by the DANN
architecture. In another example (SVHN as the source and MNIST as the target domains),
DAPCA resulted in b = 23% while DANN resulted in b = 57%. Such modest performance
of DAPCA compared to DANN can be explained by that the most of the learning in the
DANN architecture from Figure 6A is happening in the convolutional layers of the feature
extractor part and this learning uses examples from the target domain. DAPCA-based
domain adaptation shown in Figure 6B does not use at all the examples from the target
domain for learning the image representation, so the result is not surprising. On the other
hand, we can document a measurable and significant benefit from applying DAPCA in
the domain adaptation tasks at the very late layers of the neural network. This means that
potentially a variant of DAPCA can be used as a layer on top of a convolutional feature
extractor which can be trained (similarly to the famous Deep CORAL approach [15]), but
building such an architecture bypasses the focus of the current study.
Of note, training the DANN architecture shown in Figure 6A is rather heavy (tens of
hours on CPU), while the DAPCA-based domain adaptation shown in Figure 6B requires
much less time (few minutes on CPU).
We repeated the same benchmark in the context of a small sample size by using
subsampled digit image datasets (we used only 3000 images for training and testing in
each domain). The qualitative conclusions remained unchanged: the simple DAPCA-based
solution was less performant than the full-scale DANN architecture even if the difference
between the corresponding performances was less striking.
or stromal). Sorted cell libraries were prepared using the 10X Genomics 3’ Single Cell
V2 protocol, then pooled and sequenced on a Novaseq 6000 (Illumina). According to
the authors, reads were demultiplexed using Cell Ranger v2.0, and cells with fewer than
500 genes or 1000 UMIs were discarded, ending up with 65,667 valid cells.
We preprocessed the three raw count matrices independently. First, each cell was nor-
malized to 10,000 counts, and log(1 + x ) transformation has been applied according to the
existing preprocessing standards. We pooled each cell to the average of its 5 nearest neigh-
bors (using Euclidean metric in the space of the dataset’s 30 first principal components) to
reduce noise. Eventually, we selected the 10,000 most variable genes in each dataset to end
up with three preprocessed expression matrices, each expressed in a 10,000 feature space.
Lung tissue contains a complex, hierarchical population of cells of various types and
states organized into different compartments. Strong differences in specific genes expressed
in each of these compartments cause cell-associated gene expression vectors to form more
or less compact clusters in the multi-dimensional gene space (see Figure 7A,B. We can
see data point clouds corresponding to the three lung datasets do not overlap even when
looking at cells from different datasets associated with the same compartment. We suggest
using DAPCA instead of standard PCA to carry out dimensionality reduction, taking into
account the author-provided cell annotations (endothelial, stromal, epithelial and immune).
We set the first dataset to be the source domain, as it contains the largest number of cells,
and we consider the union of the two other datasets to be the target domain. DAPCA is
aimed at finding a low-dimensional linear projection (with only a few tens of features) in
which the multivariate distribution of projections from different samples would appear
as similar as possible, while cells with labels related to different compartments would be
maximally separated.
Application of DAPCA in such context is shown in Figure 7A. Visual inspection of the
resulting projections into the first 30 components extracted by PCA, SPCA and DAPCA and
further visualization using UMAP shows that the DAPCA projections overlap much better
than PCA and SPCA projections (Figure 7A, top panels). At the same time, the separation
between cell types remains well preserved (Figure 7A, bottom panels).
In order to quantify the effect of domain adaptation, we trained a simple kNN classifier
(k = 20) to predict the dataset of origin of each cell within the DAPCA representation. We
expect the classifier to perform poorly when domain adaptation is successful, meaning
that the source and target datasets are indistinguishable. It also makes sense to normalize
the performance of such classifier with respect to its baseline level accuracy which can be
estimated by randomly permuting the labels of the datasets. Both absolute and normalized
accuracies are shown in Figure 7C. Comparison between PCA, SPCA and DAPCA using
this strategy allows confirming that DAPCA outperforms the two other methods at making
cells less distinguishable with respect to their dataset of origin.
We also observe DAPCA does not merge equally well the cells belonging to different
compartments (Figure 7C). For instance, domain adaptation applied to endothelial cells
appears to be close to the theoretical optimal performance. On the other hand, domain
adaptation applied to the cells from the stromal compartment was less good. This could
be explained by the high heterogeneity within the cells annotated as stromal, which are
grouped into four different clusters. We followed this first analysis step by extracting the
subparts of the datasets corresponding to the stromal cells and we applied PCA, SPCA,
and DAPCA to this subset of cells. In order to apply SPCA and DAPCA we defined new
labels in Dataset 1 serving the source domain, by clustering it with the standard Louvain
clustering algorithm (these clusters are shown in color in Figure 7B) such that the clusters
hypothetically correspond to major subpopulations within the stromal cell compartment.
Four such subpopulations have been identified. The DAPCA-based domain adaptation
in this case shows the performance close to be optimal (Figure 7D). Interestingly, three
out of four clusters seemed to match and at least partially mix with the cells from the
target domain (Datasets 2 and 3). One of the clusters appeared to remain specific to the
Entropy 2023, 25, 33 19 of 26
source domain (Dataset 1), and could correspond to a subpopulation of lung cells specific
to dataset 1, and located in the stromal compartment.
Figure 7. Application of DAPCA for the task of integrating single-cell datasets (three healthy lung
tissue samples, in this case, the count data is used with permission from [37] using the publicly
available URL https://www.synapse.org/Synapse:syn21041850/files/, accessed on 20 December
2022). (A) The result of the global application of DAPCA to all data points in three domains. Top
panel: UMAP visualizations on top of 30 components extracted by PCA, SPCA, and DAPCA with
colors corresponding to the major cell type annotations. Bottom panel: same as the top panel but
with colors corresponding to three different samples. A cluster of data points from Sample 1 is
marked by a red star which appears to be dataset-specific in the PCA projection. This cluster becomes
well-integrated in the target domain in the DAPCA projection. (B) Application of DAPCA locally
to a subpart of the cell populations in three samples (only stromal cells). The labels in the source
domain are defined here through Louvain clustering of the source domain (blue, orange, green, and
red colors). The panel “After DAPCA” shows the UMAP visualization on top of 30 components
computed by DAPCA, from which one can determine the existence of a sample-specific cluster of
cells (green color) in the source domain (Sample 1) that does not match any other clusters in the
target domain. (C,D) Measuring the performance of domain adaptation tasks for global and local
applications of DAPCA correspondingly. Suffix “_n” indicates the normalized performance computed
in the way described in the text. The smaller the accuracy of the kNN classifier trying to distinguish
between the samples, the better the domain adaptation task was solved. In particular, close to zero
normalized performance of the classifier indicates theoretically maximal domain adaptation, as could
be achieved by permuting the labels corresponding to samples.
Entropy 2023, 25, 33 20 of 26
Overall, we can conclude that DAPCA can be used as a tool for simultaneously inte-
grating scRNA-seq datasets from different origins as well as reducing their dimensionality,
as long as cell annotations are available for at least one dataset. We furthermore showed
DAPCA is able to preserve cell–cell similarity in a biological sense, meaning cells within
similar compartments and expression profiles remain close to one another after the al-
gorithm application. Compared to other widely used techniques, DAPCA is based on
linear dimensionality reduction which does not tend to overfit the data integration task. In
particular, it naturally allows one to consider the existence of specific parts in the source
or in target domains that can have specific biological properties and should not be easily
matched between the source and the target domains. In addition, we show that DAPCA
transformation of the data can be computed locally with respect to a subpart of the data
point cloud which might lead to better performance than the global domain adaptation.
5. Discussion
In this paper, we suggest a novel base linear method for solving the problem of domain
adaptation which can serve as a preprocessing step for the application of more sophisticated
and non-linear machine learning approaches. The method is named Domain Adaptation
Principal Component Analysis (DAPCA) because it represents principal component analy-
sis generalization. As input DAPCA takes a pair of two datasets (source X and target Y),
one of which is labeled (source) and another is not (target). Formally, one of these datasets
can be empty. If the target domain Y is empty then DAPCA degenerates to the supervised
PCA in the source domain X. If the source domain X is empty, DAPCA degenerates to the
standard PCA in the target domain. If the source domain X contains only one label then
DAPCA represents a specific version of consensus PCA which can be used to solve the
data integration task. The classical domain adaptation problem (which is sometimes called
unsupervised in the sense that no label information is available for Y) can be extended to
the semi-supervised case (where partial information on labels in Y is known). DAPCA
can be easily adapted to this situation, too, by introducing the proper weighting schema
between pairs of data points.
A large number of variables characterizes many modern datasets so that the corre-
sponding data point clouds formally exist in high- or very high-dimensional space. A
typical step in analyzing such datasets is a dimensionality reduction step to a more man-
ageable number of dimensions (e.g., few tens or even 3–4). For example, this is the typical
case of omics data in the field of biology, including single-cell data [38,39]. If this number
is close to an estimated intrinsic dimensionality [40,41] of the data, then this step does
not lead to a significant loss of information. The reduction is frequently made through
the use of the classical PCA. DAPCA allows a user to easily replace this step when the
divergence between the source and the target datasets is suspected. In addition, it takes into
account the labeling data. The iterative DAPCA also helps to resolve the classical distance
concentration difficulty (curse of dimensionality): in really large dimensional distributions,
the kNN search may be affected by the distance concentration phenomena: most of the
distances are close to the median value [42]. It was shown that the use of fractional norms
or quasinorms does not save the situation [43]. However, dimensionality reduction may
help to overcome this.
DAPCA is based on a data point matching step, where for each point from the target
dataset one has to indicate the most similar, with appropriate metrics, data points from
the source dataset. In the current implementation, the simplest kNN approach is used for
this purpose, but this step can be more sophisticated. Some ideas can be borrowed from
known methods of data fusion in machine learning. It can be the use of mutual (reciprocal)
nearest neighbors, or the application of optimal transport-based algorithms for matching
the points in two finite data point clouds [18].
Supervised PCA and DAPCA can also be used as fast preprocessing steps for un-
supervised non-linear methods of data analysis and data approximation, enabling them
to take into account the data point labeling information. Therefore, they can make other
Entropy 2023, 25, 33 21 of 26
methods at least partially supervised. For example, elastic principal graphs [44,45], self-
organizing maps [46], UMAP [47], t-SNE [48], or Independent Component Analysis [29]
can directly benefit from DAPCA or SPCA as preprocessing steps. Such an approach can
find applications in many domains, such as bioinformatics or single-cell data science [35].
As it was expected, our study shows that neural-network-based classifiers equipped
with an adversarial module that tries to distinguish the source from the target domains
(such as DANN) achieve better performance than the linear and more constrained DAPCA
approach when tested on imaging data. This is partly explained by the fact that the
convolutional layers of DANN are trained on the information from both source and target
domains, while in our comparison DAPCA used the image representation trained on the
source domain only. Linear methods such as DAPCA are deterministic, computationally
efficient, reproducible, and relatively easily explainable. Therefore, the linear approaches
occupy a niche in those machine learning applications where such features are more
important than the maximum possible accuracy. Training neural networks and especially
choosing their architectures remains an art, requiring intuition, experience, and a lot of
computational resources, but this can lead to superior results, in terms of accuracy. In
a sense, DAPCA stays in the same position in DANN as PCA to the auto-associative
neural networks (neural-network-based autoencoders) [49]. However, PCA was introduced
almost a century before the neural-network-based autoencoders, while a standard fully
deterministic and computationally efficient linear approach to domain adaptation based on
optimization and using labels from the source domain, is still lacking. Introducing Domain
adaptation PCA fills this gap.
DAPCA, like any other method of domain adaptation, has certain limitations in some
data analysis scenarios. The application of DAPCA requires the user to specify the values
of several hyperparameters (the strength of the attraction force between the points of
the same class, the attraction force between the domains, and the number of the nearest
neighbors as the most important ones). Even though these parameters might take some
recommendations from the practice values, it might still be required to do some fine-
tuning in a concrete application. Therefore, in simple situations, other and simpler linear
approaches for domain adaptation might have similar to DAPCA performance but be more
convenient in applications. When an essentially non-linear encoding of the input data
is needed (as in the case of the image data analysis), neural-network-based architectures
might be a preferable choice on the other side of the regularity-flexibility trade-off. DAPCA
is, by design, more difficult to integrate as a component into more complex deep classifiers,
compared to some other linear domain adaptation approaches. Enabling this option can be
an important direction for future work.
Nevertheless, we have clearly demonstrated that applying DAPCA might be preferable
to other methods in certain scenarios. For example, we showed that its application would
be beneficial when both source and target domains are characterized by important sources
of variance, which do not coincide with the hyperspace where the best separation of classes
is achieved. Such a situation is rather typical in analyzing omics data in biology, where the
first principal components are frequently associated with the technical or irrelevant sample
classification biological factors.
Therefore, we are confident that DAPCA can be a useful method in the toolbox
of domain adaptation methods, and definitely there exist niches where we expect the
application of DAPCA to be preferred to other existing methods of domain adaptation.
Author Contributions: A.Z., A.N.G.: Conceptualization, E.M.M., A.Z. and A.N.G.; methodology,
E.M.M., A.Z. and A.N.G.; software, E.M.M., J.B., A.F., A.Z. and S.V.S.; validation, A.Z. and A.F.; data
curation, A.Z. and E.M.M.; writing—original draft preparation, E.M.M. and A.Z.; writing—review
and editing, E.M.M., A.Z. and A.N.G.; visualization, E.M.M. and A.Z.; supervision, A.Z. and A.N.G.
All authors have read and agreed to the published version of the manuscript.
Entropy 2023, 25, 33 22 of 26
Funding: This work was supported by the French government under the management of Agence
Nationale de la Recherche as part of the “Investissement d’avenir” program, reference ANR-19-P3IA-
0001 (PRAIRIE 3IA Institute), Part of this project was supported in 2020-2021 by the Ministry of
Science and Higher Education of the Russian Federation (Project No. 075-15-2021-634).
Institutional Review Board Statement: Not applicable.
Data Availability Statement: Only publicly available datasets were analyzed in this study. The Ama-
zon reviews dataset was obtained from https://github.com/GRAAL-Research/domain_adversarial_
neural_network. The single cell transcriptomic count data was obtained from https://www.synapse.
org/Synapse:syn21041850/files/. The digit images data was obtained using instructions from
https://github.com/vcoyette/DANN.
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or
in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
where > means transposed matrix. Calculation of this matrix requires NT× + ( N − 1) T+ for
each element of matrix X > W (there are Nd such elements) and NT× + ( N − 1) T+ for each
element of matrix QW 2 W
2 (there are d elements in this matrix). In total, to calculate matrix Q2
2
it is necessary to perform ( Nd + d )( NT× + ( N − 1) T+ ) operations. Furthermore, finally
number of operation to calculate matrix QW is N ( N − 1) T+ + d2 (2NT× + ( N − 1) T+ ) +
Entropy 2023, 25, 33 23 of 26
Let us reorder elements of matrix X with respect to labels li . Matrix X can be decom-
posed as
X1
X2
X= (A4)
. . .
Xn
Each matrix X r contains data points of class r only: l ( x ) = Lr for all x ∈ X r . Now we can
decompose matrix W into block representation:
W 11 W 12 ... W 1n
21
W W 22 ... W 2n
W=
.. .. .. .. .
(A5)
. . . .
W n1 W n2 ... W nn
This means that for the calculation of vector wS is necessary to use nT× + (n − 1) T+ for each
class. Totally it is necessary to use n(nT× + (n − 1) T+ ). Since usually the number of classes
is essentially less than the number of data points we can state that n(nT× + (n − 1) T+ )
N 2 T+ .
Let us consider calculation of QW 2 (A2):
> >
QW >
2 = X WX = ∑ Xi W ij X j = ∑ δij Xi JNi Nj X j . (A8)
ij ij
Let us calculate vector sr of sums of all cases of class r for all attributes:
sr = ∑ x. (A9)
x∈Xr
QW
2 = ∑ δii si> si + ∑ δij (si> s j + s>j si ). (A11)
i i< j
Entropy 2023, 25, 33 24 of 26
Now let us calculate the number of operations to calculate matrix QW 2 through (A11). Cal-
culation of one vector (A9) required d( Ni − 1) T+ . For all vectors we need time d( N − n) T+ .
One summand of form (A10) requires time Ni ( Nj + 1) T× and summation of all matrices
requires time n2 d2 T+ . If we consider all summands with the same first index then the
required time will be ∑ j Ni ( Nj + 1) T× = nNi T× ∑ j Ni ( Nj + 1) = nNi ( N + n) T× . The time
required to calculate all summands (A10) is nN ( N + n) T× . Since the number of classes
is negligible in comparison with the number of observations we can finally write time to
calculate matrix QW by this modified algorithm as
References
1. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial
training of neural networks. J. Mach. Learn. Res. 2016, 17, 2030–2096. [CrossRef]
2. You, K.; Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Universal Domain Adaptation. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [CrossRef]
3. Pan, S.J.; Tsang, I.W.; Kwok, J.T.; Yang, Q. Domain Adaptation via Transfer Component Analysis. IEEE Trans. Neural Netw. 2011,
22, 199–210. [CrossRef] [PubMed]
4. Farahani, A.; Voghoei, S.; Rasheed, K.; Arabnia, H.R. A Brief Review of Domain Adaptation. In Advances in Data Science and
Information Engineering; Stahlbock, R., Weiss, G.M., Abou-Nasr, M., Yang, C.Y., Arabnia, H.R., Deligiannidis, L., Eds.; Springer
International Publishing: Cham, Switzerland, 2021; pp. 877–894. [CrossRef]
5. Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Vaughan, J.W. A theory of learning from different domains. Mach.
Learn. 2010, 79, 151–175. [CrossRef]
6. Shen, Z.; Liu, J.; He, Y.; Zhang, X.; Xu, R.; Yu, H.; Cui, P. Towards Out-Of-Distribution Generalization: A Survey. arXiv 2021,
arXiv:2108.13624.
7. Chen, M.; Xu, Z.E.; Weinberger, K.Q.; Sha, F. Marginalized Denoising Autoencoders for Domain Adaptation. In Proceed-
ings of the 29th International Conference on Machine Learning, ICML 2012, icml.cc /Omnipress, Edinburgh, Scotland, UK,
26 June–1 July 2012.
8. Hardoon, D.R.; Szedmak, S.; Shawe-Taylor, J. Canonical Correlation Analysis: An Overview with Application to Learning
Methods; Neural Computation 2004, 16, 2639–2664. [CrossRef] [PubMed]
9. Neuenschwander, B.E.; Flury, B.D. Common Principal Components for Dependent Random Vectors. J. Multivar. Anal. 2000,
75, 163–183. [CrossRef]
10. Paige, C.C.; Saunders, M.A. Towards a Generalized Singular Value Decomposition. SIAM J. Numer. Anal. 2006, 18, 398–405.
[CrossRef]
11. Liu, J.; Wang, C.; Gao, J.; Han, J. Multi-view clustering via joint nonnegative matrix factorization. In Proceedings of the 13th SIAM
International Conference on Data Mining, Austin, TX, USA, 2–4 May 2013; pp. 252–260. [CrossRef]
12. Borgwardt, K.M.; Gretton, A.; Rasch, M.J.; Kriegel, H.P.; Schölkopf, B.; Smola, A.J. Integrating structured biological data by
Kernel Maximum Mean Discrepancy. Bioinformatics 2006, 22, e49–e57. [CrossRef]
13. Fernando, B.; Habrard, A.; Sebban, M.; Tuytelaars, T. Unsupervised Visual Domain Adaptation Using Subspace Alignment. In
Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2960–2967.
[CrossRef]
14. Sun, B.; Feng, J.; Saenko, K. Correlation Alignment for Unsupervised Domain Adaptation. In Domain Adaptation in Computer
Vision Applications; Csurka, G., Ed.; Springer International Publishing: Cham, Switzerland, 2017; pp. 153–171. [CrossRef]
15. Sun, B.; Saenko, K. Deep CORAL: Correlation Alignment for Deep Domain Adaptation. In Computer Vision—ECCV 2016
Workshops; Hua, G.; Jégou, H., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 443–450.
16. Liang, J.; He, R.; Sun, Z.; Tan, T. Aggregating Randomized Clustering-Promoting Invariant Projections for Domain Adaptation.
IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1027–1042. [CrossRef]
17. Haghverdi, L.; Lun, A.T.; Morgan, M.D.; Marioni, J.C. Batch effects in single-cell RNA-sequencing data are corrected by matching
mutual nearest neighbors. Nat. Biotechnol. 2018, 36, 421–427. [CrossRef]
18. Peyré, G.; Cuturi, M. Computational Optimal Transport: With Applications to Data Science. Found. Trends® Mach. Learn. 2019,
11, 355–607. [CrossRef]
Entropy 2023, 25, 33 25 of 26
19. Gorban, A.N.; Grechuk, B.; Mirkes, E.M.; Stasenko, S.V.; Tyukin, I.Y. High-dimensional separability for one-and few-shot learning.
Entropy 2021, 23, 1090. [CrossRef]
20. Pearson, K. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572.
[CrossRef]
21. Barshan, E.; Ghodsi, A.; Azimifar, Z.; Zolghadri Jahromi, M. Supervised principal component analysis: Visualization, classification
and regression on subspaces and submanifolds. Pattern Recognit. 2011, 44, 1357–1371. [CrossRef]
22. Rao, C.R. The Use and Interpretation of Principal Component Analysis in Applied Research. Sankhyā: Indian J. Stat. Ser. A 1964,
26, 329–358.
23. Giuliani, A. The application of principal component analysis to drug discovery and biomedical data. Drug Discov. Today 2017,
22, 1069–1076. [CrossRef]
24. Jolliffe, I.T. Principal Component Analysis; Springer: New York, NY, USA, 1986. [CrossRef]
25. Gorban, A.; Kégl, B.; Wunch, D.; Zinovyev, A. (Eds.) Principal Manifolds for Data Visualisation and Dimension Reduction; Lecture
Notes in Computational Science and Engineering; Springer: Berlin, Germany, 2008; p. 340. [CrossRef]
26. Koren, Y.; Carmel, L. Robust linear dimensionality reduction. IEEE Trans. Vis. Comput. Graph. 2004, 10, 459–470. [CrossRef]
27. Song, Y.; Nie, F.; Zhang, C.; Xiang, S. A unified framework for semi-supervised dimensionality reduction. Pattern Recognit. 2008,
41, 2789–2799. [CrossRef]
28. Gorban, A.N.; Mirkes, E.M.; Zinovyev, A. Supervised PCA. 2016. Available online: https://github.com/Mirkes/SupervisedPCA
(accessed on 9 September 2016).
29. Sompairac, N.; Nazarov, P.V.; Czerwinska, U.; Cantini, L.; Biton, A.; Molkenov, A.; Zhumadilov, Z.; Barillot, E.; Radvanyi, F.;
Gorban, A.; et al. Independent component analysis for unraveling the complexity of cancer omics datasets. Int. J. Mol. Sci. 2019,
20, 4414. [CrossRef]
30. Hicks, S.C.; Townes, F.W.; Teng, M.; Irizarry, R.A. Missing data and technical variability in single-cell RNA-sequencing
experiments. Biostatistics 2018, 19, 562–578. [CrossRef]
31. Krumm, N.; Sudmant, P.H.; Ko, A.; O’Roak, B.J.; Malig, M.; Coe, B.P.; Quinlan, A.R.; Nickerson, D.A.; Eichler, E.E.; Project, N.E.S.;
et al. Copy number variation detection and genotyping from exome sequence data. Genome Res. 2012, 22, 1525–1532. [CrossRef]
[PubMed]
32. Cangelosi, R.; Goriely, A. Component retention in principal component analysis with application to cDNA microarray data. Biol.
Direct 2007, 2, 1–21. [CrossRef] [PubMed]
33. Gorban, A.N.; Mirkes, E.M.; Tyukin, I.Y. How deep should be the depth of convolutional neural networks: A backyard dog case
study. Cogn. Comput. 2020, 12, 388–397. [CrossRef]
34. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A Kernel Two-Sample Test. J. Mach. Learn. Res. 2012,
13, 723–773.
35. Lähnemann, D.; Köster, J.; Szczurek, E.; McCarthy, D.J.; Hicks, S.C.; Robinson, M.D.; Vallejos, C.A.; Campbell, K.R.; Beerenwinkel,
N.; Mahfouz, A.; et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020, 21, 1–35. [CrossRef] [PubMed]
36. Argelaguet, R.; Cuomo, A.S.; Stegle, O.; Marioni, J.C. Computational principles and challenges in single-cell data integration.
Nat. Biotechnol. 2021, 39, 1202–1215. [CrossRef] [PubMed]
37. Travaglini, K.J.; Nabhan, A.N.; Penland, L.; Sinha, R.; Gillich, A.; Sit, R.V.; Chang, S.; Conley, S.D.; Mori, Y.; Seita, J.; et al. A
molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 2020, 587, 619–625. [CrossRef]
38. Tsuyuzaki, K.; Sato, H.; Sato, K.; Nikaido, I. Benchmarking principal component analysis for large-scale single-cell RNA-
sequencing. Genome Biol. 2020, 21, 9. [CrossRef]
39. Cuccu, A.; Francescangeli, F.; De Angelis, M.L.; Bruselles, A.; Giuliani, A.; Zeuner, A. Analysis of Dormancy-Associated
Transcriptional Networks Reveals a Shared Quiescence Signature in Lung and Colorectal Cancer. Int. J. Mol. Sci. 2022, 23, 9869.
[CrossRef]
40. Bac, J.; Mirkes, E.M.; Gorban, A.N.; Tyukin, I.; Zinovyev, A. Scikit-dimension: A python package for intrinsic dimension
estimation. Entropy 2021, 23, 1368. [CrossRef]
41. Facco, E.; D’Errico, M.; Rodriguez, A.; Laio, A. Estimating the intrinsic dimension of datasets by a minimal neighborhood
information. Sci. Rep. 2017, 7, 12140. [CrossRef] [PubMed]
42. Pestov, V. Is the k-NN classifier in high dimensions affected by the curse of dimensionality? Comput. Math. Appl. 2013,
65, 1427–1437. [CrossRef]
43. Mirkes, E.M.; Allohibi, J.; Gorban, A.N. Fractional Norms and Quasinorms Do Not Help to Overcome the Curse of Dimensionality.
Entropy 2020, 22, 1105. [CrossRef] [PubMed]
44. Gorban, A.N.; Sumner, N.R.; Zinovyev, A.Y. Topological grammars for data approximation. Appl. Math. Lett. 2007, 20, 382–386.
[CrossRef]
45. Albergante, L.; Mirkes, E.; Bac, J.; Chen, H.; Martin, A.; Faure, L.; Barillot, E.; Pinello, L.; Gorban, A.; Zinovyev, A. Robust and
scalable learning of complex intrinsic dataset geometry via ElPiGraph. Entropy 2020, 22, 296. [CrossRef]
46. Akinduko, A.A.; Mirkes, E.M.; Gorban, A.N. SOM: Stochastic initialization versus principal components. Inf. Sci. 2016,
364–365, 213–221. [CrossRef]
47. McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw.
2018, 3, 861. [CrossRef]
Entropy 2023, 25, 33 26 of 26
48. van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605.
49. Kramer, M.A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991, 37, 233–243.
[CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.