0% found this document useful (0 votes)
49 views18 pages

Variants of Simple Correspondence Analysis: Ontributed Esearch Rticles

This document introduces the CAvariants R package, which performs six variants of correspondence analysis on contingency tables. These include classical correspondence analysis, singly ordered correspondence analysis, doubly ordered correspondence analysis, non-symmetrical correspondence analysis, and variants that combine ordering and symmetry. The package provides flexibility in graphical outputs and assessing reliability. Existing R packages perform some variants of correspondence analysis but none provide functions for all ordered variants as the CAvariants package does.

Uploaded by

james ryer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views18 pages

Variants of Simple Correspondence Analysis: Ontributed Esearch Rticles

This document introduces the CAvariants R package, which performs six variants of correspondence analysis on contingency tables. These include classical correspondence analysis, singly ordered correspondence analysis, doubly ordered correspondence analysis, non-symmetrical correspondence analysis, and variants that combine ordering and symmetry. The package provides flexibility in graphical outputs and assessing reliability. Existing R packages perform some variants of correspondence analysis but none provide functions for all ordered variants as the CAvariants package does.

Uploaded by

james ryer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

C ONTRIBUTED R ESEARCH A RTICLES 167

Variants of Simple Correspondence


Analysis
by Rosaria Lombardo and Eric J. Beh

Abstract This paper presents the R package CAvariants (Lombardo and Beh, 2017). The package
performs six variants of correspondence analysis on a two-way contingency table. The main function
that shares the same name as the package – CAvariants – allows the user to choose (via a series of
input parameters) from six different correspondence analysis procedures. These include the classical
approach to (symmetrical) correspondence analysis, singly ordered correspondence analysis, doubly
ordered correspondence analysis, non symmetrical correspondence analysis, singly ordered non
symmetrical correspondence analysis and doubly ordered non symmetrical correspondence analysis.
The code provides the flexibility for constructing either a classical correspondence plot or a biplot
graphical display. It also allows the user to consider other important features that allow to assess the
reliability of the graphical representations, such as the inclusion of algebraically derived elliptical
confidence regions. This paper provides R functions that elaborates more fully on the code presented
in Beh and Lombardo (2014).

Introduction
Computational procedures for detecting the association between two or more categorical variables are
important aspects of statistical theory and practice. In particular, correspondence analysis provides a
quick and simple graphical summary of how categories and variables are associated with one another.
The theoretical aspects of the analysis are well documented in the statistical and allied disciplines;
see, for example, Benzécri (1973), Greenacre (1984), Lebart et al. (1984), Beh (2004a), Nishisato (2007),
and Beh and Lombardo (2014). Despite the necessity for programs and functions that allow their user
to perform correspondence analysis, the availability for many of the varied approaches is generally
limited. Commercially available statistical software, such as MATLAB, Minitab, SAS and SPSS provide
a means of carrying out correspondence analysis, although their procedures often provide only the
most basic of features as part of their output. Generally nothing beyond the calculation of principal
inertia values, profile coordinates, contribution to inertia and a two-dimensional correspondence plot
are provided. Other popular statistical languages, such as R, provide some packages for performing
simple and multiple correspondence analysis of the classical (symmetrical) type, (Murtagh, 2005;
Nenadic and Greenacre, 2007; Alberti, 2015; De Leeuw, 2006; De Leeuw and Mair, 2009a; Ringrose,
2012; Kostov et al., 2015). Nevertheless, at present, no popular statistical packages provide functions
to perform ordered variants of symmetrical and non symmetrical correspondence analysis.

Overview of correspondence analysis in R


Since the mid 2000’s the programming environment of R has proven to be extremely popular in all
areas of theoretical and applied statistics. This is due in part to the free availability of the program
from the Comprehensive R Archive Network (CRAN; http://CRAN.R-project.org/), the versatility
of the coding environment and the ever increasing number of packages that are now available on the
CRAN.
Various R packages have received a great deal of attention for their contribution to the computing
of correspondence analysis (CA). One of the first is the MASS package (Venables and Ripley, 2002;
Ripley, 2016). It provides the user with a means of performing simple and multiple correspondence
analysis with the option of including supplementary points onto a display. More recently the ca
package of Nenadic and Greenacre (2007) includes functions for performing simple, multiple and joint
correspondence analysis using two and three dimensions for the graphical displays. Supplementary
points were incorporated into the R code of Murtagh (2005) while the anacor package of De Leeuw and
Mair (2009a) allows the user to perform classical and canonical correspondence analysis with missing
values (De Leeuw, 2006; De Leeuw and Mair, 2009b). Further, one may refer to the CA or MCA functions
in the FactoMineR package by Lê et al. (2008). For lexical tables, the CaGalt function incorporated into
the FactoMineR package by Kostov et al. (2015) may be used. Another recent package – cabootcrs – by
Ringrose (2012) checks the reliability of association by superimposing onto a plot bootstrap confidence
regions. The CAinterprTools package by Alberti (2015) makes use of graphical features to enrich a
visual interpretation of CA results. Alternatively, De Leeuw and Mair (2009a) prepared the homals
package for performing Gifi’s approach to correspondence analysis. As well also, Clavel et al. (2014)
presented dualScale package for doing dual scaling (i.e., multiple correspondence analysis) of multiple

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 168

Variants of correspondence analysis


package CA NSCA MCA JCA SOCA DOCA SONSCA DONSCA CCA CNSCA DCA
ade4 x x x x x
anacor x x x
ca x x x
cabootcrs x
CAinterprTools x x
CAvariants x x x x x x
cncaGUI x x
dualScale x x
ExPosition x x x
FactoMineR x x
homals x x
MASS x x
PTAk x
vegan x x x

Table 1: R packages and some CA variants. CA: simple CA; NSCA: non symmetrical CA; MCA:
multiple CA; JCA: joint CA; SOCA: singly ordered CA; DOCA: doubly ordered CA; SONSCA: singly
ordered NSCA; DONSCA: doubly ordered NSCA; CCA: canonical CA; CNSCA: canonical NSCA;
DCA: discriminant CA

choice data. Baxter and Cool (2010) and Alberti (2015) provide a good overview of correspondence
analysis using R with an archaeological focus. Another R based package that can be downloaded freely
from CRAN is ExPosition (Beaton et al., 2014). It is written by Herve Abdi and his team and performs
a variety of different multivariate data analysis techniques, including correspondence analysis and
multiple correspondence analysis. Abdi’s group has also been responsible for other variations of
correspondence analysis including multi-block discriminant correspondence analysis (Williams et al., 2010)
and discriminant correspondence analysis (Abdi, 2007). Furthermore, another suite of R functions that
enables the user to perform a variety of correspondence analysis techniques is vegan (Oksanen et al.,
2016), which was developed primarily for vegetation ecologists. It includes functions that provide
the user with a large array of techniques to choose from including classic correspondence analysis,
canonical correspondence analysis and detrended correspondence analysis. One may also consider
the ade4 package (Dray and Dufour, 2007; Chessel et al., 2004; Thioulouse et al., 1997), which also
includes non symmetric correspondence analysis, to analyze ecological and environmental data in
the framework of numerous euclidean exploratory methods. Further, the cncaGUI package (Librero
et al., 2015) allows canonical correspondence analysis and canonical non symmetrical correspondence
analysis providing inferential results by using bootstrap methods. The PTAk package includes
(Leibovici, 2010, 2015) functions for doing multiway data decomposition, and in particular, it also
allows simple correspondence analysis and a generalization of correspondence analysis for k-way
tables. Lastly, but certainly not least, the R code of Murtagh (2005) for performing simple and multiple
correspondence analysis may also be considered.
An overview of the broad areas of correspondence analysis that these packages cover is summa-
rized in Table 1. While non symmetrical correspondence analysis for nominal variables is included in
some of the R packages on the CRAN that perform correspondence analysis, the remaining ordered
variants have not yet been made available in any R package. However, fragments of R code for some
of these CA variants are available in Beh and Lombardo (2014). Therefore, this paper provides a
comprehensive description of R code that enhances, beyond the classics, the type of correspondence
analysis that one may use. The advantages of these variants is that they enable the user to incorpo-
rate categorical predictor/response associations and the ordinal structure of a variable. For ordered
variables we can easily identify any linear and non-linear sources of association that may exist in the
data. The ordered variants also provide a visualization of non-linear trends of association; the classical
approaches to correspondence analysis do not encompass these features.
The theoretical aspects underlying all the six variants of correspondence analysis considered in
this paper can be found in Beh and Lombardo (2014) and Lombardo et al. (2016). However, here we
will provide the reader with a brief overview of the theoretical aspects of these analyses. We also
describe how the algebraic confidence ellipses for polynomial biplots can be derived; this aspect of the
analysis has not been described elsewhere.

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 169

Some theory

Symmetrical and non symmetrical correspondence analysis

Consider a two-way contingency table N of dimension I × J such that it is a cross-classification of two


variables consisting of I row categories
  and J column categories, respectively. Denote the matrix of the
J J
joint relative frequencies by P = pij so that ∑iI=1 ∑ j=1 pij = 1. Let pi• = ∑ j=1 pij and p• j = ∑iI=1 pij
be the ith marginal row proportion and the jth marginal column proportion, general elements of the
diagonal matrices, D I and D J , respectively.
There are many ways that correspondence analysis can be performed and Nishisato (2007, Chapter
2) provides an excellent overview of some of them. Here, we present the chi-squared statistic expressed
in terms of the weighted sum-of-squares of the centered column profiles since this alternative expres-
sion of X 2 is useful when comparing symmetrical correspondence analysis with its non symmetrical
variant. Therefore, consider the chi-squared statistic of N which is defined as

!2
I J p• j pij I J p• j
X =n∑
2
∑ − pi • =n∑ ∑ πij2 ,
i =1 j =1
pi • p• j i =1 j =1
pi •

 
where = πij is the I × J matrix of centered column profiles. In this case, the weight matrices in < I
and < J are defined by the elements of the matrices D− 1
I and D J , respectively.
Suppose we now treat the column variable as a predictor variable and the row variable as its
response variable. When such an asymmetric association structure exists between the two categorical
variables one may consider non symmetrical correspondence analysis (Lauro and D’Ambra, 1984;
D’Ambra and Lauro, 1989; Kroonenberg and Lombardo, 1999). To quantify this asymmetric association,
consider the Goodman-Kruskal (1954) tau index

2
pij

J J
∑iI=1 ∑ j=1 p• j p• j − pi • ∑iI=1 ∑ j=1 p• j πij2 τnum
τ= = = .
1 − ∑iI=1 p2i• 1 − ∑iI=1 p2i• τden

For this asymmetric case, the weight matrices are I (an I × I identity matrix) and D J respectively.
Notice that the denominator can be treated as a constant term since it does not depend on the predictor
variable. For this reason it can be neglected without losing any information about the structure
of the association. Therefore τnum is the measure of association considered in non symmetrical
correspondence analysis.
In order to graphically depict the association or the prediction of the rows given the columns in a
low dimensional space, we may consider the generalized singular value decomposition of the centered
column matrix using the suitable weight matrices (Kroonenberg and Lombardo, 1999).
Suppose we consider a general framework for the symmetrical and non symmetrical variants of
CA (Lombardo et al., 2016), that considers generic weight matrices, V I and W J , in < I and < J . This
general framework may be defined by considering the weighted centered column profile matrix

˜ = V1/2 W1/2 .

Therefore, symmetric (or classical) correspondence analysis may be performed by considering V =


D− 1
I and W = D J , while non symmetrical correspondence analysis is defined when V = I and
W = D J . Doing so leads to the generalized singular value decomposition (GSVD) of

GSVD ( ˜ ) = A BT .

where the right and left singular vectors are A(= aim ) and B(= b jm ), respectively. These quantities
have the orthonormality properties with metrics D− 1 I
I or I (identity) (in < , depending on the symmetric
or asymmetric relationship between the rows and columns) and D J (in < J ), respectively. As usual, the
elements of the diagonal matrix of singular values, = diag (λm ), are arranged in descending order.

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 170

Ordered symmetrical and non symmetrical correspondence analysis

When both variables are ordered, we adapt SVD by using basis vectors for the row and column spaces
by performing the bivariate moment decomposition (BMD) on the matrix ˜. The BMD of ˜ is expressed
as

BMD ( ˜ ) = AZB T ,

where A and B are the row and column polynomial matrices defined by Emerson (1968), respectively,
and Z is the matrix of the generalized correlations (Rayner and Beh, 2009). The construction of
polynomials A and B requires the specification of a priori scores, s X (i ) and sY ( j) (defined by mi and
mj in CAvariants, respectively), to reflect the ordinal structure of the row and column variables. These
polynomials are orthonormal with respect to the weight matrices. For the analysis of nominal variables,
when a symmetrical association between the variables is considered, the weights in < I and < J are
D− 1
I and D J , respectively. When an asymmetric association is considered, the weights are given by I
and D J , respectively.
When only one of the two variables consists of ordered categories, rather than considering the
BMD or the GSVD of ˜, one may consider instead its hybrid decomposition (HD) (Beh, 2001, 2008;
Lombardo et al., 2016). This method of decomposition consists of singular vectors for the nominal
variable and orthogonal polynomials for the ordered variable. Consider the case, as does the package
CAvariants, where the column variable consists of ordered categories and the row variable consists of
nominal categories. Then the HD of ˜ takes the form

HD ( ˜ ) = AZB T ,

where A is the column matrix of singular vectors for the nominal row categories and B is the column
matrix of orthogonal polynomials for the ordered column categories. The generic elements of Z, zmv ,
are the hybrid generalized correlations; for further details on these elements see Beh and Lombardo
(2014) and Lombardo et al. (2016).

Generalized correlations in ordered CA variants

The generalized correlation matrix, Z, in the BMD of ˜ reflects the various sources of association
between the variables and is derived using orthogonal polynomials (Best and Rayner, 1996; Beh, 1997;
Rayner and Beh, 2009). For example, when the row and column scores are defined as consecutive
integers such that s X (i ) = i for i = 1, . . . , I and sY ( j) = j for j = 1, . . . , J, then z11 is Pearson’s
product moment correlation of N. A simple generalization of this correlation is z12 which is a measure
of the correlation between any change in the location of the row categories and dispersion of the
column categories. For this reason, z12 is a generalized correlation describing the linear-by-quadratic
association between the row and column categories.
For ordered CA variants, the total inertia is
I −1 J −1
Inertia ( ˜ ) = ∑ ∑ z2uv ,
u =1 v =1

which can also be written in matrix form as


     
Inertia ( ˜ ) = trace ZT Z = trace ZZT = trace 2 .

From the matrix of generalized correlations Z, we can obtain the inertia of each polynomial axis
by considering the sum-of-squares of zuv over either u or v. Using BMD or HD, the symmetric
and asymmetric measures of association (X 2 and τ) can be partitioned into polynomial components
that reflect various sources of variation for each of the categories. The inertias of the polynomial
components will henceforth be referred to as sources of inertia and are akin to the principal inertia
values in (symmetrical or non symmetrical) correspondence analysis.
A formal statistical test of the X 2 or τ index can be made. To test the statistical significance of the
total inertia in the symmetrical and non symmetrical case, we can compare the chi-squared statistic,
or the C-statistic, C = τ · (n − 1) · ( I − 1) (Light and Margolin, 1971), with the χ2 distribution with
( I − 1)( J − 1) degree of freedom; see, for example, Beh and Lombardo (2014) for further details.

Unequal inertias of the row and column polynomials. When considering the BMD of ˜, the total
inertia of the row and column spaces (< I and < J , respectively) will be identical. However, the inertia
associated with each of the row and column polynomials will often be different. For the row categories,

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 171

there are I − 1 row inertia values – one for each of the axes – where the inertia of the uth polynomial
axis is z2u• . Similarly, there are J − 1 column inertia values – one for each of the axes – where the inertia
of the vth axis is z2•v . For this reason, we recommend constructing polynomial biplots for the ordered
variants of correspondence analysis instead of the traditional correspondence plots constructed using
principal coordinates. See Beh and Lombardo (2014) and Lombardo et al. (2016) for more details on
these features.
For the HD of ˜, the interpretation and properties of the Z matrix are a mixture (or hybrid) of
M∗
from the GSVD and Z from the BMD. When considering the space < J , calculating ∑m 2
=1 zm1 = z•1
2

gives the location component of the ordered columns and represents the principal inertia for this
J −1
variable along the first polynomial axis. Similarly in < I , computing ∑v=1 z21v = z21• = λ21 yields the
principal inertia of the first principal axis for the nominal row variable. Like BMD, HD yields different
sets of inertia values for each axis in the < I and < J spaces.

Polynomial biplots and elliptical confidence regions


When constructing a polynomial biplot, the ordered row and column categories can be displayed
in a single plot since the row and column coordinates are computed with respect to the same set of
polynomial axes. For example, in a polynomial row metric preserving (or row isometric) biplot, the
column standard polynomial coordinates are
 
G=B g jv = β jv ,

while the principal polynomial coordinates for the row categories are
 
J
F = AZ = ˜W J B  f iv = αiu zuv = ∑ w• j π̃ij β jv  .
j =1

In practice, the coordinates for both the row and column categories are computed using the same
orthonormal polynomial axes, i.e., the column polynomials.
The plot method for objects returned by CAvariants provides the user with the option of con-
structing parametric (or algebraic) elliptical confidence regions for all the six CA variants not only
for the nominal CA variants as originally proposed by Beh (2010). We compute the semi-major and
semi-minor axis lengths of the elliptical region for the row and column categories. Here, we provide
the ellipse axes lengths for the ordered symmetric variants of correspondence analysis. For example,
the semi-major axis length of the confidence ellipse for the ith row category is
v !
I −1
u
χ2α 1
n × trace(Z0 Z) pi• m∑
u
2 t 2
xi(α) = z11 − aim , (1)
=3

while the semi-minor axis length for this row is


v !
I −1
u
χ2α 1
− ∑ a2im
u
y j(α) = z222 t . (2)
n × trace(Z0 Z) p i • m =3

Similar semi-axis lengths can also be derived for the column categories and for the non-symmetrical
CA variants. Furthermore, note that ellipsoids can be constructed for three- or higher- dimensional
correspondence plots by considering the input parameter M >2 in the plot method. For further details
on this issue see Beh (2010); Beh and Lombardo (2014).
Unlike the confidence circles of Lebart et al. (1984) and the more computationally intensive
bootstrap techniques proposed in the literature (Markus, 1994; Linting et al., 2007; Ringrose, 2012;
Greenacre, 1984; Lombardo and Ringrose, 2012), constructing confidence ellipses in this manner takes
into consideration the contribution of the ith row principal polynomial coordinate in dimensions
higher than the second. In fact, since all I dimensions are reflected in the semi-major and semi-minor
axis lengths, all of the contribution that a point has to the symmetrical or asymmetrical association can
be accounted for in a two-dimensional plot using equations (1) and (2). Additional information for how
to compute the p-values of each category point can be easily found by considering a similar theoretical
development of the p-values described in Beh and Lombardo (2014, 2015) for a correspondence
analysis of a contingency table with nominal variables.

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 172

An overview of the CAvariants package


The primary function discussed in this paper is CAvariants. It allows the user to select which
analysis to perform from a suite of six correspondence analysis techniques. These include symmetrical
(or classical) correspondence analysis, non symmetrical correspondence analysis and their ordered
variants, described in Beh and Lombardo (2014, 2015).
The six variations of simple correspondence analysis included in the package CAvariants are:
• The classical approach to simple correspondence analysis (the default analysis which is defined
by the input parameter catype = "CA").
• Two-way, or doubly ordered, symmetrical correspondence analysis (the user can perform this
analysis by defining the input parameter catype = "DOCA").
• One-way, or singly ordered, correspondence analysis for tables of symmetrically related vari-
ables, where the column variable is ordered (the user can perform this analysis by defining the
input parameter catype = "SOCA").
• Non symmetrical correspondence analysis where the nominal column variable is a predictor of
the nominal row variable (the user can perform this analysis by defining the input parameter
catype = "NSCA")
• Two-way, or doubly ordered, non symmetrical correspondence analysis where the ordered
column variable is a predictor of the ordered row variable (the user can perform this analysis by
defining the input parameter catype = "DONSCA").
• One-way, or singly ordered, non symmetrical correspondence analysis, where the ordered
column variable is a predictor of the nominal row variable (the user can perform this analysis
by defining the input parameter catype = "SONSCA")
The input parameters of the function CAvariants are:
• The two-way contingency table, Xtable.
• The assigned ordered scores for the row categories. By default, mi = NULL which gives consecu-
tive integer valued (natural) scores.
• The assigned ordered scores for the column categories. By default, mj = NULL which gives
consecutive integer valued (natural) scores.
• The horizontal polynomial or principal axis. By default firstaxis = 1.
• The vertical polynomial or principal axis. By default lastaxis = 2.
• The input parameter for specifying what variant of correspondence analysis is considered. By
default catype = "CA", other possible values are: catype = "SOCA",catype = "DOCA",catype
= "NSCA",catype = "SONSCA",catype = "DONSCA".
• The input parameter, ellcomp, ensures that the characteristics of the algebraic confidence ellipses
are computed and stored. When ellcomp = TRUE (which is the default), the output includes
the characteristics of the ellipses. The eccentricity of the confidence ellipses is summarized
by the quantity eccentricity, this is the distance between the center and either of its two
foci, which can be thought of as a measure of how much the conic section deviates from being
circular (when it is equal to zero then the region becomes circular). The semi-major axis length
of the ellipse for each row and column point is given by HL Axis 1 while HL Axis 2 gives the
semi-minor axis length of the points along the second axis. The area of the ellipse for each row
and column category is given by Area while the p-value of each category is defined by P-value.
• The number of axes Mell considered in determining the structure of the elliptical confidence
regions. By default, Mell = min(nrow(Xtable),ncol(Xtable)), i.e., the rank of the data matrix.
• The confidence level, alpha, of the elliptical regions. By default, alpha = 0.05.
To visually portray and assess the statistical significance of the categories to the association between
the variables of a contingency table, the plot method can be called by the user. As well as displaying
the classic correspondence plot or biplot, this function allows one to superimpose onto the plot
algebraically derived elliptical confidence regions for each of the principal coordinates (Lebart et al.,
1984; Beh, 2010; Lombardo and Ringrose, 2012; Beh and Lombardo, 2015) for all CA variants. There
are other features of the plot, i.e., through the plot method for “CAvariants” objects, the user may
utilize. Some of these are applicable to all of the analyses and some are applicable to only a few. The
input parameters of the plot method for “CAvariants” objects are:

• The name of the output object, for example say res, used with the main function CAvariants.
• The horizontal polynomial or principal axis, firstaxis. By default, firstaxis = 1.

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 173

• The vertical polynomial or principal axis, lastaxis. By default, lastaxis = 1.


• The size of characters, cex, displayed on the correspondence plot or biplot. By default, cex =
0.8.
• The parameter, cex.lab that specifies the size of character labels of axes in graphical displays.
By default, cex.lab = 0.8.
• The scaling parameter, prop, for specifying the limits of the plotting area. By default, prop = 1.
• The type of graphical display required (either a classical correspondence plot or a biplot). The
user can look at a classical correspondence plot by defining the input parameter plottype =
"classic". When plottype = "biplot", it produces biplot graphical displays, or polynomial
biplots in case of an ordered analysis. Note that for an ordered analysis only polynomial biplots
are suitable. In particular for the singly ordered variants, only row isometric polynomial biplots
make sense, as we assume that the ordered variable is the column variable (the column coordi-
nates are standard polynomial coordinates and the row coordinates are principal polynomial
coordinates). By default, plottype = "biplot".
• For a biplot, one may specify that it be a row-isometric biplot (biptype = "row") or a column-
isometric biplot (biptype = "column"). This feature is available for the nominal symmetrical
and the non symmetrical correspondence analyses. By default, a row-isometric biplot, biptype
= "row", is produced.
• The parameter for scaling the biplot coordinates, scaleplot, originally proposed in Section
2.3.1 of Gower et al. (2011) and described on page 135 of Beh and Lombardo (2014). By default,
scaleplot = 1.
• The parameter posleg for specifying the position of the legend when portraying trends of
ordered categories in ordered variants of correspondence analysis. By default, posleg =
"topleft".
• The parameter pos for specifying the position of point symbols in the graphical displays. By
default, pos = 2.
• The logical parameter, ell which specifies whether algebraic confidence ellipses are to be
included in the plot or not. Setting the input parameter to ell = TRUE will allow the user to
assess the statistical significance of each category to the association between the variables. The
ellipses will be included when the plot is constructed using principal coordinates (being either
row and column principal coordinates or row and column principal polynomial coordinates).
By default, this input parameter is set to ell = FALSE. See also the input parameter ellcomp
of the function CAvariants for a description of the numeric characteristics of the confidence
ellipses (eccentricity, area, etc.), as well as the input parameter ellprint of the print method
for “CAvariants” objects for getting a print of these parameters.
• The number of axes Mell considered when portraying the elliptical confidence regions. By
default, it is equal to Mell = min(nrow(Xtable),ncol(Xtable)), i.e., the rank of the data matrix.
This parameter is identical to the input parameter Mell of the function CAvariants.
• The confidence level of the elliptical regions. By default, alpha = 0.05.

The print method for “CAvariants” objects included in the package, CAvariants, and displays
the main results of the analysis specified by the user. The results displayed depends on the type of
analysis being performed. The principal inertia values, total inertia and p-values are included as part
of its output when catype = "CA", catype = "SOCA" or catype = "DOCA" and are based on Pearson’s
chi-squared statistic. The Goodman Kruskal tau-index is the association measure of interest when
catype = "NSCA", catype = "SONSCA" or catype = "DONSCA". When an ordered analysis is specified
– such as when catype = "DOCA", catype = "SOCA", catype = "SONSCA" or catype = "DONSCA" – a
table describing the significant polynomial components of inertia will also be reported.
The input parameters of the print method for “CAvariants” objects are:

• The name of the output object, for example say res, used with the main function CAvariants.
• The number of dimensions, printdims, that are used to generate the correspondence plot, or
biplot, and for summarizing the numerical output of the analysis. By default, printdims = 2.
• The flag parameter, ellprint, allows that the characteristics of the confidence ellipses (eccen-
tricity, semi-axis, area, p-values) are displayed. By default, ellprint = TRUE.
• The number of axes, Mell, used for the construction of the confidence ellipses. By default,
it is equal to its maximum value, Mell = min(nrow(Xtable),ncol(Xtable)), i.e., the rank of
the data matrix. This input parameter is identical to the parameter Mell of both, function
CAvariants and the plot method for “CAvariants” objects.

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 174

• The level of significance used for the construction of the elliptical regions, alpha. By default,
alpha = 0.05.
• The minimum number of decimal places, digits, used for displaying the numerical summaries
of the analysis. By default, digits = 3.

In general, this function produces the following output:

• The two-way contingency table, Xtable.


• The matrix of row weights, Row weights: Imass. These weights depend on the type of analysis
performed.
• The matrix of column weights, Column weights: Jmass. These weights are equal to the data
column margins for all types of analysis performed.
• The total inertia, Total inertia, of the analysis performed. For example, when considering the
variants of non symmetrical correspondence analysis, the numerator of the Goodman-Kruskal
tau index, the associated C-statistic and its p-value are produced.
• The inertia values, their percentage contribution to the total inertia and the cumulative percent
inertias of the row and column space Inertias. When performing an ordered correspondence
analysis, this output summary describes both the row and column spaces for each principal or
polynomial axis. When catype is "CA" or "NSCA", the associated inertia values in the row and
column spaces are identical.
• The generalized correlation matrix Generalized correlation matrix, when performing an
ordered correspondence analysis, catype should be "DOCA", "DONSCA", "SOCA" or "SONSCA".
• The row principal coordinates, Row principal coordinates, when catype is "CA" or "NSCA".
• The column principal coordinates, Column principal coordinates, when catype is "CA" or
"NSCA".
• The row standard coordinates, Row standard coordinates , when catype is "CA" or "NSCA".
• The column standard coordinates, Column standard coordinates, when catype is "CA" or
"NSCA".
• The row principal polynomial coordinates, Row principal polynomial coordinates, when
performing an ordered correspondence analysis.
• The column principal polynomial coordinates, Column principal polynomial coordinates,
when performing a doubly ordered correspondence analysis.
• The Row standard polynomial coordinates, i.e., standard polynomial coordinates for the row
categories when performing a doubly ordered correspondence analysis.
• The Column standard polynomial coordinates, i.e., standard polynomial coordinates for the
column categories when performing an ordered correspondence analysis.
• The Euclidean distance of the row categories from the origin of the plot, Row distances from
the origin of the plot.
• The Euclidean distance of the column categories from the origin of the plot, Column distances
from the origin of the plot.
• The polynomial components of the total inertia and their p-values, Polynomial components.
The total inertia of the column space is partitioned to identify polynomial components when
catype is "SOCA" or "SONSCA". When catype is "DOCA" or "DONSCA", the total inertia of both the
row and column space is partitioned to identify of polynomial components.
• The inner product, Inner product, of the biplot coordinates (concerning the first two axes when
firstaxis = 1 and lastaxis = 2).
• When the input flag parameter is ellprint = TRUE, then the print includes the eccentricity of
the confidence ellipses, the semi-major axis length of the ellipse for each row and column point,
HL Axis 1, the semi-minor axis length for the ellipse for each row and column point, HL Axis
2, the area of the ellipse for each row and column point, Area and the p-value for each row
and column point, P-value, see also the parameter ellcomp of the function CAvariants for a
detailed description of these parameters.

Furthermore, package CAvariants contains a summary method for the objects returned by CAvariants.
This method provides the list of the objects names of the output and a selection of the main output
objects described in the print method for objects returned by CAvariants.

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 175

Numerical outputs

As an example of the complete set of numerical results that is obtained from performing a particular
variant of correspondence analysis, consider the case where a singly ordered non symmetrical cor-
respondence analysis is performed on the data table shopdataM available in the package CAvariants.
This object is the contingency table being analyzed and is described more fully in Section Application.
The output object name of the main function is called res and is the execution of the CAvariants
function on the shopdataM. The object res is obtained using
R> res <- CAvariants(shopdataM, catype = "SONSCA")
The results are available in the following entries which can be obtained using
R> names(res)
which gives
[1] "Xtable" "rows" "cols" "r" "rowlabels"
[6] "collabels" "Rprinccoord" "Cprinccoord" "Rstdcoord" "Cstdcoord"
[11] "tauden" "tau" "inertiasum" "inertias" "inertias2"
[16] "comps" "catype" "mj" "mi" "pcc"
[21] "Jmass" "Imass" "Trend" "Z" "ellcomp"
[26] "risell" "Mell"
These results may be printed to the screen by using
R> print(res)
while a summary of each of these numerical features is produced by using
R> summary(res)

Application
To demonstrate the application of a variant of simple correspondence analysis described in the
CAvariants package, we present the following example. We shall confine our attention to the non
symmetrical correspondence analysis of a singly ordered contingency table. The contingency table
that we are examining is concerned with shoplifting in The Netherlands and summarizes, in part,
the results of a survey of the Dutch Central Bureau of Statistics (Israëls, 1987). The data considers a
sample of 20819 men who were suspected of shoplifting in Dutch stores between 1977 and 1978. The
predictor variable consists of the age groups of the perpetrators (less than 12yrs, 12 to 14yrs, 15 to
17yrs, 18 to 20yrs, 21 to 29yrs, 30 to 39yrs, 40 to 49yrs, 50 to 64yrs, 65yrs and over) while the response
variable of the table consists of the items stolen. These items are clothing, clothing accessory, tobacco
and/or provisions, stationary, books, records, household goods, candy, toys, jewelry, perfume, hobby and/or tools
and other items. For an extensive description of this example, and the application of correspondence
analysis, see Lombardo et al. (2016).
After choosing the suitable variant of correspondence analysis, we create the object res that
consists of the complete features of the analysis by running the command
R> res <- CAvariants(shopdataM, catype = "SONSCA")
print(res) will return as part of its output the following numerical features:
RESULTS for SONSCA Correspondence Analysis

Data Table:
M12< M13 M16 M19 M25 M35 M45 M57 M65+
clothing 81 138 304 384 942 359 178 137 45
accessories 66 204 193 149 297 109 53 68 28
tobacco 150 340 229 151 313 136 121 171 145
stationary 667 1409 527 84 92 36 36 37 17
books 67 259 258 146 251 96 48 56 41
records 24 272 368 141 167 67 29 27 7
household 47 117 98 61 193 75 50 55 29
candy 430 637 246 40 30 11 5 17 28
toys 743 684 116 13 16 16 6 3 8
jewelry 132 408 298 71 130 31 14 11 10
perfumes 32 57 61 52 111 54 41 50 28

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 176

hobby 197 547 402 138 280 200 152 211 111
other 209 550 454 252 624 195 88 90 34

Row Weights: Imass


clothing accessories tobacco stationary books records household
clothing 1 0 0 0 0 0 0
accessories 0 1 0 0 0 0 0
tobacco 0 0 1 0 0 0 0
stationary 0 0 0 1 0 0 0
books 0 0 0 0 1 0 0
records 0 0 0 0 0 1 0
household 0 0 0 0 0 0 1
candy 0 0 0 0 0 0 0
toys 0 0 0 0 0 0 0
jewelry 0 0 0 0 0 0 0
perfumes 0 0 0 0 0 0 0
hobby 0 0 0 0 0 0 0
other 0 0 0 0 0 0 0
candy toys jewelry perfumes hobby other
clothing 0 0 0 0 0 0
accessories 0 0 0 0 0 0
tobacco 0 0 0 0 0 0
stationary 0 0 0 0 0 0
books 0 0 0 0 0 0
records 0 0 0 0 0 0
household 0 0 0 0 0 0
candy 1 0 0 0 0 0
toys 0 1 0 0 0 0
jewelry 0 0 1 0 0 0
perfumes 0 0 0 1 0 0
hobby 0 0 0 0 1 0
other 0 0 0 0 0 1

Column Weights: Jmass


12< 13 16 19 25 35 45 57 65+
12< 0.137 0.00 0.000 0.0000 0.000 0.0000 0.0000 0.0000 0.0000
13 0.000 0.27 0.000 0.0000 0.000 0.0000 0.0000 0.0000 0.0000
16 0.000 0.00 0.171 0.0000 0.000 0.0000 0.0000 0.0000 0.0000
19 0.000 0.00 0.000 0.0808 0.000 0.0000 0.0000 0.0000 0.0000
25 0.000 0.00 0.000 0.0000 0.166 0.0000 0.0000 0.0000 0.0000
35 0.000 0.00 0.000 0.0000 0.000 0.0665 0.0000 0.0000 0.0000
45 0.000 0.00 0.000 0.0000 0.000 0.0000 0.0394 0.0000 0.0000
57 0.000 0.00 0.000 0.0000 0.000 0.0000 0.0000 0.0448 0.0000
65+ 0.000 0.00 0.000 0.0000 0.000 0.0000 0.0000 0.0000 0.0255

Total inertia 0.038

Inertias, percent inertias and cumulative percent inertias of the row space

inertia inertiapc cuminertiapc


1 0.0300 79.88 79.88
2 0.0037 9.86 89.74
3 0.0032 8.44 98.18
4 0.0003 0.92 99.10
5 0.0003 0.67 99.77
6 0.0001 0.17 99.94
7 0.0000 0.05 99.99
8 0.0000 0.01 100.00
Inertias, percent inertias and cumulative percent inertias of the column space

inertia2 inertiapc2 cuminertiapc2


1 0.0225 59.83 59.83
2 0.0096 25.58 85.41
3 0.0028 7.33 92.74

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 177

4 0.0019 5.18 97.92


5 0.0003 0.82 98.74
6 0.0003 0.74 99.48
7 0.0001 0.35 99.83
8 0.0001 0.17 100.00

Predictability Index for Variants of Non symmetrical Correspondence Analysis:

Numerator of Tau Index predicting the rows given the column categories

[1] 0.038

Tau Index predicting the rows given the column categories

[1] 0.041

C-statistic 10331.51 and p-value 0

Polynomial Components of Inertia

** Column Components **
Component Value P-value
Location 6181.536 0
Dispersion 2642.363 0
Cubic 757.192 0
Error 750.418 0
** C-Statistic ** 10331.509 0

Generalized correlation matrix of Hybrid Decomposition


v1 v2 v3 v4 v5 v6 v7 v8
m1 -0.147 0.084 0.018 -0.030 0.011 0.005 -0.005 0.003
m2 -0.028 -0.034 -0.032 0.024 0.005 -0.010 0.003 0.001
m3 -0.013 -0.037 0.036 -0.016 -0.004 0.006 -0.006 0.002
m4 -0.001 0.002 0.006 0.014 -0.010 0.005 -0.001 -0.001
m5 -0.001 -0.001 -0.007 -0.006 -0.007 0.009 -0.004 -0.004
m6 0.000 0.000 -0.001 0.000 0.000 -0.001 -0.006 0.005
m7 0.000 0.000 0.000 -0.001 -0.002 -0.003 -0.001 -0.002
m8 0.000 0.000 0.000 0.000 -0.001 0.000 0.001 0.001
m9 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Column standard polynomial coordinates = column polynomial axes


Axis 1 Axis 2
M12< -1.232 1.352
M13 -0.759 0.142
M16 -0.285 -0.652
M19 0.188 -1.029
M25 0.661 -0.991
M35 1.135 -0.536
M45 1.608 0.336
M57 2.081 1.624
M65+ 2.554 3.328

Row principal polynomial coordinates


Axis 1 Axis 2
clothing 0.072 -0.056
accessories 0.017 -0.014
tobacco 0.039 0.017
stationary -0.084 0.033
books 0.012 -0.012
records 0.000 -0.021
household 0.015 -0.004
candy -0.045 0.027
toys -0.067 0.049
jewelry -0.017 -0.006

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 178

perfumes 0.014 0.000


hobby 0.030 0.015
other 0.014 -0.030

Column distances from the origin of the plot


Axis 1 Axis 2
M12< 0.057 0.002
M13 0.027 0.000
M16 0.000 0.000
M19 0.027 0.002
M25 0.046 0.004
M35 0.041 0.000
M45 0.031 0.005
M57 0.021 0.022
M65+ 0.010 0.047

Row distances from the origin of the plot


Axis 1 Axis 2
clothing 0.005 0.003
accessories 0.000 0.000
tobacco 0.001 0.000
stationary 0.007 0.001
books 0.000 0.000
records 0.000 0.000
household 0.000 0.000
candy 0.002 0.001
toys 0.005 0.002
jewelry 0.000 0.000
perfumes 0.000 0.000
hobby 0.001 0.000
other 0.000 0.001

Inner product of coordinates (first two axes when 'firstaxis=1' and 'lastaxis=2')
M12< M13 M16 M19 M25 M35 M45 M57 M65+
clothing 0.111 0.097 0.014 -0.112 -0.150 -0.118 -0.065 -0.012 0.044
accessories 0.029 0.022 0.002 -0.024 -0.031 -0.027 -0.019 -0.011 -0.002
tobacco 0.057 0.017 -0.010 -0.003 0.004 -0.023 -0.059 -0.094 -0.122
stationary -0.132 -0.089 -0.003 0.089 0.114 0.110 0.098 0.085 0.064
books 0.023 0.015 0.000 -0.014 -0.018 -0.018 -0.018 -0.017 -0.015
records 0.011 0.008 0.000 -0.008 -0.010 -0.010 -0.008 -0.006 -0.004
household 0.023 0.014 0.000 -0.013 -0.016 -0.018 -0.018 -0.018 -0.017
candy -0.074 -0.049 -0.001 0.048 0.061 0.061 0.055 0.049 0.039
toys -0.122 -0.070 0.004 0.061 0.074 0.088 0.101 0.113 0.115
jewelry -0.021 -0.013 0.000 0.012 0.015 0.016 0.016 0.016 0.015
perfumes 0.021 0.010 -0.001 -0.008 -0.009 -0.013 -0.018 -0.023 -0.026
hobby 0.048 0.007 -0.012 0.010 0.021 -0.011 -0.055 -0.098 -0.135
other 0.026 0.030 0.007 -0.039 -0.054 -0.036 -0.010 0.017 0.043

Eccentricity of ellipses
[1] 0.757

Ellipse axes, Area, p-values of rows


HL Axis 1 HL Axis 2 Area P-value
clothing 0.013 0.009 0 0.000
accessories 0.010 0.007 0 0.000
tobacco 0.011 0.007 0 0.000
stationary 0.010 0.007 0 0.000
books 0.012 0.008 0 0.000
records 0.008 0.005 0 0.000
household 0.015 0.010 0 0.000
candy 0.013 0.008 0 0.000
toys 0.011 0.007 0 0.000
jewelry 0.013 0.008 0 0.000
perfumes 0.015 0.010 0 0.297

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 179

hobby 0.011 0.007 0 0.000


other 0.011 0.007 0 0.000

Ellipse axes, Area, p-values of columns


HL Axis 1 HL Axis 2 Area P-value
M12< 0.034 0.022 0.002 0
M13 0.020 0.013 0.001 0
M16 0.020 0.013 0.001 0
M19 0.023 0.015 0.001 0
M25 0.025 0.016 0.001 0
M35 0.026 0.017 0.001 0
M45 0.031 0.020 0.002 0
M57 0.046 0.030 0.004 0
M65+ 0.070 0.046 0.010 0
The total inertia of data, defined by the Goodman-Kruskal tau index (which may also be referred
to as the index of predictability) when performing a non symmetrical correspondence analysis, is
τ = 0.0414; in the output this is reflected by Tau Index predicting the rows given the column. To
determine whether this index is statistically significant, we compute the C-statistic and find that it is
equal to 10331.5 (with 96 degrees of freedom). Therefore, with a p-value that is less than 0.0001, the
age of the perpetrators is a strong predictor of the items that are stolen. The Goodman-Kruskal tau
index and the statistical significance of the C-statistic are summarized as part of the output, together
with the partition of the C-statistic, which identifies significant sources of variation in the ordered
column categories. Indeed, we can look at the inertia explained by each polynomial axis to mark
differences with the other non-ordered analysis. We can see that the most dominant contribution to the
total inertia of the data is due to the component associated with the linear polynomial of the columns.
This location component is 6182 and explains 59.8% of the total inertia. The next most dominant is
the dispersion component of 2642 and reflects that 25.6% of the variation in the column categories is
due to their difference in dispersion. Similarly, the cubic component is 757 and accounts for about
7% of the column variation. Even if the remaining, higher order, components are all statistically
significant (their associated p-value is less than 0.001), they will be not taken into consideration since
polynomials with degree higher than three (and more commonly, four) show limited information
about the association structure and variation of the variables. Hence, collectively, components higher
than the fourth are referred to as the error polynomial component. Note that the first two components
(linear and dispersion) explain 85.4% of the total inertia, so the first two polynomial axes will provide
a sufficient graphical display of the variation of the categories. Furthermore, with the specification of
ellprint = TRUE in the print method for “CAvariants” objects, the output consists of the eccentricity
value of the ellipses, the semi-axis lengths of the ellipse for each of the categories, the area of each
ellipse and the associated p-values.

Polynomial biplot: Portraying the predictability

When an ordered analysis is performed, the trend plots of the row and column categories are depicted.
For example, when performing a singly ordered NSCA, the variation, or trend, of the row categories is
examined by observing how it is affected by the ordered column categories when using a polynomial
transformation. Figure 1 shows a parabolic trend of the row category clothing. This trend highlights
that there is a greater propensity to steal clothing by people aged 25 to 45 years than those of a younger,
or older, age. Figure 2 provides an alternative visual display of these trends and is constructed by
depicting the row (items) categories using principal coordinates and the column (age) categories using
standard coordinates. Hence a row isometric biplot is constructed. Since the analysis also incorporates
the ordered nature of the column categories and the nominal structure of the row categories, Figure 2
is referred to as the row isometric polynomial biplot of the data.
The trend plot of Figure 1 and the polynomial biplot given by Figure 2 can be obtained using the
following command:
R> plot(res, plottype = "biplot", biptype = "row", scaleplot = 5, pos = 1)
When the first two polynomial axes are used to construct the biplot of Figure 2, the resulting configu-
ration has a parabolic shape. Observe that the explained inertia of the polynomial axes is as follows:
The first polynomial axis accounts for 59.8% of the inertia and the second polynomial axis for 25.6% of
the inertia. We can therefore see that the novelty of the polynomial biplot is based on the polynomial
representation of the predictor variable. The first linear polynomial axis represents the deviation from
the mean centered profile accounting for the ordered structure of the age groups, which is reflected in
the correct ordering of the age categories along the first polynomial axis. The second polynomial axis
shows a parabolic shape of the categories with positive concavity. Furthermore, note that the left-hand

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 180

0.4
clothing
accessories
tobacco

0.3
stationary
toys
other

0.2
0.1
Increase in Predictability

0.0
−0.1
−0.2

10 20 30 40 50 60 70

Age group scores

Figure 1: Trend of rows: A selection of rows of the centered column profile table reconstructed by
using the first two polynomials.

+
0.6

M65+
0.4
25.58%

+
+ M57
*M12<
0.2

toys
*
stationary *
candy
+ ** +
tobacco
0.0
Axis 2

hobby M45
*
*
M13* perfumes
*
* * +
household
jewelrybooks
+accessories
M16 +* + M35
records
other
M19M25 *
−0.4

clothing

−0.4 0.0 0.2 0.4 0.6

Axis 1 59.83%

Figure 2: Row-isometric polynomial biplot of singly ordered NSCA of shoplifting data: first two
polynomial components, Stolen goods and Age.

side of the first axis is dominated by the young age groups with adolescents and young adults at the
center of the display (who steal items consistent with the average number of thefts of all items). The
mid-adult and older age groups are on the right-hand side of Figure 2.
The magnitude of the coordinates indicate the importance of the first two polynomial components
for modeling the trends of the items. In particular, we see that the first two polynomial coordinates are
sufficient to model the trends for most stolen goods. The reliability of the graphical representation can
be assessed by constructing elliptical confidence regions for the row categories which are depicted
using row principal polynomial coordinates. These ellipses can be obtained using the plot method for
“CAvariants” objects such that

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 181

M65+ +

3
25.58%

2
M57+
M12< +

1
Axis 2

M45 +
M13 + toys
stationary
candy
tobacco
hobby
perfumes
household
jewelry
books
accessories
other ***
records
0
clothing **
M35 +
M16 +
−1

+ +
M19 M25

−1 0 1 2 3

Axis 1 59.83%

Figure 3: 95% confidence ellipses in the row isometric polynomial biplot of singly ordered NSCA of
the shoplifting data: Stolen goods and Age.

R> plot(res, scaleplot = 1, ell = TRUE, alpha = 0.05)


Figure 3 gives the 95% confidence ellipses for the row categories and are constructed so that the
weights of the semi-axes are expressed in terms of the hybrid generalized correlations rather than the
squared singular values associated to each axis. These ellipses are constructed so that the information
contained in all of the dimensions is depicted so that, for the plot method for “CAvariants” objects, M
= 8. Since this figure does not show clearly ellipses for a scale problem of coordinates, we can focus
our attention more closely to those points closer to the origin of Figure 3 by specifying that

R> plot(res, scaleplot = 1, ell = TRUE, alpha = 0.05, prop = 60)


By zooming closer to the origin, the configuration of points near the origin is given by Figure 4. It
shows the overlap of the confidence region for perfumes with the origin. It means that all of the
items, except perfumes, are important contributors to the asymmetric association since their confidence
ellipses do not overlap with the origin of the plot.
The contribution of all items to the association structure is also reflected in the p-values that are
summarized as part of the output of the print method for “CAvariants” objects with M = 8 and appear
in the last column of the table, titled Ellipse axes,Area, p-values of rows where alpha = 0.05.
These results show that the only non-statistically significant row category is perfumes, as expected from
its ellipse, with a p-value of 0.297. If we now consider the age of the males in the sample, and specify M
= 8 when constructing confidence ellipses and calculating p-values, see the last column of the table
titled Ellipse axes,Area,p-values of columns, all age groups are useful predictors of the items that
are stolen.

Conclusion
There are many freely downloadable programs/code available for performing classical correspondence
analysis. For example, the R code of Nenadic and Greenacre (2007) and De Leeuw and Mair (2009a)
may be considered for performing simple and joint correspondence analysis. However, the CAvariants
package provides variants of correspondence analysis which are not offered by other correspondence
analysis R packages on CRAN. To the best of these authors’ knowledge, CAvariants is the only package
available that provides the user with the option of performing six variants of two-way correspondence
analysis and, in particular, ordered symmetrical and non symmetrical correspondence analysis variants.
Indeed, symmetrical correspondence analysis for ordered variables was implemented in SPLUS by
Beh (2004b) and has been easily adapted for R.
Subsequent versions of the function may allow for more flexibility by giving the user more tools
to assess the reliability of graphical results. These may include bootstrap confidence regions to

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 182

95 % Confidence Ellipses

0.04
25.58%

0.02
tobacco *
hobby *
Axis 2

0.00

perfumes *
jewelry household *
*
books *
accessories *
−0.02

records *

−0.02 0.00 0.02 0.04

Axis 1 59.83%

Figure 4: A zoomed view of the origin of the row-isometric polynomial biplot given by Figure 3.

complement the algebraic regions developed by these authors, or three-dimensional polynomial


biplots. While Beh and Lombardo (2014) and Lombardo et al. (2016) describe the theoretical aspects of
these variants of correspondence analysis for two-way contingency tables in detail, they also provide
fragments of R code to undertake the relevant calculations. However, this paper has described the
CAvariants package by demonstrating the applicability of one variant and providing new insight into
the development of elliptical regions for ordered variants of correspondence analysis.

Bibliography
H. Abdi. Discriminant correspondence analysis. In N. J. Salkind, editor, Encyclopedia of Measurement
and Statistics, pages 270–275. Sage Publications, Inc., 2007. [p168]

G. Alberti. CAinterprTools: An R package to help interpreting correspondence analysis results.


SoftwareX, 1–2:26–31, 2015. doi: 10.1016/j.softx.2015.07.001. [p167, 168]

M. J. Baxter and H. E. M. Cool. Correspondence analysis in R for archaeologists: An educational


account. Archeologia e Calcolatori, 21:211–228, 2010. [p168]

D. Beaton, C. R. C. Fatt, and H. Abdi. An ExPosition of multivariate analysis with the singular value
decomposition in R. Computational Statistics & Data Analysis, 72:176–189, 2014. doi: 10.1016/j.csda.
2013.11.006. [p168]

E. J. Beh. Simple correspondence analysis of ordinal cross-classifications using orthogonal polynomials.


Biometrical Journal, 39:589–613, 1997. doi: 10.1002/bimj.4710390507. [p170]

E. J. Beh. Partitioning Pearson’s chi-squared statistic for singly ordered two-way contingency tables.
The Australian and New Zealand Journal of Statistics, 43:327–333, 2001. doi: 10.1111/1467-842x.00179.
[p170]

E. J. Beh. Simple correspondence analysis: A bibliographic review. International Statistical Review, 72:
257–284, 2004a. doi: 10.1111/j.1751-5823.2004.tb00236.x. [p167]

E. J. Beh. S-PLUS code for ordinal correspondence analysis. Computational Statistics, 19:593–612, 2004b.
doi: 10.1007/bf02753914. [p181]

E. J. Beh. Simple correspondence analysis of nominal-ordinal contingency tables. Journal of Applied


Mathematics and Computer Sciences, 8:1–12, 2008. doi: 10.1155/2008/218140. [p170]

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 183

E. J. Beh. Elliptical confidence regions for simple correspondence analysis. Journal of Statistical Planning
and Inference, 140:2582–2588, 2010. doi: 10.1016/j.jspi.2010.03.018. [p171, 172]

E. J. Beh and R. Lombardo. Correspondence Analysis: Theory, Practice and New Strategies. John Wiley &
Sons, 2014. doi: 10.1002/9781118762875. [p167, 168, 170, 171, 172, 173, 182]

E. J. Beh and R. Lombardo. Confidence regions and p-values for classical and non-symmetric cor-
respondence analysis. Communications in Statistics – Theory and Methods, 44:95–114, 2015. doi:
10.1080/03610926.2013.768665. [p171, 172]

J. P. Benzécri. Analyse des Données. Dunod, Paris, 1973. [p167]

D. J. Best and J. C. W. Rayner. Nonparametric analysis for doubly ordered two-way contingency tables.
Biometrics, 52:1153–1156, 1996. doi: 10.2307/2533077. [p170]

D. Chessel, A. B. Dufour, and J. Thioulouse. The ade4 package I: One-table methods. R News, 4(1):
5–10, 2004. URL https://www.R-project.org/doc/Rnews/Rnews_2004-1.pdf. [p168]

J. G. Clavel, S. Nishisato, and A. Pita. dualScale: Dual Scaling Analysis of Multiple Choice Data, 2014.
URL https://CRAN.R-project.org/package=dualScale. R package version 0.9.1. [p167]

L. D’Ambra and N. C. Lauro. Non-symmetrical correspondence analysis for three-way contingency ta-
ble. In R. Coppi and S. Bolasco, editors, Multiway Data Analysis, pages 301–315. Elsevier, Amsterdam,
1989. [p169]

J. De Leeuw. Correspondence analysis in R. www.cuddyvalley.org/psychoR/ca, 2006. [p167]

J. De Leeuw and P. Mair. Simple and canonical correspondence analysis using the R package anacor.
Journal of Statistical Software, 31(5):1–18, 2009a. doi: 10.18637/jss.v031.i01. [p167, 181]

J. De Leeuw and P. Mair. Gifi methods for optimal scaling in R: The package homals. Journal of
Statistical Software, 31(4):1–20, 2009b. doi: 10.18637/jss.v031.i04. [p167]

S. Dray and A. B. Dufour. The ade4 package: Implementing the duality diagram for ecologists. Journal
of Statistical Software, 22(4):1–20, 2007. doi: 10.18637/jss.v022.i04. [p168]

P. L. Emerson. Numerical construction of orthogonal polynomials from a general recurrence formula.


Biometrics, 24:696–701, 1968. doi: 10.2307/2528328. [p170]

J. Gower, S. Lubbe, and N. le Roux. Understanding Biplots. John Wiley & Sons, Chichester, 2011. doi:
10.1002/9780470973196. [p173]

M. Greenacre. Theory and Application of Correspondence Analysis. London Academic Press, London,
1984. [p167, 171]

A. Israëls. Eigenvalue Techniques for Qualitative Data. DSWO Press, Leiden, 1987. [p175]

B. Kostov, M. Bécue-Bertaut, and F. Husson. Correspondence analysis on generalised aggregated


lexical tables (CA-GALT) in the FactoMineR package. The R Journal, 7(1):109–117, 2015. URL
https://journal.r-project.org/archive/2015-1/kostov-becuebertaut-husson.pdf. [p167]

P. M. Kroonenberg and R. Lombardo. Nonsymmetric correspondence analysis: A tool for analysing


contingency tables with a dependence structure. Multivariate Behavioral Research Journal, 34:367–397,
1999. doi: 10.1207/s15327906mbr3403_4. [p169]

N. C. Lauro and L. D’Ambra. L’analyse non symmetrique des correspondances. In E. Diday, editor,
Data Analysis and Informatics III, pages 433–446. Elsevier, Amsterdam, 1984. [p169]

S. Lê, J. Josse, and F. Husson. FactoMineR: An R package for multivariate analysis. Journal of Statistical
Software, 25(1):1–18, 2008. doi: 10.18637/jss.v025.i01. [p167]

L. Lebart, A. Morineau, and K. M. Warwick. Multivariate Descriptive Statistical Analysis. John Wiley &
Sons, New-York, USA, 1984. [p167, 171, 172]

D. G. Leibovici. Spatio-temporal multiway decomposition using principal tensor analysis on k-modes:


The R package PTAk. Journal of Statistical Software, 34(10):1–34, 2010. doi: 10.18637/jss.v034.i10.
[p168]

D. G. Leibovici. Principal tensor analysis on k modes. http://c3s2i.free.fr/, 2015. [p168]

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859


C ONTRIBUTED R ESEARCH A RTICLES 184

A. B. N. Librero, P. Willems, and P. G. Villardon. cncaGUI: Canonical Non-Symmetrical Correspondence


Analysis in R, 2015. URL https://CRAN.R-project.org/package=cncaGUI. R package version 1.0.
[p168]

R. J. Light and B. H. Margolin. An analysis of variance for categorical data. Journal of the American
Statistical Association, 66(335):534–544, 1971. doi: 10.1080/01621459.1971.10482297. [p170]

M. Linting, J. J. Meulman, P. F. J. Groenen, and A. J. Van der Kooij. Stability of nonlinear principal
components analysis: An empirical study using the balanced bootstrap. Psychological Methods, 12(3):
359–379, 2007. doi: 10.1037/1082-989x.12.3.359. [p171]

R. Lombardo and E. J. Beh. CAvariants: Correspondence Analysis Variants, 2017. URL https://CRAN.R-
project.org/package=CAvariants. R package version 3.4. [p167]

R. Lombardo and T. J. Ringrose. Bootstrap confidence regions in non-symmetrical correspondence


analysis. Electronic Journal of Applied Statistical Analysis, 5:413–417, 2012. doi: 10.1080/00949655.2011.
579968. [p171, 172]

R. Lombardo, E. J. Beh, and P. M. Kroonenberg. Modelling trends in ordered correspondence analysis


using orthogonal polynomials. Psychometrika, 81:325–349, 2016. doi: 10.1007/s11336-015-9448-y.
[p168, 169, 170, 171, 175, 182]

M. T. Markus. Bootstrap Confidence Regions in Non-Linear Multivariate Analysis. DSWO Press, 1994.
[p171]

F. Murtagh. Correspondence Analysis and Data Coding with Java and R. Chapman & Hall/CRC, Boca
Raton, FL, 2005. doi: 10.1201/9781420034943. [p167, 168]

O. Nenadic and M. Greenacre. Correspondence analysis in R, with two- and three-dimensional


graphics: The ca package. Journal of Statistical Software, 20:1–13, 2007. doi: 10.18637/jss.v020.i03.
[p167, 181]

S. Nishisato. Multidimensional Nonlinear Descriptive Analysis. Taylor & Francis Group, LLC, 2007. [p167,
169]

J. Oksanen, F. G. Blanchet, M. Friendly, R. Kindt, P. Legendre, D. McGlinn, P. R. Minchin, R. B. O’Hara,


G. L. Simpson, P. Solymos, M. H. H. Stevens, E. Szoecs, and H. Wagner. vegan: Community Ecology
Package, 2016. URL https://CRAN.R-project.org/package=vegan. R package version 2.4-1. [p168]

J. C. W. Rayner and E. J. Beh. Towards a better understanding of correlation. Statistica Neerlandica, 63:
324–333, 2009. doi: 10.1111/j.1467-9574.2009.00425.x. [p170]

T. J. Ringrose. Bootstrap confidence regions for correspondence analysis. Journal of Statistical Computa-
tion and Simulation, 83:1397–1413, 2012. doi: 10.1080/00949655.2011.579968. [p167, 171]

B. Ripley. MASS: Support Functions and Datasets for Venables and Ripley’s MASS, 2016. URL https:
//CRAN.R-project.org/package=MASS. R package version 7.3-45. [p167]

J. Thioulouse, D. Chessel, S. Dolédec, and J. M. Olivier. ADE-4: A multivariate analysis and graphical
display software. Statistics and Computing, 7:75–83, 1997. doi: 10.1023/a:1018513530268. [p168]

W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer-Verlag, 4th edition, 2002.
doi: 10.1007/978-0-387-21706-2. [p167]

J. L. Williams, H. Abdi, R. French, and B. J. Orange. A tutorial on multi-block discriminant correspon-


dence analysis (MUDICA): A new method for analyzing discourse data from clinical populations.
Journal of Speech Language and Hearing Research, 53:1372–1393, 2010. [p168]

Rosaria Lombardo
Department of Economics, University of Naples Campania “Luigi Vanvitelli”
via Gran Priorato di Malta, Capua 81043
Italy
rosaria.lombardo@unina2.it

Eric J. Beh
School of Mathematical & Physical Sciences, University of Newcastle
University Drive, Callaghan, NSW, 2308 Australia
eric.beh@newcastle.edu.au

The R Journal Vol. 8/2, December 2016 ISSN 2073-4859

You might also like