Zenon Gniazdowski NR 27
Zenon Gniazdowski NR 27
Zenon Gniazdowski NR 27
57-82
DOI: 10.26348/znwwsi.27.57
Zenon Gniazdowski*
Abstract
The article investigates the possibility of measuring the strength of a linear corre-
lation relationship between nominal data and numerical data. Correlation coeffi-
cients for variables coded with real numbers as well as for variables coded with
complex numbers were studied. For variables coded with real numbers, unam-
biguous measures of real linear correlation were obtained. In the case of complex
coding, it has been observed that the obtained complex correlation coefficients
change with the permutation of the phases in the complex numbers used to code
classes of elements with equal cardinalities. It was found that a necessary condi-
tion for linear correlation is the possibility of linear ordering of a set with data.
Since linear order is not possible in the set of complex numbers, complex correla-
tion coefficients cannot be used as a measure of linear correlation. In the event of
such a situation, a substitute action was suggested that would prevent equal cardi-
nality of classes of identical elements contained in the set with nominal data. This
action would consist in the correction of data, analogous to the correction during
preprocessing or cleaning of data containing missing or outlier values.
Keywords — nominal data, numerical data, numerical coding of nominal data, complex ran-
dom variable, correlation coefficient, complex correlation, complex least squares method
1 Introduction
In classical statistics, the χ2 test is used to test the correlation between nominal variables. A
contingency table is created for nominal data. On its basis, the χ2 statistic can be used to assess
the significance of the correlation between two nominal variables. The V-Cramer coefficient,
also estimated using χ2 statistics, can be used to measure the strength of this relationship [1].
On the other hand, for variables measured in at least an ordinal scale, the Pearson correlation
coefficient or the Spearman rank correlation coefficient [1][2] are examined.
* E-mail: zgniazdowski@wwsi.edu.pl
In [3] a different approach to the problem was proposed. In this approach, nominal data is
given a numerical interpretation. Since a random variable measured on a nominal scale takes
k different values, each of the k subsets of identical values will be called a class later in this
article. Depending on the cardinality of different classes, the elements of a given class will be
coded with real numbers or complex numbers. If there are no classes with equal cardinality,
then a given class will be assigned a real number which is a function of the cardinality of the
class. If there are m classes of equal cardinality in the considered set of values, then each of
these classes is assigned a complex number whose modulus is a function of the cardinality of a
given class, and the phase is equal to the phase of one of the m roots of unity. The phases are
chosen arbitrarily. However, it is required that the phases for each of the classes with identical
cardinalities be different.
The properties of numerical coding make it possible to cluster or classify nominal data using
algorithms specific to numerical data [3]. It remains an open question whether the numerical
coding of nominal data mentioned here can be used to assess the level of correlation between
coded nominal random variables. This article will attempt to answer this question. For this
purpose, the possibility of measuring the strength of the correlation between variables measured
in the nominal scale and variables measured in the nominal scale or at least in the ordinal scale
will be analyzed. The starting point for this analysis will be two observations:
• The correlation coefficient between two random variables has an interpretation of the
cosine of the angle between the vectors containing the random components of these
variables [4].
• Nominal data can be coded with real or complex numbers. Since in a real and complex
vector space it is possible to define the scalar product and the Euclidean norm, it is also
possible to calculate the cosine of the angle between the vectors [3].
The examples will show the influence of coding proposed in the paper [3] on the possibility of
identifying linear correlation between nominal variables. First, examples of correlation anal-
ysis for nominal variables described by real numbers will be shown. Next , the possibility of
correlation analysis for nominal variables described by complex numbers will be examined and
discussed.
The analysis will be carried out on artificial data, specially prepared for the purposes of this
work1 . Sample input data for the analysis will be provided in the form of a contingency table.
A contingency table provides a compact representation of data consisting of many records.
Since the aim of this article is to attempt to assess the possibility of measuring the strength
of correlation between a variable measured on a nominal scale and a variable measured on a
nominal or stronger scale, the analyzed data should be selected in such a way that at least one
variable can be interpreted as if it had been measured on a scale stronger than nominal, and
therefore at least on an ordinal scale. For this reason, an additional restriction is imposed on the
contingency table. It is required that after numerical coding of the nominal variables, the second
variable may be treated as if it were measured on a scale stronger than the nominal scale:
1
To present the discussed problems, the author of this article uses artificially prepared data, because so
far he has not encountered non-trivial nominal data sets that would contain classes of identical elements
with equal cardinalities.
58
On the Analysis of Correlation Between Nominal Data and Numerical Data
• The first variable will be constructed in such a way that it can be coded using real num-
bers or complex numbers. In the case of coding with complex numbers, the first variable
will contain at least two classes with identical cardinalities. This will be manifested
by the fact that at least two rows in the contingency table will have the same sums of
elements.
• The second variable will always be constructed in such a way that it can only be coded
with real numbers. This will only be possible if the variable does not contain classes
with equal cardinalities. This will be manifested by the fact that in the columns of the
contingency table the sums of elements will always be different in pairs. This approach
will allow the second variable to be treated as if it were measured on a scale with a
possible linear order relation, i.e. on a scale stronger than the nominal scale.
Since the second variable will be coded with real numbers with a well-defined linear order
relation, the analyzed problem will become equivalent to the problem of examining the strength
of the linear correlation relationship between a random variable measured in a nominal scale
and a random variable measured in at least an ordinal scale. Thanks to this, the conclusions
resulting from the study of correlations for two nominal variables will also be appropriate for
the case of correlation between a variable measured in a nominal scale and a variable measured
in one of the stronger scales: ordinal, interval or ratio.
Finally, there is a comment on the accepted designations. In some mathematical formulas,
a line will appear over the letter denoting the variable. If a given subsection concerns the
geometric interpretation of correlations between random variables, the line will be placed over
the capital letter and will represent the average value of the random variable. On the other
hand, when a subsection deals with the definition of a scalar product for vectors containing
complex numbers, or concerns correlation for complex random variables, then the line will be
over lowercase letters and will indicate the conjugate of a complex number.
2 Preliminaries
The preliminaries will present some ideas to which the author will refer later in this article.
Here, the following concepts will be introduced: binary relation, measurement scales, statistics
χ2 , strength of the correlation relationship for nominal data, numerical coding of nominal data,
geometric interpretation of a correlation, as well as the least squares complex method.
59
Zenon Gniazdowski
In the set of complex numbers C, it is not possible to define the relation in the above way. Here
it is possible to define the weaker relation, that is, a partial order relation:
60
On the Analysis of Correlation Between Nominal Data and Numerical Data
X Y Total
A 3 0 3
B 2 2 4
C 0 2 2
Total 5 4 9
The nominal scale assumes the classification of data into different classes. For data measured
on a nominal scale, it can be said that the two measured values are equivalent or different. It
cannot be said that one value precedes another. This means that nominal data cannot be sorted
in any way.
Measurement on the ordinal scale is more precise than measurement on the nominal scale.
The ordinal scale allows you to order a set according to the degree to which the elements of the
set have certain features, but does not give information about the magnitude of the differences
between these elements.
The interval scale makes it possible not only to order objects in terms of the degree of pos-
sessing a certain feature, but also gives the ability to determine the distance between objects.
An interval scale is a continuous numerical scale that has no absolute zero. Zero on the mea-
surement scale is set arbitrarily. An example of interval scale measurement is a temperature
measurement in degrees Celsius or degrees Fahrenheit. It is known how much one measure-
ment result is greater than another. However, it is impossible to say how many times one result
is greater than the other. For example, for temperature measurement on the Celsius scale, it
can be said that the temperature of 27 degrees is 18 degrees higher than the temperature of 9
degrees. On the other hand, it cannot be said that a temperature of 27 degrees is three times
higher than a temperature of 9 degrees.
The ratio scale has all the features of an interval scale. Its additional feature is that this
scale has absolute zero. Therefore, with regard to the measurement results on a ratio scale, it
can be said how many times one measurement result is greater than another. An example of a
ratio scale is the temperature measurement scale in degrees Kelvin.
The concept of measurement scales is closely related to the concept of two types of rela-
tions. For elements of a set measured at a nominal scale, an equivalence relation can be defined.
For elements of a set measured on an ordinal, interval and ratio scale, a linear order relation
can be defined. Thanks to the linear order relation, data measured on ordinal, interval, and ratio
scales can also be ranked.
2.3 The study of the relationship between nominal data in classical statistics
An example of a contingency table for nominal data (Table 2) is given. The table describes the
interdependence of two nominal random variables V1 and V2 . The first variable (V1 ) takes three
values: {A, B, C}. The second variable (V2 ) takes two values: {X, Y }. Table 3 shows the
random variables V1 and V2 reconstructed from the contingency table (Table 2).
61
Zenon Gniazdowski
V1 V2
A X
A X
A X
B X
B X
B Y
B Y
C Y
C Y
X Y Total
A 1.667 1.333 3
B 2.222 1.778 4
C 1.111 0.889 2
Total 5 4 9
2.3.1 Statistics χ2
Individual cells of the contingency table (Oi,j ) count the observed frequencies of pairs of dif-
ferent nominal values. Based on the observed frequencies, the relevant elements (Eij ) in the
expected frequency table (Table 4) can be estimated [1]. The Eij element
P is equal to the product
of the sum of the elements in the i-th row of the contingency
P table ( j Oij ) and the sum of the
elements in the j-th column
P of the contingency table ( O
i ij ), divided by sum of all elements
of the contingency table ( ij Oij ):
P P
j Oij · i Oij
Eij = P . (3)
ij Oij
The estimated table of expected frequencies is shown in Table 4. Now, based on the contents of
the table of observed frequencies and the table of expected frequencies, the value of the statistic
χ2 can be estimated [1]:
X (Oij − Eij )2
χ2 = . (4)
i.j
Eij
For the contingency table considered here (Table 2), the statistic χ2 is 4.95. In classical statis-
tics, the value of the χ2 statistic can be used to test the significance of the correlation between
62
On the Analysis of Correlation Between Nominal Data and Numerical Data
the two nominal variables. To do this, a null hypothesis should be formulated and then checked
whether the null hypothesis can be rejected.
df = (r − 1) × (c − 1) . (5)
If the probability p is small enough, the null hypothesis can be rejected. Assuming that the
significance level α = 0.1, the null hypothesis can be rejected when the obtained probability p
is less than α. For the example in Table 2, the estimated probability p is 0.0841. This means
that at the significance level α = 0.1, the null hypothesis should be rejected. At the same time,
the alternative hypothesis, which says that both variables are significantly correlated, should be
accepted.
The χ2 test also has its limitations. When using this test, it is required that the expected
frequencies be not less than five [1]. Some sources state that expected frequencies should not be
less than ten [2][7]. If the expected numbers are too small, Yates’ correction or Fisher’s exact
test [1][2][7] is used for contingency tables of size 2 × 2. For tables of larger sizes, Fisher’s
exact test is a combinatorial problem of high computational complexity. For such a case, a
graph algorithm was proposed that extends the feasibility limits of Fisher’s exact test [8].
63
Zenon Gniazdowski
(n + 1)
R= . (8)
2
The rank of elements in the class with higher cardinality will be greater than the rank of elements
belonging to the class of lower cardinality. Hence, it can be seen that in the set with ranks
defined by the formula (8) a linear order relation (1) is possible.
64
On the Analysis of Correlation Between Nominal Data and Numerical Data
√
In the above expression i = −1 , φ = 2πj/k is the phase arbitrarily assigned to the successive
(j-th) class, and R is the rank calculated according to the formula (8) - the same as in the case
of classes with different cardinalities. In the presented concept, R is a module of complex
rank, depending on the cardinality of a given nominal value in the set. This approach assigns
identical rank modules to classes with identical cardinalities, and distinguishes classes with
identical cardinalities by their different phases.
Unfortunately, unlike in a set with real ranks, a linear order relation is not possible in a set
with complex ranks. Here only the partial order relation (2) is possible.
Based on the above definition of the scalar product, the Euclidean norm can also be defined:
p
||x|| = (x, x). (11)
Since the real number is a special case of a complex number, the above formulas for the scalar
product, the Euclidean norm and the metric are also valid when coding nominal variables with
real numbers. And the ability to define a metric allows the use of numerical coding for the
purposes of clustering and classification [3].
65
Zenon Gniazdowski
In the numerator of the above formula there is the scalar product of the vectors x and y, and
in the denominator there is the product of the lengths of these vectors. This quotient expresses
the cosine of the angle between the two vectors. This means that the correlation coefficient is
identical to the cosine of the angle between the random components of random variables [4]:
Pn
xi yi x·y
R(X, Y ) = Pn i=1
p pPn = = cos(∠x, y). (17)
i=1 xi
2 2
i=1 yi
||x|| · ||y||
The constant a + bX represents the mean value of variable X after a linear transformation.
The random component x becomes a new random component bx after the transformation of
the variable X. Since the vector bx is parallel to the vector x, and the correlation coefficient
is formally identical to the cosine of the angle between the vectors representing the random
components of two random variables, this linear transformation does not change the modulus
of the correlation coefficient.
66
On the Analysis of Correlation Between Nominal Data and Numerical Data
As noted in the introduction, in the formula above, x and y are conjugates of the complex
numbers x and y.
67
Zenon Gniazdowski
V1 V2
2 3
2 3
2 3
2.5 3
2.5 3
2.5 2.5
2.5 2.5
1.5 2.5
1.5 2.5
68
On the Analysis of Correlation Between Nominal Data and Numerical Data
Table 6: Contingency table for nominal data with a random variable V1 containing two
classes (A and B) with identical cardinalities
X Y Total
A 3 1 4
B 2 2 4
Total 5 3 8
V1 V2
A X
A X
A X
A Y
B X
B X
B Y
B Y
3.2.1 Example No. 2 – nominal variable with two classes with identical cardinal-
ities
Table 6 shows an example of a contingency table in which the first nominal variable contains
two classes with identical cardinalities. These are classes containing values A and B. Table 7
69
Zenon Gniazdowski
Table 8: Data from Table 7 after complex coding of variables V1 and V2 . In the columns
V11 and V12 two phase permutations are taken into account
V11 V12 V2
−2.5 + 0i 2.5 + 0i 3
−2.5 + 0i 2.5 + 0i 3
−2.5 + 0i 2.5 + 0i 3
−2.5 + 0i 2.5 + 0i 2
2.5 + 0i −2.5 + 0i 3
2.5 + 0i −2.5 + 0i 3
2.5 + 0i −2.5 + 0i 2
2.5 + 0i −2.5 + 0i 2
Table 9: Contingency table for nominal data with a random variable containing three
classes (A, B and C) with identical cardinalities
X Y Total
A 1 4 5
B 2 3 5
C 3 2 5
D 1 2 3
Total 7 11 18
shows the reconstructed random variables V1 and V2 . Since the variable V1 takes the value of A
four times and also takes the value of B four times, the modulus of numbers used to code these
values is 2.5. On the other hand, the arbitrarily adopted phase allowing to distinguish the values
of A and B after their complex coding is φ = 00 and φ = 1800 , respectively.
Since the phase for complex coding is chosen arbitrarily, the question arises about the in-
fluence of the selected phase on the value of the correlation coefficient. For two classes with
identical cardinalities, the phases can be chosen in 2! = 2 ways. Both ways of coding a random
variable V1 are shown in the first two columns of Table 8 (V11 and V12 ). The third column
shows the variable V2 after coding. For the data from Table 8, correlation coefficients have been
estimated, which are R (V1 , V2 ) = −0.258 and R (V1 , V2 ) = 0.258, respectively.
The correlation coefficients for different phase permutations are shown in Figure 1. Com-
paring both values of the estimated correlation coefficients, it should be noted that their values
result from the arbitrarily assigned phase in the coding of the nominal variable with two classes
with identical cardinalities. It can be seen that both values of the correlation coefficient are
located on the real axis symmetrically with respect to the center of the coordinate system. This
means that the obtained solution is not unambiguous.
70
On the Analysis of Correlation Between Nominal Data and Numerical Data
3.2.2 Example No. 3 - nominal variable with three classes with identical cardi-
nalities
Another example of data is presented in Table 9. Variable V1 contains three classes of elements
with equal cardinality. These are classes A, B and C. There is also a class D with cardinality
other than classes A, B and C. Classes A, B and C contain 5 elements each. Class D contains
3 elements. The nominal value of D was coded with the real number 2 = (3 + 1) /2. The
nominal values A, B and C are assigned the modulus 3 = (5 + 1)/2. An arbitrary phase
allowing to distinguish the coded values of A, B and C can take one of three values φ = 00 ,
φ = 1200 and φ = 2400 , respectively. In table 10, the first six columns show the coded prime
variable for all 6 = 3! phase permutations. The second index digit in the variable name V1 is
used to distinguish permutations. The last column of Table 10 also shows the variable V2 after
coding.
For the data in Table 10, six complex correlation coefficients between the V1 and V2 vari-
ables were estimated, for different phase permutations attributed to different values of the V1
variable belonging to classes having identical cardinalities. The obtained results are presented
in the complex plane in Figure 2. It can be seen that the obtained correlation coefficients are
distributed in the complex plane symmetrically with respect to the point with the coordinate
0.013 lying on the real axis. In contrast to the previous example, the center of symmetry for
all obtained correlation coefficients is outside the central point of the complex plane. This is
because in the current example, the variable V1 , in addition to the three equal-cardinality classes
containing the values A, B, and C, also has a class D with a cardinality different from classes
A, B, and C. As in the previous example, the solution obtained is not unambiguous.
71
Zenon Gniazdowski
Table 10: Data from the Table 9 after coding variables V1 and V2 . The first six columns
show the representations of the V1 variable obtained for different phase permutations
3.2.3 Example No. 4 - nominal variable with four classes with identical cardinal-
ities
Another example of data is shown in Table 11. The variable V1 contains four classes with equal
cardinality, containing the values of U , W , X and Y . This variable also contains a class of
Z values whose cardinality is different from the cardinality of classes U , W , X and Y . The
variable V1 was coded using complex numbers. Since the variable V1 contains four classes
with the same cardinality, 4! = 24 different phase permutations are possible. For all these
permutations, Figure 3 shows all possible values of correlation coefficients between variable V1
and variable V2 in the complex plane. Since the variable V1 contains a class of Z values with a
cardinality different from the cardinality of classes U , W , X and Y , therefore also this time the
estimated correlation coefficients are not symmetric with respect to the center of the coordinate
system. The results are distributed symmetrically around a point with the coordinate 0.542 on
the real axis. In this case, too, no unambiguous solution was obtained.
72
On the Analysis of Correlation Between Nominal Data and Numerical Data
3.2.4 Example No. 5 – nominal variable with five classes with identical cardinal-
ities
Another example of nominal data is shown in Table 12. In the considered example, the first
variable takes 450 times each of the values from the set {A, B.C, D, E}. For five classes
with equal cardinality, the number of different codings resulting from the phase permutations
is 5! = 120. For all these cases, complex correlation coefficients were estimated between
both variables. 120 different correlation coefficients were obtained, which are presented in
the complex plane in Figure 4. Since all classes contained in the set of values of the first
variable have equal cardinality, therefore the obtained correlation coefficients are distributed
symmetrically with respect to the center of the coordinate system.
73
Zenon Gniazdowski
Table 11: Contingency table for data with random variable V1 containing four classes
(U , W , X and Y ) with the same cardinality
A B C D Total
U 45 20 13 12 90
W 15 45 20 10 90
X 5 8 50 27 90
Y 8 10 17 55 90
Z 10 30 50 450 540
Total 83 113 150 554 900
Table 12: Contingency table for data with random variable V1 containing five classes
(A, B, C, D and E) with the same cardinality
X Y Z Total
A 200 120 130 450
B 250 100 100 450
C 50 80 320 450
D 170 130 150 450
E 300 100 50 450
Total 970 530 750 2250
arbitrarily chosen phase equal to the phase of one of the k roots of unity. Therefore, k classes of
equal cardinality can be assigned phases in k! ways. This means that a nominal random variable
can be represented in k! ways. For these k! possible representations of the first random variable
and a unique representation of the second random variable, k! different complex correlation
coefficients are obtained. In the complex plane, these correlation coefficients are symmetrical
about some point on the real axis.
In real space, the correlation coefficient is identical to the cosine of the angle between the
random components of two random variables. Calculating the cosine of an angle in a complex
space and treating it as a correlation coefficient, due to ambiguity, does not lead to a satisfactory
result.
4.1 Searching for invariants with respect to phase permutations in complex cod-
ing of nominal data
Since the measure of linear correlation cannot be the cosine of the angle between the vectors
representing the random components of complex random variables, an attempt was also made to
find alternative correlation measures that would be invariant with respect to phase permutations
between classes with equal cardinalities. For this purpose, a linear correlation between the V2
74
On the Analysis of Correlation Between Nominal Data and Numerical Data
variable and its model Vc2 = f (V1 ) was considered. First, tests were carried out for the linear
model f (V1 ), and then for the non-linear model.
75
Zenon Gniazdowski
As a result of the research, it was found that the correlation between the variable and its
linear model is not invariant with respect to phase permutations. Therefore, it cannot be consid-
ered as a practical measure of the correlation between a nominal random variable and a random
variable measured on a nominal scale or on a stronger measurement scale.
76
On the Analysis of Correlation Between Nominal Data and Numerical Data
Table 13: Comparison between the correlation coefficients R (V1 , V2 ) and R V2 , Vc2
for the coded data from Table 10, obtained for 3! = 6 different phase permutations
No. R (V1 , V2 ) |R (V1 , V2 ) | V
c2 = b0 + b1 V1 R V2 , V
c2
(a) (b)
successive roots of degree m of unity. For different phase permutations, different coefficients
in the identified polynomial will also be obtained. The question remains whether the correla-
tion coefficients between the values calculated using the V c2 model and the numerical values
representing the V2 variable change for different phase permutations.
The problem posed in this way was tested for many different data sets in which there were
classes with equal cardinality. Variables with two, with three, with four and also with five
classes with the same cardinalities were tested. For each data set, for all phase permutations,
non-linear models V c2 = f (V1 ) were identified using the least squares method. For two, three,
four and five identical cardinalities, 2! = 2, 3! = 6, 4! = 24 and 5! = 120 polynomials were
identified, respectively. In all tested cases, regardless of the phase permutations for the coded
nominal values of the random variable, the linear correlation coefficient R(V2 , V c2 ) between
the non-linear polynomial V2 = f (V1 ), and the variable V2 did not change with successive
c
permutations. This means that the measure obtained here is invariant with respect to phase
77
Zenon Gniazdowski
Table 14: Comparison of correlation coefficients between variable V2 and its non-linear
model Vc2 = f (V1 ) for the coded data from Table 10, obtained for 3! = 6 different
phase permutations
No. c2 = b0 + b1 V1 + b2 V 2 + b3 V 3
V R V2 , V
c2
1 1
1 (5.074 + 0.036i) + (0.067 − 0.038i) V1 + (0.022 + 0.013i) V12 + (0.005 − 0.001i) V13 0.31
2 (5.389 + 0.073i) + (0.000 − 0.077i) V1 + (0.000 + 0.026i) V12 − (0.007 + 0.003i) V13 0.31
3 (5.389 − 0.073i) + (0.000 + 0.077i) V1 + (0.000 − 0.026i) V12 − (0.007 − 0.003i) V13 0.31
4 (5.074 − 0.036i) + (0.067 + 0.038i) V1 + (0.022 − 0.013i) V12 + (0.005 + 0.001i) V13 0.31
5 (5.705 + 0.036i) − (0.067 + 0.038i) V1 − (0.022 − 0.013i) V12 − (0.019 + 0.001i) V13 0.31
6 (5.705 − 0.036i) − (0.067 − 0.038i) V1 − (0.022 + 0.013i) V12 − (0.019 − 0.001i) V13 0.31
permutations in the complex coding of a random variable. Unfortunately, the found measure
measures the non-linear correlation between the variables V1 and V2 . For this reason, in those
cases where a linear measure of correlation is expected, the non-linear correlation coefficient
above could not be used.
Table 14 shows examples of results obtained for the data in Table 9. In the dataset, variable
V1 contained four different classes, three of which had equal cardinality. Hence, the variable
V2 could be modeled using a third degree polynomial. For three classes with equal cardinality,
3! = 6 different polynomials were obtained.
5 Conclusions
The aim of the work was to examine the possibility of measuring the strength of the linear cor-
relation relationship between two random variables measured on a nominal scale, coded with
real numbers or complex numbers. The research was conducted with the assumption that the
second random variable will always be coded with real numbers. This assumption caused the
analyzed problem to become equivalent to the problem of testing the strength of a linear cor-
relation relationship between a random variable measured on a nominal scale and a random
variable measured at least on an ordinal scale. The calculations made use of the fact that the
correlation coefficient has an interpretation of the cosine of the angle between the vectors con-
taining the random components of the analyzed variables. Since for vectors of numbers (both
real and complex) Euclidean norms and the scalar product can be calculated, it is also possible
to calculate the cosine of the angle between vectors, and consequently also to estimate the cor-
relation coefficient between them. The conducted research allowed to draw several conclusions
that may potentially be useful in the analysis of linear correlation for nominal data, and which
will be presented in more detail in the following subsections.
78
On the Analysis of Correlation Between Nominal Data and Numerical Data
5.1 Study of the correlation relationship between nominal data coded with real
numbers
A set of data measured on a nominal scale may be coded using real numbers if there are no
classes with identical cardinalities among the classes of identical elements contained in this set.
In this case, correlations were studied for two variables, each of which could be coded with
real numbers. Coding with real numbers introduced a linear order to the nominal data sets. As
a consequence, unambiguous real measures of linear correlation were obtained in all analyzed
cases.
5.2 Study of the correlation relationship between nominal data coded with com-
plex numbers
According to the adopted assumptions, correlations between two random variables were inves-
tigated, one of which was coded with real numbers, and the other had to be coded with complex
numbers. A set of data measured on a nominal scale cannot be coded with real numbers if there
are at least two classes with identical cardinalities among the classes of identical elements con-
tained in this set. For two random variables, one of which contained complex numbers and the
other contained real numbers, complex correlation coefficients were obtained, which changed
with the permutation of phases in complex numbers coding classes of elements with equal car-
dinality. The solution to the problem of finding linear correlation turned out to be ambiguous.
79
Zenon Gniazdowski
80
On the Analysis of Correlation Between Nominal Data and Numerical Data
Acknowledgments
The author would like to express his gratitude to Dr. Leszek Rudak for valuable comments that
helped to improve the content of the article.
References
[1] H. M. Blalock, Social Statistics. McGraw-Hill, 1960.
[2] P. Francuz and R. Mackiewicz, Liczby nie wiedza,˛ skad ˛ pochodza.˛ Przewodnik po
metodologii i statystyce nie tylko dla psychologów. Lublin: Wydawnictwo KUL, 2007.
[3] Z. Gniazdowski and M. Grabowski, “Numerical Coding of Nominal Data,” Zeszyty
Naukowe WWSI, vol. 9, no. 12, pp. 53–61, 2015. [Online]. Available: http:
//doi.org/10.26348/znwwsi.12.53
[4] Z. Gniazdowski, “Geometric interpretation of a correlation,” Zeszyty Naukowe WWSI,
vol. 7, no. 9, pp. 27–35, 2013. [Online]. Available: http://doi.org/10.26348/znwwsi.9.27
[5] ——, “O relacjach i algorytmach,” in Zbiór wykładów wszechnicy popołudniowej:
Algorytmika i programowanie. Zastosowania informatyki. Warszawska Wyższa Szkoła
Informatyki, 2011, pp. 265–286. [Online]. Available: http://akademickaseriawwsi.wwsi.
edu.pl/ksiazki/5/O_relacjachi_algorytmach.pdf
[6] S. S. Stevens, “On the theory of scales of measurement,” Science, vol. 103, no. 2684,
pp. 677–680, 1946. [Online]. Available: https://psychology.okstate.edu/faculty/jgrice/
psyc3120/Stevens_FourScales_1946.pdf
[7] StatSoft, “Elektroniczny podr˛ecznik statystyki,” 2011. [Online]. Available: https:
//www.statsoft.pl/textbook/stbasic.html
81
Zenon Gniazdowski
82