Benedetto-Puzzle Basil Ep 388

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

This article was downloaded by: [Australian National University]

On: 08 January 2015, At: 16:00


Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,
UK

Journal of Quantitative
Linguistics
Publication details, including instructions for authors
and subscription information:
http://www.tandfonline.com/loi/njql20

The Puzzle of Basil’s Epistula


38: A Mathematical Approach to
a Philological Problem
a b
Dario Benedetto , Mirko Degli Esposti & Giulio
c
Maspero
a
Università Sapienza , Roma
b
Università di Bologna
c
University of the Holy Cross , Rome , Italy
Published online: 12 Nov 2013.

To cite this article: Dario Benedetto , Mirko Degli Esposti & Giulio Maspero (2013)
The Puzzle of Basil’s Epistula 38: A Mathematical Approach to a Philological Problem,
Journal of Quantitative Linguistics, 20:4, 267-287, DOI: 10.1080/09296174.2013.830549

To link to this article: http://dx.doi.org/10.1080/09296174.2013.830549

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the
information (the “Content”) contained in the publications on our platform.
However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or
suitability for any purpose of the Content. Any opinions and views expressed
in this publication are the opinions and views of the authors, and are not the
views of or endorsed by Taylor & Francis. The accuracy of the Content should
not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions,
claims, proceedings, demands, costs, expenses, damages, and other liabilities
whatsoever or howsoever caused arising directly or indirectly in connection
with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sub-
licensing, systematic supply, or distribution in any form to anyone is expressly
forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Downloaded by [Australian National University] at 16:00 08 January 2015
Journal of Quantitative Linguistics, 2013
Vol. 20, No. 4, 267–287, http://dx.doi.org/10.1080/09296174.2013.830549

The Puzzle of Basil’s Epistula 38: A Mathematical


Approach to a Philological Problem*
Dario Benedetto1, Mirko Degli Esposti2 and Giulio Maspero3
1
Università Sapienza, Roma; 2Università di Bologna; 3University of the Holy Cross, Rome,
Downloaded by [Australian National University] at 16:00 08 January 2015

Italy

ABSTRACT

We present and explore a real case in authorship attribution (A.A.) by combining the tradi-
tional philological approach with novel mathematical techniques. The problem involves the
extensive productions of Basil of Caesarea and his brother Gregory of Nyssa, two influential
4th century Christian theologians, and the attribution of specific and discussed works in their
corpora. Our novel method is based on two similarity (pseudo) distances, based, respectively,
on the statistics of n-grams and on zip-like algorithms and on a new ranking/voting system
that allows to infer the attribution from the values of distances between the unknown texts
and the texts of the training corpus. The main results are on one hand the attribution of the
letter with 97% of precision to one of the two authors and on the other the strong agreement
of the numerical explorations with both the philological analysis and the so far known
results for all the works in the two corpora.

1. INTRODUCTION

Research in authorship attributions seems to be a perfect meeting point for


different sciences: information theory, statistical computations, mathematical
models and more in general quantitative analysis have an object of study
that is typical also of very different disciplines, like literature studies,
philology, history and in general humanities (Stamatatos, 2009). As when
drilling the earth, the deeper you go the closer you get also to someone
else who has started from a different and even opposite position on the
surface. That seems the case of this paper, which analyses a real and

*Address correspondence to: Mirko Degli Esposti, Dipartimento di Matematica Università di


Bologna, Piazza di Porta S. Donato 5, 40126 Bologna, Italy. Email: mirko.degliesposti@ unibo.it

Ó 2013 Taylor & Francis


268 D. BENEDETTO ET AL.

concrete philological problem in authorship attribution, i.e. a letter written


in Greek during the 4th century AD by either one of two brothers, to some
of the most important theologians of that time. The interaction between
philological and mathematical sciences in the analysis of the problem is
not only at the level of definition of the case and of discussion of the
results, but at the very level of development of the methods and of the
experimental setting.
Here, in fact, we combine recent mathematical tools, mostly developed
Downloaded by [Australian National University] at 16:00 08 January 2015

in the area of information theory, with a solid and traditional philological


approach that will guide and inspire the experimental setting.
The specific authorship attribution problem we consider here is intro-
duced and explained in Section 2, we devote Section 3 to introducing the
whole digital corpus used in the present study, together with a detailed
description of the cleaning and coding procedure that have been imple-
mented on each text of the corpus, prior to any numerical investigation
(Section 4).
Section 5 is instead devoted to introducing the two quantitative methods
used to perform authorship attribution: as discussed there, the two methods
we have implemented are the development of two methods used in a project
concerning the attribution of Antonio Gramsci’s papers (Basile et al., 2008).
Essentially each method defines a kind of similarity distance between texts:
given any pair of texts, each method produces a positive number that we
interpret as the distance between the texts. A small distance means that the
two texts are quite similar (either in the argument/topic or in the author’s
style, see below), whereas a large distance means a high degree of
dissimilarity.
While the theoretical backbones of our methods rely on asymptotic
results (infinite sequences), in practice we face finite texts with non-homo-
geneous lengths and we attack this problem in Section 6 where we
construct a reference corpus with an homogeneous distribution of lengths of
the texts. We also introduce here a voting system that allows us to introduce
an efficient threshold for attribution.
In Section 7 we then discuss the efficiency and the stability of our
methods in discriminating between the two authors. Finally, in Section 8 we
discuss the authorship of the disputed letter, leaving final remarks and
future developments to Section 9.
THE PUZZLE OF BASIL’S EPISTULA 38 269

2. THE AUTHORSHIP ATTRIBUTION PROBLEM

Basil of Caesarea (Rousseau, 1998) and his brother Gregory of Nyssa


(Meredith, 1999) were two important bishops and theologians of the 4th cen-
tury. Their thought and teachings were fundamental for the definition of
Christian doctrine on the Trinity, i.e. on God as one divine nature and three
divine persons. They had to face Eunomius of Cyzicus, who tried to explain
the mystery of God as revealed in the Gospel having recourse to some philo-
Downloaded by [Australian National University] at 16:00 08 January 2015

sophical doctrines, viz. neoplatonism. Their discussion was almost 30 years


long and many works were devoted to it. Epistula 38 (Ep. 38) is one of them.
It is a letter that was transmitted in Basil’s epistolary corpus (Fedwick, 1993)
but that has been attributed also to his brother Gregory of Nyssa as a dogmatic
treatise addressed to their brother Peter. The history of the discussion on the
authorship of Ep. 38 is marked by three main moments:

(1) In 1972 Hübner challenged the traditional attribution, affirming that


Gregory was the real author (Hübner, 1972).
(2) Even if this conclusion was sometimes discussed (Fedwick, 1978), in
1996 Drecoll (Tübingen University) moved Ep. 38 back to Basil
(Drecoll, 1996).
(3) More recently, Zachhuber (Oxford University) defended again that
the author is Gregory (Zachhuber, 2003).

The work has been extensively analyzed and studied from the philologi-
cal, philosophical and theological perspectives. The aim of the present paper
is to investigate its authorship having recourse to mathematical methods
and numerical computations.
Ep. 38 is in some ways a perfect authorship attribution problem: it is
known in advance that the work belongs to either Basil or Gregory, i.e. it is
a genuine two-class classification problem. Moreover, the statistical approach
is expected to give good results, as the productions of these authors are
conspicuous.
But there is also another characteristic of the problem that raises hope of
getting good results: both Basil and Gregory of Nyssa have produced exten-
sive works against Eunomius, discussing the same subjects and practically
having recourse to the same quotations and vocabulary. It seems reasonable
to suppose that these works can be effective touchstones to analyse Ep. 38.
In fact, if we apply statistical methods and develop a kind of distance
between different texts, the differences between Ep. 38 and Basil’s or
270 D. BENEDETTO ET AL.

Gregory’s productions to counter Eunomius should be principally due to


their personal styles, because the contents are very similar. Moreover Ep.
38 has the same subject as the works against Eunomius: this is another
good point that justifies a quantitative approach to author style recognition,
as the one presented here.

3. THE CORPUS
Downloaded by [Australian National University] at 16:00 08 January 2015

The Corpus used in the present study is composed of all the known works
by Basil and Gregory of Nyssa. The digitalized texts can be found in The-
saurus Linguae Graecae (TLG) (http://www.tlg.uci.edu/). All the works
have been included, also the spurious and dubious ones.
Being the main objective of this study the developing of a reliable
authorship attribution method for exploring spurious and dubious ones, the
whole corpus has been divided into three sets with specific tasks in mind:

(1) A reference set: One dogmatic work by Basil and one set of the three
books by Gregory of Nyssa against Eunomius. From now on we will
denote them by B0 and G0, respectively. The (character) lengths of
these two texts, after the coding explained in the next section are

jB0 j ¼ 172342

and
jG0 j ¼ 1017314

i.e. G0 is almost six times B0 .


(2) A large controlled corpus composed by the works and letters both by
Basil1 and by Gregory2. Basil’s letters with an undisputed authorship
were used (Fedwick, 1993) and divided into three groups, according
to their extensions: the first one includes the letters larger than 2500
characters (L); the letters which extensions between 2500 and 1250

1
Y. Courtonne’s edition has been used: Saint Basile. Lettres, Les Belles Lettres, Paris
1957–1966.
2
G. Pasquali’s edition has been used: Gregorii Nysseni Opera (= GNO), vol. VIII/2, Brill,
Leiden, (1959).
THE PUZZLE OF BASIL’S EPISTULA 38 271

form the second group (M) and the letters shorter than 1250
constitute the third one (S). Correspondingly, also 57 works and 25
letters by Gregory3 with well known attribution have been included
in this set. This set has been used, as we will soon explain, in order
to verify the efficiencies and the stability of our method.
(3) A set of almost 100 dubious texts such as Ep. 38.

A first important step is to define a unique coding and a common cleaning


Downloaded by [Australian National University] at 16:00 08 January 2015

procedure for all texts used in the experiments. Coding and cleaning of
texts is a crucial issue in automatic authorship attribution and even if it is
often (voluntarily or not) ignored, we believe it deserves a detailed descrip-
tion and we discuss it in the next section.

4. CLEANING AND CODING

The digital texts of ancient authors are typically the final versions of an
unknown number of intermediate copies of the originals (Canfora, 2002).
The history of a single text might determine different frequencies of some
words, characters, or of a whole class of characters. Moreover, the digitali-
zation of the text can arbitrarily change the frequency of “new line” and
“carriage return”, and different editorial policies can change the frequency
of numbers, capital letters and punctuation signs.
For instance, in the Corpus we are analysing, Πατρός appears only in
the digital version of Basil’s Corpus, whereas πατρός appears almost exclu-
sively in the digital version of the texts handed on as attributed to Gregory.
Even if the frequency of capital letters is quite small, this feature can cause,
in the quantitative methods, a bias linked to the history of the transmission
of the text.
Moreover, the digital version of the whole corpus we are analysing here
has another quite problematic feature: it uses (via the UTF8 coding) the
polytonic orthography of the ancient Greek, which has different symbols for
any letter with accents or breathings (for example ɛ appears in nine ver-
sions). A typical phrase appears as

3
About Gregory’s letters, we can use only 25 from the Pasquali collection, because 26 and
30 are not by Gregory, while 27 and 28 are spurious.
272 D. BENEDETTO ET AL.

The whole Corpus consists of more than 10 million characters, distrib-


uted among 233 different symbols. The histogram of the relative frequen-
cies is shown in Figure 1.
Downloaded by [Australian National University] at 16:00 08 January 2015

Fig. 1. Frequencies of characters in the corpus. Frequencies in utf 8 (top), frequencies in


iso8859-7 lowercase (bottom).

Only 29 characters have frequencies greater than 1%, whereas the total
frequency of the other rare characters is not negligible, namely it is greater
than 16%, indicating that we will likely encounter rare characters while ana-
lysing an arbitrary text. The presence of several variants of the same charac-
ter has severe consequences on both quantitative attribution methods we use
(see the next section). Both methods in fact exploit redundancies between
not too long sequences of characters (usually 6 15) and the existence of
different variants of the same character diminishes the statistics of redun-
dancies preventing the discovery of similarities between different words that
might constitute part of the style of an author. Moreover, depending on the
particular historical period, a version of a given text might be produced
with more or less consistency with respect to given grammar rules that
establish the use of specific variants of a character. Also in this case, differ-
ences detected by the methods cannot lead back to author’s features.
All these kinds of differences, even if minimal, can strongly obfuscate
the stylistic features of the authors and deeply effect the attributions. In
order to avoid these effects, we have eliminated all the characteristics in the
texts that might have been introduced after the author’s creation of the
original text. In particular:
THE PUZZLE OF BASIL’S EPISTULA 38 273

• We have eliminated any critical apparatus (as philological comments,


variants, citations), any enumeration, any line terminator and we have
forced any space to be just one character space.
• We have reduced the alphabet by transliterating all texts into the
iso8859-7 coding of the modern Greek alphabet which has a much
simpler orthography, and the reduction has been obtained through the
iconv conversion software, distributed with the GNU C-library, used
with the option TRANSLIT that selects the closest character among
Downloaded by [Australian National University] at 16:00 08 January 2015

the available ones.


• Finally we have reduced all characters to lower case.

We did not eliminate punctuation symbols because their presence or


absence does not seem to effect attributions. For example, this procedure
converts the previous phrase in polytonic Greek in
ɛι μɛν ɛβoυλoντo παντɛς, ɛu oυς τo oνoμα τoυ hɛoυ και σωτηρoς ημων
ιɛσoυ.
After the conversion, the whole corpus contains only 30 distinct characters,
and few of them have low frequencies: the punctuation symbols (except the
comma), and some rare letters (e.g. β, ζ, κ, u, χ, ψ).
It is worth noticing that while the alphabet reduction could in principle
increase word sense ambiguity, for our methods the impact on the attribu-
tion is minimal. In fact our methods are not based on words but on charac-
ter sequences and are able to capture not just given words but the
surrounding context, suppressing the possible negative effect of sense ambi-
guity. It is also worth mentioning that words in different contexts can still
remain distinguishable even after extreme alphabet reduction. For example
in Basile et al. (in print) we have used a severe coding to the 10 characters
alphabet given by the T9-coding without degrading the efficiency of an
automatic plagiarism detection algorithm.

5. THE METHODS

A first crucial assumption behind all of our mathematical techniques is that


a text should be thought as a sequence of symbols chosen from an alphabet,
while the author is interpreted as a source of literary texts. Assuming that
the text is “just” a symbol sequence means not taking into consideration
either the semantic content of the text or its linguistic/syntactic/grammatical
274 D. BENEDETTO ET AL.

aspects: letters of the alphabet, punctuation marks, blank spaces between


are just abstract symbols, without any hierarchy.
Moreover, at least for the two methods used in this paper, the word as
basic constituent of the text has not more meaning than other aggregate of
symbols, and its role as unity of higher level with respect to the single
character is taken by the (characters) n-grams.
Let us make some examples to clarify:


Downloaded by [Australian National University] at 16:00 08 January 2015

With monogram (1-gram) we mean just one single symbol of the


alphabet.
• With bigram (2-gram) we mean a sequence of two symbols, for
example “τo” but also “ς ” (i.e. ς followed by a blank space).
• With trigram (3-gram) we mean a sequence of three symbols, for
example “αντ” but also “π o”.
• With n-gram we mean any sequence of n symbols; for example
“o παντɛς” is an 8-gram.

The two methods we have implemented here are the development of two
methods that have already been used in a project concerning the attribution
of Antonio Gramsci’s papers (Basile et al., 2008). Essentially each method
defines a kind of similarity distance between texts: given any pair of texts,
each method produces a positive number we interpret as the distance
between the texts. A small distance means that the two texts are quite simi-
lar (either in the argument/topic or in the author’s style, see below), whereas
a large distance means a high degree of dissimilarity. Let us briefly describe
both algorithms.

5.1 n-gram Distance


Here we use and refer to Basile et al. (2008) for more details and refer-
ences. The first method we used based on n-grams is probably one of the
simplest possible measures on a text, and it has a relatively short history in
published bibliography: after a first experiment based on bigram frequen-
cies presented in 1976 by (Bennett 1976), Keselj et al. published in 2003 a
paper in which n-grams frequencies were used to define a similarity
distance between texts (see also Clement & Sharp, 2003). Here we present
the version of the similarity distance as introduced and discussed in Basile
et al. (2008): we call ω an arbitrary n-gram, and we denote by fX (ω) and
fY (ω) the relative frequencies with which ω occurs in text X and Y,
THE PUZZLE OF BASIL’S EPISTULA 38 275

respectively. Dn(X) is the n-gram dictionary of the text X, that is, the set
of all n-grams which have non-zero frequency in X (similarly for Y) and
we define what we will call the n-gram distance between text X and text Y
as
1 X fx ðxÞ  fy ðxÞ2
dn ðX ; Y Þ: ¼ ð5:1Þ
jDn ðX Þj þ jDn ðY Þj fx ðxÞ þ fy ðxÞ
Downloaded by [Australian National University] at 16:00 08 January 2015

Here, |Dn(X)| and |Dn(Y)| are the numbers of different n-grams in the two
dictionaries and the sum is taken over all different n-grams occurring in the
two texts.
Note that in the previous formula, in contrast with what happens for the
Euclidian distance, each term of the sum is weighted with the inverse of the
square of the sum of the frequencies of that particular n-gram. In this way
rare words, i.e. n-grams with lower frequencies, give a larger contribution
to the sum.

5.2 LZEW
The second method we use to estimate the similarity between texts is based
on data compression and its role in the estimation of the entropy of a
source. Data compression is nowadays a very well established field of infor-
mation theory, thanks to the founding papers published by Ziv, Lempel and
their co-workers in the 1970s (cf., among others, Lempel & Ziv, 1976,
1977, 1978) and the review paper (Wyner et al., 1998). They proposed a
variety of compression algorithms (the family of) LZ algorithms, based on
the idea of a clever parsing (subdivision) of the symbolic sequence, i.e. to
split it up into pieces so that this separation can then be used to produce a
shorter, equivalent version of the string itself. It was a huge progress in the
field, since it was the first example of compressor that does not operate with
a fixed number of characters at a time, but is allowed to vary the length of
encoded substrings according to the “size” of the redundancies that specific
sequence presents; indeed, such algorithms are still at the base of the most
common zipping software that are in everyday use on our computers.
In 1993 Ziv and Merhav proposed a method to estimate the relative
entropy (or Kullback-Leibler divergence) between a given pair of informa-
tion sources (Ziv & Merhav, 1993). The relative entropy is basically a
measure of the similarity between the information emitted by the sources
and they proved that a modified version of an LZ algoritm (where the
276 D. BENEDETTO ET AL.

sub-sequences for a sequence are searched in another sequence), can be


used to approximate the relative entropy between the two sources that gen-
erated such sequences. This important result was used in various subsequent
studies, among which Benedetto et al. (2002) and Basile et al. (2008), to
deal with problems of text classification and clustering.
Following the idea of LZ77 algorithm, we estimate the similarity of the
two sequences in the following way:
Downloaded by [Australian National University] at 16:00 08 January 2015

(1) We sequentially parse x starting from the first character in such a


way that each string of the parsing is the longest possible sequence
that occur at least once in y.
(2) The length of each substring and its position (index) in x (starting
from the end of y) are stored.
(3) The coding data, given by the lengths and the positions of the sub-
sequences, are then sequentially compressed.

For example consider the two sequences y = aaabbababbbaaba and


x = ababaabbbaa.
The text x is parsed into the substrings abab aabb baa, of lengths 4, 4, 3
and indices 10, 14, 5 respectively; then these numbers are suitable codified.
It should be clear that the longer are the common sequences, the shorter are
the numbers of bits we need to reconstruct x given y.
In gzip a match shorter than three characters is always ignored and
moreover it can never have a length larger than 258 characters. Characters
not belonging to an accepted match are directly memorized with a corre-
sponding 8 bits coding, whereas lengths larger than two are coded in groups
with integers between 257 and 271, using some extra bits to distinguish
between lengths with the same code. In the same way, positions of the
matches between 1 and 215 are coded similar to gzip using integer numbers
between 0 and 29 (the codex of the position) and some more extra bits.
Finally, length codes (together with characters with no match) and index
codes are independently compressed using a Huffman coding. This is basi-
cally the entropic algorithm (called BCL) already used in Basile et al.
(2008) for the attribution of Gramsci’s articles. As we can see from the
description, there is a limitation due to the coding of the indexes.
Some of the texts, from both Basil’s and Gregory’s corpus, are much
larger than 215. For this reason, again inspired by gzip and some of its
variants (such as LZMA), we have modified the parsing procedure and the
coding of the indexes. In particular, the strings forming the parsing of y
THE PUZZLE OF BASIL’S EPISTULA 38 277

have both the first and last character in common. Namely, each new match
begins with the last character of the previous match.
Let σ be this first character of a new match. As before, we have to store
the position of the match in y. Instead of storing the number of characters
from the match and the end of y, we store as index of the position the number
of words in y that starts with σ and lying between the match and the end of y.
For example, consider again the sequences x and y above. Our algorithm first
returns the character a, then the position of the match abab is indicated by the
Downloaded by [Australian National University] at 16:00 08 January 2015

value 5, just because abab is the substring starting with the fifth a from the
end; then the match baab is found (starting from the final b of the previous
match), and its index is 2. Finally, the last match, bbbaa, has index 4.
In this way, the number of matches is in general bigger than that of BCL,
but the numbers we need to express the positions of the matches are usually
much smaller (in particular for texts from large alphabets). Let us note that
we can minimize some of this numbers: in the example, the last match bbbaa
cannot start with the second b from the end of y because this b is also the
last character of a copy of the previous match. So we can specify the posi-
tion of the new match counting only the possible characters with which the
match can start, according to the fact that the previous match is maximal. In
this case we can specify the position with the index 3 and not 4.
We codify all this numbers exactly as gzip does, and finally, we
compress the list of characters and lengths, and the list of the positions,
using an arithmetic encoding conditioned to the first character of the match.
This algorithm, now called LZEW, has been used here for the first time,
but we have previously tested the method with the Gramsci corpus used in
Basile et al. (2008), for which LZEW gives the same results of BCL (but note

Table 1. Comparison of the compression rate between gzip and LZEW. For each literary
work, we show the author (left), the dimension (Dim) in characters and the compression rate
(i.e. bytes/character) using gzip and LZEW, respectively.

Author Title Dim gzip LZEW

G. Galilei La bilancetta 8936 3.04 2.85


N. Machiavelli Favola di Belfagor arcidiavolo 19625 3.30 3.11
D. Alighieri Quaestio de aqua de terra 29766 2.85 2.69
G. Galilei Siderus Nuncius 73346 2.88 2.70
D. Alighieri De Vulgari Eloquentia 82765 3.02 2.84
G. Galilei Trattati di fortificazione 121893 2.58 2.38
E. Salgari I pirati della Malesia 370949 2.85 2.45
G. Leopardi Zibaldone 5772133 2.84 2.35
278 D. BENEDETTO ET AL.

that the lengths of the texts of the Gramsci Corpus are all smaller than 215).
This leads to the implementation of a true compressor program which effec-
tively shows better compression rates with respect to gzip, as shown in this
Table 1.
The details of this method are designed to optimize the compression
ratio in order to obtain a good estimate of the relative entropy between
texts. Because of this specific optimization, the complexity of the imple-
mentation naturally increases. The interested reader can compare our
Downloaded by [Australian National University] at 16:00 08 January 2015

method with the more simple one introduced by Ziv and Merhav (1993),
even if it turns out to be less accurate for our aims: the authors estimate the
similarity between the texts x and y with
N log jyj=jxj

where |x| and | y| are the lengths of the two texts, respectively, and N is just
the cardinality of the parsing of x in y, as described at the beginning of this
section.

6. THE PROBLEM OF THE SIZE OF THE REFERENCE TEXTS

As common to many problems case in authorship attribution (A.A.) we


have constructed a reference corpus made of texts with known attribution,
assuming/hoping that this corpus contains all stylistic features needed to
efficiently discriminate between the two authors.
Here, as we said, following a precise philological idea that will be dis-
cussed further later in the paper, we have chosen one dogmatic work of
Basil B0 and three works of Gregory of Nissa G0. We remind that the (char-
acter) lengths of these two texts are jB0 j ¼ 1 72 342 and jG0 j ¼ 1 017 314,
i.e. G0 is almost six times B0.
A key point is that our methods are both quite sensitive to the length of
the reference texts (in particular the entropic one): longer texts are usually
richer and contain several different expressions, producing an artificial effect
of attracting more unknown texts, leading to smaller similarity distances. In
order to construct a reference corpus with a homogenous distribution of
lengths of the texts, we split B0 and G0 in pieces of approximately the same
length. In this way we have transformed a problem of managing texts of
different sizes, in the problem of managing a different number of texts of
the same size.
THE PUZZLE OF BASIL’S EPISTULA 38 279

We analyse an unknown text X by calculating its similarity distance (using


either one of the two methods) from each of the NG fragments of G0 and the
NB fragments of B0, respectively. We then put these values in an ordered list,
from the closest fragment to the last one, extracting the rank of the authors G
= Gregory and B = Basil. Just to give an example, if for instance NG = 6 and
NB = 1 there are only seven possible final ranking for the ordered distances:
BGGGGG
GBGGGG
Downloaded by [Australian National University] at 16:00 08 January 2015

GGBGGG
GGGBGG
GGGGBG
GGGGGB

If now, for example, the output is given by the first sequence, we attri-
bute the text X to B, while if we observe the last one we surely attribute the
test to G, but it is not clear where to put the threshold. Moreover, if NG ≠ 1
and NB ≠ 1, we have also to understand how to order the ranks for attribu-
tion. For instance, is the rank BGGB more or less basilian then GBBG?
This is why we need an efficient voting algorithm that allows us to order
the ranking and define a threshold for attribution. This is really a crucial
step that consistently increases the performance of the attribution with
respect the trivial first nearest author attribution that will only look at the
item at the top of the rank.
We have already solved this problem in Basile et al. (2008), but in the
special homogenous case NG = NB. Here we want now to summarize the
previous result and extend it to the more general case NG ≠ NB.
Let us denote the rank with c1 ... cN where ci 2 fG; Bg is the author of
the ith fragment in the rank, and N = NG + NB. Let us consider a text g⁄ of
the author G. We can model the attribution procedure assuming that the
similarity distance with a fragment g of author G is a random positive vari-
able d(g⁄,g), and that also the distance d(g⁄,b) with a fragment b of the
author B is again another random positive variable. Through a monotonic
transformation of the distances, we can assume that d(g⁄,g) is uniformly
distributed in [0,1]. Our main assumption is that with this transformation
the other distribution function for the random variable d(g⁄,b) turns into a
power law: P(d(g⁄, b) < z) = z1+β, with β > 0. Moreover, we also assume
that different distance values are independent.
280 D. BENEDETTO ET AL.

Let ci be the value of the ith position in the ranking, i.e. ci = G if the
ith text is of the author G, and ci = B elsewhere. We define
X
k X
k
mB ðkÞ ¼ vfci ¼ Bg and mG ðkÞ ¼ vfci ¼ Gg
t¼1 t¼1

(where v is the characteristic function), with mG(k) + mB(k) = k.


Downloaded by [Australian National University] at 16:00 08 January 2015

The value mG(k) is the number of texts of class G which appear in the rank
until position k. With these assumptions, we can calculate explicitly the
probability of observing the rank c1…cN if X 2 G (i.e. if X is a text of G). The
calculations involve the evaluation of some multiple integrals as described in
details in Basile et al. (2008) for the case NG = NB and yields the formula:
1
Pðc1 c2 . . . cN jx 2 GÞ ¼ ð1 þ ÞNB N
Q
ðk þ mG ðkÞÞ
k¼1

A first order expansion in  yields:


!
1 XN
mB ðkÞ
Pðc1 c2 . . . cN jx 2 GÞ ffi 1 þ NB  
N! k¼1
k

We can now define the IB and the IG index as


XN
mG ðkÞ XN
mB ðkÞ
IG ¼  NG ; IB ¼  NB
k¼1
k k¼1
k

namely
1
Pðc1 c2 :::cN jx 2 GÞ ffi ð1  IB Þ:
N!

We can now make the same assumption for the case X = b⁄2 B; in
particular assuming the law z1+γ for the distribution function of d(b⁄, g)
with respect to d(b⁄, b) , we obtain:
THE PUZZLE OF BASIL’S EPISTULA 38 281

1
Pðc1 c2 :::cN jx 2 BÞ ffi ð1  cIG Þ:
N!

Note that IG + IB = 0. Now we can use a maximum likelihood method


(which in this case corresponds to a Bayesian method with the dubious
texts equi-distributed between B and G) and we can attribute X to B if
Pðc1 c2 :::CN jX 2 BÞ[Pðc1 c2 :::cN jX 2 GÞ , cIG [  IB , IB [0:
Downloaded by [Australian National University] at 16:00 08 January 2015

In the same way, we choose attribution G if IG = IB > 0.

7. RESULTS: DISCRIMINATING BETWEEN BASIL AND GREGORY

The length used in the splitting of the reference texts has been selected
through simple empirical considerations that suggest to combine analyses at
two different scales:

• A large scale where the Basil of Caesarea’s work is left untouched.


On this way we have cuts of about 170 000 characters long and the
corpus of Gregory of Nyssa is divided into six equal parts.
• A small scale with cuts of about 11 000 lengths long. With this choice
we have exactly 16 parts for Basil of Caesarea and 94 parts for Gregory
of Nyssa.

The choice of two very different scales of comparison can be motivated:


confronting any unknown text with pieces that are 170 000 characters long

Table 2. The corpus.

No. of texts

Basil works 43
Basil epistulae L 89
Basil epistulae M 92
Basil epistulae S 138
Gregory works 57
Gregory epistulae 25

Total 444
282 D. BENEDETTO ET AL.

enhances the weight of rare stylistic features, whereas using cuts of 11 000
characters amplifies frequently used stylistic patterns. In order to correlate
the different information arising from the two different scale analysis, both
the entropic distance and the n-gram distance with n = 11 have been imple-
mented at these scales, yielding to four different attributions for each dis-
puted text. We note that values of n close to n = 11 give very similar
results. This reflects a quite important characteristic of our method, namely
its stability. On the other extreme, very small values of n give quite confus-
Downloaded by [Australian National University] at 16:00 08 January 2015

ing data, while larger values of n correspond to too long sequences of


words that rarely repeat along the corpus, seriously degrading the efficiency
of the n-gram-distance.
Summarizing, we have selected the four methods LZWE-170, LZWE-11
(the entropic method with Bo and G0 split in texts of size 170 000 and
11 000 respectively), and N-11-170, N-11-11 (the n-grams methods with
n = 11 with B0 and G0 split as before).
We have applied these four methods to the whole Basil and Gregory
corpus. We recall that texts have been divided in six categories as explained
in Section 3 and summarized in the following Table 2.
For each given text, a fixed single method returns (following the voting
procedure explained in Section 6) an attribution either to Basil (B) or Greg-
ory (G). Combining now all the methods, we can have up to five different
results, namely: all four methods return the same attribution, B or G, only
three methods coherently give the same attribution or, finally, two methods
return B, while the other two attribute the text to G.

Fig. 2. Results of the attribution. In the first graphic: black B = 4/G = 0; dark grey B = 3/G
= 1; white: B = 2/G = 2; light grey B = 1/G = 3; grey B = 0/G = 4. In the second: black
means B, grey G, white tie.
THE PUZZLE OF BASIL’S EPISTULA 38 283

Our global attribution strategy was to consider a text attributed if and


only if at least three methods attributed it to the same author. For example,
any text with a tied BBGG or BGBG as final output from the methods was
consider as not attributed.
The following aggregated results refer only to not too short texts, i.e. we
have excluded the Basil M and S letters that will be analyzed and discussed
later on.
To present the results, for A2{B, G}, TP(A) means the true positive for
Downloaded by [Australian National University] at 16:00 08 January 2015

A, namely the number of texts by A correctly attributed to author A, T(A)


indicates the number of texts by the author A and P(A) the whole number
of texts attributed to author A. The following table shows the standard indi-
ces from information retrieval we have used to measure the efficiency of
our method:
We remark that if we chose as true attributions only the case when all
four methods agree, the overall precision clearly increases (up to 97 for
both authors) but with a natural decrease of both the recall (around 70–80)
and the F-score (below 90).
We believe that in this specific case (but probably also in other similar
situations) the wrong attributions (false positive) returned by our quantita-
tive approach offer new suggestions to the philologists for a deeper investi-
gation of the questioned texts and can be inscribed among the success of
the method.
For example, in the studied case there is only one letter by Gregory that
has been attributed to Basil by three of the four methods: this missed attri-
bution can be explained by the fact that this letter is an exposition of the
faith written for a synod of bishops in Neocesarea or in Sebasteia in 380
and its style can be influenced by this circumstance. For a more detailed
philological analysis about this and other wrong attributions we refer to
Maspero et al. (in print).
In Figure 2a detailed description of attributions on each specific class of
texts is shown.
As expected, the results degrade, decreasing the size of the sample. In
fact, for group M of 92 (with a number of characters between 1250 and
2500), we have 10 errors and 23 ties.
In this way the number of correct attributions is only 59 (64). The out-
come is even worse for the last group S of 134 letters with a size less than
1250 characters: 25 errors, 25 ties and only 50 of correct attributions. The
decreasing of the effectiveness for smaller texts confirms the importance of
284 D. BENEDETTO ET AL.

Table 3. Efficiency of the method.

Recall r = TP(A)/T(A) the sensitivity of the test


Precision p = TP(A)/P(A) the positive predictive value of the test
F-score f = 2rp/(r + p) the harmonic mean of r and p

Here the results:

Basil Gregor
Downloaded by [Australian National University] at 16:00 08 January 2015

Recall 0.87 0.90


Precison 0.96 0.93
F-measure 0.91 0.91

good statistical samples and can be considered as a solid indication of the


coherence and meaningfulness of our results.
We end this section showing the recall and the precision for each one of
the four methods, again excluding the small and the medium letters. It is
worth remarking that while n-gram methods show a bigger recall for Greg-
ory, the entropic methods maximize their recall for Basil. At this moment,
even if we can argue that this is linked to the more peculiar style of Greg-
ory’s language, we do not have a full and convincing explanation of this
interesting phenomenon.

8. WHO WROTE EPISTULA 38?

Ep. 38 is not included in the previous computations, as are many other


works, because its authorships have been discussed over time. The good
results of our strategy on the controlled corpus of already attributed
works suggest that the answer to the attribution problem for this letter
under investigation should be meaningful. It should be stressed that the
method has been designed having in mind Ep. 38 and its literary genre,
because it belongs to the same kind of the works against Eunomius used
as points of reference to measure the distances. Finally it is attributed to
Gregory by all the methods. The high precision (97%) in the cases of
the complete agreement of the four methods seems to suggest that the
answer can be trusted. The letter is 18 083 characters long, i.e. suffi-
ciently extended to have good statistics, and the attribution remains sta-
ble if we divide it into smaller sections.
THE PUZZLE OF BASIL’S EPISTULA 38 285

The effectiveness of the computations is confirmed by the results for


Basil’s letter 16 and 189, which are known to be in reality parts of some
works written by Gregory.4 This means that they are in the same situation
as Ep. 38, but the correct answer is known in advance. Our methods attri-
bute these two letters transmitted in the Basil’s Corpus to Gregory, giving
the right result.
Moreover, the analysis of the works which are not correctly attributed
seems to give interesting results. Only four works in Gregory’s corpus are
Downloaded by [Australian National University] at 16:00 08 January 2015

wrongly attributed, but for three of them one can find in literature philologi-
cal reasons to doubt about their authorship. This means that the “errors”
made by our method seem to have a philological meaning, suggesting an
even greater precision. The remaining mistakes, one in Gregory’s corpus
and one in Basil’s corpus, can be explained with their literary genre and
contents, very different with respect to the works written against Eunomius
by both authors and used as comparison to measure the different distances.
This result seems to suggest that the combination of philological and
numerical techniques in the design of the method is very effective. As far
as we know, our method is the first one able to analyse a real philological
problem in authorship attribution of ancient Greek works and to give a clear
result, in agreement with what was previously known in scientific literature
(Maspero & Leal, 2010).

9. CONCLUSIONS

We can conclude that we have analysed and compared the two corpora
formed by all the works written by two Greek authors of the 4th century,
Basil and Gregory of Nyssa. Through the combination of two different
methods and of different scales and parameters we have been able to
correctly attribute the works examined with an overall precision of 96%
(Basil) and 93% (Gregory). In case of perfect agreement of all the methods
used the precision grows to 97% for both authors. The experimental setting
was designed in such a way as to study the attribution of a specific letter,
Ep. 38, which has been transmitted in the corpora of both authors and has
been extensively discussed from the philological and historical perspectives.
We could attribute the letter in a very clear way to Gregory of Nyssa, get-
4
Basil’s letter 189 coincides with part of one of Gregory’s Trinitarian tracts, Ad Eustathium,
De Sancta Trinitate (GNO III/1, 3– 16), and Basil’s letter 16 is known to be part of the 8th
chapter of Gregory’s Contra Eunomium III (GNO II, 226–228).
286 D. BENEDETTO ET AL.

ting a unanimous agreement by all our methods. This answer has been con-
firmed by the other results obtained, even by the few errors in the attribu-
tions, which are coherent with what was previously known at the
philological level.
As future developments of this research we are considering the possibility
and effectiveness of the analysis of the inner structure of some of the studied
works, as well as the comparison of these results with analogous problems
concerning other Greek authors of the same time, in particular Gregory of
Downloaded by [Australian National University] at 16:00 08 January 2015

Nazianzus, who also extensively wrote on the same subject and against the
same Eunomius. This case is particularly interesting, because in Gregory of
Nazianzus’ Corpus there are some works with disputed attribution just as
Ep. 38.

REFERENCES

Basile, C., Benedetto, D., Caglioti, E., & Degli Esposti, M. (2008). An example of mathe-
matical authorship attribution. Journal of Mathematical Physics, 49, 125211–1251120.
Basile, C., Benedetto, D., Caglioti, E., Cristadoro, G., & Degli Esposti, M. (2009). A plagia-
rism detection procedure in three steps: Selection, matches and “squares”. In Uncover-
ing Plagiarism, Authorship and Social Software Misuse and 1st International
Competition on Plagiarism Detection. San Sebastian, Spain, 10 September 2009,
Aachen: CEUR Workshop Proceedings ISSN 1613–0073, vol. 502, pp. 19–232
Benedetto, D., Caglioti, E., & Loreto, V. (2002). Language trees and zipping. Physical
Review Letters, 88(4), 48702.
Bennet, W. R. (1976). Scientific and engineering problem-solving with the computer. Engle-
wood Cliffs, NJ: Prentice-Hall.
Canfora, L. (2002). Il copista come autore. Palermo: Sellerio.
Clement, R., & Sharp, D. (2003). Ngram and Bayesian classification of documents for topic
and authorship. Literary and Linguistic Computing, 18(4), 423–447.
Drecoll, V. H. (1996). Die Entwicklung der Trinitätslehre des Basilius von Cäsarea. Sein
Weg vom Homöusianer zum Neonizäner. Göttingen: Vandenhoeck & Ruprecht.
Fedwick, P. J. (1978). Commentary of Gregory of Nyssa or the 38th letter of Basil of
Caesarea, OrChrP, 44, 31–51. J. Hammerstaedt, Zur Echtheit von Basiliusbrief 38, in
E. Dassmann & K. Thraede (Eds), Tesserae. Festschrift für Josef Engemann, Jahrbuch
für Antikes Christentum. Ergänzungsband 18, Münster 1991, 416–419; W.-D. Haus-
child, Basilius von Caesarea. Briefe. Eingeleitet, übersetzt und erläutert. Erster Teil,
Bibliothek der Griechischen Literatur 32, Stuttgart 1990.
Fedwick, P. J. (1993). Bibliotheca Basiliana Universalis I. Turnhout: Brepols.
Hübner, Cf. R. (1972). Gregor von Nyssa als Verfasser der sog. Ep. 38 des Basilius. Zum
unterschiedlichen Verständnis der ousia bei den kappadozischen Brüdern. In J. Fon-
taine & Ch. Kannengiesser (Eds), Epektasis Mélanges patristiques offerts au Cardinal
Jean Daniélou (pp. 463–490). Paris: Beauchesne.
THE PUZZLE OF BASIL’S EPISTULA 38 287

Keselj, V., Peng, F., Cercone, N., & Thoas, C. (2003). N-gram-based author profiles for
authorship attribution. In V. Keselj & T. Endo (Eds), Proceedings of the Confer-
ence Pacific Association for Computational Linguistics PACLING’03 (pp. 255–
264). Halifax: Dalhousie University.
Lempel, A., & Ziv, J. (1976). On the complexity of finite sequences. IEEE Transactions on
Information Theory, 22(1), 75–81.
Lempel, A., & Ziv, J. (1977). A universal algorithm for sequential data compression. IEEE
Transactions on Information Theory, 23(3), 337–343.
Lempel, A., & Ziv, J. (1978). Compression of individual sequences via variable-rate coding.
IEEE Transactions on Information Theory, 24(5), 530–536.
Downloaded by [Australian National University] at 16:00 08 January 2015

Maspero, G., & Leal, J. (2010). Revisiting Tertullian’s Authorship of the Passio Perpetuae
through Quantitative Analysis. In P. Grzybek, E. Kelih, & J. Maoutek (Eds), Text and Lan-
guage. Structures - Functions - Interrelations Π Quantitative Perspectives (pp. 99–108).
Wien: Praesens Verlag.
Maspero, G., Benedetto, D., & Degli Esposti M. (2013 in print). Who wrote Basil’s Epistula
38? A Possible Answer through Quantitative Analysis. In J. Leemans & M. Cassin
(Eds), Gregory of Nyssa’s Contra Eunomium III. Proceedings of the Twelfth Interna-
tional Gregory of Nyssa Colloquium (Leuven, 14–17 September 2010). Leuven: Brill
Meredith, A. (1999). Gregory of Nyssa. London, New York: Routledge.
Rousseau, Ph. (1998). Basil of Caesarea. Berkeley/Los Angeles/London: University of
California Press.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the
American Society for Information Science and Technology, 60(3), 538–556.
Wyner, A. D., Ziv, J., & Wyner, A. J. (1998). On the role of pattern matching in information
theory. IEEE Transactions on Information Theory, 44(6), 2045–2056.
Zachhuber, J. (2003). Nochmals: Der “38. Brief” des Basilius von Cäsarea als Werk des Gre-
gor von Nyssa. Zac, 7, 73–90.
Ziv, J., & Merhav, N. (1993). A measure of relative entropy between individual sequences
with application to universal classification. IEEE Transactions on Information Theory,
39(4), 1270–1279.

ONLINE REFERENCE

http://www.tlg.uci.edu/. The digital library has been developed by University of California,


Irvine.

You might also like