Benedetto-Puzzle Basil Ep 388
Benedetto-Puzzle Basil Ep 388
Benedetto-Puzzle Basil Ep 388
Journal of Quantitative
Linguistics
Publication details, including instructions for authors
and subscription information:
http://www.tandfonline.com/loi/njql20
To cite this article: Dario Benedetto , Mirko Degli Esposti & Giulio Maspero (2013)
The Puzzle of Basil’s Epistula 38: A Mathematical Approach to a Philological Problem,
Journal of Quantitative Linguistics, 20:4, 267-287, DOI: 10.1080/09296174.2013.830549
Taylor & Francis makes every effort to ensure the accuracy of all the
information (the “Content”) contained in the publications on our platform.
However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or
suitability for any purpose of the Content. Any opinions and views expressed
in this publication are the opinions and views of the authors, and are not the
views of or endorsed by Taylor & Francis. The accuracy of the Content should
not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions,
claims, proceedings, demands, costs, expenses, damages, and other liabilities
whatsoever or howsoever caused arising directly or indirectly in connection
with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sub-
licensing, systematic supply, or distribution in any form to anyone is expressly
forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Downloaded by [Australian National University] at 16:00 08 January 2015
Journal of Quantitative Linguistics, 2013
Vol. 20, No. 4, 267–287, http://dx.doi.org/10.1080/09296174.2013.830549
Italy
ABSTRACT
We present and explore a real case in authorship attribution (A.A.) by combining the tradi-
tional philological approach with novel mathematical techniques. The problem involves the
extensive productions of Basil of Caesarea and his brother Gregory of Nyssa, two influential
4th century Christian theologians, and the attribution of specific and discussed works in their
corpora. Our novel method is based on two similarity (pseudo) distances, based, respectively,
on the statistics of n-grams and on zip-like algorithms and on a new ranking/voting system
that allows to infer the attribution from the values of distances between the unknown texts
and the texts of the training corpus. The main results are on one hand the attribution of the
letter with 97% of precision to one of the two authors and on the other the strong agreement
of the numerical explorations with both the philological analysis and the so far known
results for all the works in the two corpora.
1. INTRODUCTION
The work has been extensively analyzed and studied from the philologi-
cal, philosophical and theological perspectives. The aim of the present paper
is to investigate its authorship having recourse to mathematical methods
and numerical computations.
Ep. 38 is in some ways a perfect authorship attribution problem: it is
known in advance that the work belongs to either Basil or Gregory, i.e. it is
a genuine two-class classification problem. Moreover, the statistical approach
is expected to give good results, as the productions of these authors are
conspicuous.
But there is also another characteristic of the problem that raises hope of
getting good results: both Basil and Gregory of Nyssa have produced exten-
sive works against Eunomius, discussing the same subjects and practically
having recourse to the same quotations and vocabulary. It seems reasonable
to suppose that these works can be effective touchstones to analyse Ep. 38.
In fact, if we apply statistical methods and develop a kind of distance
between different texts, the differences between Ep. 38 and Basil’s or
270 D. BENEDETTO ET AL.
3. THE CORPUS
Downloaded by [Australian National University] at 16:00 08 January 2015
The Corpus used in the present study is composed of all the known works
by Basil and Gregory of Nyssa. The digitalized texts can be found in The-
saurus Linguae Graecae (TLG) (http://www.tlg.uci.edu/). All the works
have been included, also the spurious and dubious ones.
Being the main objective of this study the developing of a reliable
authorship attribution method for exploring spurious and dubious ones, the
whole corpus has been divided into three sets with specific tasks in mind:
(1) A reference set: One dogmatic work by Basil and one set of the three
books by Gregory of Nyssa against Eunomius. From now on we will
denote them by B0 and G0, respectively. The (character) lengths of
these two texts, after the coding explained in the next section are
jB0 j ¼ 172342
and
jG0 j ¼ 1017314
1
Y. Courtonne’s edition has been used: Saint Basile. Lettres, Les Belles Lettres, Paris
1957–1966.
2
G. Pasquali’s edition has been used: Gregorii Nysseni Opera (= GNO), vol. VIII/2, Brill,
Leiden, (1959).
THE PUZZLE OF BASIL’S EPISTULA 38 271
form the second group (M) and the letters shorter than 1250
constitute the third one (S). Correspondingly, also 57 works and 25
letters by Gregory3 with well known attribution have been included
in this set. This set has been used, as we will soon explain, in order
to verify the efficiencies and the stability of our method.
(3) A set of almost 100 dubious texts such as Ep. 38.
procedure for all texts used in the experiments. Coding and cleaning of
texts is a crucial issue in automatic authorship attribution and even if it is
often (voluntarily or not) ignored, we believe it deserves a detailed descrip-
tion and we discuss it in the next section.
The digital texts of ancient authors are typically the final versions of an
unknown number of intermediate copies of the originals (Canfora, 2002).
The history of a single text might determine different frequencies of some
words, characters, or of a whole class of characters. Moreover, the digitali-
zation of the text can arbitrarily change the frequency of “new line” and
“carriage return”, and different editorial policies can change the frequency
of numbers, capital letters and punctuation signs.
For instance, in the Corpus we are analysing, Πατρός appears only in
the digital version of Basil’s Corpus, whereas πατρός appears almost exclu-
sively in the digital version of the texts handed on as attributed to Gregory.
Even if the frequency of capital letters is quite small, this feature can cause,
in the quantitative methods, a bias linked to the history of the transmission
of the text.
Moreover, the digital version of the whole corpus we are analysing here
has another quite problematic feature: it uses (via the UTF8 coding) the
polytonic orthography of the ancient Greek, which has different symbols for
any letter with accents or breathings (for example ɛ appears in nine ver-
sions). A typical phrase appears as
3
About Gregory’s letters, we can use only 25 from the Pasquali collection, because 26 and
30 are not by Gregory, while 27 and 28 are spurious.
272 D. BENEDETTO ET AL.
Only 29 characters have frequencies greater than 1%, whereas the total
frequency of the other rare characters is not negligible, namely it is greater
than 16%, indicating that we will likely encounter rare characters while ana-
lysing an arbitrary text. The presence of several variants of the same charac-
ter has severe consequences on both quantitative attribution methods we use
(see the next section). Both methods in fact exploit redundancies between
not too long sequences of characters (usually 6 15) and the existence of
different variants of the same character diminishes the statistics of redun-
dancies preventing the discovery of similarities between different words that
might constitute part of the style of an author. Moreover, depending on the
particular historical period, a version of a given text might be produced
with more or less consistency with respect to given grammar rules that
establish the use of specific variants of a character. Also in this case, differ-
ences detected by the methods cannot lead back to author’s features.
All these kinds of differences, even if minimal, can strongly obfuscate
the stylistic features of the authors and deeply effect the attributions. In
order to avoid these effects, we have eliminated all the characteristics in the
texts that might have been introduced after the author’s creation of the
original text. In particular:
THE PUZZLE OF BASIL’S EPISTULA 38 273
5. THE METHODS
•
Downloaded by [Australian National University] at 16:00 08 January 2015
The two methods we have implemented here are the development of two
methods that have already been used in a project concerning the attribution
of Antonio Gramsci’s papers (Basile et al., 2008). Essentially each method
defines a kind of similarity distance between texts: given any pair of texts,
each method produces a positive number we interpret as the distance
between the texts. A small distance means that the two texts are quite simi-
lar (either in the argument/topic or in the author’s style, see below), whereas
a large distance means a high degree of dissimilarity. Let us briefly describe
both algorithms.
respectively. Dn(X) is the n-gram dictionary of the text X, that is, the set
of all n-grams which have non-zero frequency in X (similarly for Y) and
we define what we will call the n-gram distance between text X and text Y
as
1 X fx ðxÞ fy ðxÞ2
dn ðX ; Y Þ: ¼ ð5:1Þ
jDn ðX Þj þ jDn ðY Þj fx ðxÞ þ fy ðxÞ
Downloaded by [Australian National University] at 16:00 08 January 2015
Here, |Dn(X)| and |Dn(Y)| are the numbers of different n-grams in the two
dictionaries and the sum is taken over all different n-grams occurring in the
two texts.
Note that in the previous formula, in contrast with what happens for the
Euclidian distance, each term of the sum is weighted with the inverse of the
square of the sum of the frequencies of that particular n-gram. In this way
rare words, i.e. n-grams with lower frequencies, give a larger contribution
to the sum.
5.2 LZEW
The second method we use to estimate the similarity between texts is based
on data compression and its role in the estimation of the entropy of a
source. Data compression is nowadays a very well established field of infor-
mation theory, thanks to the founding papers published by Ziv, Lempel and
their co-workers in the 1970s (cf., among others, Lempel & Ziv, 1976,
1977, 1978) and the review paper (Wyner et al., 1998). They proposed a
variety of compression algorithms (the family of) LZ algorithms, based on
the idea of a clever parsing (subdivision) of the symbolic sequence, i.e. to
split it up into pieces so that this separation can then be used to produce a
shorter, equivalent version of the string itself. It was a huge progress in the
field, since it was the first example of compressor that does not operate with
a fixed number of characters at a time, but is allowed to vary the length of
encoded substrings according to the “size” of the redundancies that specific
sequence presents; indeed, such algorithms are still at the base of the most
common zipping software that are in everyday use on our computers.
In 1993 Ziv and Merhav proposed a method to estimate the relative
entropy (or Kullback-Leibler divergence) between a given pair of informa-
tion sources (Ziv & Merhav, 1993). The relative entropy is basically a
measure of the similarity between the information emitted by the sources
and they proved that a modified version of an LZ algoritm (where the
276 D. BENEDETTO ET AL.
have both the first and last character in common. Namely, each new match
begins with the last character of the previous match.
Let σ be this first character of a new match. As before, we have to store
the position of the match in y. Instead of storing the number of characters
from the match and the end of y, we store as index of the position the number
of words in y that starts with σ and lying between the match and the end of y.
For example, consider again the sequences x and y above. Our algorithm first
returns the character a, then the position of the match abab is indicated by the
Downloaded by [Australian National University] at 16:00 08 January 2015
value 5, just because abab is the substring starting with the fifth a from the
end; then the match baab is found (starting from the final b of the previous
match), and its index is 2. Finally, the last match, bbbaa, has index 4.
In this way, the number of matches is in general bigger than that of BCL,
but the numbers we need to express the positions of the matches are usually
much smaller (in particular for texts from large alphabets). Let us note that
we can minimize some of this numbers: in the example, the last match bbbaa
cannot start with the second b from the end of y because this b is also the
last character of a copy of the previous match. So we can specify the posi-
tion of the new match counting only the possible characters with which the
match can start, according to the fact that the previous match is maximal. In
this case we can specify the position with the index 3 and not 4.
We codify all this numbers exactly as gzip does, and finally, we
compress the list of characters and lengths, and the list of the positions,
using an arithmetic encoding conditioned to the first character of the match.
This algorithm, now called LZEW, has been used here for the first time,
but we have previously tested the method with the Gramsci corpus used in
Basile et al. (2008), for which LZEW gives the same results of BCL (but note
Table 1. Comparison of the compression rate between gzip and LZEW. For each literary
work, we show the author (left), the dimension (Dim) in characters and the compression rate
(i.e. bytes/character) using gzip and LZEW, respectively.
that the lengths of the texts of the Gramsci Corpus are all smaller than 215).
This leads to the implementation of a true compressor program which effec-
tively shows better compression rates with respect to gzip, as shown in this
Table 1.
The details of this method are designed to optimize the compression
ratio in order to obtain a good estimate of the relative entropy between
texts. Because of this specific optimization, the complexity of the imple-
mentation naturally increases. The interested reader can compare our
Downloaded by [Australian National University] at 16:00 08 January 2015
method with the more simple one introduced by Ziv and Merhav (1993),
even if it turns out to be less accurate for our aims: the authors estimate the
similarity between the texts x and y with
N log jyj=jxj
where |x| and | y| are the lengths of the two texts, respectively, and N is just
the cardinality of the parsing of x in y, as described at the beginning of this
section.
GGBGGG
GGGBGG
GGGGBG
GGGGGB
If now, for example, the output is given by the first sequence, we attri-
bute the text X to B, while if we observe the last one we surely attribute the
test to G, but it is not clear where to put the threshold. Moreover, if NG ≠ 1
and NB ≠ 1, we have also to understand how to order the ranks for attribu-
tion. For instance, is the rank BGGB more or less basilian then GBBG?
This is why we need an efficient voting algorithm that allows us to order
the ranking and define a threshold for attribution. This is really a crucial
step that consistently increases the performance of the attribution with
respect the trivial first nearest author attribution that will only look at the
item at the top of the rank.
We have already solved this problem in Basile et al. (2008), but in the
special homogenous case NG = NB. Here we want now to summarize the
previous result and extend it to the more general case NG ≠ NB.
Let us denote the rank with c1 ... cN where ci 2 fG; Bg is the author of
the ith fragment in the rank, and N = NG + NB. Let us consider a text g⁄ of
the author G. We can model the attribution procedure assuming that the
similarity distance with a fragment g of author G is a random positive vari-
able d(g⁄,g), and that also the distance d(g⁄,b) with a fragment b of the
author B is again another random positive variable. Through a monotonic
transformation of the distances, we can assume that d(g⁄,g) is uniformly
distributed in [0,1]. Our main assumption is that with this transformation
the other distribution function for the random variable d(g⁄,b) turns into a
power law: P(d(g⁄, b) < z) = z1+β, with β > 0. Moreover, we also assume
that different distance values are independent.
280 D. BENEDETTO ET AL.
Let ci be the value of the ith position in the ranking, i.e. ci = G if the
ith text is of the author G, and ci = B elsewhere. We define
X
k X
k
mB ðkÞ ¼ vfci ¼ Bg and mG ðkÞ ¼ vfci ¼ Gg
t¼1 t¼1
The value mG(k) is the number of texts of class G which appear in the rank
until position k. With these assumptions, we can calculate explicitly the
probability of observing the rank c1…cN if X 2 G (i.e. if X is a text of G). The
calculations involve the evaluation of some multiple integrals as described in
details in Basile et al. (2008) for the case NG = NB and yields the formula:
1
Pðc1 c2 . . . cN jx 2 GÞ ¼ ð1 þ ÞNB N
Q
ðk þ mG ðkÞÞ
k¼1
namely
1
Pðc1 c2 :::cN jx 2 GÞ ffi ð1 IB Þ:
N!
We can now make the same assumption for the case X = b⁄2 B; in
particular assuming the law z1+γ for the distribution function of d(b⁄, g)
with respect to d(b⁄, b) , we obtain:
THE PUZZLE OF BASIL’S EPISTULA 38 281
1
Pðc1 c2 :::cN jx 2 BÞ ffi ð1 cIG Þ:
N!
The length used in the splitting of the reference texts has been selected
through simple empirical considerations that suggest to combine analyses at
two different scales:
No. of texts
Basil works 43
Basil epistulae L 89
Basil epistulae M 92
Basil epistulae S 138
Gregory works 57
Gregory epistulae 25
Total 444
282 D. BENEDETTO ET AL.
enhances the weight of rare stylistic features, whereas using cuts of 11 000
characters amplifies frequently used stylistic patterns. In order to correlate
the different information arising from the two different scale analysis, both
the entropic distance and the n-gram distance with n = 11 have been imple-
mented at these scales, yielding to four different attributions for each dis-
puted text. We note that values of n close to n = 11 give very similar
results. This reflects a quite important characteristic of our method, namely
its stability. On the other extreme, very small values of n give quite confus-
Downloaded by [Australian National University] at 16:00 08 January 2015
Fig. 2. Results of the attribution. In the first graphic: black B = 4/G = 0; dark grey B = 3/G
= 1; white: B = 2/G = 2; light grey B = 1/G = 3; grey B = 0/G = 4. In the second: black
means B, grey G, white tie.
THE PUZZLE OF BASIL’S EPISTULA 38 283
Basil Gregor
Downloaded by [Australian National University] at 16:00 08 January 2015
wrongly attributed, but for three of them one can find in literature philologi-
cal reasons to doubt about their authorship. This means that the “errors”
made by our method seem to have a philological meaning, suggesting an
even greater precision. The remaining mistakes, one in Gregory’s corpus
and one in Basil’s corpus, can be explained with their literary genre and
contents, very different with respect to the works written against Eunomius
by both authors and used as comparison to measure the different distances.
This result seems to suggest that the combination of philological and
numerical techniques in the design of the method is very effective. As far
as we know, our method is the first one able to analyse a real philological
problem in authorship attribution of ancient Greek works and to give a clear
result, in agreement with what was previously known in scientific literature
(Maspero & Leal, 2010).
9. CONCLUSIONS
We can conclude that we have analysed and compared the two corpora
formed by all the works written by two Greek authors of the 4th century,
Basil and Gregory of Nyssa. Through the combination of two different
methods and of different scales and parameters we have been able to
correctly attribute the works examined with an overall precision of 96%
(Basil) and 93% (Gregory). In case of perfect agreement of all the methods
used the precision grows to 97% for both authors. The experimental setting
was designed in such a way as to study the attribution of a specific letter,
Ep. 38, which has been transmitted in the corpora of both authors and has
been extensively discussed from the philological and historical perspectives.
We could attribute the letter in a very clear way to Gregory of Nyssa, get-
4
Basil’s letter 189 coincides with part of one of Gregory’s Trinitarian tracts, Ad Eustathium,
De Sancta Trinitate (GNO III/1, 3– 16), and Basil’s letter 16 is known to be part of the 8th
chapter of Gregory’s Contra Eunomium III (GNO II, 226–228).
286 D. BENEDETTO ET AL.
ting a unanimous agreement by all our methods. This answer has been con-
firmed by the other results obtained, even by the few errors in the attribu-
tions, which are coherent with what was previously known at the
philological level.
As future developments of this research we are considering the possibility
and effectiveness of the analysis of the inner structure of some of the studied
works, as well as the comparison of these results with analogous problems
concerning other Greek authors of the same time, in particular Gregory of
Downloaded by [Australian National University] at 16:00 08 January 2015
Nazianzus, who also extensively wrote on the same subject and against the
same Eunomius. This case is particularly interesting, because in Gregory of
Nazianzus’ Corpus there are some works with disputed attribution just as
Ep. 38.
REFERENCES
Basile, C., Benedetto, D., Caglioti, E., & Degli Esposti, M. (2008). An example of mathe-
matical authorship attribution. Journal of Mathematical Physics, 49, 125211–1251120.
Basile, C., Benedetto, D., Caglioti, E., Cristadoro, G., & Degli Esposti, M. (2009). A plagia-
rism detection procedure in three steps: Selection, matches and “squares”. In Uncover-
ing Plagiarism, Authorship and Social Software Misuse and 1st International
Competition on Plagiarism Detection. San Sebastian, Spain, 10 September 2009,
Aachen: CEUR Workshop Proceedings ISSN 1613–0073, vol. 502, pp. 19–232
Benedetto, D., Caglioti, E., & Loreto, V. (2002). Language trees and zipping. Physical
Review Letters, 88(4), 48702.
Bennet, W. R. (1976). Scientific and engineering problem-solving with the computer. Engle-
wood Cliffs, NJ: Prentice-Hall.
Canfora, L. (2002). Il copista come autore. Palermo: Sellerio.
Clement, R., & Sharp, D. (2003). Ngram and Bayesian classification of documents for topic
and authorship. Literary and Linguistic Computing, 18(4), 423–447.
Drecoll, V. H. (1996). Die Entwicklung der Trinitätslehre des Basilius von Cäsarea. Sein
Weg vom Homöusianer zum Neonizäner. Göttingen: Vandenhoeck & Ruprecht.
Fedwick, P. J. (1978). Commentary of Gregory of Nyssa or the 38th letter of Basil of
Caesarea, OrChrP, 44, 31–51. J. Hammerstaedt, Zur Echtheit von Basiliusbrief 38, in
E. Dassmann & K. Thraede (Eds), Tesserae. Festschrift für Josef Engemann, Jahrbuch
für Antikes Christentum. Ergänzungsband 18, Münster 1991, 416–419; W.-D. Haus-
child, Basilius von Caesarea. Briefe. Eingeleitet, übersetzt und erläutert. Erster Teil,
Bibliothek der Griechischen Literatur 32, Stuttgart 1990.
Fedwick, P. J. (1993). Bibliotheca Basiliana Universalis I. Turnhout: Brepols.
Hübner, Cf. R. (1972). Gregor von Nyssa als Verfasser der sog. Ep. 38 des Basilius. Zum
unterschiedlichen Verständnis der ousia bei den kappadozischen Brüdern. In J. Fon-
taine & Ch. Kannengiesser (Eds), Epektasis Mélanges patristiques offerts au Cardinal
Jean Daniélou (pp. 463–490). Paris: Beauchesne.
THE PUZZLE OF BASIL’S EPISTULA 38 287
Keselj, V., Peng, F., Cercone, N., & Thoas, C. (2003). N-gram-based author profiles for
authorship attribution. In V. Keselj & T. Endo (Eds), Proceedings of the Confer-
ence Pacific Association for Computational Linguistics PACLING’03 (pp. 255–
264). Halifax: Dalhousie University.
Lempel, A., & Ziv, J. (1976). On the complexity of finite sequences. IEEE Transactions on
Information Theory, 22(1), 75–81.
Lempel, A., & Ziv, J. (1977). A universal algorithm for sequential data compression. IEEE
Transactions on Information Theory, 23(3), 337–343.
Lempel, A., & Ziv, J. (1978). Compression of individual sequences via variable-rate coding.
IEEE Transactions on Information Theory, 24(5), 530–536.
Downloaded by [Australian National University] at 16:00 08 January 2015
Maspero, G., & Leal, J. (2010). Revisiting Tertullian’s Authorship of the Passio Perpetuae
through Quantitative Analysis. In P. Grzybek, E. Kelih, & J. Maoutek (Eds), Text and Lan-
guage. Structures - Functions - Interrelations Π Quantitative Perspectives (pp. 99–108).
Wien: Praesens Verlag.
Maspero, G., Benedetto, D., & Degli Esposti M. (2013 in print). Who wrote Basil’s Epistula
38? A Possible Answer through Quantitative Analysis. In J. Leemans & M. Cassin
(Eds), Gregory of Nyssa’s Contra Eunomium III. Proceedings of the Twelfth Interna-
tional Gregory of Nyssa Colloquium (Leuven, 14–17 September 2010). Leuven: Brill
Meredith, A. (1999). Gregory of Nyssa. London, New York: Routledge.
Rousseau, Ph. (1998). Basil of Caesarea. Berkeley/Los Angeles/London: University of
California Press.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the
American Society for Information Science and Technology, 60(3), 538–556.
Wyner, A. D., Ziv, J., & Wyner, A. J. (1998). On the role of pattern matching in information
theory. IEEE Transactions on Information Theory, 44(6), 2045–2056.
Zachhuber, J. (2003). Nochmals: Der “38. Brief” des Basilius von Cäsarea als Werk des Gre-
gor von Nyssa. Zac, 7, 73–90.
Ziv, J., & Merhav, N. (1993). A measure of relative entropy between individual sequences
with application to universal classification. IEEE Transactions on Information Theory,
39(4), 1270–1279.
ONLINE REFERENCE