I was taken by one data scientist who wanted to textually analyse the New Testament using software. What could they reveal from merely using some statistics to count the frequency of the words used in each book? Surely they would need to know more about the authorship as we, or rather scholars, have discerned?
We are informed that a number of the New Testament books may have been written pseudographically, written in the name of one by another. This was customary then and a sign of the worthiness of the implied author.
Here below we can see that there is an implied linkage between Mark to both Matthew and Luke, with John’s Gospel separate; furthermore, Paul is believed to have written: the first letter to the Thessalonians; the letters to the Galatians, Philemon, Philippians, Romans; and both letters to the Corinthians but not the other letters seemingly connected with his name. The book of Revelation is also noted to be quite separate.
So what of the statistical analysis? The code, written in R, was taken from a site named Learning Machines which used the Greek New Testament Gospels from the Project Gutenberg site. The results, below, depict a close relationship between the Gospels as noted above.
If we look further at the remainder of the New Testament, using the plaintext downloads from http://sblgnt.com we may be able to see some patterns with the New Testament. It should be noted that some criticism has been cited with this analysis in terms of the Greek data used. Many, if not all, of the available Greek texts are ‘modern’ Greek and not Koine Greek as it may be surmised the New Testament was originally written. This may cause errors to arise; furthermore, the Greek language is rich and diverse; hence, translations can cause inferences to be implied which did not exist in the original.
Initially the software used a clustering algorithm which looks at the various frequencies. Those statistics revealed the following:
Here we can see that the Gospels, at the bottom of the graph, are discerned as separate from the remainder of the texts, even the Book of Acts. A collection of books, noted as Pauline in origin, are seen together albeit 2 Thessalonians is clearly linked to its perceived earlier letter. Nevertheless, this clustering of data is only one way that this data may be shown. If we look to model the frequencies seen in each book, and then consider what are the principal components in each model, we could look at the major ‘drivers’ of each model – here we look at the first two components, PC1 and PC2.
Interestingly, Revelation is noticeably, significantly, offset, surprisingly different to any other Greek text. The Gospels are also offset to the right of the graph. On the central vertical dotted line, a familiar group of books appear to congregate: Philemon, Romans, Galatians, I Corinthians, Philippians and 1 Thessalonians.
What can we conclude? It is not obvious which books are written by one author! That said, it does allow some consideration into the authorship of the New Testament. It does add to the discussion. Much more work is required to establish ‘good’ data – as it does in any statistical exercise; nevertheless, some interesting revelations are noted, especially with software which does not purport to use any knowledge of the Greek grammar.