TY - JOUR
T1 - A comparative assessment of the difficulty of authorship attribution in Greek and in English
AU - Juola, Patrick
AU - Mikros, George K.
AU - Vinsick, Sean
N1 - Publisher Copyright:
© 2018 ASIS&T
PY - 2019/1
Y1 - 2019/1
N2 - Authorship attribution is an important problem in text classification, with many applications and a substantial body of research activity. Among the research findings are that many different methods will work, including a number of methods that are superficially language-independent (such as an analysis of the most common “words” or “character n-grams” in a document). Since all languages have words (and all written languages have characters), this method could (in theory) work on any language. However, it is not clear that the methods that work best on, for example English, would also work best on other languages. It is not even clear that the same level of performance is achievable in different languages, even under identical conditions. Unfortunately, it is very difficult to achieve “identical conditions” in practice. A new corpus, developed by George Mikros, provides very tight controls not only for author but also for topic, thus enabling a direct comparison of performance levels between the two languages Greek and English. We compare a number of different methods head-to-head on this corpus, and show that, overall, performance on English is higher than performance on Greek, often highly significantly so.
AB - Authorship attribution is an important problem in text classification, with many applications and a substantial body of research activity. Among the research findings are that many different methods will work, including a number of methods that are superficially language-independent (such as an analysis of the most common “words” or “character n-grams” in a document). Since all languages have words (and all written languages have characters), this method could (in theory) work on any language. However, it is not clear that the methods that work best on, for example English, would also work best on other languages. It is not even clear that the same level of performance is achievable in different languages, even under identical conditions. Unfortunately, it is very difficult to achieve “identical conditions” in practice. A new corpus, developed by George Mikros, provides very tight controls not only for author but also for topic, thus enabling a direct comparison of performance levels between the two languages Greek and English. We compare a number of different methods head-to-head on this corpus, and show that, overall, performance on English is higher than performance on Greek, often highly significantly so.
UR - http://www.scopus.com/inward/record.url?scp=85056757270&partnerID=8YFLogxK
U2 - 10.1002/asi.24073
DO - 10.1002/asi.24073
M3 - Article
AN - SCOPUS:85056757270
SN - 2330-1635
VL - 70
SP - 61
EP - 70
JO - Journal of the Association for Information Science and Technology
JF - Journal of the Association for Information Science and Technology
IS - 1
ER -