Cross-linguistic authorship attribution and gender profiling. Machine translation as a method for bridging the language gap

George Mikros*, Dimitris Boumparis

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

This study explores the feasibility of cross-linguistic authorship attribution and the author's gender identification using Machine Translation (MT). Computational stylistics experiments were conducted on a Greek blog corpus translated into English using Google's Neural MT. A Random Forest algorithm was employed for authorship and gender profiling, using different feature groups [Author's Multilevel N-gram Profiles, quantitative linguistics (QL), and cross-lingual word embeddings (CLWE)] in both original and translated texts. Results indicate that MT is a viable method for converting a multilingual corpus into one language for authorship attribution and gender profiling research, with considerable accuracy when training and testing datasets use identical language. In the pure cross-linguistic scenario, higher accuracies than the baselines were obtained using CLWE and QL features.

Original languageEnglish
Pages (from-to)954-967
Number of pages14
JournalDigital Scholarship in the Humanities
Volume39
Issue number3
Early online dateJun 2024
DOIs
Publication statusPublished - 1 Sept 2024

Keywords

  • Authors' Multilevel N-gram Profiles
  • Machine Translation
  • author profiling
  • authorship attribution
  • lexical diversity
  • multilingual word embeddings

Fingerprint

Dive into the research topics of 'Cross-linguistic authorship attribution and gender profiling. Machine translation as a method for bridging the language gap'. Together they form a unique fingerprint.

Cite this