Abstract
The aim of this study is to obtain authorship attribution and author’s gender identification in a corpus of blogs written in Modern Greek language. More specifically, the corpus used contains 20 bloggers equally divided by gender (10males & 10 females) with 50 blog posts from each author (1,000 posts in total and an overall size of 406,460 words). In this corpus we calculated a number of standard stylometric variables (e.g. word length statistics and various vocabulary “richness”indices) and 300 most frequent word and character n-grams (character and word uni-grams, bigrams, trigrams). Support Vector Machines (SVM) were trained in the above data and the author’s gender prediction accuracy in 10-fold cross-validation experiment reached 82.6% accuracy, a result that is comparable to current state-of-the-art author profiling systems. Authorship attribution accuracy reached 85.4%, an equally satisfying result given the large number of candidate authors (n=20).
Original language | English |
---|---|
Number of pages | 12 |
Publication status | Published - 2012 |
Externally published | Yes |
Event | QUALICO 2012 - Belgrade, Serbia Duration: 26 Apr 2012 → 29 Apr 2012 |
Conference
Conference | QUALICO 2012 |
---|---|
Country/Territory | Serbia |
City | Belgrade |
Period | 26/04/12 → 29/04/12 |