Authorship attribution and gender identification in Greek blogs

Research output: Contribution to conferencePaperpeer-review

Abstract

The aim of this study is to obtain authorship attribution and author’s gender identification in a corpus of blogs written in Modern Greek language. More specifically, the corpus used contains 20 bloggers equally divided by gender (10males & 10 females) with 50 blog posts from each author (1,000 posts in total and an overall size of 406,460 words). In this corpus we calculated a number of standard stylometric variables (e.g. word length statistics and various vocabulary “richness”indices) and 300 most frequent word and character n-grams (character and word uni-grams, bigrams, trigrams). Support Vector Machines (SVM) were trained in the above data and the author’s gender prediction accuracy in 10-fold cross-validation experiment reached 82.6% accuracy, a result that is comparable to current state-of-the-art author profiling systems. Authorship attribution accuracy reached 85.4%, an equally satisfying result given the large number of candidate authors (n=20).
Original languageEnglish
Number of pages12
Publication statusPublished - 2012
Externally publishedYes
EventQUALICO 2012 - Belgrade, Serbia
Duration: 26 Apr 201229 Apr 2012

Conference

ConferenceQUALICO 2012
Country/TerritorySerbia
CityBelgrade
Period26/04/1229/04/12

Fingerprint

Dive into the research topics of 'Authorship attribution and gender identification in Greek blogs'. Together they form a unique fingerprint.

Cite this