TY - GEN
T1 - Detecting automatically-generated Arabic tweets
AU - Almerekhi, Hind
AU - Elsayed, Tamer
N1 - Publisher Copyright:
© Springer International Publishing Switzerland 2015.
PY - 2015
Y1 - 2015
N2 - Recently, Twitter, one of the most widely-known social media platforms, got infiltrated by several automation programs, commonly known as “bots”. Bots can be easily abused to spread spam and hinder information extraction applications by posting lots of automatically-generated tweets that occupy a good portion of the continuous stream of tweets. This problem heavily affects users in the Arab region due to the recent developing political events as automated tweets can disturb communication and waste time needed in filtering such tweets. To mitigate this problem, this research work addresses the classification of Arabic tweets into automated or manual. We proposed four categories of features including formality, structural, tweet-specific, and temporal features. Our experimental evaluation over about 3.5 k randomly sampled Arabic tweets shows that classification based on individual categories of features outperform the baseline unigram-based classifier in terms of classification accuracy. Additionally, combining tweet-specific and unigram features improved classification accuracy to 92%, which is a significant improvement over the baseline classifier, constituting a very strong reference baseline for future studies.
AB - Recently, Twitter, one of the most widely-known social media platforms, got infiltrated by several automation programs, commonly known as “bots”. Bots can be easily abused to spread spam and hinder information extraction applications by posting lots of automatically-generated tweets that occupy a good portion of the continuous stream of tweets. This problem heavily affects users in the Arab region due to the recent developing political events as automated tweets can disturb communication and waste time needed in filtering such tweets. To mitigate this problem, this research work addresses the classification of Arabic tweets into automated or manual. We proposed four categories of features including formality, structural, tweet-specific, and temporal features. Our experimental evaluation over about 3.5 k randomly sampled Arabic tweets shows that classification based on individual categories of features outperform the baseline unigram-based classifier in terms of classification accuracy. Additionally, combining tweet-specific and unigram features improved classification accuracy to 92%, which is a significant improvement over the baseline classifier, constituting a very strong reference baseline for future studies.
KW - Arabic microblogs
KW - Automated tweets
KW - Bots
KW - Crowdsourcing
KW - Tweet classification
UR - http://www.scopus.com/inward/record.url?scp=84958043745&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-28940-3_10
DO - 10.1007/978-3-319-28940-3_10
M3 - Conference contribution
AN - SCOPUS:84958043745
SN - 9783319289397
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 123
EP - 134
BT - Information Retrieval Technology - 11th Asia Information Retrieval Societies Conference, AIRS 2015, Proceedings
A2 - Scholer, Falk
A2 - Zuccon, Guido
A2 - Geva, Shlomo
A2 - Sun, Aixin
A2 - Joho, Hideo
A2 - Zhang, Peng
PB - Springer Verlag
T2 - 11th Asia Information Retrieval Societies Conference, AIRS 2015
Y2 - 2 December 2015 through 4 December 2015
ER -