TY - GEN
T1 - Urdu word segmentation
AU - Durrani, Nadir
AU - Hussain, Sarmad
PY - 2010
Y1 - 2010
N2 - Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.
AB - Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.
UR - http://www.scopus.com/inward/record.url?scp=79952062201&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:79952062201
SN - 1932432655
SN - 9781932432657
T3 - NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference
SP - 528
EP - 536
BT - NAACL HLT 2010 - Human Language Technologies
T2 - 2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010
Y2 - 2 June 2010 through 4 June 2010
ER -