TY - GEN
T1 - Bidirectional LSTMs - CRFs networks for bangla POS tagging
AU - Alam, Firoj
AU - Chowdhury, Shammur Absar
AU - Noori, Sheak Rashed Haider
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/2/21
Y1 - 2017/2/21
N2 - Part-of-speech (POS) information is one of the fundamental components in the natural language processing pipeline, which helps in extracting higher-level information such as named entities, discourse, and syntactic structure of a sentence. For some languages, such as English, Dutch, and Chinese, it is considered as a solved problem due to the higher accuracy (97%) of the predicted system. Significant efforts have been made for such languages in terms of making the data publicly accessible and also organizing evaluation campaigns. Compared to that there are very fewer efforts for Bangla (ethnonym: Bangla; exonym: Bengali). In this paper, we present a knowledge poor approach for POS tagging, which we evaluated using publicly accessible dataset from LDC. The motivation of our approach is that we did not want to rely on any existing resources such as lexicon or named entity recognizer for designing the system as they are not publicly available and difficult to develop. We have not used any handcrafted features, rather we employed distributed representations of word and characters. We designed the system using Long Short Term Memory (LSTM) neural networks followed by Conditional Random Fields (CRFs) for designing the model with an inclusion of pre-trained word embedded model. We obtained promising results with an accuracy of 86:0%.
AB - Part-of-speech (POS) information is one of the fundamental components in the natural language processing pipeline, which helps in extracting higher-level information such as named entities, discourse, and syntactic structure of a sentence. For some languages, such as English, Dutch, and Chinese, it is considered as a solved problem due to the higher accuracy (97%) of the predicted system. Significant efforts have been made for such languages in terms of making the data publicly accessible and also organizing evaluation campaigns. Compared to that there are very fewer efforts for Bangla (ethnonym: Bangla; exonym: Bengali). In this paper, we present a knowledge poor approach for POS tagging, which we evaluated using publicly accessible dataset from LDC. The motivation of our approach is that we did not want to rely on any existing resources such as lexicon or named entity recognizer for designing the system as they are not publicly available and difficult to develop. We have not used any handcrafted features, rather we employed distributed representations of word and characters. We designed the system using Long Short Term Memory (LSTM) neural networks followed by Conditional Random Fields (CRFs) for designing the model with an inclusion of pre-trained word embedded model. We obtained promising results with an accuracy of 86:0%.
KW - Bangla
KW - Deep learning
KW - POS tagging
UR - http://www.scopus.com/inward/record.url?scp=85016184561&partnerID=8YFLogxK
U2 - 10.1109/ICCITECHN.2016.7860227
DO - 10.1109/ICCITECHN.2016.7860227
M3 - Conference contribution
AN - SCOPUS:85016184561
T3 - 19th International Conference on Computer and Information Technology, ICCIT 2016
SP - 377
EP - 382
BT - 19th International Conference on Computer and Information Technology, ICCIT 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 19th International Conference on Computer and Information Technology, ICCIT 2016
Y2 - 18 December 2016 through 20 December 2016
ER -