Abstract
Building a generic Part-of-Speech (POS) tagger system without a lexicon (dictionary) depends on the language and the characteristics of its grammar, both the morphological and the syntactical systems of that language. Arabic language has a valuable and important feature, called diacritics, which are marks placed over and below the letters of Arabic word. This paper presents a novel algorithm to assign the correct POS tag to those words belonging to a verb or a noun class in an Arabic text. The algorithm is based on the pattern (wazn) of the word instead of using a huge manually tagged lexicon from which large amounts of training data can be extracted. An experiment was ran on a data set that contains 5,000 words belonging to a noun and a verb class to evaluate the accuracy of the algorithm. The algorithm is achieved an accuracy of 91%.
Original language | English |
---|---|
Title of host publication | 2008 International Conference on Computer Engineering and Systems, ICCES 2008 |
Publisher | IEEE Explore |
Pages | 119-124 |
Number of pages | 6 |
ISBN (Print) | 9781424421152 |
DOIs | |
Publication status | Published - 2008 |
Event | 2008 International Conference on Computer Engineering and Systems, ICCES 2008 - Cairo, Egypt Duration: 25 Nov 2008 → 27 Nov 2008 |
Conference
Conference | 2008 International Conference on Computer Engineering and Systems, ICCES 2008 |
---|---|
Country/Territory | Egypt |
City | Cairo |
Period | 25/11/08 → 27/11/08 |
Keywords
- Arabic language
- Diacritics
- Morphological
- Part-Of-speech(POS)
- Syntactical
- Tag set