Pattern-based algorithm for part-of-speech tagging arabic text

Shihadeh Alqrainy*, Hasan Muaidi AlSerhan, Aladdin Ayesh

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

7 Citations (Scopus)

Abstract

Building a generic Part-of-Speech (POS) tagger system without a lexicon (dictionary) depends on the language and the characteristics of its grammar, both the morphological and the syntactical systems of that language. Arabic language has a valuable and important feature, called diacritics, which are marks placed over and below the letters of Arabic word. This paper presents a novel algorithm to assign the correct POS tag to those words belonging to a verb or a noun class in an Arabic text. The algorithm is based on the pattern (wazn) of the word instead of using a huge manually tagged lexicon from which large amounts of training data can be extracted. An experiment was ran on a data set that contains 5,000 words belonging to a noun and a verb class to evaluate the accuracy of the algorithm. The algorithm is achieved an accuracy of 91%.

Original languageEnglish
Title of host publication2008 International Conference on Computer Engineering and Systems, ICCES 2008
PublisherIEEE Explore
Pages119-124
Number of pages6
ISBN (Print)9781424421152
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event2008 International Conference on Computer Engineering and Systems, ICCES 2008 - Cairo, Egypt
Duration: 25 Nov 200827 Nov 2008

Conference

Conference2008 International Conference on Computer Engineering and Systems, ICCES 2008
Country/TerritoryEgypt
CityCairo
Period25/11/0827/11/08

Keywords

  • Arabic language
  • Diacritics
  • Morphological
  • Part-Of-speech(POS)
  • Syntactical
  • Tag set

Fingerprint

Dive into the research topics of 'Pattern-based algorithm for part-of-speech tagging arabic text'. Together they form a unique fingerprint.

Cite this