Pattern-based algorithm for part-of-speech tagging arabic text

Shihadeh Alqrainy*, Hasan Muaidi AlSerhan, Aladdin Ayesh

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

    5 Citations (Scopus)

    Abstract

    Building a generic Part-of-Speech (POS) tagger system without a lexicon (dictionary) depends on the language and the characteristics of its grammar, both the morphological and the syntactical systems of that language. Arabic language has a valuable and important feature, called diacritics, which are marks placed over and below the letters of Arabic word. This paper presents a novel algorithm to assign the correct POS tag to those words belonging to a verb or a noun class in an Arabic text. The algorithm is based on the pattern (wazn) of the word instead of using a huge manually tagged lexicon from which large amounts of training data can be extracted. An experiment was ran on a data set that contains 5,000 words belonging to a noun and a verb class to evaluate the accuracy of the algorithm. The algorithm is achieved an accuracy of 91%.

    Original languageEnglish
    Title of host publication2008 International Conference on Computer Engineering and Systems, ICCES 2008
    PublisherIEEE Explore
    Pages119-124
    Number of pages6
    ISBN (Print)9781424421152
    DOIs
    Publication statusPublished - 2008
    Event2008 International Conference on Computer Engineering and Systems, ICCES 2008 - Cairo, Egypt
    Duration: 25 Nov 200827 Nov 2008

    Conference

    Conference2008 International Conference on Computer Engineering and Systems, ICCES 2008
    Country/TerritoryEgypt
    CityCairo
    Period25/11/0827/11/08

    Keywords

    • Arabic language
    • Diacritics
    • Morphological
    • Part-Of-speech(POS)
    • Syntactical
    • Tag set

    Fingerprint

    Dive into the research topics of 'Pattern-based algorithm for part-of-speech tagging arabic text'. Together they form a unique fingerprint.

    Cite this