Pattern-based algorithm for part-of-speech tagging arabic text

Shihadeh Alqrainy; Hasan Muaidi AlSerhan; Aladdin Ayesh

doi:10.1109/ICCES.2008.4772979

Pattern-based algorithm for part-of-speech tagging arabic text

Shihadeh Alqrainy^*, Hasan Muaidi AlSerhan, Aladdin Ayesh

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

7 Citations (Scopus)

Abstract

Building a generic Part-of-Speech (POS) tagger system without a lexicon (dictionary) depends on the language and the characteristics of its grammar, both the morphological and the syntactical systems of that language. Arabic language has a valuable and important feature, called diacritics, which are marks placed over and below the letters of Arabic word. This paper presents a novel algorithm to assign the correct POS tag to those words belonging to a verb or a noun class in an Arabic text. The algorithm is based on the pattern (wazn) of the word instead of using a huge manually tagged lexicon from which large amounts of training data can be extracted. An experiment was ran on a data set that contains 5,000 words belonging to a noun and a verb class to evaluate the accuracy of the algorithm. The algorithm is achieved an accuracy of 91%.

Original language	English
Title of host publication	2008 International Conference on Computer Engineering and Systems, ICCES 2008
Publisher	IEEE Explore
Pages	119-124
Number of pages	6
ISBN (Print)	9781424421152
DOIs	https://doi.org/10.1109/ICCES.2008.4772979
Publication status	Published - 2008
Externally published	Yes
Event	2008 International Conference on Computer Engineering and Systems, ICCES 2008 - Cairo, Egypt Duration: 25 Nov 2008 → 27 Nov 2008

Conference

Conference	2008 International Conference on Computer Engineering and Systems, ICCES 2008
Country/Territory	Egypt
City	Cairo
Period	25/11/08 → 27/11/08

Keywords

Arabic language
Diacritics
Morphological
Part-Of-speech(POS)
Syntactical
Tag set

Access to Document

10.1109/ICCES.2008.4772979

Cite this

Alqrainy, S, AlSerhan, HM & Ayesh, A 2008, Pattern-based algorithm for part-of-speech tagging arabic text. in 2008 International Conference on Computer Engineering and Systems, ICCES 2008., 4772979, IEEE Explore, pp. 119-124, 2008 International Conference on Computer Engineering and Systems, ICCES 2008, Cairo, Egypt, 25/11/08. https://doi.org/10.1109/ICCES.2008.4772979

@inproceedings{8c65c2a642fa4109bde5b24e05402a75,

title = "Pattern-based algorithm for part-of-speech tagging arabic text",

abstract = "Building a generic Part-of-Speech (POS) tagger system without a lexicon (dictionary) depends on the language and the characteristics of its grammar, both the morphological and the syntactical systems of that language. Arabic language has a valuable and important feature, called diacritics, which are marks placed over and below the letters of Arabic word. This paper presents a novel algorithm to assign the correct POS tag to those words belonging to a verb or a noun class in an Arabic text. The algorithm is based on the pattern (wazn) of the word instead of using a huge manually tagged lexicon from which large amounts of training data can be extracted. An experiment was ran on a data set that contains 5,000 words belonging to a noun and a verb class to evaluate the accuracy of the algorithm. The algorithm is achieved an accuracy of 91%.",

keywords = "Arabic language, Diacritics, Morphological, Part-Of-speech(POS), Syntactical, Tag set",

author = "Shihadeh Alqrainy and AlSerhan, {Hasan Muaidi} and Aladdin Ayesh",

year = "2008",

doi = "10.1109/ICCES.2008.4772979",

language = "English",

isbn = "9781424421152",

pages = "119--124",

booktitle = "2008 International Conference on Computer Engineering and Systems, ICCES 2008",

publisher = "IEEE Explore",

note = "2008 International Conference on Computer Engineering and Systems, ICCES 2008 ; Conference date: 25-11-2008 Through 27-11-2008",

}

TY - GEN

T1 - Pattern-based algorithm for part-of-speech tagging arabic text

AU - Alqrainy, Shihadeh

AU - AlSerhan, Hasan Muaidi

AU - Ayesh, Aladdin

PY - 2008

Y1 - 2008

N2 - Building a generic Part-of-Speech (POS) tagger system without a lexicon (dictionary) depends on the language and the characteristics of its grammar, both the morphological and the syntactical systems of that language. Arabic language has a valuable and important feature, called diacritics, which are marks placed over and below the letters of Arabic word. This paper presents a novel algorithm to assign the correct POS tag to those words belonging to a verb or a noun class in an Arabic text. The algorithm is based on the pattern (wazn) of the word instead of using a huge manually tagged lexicon from which large amounts of training data can be extracted. An experiment was ran on a data set that contains 5,000 words belonging to a noun and a verb class to evaluate the accuracy of the algorithm. The algorithm is achieved an accuracy of 91%.

AB - Building a generic Part-of-Speech (POS) tagger system without a lexicon (dictionary) depends on the language and the characteristics of its grammar, both the morphological and the syntactical systems of that language. Arabic language has a valuable and important feature, called diacritics, which are marks placed over and below the letters of Arabic word. This paper presents a novel algorithm to assign the correct POS tag to those words belonging to a verb or a noun class in an Arabic text. The algorithm is based on the pattern (wazn) of the word instead of using a huge manually tagged lexicon from which large amounts of training data can be extracted. An experiment was ran on a data set that contains 5,000 words belonging to a noun and a verb class to evaluate the accuracy of the algorithm. The algorithm is achieved an accuracy of 91%.

KW - Arabic language

KW - Diacritics

KW - Morphological

KW - Part-Of-speech(POS)

KW - Syntactical

KW - Tag set

UR - http://www.scopus.com/inward/record.url?scp=67649544246&partnerID=8YFLogxK

U2 - 10.1109/ICCES.2008.4772979

DO - 10.1109/ICCES.2008.4772979

M3 - Published conference contribution

AN - SCOPUS:67649544246

SN - 9781424421152

SP - 119

EP - 124

BT - 2008 International Conference on Computer Engineering and Systems, ICCES 2008

PB - IEEE Explore

T2 - 2008 International Conference on Computer Engineering and Systems, ICCES 2008

Y2 - 25 November 2008 through 27 November 2008

ER -

Pattern-based algorithm for part-of-speech tagging arabic text

Abstract

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this