Malaysian Journal of Computer Science (ISSN 0127-9084)
Indexing Page
Visit the official web site at

Article Information
Title:Exhaustive Affix Stripping And A Malay Word Register To Solve Stemming Errors And Ambiguity Problem In Malay Stemmers
Auhtor(s): Salhana Amad Darwis,Rukaini Abdullah ,Norisma Idris,
Journal:Malaysian Journal of Computer Science (ISSN 0127-9084)
Volume:25, No 4
Keywords:Malay language, stemming, Malay Language stemmers, Malay word register, ambiguity problem, under stemming, over stemming
Abstract:Stemmers or word stemming algorithms reduce a derivative word to its root word by removing all the affixes. The complexity of Malay Language (ML) morphological rules and Malay lexicon make stemming Malay words difficult. There is no fixed method to determine the affix to be removed from a derivative word to produce the correct root word. Furthermore, a derivative word could contain one or more valid root words. Stemming errors still exist in the previous Malay Language Stemmers (MLS). Regardless of the approaches used, they rely on the first affix matched or the first root word found. Hence, some words were under stemmed or over stemmed while words with many valid root words were not stemmed to reveal the correct root word. This multiple root words or ambiguity problem, however, has never been addressed by previous MLS. To solve the over stemming and under stemming errors, we propose an approach that exhaustively strips all matched affixes to ensure that a valid root word will be extracted. In addition, we also propose the use of a Malay Word Register to address the ambiguity problem of determining the correct root word. We tested the proposed approach with words from newspaper articles, Malay translation of the Quran, History essays and incorrectly stemmed words from the previous stemmers. The results reveal this stemmer is successful with 99.8% accuracy. There were no stemming errors. The imperfect accuracy is due to the ambiguity problem approach.

Volume Listing