Farasa: A Fast and Furious Segmenter For Arabic: June 2016
Farasa: A Fast and Furious Segmenter For Arabic: June 2016
net/publication/303989318
CITATIONS READS
191 1,469
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Nadir Durrani on 15 June 2016.
ing: Palisky); or to long words with more than four transliterate the OOV words. We used the standard
A®Ö Ïð”
segmentations such as ”wlmfAj}thmA” “ AÒîDJk tune and test set provided by the IWSLT shared task
.
(“w+l+mfAj}+t+hmA” “ AÒë + H + úk
A®Ó + È + ð”) to evaluate the systems.
. In each experiment, we simply changed the seg-
(meaning “and to surprise both of them”). Perhaps,
adding larger gazetteers of foreign names would mentation pipeline to try different segmentation.
help reduce the first kind of errors. For the sec- We used ATB scheme for MADAMIRA which has
ond type of errors, the classifier generates the cor- shown to outperform its alternatives (S2 and D3)
rect segmentation, but it receives often a slightly previously (Sajjad et al., 2013).
lower score than the incorrect segmentation. Per-
Results: Table 2 compares the Arabic-to-English
haps adding more features can help correct such er-
SMT systems using the three segmentation tools.
rors.
Farasa performs better than Stanford’s Arabic seg-
3 Machine Translation menter giving an improvement of +0.25, but slightly
worse than MADAMIRA (-0.10). The differences
Setup: We trained Statistical Machine Translation are not statistically significant. For efficiency, Farasa
(SMT) systems for Arabic↔English, to compare is faster than Stanford and MADAMIRA by a fac-
Farasa with Stanford and MADAMIRA3 . The com-
3 4
Release-01292014-1.0 was used in the experiments We used mkcls to cluster the data into 50 clusters.
Seg iwslt12 iwslt13 Avg Time Stemming MAP P@10 Time
MADAMIRA 19.6 19.1 19.4 1781 Words 0.20 0.34 -
Stanford 17.4 17.2 17.3 692 MADAMIRA 0.26 w,s 0.39 w 21:27:21
Farasa 19.2 19.3 19.3 66 Stanford 0.22 w 0.37 03:43:25
Farasa 0.28 w,s,m 0.43 w,s,m 00:15:26
Table 3: English-to-Arabic Machine Translation,
BLEU scores and Time (in seconds) Table 4: Retrieval Results in MAP and P@10 and
Processing Time (in hh:mm:ss). For statistical sig-
tor of 5 and 50 respectively.5 The run-time of nificance, w = better than words, s = better than
MADAMIRA makes it cumbersome to run on big- Stanford, and m = better than MADAMIRA
ger corpora like the multiUN (UN) (Eisele and
Chen, 2010) which contains roughly 4M sentences. of “hamza”. Unlike MT, Arabic IR performs better
This factor becomes even daunting when training a with more elaborate segmentation which improves
segmented target-side language model for English- matching of core units of meaning, namely stems.
to-Arabic system. Table 3 shows results from For MADAMIRA, we used the D34MT scheme,
English-to-Arabic system. In this case, Stanford per- where all affixes are segmented. Stanford tokenizer
forms significantly worse than others. MADAMIRA only provides the ATB tokenization scheme. Farasa
performs slightly better than Farasa. However, as was used with the default scheme, where all affixes
before, Farasa is more than multiple orders of mag- are segmented.
nitude faster.
Results: Table 4 summarizes the retrieval re-
4 Information Retrieval sults for using words without stemming and using
MADAMIRA, Stanford, and Farasa for stemming.
Setup: We also used extrinsic IR evaluation to The table also indicates statistical significance and
determine the quality of stemming compared to reports on the processing time that each of the seg-
MADAMIRA and the Stanford segmenter. We per- menters took to process the entire document collec-
formed experiments on the TREC 2001/2002 cross tion. As can be seen from the results, Farasa out-
language track collection, which contains 383,872 performed using words, MADAMIRA, and Stan-
Arabic newswire articles, containing 59.6 million ford significantly. Farasa was an order of magni-
words), and 75 topics with their relevance judgments tude faster than Stanford and two orders of magni-
(Oard and Gey, 2002). This is presently the best tude faster than MADAMIRA.
available large Arabic information retrieval test col-
lection. We used Mean Average Precision (MAP) 5 Analysis
and precision at 10 (P@10) as the measures of good-
ness for this retrieval task. Going down from the top The major advantage of using Farasa is speed, with-
a retrieved ranked list, Average Precision (AP) is the out loss in accuracy. This mainly results from op-
average of precision values computed at every rel- timization described earlier in the Section 2 which
evant document found. P@10 is the same as MAP, includes caching and limiting the context used for
but the ranked list is restricted to 10 results. We used building the features vector. Stanford segmenter
SOLR (ver. 5.6)6 to perform all experimentation. uses a third-order (i.e., 4-gram) Markov CRF model
SOLR uses a tf-idf ranking model. We used a paired (Green and DeNero, 2012) to predict the correct seg-
2-tailed t-test with p-value less than 0.05 to ascer- mentation. On the other hand, MADAMIRA bases
tain statistical significance. For experimental setups, its segmentation on the output of a morphological
we performed letter normalization, where we con- analyzer which provides a list of possible analyses
flated: variants of “alef”, “ta marbouta” and “ha”, (independent of context) for each word. Both text
“alef maqsoura” and “ya”, and the different forms and analyses are passed to a feature modeling com-
5
ponent, which applies SVM and language models to
Time is the average of 10 runs on a machine with 8 Intel
i7-3770 cores, 16 GB RAM, and 7 Seagate disks in software
derive predictions for the word segmentation (Pasha
RAID 5 running Linux 3.13.0-48 et al., 2014). This hierarchy could explain the slow-
6
http://lucene.apache.org/solr/ ness of MADAMIRA versus other tokenizers.
6 Conclusion Kareem Darwish, Walid Magdy, and Ahmed Mourad.
2012. Language processing for arabic microblog re-
In this paper we introduced Farasa, a new Ara- trieval. In ACM CIKM-2012, pages 2427–2430.
bic segmenter, which uses SVM for ranking. We Kareem Darwish, Ahmed Abdelali, and Hamdy
compared our segmenter with state-of-the-art seg- Mubarak. 2014. Using stem-templates to im-
menters MADAMIRA and Stanford, on standard prove arabic pos and gender/number tagging. In
MT and IR tasks and demonstrated Farasa to be sig- LREC-2014.
nificantly better (in terms of accuracy) than both on Kareem Darwish. 2002. Building a shallow arabic mor-
the IR tasks and at par with MADAMIRA on the phological analyzer in one day. In Computational Ap-
proaches to Semitic Languages, ACL-2002, pages 1–8.
MT tasks. We found Farasa by orders of magnitude
Mona Diab. 2009. Second generation amira tools for
faster than both. Farasa has been made available for
arabic processing: Fast and robust tokenization, pos
use7 and will be added to Moses for Arabic tokeniza- tagging, and base phrase chunking. In Intl. Conference
tion. on Arabic Language Resources and Tools.
Nadir Durrani, Alexander Fraser, and Helmut Schmid.
2013. Model with minimal translation units, but de-
References code with phrases. In Proceedings of the 2013 Con-
Mohammed Aljlayl and Ophir Frieder. 2002. On arabic ference of the North American Chapter of the Associa-
search: improving the retrieval effectiveness via a light tion for Computational Linguistics: Human Language
stemming approach. In CIKM-2002, pages 340–347. Technologies, pages 1–11, Atlanta, Georgia, June. As-
Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia sociation for Computational Linguistics.
Tounsi, and Josef van Genabith. 2011. An open- Nadir Durrani, Philipp Koehn, Helmut Schmid, and
source finite state morphological transducer for mod- Alexander Fraser. 2014a. Investigating the Useful-
ern standard arabic. In Workshop on Finite State Meth- ness of Generalized Word Representations in SMT. In
ods and Natural Language Processing, pages 125– COLING’14, pages 421–432, Dublin, Ireland.
133. Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp
Kenneth Beesley, Tim Buckwalter, and Stuart Newton. Koehn. 2014b. Integrating an Unsupervised Translit-
1989. Two-level finite-state analysis of arabic mor- eration Model into Statistical Machine Translation. In
phology. In Proceedings of the Seminar on Bilingual EACL’14, pages 148–153, Gothenburg, Sweden.
Computing in Arabic and English, pages 6–7. Nadir Durrani. 2007. Typology of word and automatic
Kenneth R Beesley. 1996. Arabic finite-state morpho- word segmentation in Urdu text corpus. Master’s the-
logical analysis and generation. In ACL, pages 89–94. sis, National University of Computer and Emerging
Alexandra Birch, Matthias Huck, Nadir Durrani, Nikolay Sciences, Lahore, Pakistan, August.
Bogoychev, and Philipp Koehn. 2014. Edinburgh SLT Chris Dyer, Victor Chahuneau, and Noah A. Smith.
and MT system description for the IWSLT 2014 eval- 2013. A Simple, Fast, and Effective Reparameteriza-
uation. In Proceedings of the 11th International Work- tion of IBM Model 2. In Proceedings of NAACL’13.
shop on Spoken Language Translation, IWSLT ’14, Andreas Eisele and Yu Chen. 2010. MultiUN: A Mul-
Lake Tahoe, CA, USA. tilingual Corpus from United Nation Documents. In
Tim Buckwalter. 2002. Buckwalter {Arabic} morpho- LREC-2010, Valleta, Malta, May.
logical analyzer version 1.0. Michel Galley and Christopher D. Manning. 2008. A
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Simple and Effective Hierarchical Phrase Reordering
Bentivogli, and Marcello Federico. 2014. Report on Model. In EMNLP-2008, pages 848–856, Honolulu,
the 11th IWSLT Evaluation Campaign. IWSLT-14. Hawaii, October.
Colin Cherry and George Foster. 2012. Batch Tuning Spence Green and John DeNero. 2012. A class-based
Strategies for Statistical Machine Translation. In Pro- agreement model for generating accurately inflected
ceedings of the 2012 Conference of the North Amer- translations. In ACL-2012.
ican Chapter of the Association for Computational
Nizar Habash, Owen Rambow, and Ryan Roth. 2009.
Linguistics: Human Language Technologies, HLT-
Mada+ tokan: A toolkit for arabic tokenization, di-
NAACL’12, pages 427–436, Montréal, Canada.
acritization, morphological disambiguation, pos tag-
Kareem Darwish and Douglas W Oard. 2007. Adapt-
ging, stemming and lemmatization. In MEDAR, pages
ing morphology for arabic information retrieval*. In
102–109.
Arabic Computational Morphology, pages 245–262.
Kenneth Heafield. 2011. KenLM: faster and smaller lan-
7
http://alt.qcri.org/tools/farasa/ guage model queries. In Sixth Workshop on Statistical
Machine Translation, EMNLP-2011, pages 187–197,
Edinburgh, Scotland, United Kingdom, July.
Liang Huang and David Chiang. 2007. Forest Rescoring:
Faster Decoding with Integrated Language Models. In
Proceedings of the 45th Annual Meeting of the Asso-
ciation of Computational Linguistics, ACL’07, pages
144–151, Prague, Czech Republic.
Thorsten Joachims. 2006. Training linear svms in linear
time. In ACM SIGKDD-2006, pages 217–226. ACM.
Shereen Khoja. 2001. Apt: Arabic part-of-speech tagger.
In NAACL, pages 20–25.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran, Richard
Zens, Chris Dyer, Ondrej Bojar, Alexandra Con-
stantin, and Evan Herbst. 2007. Moses: Open source
toolkit for statistical machine translation. In ACL-
2007, Prague, Czech Republic.
Shankar Kumar and William J. Byrne. 2004. Minimum
Bayes-Risk Decoding for Statistical Machine Transla-
tion. In Proceedings of the North American Chapter
of the Association for Computational Linguistics: Hu-
man Language Technologies, HLT-NAACL’04, pages
169–176, Boston, Massachusetts, USA.
Will Monroe, Spence Green, and Christopher D Man-
ning. 2014. Word segmentation of informal arabic
with domain adaptation. ACL, Short Papers.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: a method for automatic eval-
uation of machine translation. In ACL-2002, Philadel-
phia, PA, USA.
Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab,
Ahmed El Kholy, Ramy Eskander, Nizar Habash,
Manoj Pooleery, Owen Rambow, and Ryan M Roth.
2014. Madamira: A fast, comprehensive tool for mor-
phological analysis and disambiguation of arabic. In
LREC-2014, Reykjavik, Iceland.
Hassan Sajjad, Francisco GuzmÃan, ˛ Preslav Nakov,
Ahmed Abdelali, Kenton Murray, Fahad Al Obaidli,
and Stephan Vogel. 2013. QCRI at IWSLT 2013: Ex-
periments in Arabic-English and English-Arabic spo-
ken language translation. In IWSLT-13, December.