Top 10 cited articles in nlp

Top 10 cited Natural Language Computing
International Journal on Natural
Language Computing (IJNLC)
http://airccse.org/journal/ijnlc/index.html
ISSN : 2278 - 1307 [Online]; 2319 - 4111 [Print]
Google Scholar
https://scholar.google.com/citations?user=A5tqIdoAAAAJ&hl=en

AN IMPROVED APRIORI ALGORITHM FOR ASSOCIATION RULES
Mohammed Al-Maolegi1
, Bassam Arkok2
Computer Science, Jordan University of Science and Technology, Irbid, Jordan
ABSTRACT
There are several mining algorithms of association rules. One of the most popular algorithms is
Apriori that is used to extract frequent itemsets from large database and getting the association
rule for discovering the knowledge. Based on this algorithm, this paper indicates the limitation of
the original Apriori algorithm of wasting time for scanning the whole database searching on the
frequent itemsets, and presents an improvement on Apriori by reducing that wasted time
depending on scanning only some transactions. The paper shows by experimental results with
several groups of transactions, and with several values of minimum support that applied on the
original Apriori and our implemented improved Apriori that our improved Apriori reduces the
time consumed by 67.38% in comparison with the original Apriori, and makes the Apriori
algorithm more efficient and less time consuming.
KEYWORDS
Apriori, Improved Apriori, Frequent itemset, Support, Candidate itemset, Time consuming.
FULL TEXT: http://airccse.org/journal/ijnlc/papers/3114ijnlc03.pdf
VOLUME URL: http://airccse.org/journal/ijnlc/vol3.html

REFERENCES
[1] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng,
B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, “Top 10 algorithms in
data mining,” Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37, Dec. 2007.
[2] S. Rao, R. Gupta, “Implementing Improved Algorithm Over APRIORI Data Mining
Association Rule Algorithm”, International Journal of Computer Science And Technology, pp.
489-493, Mar. 2012
[3] H. H. O. Nasereddin, “Stream data mining,” International Journal of Web Applications, vol.
1, no. 4, pp. 183–190, 2009.
[4] F. Crespo and R. Weber, “A methodology for dynamic data mining based on fuzzy
clustering,” Fuzzy Sets and Systems, vol. 150, no. 2, pp. 267–284, Mar. 2005.
[5] R. Srikant, “Fast algorithms for mining association rules and sequential patterns,”
UNIVERSITY OF WISCONSIN, 1996.
[6] J. Han, M. Kamber,”Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers,
Book, 2000.
[7] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge discovery
in databases,” AI magazine, vol. 17, no. 3, p. 37, 1996.
[8] F. H. AL-Zawaidah, Y. H. Jbara, and A. L. Marwan, “An Improved Algorithm for Mining
Association Rules in Large Databases,” Vol. 1, No. 7, 311-316, 2011
[9] T. C. Corporation, “Introduction to Data Miningand Knowledge Discovery”, Two Crows
Corporation, Book, 1999.
[10] R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules between sets of items
in large databases,” in ACM SIGMOD Record, vol. 22, pp. 207–216, 1993
[11] M. Halkidi, “Quality assessment and uncertainty handling in data mining process,” in Proc,
EDBT Conference, Konstanz, Germany, 2000.
AUTHORS
Mohammed Al-Maolegi Obtained his Master degree in computer science from Jordan
University of Science and Technology University (Jordan) in 2014. He received his
B.Sc. in computer information system from Mutah University (Jordan) in 2010. His
research interests include: softw are engineering, software metrics, data mining and
wireless sensor networks.

Bassam Arkok Obtained his Master degree in computer science from Jordan University of
Science and Technology University (Jordan) in 2014. He received his B.Sc. in computer science
from Alhodidah University (Yemen). His research interests include: software engineering,
software metrics, data mining and wireless sensor networks.

NAMED ENTITY RECOGNITION USING HIDDEN MARKOV
MODEL (HMM)
Sudha Morwal 1
, Nusrat Jahan 2
and Deepti Chopra 3
1
Associate Professor, Banasthali University, Jaipur, Rajasthan-302001
2
M.Tech (CS), Banasthali University, Jaipur, Rajasthan-302001
3
M. Tech (CS), Banasthali University, Jaipur, Rajasthan-302001
ABSTRACT:
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP)
which is the branch of artificial intelligence. It has many applications mainly in machine
translation, text to speech synthesis, natural language understanding, Information Extraction,
Information retrieval, question answering etc. The aim of NER is to classify words into some
predefined categories like location name, person name, organization name, date, time etc. In
this paper we describe the Hidden Markov Model (HMM) based approach of machine
learning in detail to identify the named entities. The main idea behind the use of HMM model
for building NER system is that it is language independent and we can apply this system for
any language domain. In our NER system the states are not fixed means it is of dynamic in
nature one can use it according to their interest. The corpus used by our NER system is also
not domain specific.
KEYWORDS
Named Entity Recognition (NER), Natural Language processing (NLP), Hidden Markov
Model (HMM).

REFERENCES
[1] Pramod Kumar Gupta, Sunita Arora “An Approach for Named Entity Recognition
System for Hindi: An Experimental Study” in Proceedings of ASCNT – 2009, CDAC,
Noida, India, pp. 103 – 108.
[2] Shilpi Srivastava, Mukund Sanglikar & D.C Kothari. ”Named Entity Recognition System
for Hindi Language: A Hybrid Approach” International Journal of Computational Linguistics
(IJCL), Volume(2):Issue(1):2011.Availableat:
http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL-19.pdf
[3] “Padmaja Sharma, Utpal Sharma, Jugal Kalita”Named Entity Recognition: A Survey for
the Indian Languages”(Language in India www.languageinindia.com 11:5 May 2011 Special
Volume: Problems of Parsing in Indian Languages.) Available at:
http://www.languageinindia.com/may2011/padmajautpaljugal.pdf.
[4] Lawrence R. Rabiner, " A Tutorial on Hidden Markov Models and Selected Applications
in Speech Recognition", In Proceedings of the IEEE, VOL.77,NO.2, February
1989.Available at: http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf.
[5] Sujan Kumar Saha, Sudeshna Sarkar, Pabitra Mitra “Gazetteer Preparation for Named
Entity Recognition in Indian Languages” in the Proceeding of the 6th Workshop on Asian
Language Resources, 2008 . Available at: http://www.aclweb.org/anthology-new/I/I08/I08-
7002.pdf
[6] B. Sasidhar#1, P. M. Yohan*2, Dr. A. Vinaya Babu3, Dr. A. Govardhan4” A Survey on
Named Entity Recognition in Indian Languages with particular reference to Telugu” in IJCSI
International Journal of Computer Science Issues, Vol. 8, Issue 2, March 2011 available at :
http://www.ijcsi.org/papers/IJCSI-8-2-438-443.pdf.
[7] GuoDong Zhou Jian Su,” Named Entity Recognition using an HMM-based Chunk
Tagger” in Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics (ACL), Philadelphia, July 2002, pp. 473-480.
[8] http://en.wikipedia.org/wiki/Forward–backward_algorithm
[9] http://en.wikipedia.org/wiki/Baum-Welch_algorithm.
[10] Dan Shen, jie Zhang, Guodong Zhou,Jian Su, Chew-Lim Tan” Effective Adaptation of a
Hidden Markov Model-based Named Entity Recognizer for Biomedical Domain” available
at: http://acl.ldc.upenn.edu/W/W03/W03-1307.pdf.
AUTHORS
Sudha Morwal is an active researcher in the field of Natural Language
Processing. Currently working as Associate Professor in the Department of
Computer Science at Banasthali University (Rajasthan), India. She has done
M.Tech (Computer Science), NET, M.Sc (Computer Science) and her PhD is in
progress from Banasthali University (Rajasthan), India.

Nusrat Jahan received B.Tech degree in Computer Science and Engineering from R.N.
Modi Engineering College, Kota, Rajasthan in 2010.Currently she is pursuing her
M.Tech degree in Computer Science and Engineering from Banasthali University,
Rajasthan. Her subject of interests includes Natural Language Processing and
Information retrieval.
Deepti Chopra received B. Tech degree in Computer Science and Engineering from
Rajasthan College of Engineering for Women, Jaipur, Rajasthan in 2011.Currently she
is pursuing her M.Tech.degree in Computer Science and Engineering from Banasthali
University, Rajasthan. Her subject of research includes Natural Language Processing.

SENTIMENT ANALYSIS FOR MODERN STANDARD ARABIC AND COLLOQUIAL
Hossam S. Ibrahim 1
, Sherif M. Abdou2
and Mervat Gheith
1
Computer Science Department, Institute of statistical studies and research (ISSR), Cairo
University, EGYPT
2
Information Technology Department, Faculty of Computers and information Cairo
University, EGYPT
ABSTRACT
The rise of social media such as blogs and social networks has fueled interest in sentiment
analysis. With the proliferation of reviews, ratings, recommendations and other forms of online
expression, online opinion has turned into a kind of virtual currency for businesses looking to
market their products, identify new opportunities and manage their reputations, therefore many
are now looking to the field of sentiment analysis. In this paper, we present a feature-based
sentence level approach for Arabic sentiment analysis. Our approach is using Arabic
idioms/saying phrases lexicon as a key importance for improving the detection of the sentiment
polarity in Arabic sentences as well as a number of novels and rich set of linguistically motivated
features (contextual Intensifiers, contextual Shifter and negation handling), syntactic features for
conflicting phrases which enhance the sentiment classification accuracy. Furthermore, we
introduce an automatic expandable wide coverage polarity lexicon of Arabic sentiment words.
The lexicon is built with gold-standard sentiment words as a seed which is manually collected
and annotated and it expands and detects the sentiment orientation automatically of new
sentiment words using synset aggregation technique and free online Arabic lexicons and
thesauruses. Our data focus on modern standard Arabic (MSA) and Egyptian dialectal Arabic
tweets and microblogs (hotel reservation, product reviews, etc.). The experimental results using
our resources and techniques with SVM classifier indicate high performance levels, with
accuracies of over 95%.
KEYWORDS
Sentiment Analysis, opinion mining, social network, sentiment lexicon, modern standard Arabic,
colloquial, natural language processing

REFERENCES
[1] A. Shoukry and A. Rafea, "Sentence-level Arabic sentiment analysis," in Collaboration
Technologies and Systems (CTS) International Conference, Denver, CO, USA, 2012, pp. 546-
550.
[2] B. Pang, L. Lee, and S. Vaithyanathan, "Thumbs up? Sentiment classification using machine
learning techniques," in Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP), 2002, pp. 79–86.
[3] D. Davidiv, O. Tsur, and A. Rappoport, "Enhanced Sentiment Learning Using Twitter Hash-
tags and Smileys," in Proceedings of the 23rd International Conference on Computational
Linguistics (Coling2010), Beijing, China, 2010, pp. 241–249.
[4] L. Barbosa and J. Feng, "Robust Sentiment Detection on Twitter from Biased and Noisy Data
" in Proceedings of the 23rd International Conference on Computational Linguistics (Coling),
2010.
[5] P. Turney, "Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised
Classification of Reviews," in Proceedings of the 40th Annual Meeting on Association for
Computational Linguistics ACL '02, Stroudsburg, PA, USA, 2002, pp. 417-424.
[6] V. Hatzivassiloglou and K. R. McKeown, "Predicting the semantic orientation of adjectives,"
in Proceedings of the Joint ACL / EACL Conference, 1997, pp. 174–181.
[7] B. Pang and L. Lee, "Opinion mining and sentiment analysis," Foundations and Trends in
Information Retrieval vol. 2, pp. 1–135, 2008.
[8] M. Hu and B. Liu, "Mining and summarizing customer reviews " in Proceedings of the ACM
SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2004, pp. 168–177.
[9] B. Liu, "Sentiment Analysis and Subjectivity," in Handbook of Natural Language Processing,
Second ed: CRC Press, Taylor and Francis Group, 2010.
[10] P. Alexander and P. Patrick, "Twitter as a Corpus for Sentiment Analysis and Opinion
Mining " in Proceedings of the Seventh conference on International Language Resources and
Evaluation (LREC'10), European Language Resources Association ELRA, Valletta, Malta, 2010.
[11] C. Scheible and H. Schütze, "Bootstrapping Sentiment Labels For Unannotated Documents
With Polarity PageRank," in Proceedings of the Eight International Conference on Language
Resources and Evaluation (LREC 2012), Istambol-Turki, 2012.
[12] C. Manning and D. Klein, "Optimization, maxent models, and conditional estimation
without magic," in Proceedings of the 2003 Conference of the North American Chapter of the
Association for Computational Linguistics on Human Language Technology, 2003, p. 8.

[13] A. Abbasi, H. Chen, and A. Salem, "Sentiment Analysis in Multiple Languages: Feature
Selection for Opinion Classification in Web Forums," ACM Transactions on Information
Systems, vol. 26, 2008.
[14] E. Riloff and J. Wiebe, "Learning extraction patterns for subjective expressions," in
Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2003.
[15] E. Riloff, J. Wiebe, and T. Wilson, "Learning subjective nouns using extraction pattern
bootstrapping," in Proceedings of the Conference on Natural Language Learning (CoNLL),
2003, pp. 25–32.
[16] M. Abdul-Mageed and M. Diab, "Subjectivity and Sentiment Annotation of Modern
Standard Arabic Newswire," in Proceedings of the Fifth Law Workshop (LAW V), Association
for Computational Linguistics, Portland, Oregon, 2011, pp. 110–118.
[17] M. Abdul-Mageed, M. Diab, and M. Korayem, "Subjectivity and sentiment analysis of
modern standard Arabic," in Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics, 2011.
[18] M. Abdul-Mageed, K. Sandra, and M. Diab, "SAMAR: A System for Subjectivity and
Sentiment Analysis of Arabic Social Media," in Proceedings of the 3rd Workshop on
Computational Approaches to Subjectivity and Sentiment Analysis, Jeju,Republic of Korea,
2012, pp. 19–28.
[19] A. Mourad and K. Darwish, "Subjectivity and Sentiment Analysis of Modern Standard
Arabic and Arabic Microblogs," in Proceedings of the 4th Workshop on Computational
Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA), Atlanta, Georgia,
2013, pp. 55–64.
[20] M. Korayem, D. Crandall, and M. Abdul-Mageed, "Subjectivity and Sentiment Analysis of
Arabic: A Survey," in Advanced Machine Learning Technologies and Applications,
Communications in Computer and Information Science series 322, (Springer), AMLTA, 2012,
pp. 128-139.
[21]M. Abdul-Mageed and M. Diab, "AWATIF: A multi-genre corpus for Arabic subjectivity
and sentiment analysis," in Proceedings of the 8th International Conference on Language
Resources and Evaluation (LREC), Istanbul, Turkey, 2012a.
[22] M. Rushdi-Saleh, M. Mart´ın-Valdivia, L. Ure˜na-L´opez, and J. Perea-Ortega, "Oca:
Opinion corpus for Arabic," Journal of the American Society for Information Science and
Technology, vol. 62, pp. 2045–2054, 2011.
[23] M. Elarnaoty, S. AbdelRahman, and A. Fahmy, "A Machine Learning Approach for
Opinion Holder Extraction Arabic Language," CoRR, abs/1206.1011, vol. 3, 2012.

[24] M. Abdul-Mageed and M. Diab, "SANA: A Large Scale Multi-Genre, Multi-Dialect
Lexicon for Arabic Subjectivity and Sentiment Analysis," in Proceedings of The 9th edition of
the Language Resources and Evaluation Conference (LREC ), Reykjavik, Iceland, 2014.
[25] E. Refaee and V. Rieser, "An Arabic Twitter Corpus for Subjectivity and Sentiment
Analysis," in Proceedings of The 9th edition of the Language Resources and Evaluation
Conference (LREC 2014), Reykjavik, Iceland, 2014.
[26] M. Elmahdy, G. Rainer, M. Wolfgang, and A. Slim, "Survey on common Arabic language
forms from a speech recognition point of view," in proceeding of International conference on
Acoustics (NAG-DAGA), Rotterdam, Netherlands, 2009, pp. 63-66.
[27] J. C. Carletta, "Assessing agreement on classification tasks: the KAPPA statistic "
Computational Linguistics, vol. 22, pp. 249- 254, 1996.
[28] B. Liu, Sentiment Analysis and Opinion Mining Morgan &Claypool Publishers, 2012.
:sayings Colloquial [‫ا‬B‫ا‬ ‫الحرف‬ ‫حسب‬ ‫ومرتبة‬ ‫مشروحة‬ :‫العالمية‬ ‫مثال‬B‫موضوعى‬ ‫كشاف‬ ‫مع‬ ‫المثل‬ ‫من‬ ‫ول‬ ,Basha.
[29] an annotated and arranged by the first letter of ideals with the Scout TOPICAL]. Egypt: Al-
Ahram Foundation - Al-Ahram Center for Translation and Publishing, 1986.
[30] A. Saalan, ‫مثال‬ ‫الشعبية‬ ‫المصرية‬B‫موسوعة‬ ‫]ا‬ Encyclopedia of Egyptian popular sayings], First ed.
Egypt: Dar-alafkalarabia press, 2003. Egyptian, sayings Colloquial [ ‫ا‬ ‫النوادر‬
,‫الشعبية‬ ‫القصص‬ ,‫لعربية‬
‫ا‬B‫المصرى‬ ‫الفولكلور‬ ,‫العامية‬ ‫مثال‬ ,Husain.
F] 31[ folklore]. Egypt: General Egyptian Book Organization GEBO, 1984.
[32] G. Taher. (2006). ‫دراسة‬ ‫علمية‬
-
‫مثال‬ ‫الشعبية‬ P‫موسوعة‬ ‫]ا‬ Encyclopedia of public sayings - a
scientific study]. Available: http://books.google.com.eg/books?id=2CR_EKTjxRgC
[33] PROz. (2014). PROz website for Arabic Idioms/Maxims/Sayings (Jan 2014). Available:
http://www.proz.com/glossary-translations/
[34] M. Diab, "Towards an optimal POS tag set for Modern Standard Arabic processing," in
Proceedings of Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria,
2007.
[35] O. F. Zaidan and C. Callison-Burch, "Arabic dialect identification," Computational
Linguistics, vol. 40, pp. 171-202, March 2014 2012.
[36] H. S. Ibrahim, S. M. Abdou, and M. Gheith, "Automatic expandable large-scale sentiment
lexicon of Modern Standard Arabic and Colloquial," in 16th International Conference on
Intelligent Text Processing and Computational Linguistics (CICLING), Cairo - Egypt, 2015.
[37] M. Sharifi and W. Cohen. (2008, May, 2014). “Finding domain specifc polar words for
sentiment classification. Available: http://www.cs.cmu.edu/~mehrbod/polarity_08.pdf

[38] J. YI, T. NASUKAWA, R. BUNESCU, and W. NIBLACK, "Sentiment analyzer:
Extracting sentiments about a given topic using natural language processing techniques " in
Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), 2003, pp. 427–
434.
[39] Z. Fei, J. LIU, and G. WU, "Sentiment classification using phrase patterns," in Proceedings
of the 4th IEEE International Conference on Computer Information Technology, 2004, pp.
1147–1152.
[40] T. Joachims. (2008, Jan-2013). SVM-light: Support vector machine. Available:
http://svmlight.joachims.org/

SURVEY OF MACHINE TRANSLATION SYSTEMS IN INDIA
G V Garje1
and G K Kharate2
1
Department of Computer Engineering and Information Technology PVG’s College of
Engineering and Technology, Pune, India
2
Principal, Matoshri College of Engineering and Research Centre, Nashik, India
ABSTRACT
The work in the area of machine translation has been going on for last few decades but the
promising translation work began in the early 1990s due to advanced research in Artificial
Intelligence and Computational Linguistics. India is a multilingual and multicultural country
with over 1.25 billion population and 22 constitutionally recognized languages which are
written in 12 different scripts. This necessitates the automated machine translation system for
English to Indian languages and among Indian languages so as to exchange the information
amongst people in their local language. Many usable machine translation systems have been
developed and are under development in India and around the world. The paper focuses on
different approaches used in the development of Machine Translation Systems and also
briefly described some of the Machine Translation Systems along with their features,
domains and limitations.
KEYWORDS
Machine Translation, Example-based MT, Transfer-based MT, Interlingua-based MT
REFERENCES

[1] Sitender & Seema Bawa, (2012) “Survey of Indian Machine Translation Systems”,
International Journal Computer Science and Technolgy, Vol. 3, Issue 1, pp. 286-290, ISSN :
0976-8491 (Online) | ISSN : 2229-4333 (Print)
[2] Sanjay Kumar Dwivedi & Pramod Premdas Sukhadeve, (2010) “Machine Translation
System in Indian Perspectives”, Journal of Computer Science 6 (10): 1082-1087, ISSN 1549-
3636, © 2010 Science
[3] John Hutchins, (2005) “Current commercial machine translation systems and computer-
based translation tools: system types and their uses”, International Journal of Translation
vol.17, no.1-2, pp.5-38.
[4] Vishal Goyal & Gurpreet Singh Lehal, (2009) “Advances in Machine Translation
Systems”, National Open Access Journal, Volume 9, ISSN 1930-2940
http://www.languageinindia.
[5] Latha R. Nair & David Peter S., (2012) “Machine Translation Systems for Indian
Languages”, International Journal of Computer Applications (0975 – 8887) Volume 39–
No.1
[6] Vishal Goyal & Gurpreet Singh Lehal, (2010) “Web Based Hindi to Punjabi Machine
Translation System”, International Journal of Emerging Technologies in Web Intelligence,
Vol. 2, no. 2, pp. 148-151, ACADEMY PUBLISHER
[7] Shachi Dave, Jignashu Parikh & Pushpak Bhattacharyya, (2002) “Interlingua-based
English-Hindi Machine Translation and Language Divergence”, Journal of Machine
Translation, pp. 251-304.
[8] Sudip Naskar & Shivaji Bandyopadhyay, (2005) “Use of Machine Translation in India:
Current status” AAMT Journal, pp. 25-31.
[9] Sneha Tripathi & Juran Krishna Sarkhel, (2010) “Approaches to Machine Translation”,
International journal of Annals of Library and Information Studies, Vol. 57, pp. 388-393
[10] Gurpreet Singh Josan & Jagroop Kaur, (2011) “Punjabi To Hindi Statistical Machine
Transliteration”, International Journal of Information Technology and Knowledge
Management , Volume 4, No. 2, pp. 459-463.
[11] S. Bandyopadhyay, (2004) "ANUBAAD - The Translator from English to Indian
Languages", in proceedings of the VIIth State Science and Technology Congress. Calcutta.
India. pp. 43-51
[12] R.M.K. Sinha & A. Jain, (2002) “AnglaHindi: An English to Hindi Machine-Aided
Translation System”, International Conference AMTA(Association of Machine Translation
in the Americas)
[13] Murthy. K, (2002) “MAT: A Machine Assisted Translation System”, In Proceedings of
Symposium on Translation Support System( STRANS-2002), IIT Kanpur. pp. 134-139.
[14] Lata Gore & Nishigandha Patil, (2002) “English to Hindi - Translation System”, In
proceedings of Symposium on Translation Support Systems. IIT Kanpur. pp. 178-184.
[15] Kommaluri Vijayanand, Sirajul Islam Choudhury & Pranab Ratna
“VAASAANUBAADA - Automatic Machine Translation of Bilingual Bengali-Assamese
News Texts”, in proceedings of Language Engineering Conference-2002, Hyderabad, India

© IEEE Computer Society.
[16] Bharati, R. Moona, P. Reddy, B. Sankar, D.M. Sharma & R. Sangal, (2003) “Machine
Translation: The Shakti Approach”, Pre-Conference Tutorial, ICON-2003.
[17] S. Mohanty & R. C. Balabantaray, (2004) “English to Oriya Translation System
(OMTrans)” cs.pitt.edu/chang/cpol/c087.pdf
[18] Ananthakrishnan R, Kavitha M, Jayprasad J Hegde, Chandra Shekhar, Ritesh Shah,
Sawani Bade & Sasikumar M., (2006) “MaTra: A Practical Approach to Fully- Automatic
Indicative EnglishHindi Machine Translation”, In the proceedings of MSPIL-06.
[19] G. S. Josan & G. S. Lehal, (2008) “A Punjabi to Hindi Machine Translation System”, in
proceedings of COLING-2008: Companion volume: Posters and Demonstrations,
Manchester, UK, pp. 157-160.
[20] Sanjay Chatterji, Devshri Roy, Sudeshna Sarkar & Anupam Basu, (2009) “A Hybrid
Approach for Bengali to Hindi Machine Translation”, In proceedings of ICON-2009, 7th
International Conference on Natural Language Processing, pp. 83-91.
[21] Vishal Goyal & Gurpreet Singh Lehal, (2011) “Hindi to Punjabi Machine Translation
System”, in proceedings of the ACL-HLT 2011 System Demonstrations, pages 1–6, Portland,
Oregon, USA, 21 June 2011.
[22] Ankit Kumar Srivastava, Rejwanul Haque, Sudip Kumar Naskar & Andy Way, (2008)
“The MATREX (Machine Translation using Example): The DCU Machine Translation
System for ICON 2008”, in Proceedings of ICON-2008: 6th International Conference on
Natural Language Processing, Macmillan Publishers, India,
http://ltrc.iiit.ac.in/proceedings/ICON-2008.
[23] hutchinsweb.me.uk/Nutshell-2005.pdf
[24] John Hutchins “Historical survey of machine translation in Eastern and Central Europe”,
Based on an unpublished presentation at the conference on Crosslingual Language
Technology in service of an integrated multilingual Europe, 4-5 May 2012, Hamburg,
Germany. (www.hutchinsweb.me.uk/Hamburg-2012.pdf)
[25] Sampark: Machine Translation System among Indian languages (2009)
http://tdildc.in/index.php?option=com_vertical&parentid=74, http://sampark.iiit.ac.in/
[26] Akshar Bharti, Chaitanya Vineet, Amba P. Kulkarni & Rajiv Sangal, (1997)
”ANUSAARAKA: Machine Translation in stages’, Vivek, a quarterly in Artificial
Intelligence, Vol. 10, No. 3, NCST Mumbai, pp. 22-25
[27] Akshar Bharti, Chaitanya Vineet, Amba P. Kulkarni & Rajiv Sangal, (2001)
”ANUSAARAKA: overcoming the language barrier in India”, published in Anuvad:
approaches to Translation
[28] Hemant Darabari, (1999) “Computer Assisted Translation System- An Indian
Perspective”, in proceedings of MT Summit VII, Thialand [29] R. Mahesh K. Sinha & Anil
Thakur, (2005) “Machine Translation of Bi-lingual Hindi-English (Hinglish) Text”, in
proceedings of 10th Machine Translation Summit organized by Asia-Pacific Association for
Machine Translation (AAMT), Phuket, Thailand
[30] Parameswari K, Sreenivasulu N.V., Uma Maheshwar Rao G & Christopher M, (2012)
“Development of Telugu-Tamil Bidirectional Machine Translation System: A special focus

on case divergence”, in proceedings of 11th International Tamil Internet conference, pp 180-
191
[31] Salil Badodekar, (2004) “Translation Resources, Services and Tools for Indian
Languages”, a report of Centre for Indian Language Technology, IITB,
http://www.cfilt.iitb.ac.in/Translationsurvey/survey.pdf
[32] Ananthakrishnan R, Kavitha M, Jayprasad J Hegde, Chandra Shekhar, Ritesh Shah,
Sawani Bade & Sasikumar M, (2006) “MaTra: A Practical Approach to Fully-Automatic
Indicative EnglishHindi Machine Translation”, in proceedings of the first national
symposium on Modelling and shallow parsing of Indian languages (MSPIL-06) organized by
IIT Bambay, 202.141.152.9/clir/papers/matra_mspil06.pdf
[33] CDAC Mumbai, (2008) “MaTra: an English to Hindi Machine Translation System”, a
report by CDAC Mumbai formerly NCST.
[34] Sanjay Chatterji, Praveen Sonare, Sudeshna Sarkar & Anupam Basu, (2011) “Lattice
Based Lexical Transfer in Bengali Hindi Machine Translation Framework”, in Proceedings
of ICON2011: 9th International Conference on Natural Language Processing, Macmillan
Publishers, India. Also accessible from ltrc.iiit.ac.in/proceedings/ICON-2011.
[35] R. Ananthakrishnan, Jayprasad Hegde, Pushpak Bhattacharyya, Ritesh Shah & M.
Sasikumar, (2008) “Simple Syntactic and Morphological Processing Can Help English-Hindi
Statistical Machine Translation”, in proceedings of International Joint Conference on NLP
(IJCNLP08), Hyderabad, India.
[36] Yanjun Ma, John Tinsley, Hany Hassan, Jinhua Du & Andy Way, (2008) “Exploiting
Alignment Techniques in MATREX: the DCU Machine Translation System for IWSLT
2008’, in proceedings of IWSLT 2008, Hawaii, USA
[37] projects.uptuwatch.com/cs-it/anubharti-an-hybrid-example-based-approach-for-
machine-aidedtrapnslation/
[38] Sugata Sanyal & Rajdeep Borgohain, (2013) “Machine Translation Systems in India”,
Cornel University Library, arxiv.org/ftp/arxiv/papers/1304/1304.7728.pdf [39] Antony P. J.,
(2013) “Machine Translation Approaches and Survey for Indian Languages”, International
journal of Computational Linguistics and Chinese Language Processing Vol. 18, No. 1, pp.
47-78.
[40] Manoj Jain & Om P. Damani, (2009) “English to UNL (Interlingua) Enconversion”, in
proceedings of 4th Language and Translation Conference (LTC-09).
[41] Smriti Singh, Mrugank Dalal, Vishal Vachhani, Pushpak Bhattacharyya & Om P.
Damani, (2007) “Hindi Generation from Interlingua (UNL)”, in proceedings of MT Summit,
2007
[42] language.worldofcomputing.net
[43] sampark.iiit.ac.in [44] www.cdacmumbai.in/xlit [
45] www.cdacmumbai.in/rupantar
[46] translationjournal.net/journal/29computers.htm
[47] www.cfilt.iitb.ac.in/resources/surveys/MT-Literature%20Survey-2012-Somya.pdf
[48] www.cdacmumbai.in/e-ilmt

[49] www.iiit.net/ltrc/Anusaaraka/anu_home.html
[50] cdac.in/html/aai/mantra.asp
[51] translate.google.com/about/intl/en_ALL/
AUTHORS
G V Garje (gvg_comp@pvgcoet.ac.in) has completed ME in Computer Science and
Engineering from NITTR, Chandigarh, India in 1998. Currently he is
working as Associate Professor and Head of Computer Engineering and
Information Technology Department at Pune Vidyarthi Griha’s College of
Engineering and Technology, Pune. Presently, he is a Chairman, Board of
Studies in Information Technology, University of Pune and formerly,
chairman, Board of Studies in Computer Engineering, University of Pune.
Presently he is pursuing his Ph.D. from University of Pune, Maharashtra,
India, in the area of Machine Translation. His area of research are NLP,
Machine Translation specifically English-Marathi Language Pair. He has developed a tool
for translating simple English interrogative sentences to Marathi sentences funded by
University of Pune. His areas of interest are Data Structures, Operating Systems and
Software Architecture.
G K Kharate (gkkharate@rediffmail.com) has completed his Ph.D. in
Electronics and Telecommunication Engineering from University of Pune
and ME Electronics and Communication from Walchand College of
Engineering, Sangali, Maharashtra. Currently he is a Principal at Matoshri
College of Engineering and Research Centre, Nashik, Maharshtra. He is a
Dean, Faculty of Engineering and Member of Man agement Council,
University of Pune. He is former Chairman, Board of Studies in Electronics
Engineering, University of Pune. His areas of research are Image Processing, Pattern
Recognition, and Artificial Intelligence. His areas of interest are Digital Electronics,
Computer Networks, Image Processing and Natural Language Processing.

RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
Deepti Bhalla1
, Nisheeth Joshi2
and Iti Mathur3
1,2,3
Apaji Institute, Banasthali University, Rajasthan, India
ABSTRACT
Machine Transliteration has come out to be an emerging and a very important research area
in the field of machine translation. Transliteration basically aims to preserve the phonological
structure of words. Proper transliteration of name entities plays a very significant role in
improving the quality of machine translation. In this paper we are doing machine
transliteration for English-Punjabi language pair using rule based approach. We have
constructed some rules for syllabification. Syllabification is the process to extract or separate
the syllable from the words. In this we are calculating the probabilities for name entities
(Proper names and location). For those words which do not come under the category of name
entities, separate probabilities are being calculated by using relative frequency through a
statistical machine translation toolkit known as MOSES. Using these probabilities we are
transliterating our input text from English to Punjabi.
KEYWORDS
Machine Translation, Machine Transliteration, Name entity recognition, Syllabification

REFERENCES
[1] Kamal Deep and Vishal Goyal, (2011) ”Development of a Punjabi to English
transliteration system”. In International Journal of Computer Science and Communication
Vol. 2, No. 2, pp. 521-526.
[2] Shubhangi Sharma, Neha Bora and Mitali Halder, (2012) “English-Hindi Transliteration
using Statistical Machine Translation in different Notation” International Conference on
Computing and Control Engineering (ICCCE 2012).
[3] Kamal Deep, Dr.Vishal Goyal, (2011) “Hybrid Approach for Punjabi to English
Transliteration System” International Journal of Computer Applications (0975 – 8887)
Volume 28– No.1.
[4] Jasleen kaur Gurpreet Singh josan , (2011) “Statistical Approach to Transliteration from
English to Punjabi”, In Proceeding of International Journal on Computer Science and
Engineering (IJCSE), Vol. 3 Issue 4, p1518.
[5] Er. Sheilly Padda, Rupinderdeep Kaur, Er. Nidhi, (2012) “Punjabi Phonetic: Punjabi Text
to IPA Conversion” International Journal of Emerging Technology and Advanced
Engineering Website: www.ijetae.com ISSN 2250-2459, Volume 2, Issue 10.
[6] Gurpreet Singh Josan, Gurpreet Singh Lehal, (2010) “A Punjabi to Hindi Machine
Transliteration System” Computational Linguistics and Chinese Language Processing Vol.
15, No. 2, pp. 77-102.
[7] Manikrao L Dhore, Shantanu K Dixit, Tushar D Sonwalkar, (2012) “Hindi to English
Machine Transliteration of Named Entities using Conditional Random Fields.” International
Journal of Computer Applications;6/15/2012, Vol. 48, p31.
[8] Musa, Hafiz, Rabith A.kadir, Azreen Azman, M.taufik Abadullah, (2011) "Syllabification
algorithm based on syllable rules matching for Malay language." Proceedings of the 10th
WSEAS international conference on Applied computer and applied computational science.
World Scientific and Engineering Academy and Society (WSEAS).
[9] To download IRSTLM toolkit http://www.statmt.org
[10] Jenny Rose Finkel, Trond Grenager, and Christopher Manning, (2005) Incorporating
Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings
of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005),
pp. 363-370.
[11] Daniel Jurafsky, James H. Martin Speech and Language processing An Introduction to
speech Recognition, natural language processing, and computational linguistics.
AUTHORS
Deepti Bhalla is pursuing her M.Tech in Computer Science from Banasthali
University, Rajasthan and is working as a Research Assistant in English-Indian
Languages Machine Translation System Project sponsored by TDIL
Programme, DEITY. She has her interest in Machine Translation specifically in

English-Punjabi Language Pair. She has developed various tools on Punjabi Language
Processing. Her current research interest includes Natural Language
Processing and Machine Translation.
Nisheeth Joshi is a researcher working in the area of Machine Translation.
He has been primarily working in design and development of evaluation
Matrices in Indian languages. Besides this he is also actively involved in the
development of MT engines for English to Indian Languages. He is one of
the expert empanelled with TDIL programme, Department of electronics
Information Technology (DEITY), Govt. of India, a premier organization which foresees
Language Technology Funding and Research in India. He has several publications in various
journals and conferences and also serves on the Programme Committees and
Editorial Boards of several conferences and journals.
Iti Mathur is an assistant professor at Banasthali University. Her primary
area of research is computational semantics and ontological engineering.
Besides this she is also involved in the development of MT engines for
English to Indian Languages. She is one of the experts empanelled with TDIL
Programme, Department of Electronics Information Technology (DEITY), Govt. of India, a
premier organization which foresees Language Technology Funding and Research in India.
She has several publications in various journals and conferences and also serves on the
Programme Committees and Editorial Boards of several conferences and journals.

HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING
SVM
P H Rathod1
, M L Dhore2
, R M Dhore3
1,2
Department of Computer Engineering, Vishwakarma Institute of Technology, Pune
3
Pune Vidhyarthi Griha’s College of Engineering and Technology, Pune
ABSTRACT
Language transliteration is one of the important areas in NLP. Transliteration is very useful
for converting the named entities (NEs) written in one script to another script in NLP
applications like Cross Lingual Information Retrieval (CLIR), Multilingual Voice Chat
Applications and Real Time Machine Translation (MT). The most important requirement of
Transliteration system is to preserve the phonetic properties of source language after the
transliteration in target language. In this paper, we have proposed the named entity
transliteration for Hindi to English and Marathi to English language pairs using Support
Vector Machine (SVM). In the proposed approach, the source named entity is segmented into
transliteration units; hence transliteration problem can be viewed as sequence labeling
problem. The classification of phonetic units is done by using the polynomial kernel function
of Support Vector Machine (SVM). Proposed approach uses phonetic of the source language
and n-gram as two features for transliteration.
KEYWORDS
Machine Transliteration, n-gram, Support Vector Machine, Syllabification.

REFERENCES
[1] Padariya Nilesh, Chinnakotla Manoj, Nagesh Ajay, Damani Om P.(2008) “Evaluation of
Hindi to English, Marathi to English and English to Hindi”, IIT Mumbai CLIR at FIRE.
[2] Saha Sujan Kumar, Ghosh P. S, Sarkar Sudeshna and Mitra Pabitra (2008) “Named entity
recognition in Hindi using maximum entropy and transliteration.”
[3] BIS (1991) “Indian standard code for information interchange (ISCII)”, Bureau of Indian
Standards, New Delhi.
[4] Joshi R K, Shroff Keyur and Mudur S P (2003) “A Phonemic code based scheme for
effective processing of Indian languages”, National Centre for Software Technology,
Mumbai, 23rd Internationalization and Unicode Conference, Prague, Czech Republic, pp 1-
17.
[5] Arbabi M, Fischthal S M, Cheng V C and Bart E (1994) “Algorithms for Arabic name
transliteration”, IBM Journal of Research and Development, pp 183-194.
[6] Knight Kevin and Graehl Jonathan (1997) “Machine transliteration”, In proceedings of
the 35th annual meetings of the Association for Computational Linguistics, pp 128-135.
[7] Stalls Bonnie Glover and Kevin Knight (1998) “Translating names and technical terms in
Arabic text.”
[8] Al-Onaizan Y, Knight K (2002) “Machine translation of names in Arabic text”,
Proceedings of the ACL conference workshop on computational approaches to Semitic
languages.
[9] Jaleel Nasreen Abdul and Larkey Leah S. (2003) “Statistical transliteration for English-
Arabic cross language information retrieval”, In Proceedings of the 12th international
conference on information and knowledge management, pp 139 – 146.
[10] Jung S. Y., Hong S., S., Paek E.(2003) “English to Korean transliteration model of
extended Markov window”, In Proceedings of the 18th Conference on Computational
Linguistics, pp 383–389.
[11] Ganapathiraju M., Balakrishnan M., Balakrishnan N., Reddy R. (2005) “OM: One Tool
for Many (Indian) Languages”, ICUDL: International Conference on Universal Digital
Library, Hangzhou.
[12] Malik M G A (2006) “Punjabi Machine Transliteration”, Proceedings of the 21st
International Conference on Computational Linguistics and the 44th annual meeting of the
ACL, pp 1137–1144.
[13] Sproat R.(2002) “Brahmi scripts, In Constraints on Spelling Changes”, Fifth
International Workshop on Writing Systems, Nijmegen, The Netherlands.
[14] Sproat R.(2003) “A formal computational analysis of Indic scripts”, In International
Symposium on Indic Scripts: Past and Future, Tokyo.
[15] Sproat R.(2004) “A computational theory of writing systems, In Constraints on Spelling
Changes”, Fifth International Workshop on Writing Systems, Nijmegen, The Netherlands.
[16] Kopytonenko M. , Lyytinen K. , and Krkkinen T.(2006) “Comparison of phonological
representations for the grapheme-to-phoneme mapping, In Constraints on Spelling Changes”,

Fifth International Workshop on Writing Systems, Nijmegen, The Netherlands.
[17] Ganesh S, Harsha S, Pingali P, and Verma V (2008) “Statistical transliteration for cross
language information retrieval using HMM alignment and CRF”, In Proceedings of the
Workshop on CLIA, Addressing the Needs of Multilingual Societies.
[18] Sumaja Sasidharan, Loganathan R, and Soman K P (2009) “English to Malayalam
Transliteration Using Sequence Labeling Approach” International Journal of Recent Trends
in Engineering, Vol. 1, No. 2, pp 170-172
[19] Oh Jong-Hoon, Kiyotaka Uchimoto, and Kentaro Torisawa (2009) “Machine
transliteration using target-language grapheme and phoneme: Multi-engine transliteration
approach”, Proceedings of the Named Entities Workshop ACL-IJCNLP Suntec,
Singapore,AFNLP, pp 36–39
[20] Antony P.J, Soman K.P (2010) “Kernel Method for English to Kannada Transliteration”,
Conference on Machine Learning and Cybernetics, pp 11-14
[21] Ekbal A. and Bandyopadhyay S. (2007) “A Hidden Markov Model based named entity
recognition system: Bengali and Hindi as case studies”, Proceedings of 2nd International
conference in Pattern Recognition and Machine Intelligence, Kolkata, India, pp 545–552.
[22] Ekbal A. and Bandyopadhyay S. (2008) “Bengali named entity recognition using
support vector machine”, In Proceedings of the IJCNLP-08 Workshop on NER for South and
South East Asian languages, Hyderabad, India, pp 51–58.
[23] Ekbal A. and Bandyopadhyay S. (2008), “Development of Bengali named entity tagged
corpus and its use in NER system”, In Proceedings of the 6th Workshop on Asian Language
Resources.
[24] Ekbal A. and Bandyopadhyay S. (2008) “A web-based Bengali news corpus for named
entity recognition”, Language Resources & Evaluation, vol. 42, pp 173–182.
[25] Ekbal A. and Bandyopadhyay S.(2008) “Improving the performance of a NER system
by postprocessing and voting”, In Proceedings of Joint IAPR International Workshop on
Structural Syntactic and Statistical Pattern Recognition, Orlando, Florida, pp 831–841.
[26] Ekbal A. and Bandyopadhyay S.(2009) “Bengali Named Entity Recognition using
Classifier Combination”, In Proceedings of Seventh International Conference on Advances in
Pattern Recognition, pp 259–262.
[27] Ekbal A. and Bandyopadhyay S. (2009) “Voted NER system using appropriate
unlabelled data”, In Proceedings of the Named Entities Workshop, ACL-IJCNLP.
[28] Ekbal A. and Bandyopadhyay S. (2010) “ Named entity recognition using appropriate
unlabeled data, post-processing and voting”, In Informatica, Vol 34, No. 1, pp 55-76.
[29] Chinnakotla Manoj K., Damani Om P., and Satoskar Avijit (2010) “Transliteration for
ResourceScarce Languages”, ACM Trans. Asian Lang. Inform,Article 14, pp 1-30.
[30] Kishorjit Nongmeikapam (2012) “Transliterated SVM Based Manipuri POS Tagging”,
Advances in Computer Science and Engineering and Applications, pp 989-999
[31] K.P.Sonam, V. Ajay, R. Laganatha.(2009) “Machine Learning with SVM and Other
Kernel Methods”, Machine Learning Book, PHI.
[32] Koul Omkar N. (2008) “Modern Hindi Grammar”, Dunwoody Press

[33] Walambe M. R. (1990) “Marathi Shuddalekhan”, Nitin Prakashan, Pune [34] Walambe
M. R. (1990) “Marathi Vyakran”, Nitin Prakashan, Pune.
[35] Dhore M L, Dixit S K and Dhore R M (2012) “Hindi and Marathi to English NE
Transliteration Tool using Phonology and Stress Analysis”, 24th International Conference on
Computational Linguistic,s Proceedings of COLING Demonstration Papers, at IIT Bombay,
pp 111-118
AUTHORS
(pravin.rathod@vit.edu) has completed BE in Information Technology, from
Government College of Engineering, Karad, Maharashtra, India, in 2008.
Recently he has completed ME in Computer Science and Engineering from
Vishwakarma Institute of Technology, Pune, India in 2013. Currently he is
working as Assistant Professor in Department of Computer Engineering at
Vishwakarma Institute of Technology, Pune. He has his interest in Machine
Translation and Machine Transliteration specifically in DevanagariEnglish Language Pairs.
His current areas of research are Mobile Ad hoc Networks, Internet Routing Algorithms,
Computer Networking, Machine Translation and Transliteration.
M. L. Dhore (manikrao.dhore@vit.edu) has completed ME in Computer
Science and Engineering from NITTR, Chandigarh, India in 1998. Currently
he is working as Associate Professor in Department of Computer Engineering
at Vishwakarma Institute of Technology, Pune. Presently he is pursuing his
Ph.D. from University of Solapur, Maharashtra, India, in the area of
Computational Lingui stics. He has his interest in Machine Translation and
Machine Transliteration specifically in Marathi-English and Hindi- English Language Pairs.
He has developed the tools for Devanagari to English Machine Transliteration for the online
web based commercial applications. His current areas of research are Internet Routing
Algorithms, Computer Networking, Machine Translation and Transliteration.
Ruchi M Dhore (ruchidhore93@gmail.com) is the student of Third Year
Computer Engineering at Pune Vidyarthi Griha’s College of Engineering and
Technology, Pune, Maharashtra, India. She is scholar student of her college
and securing distinction every year in the University of Pune examinations.
She is very good in programming and won the prizes in state level and
national level competitions. Her area of research interest includes Text
Processing and Pattern Searching. She likes to build her carrier in the development of
language processing tools for Marathi language.

HYBRID PART-OF-SPEECH TAGGER FOR NON-VOCALIZED ARABIC TEXT
Meryeme Hadni1 , Said Alaoui Ouatik1
, Abdelmonaime Lachkar2
and Mohammed
Meknassi1
1
FSDM, Sidi Mohamed Ben Abdellah University (USMBA), Morocco
2
E.N.S.A, Sidi Mohamed Ben Abdellah University (USMBA), Morocco
ABSTRACT
Part of speech tagging (POS tagging) has a crucial role in different fields of natural language
processing (NLP) including Speech Recognition, Natural Language Parsing, Information
Retrieval and Multi Words Term Extraction. This paper proposes an efficient and accurate
POS Tagging technique for Arabic language using hybrid approach. Due to the ambiguity
issue, Arabic Rule-Based method suffers from misclassified and unanalyzed words. To
overcome these two problems, we propose a Hidden Markov Model (HMM) integrated with
Arabic Rule-Based method. Our POS tagger generates a set of three POS tags: Noun, Verb,
and Particle. The proposed technique uses the different contextual information of the words
with a variety of the features which are helpful to predict the various POS classes. To
evaluate its accuracy, the proposed method has been trained and tested with two corpora: the
Holy Quran Corpus and Kalimat Corpus for undiacritized Classical Arabic language. The
experiment results demonstrate the efficiency of our method for Arabic POS Tagging. In fact,
the obtained accuracies rates are 97.6%, 96.8% and 94.4% for respectively our Hybrid
Tagger, HMM Tagger and for the Rule-Based Tagger with Holy Quran Corpus. And for
Kalimat Corpus we obtained 94.60%, 97.40% and 98% for respectively Rule-Based Tagger,
HMM Tagger and our Hybrid Tagger.
KEY WORDS
Part-Of-Speech Tagger, Natural Language Applications, Natural Language Parsing, Hidden
Markov Model, Multi Words Term Extraction, Speech Recognition.
FULL TEXT : http://airccse.org/journal/ijnlc/papers/2613ijnlc01.pdf

REFERENCE
[1] Lee, S.hyun. & Kim Mi Na, (2008) “This is my paper”, ABC Transactions on ECE, Vol.
10, No. 5, pp120-122.
[2] Gizem, Aksahya & Ayese, Ozcan (2009) Comunications & Networks, Network Books,
ABC Publishers. [1] http://en.wikipedia.org/wiki/Part-of-speech_tagging.
[2] L.Van Guilder, (1995) “Automated Part of Speech Tagging: A Brief Overview” Handout
for LING361, Georgetown University.
[3] H. Halteren, J.Zavrel & Walter Daelemans (2001).Improving Accuracy in NLP Through
Combination of Machine Learning Systems. Computational Linguistics. 27(2): 199–229.
[4] DeRose & J.Steven (1990) "Stochastic Methods for Resolution of Grammatical Category
Ambiguity in Inflected and Uninflected Languages." PhD.Dissertation. Providence, RI:
Brown University Department of Cognitive and Linguistic Sciences.
[5] N. kumar Kumar, Anikel Dalal &Uma Sawant (2006)”hindi part of speech tagging and
chunking”, NLPAI machine learning contest.
[6] M. Mohseni, H. Motalebi, B. Minaei-bidgoli & M. Shokrollahi-far (2008) “A farsi part-
of-speech tagger based on markov”. In the proceedings of ACM symposium on Applied
computing, Brazil.
[7] S. Jabbari &B. Allison(2007)“Persian Part of Speech Tagging”, In the Proceedings of
Workshop on Computational Approaches to Arabic Script-Based Languages (CAASL-2),
USA.
[8] E. Brill (1995) “Transformation-Based Error-Driven Learning and Natural Language
Processing: A case Study in Part of Speech Tagging”, Computational Linguistics, USA.
[9] M. Hepple (2000), ”Independence and Commitment: Assumptions for Rapid Training and
Execution of Rule-based Part of-Speech Taggers”, In Proceedings of the 38th Annual
Meeting of the Association for Computational Linguistics (ACL). Hong Kong.
[10] T. Brants (200),“TNT – a Statistical Part-of-Speech Tagger”, In the Proceedings of 6th
conference on applied natural language processing (ANLP), USA.
[11] K. Megerdoomian (2004), “Developing a Persian part-of speech tagger”, In the
Proceedings of first Workshop on Persian Language and computer, Iran .
[12] Khoja, S.( 2001) “ APT: Arabic part-of-speech tagger”. Proceeding of the Student
Workshop at the 2nd Meeting of the NAACL, (NAACL’01), Carnegie Mellon University,
Pennsylvania, pp: 1- 6. http://zeus.cs.pacificu.edu/shereen/NAACL.pdf
[13] Freeman A (2001), “Brill’s POS tagger and a morphology parser for Arabic”, In
ACL’01 Workshop on Arabic language processing.
[14] Maamouri M, Cieri C. (2002). “Resources for Arabic Natural Language Processing at
the LDC”, Proceedings of the International Symposium on the Processing of Arabic,Tunisia,
pp.125-146.
[15] Diab M., Hacioglu K. and Jurafsky D. (2004), “Automatic Tagging of Arabic Text:
From Raw Text to Base Phrase Chunks”. proc. of HLTNAACL’04: 149–152.
[16] Banko M, Moore R. C. (2004). “Part of Speech Tagging in Context”, Proc of the 20th
international conference on Computational Linguistics, Switzerland.

[17] Tlili-Guiassa Y. (2006) “Hybrid Method for Tagging Arabic Text”. Journal of Computer
Science 2 (3): 245-248.
[18] L. Young-Suk, K. Papineni & S. Roukos ( 2003), “Language Model Based Arabic Word
Segmentation,” in Proceedings of the Annual Meeting on Association for Computational
Linguistics, Japan, pp. 399- 406.
[19] A.T Al-Taani & S. Abu-Al-Rub (2009),”A rule-based approaches for tagging non-
vocalized Arabic words”. The International Arab Journal of Information Technology,
Volume6 (3): 320-328.
[20] T. Brants (2000),” TnT: A statistical part of speech tagger”, Proceedings of the 6th
Conference on Applied Natural Language Processing, Apr. 29- May 04, Association for
Computational Linguistics Morristown, New Jersey, USA., pp: 224-231.
[21] NLTK, Natural Language Toolkit. http://www.nltk.org/Home
[22] Quranic Arabic Corpus: http://corpus.quran.com
[23] Quran Tagset: http://corpus.quran.com/documentation/tagset.jsp
[24] N. Habash & O. Rambow (2005), “Arabic Tokenization, Part-of-Speech Tagging and
Morphological Disambiguation in One Fell Swoop,” in Proceedings of the Annual Meeting
on Association for Computational Linguistics, Michigan, pp. 573-580.
[25] http://sibawayh.emi.ac.ma/web/s/?q=node/79
[26] http://bit.ly/16jO3Ks [27] http://www.alwatan.com/
[28] F. Al Shamsi & A.Guessoum(2006),” A Hidden Markov Model–Based POS Tagger for
Arabic”, 8es Journées internationales d’Analyse statistique des Données Textuelles (JADT).
[29] M. Albared & O.Nazlia(2010),” Automatic Part of Speech Tagging for Arabic: An
Experiment Using Bigram Hidden Markov Model “,Springer-Verlag Berlin Heidelberg,
LNAI 6401, pp. 361– 370.
[30] Y.O. Mohamed Elhadj(2009),” Statistical Part-of-Speech Tagger for Traditional Arabic
Texts”, Journal of Computer Science 5 (11): 794-800.
Authors
Miss. Meryeme Hadni Phd Student in Laboratory of computer and Modelization,
Faculty of Sciences, University Sidi Mohamed Ben Abdellah (USMBA), Fez,
Morocco. She has also presented different papers at different National and
International conferences.
Pr. Abdelmonaime LACHKAR : received his PhD degree from the USMBA,
Morocco in 2004, He is Professor and Computer Engineering Program Coordinator
at (E.N.S.A, FES), and the Head of the Systems Architecture and Multimedia Team
(LSIS Laboratory) at Sidi Mohamed Ben Abdellah University, Fez, Morocco. His
current research interests include Arabic Natural Language Processing ANLP,
Arabic Web Document Clustering and Categorization, Arabic Information Retrieval Systems,
Arabic Text Summarization, Arabic Ontologies development and usage, Arabic Semantic
Search Engines (SSEs).

Pr. Said Alaoui Ouatik i s working as a Professor in Department of Computer
Science, Faculty of Science Dhar EL Mahraz (FSDM), Fez, Morocco. His
research interests include high-dimensional indexing and content-based retrieval,
Arabic Document Categorization. 2D/3D Shapes Indexing and Retrieval in large
3D Objects Database.
Mohammed Meknassi received Ph. D degree in computer sciences from Montreal
University in 1993. Since 1993, he is professor of computer sciences. He teaches
and makes his scientific research in the following fields: Parallel processing,
Distributed Computing, Operating Systems and Image Processing. He is a member
of the research unit: Systems Image and Multimedia (SIM) attached to the
laboratory: Computer Sciences, Statistics and Quality (LISQ). He is the chief of the computer
Sciences Department in the Faculty of Sciences Dhar El Mahraz of Fez.

HYBRID APPROACHES FOR AUTOMATIC VOWELIZATION OF ARABIC
TEXTS
Mohamed Bebah1
Chennoufi Amine2
Mazroui Azzeddine3
and Lakhouaja Abdelhak4
1
Arab Center for Research and Policy Studies, Doha, Qatar
2
Faculty of Sciences/University Mohamed I, Oujda, Morocco
3
4
ABSTRACT
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article.
The process is made up of two modules. In the first one, a morphological analysis of the text
words is performed using the open source morphological Analyzer AlKhalil Morpho Sys.
Outputs for each word analyzed out of context, are its different possible vowelizations. The
integration of this Analyzer in our vowelization system required the addition of a lexical
database containing the most frequent words in Arabic language. Using a statistical approach
based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed
states and the vowelized words are the hidden states. The observed states of the second
HMM are identical to those of the first, but the hidden states are the lists of possible diacritics
of the word without its Arabic letters. Our system uses Viterbi algorithm to select the optimal
path among the solutions proposed by Al Khalil Morpho Sys. Our approach opens an
important way to improve the performance of automatic vowelization of Arabic texts for
other uses in automatic natural language processing.
KEYWORDS
Arabic language, Automatic vowelization, morphological analysis, hidden Markov model,
corpus

REFERENCE
[1] Debili, Fathi & Hadhemi Achour (1998) Voyellation automatique de l’arabe. In
Proceedings of the workshop on Computation approaches to Semitic languages, COLING-
ACL ’98, pages 42–49.
[2] Maamouri, Mohamed, Ann Bies, and Seth Kulick. (2006) Diacritization: a challenge to
Arabic treebank annotation and parsing. In Proceedings of the British Computer Society
Arabic NLP/MT Conference.
[3] Zitouni, Imed, Jefrey S. Sorensen, and Ruhi Sarikaya. (2006) Maximum entropy based
restoration of arabic diacritics. In Proceedings of the 21st International Conference on
Computational Linguistics and 44th Annual Meeting of the Association for Computational
Linguistics. Workshop on Computational approaches to Semitic Languages, Sydney,
Australia. July 2006, pages 577– 584.
[4] Vergyri, Dimitra & Katrin Kirchhoff. (2004) Automatic diacritization of arabic for
acoustic modeling in speech recognition. In Proceedings of the Workshop on Computational
Approaches to Arabic Script-based Languages. COLING, Geneva, pages 66–73.
[5] Messaoudi, Abdel, Lori Lamel, and Jean-Luc Gauvain. (2004) The limsi rt04 b arabic
system. In Proceedings DARPA RT04, Palisades NY.
[6] Elshafei, Moustafa, Husni Al-Muhtaseb, and Mansour Alghamdi. (2006) Machine
generation of arabic diacritical marks. In The 2006 World Congress in Computer Science
Computer Engineering, and Applied Computing. Las Vegas, USA., pages 128–133.
[7] Emam, Ossama and Volker Fischer. (2005) Hierarchical approach for the statistical
vowelization of arabic text. Technical report, IBM Corporation Intellectual Property Law,
Austin, TX, US.
[8] Schlippe, Tim, ThuyLinh Guyen, and ThuyLinh Vogel. (2008) Diacritization as a
machinetranslation problem and as a sequence labeling problem. In 8th AMTA conference,
Hawai., pages 21–25.
[9] Gal, Yaakov. (2002) An hmm approach to vowel restoration in arabic and hebrew. In
Proceedings of the Workshop on Computational Approaches to Semitic Languages-
Philadelphia- Association for Computational Linguistics, pages 27–33.
[10] Nelken, Rani and Stuart M. Shieber. (2005) Arabic diacritization using weighted finite-
state transducers. In Proceedings of the ACL 2005 Workshop On Computational Approaches
To Semitic Languages, Ann Arbor, Michigan, USA,, pages 79–86.
[11] Habash, Nizar and Owen Rambow. (2007) Arabic diacritization through full
morphological tagging. In Proceeding NAACL-Short ’07 Human Language Technologies
2007: The Conference of the North American Chapter of the Association for Computational
Linguistics - Companion Volume - Short Papers Rochester - New York- USA, pages 53–56.
[12] Bebah, Mohamed Ould Abdallahi Ould, Abdelouafi Meziane, Azzeddine Mazroui, and
Abdelhak Lakhouaja. (2012) Approche morpho-statistique pour la voyellation des texts
arabes. Journal of Computer Science and Engineering, 5(1).
[13] Bebah, Mohamed Ould Abdallahi Ould, Abdelouafi Meziane, Azzeddine Mazroui, and
Abdelhak Lakhouaja. (2011) Alkhalil morpho sys. In 7th International Computing
Conference in Arabic, May 31- June 2, 2011, Riyadh, Saudi Arabia.

[14] El-Sadany, T and M Hashish. (1988) Semi-automatic vowelization of arabic verbs. In
10th NC Conference, Jeddah, Saudi Arabia.
[15] Manning, Chris and Hinrich Schutze. (1999) Foundations of statistical natural language
processing. Massachusetts Institute of Technology Press - Library of Congress Cataloging in
publication Information.
[16] Deltour, Amelie. (2003) Methodes statistiques pour la voyellisation des texts arabes.
Master’s thesis, ENSIMAG-Karlsruhe University.
AUTHORS
Mohamed Ould Abdallahi Ould Bebah Researcher at Doha Institute for
Graduate Studies since 2013. "Doctorat" in Computer Sciences, Mohamed I
University, Oujda, Morocco, 2013. "DESA" in "Numerical Analysis, Computer
science and Signal Processing" from Mohamed I University, 2005. Member of
Arabic NLP unit, LaRI Laboratory, Mohamed I University since 2005. Member
of Language Studies unit at the Center of Social and Human Studies and
Researches (CERHSO) in Oujda since 2005. Member of Arabic Language Engineering
Society in Morocco (ALESM) since 2012.
Amine CHENNOUFI Master in Computer Sciences from Mohamed I
University, Oujda, Morocco 2010. Engineering degree in Meteorology from the
National School of Meteorology ENM in Toulouse in France since 1994. Since
January 2011, He prepares his PhD thesis in Arabic Natural Language
Processing within the Computer Research Laboratory (LaRI). His research
interests are especially in Automatic vowelization of Arabic language. Professionally he is
the responsible of Meteorological Centre of Oujda Airport.
Azzeddine Mazroui "Doctorat d’Etat" in Numerical Analysis, University
Mohammed I Morocco, 2000. PHD in Probability and Statistics, Pierre & Marie
Curie University France, 1993. Professor of mathematics and Computer Sciences
in University Mohammed I. Member of Computer Research Laboratory (LaRI).
Director of the ANLP unit in the LaRI laboratory
Abdelhak Lakhouaja "Doctorat d’Etat" in Computer Sciences, University
Mohammed I Morocco, 2000. Professor of Computer Sciences in University
Mohammed I. Member of Computer Research Laboratory (LaRI). Cofounder of
the ANLP unit in the LaRI laboratory.

WORD SENSE DISAMBIGUATION USING WSD SPECIFIC WORDNET OF
POLYSEMY WORDS
Udaya Raj Dhungana1
, Subarna Shakya2
, Kabita Baral3
and Bharat Sharma4
1, 2, 4
Department of Electronics and Computer Engineering, Central Campus, IOE,
Tribhuvan University, Lalitpur, Nepal
3
Department of Computer Science, GBS, Lamachaur, Kaski, Nepal
ABSTRACT
This paper presents a new model of WordNet that is used to disambiguate the correct sense
of polysemy word based on the clue words. The related words for each sense of a polysemy
word as well as single sense word are referred to as the clue words. The conventional
WordNet organizes nouns, verbs, adjectives and adverbs together into sets of synonyms
called synsets each expressing a different concept. In contrast to the structure of WordNet,
we developed a new model of WordNet that organizes the different senses of polysemy
words as well as the single sense words based on the clue words. These clue words for each
sense of a polysemy word as well as for single sense word are used to disambiguate the
correct meaning of the polysemy word in the given context using knowledge based Word
Sense Disambiguation (WSD) algorithms. The clue word can be a noun, verb, adjective or
adverb.
KEYWORDS
Word Sense Disambiguation, WordNet, Polysemy Words, Synset, Hypernymy, Context
word, Clue Word

REFERENCES
[1] N. Ide and J. Véronis, “Word sense disambiguation: The state of the art,” Computational
Linguistics, pp. 1–40, 1998.
[2] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, “Introduction to
wordnet: An on-line lexical database,” International Journal of Lexicography, 1998.
[3] U. R. Dhungana and S. Shakya, “Word sense disambiguation in nepali language,” in The
Fourth International Conference on Digital Information and Communication Technology and
Its Application (DICTAP2014), Bangkok, Thailand, 2014, pp. 46–50.
[4] M. E. Lesk, “Automatic sense disambiguation using machine readable dictionaries: How
to tell a pine cone from a ice cream cone,” in SIGDOC Conference, Toronto, Ontario,
Canada, 1986.
[5] S. Banerjee and T. Pedersen, “An adapted lesk algorithm for word sense disambiguation
using wordnet,” in Third International Conference on Intelligent Text Processing and
Computational Linguistics, Gelbukh, 2002.
[6] M. Sinha, M. K. Reddy, P. Bhattacharyya, P. Pandey, and L. Kashyap, “Hindi word sense
disambiguation,” Master’s thesis, Indian Institute of Technology Bombay, Mumbai, India,
2004.
[7] N. Shrestha, A. V. H. Patrick, and S. K. Bista, “Resources for nepali word sense
disambiguation,” in IEEE International conference on Natural Language Processing and
Knowledge Engineering (IEEE NLP-KE’08), Beijing, China, 2008.
[8] P. Bhattacharyya, P. Pande, and L. Lupu, “Hindi wordnet,” Indian Institute of
Technology Bombay, Mumbai, India, Tech. Rep., 2008.
[9] N. Shrestha, A. V. H. Patrick, and S. K. Bista, “Nepali word sense disambiguation using
lesk algorithm,” Master’s thesis, Kathmandu University, Dhulikhel, Kavre, Nepal, 2004.

AN UNSUPERVISED APPROACH TO DEVELOP STEMMER
Mohd. Shahid Husain
Department of Information Technology, Integral University, Lucknow
ABSTRACT
This paper presents an unsupervised approach for the development of a stemmer (For the
case of Urdu & Marathi language). Especially, during last few years, a wide range of
information in Indian regional languages has been made available on web in the form of e-
data. But the access to these data repositories is very low because the efficient search
engines/retrieval systems supporting these languages are very limited. Hence automatic
information processing and retrieval is become an urgent requirement. To train the system
training dataset, taken from CRULP [22] and Marathi corpus [23] are used. For generating
suffix rules two different approaches, namely, frequency based stripping and length based
stripping have been proposed. The evaluation has been made on 1200 words extracted from
the Emille corpus. The experiment results shows that in the case of Urdu language the
frequency based suffix generation approach gives the maximum accuracy of 85.36% whereas
Length based suffix stripping algorithm gives maximum accuracy of 79.76%. In the case of
Marathi language the systems gives 63.5% accuracy in the case of frequency based stripping
and achieves maximum accuracy of 82.5% in the case of length based suffix stripping
algorithm.
KEYWORDS
Stemming, Morphology, Urdu stemmer, Marathi stemmer, Information retrieval.

REFERENCES
[1] Rizvi, J et. al. “Modeling case marking system of Urdu-Hindi languages by using
semantic information”. Proceedings of the IEEE International Conference on Natural
Language Processing and Knowledge Engineering (IEEE NLP-KE '05). 2005.
[2] Butt, M. King, T. “Non-Nominative Subjects in Urdu: A Computational Analysis”.
Proceedings of the International Symposium on Non-nominative Subjects, Tokyo, December,
pp. 525-548, 2001.
[3] Savoy, J. “Stemming of French words based on grammatical categories”. Journal of the
American Society for Information Science, 44(1), 1-9, 1993.
[4] Lovins Julie Beth: Development of a stemming algorithm. Mechanical Translation and
Computational Linguistics 11:22–31. (1968)
[5] Mokhtaripour, A., Jahanpour, S. “Introduction to a New Farsi Stemmer”. Proceedings of
CIKM Arlington VA, USA, 826-827, 2006.
[6] R. Wicentowski. "Multilingual Noise-Robust Supervised Morphological Analysis using
the Word Frame Model." In Proceedings of Seventh Meeting of the ACL Special Interest
Group on Computational Phonology (SIGPHON), pp. 70-77, 2004.
[7] Rizvi, Hussain M. “Analysis, Design and Implementation of Urdu Morphological
Analyzer”. SCONEST, 1-7, 2005.
[8] Krovetz, R. “View Morphology as an Inference Process”. In the Proceedings of 5th
International Conference on Research and Development in Information Retrieval, 1993.
[9] Porter, M. “An Algorithm for Suffix Stripping”. Program, 14(3): 130-137, 1980.
[10] Thabet, N. “Stemming the Qur’an”. In the Proceedings of the Workshop on
Computational Approaches to Arabic Script-based Languages, 2004.
[11] Paik, Pauri. “A Simple Stemmer for Inflectional Languages”. FIRE 2008.
[12] Sharifloo, A.A., Shamsfard M. “A Bottom up Approach to Persian Stemming”. IJCNLP,
2008
[13] Croft and Xu. “Corpus-Based Stemming Using Co occurrence of Word Variants”. ACM
Transactions on Information Systems (61-81), 1998.
[14] Kumar, A. and Siddiqui, T. “An Unsupervised Hindi Stemmer with Heuristics
Improvements”. In Proceedings of the Second Workshop on Analytics for Noisy
Unstructured Text Data, 2008.
[15] Kumar, M. S. and Murthy, K. N. “Corpus Based Statistical Approach for Stemming
Telugu”. Creation of Lexical Resources for Indian Language Computing and Processing
(LRIL), C-DAC, Mumbai, India, 2007.
[16] Qurat-ul-Ain Akram, Asma Naseer, Sarmad Hussain. “Assas-Band, an Affix-Exception-
List Based Urdu Stemmer”. Proceedings of ACL-IJCNLP 2009.
[17] http://en.wikipedia.org/wiki/Urdu
[18] .http://www.bbc.co.uk/languages/other/guide/urdu/steps.shtml
[19] http://www.andaman.org/BOOK/reprints/weber/rep-weber.htm
[20] Natural Language processing and Information Retrieval by Tanveer Siddiqui, U S

Tiwary.
[21] Information retrieval: data structure and algorithms by William B. Frakes, Ricardo
Baeza-Yates.
[22] http://www.crulp.org/software/ling_resources.htm
[23] Marathi Corpus, http://www.cfilt.iitb.ac.in/marathi_Corpus/ , IIT Powai, Mumbai.
AUTHORS
Mohd. Shahid Husain M.Tech. from Indian Institute of Information Technolo
gy (IIIT-A), Allahabad with Intelligent System as specialization. Currently
pursuing Ph.D. and working as assistant professor in the department of
Information Technology, Integral University, Lucknow.

Top 10 cited articles in nlp

More Related Content

What's hot (20)

Similar to Top 10 cited articles in nlp (20)

More from kevig (20)

Recently uploaded (20)

Top 10 cited articles in nlp