0% found this document useful (0 votes)
39 views7 pages

Experimental Evaluation of Open Source Data Mining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views7 pages

Experimental Evaluation of Open Source Data Mining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/345203774

Experimental Evaluation of Open Source Data Mining Tools (WEKA and


Orange)

Article in International Journal of Engineering Trends and Technology · August 2020


DOI: 10.14445/22315381/IJETT-V68I8P206S

CITATIONS READS

20 1,945

2 authors:

Ritu Ratra Preeti Gulia


Maharshi Dayanand University Maharshi Dayanand University
7 PUBLICATIONS 79 CITATIONS 105 PUBLICATIONS 463 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Preeti Gulia on 12 November 2020.

The user has requested enhancement of the downloaded file.


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 8 - Aug 2020

Experimental Evaluation of Open Source Data


Mining Tools (WEKA and Orange)
Ritu Ratra1 , Preeti Gulia2
1.
Research Scholar, Department of Computer Science and Applications, Maharshi Dayanand University,
Rohtak, Haryana, India
2.
Assistant Professor, Department of Computer Science and Applications, Maharshi Dayanand University,
Rohtak, Haryana, India
1. 2.
[email protected] [email protected]

Abstract: Nowadays, it is possible for every time consuming task to execute that data. So, there is
organisation to manage the large dataset at a requirement of automated tools that can help the
minimum cost. But in order to collect the fruitful researcher to convert that messy data into useful
information, it is mandatory to utilize the large information. Few years ago, there are so many data
volume of stored data. Data mining is an on-going mining software tools have been developed to
process of searching pattern and collecting useful overcome this problem. Some of them are freely
information from large datasets for future use. There available as open-source tools. The affirmation of
is no doubt that Data mining is very important in open source tools of information sharing for
various areas like education, military, e-business, implementations of different machine learning
healthcare etc. The main objective of data mining algorithms can be most beneficial for the complete
process is to supervise the data from various sources field [10].
in different manner then assemble it to collect the In this paper, a comparative study is conducted
useful information. It can be done by the help of among various classification algorithms like
various tools and techniques. There are a number of Random Forest tree, K-Nearest Neighbour and
data mining tools available in the digital world that Naïve bayes algorithm using WEKA and Orange
can help the researchers for the evaluation of the tool. The evaluation metrics Precision and Recall are
data. These tools work as an interface to receive the used to analyze the performance of the both the tools
data and to extract some meaningful patterns out of with the help of various classification algorithms.
large dataset. Selection of best tool according to The following Classification Algorithms have been
requirement is not an easy task. In order to find out used for the experimentation:
the best data mining tool for classification problem,  Naïve Bayes: Naive Bayes classifier is a
comparison of various tools is necessary on the group of simple probabilistic algorithms.
basis of different parameters. In this paper, data These are based on Bayes' theorem. In it
mining tools WEKA and Orange are analysed on the algorithm is applied with powerful
the basis of implementation of parameters. The main assumptions between the various features.
objective of this comparison is to help the  K-Nearest Neighbour:It is a simple
researchers to select the suitable tool from these classifier that saves all cases that are
two. available and then generates new cases
based on a similar measurement e.g.,
Keywords: Classification, Naïve Bayes, Random distance functions.
Forest tree, WEKA, Orange, Precision, Recall.  Random forest: It is almost same as
Decision tree classifier. But it adds some
I. INTRODUCTION randomness to the model at the time of
In present scenario, data is increasing day by day
making the tree. It can produce great
according to different parameters. It is very difficult
results without the help of hyper
for a person to analyse the large volume of data for
parameter. It builds different decision trees
perfect decision making. Hence, there is need of
and then combines them to generate more
data mining to extract valuable and useful data from
stable prediction.
the available data. Data mining is the process of
finding the most useful knowledge from the large To handle huge volume of data, there are several
volume of data available in databases or data tools available for the user. Moreover it is not easy
repositories. Classification is one of the most to include all the features in single tool .That’s why a
important problems in data mining, which is a number of different varieties of tools have been
collection of finding rules that divides the given data introduced [8][10]. In this paper, two data mining
into different classes. These classes are predefined. tools i.e. WEKA and Orange will be compared.
There is trillions of data available in the form of These tools have different characteristics,
different types in digital world. Manually, it very functionality and capabilities. Researchers can use

ISSN: 2231-5381 http://www.ijettjournal.org Page 30


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 8 - Aug 2020

these according to their research activities developed at the University of Waikato in New
requirements. These tools are continuously upgraded Zealand. WEKA is a data mining tool that allows
with new features as per the needs of the user which data pre-processing process. Attribute selection is
are changing day by day. It is very typical to deal very interesting feature of WEKA. It enhances the
with the complexity of huge data. effectiveness and accuracy of selected data. WEKA
comes with these functionalities: command-line
The rest of the paper flow is as follows: section II interface (CLI), Explorer, Experimenter and
describes open source software, section III describes Knowledge flow and weka workbench. Explorer is
the comparative study of WEKA and Orange tool on used to define the data source, preparation, selection
the basis of parametric comparison and experimental of algorithms, and visualization. The Experimenter
analyses and conclusions and future scope is is helpful for the comparison of the different
discussed in Section IV. algorithms on same dataset.

II. OPEN SOURCE SOFTWARE In WEKA software,secondary data can be used to


Open source software is computer software in which analyse. Researcher can apply algorithm to a data
the source code publically available for user under a set and can analyse the results to make decision
license. In this license copyright holder permit the about the data, various predictions can also generate
users to use it. They can inspect and update it and to predict the new instances. Even though, this tool
can also distribute it to anyone for use. Open source support a lot of model evaluation metrics, but there
software is cheap and flexible because it is is absence of many data survey and visualization
developed by group of company rather than a single methods [6]. WEKAis more towards the
programmer. The common open-source licenses are classification and regressionand less towards the
GPL, general people consent (GNU.org, 2015a), descriptive statistics and clustering methods. There
GNU (GNU.org, 2015b), Mozilla Public License is less support for big data and semi-supervised
(MPL), Berkeley Software Distribution (BSD), learning in WEKA [11]. WEKA is a tool that
Netscape Public License (NPL) and Lesser General available freely for download.Popular features of
Public License (LGPL) [10] WEKA are shown in Figure 1.
There are lot of open-source data mining tools are
available for data mining process such as the As shown in the figure most famous feature of
KNIME, RapidMiner, Orange, WEKA, R- WEKA are as: It is an open source data mining tool
Programming etc. These data mining tools are that is based on JAVA language. It is very easy to
assembled with a set of techniques and algorithms understand and use for the beginners and it has the
that are very helpful in better data analytics. capability of running and comparing several
Researcher can take help in classification, clustering algorithms. It is able to perform different data
and visualization of data. These tools are also useful mining activities including: Data preprocessing,
for regression analysis, Predictive analytics etc. clustering, Classification, Association Rule,
These tools are present with their own functionalities knowledge discovery etc. There are a number of
to help the user with their work. In this paper, built in features in WEKA that makes easy for the
WEKA and Orange tool are described. users. Without the knowledge of programming and
coding, researcher can use it for analyses.
A. WEKA: WEKA is a popular toolkit for learning
the machine learning algorithm. It was originally

Open source data mining tool

Java Based Tool

It is platform independent

Data preprocessing, Classification rules, regression, Clustering, association rules,


visualization, feature selection and improving the knowledge discovery

No programming and coding language required

Provide access to SQL databases.

It provides various machine learning algorithms for data mining tasks.

Figure 1: Features of WEKA

ISSN: 2231-5381 http://www.ijettjournal.org Page 31


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 8 - Aug 2020

B. Orange or quality is involved[12],[13]. Basically Orange


Orange is also freely available open source data Canvas is quite useful for visual programming
mining software. It can be useful for explorative data interface. It provides a well-structured view of
analytics and visualization. It provides a platform for different features. These features are depicting in
different experiment selection. Orange is very figure 2.
effective when the concept of innovation, reliability

Visualiza
tions
Associati Classific
on ation

Visualiza Unsuper
tion ORANGE vised
using Qt learning

Prototyp
e Regressi
impleme on
ntations
Evaluatio
n

Figure2: Features of Orange tool

This figure depicts different features of Orange Data The activity is measured by the frequency of updates
Mining. Visualization of data, classification, and time of latest update. Whenever there is
evaluation, comparison between two tools, then it becomes
necessary to compare them both parametrically and
unsupervised learning, association, visualization experimentally. After then reliable results could be
using Qt, and prototype implementations are some achieved. So in this manner, let us start with
famous features of Orange. The cross-platform parametrical comparison and then analysed the
application of orange is QT and developers can use experimental results.
UI framework for applications. It can be done by
using C++. CSS & JavaScript like language. Orange A. Parametric Comparison: In parametric
tool’s working is visually represented by using comparison, all the characteristics of tools are taken
different widgets for example reading file, training from previous available sources. These
SVM classifier etc. Every widget is self-explained characteristics were listed in Table I. Some
i.e. has a short description about itself is within the characteristics are common in both tools for example
interface. To program, first of all widgets are placed Graphical User Interface (GUI) functionalities,
on the canvas and then inputs and outputs are command line of are in both tool [18], [19].
connected. The widgets available are limited in
Orange in counting as compared to other tools.

III. COMPARATIVE ANALYSIS


Table I: General Characteristics of Open-Source DM Tool WEKA and Orange

Parameters WEKA ORANGE

Company Name University of Waikato University of Ljubljana


New Zealand Switzerland

Source http://www.cs.waikato.ac.nz/ml/weka/ http://orange.biolab.si

Programming language JAVA C++, Python

Released date` 1993 1996

ISSN: 2231-5381 http://www.ijettjournal.org Page 32


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 8 - Aug 2020

License GNU General Public License Open-source, GNU GPLv3

Availability Open Source Open Source

Current Version 3.8 3.24.1

Areas Machine learning, Data visualization, Marketing, Direct Mail Financial


time series and analysis, text mining, Service, Manufacturing, Health
fraud detection Care, Military

Portability Cross Platform Cross Platform

Logo

GUI/Command line Both Both

B. Technical comparison of WEKA and Orange

To make technical comparison between these tools,


first of all these free data mining and knowledge
discovery tools are to be downloaded. After then
specified the datasets to be used and selecting some
classification algorithm to test the performance of
tools. Precision and Recall are most popular
evaluation metrics of model. To make comparison
these are used in this paper.
Figure 3: Precision and Recall [15]
1) Precision: Precision is positive predictive value.
It is defined as the average probability of relevant Data set: The dataset Heart Disease is used for the
retrieval. work. It is taken from UCI Machine Learning
repository and Cleveland heart disease dataset is
Precision = Number of true positives/(Number of selected for the study. It has 303 instance and 76
true positives + False positives). attributes.
2) Recall: Recall is the average probability of The comparison between these tool are well shown
complete retrieval. through the table II and Table III
Recall= True positives/True positives + False Table II: Comparative study of WEKA and
negative Orange tool
Precision Metric

Classifier WEKA(%) Orange(%)

Naïve bays 83.7 82.4


Random 81.8 77.9
Forest
k-nearest 75.3 58.0

ISSN: 2231-5381 http://www.ijettjournal.org Page 33


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 8 - Aug 2020

Table III: Comparative study of WEKA and Information and Communication Technology, Electronics
and Microelectronics (MIPRO), 2014 37th International
Orange tool Convention, (May), 26–30. Retrieved from
Recall Metric http://www.zemris.fer.hr/~ajovic/articles/MIPRO
2014_final.pdf
Classifier WEKA(%) Orange(%) [2] Alcalá-Fdez, J., Sánchez, L., & García, S. (2009). “KEEL:
a software tool to assess evolutionary algorithms for data
mining problems”. Soft Computing. Retrieved from
Naïve bays 83.7 80.6 http://link.springer.com/article/10.1007/s00500-008-0323-
Random 81.9 73.4 y
[3] Collier, K., Ph, D., Carey, B., & Marjaniemi, C. (1999). “A
Forest Methodology for Evaluating and Selecting Data Mining
k-nearest 75.2 54.7 Software” Keywords : Data Mining , Tool Evaluation ,
Knowledge Discovery, 00(c), 1–11.
[4] Sonnenburg, S., Braun, M., & Ong, C. (2007). “The need
for open source software in machine learning”, 8, 2443–
When the dimension of the input data is high, then 2466. Retrieved from
Naïve Bayes Classifier algorithm is most suited. http://researchcommons.waikato.ac.nz/handle/10289/3928
Naive Bayes is particularly applicable in artificial [5] Chen, X., Ye, Y., Williams, G., & Xu, X. (2007). “A
intelligence. When comparative study is made , the survey of open source data mining systems”. Emerging
Technologies in Knowledge Discovery and Data Mining,
analysis of precession and recall is analysing for (60603066), 3– 14. Retrieved from
heart disease data sets precession in Orange 82.4% http://link.springer.com/chapter/10.1007/978-3-540-
and Recall 80.6%. In WEKA the value of precision 770183_2
is 83.7% and Recall 83.7 %. WEKA tool is best is [6] Jović, A., Brkić, K., & Bogunović, N. (2014). “An
overview of free software tools for general data mining”.
best precession and Recall as compare to Orange Information and Communication Technology, Electronics
tool in Naïve bayes classifier. Same is happened and Microelectronics (MIPRO), 2014 37th International
with Random forest and k-nearest classifiers. In Convention, (May), 26–30. Retrieved from
Random Forest, precision value in Orange is 77.9% http://www.zemris.fer.hr/~ajovic/articles/MIPRO
2014_final.pdf
and Recall value is 73.4%. In WEKA the value of [7] Kalpana Rangra, Dr. K. L. Bansal. “Comparative Study of
precision is 81.8% and Recall 81.9 %. And in k- Data Mining Tools”, presented at International Journal of
nearest algorithm, precision value in Orange is 58% Advanced Research in Computer Science and Software
and Recall value is 54.7%. In WEKA the value of Engineering, Volume 4, Issue 6, 2014.
[8] Dr. Anil Sharma, Balrajpreet Kaur,” A RESEARCH
precision is 75.3% and Recall 75.2 %. REVIEW ON COMPARATIVE ANALYSIS OF DATA
MINING TOOLS, TECHNIQUES AND PARAMETERS”,
IV. CONCLUSION AND FUTURE STUDY ISSN No. 0976-5697, International Journal of Advanced
This paper presents the study of two different open Research in Computer Science, volume 8, No. 7, July –
August 2017.
source Data mining tools along with their features- [9] H.Witten, E. Frank, M. A.Hall, “Data Mining practiced
WEKA and Orange. Both tools have their own machine learning tools and techniques”, 3rd ed., Morgan
merits and demerits This paper specifies the Kaufmann Elsevier: USA,2011.
comparison between these tools by experimental [10] Predictive Analytics [Online].Available
from:http://www.predictiveanalyticstoday.com/top-
analysis and by using their parameters. This softwarefor-text-analysis-text-mining-text-analytics/
comparative study is based on datasets and [11] Jović, A., Brkić, K., & Bogunović, N. “An overview of free
algorithms. It may be possible that the results may software tools for general data mining. Information and
vary with different datasets or algorithms. The Communication Technology”, Electronics and
Microelectronics (MIPRO), 2014 37th International
comparative analysis is helpful in learning and Convention, (May), 26–30. Retrieved from
selection of the data mining tools as per the areas. http://www.zemris.fer.hr/~ajovic/articles/MIPRO
By employing experimental study, it is to be 2014_final.pdf
concluded that WEKA tool is better than Orange. It [12] http://www.kdnuggets.com/2015/12/ top-7-newfeatures-
orange-3.html/2
can be stated that WEKA has most desired features [13] Orange Data Mining, ‘Orange Data Mining Library
for a fully-functional and user friendly platform for Documentation Release 3’.
classification problems. So, WEKA can be [14] http://orange.biolab.si/
recommended for Classification problems of data [15] http://Precision%20and%20recall%20-%20Wikipedia.PDF
[16] M.Hall, E.Frank , G.Holmes, B.Reutemann , IH
mining. In the future work, different data sets and Witten,"The WEKA Data Mining Software: An Update,"
different problems like clustering, association rule SIGKDD Explorations,2009.
mining will be taken and applied using these tools. [17] A.Wahbeh.,"A Comparison Study between Data Mining
Tools over some Classification Methods," International
Journal of Artificial Intelligence,2012.
ACKNOWLEDGEMENTS [18] Swasti Singhal, Monika Jena. “A Study on WEKA Tool for
The authors are thankful to the Data Preprocessing, Classification and Clustering”
http://archive.ics.uci.edu/ml/datasets/heart+Diseasef presented at International Journal of Innovative
or providing the dataset. Technology and Exploring Engineering (IJITEE),
Volume-2, Issue-6,2013.
[19] http://www.ionos.com>digitalguide
REFERENCES [20] http://www.google.com
[21] Venkateswarlu Pynam , R Roje Spanadna, Kolli Srikanth,
[1] Jović, A., Brkić, K., & Bogunović, N. (2014). “An “An Extensive Study of Data Analysis Tools (Rapid Miner,
overview of free software tools for general data mining”.

ISSN: 2231-5381 http://www.ijettjournal.org Page 34


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 8 - Aug 2020

Weka, R Tool, Knime, Orange)”, SSRG International [27] A. kumar, et al., “ Data mining: various issues and
Journal of Computer Science and Engineering ( SSRG – challenges for future," IJETA,2014
IJCSE ) – Volume 5 Issue 9 – September 2018, ISSN: [28] H. Nasereddin," NEW TECHNIQUE TO DEAL WITH
2348 – 8387,pp 4-11. DYNAMIC DATA MINING IN THE DATABASE,"
[22] http://opensourceforu.com/2017/03/top-10-open-source- IJRRAS,.December 2012.
datamining-t ools/ [29] J.Demšar and B.Zupan, “Orange: Data Mining Fruitful
[23] Nurdatillah Hasim, Norhaidah Abu Haris, “A Study of and Fun - A Historical Perspective”, 2012.
Open-Source Data Mining Tools for Forecasting”, [30] C.Shah, A.Jivani, ”Comparison of data mining
IMCOM '15, January 08 - 10 2015, BALI, Indonesia. classification algorithms for breast cancer prediction”, 4th
[24] Witten, I. H., & Eibe, F. (2005), “Data Mining: Practical ICCCNT ,IEEE,2013.
Machine Learning Tools and Techniques”, (2nd ed., p. [31] P.Kakkar, A.Parashar, “Comparison of different clustering
525). Algorithm using WEKA tool”, International Journal of
[25] Sonnenburg, S., Braun, M., & Ong, C., “The need for open Advanced Research in Technology, Engineering and
source software in machine learning”, 8, 2443–2466. Science, 2014.
2007. Retrieved from [32] N.Chauhan and N.Gautam, “Parametric comparison of
http://researchcommons.waikato.ac.nz/handle/10289/3928. data mining tools,” IJATES, 2015.
[26] 12 data mining tools and techniques [Online]. Available: [33] A.Gupta, N.Chetty , S.Shukla, “A classification method to
https://www.invensis.net/blog/data-processing/12- classify High Dimensional data”,IEEE,2015.
datamining-tools-techniques.

ISSN: 2231-5381 http://www.ijettjournal.org Page 35

View publication stats

You might also like