94-977-1-PB
94-977-1-PB
open research software Education. Journal of Open Research Software, 4: e9, DOI: http://dx.doi.org/10.5334/jors.94
SOFTWARE METAPAPER
DataExplore is an open source desktop application for data analysis and plotting intended for use in both
research and education. It is intended primarily for non-programmers who need to do relatively advanced
table manipulation methods. Common tasks that might not be familiar to spreadsheet users such as
table pivot, merge and join functionality are included as core elements. Creation of new columns using
arithmetic expressions and pre-defined functions is possible. Table filtering may be done with simple
boolean queries. The other primary feature is rapid dynamic plot creation from selected data. Multiple
plots from various selections and data sources can also be composed using a grid layout. It is thus possible
to create p ublication quality plots. A plugin system allows the addition of features with several plugins
already available by default. The program is written in Python and is based on the PyData suite of Python
libraries.
(1) Overview out of multiple factors. The opaque way in which cell
Introduction based formulae are often used makes it hard to track cal-
Recent years have seen a rapid growth in the importance culations. The use of conditional formatting to present
of data handling and analysis in the sciences. Such is results makes it almost impossible to interpret them in
the complexity and volume of data it has given rise another format. Also because they may be used for data
to data scientist specialisations within many fields. In entry and analysis at the same time, ad hoc changes to
the biological sciences, the data analysis task is often the raw data are encouraged. Finally, statistical analysis
assigned to a bioinformatician. Such expert skills are using the most common commercial product, Excel, have
essential when the complexity of the task is too much been criticized [4]. These problems make reproducible
for experimentalists unfamiliar with advanced compu- science difficult.
tational techniques. However there is a danger of over For the general scientific user these limitations can
reliance upon data analysts particularly in cases where be partly overcome by using other tools to compliment
an analysis can be done with relatively basic computa- spreadsheets. Many researchers use separate plotting
tional skills. packages such as GraphPad Prism [5] and statistical
Spreadsheets are widely used in scientific research. tools like SPSS [6] for analysis. However this gives rise to
They have also advanced a great deal in sophistication another problem – these are commercial applications.
[1] since their introduction and are by now a standard tool They are very expensive and the source code is closed.
for anyone dealing with numerical data. In the sciences This has serious implications for reproducibility. The
there has however been a tendency to rely on the spread- tendency to use commercial software is highly prevalent
sheet for tasks that they were not originally designed in academia even though there are some viable open
for [2]. Though advanced features like pivot tables are source solutions available. This may partly be a problem
available many general users are sometimes not aware of general awareness on the part of the user. Veusz [7]
of them and make the worksheet more complicated and SciDAVis [8] are good examples of free plotting
than it needs to be. Even if a spreadsheet can perform packages that compare favourably with commercial
a task using a macro it is often much more complex to products though they do not seem to be widely known.
accomplish than it would be with a few lines of code [3]. Commercial tools are also frequently rather feature
Spreadsheets have another more serious problem in that heavy and complex for the general user with entire
they make reproducible analysis very difficult. This arises courses devoted to teaching them.
Art. e9, p. 2 of 8 Farrell: DataExplore
Scientists in certain data intensive subject areas, are established plotting library for Python and produces pub-
now beginning to adapt to scripting languages like lication-quality figures in a variety of hard copy formats
R and Python [9]. These are a much better foundation to and interactive environments across platforms.
build future skills on and since they are open platforms In some plotting packages like Veusz, SciDaviz and mjo-
they allow users to publish full end-to-end instructions graph [20] plots are designed by the addition of multiple
that anyone in the world can reproduce for free. They plot elements to which data is attached. DataExplore is
also facilitate workflows with large data [10]. Adoption more data centric like R-studio and plots are generated
of these scripting tools is easier said than done because dynamically from the currently selected data and chosen
of the intimidating nature of programming to many. options. The idea is that rows and columns can quickly be
This is one reason R might not easily gain traction with chosen or tables edited and new plots generated instantly
experimentalists since it requires at least some program- with minimal mouse clicks. Plots cannot currently be
ming skill. R-studio [11] goes a long way to address this interactively edited though this is an option that could be
issue as it provides a user friendly environment for newer added later.
users. Though much progress has been on new web based
tools it still a challenge to build highly interactive applica- Architecture
tions inside the browser. There is still therefore space for The software is written in Python and makes extensive
graphical desktop tools to provide a familiar compliment use of the PyData libraries. These form an ecosystem of
to spreadsheets for non-programmers. libraries that can provide a complete solution to data
analysis from visualization to machine learning. The
Objectives graphical interface is built with Tkinter/ttk, the standard
DataExplore is intended for rapid exploratory analysis of graphical tool kit for Python. Like other Python packages
tabulated data. Quick transformation and visualization of the library is broken into modules which contain classes
data are core features. The use of the Python PyData stack grouped by function. Table, plotting and dialog widgets
[12] as the back-end means a large number of well tested are in their own modules as shown in Figure 1. The core
algorithms are already available. The dichotomy between class is the pandastable widget which is a Tkinter canvas
programming and tools with a graphical interface is usu- object. This is used to display a Pandas DataFrame via a
ally a sharp one [13] with users preferring one or the other model class that carries out changes to the DataFrame
approach. DataExplore is also intended to help bridge this based on user interaction and stores some additional
gap by readily making possible processing steps normally data about the Table. This widget is designed to be
familiar to data analysts. The main objectives of the soft- re-used in any Tkinter application. The DataExplore
ware are: application module itself is built around the table widget,
a plot viewer module and several plugins.
• allow quick exploration and visualization of a data set
• allow a familiar graphical interface but implement User interface
more advanced table analysis features than currently The application consists essentially of a table and associ-
accessible in spreadsheets ated plot viewer, shown in Figure 2. Multiple sets of tables
• help to bridge the gap between graphical interface can be loaded and saved as single projects. For certain func-
and command driven or programmatic approaches to tions a child table or sub-table is created below the main
data analysis one. This may be to store the results of a table manipula-
• scale to medium sized datasets, i.e. a table of the order tion such as an aggregation or to paste in another table so
of 1–5 million rows that will fit in the memory of that it can be joined to the main one. Another use would
most computers be to paste a portion of the selected data and plot it. The
• allow publication quality plots to be made easily and sub-table can be created and discarded as needed.
encourage clear scientific visualization [14] Unlike a spreadsheet, the focus is not on data entry.
Though individual cell entry is possible, users are encouraged
Implementation and architecture to keep their original data separate and unchanged. Results
Methodology can be exported to csv or other formats if required. This is
The core R data structure is called data.frame, a versatile important to robust analysis. An undo/redo feature is not
matrix structure that stores multiple data types [15]. It yet implemented but will likely be useful in the future
has been replicated in Python as a core component of when more complex series of processing steps need to be
the Pandas library [16]. This has opened the way to much experimented with.
more convenient R style data analysis in Python. Pandas Plot options are laid out in a set of tabbed control pan-
DataFrame structures, which use the efficient ndarray els below the plot allowing the user to switch quickly
data container class in numpy [17], are now well inte- between basic and other plotting modes. Currently a 3D
grated into other Python data analysis libraries, creating plot mode and grid layout options are also available. Table
a very useful ecosystem. These libraries are often grouped functions are accessed either from the right toolbar (see
together as part of the PyData stack [18]. DataExplore is Figure 2), the right-click context menu inside the table or
based on using DataFrames to present tabulated data and from the main menu. Dialogs such as plugin interfaces are
on matplotlib [19] for plotting. Matplotlib is a very well usually placed below the table.
Farrell: DataExplore Art. e9, p. 3 of 8
Figure 1: Outline of pandastable library modules. The graphical user interface is a scaffolding using these modules.
Features table is restored when the filters are cleared. The syntax is
Import of text files straightforward to learn for beginners and may be useful
Import of csv and general plain text formats is a stand- for teaching logical AND/OR/NOT row-wise operations.
ard feature of Pandas using the read_csv method and
supports many options. The most essential of these are Table manipulation
available via the import dialog accessible from the toolbar Common transformations such as transpose, aggregation,
or by right-clicking anywhere in the table and using the pivot and merge are supported. Results are mostly placed
context menu. in the sub-table so as not to overwrite the main table. The
sub-table can also be used to plot from or copied into the
Row and column indexes main table or another sheet. For operations involving two
The index is a fundamental feature of the underlying tables (like concatenate or merge) the second dataset is
DataFrame. This performs the central role of data align- loaded into the sub-table (by importing or pasting) which
ment or getting and setting of subsets of the table. A can then be joined to the main table.
more novel aspect is the use of “hierarchical” indexing.
This is essentially a way of representing data with an arbi- Plotting behaviour
trary number of dimensions in a 2D table. In our program The design is oriented around quick generation of plots
mostly the use of multi-indexes is implicit to the way the from the current selections. This means that the current
program works but it opens the door to add more useful plot is constantly overridden. However plots can be saved
functionality later on. For now the index can be displayed and recalled in the current session if required. To produce
or hidden in the table and columns can be turned into multiple plots in one figure a grid layout mode is used.
indexes. This is useful for plotting since the index is often Changing the number of rows/columns makes a finer
the implied x-axis for plotting. grid and adjusting the row/column spans allows a variety
of sub plot combinations to be created. When the user
Table filtering wants to add a new sub plot they simple select the row
Currently filtering of the table is done using a quite sim- and column location to add them. An example is shown
ple string query method. An entry box is used to enter in Figure 3. It is also possible to use this method to make
the query and the table updated accordingly. The main inset plots by overlaying them on the main plot.
Figure 3: A figure generated directly from the application using the Titanic data. This uses the grid layout mode to
combine multiple sub plots together.
Farrell: DataExplore Art. e9, p. 5 of 8
Categorical plots and the table updated to reflect the changes immedi-
Data can be grouped and plotted either by grouping by ately. This should prove a useful way to teach coding
categorical columns in the plot dialog or performing a skills in a familiar environment.
groupby-aggregate step and plotting the resulting table.
The factor plots plugin provides even more advanced plot- Documentation and usage
ting capabilities. Factor plots allow multiple comparisons Documentation is provided in the form of a wiki on github
to be made in a single graph. That is, you can split data by at https://github.com/dmnfarrell/pandastable/wiki/.
more than one variable along an axis or between plots. Specific case studies/tutorials along with links to screen
In seaborn these dimensions are called row/col (the plot casts and details of new features can be viewed on the blog
dimensions) x,y (axes) and hue (grouping/color within at http://dmnfarrell.github.io/. The case studies provide a
plots). These concepts are illustrated in the seaborn docu- visual guide through real world examples. This blog will
mentation on factor plotting [21] and on the blog. be kept up to date as the program is further developed.
provide their own plugins or extend the core a pplication. 10(9): e1003833. DOI: http://dx.doi.org/10.1371/
The project can be also be forked without restriction. journal.pcbi.1003833
15. Data science retreat 2013 R: the good parts.
Competing Interests Available at http://blog.datascienceretreat.com/post/
The authors declare that they have no competing interests. 69789735503/r-the-good-parts [Accessed: 08-Jan-2016].
16. Mckinney, W 2015 Pandas, Python Data Analysis
Acknowledgements Library. Available at http://pandas.pydata.org/.
Thanks to Prof. Stephen Gordon for supporting work on 17. van der Walt, S, Colbert, S C and Varoquaux, G
this project. 2011 (March) The NumPy Array: A Structure for
Efficient Numerical Computation. Comput Sci Eng,
References 13(2): 22–30. DOI: http://dx.doi.org/10.1109/MCSE.
1. Weathington, J 2015 5 things every data scientist 2011.37
should know about Excel. Available at http://www. 18. The PyData Community 2015 PyData. Available at
techrepublic.com/article/5-things-every-data-scientist- http://pydata.org/downloads/.
should-know-about-excel/ [Accessed: 16-Jan-2016]. 19. Hunter, J D 2007 (May) Matplotlib: A 2D Graphics
2. Burns, P 2014 Spreadsheet Addiction. Available at Environment. Comput Sci Eng, 9(3): 90–95. DOI:
http://www.burns-stat.com/documents/tutorials/ http://dx.doi.org/10.1109/MCSE.2007.55
spreadsheet-addiction/. 20. Tanahashi, M 2014 mjograph. Available at http://
3. Moffitt, C 2014 Common Excel Tasks Demonstrated www.ochiailab.dnj.ynu.ac.jp/mjograph/.
in Pandas. Available at http://pbpython.com/excel- 21. Waskom, M 2015 Seaborn factorplot documentation.
pandas-comp.html. Available at http://stanford.edu/~mwaskom/software/
4. McCullough, B D and Heiser, D A 2008 On the accu- seaborn/generated/seaborn.factorplot.html [Accessed:
racy of statistical procedures in Microsoft Excel 2007. 12-Jan-2016].
Comput Stat Data Anal, 52(10): 4570–4578. DOI: 22. Statsmodels Developers 2015 Statsmodels. Available
http://dx.doi.org/10.1016/j.csda.2008.03.004 at http://statsmodels.sourceforge.net/.
5. GraphPad Software 2015 GraphPad Prism version 23. Waskom, M 2012 Seaborn. Available at http://
6.0. Available at http://www.graphpad.com/scientific- stanford.edu/~mwaskom/software/seaborn/. DOI:
software/prism/. http://dx.doi.org/10.1109/MCSE.2007.53
6. IBM 2015 IBM SPSS Statistics. Available at http:// 24. Pérez, F and Granger, B E 2007 (May) {IP}ython: a
www-01.ibm.com/software/analytics/spss/. System for Interactive Scientific Computing. Comput
7. Sanders, J 2015 Veusz. Available at http://home.gna. Sci Eng, 9(3): 21–29.
org/veusz/. 25. Kaggle 2012 Titanic: Machine Learning from Disaster.
8. Benkert, T, Franke, K and Standish, R 2007 Available at https://www.kaggle.com/c/titanic
SciDAVis. Available at http://scidavis.sourceforge.net/. [Accessed: 16-Jan-2016].
9. Buffalo, V 2015 Bioinformatics Data Skills. O’Reilly Media. 26. Farrell, D, Shaughnessy, R G, Britton, L,
10. Heller, M 2015 Learn to crunch big data with R. Avail- MacHugh, D E, Markey, B and Gordon, S V 2015
able at http://www.infoworld.com/article/2880360/ The Identification of Circulating MiRNA in Bovine
big-data/learn-to-crunch-big-data-with-r.html Serum and Their Potential as Novel Biomarkers of
[A ccessed: 07-Jan-2016]. Early Mycobacterium avium subsp paratuberculosis
11. RStudio Inc. 2015 RStudio: Integrated Development Infection. PLoS One, 10(7): e0134310. DOI: http://
for R. Boston MA. dx.doi.org/10.1371/journal.pone.0134310
12. Oliphant, T E 2007 (May) Python for Scientific 27. Project Jupyter 2015 Jupyter Notebook. Available at
Computing. Comput Sci Eng, (9)3: 10–20. DOI: http://jupyter.org/ [Accessed: 21-Jan-2016].
http://dx.doi.org/10.1109/MCSE.2007.58 28. Travis CI Community 2011 Travis continuous inte-
13. Ward, N 2013 Excel, SPSS, Minitab or R? Available at gration. Available at https://travis-ci.org/.
https://learnandteachstatistics.wordpress.com/2013/ 29. Furuhashi, S 2008 MessagePack. Available at http://
02/11/excel-spss-minitab-or-r/ [Accessed: 09-Sep-2015]. msgpack.org/index.html.
14. Rougier, N P, Droettboom, M and Bourne, P E 2014 30. Tuininga, A 2014 cx_Freeze. Available at http://cx-
Ten Simple Rules for Better Figures. PLoS Comput Biol, freeze.sourceforge.net/.
Art. e9, p. 8 of 8 Farrell: DataExplore
How to cite this article: Farrell, D 2016 DataExplore: An Application for General Data Analysis in Research and Education.
Journal of Open Research Software, 4: e9, DOI: http://dx.doi.org/10.5334/jors.94
Copyright: © 2016 The Author(s). This is an open-access article distributed under the terms of the Creative Commons
Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited. See http://creativecommons.org/licenses/by/4.0/.