Data Warehouse and Decision Support On Integrated Crop Big Data
Data Warehouse and Decision Support On Integrated Crop Big Data
Vuong M. Ngo
E-mail: [email protected]
Nhien-An Le-Khac
E-mail: [email protected]
arXiv:2003.04470v1 [cs.DB] 10 Mar 2020
M-Tahar Kechadi
E-mail: [email protected]
Keywords: Data warehouse, decision support, crop Big Data, smart agriculture.
Reference to this paper should be made as follows: Ngo, V.M., Le-Khac, N.A. and
Kechadi, M.T. (2020) ‘Data Warehouse and Decision Support on Integrated Crop Big
Data’, The International Journal ..., Vol. x, No. x, pp.xxx–xxx.
Biographical notes: Vuong M. Ngo received the B.E, M.E and PhD degrees in
computer science at HCMC University of Technology in 2004, 2007 and 2013 respectively.
He is currently a Senior Researcher at School of Computer Science, UCD. Previously,
he held positions as a CIO, Vice-Dean and Head of Department about Information
Technology in Vietnam Universities. His research interests include information retrieval,
sentiment analysis, data mining, graph matching and data warehouse.
M-Tahar Kechadi was awarded PhD and Master degrees in computer science from
University of Lille 1, France. He joined the UCD School of Computer Science in 1999.
He is currently Professor of Computer Science at UCD. His research interests span the
areas of data mining, data analytics, distributed data mining, heterogeneous distributed
systems, grid and cloud Computing, cybersecurity, and digital forensics. He is a Principal
Investigator at Insight Centre for Data Analytics and CONSUS project. He is a member
of IEEE and ACM.
The International Journal, Vol. x, No. x, 2020 2
5) predict and adopt climate risks (Han and et al., RDF format, and cached in the RDF triple store before
2017). However, the datasets that were used in the being transformed into relational format. The actual
mentioned studies are small. Besides, they focused data used for analysis was contained in the relational
on using visualisation techniques to assist end-users database. However, as the schemas used in Schulze and
understand and interpret their data. et al. (2007) and Schuetz and et al. (2018) were based
Recently, many papers have been published on how on entity-relationship models, they cannot deal with
to exploit intelligent algorithms on sensor data to high-performance, which is the key feature of a data
improve agricultural economics Pantazi (2016), Park and warehouse.
et al. (2016), Hafezalkotob and et al. (2018), Udiasa In Nilakanta and et al. (2008), a star schema
and et al. (2018) and Rupnik and et al. (2019). In model was used. All data marts created by the star
Pantazi (2016), the authors predicted crop yield by schemas are connected via some common dimension
using self-organising-maps; namely supervised Kohonen tables. However, a star schema is not enough to present
networks, counter-propagation artificial networks and complex agricultural information and it is difficult to
XY-fusion. In Park and et al. (2016), one predicted create new data marts for data analytics. The number
drought conditions by using three rule-based machine of dimensions of the DW proposed in Nilakanta and
learning; namely random forest, boosted regression et al. (2008) is very small; only 3-dimensions Species,
trees, and Cubist. To select the best olive harvesting Location, and Time. Moreover, the DW concerns
machine, the authors in Hafezalkotob and et al. (2018) livestock farming. Overcoming disadvantages of the star
applied the target-based techniques on the main criteria, schema, the authors of Ngo and et al. (2018) and Ngo
which are cost, vibration, efficiency, suitability, damage, and Kechadi (2020) proposed a constellation schema for
automation, work capacity, ergonomics, and safety. To an agricultural DW architecture in order to satisfy the
provide optimal management of nutrients and water, quality criteria. However, they did not describe how to
the paper Udiasa and et al. (2018) exploited the multi- design and implement their DW.
objective genetic algorithm to implement an E-Water
system. This system enhanced food crop production at
river basin level. Finally, in Rupnik and et al. (2019) 3 Crop Big Data
the authors predicted pest population dynamics by using
time series clustering and structural change detection 3.1 Crop Datasets
which detected groups of different pest species. However,
The datasets were primarily obtained from an agronomy
the proposed solutions are not scalable enough to handle
company, which extracted it from them operational
agricultural Big Data; they present weaknesses in one
data storage systems, research results, and field trials.
of the following aspects: data integration, data schema,
Especially, we were given real-world agricultural datasets
storage capacity, security and performance.
on iFarms, Business-to-Business (B2B) sites, technology
From a Big Data point of view, the papers Kamilaris
centres and demonstration farms. Theses datasets were
and et al. (2018) and Schnase and et al. (2017) have
collected from several European countries and they are
proposed smart agricultural frameworks. In Kamilaris
presented in Figures 1 and 2 (Origin report, 2018). These
and et al. (2018), the authors used Hive to store and
datasets describe more than 112 distribution points,
analyse sensor data about land, water and biodiversity
73 demonstration farms, 32 formulation and processing
which can help increase food production with less
facilities, 12.7 million hectares of direct farm customer
environmental impact. In Schnase and et al. (2017), the
footprint and 60, 000 trial units.
authors moved toward a notion of climate analytics-
as-a-service, by building a high-performance analytics
and scalable data management platform, which is based
on modern cloud infrastructures, such as Amazon web
services, Hadoop, and Cloudera. However, the two
papers did not discuss how to build and implement a
DW for a precision agriculture.
The proposed approach, inspired from Schulze and
et al. (2007), Schuetz and et al. (2018), Nilakanta and
et al. (2008) and Ngo and et al. (2018), introduces
ways of building agricultural data warehouse (ADW). In
Schulze and et al. (2007), the authors extended entity-
relationship concept to model operational and analytical
data; called multi-dimensional entity-relationship model.
They also introduced new representation elements and
showed how can be extended to an analytical schema.
In Schuetz and et al. (2018), a relational database
and an RDF triple store were proposed to model the Figure 1: Data from UK and Ireland.
overall datasets. The data is loaded into the DW in
4 Ngo, V.M., Le-Khac, N.A. and Kechadi M.T.
tables, views, indexes, and synonyms which consist used to support Crop table. While, Site and Weather
of some fact and dimension tables (Oracle document, Reading tables support Field and WeatherStation tables.
2017). The DW schema can be designed based on the FieldFact fact table saves the most important facts
model of source data and the user requirements. There about teh field; yield, water volume, fertiliser quantity,
are three kind of models, namely star, snowflake and nutrient quantity, spray quantity and pest number.
fact constellation. With the its various uses, the ADW While, in Order and Sale tables, the important facts
schema needs to have more than one fact table and needed by farm management are quantity and price.
should be flexible. So, the constellation schema, also
known galaxy schema should be used to design the ADW
schema.
Table 1 Descriptions of other dimension tables
Dim.
No. Particular attributes
tables
BusinessID, Name, Address, Phone,
1 Business
Mobile, Email
CropStateID, CropID, StageScale,
Height, MajorStage, MinStage,
2 CropState
MaxStage, Diameter, MinHeight,
MaxHeight, CropCoveragePercent
FarmerID, Name, Address, Phone,
3 Farmer
Mobile, Email
FertiliserID, Name, Unit, Status,
4 Fertiliser
Description, GroupName
InspectionID, CropID, Description,
ProblemType, Severity, Problem-
5 Inspection
Notes, AreaValue, AreaUnit, Order,
Date, Notes, GrowthStage
NutrientID, NutrientName, Date,
Figure 4: Field and Crop dimension tables 6 Nutrient
Quantity
Operation OperationTimeID, StartDate, End-
7
Time Date, Season
PlanID, PName, RegisNo, Product-
8 Plan Name, ProductRate, Date, Water-
Volume
ProductID, ProductName, Group-
9 Product
Name
SiteID, FarmerID, SiteName,
10 Site Reference, Country, Address, GPS,
CreatedBy
SprayID, SprayProductName,
ProductRate, Area,Date, WaterVol,
11 Spray ConfDuration, ConfWindSPeed,
ConfDirection, ConfHumidity, Conf-
Temp, ActivityType
SupplierID, Name, ContactName,
12 Supplier
Address, Phone, Mobile, Email
TaskID, Desc, Status, TaskDate,
13 Task
TaskInterval, CompDate, AppCode
Trans TransTimeID, OrderDate, Deliver-
14
Figure 5: Soil and Pest dimension tables Time Date, ReceivedDate, Season
TreatmentID, TreatmentName,
We developed a constellation schema for ADW and FormType, LotCode, Rate, Appl-
15 Treatment
it is partially described in Figure 3. It includes few fact Code, LevlNo, Type, Description,
tables and many dimension tables. FieldFact fact table ApplDesc, TreatmentComment
contains data about agricultural operations on fields. WeatherReadingID, WeatherSta-
Order and Sale fact tables contain data about farmers’ tionID, ReadingDate, ReadingTime,
trading operations. The key dimension tables are Weather AirTemperature, Rainfall, SPLite,
16
Reading RelativeHumidity, WindSpeed,
connected to their fact table. There are some dimension
WindDirection, SoilTemperature,
tables connected to more than one fact table, such as
LeafWetness
Crop and Farmer. Besides, CropState, Inspection, Site, Weather WeatherStationID, StationName,
and Weather Reading dimension tables are not connected 17
Station Latitude, Longitude, Region
to any fact table. CropState and Inspection tables are
6 Ngo, V.M., Le-Khac, N.A. and Kechadi M.T.
The dimension tables contain details on each instance before it is analysed in the data mining module. A data
of an object involved in a crop yield or farm management. cube is a data structure that allows advanced analysis of
Figure 4 describes attributes of Field and Crop data according to multiple dimensions that define a given
dimension tables. Field table contains information about problem. The data cubes are manipulated by the OLAP
name, area, co-ordinates (being longitude and latitude engine. The DW storage, data mart and data cube are
of the centre point of the field), geometric (being a considered as metadata, which can be applied to the data
collection of points to show the shape of the field) and used to define other data. Finally, Data Mining module
site identify the site that the field it belongs to. While, contains a set of techniques, such as machine learning,
Crop table contains information about name, estimated heuristic, and statistical methods for data analysis and
yield of the crop (estYield), BBCH Growth Stage Index knowledge extraction at multiple level of abstraction.
(BbchScale), harvest equipment and its weight. These
provide useful information for crop harvesting.
Figure 5 describes attributes of Soil and Pest 5 ETL and OLAP
dimension tables. Soil table contains information about
PH value (a measure of the acidity and alkalinity), The ETL module contains Extraction, Transformation,
minerals (nitrogen, phosphorus, potassium, magnesium and Loading tools that can merge heterogeneous
and calcium), its texture (texture label and percentage schemata, extract, cleanse, validate, filter, transform
of Silt, Clay and Sand), cation exchange capacity and prepare the data to be loaded into a DW. The
(CEC) and organic matter. Besides, information about extraction operation allows to read, retrieve raw data
recommended nutrient and testing dates ware also from multiple and different types of data sources systems
included in this table. In Pest table contains name, type, and store it in a temporary staging. During this
density, coverage and detected dates of pests. For the operation, the data goes through multiple checks – detect
remaining dimension tables, their main attributes are and correct corrupted and/or inaccurate records, such
described in Table 1. as duplicate data, missing data, inconsistent values and
wrong values. The transformation operation structures,
converts or enriches the extracted data and presents it
4 ADW Architecture in a specific DW format. The loading operation writes
the transformed data into the DW storage. The ETL
A DW is a federated repository for all the data that implementation is complex, and consuming significant
an enterprise can collect through multiple heterogeneous amount of time and resources. Most DW projects usually
data sources; internal or external. The authors in use existing ETL tools, which are classified into two
Golfarelli and Rizzi (2009) and Inmon (2005) defined groups. The first is a commercial and well-known group
DW as a collection of methods, techniques, and tools and includes tools such as Oracle Data Integrator, SAP
used to conduct data analyses, make decisions and Data Integrator and IBM InfoSphere DataStage. The
improve information resources. DW is defined around second group is famous for it open source tools, such as
key subjects and involves data cleaning, data integration Talend, Pentaho and Apatar.
and data consolidations. Besides, it must show its OLAP is a category of software technology that
evolution over time and is not volatile. provides the insight and understanding of data in
The general architecture of a typical DW system multiple dimensions through fast, consistent, interactive
includes four separate and distinct modules; Raw Data, access, management and analysis of the data. By using
Extraction Transformation Loading (ETL), Integrated roll-up (consolidation), drill-down, slice-dice and pivot
Information and Data Mining (Kimball and Ross, 2013), (rotation) operations, OLAP performs multidimensional
which is illustrated in Figure 6. In that, Raw Data analysis in a wide variety of possible views of information
(source data) module is originally stored in various that provides complex calculations, trend analysis
storage systems (e.g. SQL, sheets, flat files, ...). The and sophisticated data modelling quickly. The OLAP
raw data often requires cleansing, correcting noise and systems are divided into three categories: 1) Relational
outliers, dealing with missing values. Then it needs to be OLAP (ROLAP), which uses relational or extended-
integrated and consolidated before loading it into a DW relational database management system to store and
storage through ETL module. manage the data warehouse; 2) Multidimensional OLAP
The Integrated Information module is a logically (MOLAP), which uses array-based multidimensional
centralised repository, which includes the DW storage, storage engines for multidimensional views of data,
data marts, data cubes and OLAP engine. The DW rather than in a relational database. It often requires
storage is organised, stored and accessed using a suitable pre-processing to create data cubes. 3) Hybrid OLAP
schema defined by the metadata. It can be either (HOLAP), which is a combination of both ROLAP and
directly accessed or used to create data marts, which is MOLAP. It uses both relational and multidimensional
usually oriented to a particular business function or an techniques to inherit the higher scalability of ROLAP
enterprise department. A data mart partially replicates and the faster computation of MOLAP.
DW storage’s contents and is a subset of DW storage. In the context of agricultural Big Data, HOLAP is
Besides, the data is extracted in a form of data cube more suitable than both ROLAP and MOLAP because:
The International Journal 7
1) ROLAP has quite slow performance and does not and efficient information transaction. In the last
meet all the users’ needs, especially when performing criterion, a user satisfaction survey should be used to
complex calculations; 2) MOLAP is not capable of find out how a given DW satisfies its users expectations.
handling detailed data and requires all calculations to be
performed during the data cube construction; 3) HOLAP
inherits advantages of both ROLAP and MOLAP, which 7 ADW Implementation
allow the user to store large data volumes of detailed
information and perform complex calculations within Currently, there are many popular large-scale database
reasonable response time. types that can implement DWs. Redshift (Amazon
document, 2018), Mesa (Gupta and et al., 2016),
Cassandra (Hewitt and Carpenter, 2016; Neeraj, 2015),
6 Quality Criteria MongoDB (Chodorow, 2013; Hows and et al., 2015)
and Hive (Du, 2018; Lam and et al., 2016). In Ngo
The accuracy of data mining and analysis techniques and et al. (2019), the authors analysed the most
depends on the quality of the DW. As mentioned in popular no-sql databases, which fulfil most of the
Adelman and Moss (2000) and Kimball and Ross (2013), aforementioned criteria. The advantages, disadvantages,
to build an efficient ADW, the quality of the DW should as well as similarities and differences between Cassandra,
meet the following important criteria: MongoDB and Hive were investigated carefully in the
context of ADW. It was reported that Hive is a better
1. Making information easily accessible. choice as it can be paired with MongoDB to implement
the proposed ADW for the following reasons:
2. Presenting consistent information.
1. Hive is based on Hadoop which is the most
3. Integrating data correctly and completely. powerful cloud computing platform for Big Data.
4. Adapting to change. Besides, HQL is similar to SQL which is popular
for the majority of users. Hive supports well
5. Presenting and providing right information at the high storage capacity, business intelligent and data
right time. science more than MongoDB or Cassandra. These
Hive features are useful to implement ADW.
6. Being a secure bastion that protects the
information assets. 2. Hive does not have real-time performance so it
needs to be combined with MongoDB or Cassandra
7. Serving as the authoritative and trustworthy to improve its performance.
foundation for improved decision making. The
analytics tools need to provide right information 3. MongoDB is more suitable than Cassandra to
at the right time. complement Hive because: 1) MongoDB supports
joint operation, full text search, ad-hoc query and
8. Achieving benefits, both tangible and intangible. second index which are helpful to interact with the
9. Being accepted by DW users. users. Cassandra does not support these features;
2) MongoDB has the same master slave structure
The above criteria must be formulated in a with Hive that is easy to combine. While the
form of measurements. For example, with the 8th structure of Cassandra is peer - to - peer; 3) Hive
criterion, it needs to determine quality indicators about and MongoDB are more reliable and consistent.
benefits, such as improved fertiliser management, cost So the combination of both Hive and MongoDB
containment, risk reduction, better or faster decision, adheres to the CAP theorem.
8 Ngo, V.M., Le-Khac, N.A. and Kechadi M.T.
The ADW implementation is illustrated in Figure for testing. Every group has 5 queries and uses one, two
7 which contains three modules, namely Integrated or more commands (see Table 2). Moreover, every query
Information, Products and Raw Data. The Integrated uses operators; And, Or, ≥, Like, Max, Sum and Count,
Information module includes two components; to express complex queries.
MongoDB and Hive. MongoDB receives real-time data;
as user data, logs, sensor data or queries from Products Table 2 Command combinations of queries
module, such as web application, web portal or mobile Group Commands
app. Besides, some results which need to be obtained G1 Where
in real-time will be transferred from the MongoDB to G2 Where, Group by
Products. Hive stores the online data and sends the G3 Where, Left (right) Join
processed data to MongoDB. Some kinds of queries G4 Where, Union
having complex calculations will be sent directly to G5 Where, Order by
Hive. G6 Where, Left (right) Join, Order by
G7 Where, Group by, Having
In the Raw Data module, almost data in Operational
G8 Where, Group by, Having, Order by
Databases or External Data components, is loaded into
G9 Where, Group by, Having, Left (right) Join,
Cassandra. It means that we use Cassandra to represent Order by
raw data storage. Hence, with the diverse formats of G10 Where, Group by, Having, Union, Order by
raw data; image, video, natural language and sql data,
Cassandra is better to store them than SQL databases.
In the idle times of the system, the updated raw data in Group 1
Cassandra will be imported into Hive through the ELT Group 2
Different times (T imesqi )
difference in runtime between MySQL and ADW for a of a reading query on MySQL and ADW is 687.8 seconds
query qi is calculated as T imesqi = RTqmysqli
/RTqADW
i
. and 216.1 seconds, respectively. It means that ADW
Where, RTqi mysql
and RTqi ADW
are average runtimes of is faster 3.19 times. In the future, by deploying ADW
query qi on MySQL and ADW, respectively. Moreover, solution on cloud or distributed systems, we believe that
with each group Gi , the difference in runtime between the performance will be even much better than MySQL.
MySQL and ADW is T imesGi = RTGmysql i
/RTGADW
i
.
Where, RTGi = Average(RTqi ) is average runtime of
group Gi on MySQL or ADW. 9 Application for Decision Making
Figure 8 describes the time difference between
MySQL and ADW for every query. Although running on The proposed ADW and study its performance on real
one computer, but with large data volume, ADW is faster agricultural data, we illustrated some queries examples
than MySQL on 46 out of 50 queries. MySQL is faster to show how to extract information from ADW. These
for three queries 12th , 13th and 18th belonging to groups queries incorporate inputs on crop, yield, pest, soil,
3rd and 4th . The two systems returned the same time fertiliser, inspection, farmer, businessman and operation
for query 24th from group 5th . Within each query group, time to reduce labour and fertiliser inputs, farmer
for fair performance comparison, the queries combine services, disease treatment and also increase yields.
randomly fact tables and dimensional tables. This makes These query information could not be extracted if the
complex queries taking more time and the time difference Origin’s separate 29 datasets have not been integrated
is significant. When varying the sizes and structures of into ADW. The data integration through ADW is
the tables, the difference is very significant; see Figure 8. actually improve the value of a crop management data
over time to better decision-making.
Different times (T imesGi )
Figure 10: Average Runtimes of MySQL and Example 3: List Crops and their fertiliser and
ADW in every Groups treatment information. In that, crops were cultivated
and harvested in 2017, Yield > 10 tons/ha and attached
Figure 10 presents the average runtime of the 10 by ’black twitch’ pest. Besides, the soil in field has PH
query groups on MySQL and ADW. Mean, the run time > 6 and Silt <= 50 mg/l.
10 Ngo, V.M., Le-Khac, N.A. and Kechadi M.T.
2,188.4
MySQL crop.EstYield >= 1 and crop.EstYield <=10
2,000
ADW GROUP BY crop.cropname
HAVING sum1 > 100;
1,192
Bendre, M. R. and et al. (2015). Big data in precision Hewitt, E. and Carpenter, J. (2016). Cassandra: the definitive
agriculture: Weather forecasting for future farming. In guide, 2nd edition (distributed data at web scale). O’Reilly
International Conference on Next Generation Computing Media.
Technologies (NGCT). IEEE.
Hows, D. and et al. (2015). The definitive guide to MongoDB,
Cao, T. and et al. (2012). Semantic search by latent 3rd edition (a complete guide to dealing with big data using
ontological features. International Journal of New MongoDB. Apress.
Generation Computing, Springer, SCI, 30(1):53–71.
Huang, Y. and et al. (2013). Estimation of cotton yield
Chodorow, K. (2013). MongoDB: The definitive guide, 2nd with varied irrigation and nitrogen treatments using
edition (powerful and scalable data storage). O’Reilly aerial multispectral imagery. International Journal of
Media. Agricultural and Biological Engineering, 6(2):37–41.
Golfarelli, M. and Rizzi, S. (2009). Data warehouse Ngo, V. and et al. (2011). Discovering latent concepts and
design: modern principles and methodologies. McGraw-Hill exploiting ontological features for semantic text search.
Education. In In the 5th Int. Joint Conference on Natural Languag
Processing, ACL, pages 571–579.
Gupta, A. and et al. (2016). Mesa: a geo-replicated
online data warehouse for google’s advertising system. Ngo, V. and et al. (2018). An efficient data warehouse for crop
Communications of the ACM, 59(7):117–125. yield prediction. In The 14th International Conference
Precision Agriculture (ICPA-2018), pages 3:1–3:12.
Gutierreza, F. and et al. (2019). A review of visualisations in
Ngo, V. M. and et al. (2019). Designing and implementing
agricultural decision support systems: An HCI perspective.
data warehouse for agricultural big data. In The 8th
Computers and Electronics in Agriculture, 163.
International Congress on BigData (BigData-2019), pages
Hafezalkotob, A. and et al. (2018). A decision support 1–17. Springer-LNCS, Vol. 11514.
system for agricultural machines and equipment selection: Ngo, V. M. and Kechadi, M. T. (2020). Crop knowledge
A case study on olive harvester machines. Computers and discovery based on agricultural big data integration. In
Electronics in Agriculture, 148:207–216. The 4th International Conference on Machine Learning
and Soft Computing (ICMLSC), pages 1–5. ACM.
Han, E. and et al. (2017). Climate-agriculture-modeling and
decision tool (camdt): a software framework for climate Nilakanta, S. and et al. (2008). Dimensional issues in
risk management in agriculture. Environmental Modelling agricultural data warehouse designs. Computers and
& Software, 95:102–114. Electronics in Agriculture, 60(2):263–278.
Helmer, S. and et al. (2015). A similarity measure for weaving Oliver, D. M. and et al. (2017). Design of a decision support
patterns in textiles. In In the 38th ACM SIGIR Conference tool for visualising e. coli risk on agricultural land using
on Research and Development in Information Retrieval, a stakeholder-driven approach. Land Use Policy, 66:227–
pages 163–172. 234.
The International Journal 13