4-5 Spatial Big Data - 1
4-5 Spatial Big Data - 1
Lecture 04-05
Spatial Big Data
Introduction
Big Data are “data sets that are so big they cannot be handled efficiently by
common database management systems” (Dasgupta, 2013).
Big Data have volume of 100 terabytes to petabytes, have structured and
unstructured formats, and have a constant flow of data (Davenport, 2014)
Spatial Big Data represents Big Data in the form of spatial layers and
attributes.
There is no standard threshold on minimum size of Big Data or Spatial Big
Data, although big data in 2013 was considered one petabyte (1,000
terabytes) or larger (Dasgupta, 2013).
Big Data are getting unbelievably large.
More video is captured daily today than happened in the initial 50 years of
television
Amount of data available today. More than 2.8 zettabytes (2.8 trillion
gigabytes).
Spatial big data are spatial data that challenge current computing systems in
terms of management, processing, or analysis.
2
Introduction
FRAMINGHAM, Mass., November 9, 2015 – The Big Data market continues to
exhibit strong momentum as businesses accelerate their transformation into
data-driven companies.
This momentum is driving strong growth in big data-related infrastructure,
software, and services.
A new forecast from International Data Corporation (IDC ) sees the big data
technology and services market growing at a compound annual growth rate
(CAGR) of 23.1% over the 2019 forecast period with annual spending
reaching $48.6 billion in 2019.
And a new IDC Special Study examines spending on big data solutions in
greater detail across 19 vertical industries and eight big data technologies.
"The ever-increasing appetite of businesses to embrace emerging big data-
related software and infrastructure technologies while keeping the
implementation costs low has led to the creation of a rich ecosystem of new
and incumbent suppliers,"
3
Spatial Big Enablers
Technological advances (Lidar and Satellite imagery)
Volunteered Geographic Information (VGI) (OpenStreetMap)
GPS
Geo-enabled Social Media (Twitter, Flickr, ...)
Inexpensive storage of large volumes of data
Inexpensive compute power
Next Generation Analytics
Moving from off-line to in-line embedded analytics
Need to explain what happened
Need to predicting what will happen
Operating on
Data at rest – stored someplace
Data in motion – streaming
Multiple disparate data sources
4
So, we know that “big data” is BIG…
5
Sources of Spatial Big Data
Sources of Spatial Big Data include:
GPS, including
GPS-enabled devices
Satellite remote sensing
Aerial surveying
Radar
Lidar
Sensor networks
Digital cameras
Location of readings of RFID
Mobile devices
Internet of things
satellites,
Dones,
Vehicles,
Geosocial networking services,
A significant portion of big data is in fact spatial big data
6
Where is this Big Data coming from?
It’s from the Mobile Planet and Internet of Everything…
7
Where is this Big Data coming from?
It’s User-Generated Content…
8
Where is this Big Data coming from?
It’s Sensor Data…
9
Where is this Big Data coming from?
It’s all these “Smart” “Things”…
10
Spatial Big Data vs Traditional Datasets
Traditional
Data characteristic Big Data analytics
Type of data Unstructured Formatted in
Formats columns and rows
Volume of data 100 terabytes to 10s of terabytes or
petabytes less
Flow of data Continual flow Static pool of data
Analytical Machine learning Hypothesis-based
methods
Primary purpose Data-based Internal decision
products support and
services
➢ Traditional datasets could be quite large, but they were traditionally formatted
in spreadsheets or data-bases, tended to be static, and were designed to prove
hypotheses.
➢ Big Data has the 5 Vs and can use machine learning, which pushes out
solutions by seeing what works in big datasets.
➢ The statistical term is exploratory.
11
Five V’s of Spatial Big Data
Volume
Satellite imagery covers the globe so is vast.
Sensors are expanding worldwide at a rapid rate.
Digital cameras have reached several billion through spatially-reference cell
phones.
One estimate indicates that 2.5 quintillion bytes are generated daily
worldwide. (www.ibm.com). 2.5 with 18 zeros.
Variety
The form of data is based on 2-D or 3-D points configured as vector or raster
imagery. This is entirely different than conventional big data which is
alphanumeric or pixel-based (similar to raster but not vector)
12
Five V’s of Spatial Big Data
Velocity
Velocity is very fast since imagery travels at speed of light.
13
Five V’s of Spatial Big Data…
Veracity
Data veracity is the degree to which data is accurate, precise and trusted.
Data is often viewed as certain and reliable.
The reality of problem spaces, data sets and operational environments is that
data is often uncertain, imprecise and difficult to trust. The following are
illustrative examples of data veracity
Attribute veracity
For attribute (non-spatial) data, do the data meet data quality tests?
Spatial veracity
For vector data (imagery based on points, lines, and polygons), the quality varies. It
depends on whether the points have been GPS determined, or determined by unknown
origins or manually. Also, resolution and projection issues can alter veracity.
For geocoded points, there may be errors in the address tables and in the point location
algorithms associated with addresses
For raster data (imagery based on pixels), veracity depends on accuracy of recording
instruments in satellites or aerial devices, and on timeliness.
14
Five V’s of Spatial Big Data…
15
Big Data Analytic Techniques
Big data analytics examines large amounts of data to uncover hidden
patterns, correlations and other insights.
With today’s technology, it’s possible to analyze your data and get answers
from it almost immediately – an effort that’s slower and less efficient with
more traditional business intelligence solutions
Big data analytics helps organizations harness their data and use it to identify
new opportunities. That, in turn, leads to smarter business moves, more
efficient operations, higher profits and happier customers.
Big data brings about the following advantages:
Cost reduction.
Big data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data – plus they can identify
more efficient ways of doing business.
Faster, better decision making.
With the speed of Hadoop and in-memory analytics, combined with the ability to
analyze new sources of data, businesses are able to analyze information immediately –
and make decisions based on what they’ve learned.
New products and services.
With the ability to gauge customer needs and satisfaction through analytics comes the
power to give customers what they want.
16
Traditional Big Data Analytic Techniques
What is enabling them?
Classification
Clustering
Regression
Simulation
Anomaly Detection
Numerical Forecasting
Optimization
Geographic Mapping
19
Big Data Analytic Platforms
What is enabling them?
Lower Cost
Greater Storage (HD and RAM)
Faster Input / Output Operations
Faster Processing
Increased Bandwidth
Cloud / Distributed Computing
New Data Management Tools (Hadoop, etc.)
New Technologies (Spark, etc.)
Ease-of-Use (Browser-based, etc.)
20
Techniques for Handling Big Data
Spatial data distribution
Large datasets are split into smaller datasets and distributed across a collection of machines
Often, the data in a distribution will be ordered from smallest to largest, and graphs and
charts allow you to easily see both the values and the frequency with which they appear.
Parallel processing
A mode of operation in which a process is split into parts, which are executed
simultaneously on different processors attached to the same computer.
Using a collection of machines to process the smaller datasets, combining the partial results
together.
Fault tolerance
Is the property that enables a system to continue operating properly in the event of the
failure of (or one or more faults within) some of its components.
Making copies of the partitioned data to ensure that if a machine fails, the dataset can still
be processed
Commodity hardware
Using standard hardware that is not dependent upon exotic architectures, topologies, or
data storage (e.g., RAID)
Scalability
Algorithms and frameworks that can be easily scaled to run on larger collections of
machines in order to address larger datasets
21
Challenges of Spatial Big Data
Retaining computational efficiency: Computational Systems Desktop PCs
often cannot handle large volumes of data or data with rapid velocity
Availability of data vs availability of spatial technologies to manage, analyze
and disseminate the results.
Storing Spatial Big Data into the cloud
Applying new data when Spatial Big Data or change old data => repartitioning
is needed.
Security and integration concerns
Spatial Big data is considered as structured and unstructured datasets
with massive data volumes that cannot be easily captured, stored,
manipulated, analyzed, managed and presented by traditional hardware
Algorithms and Methods: 3 V's challenge traditional algorithms and methods
to help make senseof all the data.
Database design to handle variety of data as well as volume (storage) and
velocity (reading/writing speed).
Geovisualization of all this variety data quickly is challenging.
Network limitations to transfer data with large volume or rapid velocity
22
Summary; Spatial Big Data, and Analytics
Big Data refers to huge data-sets that overflow ordinary data management
systems.
The 5 V’s define big data including Volume, Variety, Velocity, Veracity, and
Value.
Spatial Big Data is Big Data that is spatially referenced, so in addition to
common analytics techniques, mapping and spatial analytics can be applied.
Ordinary, small-data approaches will not work, because most of the
traditional techniques cannot perform exploration of massive data sets.
Big Data methods allow multidimensional screening and “data mining” to
locate parts of the mass that are showing interesting relationships, trends, or
comparisons.
Those interesting parts of a Big Data Set can be sorted into small data-sets
that can have the more powerful traditional analysis methods applied to
them.
Success need to be studied from a management and organizational
standpoint to understand what works managerially and results in profits and
other benefits.
23