0% found this document useful (0 votes)
6 views8 pages

Big Data

The document discusses the concept of Big Data, highlighting its rapid growth and the challenges organizations face in managing, analyzing, and visualizing this data. It outlines the transition in technology and analytics from traditional methods to new open platforms and tools designed for handling large datasets. Additionally, it categorizes various Big Data tools and platforms, emphasizing the importance of architecture that accommodates the velocity, volume, and variety of data.

Uploaded by

manthan pandit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

Big Data

The document discusses the concept of Big Data, highlighting its rapid growth and the challenges organizations face in managing, analyzing, and visualizing this data. It outlines the transition in technology and analytics from traditional methods to new open platforms and tools designed for handling large datasets. Additionally, it categorizes various Big Data tools and platforms, emphasizing the importance of architecture that accommodates the velocity, volume, and variety of data.

Uploaded by

manthan pandit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

International Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013 1

ISSN 2250-3153

Big Data Landscape


Shubham Sharma

Banking Product Development Division, Oracle Financial Services Software Ltd.


Bachelor of Technology Information Technology, Maharishi Markandeshwar Engineering College

Abstract- “Big Data” has become a major source of innovation nearly 500 exabytes per day .To put the numbers in perspective
across enterprises of all sizes .Data is being produced at an ever this is equivalent to 5×1020 bytes per day. Almost 200 times
increasing rate. This growth in data production is driven by higher than all the sources combined together in the world. To
increased use of media, fast developing organizations, handle this huge chunk of data will be hard with the existing data
proliferation of web and systems connected to it. Having a lot of management technologies. Hence the technology transitions have
data is one thing, being able to store it, analyze it and visualize it become imminent.
in real time environment is a whole different ball game. New
technologies are accumulating more data than ever; therefore
many organizations are looking forward to optimal ways to make II. TECHNOLOGY TRANSITION
better use of their data. In a broader sense, organizations With the introduction of Big Data platforms there has been a
analyzing big data need to view data management, analysis, and change in analytic techniques of organizations. The focus of the
decision-making in terms of “industrialized” flows and processes organizations has moved from orthodox methods like trend
rather than discrete stocks of data or events. To handle these analysis and forecasting using historic data to its complementary
aspects of large quantities of data various open platforms had and far better data visualization techniques. More interests had
been developed. been shown towards scenario simulation and development over
standardized reporting techniques. Analytics is emerging as a key
Index Terms- Big Data, Landscape,Open Platforms, to enhance business processes.
Technologies,Tools

I. INTRODUCTION
n 2012 Gartner defined Big Data as follows “Big Data are
I high volume, high velocity, and/or high variety information
assets that require new forms of processing to enable
enhanced decision making, insight discovery and process
optimization”. Using a big data platform allows one to address
the full spectrum of big data challenges. These platforms make
use of traditional technologies that are most suited for structured
and repeatable task and incorporate them with complementary
new technologies that address speed and flexibility and are ideal
for unstructured analysis as well as data exploration and
discovery.
Open platforms are software systems which have fully
documented external application programming interface which
allow the use of software in other ways than the original
programmer intended without affecting the source code. Open
platforms are based on open standards and does not mean they
are open source. Big data open platforms are based on similar
concepts and various platforms are discussed that provide
visualization and discovery of large data sets, monitors big data
systems and speeds time to value with analytical and industry
specific modules. Figure1: Technology Transition
Exquisite Example (Big data, Analytics and the Path from Insights to Value, MIT
“THE GOD PARTICLE” Sloan management review, Winter2011)
An exquisite example of the enormous amount of data
generator
is The Large Hadron Collider which represent about of 150
million sensors delivers 150 million petabytes annual rate or

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013 2
ISSN 2250-3153

III. CLASSIFICATION OF BIG DATA TOOLS


The Big Data tools landscape is growing rapidly and they
can be classified majorly into following area:
1. Data Analysis
2. Databases/Data warehousing
3. Operational
4. Multi value Database
5. Business Intelligence
6. Data Mining
7. Key Value
8. Document Store
9. Graphs
10. Grid Solutions
11. Object Databases
12. Multi Model
13. XML databases
14. Big Data Search.

There are many products available for each classification,


which have their own special features to meet the requirements.

Figure2: Big Data Landscape

IV. BIG DATA LANDSCAPE


In order to plan a big data architecture it is important to
grasp the knowledge of the current big data landscape and
incorporate it into existing infrastructure. In traditional data
management structures, the structured information or data was
fed into the enterprise integration tool which transferred the
collected structured data into data warehouses or operational
units. Then different analytical capabilities were used to reveal
the data, but the new form of data management structures that
inherit big data landscape are designed to meet the velocity,
volume, value and variety of requirements. To handle these large
data sets, new architectures have been formed that incorporate
multi node parallel processing techniques.
Big data landscape has a further classification based on
processing requirements and different strategies are proposed for

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013 3
ISSN 2250-3153

batch processing and real-time processing. Different technologies an opportunity for traditional IT vendors, because enterprise and
through which we can harness big data are : government often find open-source tools off-putting.
1. Relational Database Management Systems Therefore, traditional vendors have welcomed Hadoop with
2. Massively Parallel Processing open arms, packaging it in to their own proprietary systems so
3. MapReduce they can sell the result to enterprise as more comfortable and
4. NoSQL familiar packaged solutions.
5. Cassandra
6. Common Event Processing Cloudera
Cloudera was founded in 2008 by employees who worked
Relational Database Management Systems on Hadoop at Yahoo and Facebook. It contributes to the Hadoop
Databases are now using massively parallel processing open-source project, offering its own distribution of the software
techniques. These techniques are used to break data into small for free. It also sells a subscription-based, Hadoop-based
slots and to achieve faster processing operate them on multiple distribution for the enterprise, which includes production support
machines. Databases are acquiring columnar architecture to and tools to make it easier to run Hadoop.
allow the storage of unstructured data. Since its creation, various vendors have chosen Hadoop
distribution for their own big-data products. In 2010, Teradata
Massively Parallel Processing was one of the first to jump on the Cloudera bandwagon, with the
The data is distributed among a number of nodes for faster two companies agreeing to connect the Hadoop distribution to
processing .The process is done parallel on each machine and the Teradata's data warehouse so that customers could move
output is collected to deduce the required result. This technology information between the two. Around the same time, EMC made
requires knowledge of SQL and expensive hardware to work on. a similar arrangement for its Greenplum data warehouse. SGI
and Dell signed agreements with Cloudera from the hardware
MapReduce side in 2011, while Oracle and IBM joined the party in 2012.
Map reduce also use the concept of multi nodes and parallel
processing .It consists of two function- Hortonworks
 Map - It separates information over multiple nodes Cloudera rival Hortonworks was birthed by key architects
which are then processed in parallel. from the Yahoo Hadoop software engineering team. In June
 Reduce - This function combines the result sets into a 2012, the company launched a high-availability version of
final response. Apache Hadoop, the Hortonworks Data Platform on which it
 Massively parallel processing uses SQL queries whereas collaborated with VMware, as the goal was to target companies
MapReduce uses java and does not need expensive deploying Hadoop on VMware's vSphere.
dedicated platforms. Teradata has also partnered with Hortonworks to create
products that "help customers solve business problems in new
NoSQL and better ways".
NoSQL database-management systems are unlike relational
database-management systems, in that they do not use SQL as Teradata
their query language. The idea behind these systems is that that Teradata made its move out of the "old-world" data-
they are better for handling data that doesn't fit easily into tables. warehouse space by buying Aster Data Systems and Aprimo in
They dispense with the overhead of indexing, schema and ACID 2011. Teradata wanted Aster's ability to manage "a variety of
transactional properties to create large, replicated data stores for diverse data that is not structured", such as web applications,
running analytics on inexpensive hardware, which is useful for sensor networks, social networks, genomics, video and
dealing with unstructured data. photographs.
Teradata has now gone to market with the Aster Data
Hive nCluster, a database using MPP and MapReduce. Visualization
Databases like Hadoop's file store make ad hoc query and and analysis is enabled through the Aster Data visual-
analysis difficult, as the programming map/reduce functions that development environment and suite of analytic modules. The
are required can be difficult. Realizing this when working with Hadoop connecter, enabled by its agreement with Cloudera,
Hadoop, Facebook created Hive, which converts SQL queries to allows for a transfer of information between nCluster and
map/reduce jobs to be executed using Hadoop. Hadoop.

Vendors Oracle
There is scarcely a vendor that doesn't have a big-data plan Oracle made its big-data appliance available earlier this
in train, with many companies combining their proprietary year — a full rack of 18 Oracle Sun servers with 864GB of main
database products with the open-source Hadoop technology as memory; 216 CPU cores; 648TB of raw disk storage; 40Gbps
their strategy to tackle velocity, variety and volume. Many of the InfiniBand connectivity between nodes and engineered systems;
early big-data technologies came out of open source, posing a and 10Gbps Ethernet connectivity.
threat to traditional IT vendors that have packaged their software The system includes Cloudera's Apache Hadoop distribution
and kept their intellectual property close to their chests. and manager software, as well as an Oracle NoSQL database and
However, the open-source nature of the trend has also provided

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013 4
ISSN 2250-3153

a distribution of R (an open-source statistical computing and


graphics environment).
It integrates with Oracle's 11g database, with the idea being
that customers can use Hadoop MapReduce to create optimised
datasets to load and analyze in the database.
The appliance costs US$450,000, which puts it at the high
end of big-data deployments, and not at the test and development
end, according to analysts.

IBM
IBM combined Hadoop and its own patents to create IBM
InfoSphere BigInsights and IBM InfoSphere Streams as the core
technologies for its big-data push.
The BigInsights product, which enables the analysis of
large-scale structured and unstructured data, "enhances" Hadoop
to "withstand the demands of your enterprise", according to IBM.
It adds administrative, workflow, provisioning and security
features into the open-source distribution. Meanwhile, streams
analysis has a more complex event-processing focus, allowing
the continuous analysis of streaming data so that companies can
respond to events.
IBM has partnered with Cloudera to integrate its Hadoop
distribution and Cloudera manger with IBM BigInsights. Like
Oracle's big-data product, IBM's BigInsights links to: IBM DB2,
its Netezza data-warehouse appliance (its high-performance, Figure 3: VoltDB architecture.
massively parallel advanced analytic platform that can crunch
Figure 3 displays that a certain task that requires to operate
petascale data volumes); its InfoSphere Warehouse; and its Smart
in just one partition is executed sequentially in the corresponding
Analytics System.
partition, and that a certain task that needs to be handled in
several partitions are processed by a coordinator. If there are
many operations that need to be processed in several partitions,
V. PLATFORM TECHNOLOGY THAT HANDLES BIG DATA
large rows and sizes may not be good.
VoltDB
VoltDB is a system consisting of a suitable format to a high- SAP HANA
performance OLTP environment. The system is not memory- SAP HANA is a memory-based storage made from SAP. Its
based data processing or SQL, but it performs sequential characteristic is to organize a system optimized to analysis tasks,
processing for data split based on stored procedure and reduces such as OLAP. If all data is inside system memory, maximizing
lock overhead with communication, helping to configure the CPU utilization is crucial and the key point is to reduce
high-speed OLTP system through horizontal split for table data. bottlenecks between memory and CPU cache. In order to
minimize Cache miss, consecutive data for processing within the
given time is more advantageous, meaning that configuration of
column-oriented tables could be favorable when analyzing many
OLAP.
There are many advantages of the column-oriented table
configuration and typical examples are a high data compression
ratio and processing speed. In case of the same data domain,
several data domains are better for data compression than when
they are combined together. Moreover, the configuration enables
reducing CPU operations through a lightweight compression,
such as RLE (Run length encoding) or dictionary encoding or
executing desired operations without a recovery process for
compressed data. The following figure shows a brief comparison
with Row-oriented and Column-oriented methods.

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013 5
ISSN 2250-3153

Greenplum
Greenplum database is a shared-nothing MPP structure,
generated based on PostgreSQL. Data to be stored can select
Row-oriented or Column-oriented methods accordingly to
operations apply to the corresponding data. Data is stored in a
server in segment and have availability because of segment unit
replication of the log shipping method. A query engine, which
was developed based on PostgreSQL, is configured to execute
SQL basic operation (hash-join, hash-aggregation) or a map-
reduce program so as to effectively process parallel query or
map-reduced type programs. Each process node is connected to
software-oriented data switch component.

Figure 4: A Comparison with Row-oriented and Column-


oriented methods.

Vertica
Vertica is database specialized for OLAP, which stores data
on disk via the column method. The Shared-nothing-oriented
MPP structure comprises a storage optimized for writing so as to
load data fast, a reading storage in a compressed type, and tuple
mover that manages bilateral data flow. Figure 5 below helps to
understand the Vertica structure.

Figure 6: Greenplum architecture.

IBM Netezza Data Warehouse


IBM Netezza data warehouse has a two-tier type
architecture consisted of SMP and MPP, called AMPP
(Asymmetric Massively Parallel Processing).
A host with a SMP structure operates query execution plan
and aggregation results, while S-blade nodes with a MPP
structure handles query execution.
Each S-blade is connected by a special data processor called
FPGA (Field Programmable Gate Array) and disk.
Each S-blade and host is connected to network that use IP
addresses.
Unlike other systems, FPGA has filtering for data
compression, record or column; in transaction processing, it
enables filtering or transformation functions, such as visibility
check during retrieving data from disk memory for real-time
Figure 5: Vertica structure.
processing. When processing large-date, it adheres to the
principles (processing close to the data source), which is to

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013 6
ISSN 2250-3153

reduce as much unnecessary data as possible from transmission


by performing data operation where data is located.

Figure 7: IBM Netezza data architecture.

In addition, companies and organizations that develop


parallel DBMS are taken over by IT conglomerates and are in the
progress of development in the appliance type. The names of
conglomerates and date of acquisition of aforementioned parallel
DBMS are shown in the following table:

The names of companies acquired Database Year


SAP Sybase 2010

HP Vertica 2011

IBM Netezza 2010

Oracle Essbase (Hyperian Solutions) 2007

Teradata Aster Data 2011

EMC Greenplum 2010


Table 1: Names of companies which acquired parallel RDBMS.

NoSQL satisfied with ACID that has divided data, you have to use
In RDBMS, scaling out while supporting ACID (Atomicity, complicated locking and replication methods, which will lead to
Consistency, Isolation, and Durability) is almost impossible. For performance degradation.
storage, data had to be divided into several devices; to be

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013 7
ISSN 2250-3153

NoSQL, a general term for a new storage system has Analysis Aspects
emerged in order to simplify data models for easy definition of
shard, which is the basic of distribution, and to make
requirements less strict (Eventual Consistency) in a distribution
replication environment or constraint isolation.
Since NoSQL is covered many times in our DevPlatform
Blogs and there are many places to obtain information, we will
not go over the NoSQL products.

Processing Aspects
The key point of parallel processing is Divide and Conquer.
That is, data is divided in an independent type and process it in
parallel. Just imagine the matrix multiplication that can divide
and process each operation. The meaning of big data process is
dividing a problem into several small operations, and combine
them into a single result. If there is operation dependence, it is
certainly impossible to make the best use of the parallel
operation. It is necessary to save and process data considering
these factors.

Map-Reduce
The most widely known technology that helps to handle
large-data would be a distribution data process framework of the
Map-Reduce method, such as Apache Hadoop.
Data processing via the Map-reduce method has the following
characteristics:
It operates via regular computer that uses built-in hard disk,
not a special storage. Each computer has extremely weak
correlation where expansion can be hundreds and thousands of
computers.
Since many computers are participating in processing, Figure 8: Map-Reduce execution.
system errors and hardware errors are assumed as general
circumstances, rather than exceptional. We reviewed systems that store big data and
With a simplified and abstracted basic operation of Map and procedural/declarative technologies that display processing and
Reduce, you can solve many complicated problems. how to process large data. Finally, let us look into technology
Programmers who are not familiar with parallel programs can that analyzes big data.
easily perform parallel processing for data. The process of finding meaning in data is
It supports high throughput by using many computers. called KDD (Knowledge Discovery in Databases). KDD is to
The following figure displays the implementation flow of store data, process/analyze the whole or part of interested data in
the map-reduce method. Data stored in the HDFS storage is order to extract progress or meaning value, or discover facts that
divided to available worker and expressed (Map) a value type, were so far unknown and make them into knowledge ultimately.
and results are stored in a local disk. The data is complied by For this, various technologies are comprehensively applied, such
reducing worker and generate a result file. as artificial intelligence, machine learning, statistics, and
Depending on the characteristics of a data storage, make the database.
best use of locality by reducing the gap between a node which is
processing data and source data location by placing worker in the GNU R
location (based on network switch) where data is stored. Each GNU R is a software environment comprising program
worker can be implemented in various languages through languages specialized for statistics analysis and graphics
streaming interface (standard in/out). (visualization) and packages. It ensures a smooth process of
vector and matrix data so as to be optimized for statistical
Apache Hive calculations in terms of language. You can easily acquire desired
Apache Hive helps to analyze large data by using the query statistics process library because of the R package site known
language called HiveQL for data source, such as HDFS as CRAN (Comprehensive R Archive Network). It can be touted
or HBase. Architecture is divided into Map-Reduce-oriented as an open source in the field of statistics.
execution, meta data information for a data storage, and an In the past, R used to put data to be processed into the
execution part that receives a query from user or applications for memory of a computer for analyzing using a single CPU. There
execution. has been much progress due to ever increasing data to be
To support expansion by user, it allows user specified processed
function at the scalar value, aggregation, and table level.

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013 8
ISSN 2250-3153

[2] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard
Dobbs, Charles Roxburgh, Angela Hung Byers (May 2011), “Big data: The
next frontier for innovation, competition, and productivity”.
[3] Richard Winter (December 2011), “BIG DATA: BUSINESS
OPPORTUNITIES, REQUIREMENTS AND ORACLE’S APPROACH”
(PDF).
[4] Michael Stonebraker, Nabil Hachem, Pat Helland, “The End of an
Architectural Era (It’s Time for a Complete Rewrite)”, VLDB 2007 (PDF).
[5] Michael Stonebraker et al., “One Size Fits All? – Part 2: Benchmarking
Results”, CIDR 2007 (PDF).
[6] Daniel J. Abadi et al., “Integrating Compression and Execution in Column-
Oriented Database Systems”, SIGMOD ‘06 (PDF).
[7] “VoltDB | Lightning Fast, Rock Solid”.
[8] “SAP HANA”.
[9] “Real-Time Analytics Platform | Big Data Analytics | MPP Data
Warehouse”.
[10] “Greenplum is driving the future of Big Data analytics”.
[11] “Data Warehouse Appliance, Data Warehouse Appliances, and Data
Warehousing from Netezza”.
[12] Jeffrey Dean and Sanjay Chemawat, “MapReduce: Simplified Data
Processing On Large Clusters”, CACM Jan. 2008 (PDF).
[13] Mihai Budiu (March 2008), “Cluster Computing with Dryad”, MSR-SVC
LiveLabs (PPT)
[14] Hung-chih Yang et al., “Map-Reduce-Merge: Simplified Relational Data
Processing on Large Clusters”, SIGMOD ‘07 (PPT).
[15] “Welcome to Apache Hadoop”.
[16] “The R Project for Statistical Computing”.

AUTHORS
Figure 11: HIVE architecture. First Author – Shubham Sharma, Bachelor of Technology
Information Technology, Maharishi Markandeshwar Engineering
College, Associate Consultant, Banking Products Development,
Oracle Financial Services Software Ltd. ,
[email protected].
REFERENCES
[1] Steve LaValle, Eric Lesser, Rebecca Shockley, Michael S. Hopkins and
Nina Kruschwitz (December 21, 2010), “Big data, Analytics and the Path
from Insights to Value”.

www.ijsrp.org

You might also like