Making Big Data, small

MAKING BIG DATA, SMALL
Using distributed systems for processing, analysing and managing
large huge data sets

Marcin Jedyk
Software Professional’s Network, Cheshire Datasystems Ltd

WARM-UP QUESTIONS
 How many of you heard about Big Data before?
 How many about NoSQL?

 Hadoop?

AGENDA.
 Intro – motivation, goal and ‘not about…’
 What is Big Data?
 NoSQL and systems classification
 Hadoop & HDFS
 MapReduce & live demo
 HBase

AGENDA
 Pig
 Building Hadoop cluster

 Conclusions

 Q&A

MOTIVATION
 Data is everywhere – why not to analyse it?
 With Hadoop and NoSQL systems, building
distributed systems is easier than before
 Relying on software & cheap hardware rather
than expensive hardware works better!

GOAL
 To explain basic ideas behind Big Data
 To present different approaches towards BD

 To show that Big Data systems are easy to build

 To show you where to start with such systems

WHAT IT IS NOT ABOUT?
 Not a detailed lecture on a single system
 Not about advanced techniques in Big Data

 Not only about technology – but also about its
application

WHAT IS BIG DATA?
 Data characterised by 3 Vs:
 Volume

 Variety

 Velocity

 The interesting ones: variety & velocity

WHAT IS BIG DATA
 Data of high velocity: cannot store? Process on
the fly!
 Data of high variety: doesn’t fit into relational
schema? Don’t use schema, use NoSQL!
 Data which is impractical to process on a single
server

NO-SQL
 Hand in and with Big Data
 NoSQL – an umbrella term for non-relational
data bases or data storages
 It’s not always possible to replace RDBMS with
NoSQL! (opposite is also true)

NO-SQL
 NoSQL DBs are built around different principles
 Key-value stores: Redis, Riak
 Document stores: i.e. MongoDB – record as a
document; each entry has its own meta-data (JSON like,
BSON)
 Table stores: i.e. Hbase – data persisted in multiple
columns (even millions), billions of rows and multiple
versions of records

HADOOP
 Existed before ‘Big Data’ buzzword emerged
 A simple idea – MapReduce

 A primary purpose – to crunch tera- and
petabytes of data
 HDFS as underlying distributed file system

HADOOP – ARCHITECTURE BY EXAMPLE
 Image you need to process 1TB of logs
 What would you need?

 A server!

 But 1TB is quite a lot of data… we want it
quicker!
 Ok, what about distributed environment?

 So what about that Hadoop stuff?
 Each node can: store data & process it (DataNode
& TaskTracker)

 How about allocating jobs to slaves? We need a
JobTracker!

 How about HDFS, how data blocks are
assembled into files?
 NameNode does it.

 NameNode – manages HDFS metadata, doesn’t
deal with files directly
 JobTracker – schedules, allocates and monitors
job execution on slaves – TaskTrackers
 TaskTracker – runs MapReduce operations
 DataNode – stores blocks of HDFS – default
replication level for each block: 3

HADOOP - LIMITATIONS
 DataNodes & TaskTrackers are fault tollerant
 NameNode & JobTracker are NOT! (existing
workaround for this problem)
 HDFS deals nicely with large files, doesn’t do
well with billions of small files

MAP_REDUCE
 MapReduce – parallelisation approach
 Two main stages:
 Map – do an actual bit of work, i.e.: extract info
 Reduce – summarise, aggregate or filter outputs from
Map operation
 For each job, multiple Map and Reduce operations
– each may run on different node = parallelism

MAP_REDUCE – AN EXAMPLE
 Let’s process 1TB of raw logs and extract traffic by
host.
 After submitting a job, JobTracker allocates tasks
to slaves – possibly divided into 64MB packs =
16384 Map operations!
 Map - analyse logs and return them as set of
<key,value>
 Reduce -> merge output of Map operations

 Take a look at mocked log extract:
[IP – bandwidth]
10.0.0.1 – 1234
10.0.0.1 – 900
10.0.0.2 – 1230
10.0.0.3 – 999

 It’s important to define key, in this case IP
<10.0.0.1;2134>
<10.0.0.2;1230>
<10.0.0.3;999>
 Now, assume another Map operation returned:
<10.0.0.1;1500>
<10.0.0.3;1000>
<10.0.0.4;500>

Now, Reduce will merge those results:
<10.0.0.1;3624>
<10.0.0.2;2230>
<10.0.0.3;1499>
<10.0.0.4;500>

MAP_REDUCE
 Selecting a key is important
 It’s possible to define composite key, i.e.
IP+date
 For more complex tasks, it’s possible to chain
MapReduce jobs

HBASE
 Another layer on top of Hadoop/HDFS
 A distributed data storage

 Not a replacement for RDBMS!

 Can be used with MapReduce

 Good for unstructured data – no need to worry
about exact schema in advance

PIG – HBASE ENHANCEMENT
 HBase - missing proper query language
 Pig – makes life easier for HBase users

 Translates queries into MapReduce jobs

 When working with Pig or HBase, forget what
you know about SQL – it makes your life easier

BUILDING HADOOP CLUSTER
 Post production servers are ok
 Don’t take ‘cheap hardware’ too literally
 Good connection between nodes is a must!
 >=1Gbps between nodes
 >=10Gbps between racks
 1 disk per CPU core
 More RAM, more caching!

FINAL CONCLUSIONS
 Hadoop and NoSQL-like DB/DS scale very well
 Hadoop ideal for crunching huge data sets

 Does very well in production environment

 Cluster of slaves is fault tolerant, NameNode
and JobTracker are not!

EXTERNAL RESOURCES
 Trending Topic – build on Wikipedia access logs:
http://goo.gl/BWWO1
 Building web crawler with Hadoop:
http://goo.gl/xPTlJ
 Analysing adverse drug events:
http://goo.gl/HFXAx
 Moving average for large data sets:
http://goo.gl/O4oml

EXTERNAL RESOURCES – USEFUL LINKS
http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011-
recommendation-talk/1
https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://hstack.org/hbase-performance-testing/
http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/
http://wiki.apache.org/hadoop/MachineScaling
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-
ladis2009.pdf
http://www.cloudera.com/resource-types/video/
http://hstack.org/why-were-using-hbase-part-2/

Making Big Data, small

Recommended

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Making Big Data, small (20)

Recently uploaded (20)

Making Big Data, small