SlideShare a Scribd company logo
MAKING BIG DATA, SMALL
Using distributed systems for processing, analysing and managing
large huge data sets


    Marcin Jedyk
    Software Professional’s Network, Cheshire Datasystems Ltd
WARM-UP QUESTIONS
 How many of you heard about Big Data before?
 How many about NoSQL?

 Hadoop?
AGENDA.
 Intro – motivation, goal and ‘not about…’
 What is Big Data?
 NoSQL and systems classification
 Hadoop & HDFS
 MapReduce & live demo
 HBase
AGENDA
 Pig
 Building Hadoop cluster

 Conclusions

 Q&A
MOTIVATION
 Data is everywhere – why not to analyse it?
 With Hadoop and NoSQL systems, building
  distributed systems is easier than before
 Relying on software & cheap hardware rather
  than expensive hardware works better!
MOTIVATION
GOAL
 To explain basic ideas behind Big Data
 To present different approaches towards BD

 To show that Big Data systems are easy to build

 To show you where to start with such systems
WHAT IT IS NOT ABOUT?
 Not a detailed lecture on a single system
 Not about advanced techniques in Big Data

 Not only about technology – but also about its
  application
WHAT IS BIG DATA?
   Data characterised by 3 Vs:
     Volume

     Variety

     Velocity

   The interesting ones: variety & velocity
WHAT IS BIG DATA
 Data of high velocity: cannot store? Process on
  the fly!
 Data of high variety: doesn’t fit into relational
  schema? Don’t use schema, use NoSQL!
 Data which is impractical to process on a single
  server
NO-SQL
 Hand in and with Big Data
 NoSQL – an umbrella term for non-relational
  data bases or data storages
 It’s not always possible to replace RDBMS with
  NoSQL! (opposite is also true)
NO-SQL
   NoSQL DBs are built around different principles
     Key-value stores: Redis, Riak
     Document stores: i.e. MongoDB – record as a
      document; each entry has its own meta-data (JSON like,
      BSON)
     Table stores: i.e. Hbase – data persisted in multiple
      columns (even millions), billions of rows and multiple
      versions of records
HADOOP
 Existed before ‘Big Data’ buzzword emerged
 A simple idea – MapReduce

 A primary purpose – to crunch tera- and
  petabytes of data
 HDFS as underlying distributed file system
HADOOP – ARCHITECTURE BY EXAMPLE
 Image you need to process 1TB of logs
 What would you need?

 A server!
HADOOP – ARCHITECTURE BY EXAMPLE
 But 1TB is quite a lot of data… we want it
  quicker!
 Ok, what about distributed environment?
HADOOP – ARCHITECTURE BY EXAMPLE
   So what about that Hadoop stuff?
     Each node can: store data & process it (DataNode
      & TaskTracker)
HADOOP – ARCHITECTURE BY EXAMPLE
   How about allocating jobs to slaves? We need a
    JobTracker!
HADOOP – ARCHITECTURE BY EXAMPLE
 How about HDFS, how data blocks are
  assembled into files?
 NameNode does it.
HADOOP – ARCHITECTURE BY EXAMPLE
 NameNode – manages HDFS metadata, doesn’t
  deal with files directly
 JobTracker – schedules, allocates and monitors
  job execution on slaves – TaskTrackers
 TaskTracker – runs MapReduce operations
 DataNode – stores blocks of HDFS – default
  replication level for each block: 3
HADOOP - LIMITATIONS
 DataNodes & TaskTrackers are fault tollerant
 NameNode & JobTracker are NOT! (existing
  workaround for this problem)
 HDFS deals nicely with large files, doesn’t do
  well with billions of small files
MAP_REDUCE
 MapReduce – parallelisation approach
 Two main stages:
     Map – do an actual bit of work, i.e.: extract info
     Reduce – summarise, aggregate or filter outputs from
      Map operation
   For each job, multiple Map and Reduce operations
    – each may run on different node = parallelism
MAP_REDUCE – AN EXAMPLE
 Let’s process 1TB of raw logs and extract traffic by
  host.
 After submitting a job, JobTracker allocates tasks
  to slaves – possibly divided into 64MB packs =
  16384 Map operations!
 Map - analyse logs and return them as set of
  <key,value>
 Reduce -> merge output of Map operations
MAP_REDUCE – AN EXAMPLE
  Take a look at mocked log extract:
[IP – bandwidth]
10.0.0.1 – 1234
10.0.0.1 – 900
10.0.0.2 – 1230
10.0.0.3 – 999
MAP_REDUCE – AN EXAMPLE
 It’s important to define key, in this case IP
<10.0.0.1;2134>
<10.0.0.2;1230>
<10.0.0.3;999>
 Now, assume another Map operation returned:
<10.0.0.1;1500>
<10.0.0.3;1000>
<10.0.0.4;500>
MAP_REDUCE – AN EXAMPLE
Now, Reduce will merge those results:
<10.0.0.1;3624>
<10.0.0.2;2230>
<10.0.0.3;1499>
<10.0.0.4;500>
MAP_REDUCE
 Selecting a key is important
 It’s possible to define composite key, i.e.
  IP+date
 For more complex tasks, it’s possible to chain
  MapReduce jobs
HBASE
 Another layer on top of Hadoop/HDFS
 A distributed data storage

 Not a replacement for RDBMS!

 Can be used with MapReduce

 Good for unstructured data – no need to worry
  about exact schema in advance
PIG – HBASE ENHANCEMENT
 HBase - missing proper query language
 Pig – makes life easier for HBase users

 Translates queries into MapReduce jobs

 When working with Pig or HBase, forget what
  you know about SQL – it makes your life easier
BUILDING HADOOP CLUSTER
 Post production servers are ok
 Don’t take ‘cheap hardware’ too literally
 Good connection between nodes is a must!
 >=1Gbps between nodes
 >=10Gbps between racks
 1 disk per CPU core
 More RAM, more caching!
FINAL CONCLUSIONS
 Hadoop and NoSQL-like DB/DS scale very well
 Hadoop ideal for crunching huge data sets

 Does very well in production environment

 Cluster of slaves is fault tolerant, NameNode
  and JobTracker are not!
EXTERNAL RESOURCES
 Trending Topic – build on Wikipedia access logs:
  http://goo.gl/BWWO1
 Building web crawler with Hadoop:
  http://goo.gl/xPTlJ
 Analysing adverse drug events:
  http://goo.gl/HFXAx
 Moving average for large data sets:
  http://goo.gl/O4oml
EXTERNAL RESOURCES – USEFUL LINKS
http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011-
recommendation-talk/1
https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://hstack.org/hbase-performance-testing/
http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/
http://wiki.apache.org/hadoop/MachineScaling
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-
ladis2009.pdf
http://www.cloudera.com/resource-types/video/
http://hstack.org/why-were-using-hbase-part-2/
QUESTIONS?
Ad

More Related Content

What's hot (20)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Chirag Ahuja
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
Ashish Saraf
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
OpenDev
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
Chandra Sekhar Saripaka
 
Hadoop
HadoopHadoop
Hadoop
Kartik Kalpande Patil
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
Hadoop
HadoopHadoop
Hadoop
Tuan Cuong Luu
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
Sam Ng
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
Sharad Pandey
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
Denis Shestakov
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
Denis Shestakov
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
Sudarshan Pant
 
Hadoop
HadoopHadoop
Hadoop
Poumita Das
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡
youngick
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Ece Seçil AKBAŞ
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
DataPlato, Crossing the line
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Chirag Ahuja
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
Ashish Saraf
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
OpenDev
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
Sam Ng
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
Denis Shestakov
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
Denis Shestakov
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
Sudarshan Pant
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡
youngick
 

Viewers also liked (7)

Big data
Big dataBig data
Big data
Abdullah Masoud
 
Big Data, Small Data, Data that Totally Rocks - SMWTO
Big Data, Small Data, Data that Totally Rocks - SMWTOBig Data, Small Data, Data that Totally Rocks - SMWTO
Big Data, Small Data, Data that Totally Rocks - SMWTO
Rob Clark
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
In a Rocket
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media Plan
Post Planner
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
Kirsty Hulse
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
ux singapore
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Stanford GSB Corporate Governance Research Initiative
 
Big Data, Small Data, Data that Totally Rocks - SMWTO
Big Data, Small Data, Data that Totally Rocks - SMWTOBig Data, Small Data, Data that Totally Rocks - SMWTO
Big Data, Small Data, Data that Totally Rocks - SMWTO
Rob Clark
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
In a Rocket
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media Plan
Post Planner
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
Kirsty Hulse
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
ux singapore
 
Ad

Similar to Making Big Data, small (20)

Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
Shreyashkumar Nangnurwar
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
rohitraj268
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
Asim Jalis
 
Big Data - Part III
Big Data - Part IIIBig Data - Part III
Big Data - Part III
Thanuja Seneviratne
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
Sudar Muthu
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
jerrin joseph
 
Hadoop Distributed File System in Big data
Hadoop Distributed File System in Big dataHadoop Distributed File System in Big data
Hadoop Distributed File System in Big data
ramukaka777787
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
Denny Lee
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
Samatha Kamuni
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
rohitraj268
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
Asim Jalis
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
Sudar Muthu
 
Hadoop Distributed File System in Big data
Hadoop Distributed File System in Big dataHadoop Distributed File System in Big data
Hadoop Distributed File System in Big data
ramukaka777787
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
Denny Lee
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
Samatha Kamuni
 
Ad

Recently uploaded (20)

Sinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_NameSinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_Name
keshanf79
 
Metamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative JourneyMetamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative Journey
Arshad Shaikh
 
How to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odooHow to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odoo
Celine George
 
How to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 WebsiteHow to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 Website
Celine George
 
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdfExploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Sandeep Swamy
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025
Mebane Rash
 
apa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdfapa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdf
Ishika Ghosh
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 4-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 4-30-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 4-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 4-30-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Library Association of Ireland
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
How to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of saleHow to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of sale
Celine George
 
New Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptxNew Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptx
milanasargsyan5
 
GDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptxGDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptx
azeenhodekar
 
To study the nervous system of insect.pptx
To study the nervous system of insect.pptxTo study the nervous system of insect.pptx
To study the nervous system of insect.pptx
Arshad Shaikh
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
Sinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_NameSinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_Name
keshanf79
 
Metamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative JourneyMetamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative Journey
Arshad Shaikh
 
How to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odooHow to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odoo
Celine George
 
How to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 WebsiteHow to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 Website
Celine George
 
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdfExploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Sandeep Swamy
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025
Mebane Rash
 
apa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdfapa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdf
Ishika Ghosh
 
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Library Association of Ireland
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
How to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of saleHow to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of sale
Celine George
 
New Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptxNew Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptx
milanasargsyan5
 
GDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptxGDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptx
azeenhodekar
 
To study the nervous system of insect.pptx
To study the nervous system of insect.pptxTo study the nervous system of insect.pptx
To study the nervous system of insect.pptx
Arshad Shaikh
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 

Making Big Data, small

  • 1. MAKING BIG DATA, SMALL Using distributed systems for processing, analysing and managing large huge data sets Marcin Jedyk Software Professional’s Network, Cheshire Datasystems Ltd
  • 2. WARM-UP QUESTIONS  How many of you heard about Big Data before?  How many about NoSQL?  Hadoop?
  • 3. AGENDA.  Intro – motivation, goal and ‘not about…’  What is Big Data?  NoSQL and systems classification  Hadoop & HDFS  MapReduce & live demo  HBase
  • 4. AGENDA  Pig  Building Hadoop cluster  Conclusions  Q&A
  • 5. MOTIVATION  Data is everywhere – why not to analyse it?  With Hadoop and NoSQL systems, building distributed systems is easier than before  Relying on software & cheap hardware rather than expensive hardware works better!
  • 7. GOAL  To explain basic ideas behind Big Data  To present different approaches towards BD  To show that Big Data systems are easy to build  To show you where to start with such systems
  • 8. WHAT IT IS NOT ABOUT?  Not a detailed lecture on a single system  Not about advanced techniques in Big Data  Not only about technology – but also about its application
  • 9. WHAT IS BIG DATA?  Data characterised by 3 Vs:  Volume  Variety  Velocity  The interesting ones: variety & velocity
  • 10. WHAT IS BIG DATA  Data of high velocity: cannot store? Process on the fly!  Data of high variety: doesn’t fit into relational schema? Don’t use schema, use NoSQL!  Data which is impractical to process on a single server
  • 11. NO-SQL  Hand in and with Big Data  NoSQL – an umbrella term for non-relational data bases or data storages  It’s not always possible to replace RDBMS with NoSQL! (opposite is also true)
  • 12. NO-SQL  NoSQL DBs are built around different principles  Key-value stores: Redis, Riak  Document stores: i.e. MongoDB – record as a document; each entry has its own meta-data (JSON like, BSON)  Table stores: i.e. Hbase – data persisted in multiple columns (even millions), billions of rows and multiple versions of records
  • 13. HADOOP  Existed before ‘Big Data’ buzzword emerged  A simple idea – MapReduce  A primary purpose – to crunch tera- and petabytes of data  HDFS as underlying distributed file system
  • 14. HADOOP – ARCHITECTURE BY EXAMPLE  Image you need to process 1TB of logs  What would you need?  A server!
  • 15. HADOOP – ARCHITECTURE BY EXAMPLE  But 1TB is quite a lot of data… we want it quicker!  Ok, what about distributed environment?
  • 16. HADOOP – ARCHITECTURE BY EXAMPLE  So what about that Hadoop stuff?  Each node can: store data & process it (DataNode & TaskTracker)
  • 17. HADOOP – ARCHITECTURE BY EXAMPLE  How about allocating jobs to slaves? We need a JobTracker!
  • 18. HADOOP – ARCHITECTURE BY EXAMPLE  How about HDFS, how data blocks are assembled into files?  NameNode does it.
  • 19. HADOOP – ARCHITECTURE BY EXAMPLE  NameNode – manages HDFS metadata, doesn’t deal with files directly  JobTracker – schedules, allocates and monitors job execution on slaves – TaskTrackers  TaskTracker – runs MapReduce operations  DataNode – stores blocks of HDFS – default replication level for each block: 3
  • 20. HADOOP - LIMITATIONS  DataNodes & TaskTrackers are fault tollerant  NameNode & JobTracker are NOT! (existing workaround for this problem)  HDFS deals nicely with large files, doesn’t do well with billions of small files
  • 21. MAP_REDUCE  MapReduce – parallelisation approach  Two main stages:  Map – do an actual bit of work, i.e.: extract info  Reduce – summarise, aggregate or filter outputs from Map operation  For each job, multiple Map and Reduce operations – each may run on different node = parallelism
  • 22. MAP_REDUCE – AN EXAMPLE  Let’s process 1TB of raw logs and extract traffic by host.  After submitting a job, JobTracker allocates tasks to slaves – possibly divided into 64MB packs = 16384 Map operations!  Map - analyse logs and return them as set of <key,value>  Reduce -> merge output of Map operations
  • 23. MAP_REDUCE – AN EXAMPLE  Take a look at mocked log extract: [IP – bandwidth] 10.0.0.1 – 1234 10.0.0.1 – 900 10.0.0.2 – 1230 10.0.0.3 – 999
  • 24. MAP_REDUCE – AN EXAMPLE  It’s important to define key, in this case IP <10.0.0.1;2134> <10.0.0.2;1230> <10.0.0.3;999>  Now, assume another Map operation returned: <10.0.0.1;1500> <10.0.0.3;1000> <10.0.0.4;500>
  • 25. MAP_REDUCE – AN EXAMPLE Now, Reduce will merge those results: <10.0.0.1;3624> <10.0.0.2;2230> <10.0.0.3;1499> <10.0.0.4;500>
  • 26. MAP_REDUCE  Selecting a key is important  It’s possible to define composite key, i.e. IP+date  For more complex tasks, it’s possible to chain MapReduce jobs
  • 27. HBASE  Another layer on top of Hadoop/HDFS  A distributed data storage  Not a replacement for RDBMS!  Can be used with MapReduce  Good for unstructured data – no need to worry about exact schema in advance
  • 28. PIG – HBASE ENHANCEMENT  HBase - missing proper query language  Pig – makes life easier for HBase users  Translates queries into MapReduce jobs  When working with Pig or HBase, forget what you know about SQL – it makes your life easier
  • 29. BUILDING HADOOP CLUSTER  Post production servers are ok  Don’t take ‘cheap hardware’ too literally  Good connection between nodes is a must!  >=1Gbps between nodes  >=10Gbps between racks  1 disk per CPU core  More RAM, more caching!
  • 30. FINAL CONCLUSIONS  Hadoop and NoSQL-like DB/DS scale very well  Hadoop ideal for crunching huge data sets  Does very well in production environment  Cluster of slaves is fault tolerant, NameNode and JobTracker are not!
  • 31. EXTERNAL RESOURCES  Trending Topic – build on Wikipedia access logs: http://goo.gl/BWWO1  Building web crawler with Hadoop: http://goo.gl/xPTlJ  Analysing adverse drug events: http://goo.gl/HFXAx  Moving average for large data sets: http://goo.gl/O4oml
  • 32. EXTERNAL RESOURCES – USEFUL LINKS http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011- recommendation-talk/1 https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html http://hstack.org/hbase-performance-testing/ http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/ http://wiki.apache.org/hadoop/MachineScaling http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote- ladis2009.pdf http://www.cloudera.com/resource-types/video/ http://hstack.org/why-were-using-hbase-part-2/