BDA 01 - Introduction
BDA 01 - Introduction
Lesson 1: Introduction
3
Tools
4
Lesson 1: Introduction
Big Data???
Analysis???
Business
Intelligence???
Contents
6
Whats Big Data?
Using the Google trends site check how the popularity of the search term “Big Data” has
declined compared with similar search terms: machine learning and artificial intelligence
8
Popularity of ‘Big Data’ Search Term on
Popularity of ‘Big Data’ Search Term on Google
Google
9
What exactly is Big Data?
10
Characteristics of Big Data-The x V’s
The 4Vs of Big Data
12
History of Big Data
13
Examples of Big Data
Articles talking about Big Data volumes Very Big Data at Netflix
14
Sources of Big Data
Some sources
16
E.g. Google map traffic
17
Applications of Big Data
What Do they Do with
Industries Using Big Data the Data?
Online Retail Search Finance
1. Amazon uses its massive
data to build
Manufacturing Automobile Medicine
recommendation systems
19
Big Data Tools and Ecosytem
What Do We Want to Do with Big Data?
Store, manage and retrieve Analyze and visualize Build models (ML)
The Big Data tools, methodologies, frameworks and ecosystems address these tasks.
Most technologies combine several aspects. For instance, with Hadoop, you have
both storage and analysis capabilities
21
Tools and Platforms
Cloud providers Hadoop, HDFS, Apache Spark,
NoSQL Databases
Mapreduce
22
How Some of these Components Look in a Full System
23
Summary on Big Data
• Big Data, when it comes to size is a relative term BUT you will know
you have a large dataset if you can’t process it on a single computer
or you can’t use traditional software such as Excel to handle the data
• The Big Data technology landscape is now dominated by cloud
providers such as AWS, Google, Azure and others. Other notable
vendors involved include Databricks, Cloudera/Hortonworks, MapR
• Big Data has penetrated many industries including government
24
Further Reading on Big Data Basics
25
Big Data and Parallel
Processing
What is Parallel Processing
• Parallel computing is a computation style of carrying out multiple operations
simultaneously using one (by means of multi-threading) or many machines.
Parallel computing works on the principal that large problems can often be
divided into small problems and these small problems can be executed
concurrently.
• Task parallelism - This form of parallelism covers the execution of computer
programs across multiple processors on same or multiple machines. It focuses on
executing different operations in parallel to fully utilize the available computing
resources in form of processors and memory.
• Data parallelism - This form of parallelism focuses on distribution of data sets
across the multiple computation programs. In this form, same operations are
performed on different parallel computing processors on the distributed data
subset.
Most Big Data Frameworks utilize both data and task parallelism
27
Similar Concepts to Parallelism
• Distributed computing
• Concurrent processing
28
Why Parallel Processing?
29
Parallel vs. Linear Processing
Problem
Problem
Instruction-
Instruction-1 Instruction-2
N
Instruction-1
Error
Instruction-2 Output
Error
Instruction-N
Output
30
Advantages of Parallel Processing
31
Vertical Scaling Vs. Horizontal Scaling
Eventually vertical
scaling fails
32
Horizontal Scaling is Better
A Cluster of computers
33
Concurrency and Parallelism
in Python
Before we turn to the big guns (Big Data frameworks) to handle our Big
Data, we will look at how to achieve simple parallelism with vanilla Python
34
Concurrency and Parallelism in Python
35
Summary of Different Concurrency Types in
Python
36
CPU-Bound and I/O-Bound Programs in in
Python
37
I/O Bound Vs. CPU Bound Processes
38
How to Achieve Concurrency in Python
39
How to Solve Big Data
Problems in Python
Working with data in a distributed fashion is inherently
difficult, therefore make sure that you exhaust all options
before jumping into using Spark, Hadoop or other Big
Data frameworks
41
Advice on Tackling a Big Data Problem
Some questions to ask yourself before you jump to the big guns
1. Can I optimize pandas to solve the problem?: If you are using Pandas for
data munging, you can optimize pandas to load large datasets depending
on the nature of your problem
2. How about drawing a sample from the large dataset? Depending on your
use case, drawing a sample out of a large dataset may or may not work.
Just be careful that you sample correctly.
3. Can I use simple Python parallelism to solve the problem on my laptop?
Sometimes the data isn't that big but you just need to run more intense
computations on the smaller data, multiprocessing can help.
4. Can I use a big data framework on my laptop? For some tasks, even with a
25GB dataset, frameworks like Spark and Dask can work on a single
laptop.
5. Need to build a cluster: Take time to think about which distribution of
Hadoop to use, which vendors to use, whether you will put the cluster on
the cloud or on-premise. You will need input of IT people for this one.
42
Exercises
• Analysis with Python
• Computer repaire time
• Observing the repair time of a computer repair center, we have the following data table:
1 18 8 29 15 11 22 12 29 16 36 11 43 13
2 15 9 10 16 14 23 34 30 14 37 10 44 8
3 17 10 14 17 13 24 29 31 15 38 13 45 10
4 9 11 17 18 16 25 13 32 7 39 14 46 13
5 37 12 12 19 13 26 19 33 40 40 9 47 16
6 15 13 13 20 15 27 12 34 16 41 18 48 9
7 8 14 12 21 16 28 15 35 11 42 8 49 9
• Time in days. Decide on an appointment time with the customer when there is a new
computer repair request.
43