0% found this document useful (0 votes)
0 views

BDA 01 - Introduction

This document provides an introduction to Big Data analysis and business intelligence, covering fundamental concepts, tools, and applications. It outlines the characteristics of Big Data, including volume, velocity, and variety, and discusses the importance of parallel processing in handling large datasets. Additionally, it includes resources for further reading and exercises for practical application of the concepts learned.

Uploaded by

Quốc Lê
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

BDA 01 - Introduction

This document provides an introduction to Big Data analysis and business intelligence, covering fundamental concepts, tools, and applications. It outlines the characteristics of Big Data, including volume, velocity, and variety, and discusses the importance of parallel processing in handling large datasets. Additionally, it includes resources for further reading and exercises for practical application of the concepts learned.

Uploaded by

Quốc Lê
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Big Data Analysis and Business Intelligence

Lesson 1: Introduction

Dr. Le, Hai Ha


Objectives
• Provides fundamental and essential knowledge for Big Data,
Big Data Analysis and advanced machine learning techniques
to transform business data into business knowledge.
• The computing environment including hardware, software,
architecture also is provided.
• The basic knowledge for design and deploy a business Big Data
project is also supplied.
• Workload:
• Theory: 2 lessons In the class, first 2 lessons for lecture, and
last 1 lesson for exercises (using computer)
• Exercise: 1 lessons
• Lab: 1 hour
• Self-study: 6 hours At home, do homework (project)

• Document and communication via MS. Team, code: wrwpns6


2
Textbooks
[1] Leskovec, J., Rajaraman, A., & Ullman, J. D. (2020). Mining of massive
data sets. Cambridge university press.
[2] Erl, T., Khattak, W., & Buhler, P. (2016). Big data fundamentals:
concepts, drivers & techniques. Prentice Hall Press.
[3] Daniel, J. (2019). Data science with Python and Dask. Simon and
Schuster.
[4] Marin, I., Shukla, A., & Sarang, V. K. (2019). Big Data Analysis with
Python: Combine Spark and Python to unlock the powers of parallel
computing and machine learning. Packt Publishing Ltd.
[5] Dean, J. (2014). Big data, data mining, and machine learning: value
creation for business leaders and practitioners. John Wiley & Sons.
[6] Schmarzo, B. (2013). Big Data: Understanding how data powers big
business. John Wiley & Sons.
[7] Stanford CS246

3
Tools

4
Lesson 1: Introduction

Big Data???
Analysis???

Business
Intelligence???
Contents

 What is Big Data?


 Sources of Big Data
 Applications of Big Data
 Big Data tools and ecosystem
 Big Data and Parallel processing
 How to work with large datasets in Python

6
Whats Big Data?

1. How big is big?


2. Whats makes Big Data different from small data?
3. Is there such a thing as small data?
4. What are the characteristics of Big Data?
Exercise: Tracking the Big Data Hype

Using the Google trends site check how the popularity of the search term “Big Data” has
declined compared with similar search terms: machine learning and artificial intelligence

8
Popularity of ‘Big Data’ Search Term on
Popularity of ‘Big Data’ Search Term on Google
Google

9
What exactly is Big Data?

• Big data is larger, more complex data sets, especially from


new data sources. This data is characterized by the fact that
it contains greater variety (texts, images, audios, videos etc),
arriving in increasing volumes and with more velocity.
• These data sets are so voluminous that traditional data
processing software just can’t manage them

10
Characteristics of Big Data-The x V’s
The 4Vs of Big Data

The 10Vs of Big Data

The 5Vs of Big Data


11
Characteristics of Big Data-The x V’s
• Volume: The amount of data matters. For Big Data, we are often talking about large volumes of
data. Although the question often comes, how big is big?
• Velocity: Velocity is the fast rate at which data is received and (perhaps) acted on.
• Variety: Variety refers to the many types of data that are available. Traditional data types were
structured and fit neatly in a relational database. With the rise of big data, data comes in new
unstructured data types. Unstructured and semistructured data types, such as text, audio, and
video, require additional preprocessing to derive meaning and support metadata.
• Veracity: The data is often noisy
• Value: what value you can get from which data and how big data gets better results from
stored data
• Variability: what extent, and how fast, is the structure of your data changing? And how often
does the meaning or shape of your data change?
• Visualization: Using charts and graphs to visualize large amounts of complex data is much
more effective in conveying meaning than spreadsheets and reports chock-full of numbers and
formulas.
• Validity: pertains to the accuracy of data for its intended use.
• Volatility: refers to the time considerations placed on a particular data set.
• Vulnerability: the potential harm in sharing our shopping data.

12
History of Big Data

• Previously, large datasets were available through satellite images and


others sources.
• Explosion in size of datasets came with the coming of internet
companies such as Facebook, Youtube and others around 2005
• Hadoop and Mapreduce framework were created in 2006 specifically
to deal with large datasets at Google
• Since then, the advent of IoT, more mobile devices and increased
access to internet has contributed to ever increasing datasets

13
Examples of Big Data

Articles talking about Big Data volumes Very Big Data at Netflix

● This old article


(https://cloudtweaks.com/2015/03/
how-much-data-is-produced-every-
day/) has some interesting stats BUT
no way to know how accurate it is
● Yet another article
(https://seedscientific.com/how-
much-data-is-created-every-day/)

14
Sources of Big Data
Some sources

• Social networking sites: Facebook, Google, LinkedIn all these sites


generates huge amount of data on a day to day basis as they have
billions of users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge
amount of logs from which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge
data which are stored and manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the
user trends and accordingly publish their plans and for this they store
the data of its million users.
• Share Market: Stock exchange across the world generates huge
amount of data through its daily transaction.

16
E.g. Google map traffic

17
Applications of Big Data
What Do they Do with
Industries Using Big Data the Data?
Online Retail Search Finance
1. Amazon uses its massive
data to build
Manufacturing Automobile Medicine
recommendation systems

Telecom Entertainment Insurance


2. Netflix uses data to predict
customer viewer behavior

3. Google uses search data to


improve search results and
optimize ranking of results

4. Google uses its data for


advertising

5. And more and more

19
Big Data Tools and Ecosytem
What Do We Want to Do with Big Data?

Store, manage and retrieve Analyze and visualize Build models (ML)

The Big Data tools, methodologies, frameworks and ecosystems address these tasks.
Most technologies combine several aspects. For instance, with Hadoop, you have
both storage and analysis capabilities

21
Tools and Platforms
Cloud providers Hadoop, HDFS, Apache Spark,
NoSQL Databases
Mapreduce

● Data storage, ● Store large datasets at scale


management ● Store data in JSON instead ● Provides distributed
● Compute for data of relational tables storage
processing, analysis ● NoSQL database types ● Data analytics,
and visualization include pure document machine learning
● Examples: AWS, databases, key-value stores, ● Parallel data
Google Cloud wide-column databases and processing
Platform (GCP), graph databases ● Example vendors:
Microsoft Azure ● Examples: MongoDB, Cloudera,
CouchDB, Cassandra and Hortonworks,
Redis Databricks
● Data lake is a relatively new
architecture in data storage

22
How Some of these Components Look in a Full System

23
Summary on Big Data

• Big Data, when it comes to size is a relative term BUT you will know
you have a large dataset if you can’t process it on a single computer
or you can’t use traditional software such as Excel to handle the data
• The Big Data technology landscape is now dominated by cloud
providers such as AWS, Google, Azure and others. Other notable
vendors involved include Databricks, Cloudera/Hortonworks, MapR
• Big Data has penetrated many industries including government

24
Further Reading on Big Data Basics

1.This PDF book on Big Data


(https://drive.google.com/file/d/1F8YwClvPiHVIO3vPD4KLENof977tvf
0F/view) is a good resource to look through
2.If you want to understand data lakes and data warehouses more,
take a look at this eBook
(https://drive.google.com/file/d/11qB0qkTf0zJ9R6plUWr5VZAoujufU
iER/view)

25
Big Data and Parallel
Processing
What is Parallel Processing
• Parallel computing is a computation style of carrying out multiple operations
simultaneously using one (by means of multi-threading) or many machines.
Parallel computing works on the principal that large problems can often be
divided into small problems and these small problems can be executed
concurrently.
• Task parallelism - This form of parallelism covers the execution of computer
programs across multiple processors on same or multiple machines. It focuses on
executing different operations in parallel to fully utilize the available computing
resources in form of processors and memory.
• Data parallelism - This form of parallelism focuses on distribution of data sets
across the multiple computation programs. In this form, same operations are
performed on different parallel computing processors on the distributed data
subset.

Most Big Data Frameworks utilize both data and task parallelism

27
Similar Concepts to Parallelism

• Distributed computing
• Concurrent processing

28
Why Parallel Processing?

• In this single node/single


setup, if you have too much
Compute data and/or too many
(CPU) computations to perfom, its
difficult to scale

Storage • Parallel or distributed


processing is the cornerstone
Single node setup of working with Big Data

29
Parallel vs. Linear Processing

Linear/sequential processing Parallel processing

Problem
Problem
Instruction-
Instruction-1 Instruction-2
N
Instruction-1
Error

Instruction-2 Output
Error

Instruction-N

Output

30
Advantages of Parallel Processing

• Faster: Parallel processing can process large datasets in a fraction of


time
• Less compute: Less memory and compute requirements needed as
compute instructions are distributed to smaller execution nodes
• Scalability: More execution nodes can be added or removed from the
execution network depending on the problem

31
Vertical Scaling Vs. Horizontal Scaling

Compute Compute Compute

Eventually vertical
scaling fails

32
Horizontal Scaling is Better

Compute Compute Compute Compute Compute

Storage Storage Storage Storage Storage

A Cluster of computers

33
Concurrency and Parallelism
in Python

Before we turn to the big guns (Big Data frameworks) to handle our Big
Data, we will look at how to achieve simple parallelism with vanilla Python

34
Concurrency and Parallelism in Python

• In Python, strictly speaking, concurrency and parallelism are different


• Concurrency-processing happens on a single processor, although its
still possible to run multiple things through some tricks
• Parallelism-Python creates several processes which run
simultaneously on different processors

35
Summary of Different Concurrency Types in
Python

36
CPU-Bound and I/O-Bound Programs in in
Python

• I/O-bound problems cause your program to slow down because it frequently


must wait for input/output (I/O) from some external resource. For example,
scraping data from many websites is an I/O bound problem. What limits the
program from running faster is time spent interacting with the file system and
network.

• CPU-bound problems: classes of programs that do significant computation


without talking to the network or accessing a file. These are the CPU-bound
programs, because the resource limiting the speed of your program is the CPU,
not the network or the file system.

37
I/O Bound Vs. CPU Bound Processes

38
How to Achieve Concurrency in Python

• There are several libraries which tackle this problem in Python


depending on whether the problem is IO or CPU bound and also
expressiveness of the library
• asyncio-For IO bound programs such as downloading data from the
web, this package is useful to run concurrent tasks
• Multiprocessing: This is a builtin Python library for speeding up CPU
bound programs by creating many processes to take advantage of
multiple cores on your computer
• Other libraries: Joblib, concurrent.futures which provide similar
functionality but at a higher level

39
How to Solve Big Data
Problems in Python
Working with data in a distributed fashion is inherently
difficult, therefore make sure that you exhaust all options
before jumping into using Spark, Hadoop or other Big
Data frameworks

41
Advice on Tackling a Big Data Problem
Some questions to ask yourself before you jump to the big guns

1. Can I optimize pandas to solve the problem?: If you are using Pandas for
data munging, you can optimize pandas to load large datasets depending
on the nature of your problem
2. How about drawing a sample from the large dataset? Depending on your
use case, drawing a sample out of a large dataset may or may not work.
Just be careful that you sample correctly.
3. Can I use simple Python parallelism to solve the problem on my laptop?
Sometimes the data isn't that big but you just need to run more intense
computations on the smaller data, multiprocessing can help.
4. Can I use a big data framework on my laptop? For some tasks, even with a
25GB dataset, frameworks like Spark and Dask can work on a single
laptop.
5. Need to build a cluster: Take time to think about which distribution of
Hadoop to use, which vendors to use, whether you will put the cluster on
the cloud or on-premise. You will need input of IT people for this one.

42
Exercises
• Analysis with Python
• Computer repaire time

• Observing the repair time of a computer repair center, we have the following data table:

No Time No Time No Time No Time No Time No Time No Time

1 18 8 29 15 11 22 12 29 16 36 11 43 13
2 15 9 10 16 14 23 34 30 14 37 10 44 8
3 17 10 14 17 13 24 29 31 15 38 13 45 10
4 9 11 17 18 16 25 13 32 7 39 14 46 13
5 37 12 12 19 13 26 19 33 40 40 9 47 16
6 15 13 13 20 15 27 12 34 16 41 18 48 9
7 8 14 12 21 16 28 15 35 11 42 8 49 9

• Time in days. Decide on an appointment time with the customer when there is a new
computer repair request.

43

You might also like