0% found this document useful (0 votes)

425 views

ECS765P_W1_Introduction to Big Data

The ECS640U/ECS765P Big Data Processing module, led by Dr. Ahmed M. A. Sayed, aims to provide a comprehensive understanding of big data technologies and their applications through lectures and labs. The course includes assessments such as an exam and coursework, with a focus on practical skills using Python and relevant packages. Key topics include big data concepts, parallel processing, and the big data pipeline, with resources and support available through QMPlus.

Uploaded by

Yen-Kai Cheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

425 views

ECS765P_W1_Introduction to Big Data

Uploaded by

Yen-Kai Cheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

ECS640U/ECS765P Big Data Processing

Introduction to Big Data

Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Introduction to Big Data
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science

Credit: Joseph Doyle, Jesus Carrion, Felix Cuadrado, …

Tentative Module Organization
Module link on QMPlus: https://qmplus.qmul.ac.uk/course/view.php?id=24180
Lectures (2-hours): Monday 15:00 – 17:00 in Fogg Lecture Theatre (Fogg-LT)
Labs (2-hours) – starting from Week 2
Tuesday 11:00 – 13:00 in TB building (ground floor)
Check timetable for up-to-date info: https://timetables.qmul.ac.uk/default.aspx
On QMPlus – General tab: Student forum (https://qmplus.qmul.ac.uk/mod/forum/view.php?f=123242) and
announcements (https://qmplus.qmul.ac.uk/mod/forum/view.php?f=123241)
Frequently Asked Questions (FAQ – check here first): https://qmplus.qmul.ac.uk/mod/page/view.php?id=2488189
Lecturer (Module Organizer)
● Dr. Ahmed M. A. Sayed ([email protected])
Teaching Fellow
● Jianshu Qiao ([email protected])
Demonstrators
● Amarja Shivraj Pawar ([email protected])
QM+ page
Quiz and Coursework will appear here Lecture Recordings

General area with

various useful items

Weekly Content
(lectures and labs)
Aims of the module
● To provide a rigorous understanding of big data processing technologies and their applications
● To provide a practical understanding of how to design and develop big data solutions
● To allow students to critically engage with current and future developments in the field of data science
Module resources

https://www.oreilly.com/library/view/big- https://www.oreilly.com/library/view/h https://www.oreilly.com/library/view/spark-

data-fundamentals/9780134291185/ adoop-the-definitive/9780596521974/ the-definitive/9781491912201/

QMUL https://search.library.qmul.ac.uk/iii/en https://search.library.qmul.ac.uk/iii/encore/

https://www-vlebooks-
core/record/C__Rb1796247__Sbig% record/C__Rb1840075__Sspark%20the%
Library 20data%20fundamentals__Orightresu
com.ezproxy.library.qmul.ac.uk/Pr
oduct/Index/475742?page=0 20guide__Orightresult__U__X2?lang=eng
Links lt__U__X7?lang=eng&suite=def &suite=def
Programming in the module
We will be mostly using Python programming language, and particularly:
● The mrjob package for MapReduce
● The pyspark package for Spark

MRJob
Module Assessments
Exam 50%
● On average ~30-40 Questions with variable difficulty
● Covers the lectures, labs, practical exercises and reading materials
Coursework: 50%
● Quiz: 20% (some questions depends on completing the lab and its questions)
Released: tentatively the end of 3rd week (i.e., after Lab 2)
Deadline: after 11th lecture (3 Apr 5PM)
Any number of attempts are allowed but the last attempt that counts
No feedback/grade is given for the attempts until the quiz closes
● Individual Project: 30% (~6 weeks to complete, start as soon as you can to avoid disappointments)
Released: tentatively by the end of week 5 / early of week 6
Deadline: End of week 11 (5th Apr 3PM)
Academic Misconduct
Refer to: https://arcs.qmul.ac.uk/students/student-appeals/academic-misconduct/
Academic misconduct includes, but is not limited to, the following:
• Plagiarism;
• Examination offences;
• The use, or attempted use, of ghost writing services for any part of assessment;
• The submission of work, or sections of work, for assessment in more than one module or
assignment (including work previously submitted for assessment at another institution);
• The fraudulent reporting of source material, experimental results, research, or other
investigative work;
• Collusion in the preparation or production of submitted work.
Policy: https://arcs.qmul.ac.uk/media/arcs/policyzone/academic/Academic-Misconduct-Policy-(2022-23).pdf
Penalties: https://arcs.qmul.ac.uk/media/arcs/Academic-misconduct-penalty-guidance.pdf
Lab Arrangement
The labs are in-person in the Information Technology Lab (ITL) building:

EECS Juypter Hub Cluster:

• Each student has an ITS account can access the comp hub.
• To use the cluster, you just need internet and browser!
• For access to the hub http://jhub.eecs.qmul.ac.uk/ or https://hub.comp-teach.qmul.ac.uk/

Video with instructions available on module’s QMplus page –

(https://qmplus.qmul.ac.uk/mod/kalvidres/view.php?id=2488192)
Useful instructions/commands: https://qmplus.qmul.ac.uk/mod/url/view.php?id=2488193

For any technical issues, please refer to http://support.eecs.qmul.ac.uk or contact eecs-systems-

[email protected]
Lecture Outline

● What is Big Data?

● Parallel Processing
● The Big Data Pipeline
● What we will learn?
● The Word-Count Problem
What is Big Data?
Many different definitions
● “Big data is a term used to refer to data sets that are too large or complex for traditional data processing
application software to adequately deal with” (Wikipedia)
● “Big data is high-volume, high velocity and/or high-variety (3Vs) data assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making
and process automations.” (Gartner)

Could you think of some notable difference between Data and Information?
Data vs Information
Key difference:
● “Data is a raw and unorganized fact(s) that need to be processed to make it meaningful”
● “Information is a set of data which have been processed in a meaningful way according to the given
requirements.”
● Data: facts, observations, perceptions numbers, characters, symbols, image, etc.
● Information: processed data which includes data that possess context, relevance, and purpose

Data Data Processing Information

Source: https://www.guru99.com/difference-information-data.html
Why Big Data?
A conventional data engineering system might not be suitable for a particular application because of
insufficient:
● Storage: Data does not fit in the hard disk.
● I/O: The speed at which data can be retrieved does not meet the volume of requests.
● Processing speed: The speed at which instructions are executed by the processing unit is too low to
produce on time the desired results for the volume of data or volume of requests.
● Memory: The volume of data that needs processing does not fit in the RAM.

In all these cases (and others), we need Big Data technologies.

Example: Big Data at Netflix
Big Data at Netflix
● 100 million users
● 125+ million hours of video watched each day
● 4000 different devices
● 700+ billions events a day
● 60 peta bytes of data
More information at Netflix Technology Blog https://medium.com/@NetflixTechBlog

Without meaningful processing, these large amounts of collected and stored data are USELESS
Notable Open Datasets
● Kaggle – over 14,000 datasets
● Microsoft Academic Graph
● Data.gov – U.S. Government’s open data
● Pushshift.io – full Reddit dataset
● Common Crawl – 7 years of web pages data
● YouTube-8M Dataset – a large-scale labelled video dataset that consists of millions of YouTube video
● Data4Good.io – over 1TB of compressed networks data
Big Data Parallel Processing
● Dataset are too large and complex making it impractical to analyse them on single computing node
● SOLUTION: multiple computing nodes analyse the data to produce information in a reasonable time
● CAUTION: consider problems and challenges associated with parallel computing!!!

Could you think of what kind of problems that arise from processing data in parallel?
Contents

● What is Big Data?

● Parallel Processing
● The Big Data Pipeline
● What we will learn?
● The Word-Count Problem
Traditional Sequential Program Execution
● Basic model based on John von Neumann architecture in the 1945.
● One instruction is fetched, decoded and executed at a time
● As processor speed increases, more instructions can be executed per same duration of time.
● For instance, the metric, FLOPS: Floating-Points Operations Per Second is used to measure speed of
computers (or Super Computers).

Source: https://en.wikipedia.org/wiki/Von_Neumann_architecture
Limitations of Single Processor
● Reached the practical limitations in the amount of computing power a single processor can handle
● We cannot make processors much faster, or much bigger (silicon manufacturing process limitation)
● Moore’s law states that the number of transistors per chip will continue to increase
More transistors are great but if the clock speed is > 4GHz the chips tend to melt with heat
● Alternatively, modern chips contain an increasing amount of processing cores
This is known as Multicore Chips
Beyond conventional computing: Scaling up
The performance of our data processing systems can be improved by
increasing the processing, storage and I/O of a single machine. This is
known as scaling up or vertical scaling.

Up until the early 2000s, single processors speed increased steadily.

Physical limitations were achieved, leading to computer architectures
consisting of multiple cores (processors):
● Any laptop nowadays will have a minimum of 2 cores (mine has 10)
● IBM Power System E980 has 192 cores.
Beyond conventional computing: Scaling out

A second option to increase the performance of our data

engineering system is to add more machines. This is known as
scaling out or horizontal scaling.

A set of machines that work together as a single system is

known as a Server Cluster or (Data Center), and each
individual machine is known as a node. Cluster nodes are
interconnected through fast networks.
Parallel Processing
● Using more than one processor in parallel to perform a calculation or solve a problem.
● The calculation will be divided into tasks, sent to different processors at the same time.
● Processors can be different cores in the same machine, and/or different machines linked by a network
● Processes need to be SYNCHRONIZED and have some method of COMMUNICATION

What is the difference between concurrent and parallel processing?

Concurrent vs Parallel Execution Models
● Concurrent: If the node has only a single CPU, then the application or task (s) do not progress at exactly
the same time, but virtually more than one app/task is in progress at a time on the CPU.
Involves context switching between apps/programs/applications
● Parallel: the node has more than one CPU or CPU core, and makes progress on more than one task
simultaneously.
Tasks are ideally independent, otherwise, some form of synchronization is needed

Processing Unit Task1

Processing Unit Task1 Task2 Task1

Processing Unit
Task2
2

Concurrent Parallel
Parallelism
● Parallelism: means that an application divides the tasks up into smaller sub-tasks.
● The sub-tasks can be processed in parallel, for instance on multiple CPUs at the exact same time.
● The app must have more than one thread running - and each thread must run on separate CPUs / CPU
cores / graphics card GPU cores.

Processing Unit Parent Task Sub-task 1 (Thread 1) Parent Task

Processing Unit
Sub-task2 (Thread 2)
2
Challenges of Parallel Processing
Several challenges hinder the efficient use of parallel processors (i.e., goal is high resource utilization)
● Many algorithms are hard to be divided into subtasks (or cannot be divided at all à lack of luck).
Some problem areas are much easier than others to parallelise
● Subtasks might use results from each other, so coordinating the different tasks might be difficult
Lack of proper coordination can result in the task failing or producing wrong results
● Communication network is the main BOTTLENECK
The data exchange between the processors can overwhelm the shared network
Parallelization Example
Image processing à for instance sharpening an image
● We have a big photograph (e.g., satellite images, medical images, …).
● We divide it into tiles (square patches), and in parallel sharpen each tile
● Then we adjust the pixels at the edges so that they match up
Still parallelizable because handling edge pixels can be done as an independent second phase
Also edge pixels are a small part of the total, and changing them does not affect the other pixels
Tricky Parallelisation Example
Path search – search for the best/shortest route from two points A to B, no exhaustive search
● How to divide the task in lesser ones? Find intermediate points?
● How the processors select these intermediate points independently?
The processors need to communicate among themselves but what are the intermediate results?
● How can we guarantee that the solution is the best/shortest one?
Examples of parallelism: Simulation
We often need to simulate physical events:
● The problem: what happens when lightning strikes a plane?
● Planes are too expensive to conduct a large number of tests on them, but we need to ensure they are
safe to fly
● Simulation is a necessity: but it is computationally difficult
There are a lot of similar tasks (nuclear safety, for example) where direct testing is impossible or difficult

How do you think simulations could be parallelized?

Examples of parallelism: Weather Prediction
Examples of parallelism: Prediction
● We would like to predict real-world events (such as the weather)
● This often involves very long and data intensive calculations, just like simulation does
● Challenging both for algorithm design and implementation
Examples of parallelism: Data Analysis
● Sentiment Analysis: Marketers analyse large amounts of data collected from online social platforms
such as Twitter, Facebook, etc in order to find out how their products are doing
● Text and URL crawling: Google analyses lots of web pages in order to support google search
● Bio-Informatics: Biologists store large amounts of DNA data, and they have to work out what it means
● All these applications are computationally demanding, and they also use large amounts of data
Back to Netflix Example
● Netflix applies data analysis for maximizing the effectiveness of their video catalogue
● Traditionally, there is a limited subset of movie genres (let’s say 100)
● Netflix manages more than 70,000 genres
The genres are combined into categories using collected data from the millions of users that actively
consume content through the service
To achieve this, the large dataset about users are processed in parallel to produce timely results
● With that information, they can provide recommendations considerably more effectively
Contents

● What is Big Data?

● Parallelism
● The Big Data Pipeline
● What we will learn?
● The Word-Count Problem

Time for Quick QUIZ and a break!

The Big Data Pipeline
The Big Data Pipeline consists of a sequence of three data actions:
• Data sources are the input to the pipeline.
• The output can be another big data pipeline, small data analysis, query system, visualisation or an
action.

Data
Output
Sources
Data Sources