0% found this document useful (0 votes)
425 views

ECS765P_W1_Introduction to Big Data

The ECS640U/ECS765P Big Data Processing module, led by Dr. Ahmed M. A. Sayed, aims to provide a comprehensive understanding of big data technologies and their applications through lectures and labs. The course includes assessments such as an exam and coursework, with a focus on practical skills using Python and relevant packages. Key topics include big data concepts, parallel processing, and the big data pipeline, with resources and support available through QMPlus.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
425 views

ECS765P_W1_Introduction to Big Data

The ECS640U/ECS765P Big Data Processing module, led by Dr. Ahmed M. A. Sayed, aims to provide a comprehensive understanding of big data technologies and their applications through lectures and labs. The course includes assessments such as an exam and coursework, with a focus on practical skills using Python and relevant packages. Key topics include big data concepts, parallel processing, and the big data pipeline, with resources and support available through QMPlus.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

ECS640U/ECS765P Big Data Processing

Introduction to Big Data


Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Introduction to Big Data
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science

Credit: Joseph Doyle, Jesus Carrion, Felix Cuadrado, …


Tentative Module Organization
Module link on QMPlus: https://qmplus.qmul.ac.uk/course/view.php?id=24180
Lectures (2-hours): Monday 15:00 – 17:00 in Fogg Lecture Theatre (Fogg-LT)
Labs (2-hours) – starting from Week 2
Tuesday 11:00 – 13:00 in TB building (ground floor)
Check timetable for up-to-date info: https://timetables.qmul.ac.uk/default.aspx
On QMPlus – General tab: Student forum (https://qmplus.qmul.ac.uk/mod/forum/view.php?f=123242) and
announcements (https://qmplus.qmul.ac.uk/mod/forum/view.php?f=123241)
Frequently Asked Questions (FAQ – check here first): https://qmplus.qmul.ac.uk/mod/page/view.php?id=2488189
Lecturer (Module Organizer)
● Dr. Ahmed M. A. Sayed ([email protected])
Teaching Fellow
● Jianshu Qiao ([email protected])
Demonstrators
● Amarja Shivraj Pawar ([email protected])
QM+ page
Quiz and Coursework will appear here Lecture Recordings

General area with


various useful items

Weekly Content
(lectures and labs)
Aims of the module
● To provide a rigorous understanding of big data processing technologies and their applications
● To provide a practical understanding of how to design and develop big data solutions
● To allow students to critically engage with current and future developments in the field of data science
Module resources

https://www.oreilly.com/library/view/big- https://www.oreilly.com/library/view/h https://www.oreilly.com/library/view/spark-


data-fundamentals/9780134291185/ adoop-the-definitive/9780596521974/ the-definitive/9781491912201/

QMUL https://search.library.qmul.ac.uk/iii/en https://search.library.qmul.ac.uk/iii/encore/


https://www-vlebooks-
core/record/C__Rb1796247__Sbig% record/C__Rb1840075__Sspark%20the%
Library 20data%20fundamentals__Orightresu
com.ezproxy.library.qmul.ac.uk/Pr
oduct/Index/475742?page=0 20guide__Orightresult__U__X2?lang=eng
Links lt__U__X7?lang=eng&suite=def &suite=def
Programming in the module
We will be mostly using Python programming language, and particularly:
● The mrjob package for MapReduce
● The pyspark package for Spark

MRJob
Module Assessments
Exam 50%
● On average ~30-40 Questions with variable difficulty
● Covers the lectures, labs, practical exercises and reading materials
Coursework: 50%
● Quiz: 20% (some questions depends on completing the lab and its questions)
Released: tentatively the end of 3rd week (i.e., after Lab 2)
Deadline: after 11th lecture (3 Apr 5PM)
Any number of attempts are allowed but the last attempt that counts
No feedback/grade is given for the attempts until the quiz closes
● Individual Project: 30% (~6 weeks to complete, start as soon as you can to avoid disappointments)
Released: tentatively by the end of week 5 / early of week 6
Deadline: End of week 11 (5th Apr 3PM)
Academic Misconduct
Refer to: https://arcs.qmul.ac.uk/students/student-appeals/academic-misconduct/
Academic misconduct includes, but is not limited to, the following:
• Plagiarism;
• Examination offences;
• The use, or attempted use, of ghost writing services for any part of assessment;
• The submission of work, or sections of work, for assessment in more than one module or
assignment (including work previously submitted for assessment at another institution);
• The fraudulent reporting of source material, experimental results, research, or other
investigative work;
• Collusion in the preparation or production of submitted work.
Policy: https://arcs.qmul.ac.uk/media/arcs/policyzone/academic/Academic-Misconduct-Policy-(2022-23).pdf
Penalties: https://arcs.qmul.ac.uk/media/arcs/Academic-misconduct-penalty-guidance.pdf
Lab Arrangement
The labs are in-person in the Information Technology Lab (ITL) building:

EECS Juypter Hub Cluster:


• Each student has an ITS account can access the comp hub.
• To use the cluster, you just need internet and browser!
• For access to the hub http://jhub.eecs.qmul.ac.uk/ or https://hub.comp-teach.qmul.ac.uk/

Video with instructions available on module’s QMplus page –


(https://qmplus.qmul.ac.uk/mod/kalvidres/view.php?id=2488192)
Useful instructions/commands: https://qmplus.qmul.ac.uk/mod/url/view.php?id=2488193

For any technical issues, please refer to http://support.eecs.qmul.ac.uk or contact eecs-systems-


[email protected]
Lecture Outline

● What is Big Data?


● Parallel Processing
● The Big Data Pipeline
● What we will learn?
● The Word-Count Problem
What is Big Data?
Many different definitions
● “Big data is a term used to refer to data sets that are too large or complex for traditional data processing
application software to adequately deal with” (Wikipedia)
● “Big data is high-volume, high velocity and/or high-variety (3Vs) data assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making
and process automations.” (Gartner)

Could you think of some notable difference between Data and Information?
Data vs Information
Key difference:
● “Data is a raw and unorganized fact(s) that need to be processed to make it meaningful”
● “Information is a set of data which have been processed in a meaningful way according to the given
requirements.”
● Data: facts, observations, perceptions numbers, characters, symbols, image, etc.
● Information: processed data which includes data that possess context, relevance, and purpose

Data Data Processing Information

Source: https://www.guru99.com/difference-information-data.html
Why Big Data?
A conventional data engineering system might not be suitable for a particular application because of
insufficient:
● Storage: Data does not fit in the hard disk.
● I/O: The speed at which data can be retrieved does not meet the volume of requests.
● Processing speed: The speed at which instructions are executed by the processing unit is too low to
produce on time the desired results for the volume of data or volume of requests.
● Memory: The volume of data that needs processing does not fit in the RAM.

In all these cases (and others), we need Big Data technologies.


Example: Big Data at Netflix
Big Data at Netflix
● 100 million users
● 125+ million hours of video watched each day
● 4000 different devices
● 700+ billions events a day
● 60 peta bytes of data
More information at Netflix Technology Blog https://medium.com/@NetflixTechBlog

Without meaningful processing, these large amounts of collected and stored data are USELESS
Notable Open Datasets
● Kaggle – over 14,000 datasets
● Microsoft Academic Graph
● Data.gov – U.S. Government’s open data
● Pushshift.io – full Reddit dataset
● Common Crawl – 7 years of web pages data
● YouTube-8M Dataset – a large-scale labelled video dataset that consists of millions of YouTube video
● Data4Good.io – over 1TB of compressed networks data
Big Data Parallel Processing
● Dataset are too large and complex making it impractical to analyse them on single computing node
● SOLUTION: multiple computing nodes analyse the data to produce information in a reasonable time
● CAUTION: consider problems and challenges associated with parallel computing!!!

Could you think of what kind of problems that arise from processing data in parallel?
Contents

● What is Big Data?


● Parallel Processing
● The Big Data Pipeline
● What we will learn?
● The Word-Count Problem
Traditional Sequential Program Execution
● Basic model based on John von Neumann architecture in the 1945.
● One instruction is fetched, decoded and executed at a time
● As processor speed increases, more instructions can be executed per same duration of time.
● For instance, the metric, FLOPS: Floating-Points Operations Per Second is used to measure speed of
computers (or Super Computers).

Source: https://en.wikipedia.org/wiki/Von_Neumann_architecture
Limitations of Single Processor
● Reached the practical limitations in the amount of computing power a single processor can handle
● We cannot make processors much faster, or much bigger (silicon manufacturing process limitation)
● Moore’s law states that the number of transistors per chip will continue to increase
More transistors are great but if the clock speed is > 4GHz the chips tend to melt with heat
● Alternatively, modern chips contain an increasing amount of processing cores
This is known as Multicore Chips
Beyond conventional computing: Scaling up
The performance of our data processing systems can be improved by
increasing the processing, storage and I/O of a single machine. This is
known as scaling up or vertical scaling.

Up until the early 2000s, single processors speed increased steadily.


Physical limitations were achieved, leading to computer architectures
consisting of multiple cores (processors):
● Any laptop nowadays will have a minimum of 2 cores (mine has 10)
● IBM Power System E980 has 192 cores.
Beyond conventional computing: Scaling out

A second option to increase the performance of our data


engineering system is to add more machines. This is known as
scaling out or horizontal scaling.

A set of machines that work together as a single system is


known as a Server Cluster or (Data Center), and each
individual machine is known as a node. Cluster nodes are
interconnected through fast networks.
Parallel Processing
● Using more than one processor in parallel to perform a calculation or solve a problem.
● The calculation will be divided into tasks, sent to different processors at the same time.
● Processors can be different cores in the same machine, and/or different machines linked by a network
● Processes need to be SYNCHRONIZED and have some method of COMMUNICATION

What is the difference between concurrent and parallel processing?


Concurrent vs Parallel Execution Models
● Concurrent: If the node has only a single CPU, then the application or task (s) do not progress at exactly
the same time, but virtually more than one app/task is in progress at a time on the CPU.
Involves context switching between apps/programs/applications
● Parallel: the node has more than one CPU or CPU core, and makes progress on more than one task
simultaneously.
Tasks are ideally independent, otherwise, some form of synchronization is needed

Processing Unit Task1


1

Processing Unit Task1 Task2 Task1

Processing Unit
Task2
2

Concurrent Parallel
Parallelism
● Parallelism: means that an application divides the tasks up into smaller sub-tasks.
● The sub-tasks can be processed in parallel, for instance on multiple CPUs at the exact same time.
● The app must have more than one thread running - and each thread must run on separate CPUs / CPU
cores / graphics card GPU cores.

Processing Unit Parent Task Sub-task 1 (Thread 1) Parent Task


1

Processing Unit
Sub-task2 (Thread 2)
2
Challenges of Parallel Processing
Several challenges hinder the efficient use of parallel processors (i.e., goal is high resource utilization)
● Many algorithms are hard to be divided into subtasks (or cannot be divided at all à lack of luck).
Some problem areas are much easier than others to parallelise
● Subtasks might use results from each other, so coordinating the different tasks might be difficult
Lack of proper coordination can result in the task failing or producing wrong results
● Communication network is the main BOTTLENECK
The data exchange between the processors can overwhelm the shared network
Parallelization Example
Image processing à for instance sharpening an image
● We have a big photograph (e.g., satellite images, medical images, …).
● We divide it into tiles (square patches), and in parallel sharpen each tile
● Then we adjust the pixels at the edges so that they match up
Still parallelizable because handling edge pixels can be done as an independent second phase
Also edge pixels are a small part of the total, and changing them does not affect the other pixels
Tricky Parallelisation Example
Path search – search for the best/shortest route from two points A to B, no exhaustive search
● How to divide the task in lesser ones? Find intermediate points?
● How the processors select these intermediate points independently?
The processors need to communicate among themselves but what are the intermediate results?
● How can we guarantee that the solution is the best/shortest one?
Examples of parallelism: Simulation
We often need to simulate physical events:
● The problem: what happens when lightning strikes a plane?
● Planes are too expensive to conduct a large number of tests on them, but we need to ensure they are
safe to fly
● Simulation is a necessity: but it is computationally difficult
There are a lot of similar tasks (nuclear safety, for example) where direct testing is impossible or difficult

How do you think simulations could be parallelized?


Examples of parallelism: Weather Prediction
Examples of parallelism: Prediction
● We would like to predict real-world events (such as the weather)
● This often involves very long and data intensive calculations, just like simulation does
● Challenging both for algorithm design and implementation
Examples of parallelism: Data Analysis
● Sentiment Analysis: Marketers analyse large amounts of data collected from online social platforms
such as Twitter, Facebook, etc in order to find out how their products are doing
● Text and URL crawling: Google analyses lots of web pages in order to support google search
● Bio-Informatics: Biologists store large amounts of DNA data, and they have to work out what it means
● All these applications are computationally demanding, and they also use large amounts of data
Back to Netflix Example
● Netflix applies data analysis for maximizing the effectiveness of their video catalogue
● Traditionally, there is a limited subset of movie genres (let’s say 100)
● Netflix manages more than 70,000 genres
The genres are combined into categories using collected data from the millions of users that actively
consume content through the service
To achieve this, the large dataset about users are processed in parallel to produce timely results
● With that information, they can provide recommendations considerably more effectively
Contents

● What is Big Data?


● Parallelism
● The Big Data Pipeline
● What we will learn?
● The Word-Count Problem

Time for Quick QUIZ and a break!


The Big Data Pipeline
The Big Data Pipeline consists of a sequence of three data actions:
• Data sources are the input to the pipeline.
• The output can be another big data pipeline, small data analysis, query system, visualisation or an
action.

Data
Output
Sources
Data Sources

Data
Ingestion Storage Processing Output
Sources

A data source is any mechanism that produces data. Examples of data sources include:
● Mobile and web apps
● Websites
● IoT devices
● Databases
● Output of other big data pipelines
Ingestion

Data
Ingestion Storage Processing Output
Sources

The ingestion stage gathers and pre-processes the incoming data from a variety of data sources and
makes the data readily usable by the onward stages.
• Ingested data can be stored or directly processed
• Data is moved from the original sources across a network, hence communication protocols such
as HTTP, MQT or FTP can be used during ingestion
• Data ingestion can include transformations operations on raw data
Storage

Data
Ingestion Storage Processing Output
Sources

After ingestion, data are usually stored in the target processing nodes for future processing.
• Alternatively, due to the high-volumes of data, distributed storage solutions are used
• Distributed storage needs to provide flexibility and fast retrieval of high-volumes of data.
Processing

Data
Ingestion Storage Processing Output
Sources

Processing runs the algorithms intended to process the big data.


• Data can come directly from the ingestion stage (bypassing storage) or from storage
• The results of the processing stage can be stored back in the storage system or be used as the input
of the pipeline
• Can be batch, stream or graph Processing
Contents

● What is Big Data?


● Parallelism
● The Big Data Pipeline
● What we will learn?
● The Word-Count Problem
Weeks 2-4: Apache Hadoop

Data
Ingestion Storage Processing Output
Sources

During weeks 2 to 5, we will cover Apache Hadoop and Spark, the Big Data solution that consists of:
• Processing capabilities (based on MapReduce Programming Model)
• Hadoop Framework (Yarn Scheduler, Hadoop Distributed File System (HDFS), ..)
• Spark Framework (In-memory processing)
• Spark Programming
Week 5: Ingestion, Storage and Reliability

Data
Ingestion Storage Processing Output
Sources

Week 6 and 8 focuses on Big Data solutions for


• Ingestion. Specific ingestion solutions for Apache Hadoop will be presented
• Storage technologies, including distributed file systems and NoSQL data stores
• Reliability in Big Data Processing Frameworks .
Different types of data sources will also be discussed.
Weeks 6-11: Processing

Data
Ingestion Storage Processing Output
Sources

The remaining of the module, focuses on more on Big Data Processing stages, namely
• Weeks 9 and 10: Stream processing (ingestion and processing of stream data)
• Weeks 11: Graph processing (graph databases, large-scale graph processing)
Contents

● What is Big Data?


● Parallelism
● The Big Data Pipeline
● The Word-Count Problem
Our first parallel program
● Task: count the number of occurrences of each word in one document
● Input: text document
● Output: sequence of: word, count
The 56
School 23
Queen 10
● Collection Stage: Not applicable in this case
● Ingestion Stage: Move file to data storage/warehouse with applicable protocol e.g. HTTP/FTP
● Preparation Stage: Clean-up by removing characters which might confuse algorithm e.g. quotation
marks, special characters, .., etc
Example Program Input - Text File
QMUL has been ranked 9th among multi-faculty institutions in the UK, according to tables published today in the Times Higher
Education.
A total of 154 institutions were submitted for the exercise.
The 2008 RAE confirmed Queen Mary to be one of the rising stars of the UK research environment and the REF 2014 shows that this
upward trajectory has been maintained.
Professor Simon Gaskell, President and Principal of Queen Mary, said: “This is an outstanding result for Queen Mary. We have built
upon the progress that was evidenced by the last assessment exercise and have now clearly cemented our position as one of the UK’s
foremost research-led universities. This achievement is derived from the talent and hard work of our academic staff in all disciplines,
and the colleagues who support them.”
The Research Excellence Framework (REF) is the system for assessing the quality of research in UK higher education institutions.
Universities submit their work across 36 panels of assessment. Research is judged according to quality of output (65 per cent),
environment (15 per cent) and, for the first time, the impact of research (20 per cent).
How to solve the problem?
How to solve the problem?
How to solve the problem on a single processor?
#input:text string with the complete text
words = text.split()
count = dict()
for word in words:
if word in count:
count[word] = count[word] + 1
else:
count[word] = 1

For example:

Text à “The good, the bad and the ugly”

Mywords à [‘The’, ‘good’, ‘the’, ‘bad’, ‘and’, ’the’, ‘ugly’]


Count à {‘the’: 3, ‘good’:1, ‘bad’:1, ‘bad’:1, ‘and’:1, ‘ugly’:1}
Parallelising the problem
Splitting the workload on subtasks:
● Split sentences/lines into words
● Count all the occurrences of each word
Partition the file(s) and create threads which process each partition in parallel à Parallelism

What do we do with the intermediate results?


● Merge into single collection
● Possibly requires parallelism too
● One model for executing this is MapReduce which is the subject of next week’s lecture

You might also like