ECS765P_W1_Introduction to Big Data
ECS765P_W1_Introduction to Big Data
Weekly Content
(lectures and labs)
Aims of the module
● To provide a rigorous understanding of big data processing technologies and their applications
● To provide a practical understanding of how to design and develop big data solutions
● To allow students to critically engage with current and future developments in the field of data science
Module resources
MRJob
Module Assessments
Exam 50%
● On average ~30-40 Questions with variable difficulty
● Covers the lectures, labs, practical exercises and reading materials
Coursework: 50%
● Quiz: 20% (some questions depends on completing the lab and its questions)
Released: tentatively the end of 3rd week (i.e., after Lab 2)
Deadline: after 11th lecture (3 Apr 5PM)
Any number of attempts are allowed but the last attempt that counts
No feedback/grade is given for the attempts until the quiz closes
● Individual Project: 30% (~6 weeks to complete, start as soon as you can to avoid disappointments)
Released: tentatively by the end of week 5 / early of week 6
Deadline: End of week 11 (5th Apr 3PM)
Academic Misconduct
Refer to: https://arcs.qmul.ac.uk/students/student-appeals/academic-misconduct/
Academic misconduct includes, but is not limited to, the following:
• Plagiarism;
• Examination offences;
• The use, or attempted use, of ghost writing services for any part of assessment;
• The submission of work, or sections of work, for assessment in more than one module or
assignment (including work previously submitted for assessment at another institution);
• The fraudulent reporting of source material, experimental results, research, or other
investigative work;
• Collusion in the preparation or production of submitted work.
Policy: https://arcs.qmul.ac.uk/media/arcs/policyzone/academic/Academic-Misconduct-Policy-(2022-23).pdf
Penalties: https://arcs.qmul.ac.uk/media/arcs/Academic-misconduct-penalty-guidance.pdf
Lab Arrangement
The labs are in-person in the Information Technology Lab (ITL) building:
Could you think of some notable difference between Data and Information?
Data vs Information
Key difference:
● “Data is a raw and unorganized fact(s) that need to be processed to make it meaningful”
● “Information is a set of data which have been processed in a meaningful way according to the given
requirements.”
● Data: facts, observations, perceptions numbers, characters, symbols, image, etc.
● Information: processed data which includes data that possess context, relevance, and purpose
Source: https://www.guru99.com/difference-information-data.html
Why Big Data?
A conventional data engineering system might not be suitable for a particular application because of
insufficient:
● Storage: Data does not fit in the hard disk.
● I/O: The speed at which data can be retrieved does not meet the volume of requests.
● Processing speed: The speed at which instructions are executed by the processing unit is too low to
produce on time the desired results for the volume of data or volume of requests.
● Memory: The volume of data that needs processing does not fit in the RAM.
Without meaningful processing, these large amounts of collected and stored data are USELESS
Notable Open Datasets
● Kaggle – over 14,000 datasets
● Microsoft Academic Graph
● Data.gov – U.S. Government’s open data
● Pushshift.io – full Reddit dataset
● Common Crawl – 7 years of web pages data
● YouTube-8M Dataset – a large-scale labelled video dataset that consists of millions of YouTube video
● Data4Good.io – over 1TB of compressed networks data
Big Data Parallel Processing
● Dataset are too large and complex making it impractical to analyse them on single computing node
● SOLUTION: multiple computing nodes analyse the data to produce information in a reasonable time
● CAUTION: consider problems and challenges associated with parallel computing!!!
Could you think of what kind of problems that arise from processing data in parallel?
Contents
Source: https://en.wikipedia.org/wiki/Von_Neumann_architecture
Limitations of Single Processor
● Reached the practical limitations in the amount of computing power a single processor can handle
● We cannot make processors much faster, or much bigger (silicon manufacturing process limitation)
● Moore’s law states that the number of transistors per chip will continue to increase
More transistors are great but if the clock speed is > 4GHz the chips tend to melt with heat
● Alternatively, modern chips contain an increasing amount of processing cores
This is known as Multicore Chips
Beyond conventional computing: Scaling up
The performance of our data processing systems can be improved by
increasing the processing, storage and I/O of a single machine. This is
known as scaling up or vertical scaling.
Processing Unit
Task2
2
Concurrent Parallel
Parallelism
● Parallelism: means that an application divides the tasks up into smaller sub-tasks.
● The sub-tasks can be processed in parallel, for instance on multiple CPUs at the exact same time.
● The app must have more than one thread running - and each thread must run on separate CPUs / CPU
cores / graphics card GPU cores.
Processing Unit
Sub-task2 (Thread 2)
2
Challenges of Parallel Processing
Several challenges hinder the efficient use of parallel processors (i.e., goal is high resource utilization)
● Many algorithms are hard to be divided into subtasks (or cannot be divided at all à lack of luck).
Some problem areas are much easier than others to parallelise
● Subtasks might use results from each other, so coordinating the different tasks might be difficult
Lack of proper coordination can result in the task failing or producing wrong results
● Communication network is the main BOTTLENECK
The data exchange between the processors can overwhelm the shared network
Parallelization Example
Image processing à for instance sharpening an image
● We have a big photograph (e.g., satellite images, medical images, …).
● We divide it into tiles (square patches), and in parallel sharpen each tile
● Then we adjust the pixels at the edges so that they match up
Still parallelizable because handling edge pixels can be done as an independent second phase
Also edge pixels are a small part of the total, and changing them does not affect the other pixels
Tricky Parallelisation Example
Path search – search for the best/shortest route from two points A to B, no exhaustive search
● How to divide the task in lesser ones? Find intermediate points?
● How the processors select these intermediate points independently?
The processors need to communicate among themselves but what are the intermediate results?
● How can we guarantee that the solution is the best/shortest one?
Examples of parallelism: Simulation
We often need to simulate physical events:
● The problem: what happens when lightning strikes a plane?
● Planes are too expensive to conduct a large number of tests on them, but we need to ensure they are
safe to fly
● Simulation is a necessity: but it is computationally difficult
There are a lot of similar tasks (nuclear safety, for example) where direct testing is impossible or difficult
Data
Output
Sources
Data Sources
Data
Ingestion Storage Processing Output
Sources
A data source is any mechanism that produces data. Examples of data sources include:
● Mobile and web apps
● Websites
● IoT devices
● Databases
● Output of other big data pipelines
Ingestion
Data
Ingestion Storage Processing Output
Sources
The ingestion stage gathers and pre-processes the incoming data from a variety of data sources and
makes the data readily usable by the onward stages.
• Ingested data can be stored or directly processed
• Data is moved from the original sources across a network, hence communication protocols such
as HTTP, MQT or FTP can be used during ingestion
• Data ingestion can include transformations operations on raw data
Storage
Data
Ingestion Storage Processing Output
Sources
After ingestion, data are usually stored in the target processing nodes for future processing.
• Alternatively, due to the high-volumes of data, distributed storage solutions are used
• Distributed storage needs to provide flexibility and fast retrieval of high-volumes of data.
Processing
Data
Ingestion Storage Processing Output
Sources
Data
Ingestion Storage Processing Output
Sources
During weeks 2 to 5, we will cover Apache Hadoop and Spark, the Big Data solution that consists of:
• Processing capabilities (based on MapReduce Programming Model)
• Hadoop Framework (Yarn Scheduler, Hadoop Distributed File System (HDFS), ..)
• Spark Framework (In-memory processing)
• Spark Programming
Week 5: Ingestion, Storage and Reliability
Data
Ingestion Storage Processing Output
Sources
Data
Ingestion Storage Processing Output
Sources
The remaining of the module, focuses on more on Big Data Processing stages, namely
• Weeks 9 and 10: Stream processing (ingestion and processing of stream data)
• Weeks 11: Graph processing (graph databases, large-scale graph processing)
Contents
For example: