0% found this document useful (0 votes)
38 views

3 DSEngineering

The document discusses topics related to data engineering including preparing data, crawling data from different sources, parallel computing, and data driven application design. It covers data preparation tasks like data cleaning, integration, reduction, and transformation. Specific techniques discussed include handling missing data, outliers, and redundancy.

Uploaded by

Bereket Muniye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

3 DSEngineering

The document discusses topics related to data engineering including preparing data, crawling data from different sources, parallel computing, and data driven application design. It covers data preparation tasks like data cleaning, integration, reduction, and transformation. Specific techniques discussed include handling missing data, outliers, and redundancy.

Uploaded by

Bereket Muniye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

Introduction to

Data Science

Solomon Teferra Abate


@
SIS, AAU
Topics of Data Engineering

3.1. Preparing data


3.2. Crawling/Collecting data
from different sources using Python and R.
3.3. Introduction to Parallel Computing,
Map Reduce and Hadoop EcoSystem
3.4. Data Driven Application Design
and Deployment
Topics of Data Engineering

3.1. Preparing data


3.2. Crawling data from different sources
using Python and R.
3.3. Introduction to Parallel Computing,
Map Reduce and Hadoop EcoSystem
3.4. Data Driven Application Design
and Deployment
Data Preparation
• Data analytics requires collecting huge amount of data
(available in data warehouses or databases or else) to
achieve the intended objective.
– Data analytics (using data mining and machine learning) starts by
understanding the business or problem domain in order to gain
the business knowledge
• Business knowledge guides the process towards useful results,
and enables the recognition of those results that are useful.
– Based on the business knowledge data related to the business
problem are identified from the data source
– Once we collect the data, the next task is data understanding
where there is a need to well-understand the type of data we are
using for analysis and identify the problem observed within the
data.
• Before feeding data to DA we have to make sure the quality
of data? 4
Data Quality Measures
• The following are a set of multidimensional data quality
measures:
– Accuracy (free from errors and outliers)
– Completeness (no missing attributes and values)
– Consistency (no inconsistent values and attributes)
– Timeliness (appropriateness of the data for the purpose it is
required)
– Reliability (acceptability)- believability
– Interpret-ability (easy to understand)
• Most of the data in the real world are poor quality

incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names

5
Why low quality Data?
• Collecting the required data is challenging
– Real life data is heterogeneous
– Data sources are distributed and so is the nature of the data
Real world data is low in quality

• There are different factors that affect the quality of data


– You didn’t collect it yourself
– It probably was created for some other use, and then you came
along wanting to integrate it
– People make mistakes (typos)
– People are busy (“this is good enough”) to systematically organize
carefully using structured formats

6
Why low quality Data?
Technical determinants
• Lack of guidelines to fill out the data sources and reporting forms
• Data collection and reporting forms are not standardized
• Complex design of data collection and reporting tools
Behavioral determinants
• Personnel not trained in the use of data sources & reporting forms
• Misunderstanding of how to compile data, use tally sheets, and prepare
reports
• Math errors occur during data consolidation from data sources, affecting
report preparation
Organizational determinants
• Lack of a reviewing process, before report submission to next level
• Organization incentivizes reporting high performance
• Absence of culture of information use

7
Problems with data
• Some data have problems on their own that needs to be
cleaned:
– Outliers: misleading data that do not fit to most of the data/facts
– Missing data: attributes values might be absent which needs to be
replaced with estimates
– Irrelevant data: attributes in the database that might not be of
interest to the task being developed
– Noisy data: attribute values that might be invalid or incorrect. E.g.
typographical errors
– Inconsistent data, duplicate data, etc.
• Other data are problematic only when we integrate them
– Everyone had their own way of structuring and formatting data,
based on what was convenient for them
– Data of this difference comes from different sources
• Coming up with good quality data needs to pass through
different data pre-processing tasks
8
Major Tasks in Data Pre-processing
• Data cleaning: to get rid of bad data
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

• Data integration
– Integration of data from multiple sources, such as databases, data
warehouses, or files

• Data reduction
– obtains a reduced representation of the data set that is much
smaller in volume, yet produces almost the same or similar
results.
o Dimensionality reduction
o Numerosity/size reduction
o Data compression

• Data transformation
– Normalization 9
– Discretization
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which require data
cleaning

• What’s wrong here?


ID Name City State

1 Ministry of Transportation Addis Ababa Addis Ababa

2 Ministry of Finance Addis Ababa Addis Ababa


3 Ministry of Finance Addis Ababa Addis Ababa

• How to clean it: manually or automatically?

10
Data Cleaning: Incomplete Data
• The dataset may lack certain attributes of interest
– Is that enough if you have patient demographic profile and
address of region to predict the vulnerability (or exposure) of a
given region to an outbreak?

• The dataset may contain only aggregate data. E.g.: traffic


police car accident report
– this much accident occurs this day in this region
No of accident Date address
3 Oct 23, 2012 Addis Ababa

2 Oct 12, 2011 Amhara region

11
Data Cleaning: Missing Data
• Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
 many tuples have no recorded value for several attributes, such
as customer income in sales data
ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa
2 Ministry of Finance ? Addis Ababa
3 Office of Foreign Affairs Addis Ababa Addis Ababa


Data is not always available, lacking attribute values. E.g., Occupation=“ ”

Missing data may be due to:

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding and may not be considered
important at the time of entry

not register history or changes of the data
12
Data Cleaning: Missing Data
There are different methods for treating missing values.
• Ignore attributes with missing value: this method is
usually done when class label is missed (assuming the
task in classification).
– not effective when the percentage of missing values per
attribute varies considerably
• Fill in the missing value manually: this method is
tedious, time consuming and infeasible.
• Use a global constant to fill in the missing value: This
can be done if a new class is unknown.
• Use the attributes’ mean or mode to fill in the
missing value: Replacing the missed values with the
attributes’ mean or mode (most frequent) for numeric
or nominal attributes, respectively.
– Use the most probable value to fill in the missing value automatically
• calculate, say, using Expectation Maximization (EM) Algorithm
13
Example: Missing Values Handling method
Attribute Name Data type Handling method

Sex
Age
Religion
Height
Marital status
Job
Weight
Example: Missing Values Handling
method
Attribute Name Data type Handling method

Sex Nominal
Age Numeric
Religion Nominal
Height Numeric
Marital status Nominal
Job Nominal
Weight Numeric
Example: Missing Values Handling
method
Attribute Name Data type Handling method

Sex Nominal Replace by the mode value.


Age Numeric Replace by the mean value.
Religion Nominal Replace by the mode value.
Height Numeric Replace by the mean value.
Marital status Nominal Replace by the mode value.
Job Nominal Replace by the mode value.
Weight Numeric Replace by the mean value.
Predict missing value using EM
• Solves estimation with incomplete data.
– Obtain initial estimates for parameters using mean value.
– use estimates for calculating a value for missing data &
– The process continue Iteratively until convergence ((μi - μi+1)
≤ Ө)
• E.g.: out of six data items given known values= {1,
5, 10, 4}, estimate the two missing data items?
– Let the EM converge if two estimates differ in 0.05 & our initial guess of
the two missing values= 3.
Predict missing value using EM
• Solves estimation with incomplete data.
– Obtain initial estimates for parameters using mean value.
– use estimates for calculating a value for missing data &
– The process continue Iteratively until convergence ((μi - μi+1)
≤ Ө)
• E.g.: out of six data items given known values= {1,
5, 10, 4}, estimate the two missing data items?
– Let the EM converge if two estimates differ in 0.05 & our initial guess of
the two missing values= 3.
• The algorithm
stop when the
last two estimates
are only 0.05
apart.
• Thus, our
estimate for the
two items is 4.97.
Data Cleaning: Noisy Data
• Noisy: containing noise, errors, or outliers
– Noise: random error or variance in a measured variable
– e.g., Salary=“−10” (an error)
• Typographical errors are errors that corrupt data
• Let say ‘green’ is written as ‘rgeen’

• Incorrect attribute values may be due to


– faulty data collection instruments (e.g.: OCR)
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention

19
19
Data Cleaning: How to catch Noisy Data
• Manually check all data : tedious + infeasible?
• Sort data by frequency
– ‘green’ is more frequent than ‘rgeen’
– Works well for categorical data

• Use, say Numerical constraints to Catch Corrupt Data


• Weight can’t be negative
• People can’t have more than 2 parents
• Salary can’t be less than Birr 300???

• Use statistical techniques to Catch Corrupt Data


– Check for outliers (the case of the 8 meters man)
– Check for correlated outliers using n-gram (“pregnant male”)
• People can be male
• People can be pregnant
• People can’t be male AND pregnant
20
20
Data Integration
• Data integration
– combines data from multiple sources (database, data
warehouse, files & sometimes from non-electronic
sources) into a coherent store
– Because of the use of different sources, data that is
fine on its own may become problematic when we
want to integrate it
• Some of the issues are:
– Different formats and structures
– Conflicting and redundant data
– Data at different levels

21
Data Integration: Formats
• Not everyone uses the same format
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from
multiple data sources: e.g., A.cust-id vs B.cust-#
• Dates are especially problematic:
– 12/19/97
– 19/12/97
– 19/12/1997
– 19-12-97
– Dec 19, 1997
– 19 December 1997
– 19th Dec. 1997
• Are you frequently writing money as:
– Birr 200, Br. 200, 200 Birr, …
22
Data Integration: Inconsistent
• Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of standardization /
naming conventions. e.g.,
– Age=“26” vs. Birthday=“03/07/1986”
– Some use “1,2,3” for rating; others “A, B, C”

• Discrepancy between duplicate records


ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa region
2 Ministry of Finance Addis Ababa Addis Ababa
administration
3 Office of Foreign Affairs Addis Ababa Addis Ababa regional
administration

23
Data Integration: Inconsistent
Attribute Current values New value
name
Job status “no work”, “job Unemployed
less”, “Jobless”
Marital status “not married”, Unmarried
“single”
Education “uneducated”, “no Illiterate
level
education level”
Data Integration: different structure
What’s wrong here? No data type constraints
ID Name City State

1234 Ministry of Addis Ababa AA


Transportation

GCR34 Ministry of Finance Addis Ababa AA

Name ID City State


Office of Foreign Affairs GCR34 Addis Ababa AA

25
Data Integration: Data that Moves
• Be careful of taking snapshots of a moving target
• Example: Let’s say you want to store the price of a shoe in
Ethiopia, and the price of a shoe in Kenya Can we use same
currency (say, US$) or country’s currency?
– You can’t store it all in the same currency (say, US$) because the
exchange rate changes frequently
– Price in foreign currency stays the same
– Must keep the data in foreign currency and use the current
exchange rate to convert
• The same needs to be done for ‘Age’

26
Data at different level of detail than needed
• If it is at a finer level of detail, you can sometimes bin it
• Example
– I need age ranges of 20-30, 30-40, 40-50, etc.
– Imported data contains birth date
– No problem! Divide data into appropriate categories

• Sometimes you cannot bin it


• Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
• Ignore age ranges because you aren’t sure
• Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of people
of age 20-30 & 30-40)

27
Data Integration: Conflicting Data
• Detecting and resolving data value conflicts
–For the same real world entity, attribute values from
different sources are different
–Possible reasons: different representations, different scales,
e.g., American vs. British units
• weight measurement: KG or pound
• Height measurement: meter or inch
• Information source #1 says that Alex lives in Bahirdar
– Information source #2 says that Alex lives in Mekele
• What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of information
– Use the “most trusted” information
– Flag row to be investigated further by hand

28
Handling Redundant Data
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve speed and quality of the analysis
Example: Co-Variance
• Suppose two stocks A and B have the following values in one
week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.

30
Data Reduction Strategies
• Why data reduction?
–A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the
complete data set.
• Data reduction: obtains a reduced representation of the
data set that is much smaller in volume but yet produces
the same (or almost the same) analytical results

• Data reduction strategies


– Dimensionality reduction,
• Select best attributes or remove unimportant attributes
– Numerosity reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
– Data compression

31
Data Reduction: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse

• Dimensionality reduction
– Helps to eliminate Irrelevant attributes and reduce noise: that
contain no information useful for the data anlysis task at hand
• E.g., is students' ID relevant to predict students' GPA?
–Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
• E.g., purchase price of a product & the amount of sales tax paid
– Reduce time and space required in data mining
– Allow easier visualization

• Method: attribute subset selection


–One of the method to reduce dimensionality of data is by selecting
best attributes
32
Heuristic Search in Attribute Selection
Commonly used heuristic attribute selection methods:
– Best step-wise attribute selection:
• Start with empty set of attributes, {}
• The best single-attribute is picked first, {Ai}
• Then combine best attribute with the remaining to select the best combined two
attributes, {AiAj}, then three attributes {AiAjAk},…
• The process continues until the performance of the combined attributes starts to
decline
– Example: Given ABCDE attributes, we can start with {}, and then compare and
select attribute with best accuracy, say {B}. Then combine it with others, {[BA]
[BC][BD][BE]} & compare and select those with best accuracy, say {BD}, then
combine with the rest, {[BDA][BDC][BDE]}, select those with best accuracy or
ignore if accuracy start decreasing

– Step-wise attribute elimination:


• Start with all attributes as best
• Eliminate one of the worst performing attribute
• Repeatedly continue the process if the performance of the combined attributes
increases
– Example: Given ABCDE attributes, we can start with {ABCDE}, then compare
accuracy of (n-1) attributes, like {[ABCD][ABCE][ABDE][ACDE][BCDE]}. If {ABCE}
performed best, ignore attribute {D}. Again compare accuracy of (n-2) attributes,
{[ABC][ABE][ACE][BCE]}, based on accuracy ignore the attribute that affect
accuracy.
33
Heuristic Search in Attribute Selection
Commonly used heuristic attribute selection methods:
– Best step-wise attribute selection:
• Start with empty set of attributes, {}
• The best single-attribute is picked first, {Ai}
• Then combine best attribute with the remaining to select the best combined two
attributes, {AiAj}, then three attributes {AiAjAk},…
• The process continues until the performance of the combined attributes starts to
decline
– Example: Given ABCDE attributes, we can start with {}, and then compare and
select attribute with best accuracy, say {B}. Then combine it with others, {[BA]
[BC][BD][BE]} & compare and select those with best accuracy, say {BD}, then
combine with the rest, {[BDA][BDC][BDE]}, select those with best accuracy or
ignore if accuracy start decreasing

– Step-wise attribute elimination:


• Start with all attributes as best
• Eliminate one of the worst performing attribute
• Repeatedly continue the process if the performance of the combined attributes
increases
– Example: Given ABCDE attributes, we can start with {ABCDE}, then compare
accuracy of (n-1) attributes, like {[ABCD][ABCE][ABDE][ACDE][BCDE]}. If {ABCE}
performed best, ignore attribute {D}. Again compare accuracy of (n-2) attributes,
{[ABC][ABE][ACE][BCE]}, based on accuracy ignore the attribute that affect
accuracy.
34
Data Reduction: Numerosity Reduction
• Different methods can be used, including Clustering and sampling
• Clustering
– Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
– There are many choices of clustering definitions and clustering
algorithms
• Sampling
– Obtain a small sample s to represent the whole data set N
– Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
– Key principle: Choose a representative subset of the data using
suitable sampling technique
• Samples may be >=< Population. For instance;
– Total instances of Class 1 = 10,000 & class 2 = 1000. If a sample of
5000 records required, where from class 1 = 3,000 & class 2 = 2,
000. How to select samples? Apply sampling with & without
35
replacement.
Simple Random Sampling: With or without
Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
p l
sa m m e nt )
p l a ce
re

SRSW
R

Raw Data
36
Data Transformation
• A function that maps the entire set of values of a
given attribute to a new set of replacement values
such that each old value can be identified with one
of the new values
• Methods for data transformation
– Normalization
– Discretization
– Generalization: Concept hierarchy climbing

37
Simple Discretization: Binning
• Equal-width (distance) partitioning
–Divides the range into N intervals of equal size (uniform grid)
–if A and B are the lowest and highest values of the attribute, the
width of intervals for N bins will be:
W = (B –A)/N.
–This is the most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning


–Divides the range into N bins, each containing approximately
same number of samples
–Good data scaling
–Managing categorical attributes can be tricky

38
Topics of Data Engineering

3.1. Preparing data


3.2. Crawling/Scraping/Collecting data
from different sources using Python and R.
3.3. Introduction to Parallel Computing,
Map Reduce and Hadoop EcoSystem
3.4. Data Driven Application Design
and Deployment
Data Crawling/Scraping/Collecting
• Read on: Data Crawling, Data Scraping, and Data Collection
• Describe their relation and differences
• Why should we do Data Crawling, Data Scraping, and Data
Collection?

40
Data Crawling
• Data crawling is the process of finding and downloading web
pages or documents from the web.
• For example, you might want to crawl the entire web or a
specific domain to find relevant information for a search
engine or a web scraper.
• Data crawling can be done by using a program or a bot that
can follow the links and URLs of the web pages, and store
them in a database or a file.
• Data crawling can be useful for discovering new or updated
data sources, or for creating a web archive.

41
Data Scraping
• Data scraping is the process of extracting specific data from a
web page or a document.
• For example, you might want to scrape the names and prices
of products from an e-commerce site, or the ratings and
reviews of movies from a streaming platform.
• Data scraping can be done manually, by copying and pasting
the data, or automatically, by using a script or a tool that can
parse the HTML or XML code of the web page.
• Data scraping can be useful for collecting data for analysis,
research, or comparison.

42
Challenges of Data Scraping/Crawling
• Data scraping and data crawling can be subject to a variety of
challenges:

legal and ethical issues,

technical difficulties, and

quality issues.
• It's important to respect the data owner's rights and permissions, and
avoid any violations of the law.
• Some webpages or documents may have dynamic, complex, or
encrypted content that can make data scraping or crawling difficult or
impossible.
• To overcome these challenges, you may need to use advanced
techniques, such as browser automation, proxies, or APIs.
Additionally, some webpages or documents may have inaccurate,
incomplete, or outdated data that can affect the reliability and validity
of your results.
• To ensure quality data, you may need to use data cleaning, validation,
43
or verification methods.
Tools for Data Scraping/Crawling
• Data scraping and crawling can be done using a variety of tools and
frameworks, depending on the specific needs and preferences.
• Scrapy, for example, is a Python framework for building data
crawling and scraping applications that can handle large-scale and
concurrent data extraction
• BeautifulSoup is a Python library for parsing and extracting data
from HTML and XML documents
• Selenium is a framework for automating web browsers that can be
used to scrape or crawl dynamic or interactive content.
• Requests is a Python library for sending and receiving HTTP
requests that can be used to scrape or crawl static or simple
content.
• All of these tools support various features, such as pipelines,
middleware, spiders, selectors, items, headers, cookies,
parameters, authentication, or sessions.
44
Crawling the Web
Web pages
• Few thousand characters long
• Served through the internet using the hypertext transport
protocol (HTTP)
• Viewed at client end using `browsers’
Crawler
• To fetch the pages to the computer
• At the computer
 Automatic programs can analyze hypertext documents

45
HTML:HyperText Markup Language
 Lets the author
• specify layout and typeface
• embed diagrams
• create hyperlinks.
 expressed
as an anchor tag with a HREF attribute
 HREF names another page using a Uniform Resource Locator
(URL),
• URL =
 protocol field (“HTTP”) +
 a server hostname (“www.cse.iitb.ac.in”) +
 file path (/, the `root' of the published file system).

46
HTTP(hypertext transport protocol)
 Built on top of the Transport Control Protocol (TCP)
 Steps(from client end)
• resolve the server host name to an Internet address (IP)
 Use Domain Name Server (DNS)
 DNS is a distributed database of name-to-IP mappings maintained at a set
of known servers
• contact the server using TCP
 connect to default HTTP port (80) on the server.
 Enter the HTTP requests header (E.g.: GET)
 Fetch the response header
– MIME (Multipurpose Internet Mail Extensions)
– A meta-data standard for email and Web content transfer
 Fetch the HTML page

47
Crawl “all” Web pages?
 Problem: no catalog of all accessible URLs on the
Web.
 Solution:
• start from a given set of URLs
• Progressively fetch and scan them for new outlinking
URLs
• fetch these pages in turn…..
• Submit the text in page to a text indexing system
• and so on……….

48
Crawling procedure
 Simple
• Great deal of engineering goes into industry-strength
crawlers
• Industry crawlers crawl a substantial fraction of the Web
• E.g.: Alta Vista, Northern Lights, Inktomi
 No guarantee that all accessible Web pages will be
located in this fashion
 Crawler may never halt
• pages will be added continually even as it is running.

49
Crawling overheads
 Delays involved in
• Resolving the host name in the URL to an IP address
using DNS
• Connecting a socket to the server and sending the
request
• Receiving the requested page in response
 Solution: Overlap the above delays by
• fetching many pages at the same time

50
Anatomy of a crawler.
 Page fetching threads
• Starts with DNS resolution
• Finishes when the entire page has been fetched
 Each page
• stored in compressed form to disk/tape
• scanned for outlinks
 Work pool of outlinks
• maintain network utilization without overloading it
 Dealt with by load manager
 Continue till the crawler has collected a sufficient
number of pages.

51
Anatomy of a crawler.

52
Collection of data that does not exist
• Survey

Survey Method: Person, Paper, or Online

Survey Question Types:

Multiple-choice-type questions

Rank-order-type questions

Rating or open-ended questions (eg. Likert Scale)

Dichotomous (closed-ended) questions: Yes/No questions

Tools:

Microsoft’s Office 365,

SmartSurvey,

Google Forms,

SurveyMonkey

Pros and Cons of Surveys

53
Collection of data that does not exist
• Interviews and Focus Groups

Why interviews and/or focus groups

Procedures:

Agreement

Ice breaker

Honest opinion

Plan

Analyzing Interview Data

Pros and Cons of Interviews and Focus Groups
• Log and Diary Data
• Analysis Methods:

Quantitative

Qualitative

Mixed
54
Topics of Data Engineering

3.1. Preparing data


3.2. Crawling data from different sources
using Python and R.
3.3. Introduction to Parallel Computing,
Map Reduce and Hadoop EcoSystem
3.4. Data Driven Application Design
and Deployment
Parallel Computing
• Parallel vs Serial Computing
• What is Parallel Computing (PC)?
• Why do we need to use PC?

56
Serial Computing
• A problem is broken into a discrete series of instructions
• Instructions are executed sequentially one after another
• Executed on a single processor
• Only one instruction may execute at any moment in time

This was a huge waste of hardware resources and time

only one part of the hardware will be running for
particular instruction and of time

As problem statements were getting heavier and bulkier, so
does the amount of time in execution of those statements.
• Examples of processors are Pentium 3 and Pentium 4.

57
Parallel Computing
• Parallel computing is the simultaneous use of multiple
computer resources to solve a computational problem

A problem is broken into discrete parts that can be solved
concurrently

Each part is further broken down to a series of instructions

Instructions from each part execute simultaneously on different
processors

An overall control/coordination mechanism is employed

58
Why Parallel Computing
• Save time and/or money

Many resources working together will reduce the time and
cut potential costs.
• Solve complex problems

The problem will be broken into a number of simpler ones
• Create collaborative work environment

Collaborators can solve a specific problem that is part of the bigger
problem while working together with the others

59
Types of Parallel Computing
• Bit-level:

It is increasing the size of the processors size

Consider a scenario where an 8-bit processor must compute the sum of
two 16-bit integers. It must first sum up the 8 lower-order bits, then add
the 8 higher-order bits. A 16-bit processor can perform the operation
with just one instruction.
• Instruction-level:

A processor can only address less than one instruction for each clock
cycle phase. These instructions can be re-ordered and grouped which
are later on executed concurrently without affecting the result of the
program.
• Task Parallelism:

Task parallelism employs the decomposition of a task into subtasks and
then allocating each of the subtasks for execution.
• Data-level:

It is parallelization across multiple processors focusing on distributing
the data. It can be applied on regular data structures like arrays and
matrices by working on each element in parallel 60
MapReduce for Parallel Computing
• MapReduce refers to two separate and distinct tasks that Hadoop programs
perform.

Map job: takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value
pairs).

Reduce job: takes the output from a map as input and combines those
data tuples into a smaller set of tuples.

As the sequence of the name MapReduce implies, the reduce job is
always performed after the map job.
• MapReduce programming offers several benefits:

Scalability: Businesses can process petabytes of data stored in the
Hadoop Distributed File System (HDFS).

Flexibility: Hadoop enables easier access to multiple sources of data and
multiple types of data.

Speed: With parallel processing and minimal data movement, Hadoop
offers fast processing of massive amounts of data.

Simple: Developers can write code in a choice of languages, including
Java, C++ and Python. 61
Topics of Data Engineering

3.1. Preparing data


3.2. Crawling data from different sources
using Python and R.
3.3. Introduction to Parallel Computing,
Map Reduce and Hadoop EcoSystem
3.4. Data Driven Application Design
and Deployment
Data Driven Application
• Data-driven apps operate on a diverse set of data

Multi-modal: spatial, documents, sensor, transactional, etc.

From multiple different sources

Often in real-time
• They may use Machine Learning to make real-time:

Recommendations to customers

Detect fraudulent transactions
• They may use Graph analytics:

to identify influencers in a community and target them with
specific promotions
• They may use spatial data to keep track of deliveries
• They may use IoT streaming data or Blockchain data to create
real-time information/services

63
Data Driven Application
• When building data-driven apps, we need to leverage a lot of
data processing and machine learning algorithms
These apps are frequently deployed on multiple platforms,
including mobile devices as well as standard web browsers
 They need a flexible, scalable and reliable deployment
platform
Given the demands on these apps, they need to be
continuously developed to adapt to new use cases or user
needs, and all updates must happen online as they have to be
available 24×7.

64

You might also like