0% found this document useful (0 votes)

38 views

3 DSEngineering

The document discusses topics related to data engineering including preparing data, crawling data from different sources, parallel computing, and data driven application design. It covers data preparation tasks like data cleaning, integration, reduction, and transformation. Specific techniques discussed include handling missing data, outliers, and redundancy.

Uploaded by

Bereket Muniye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

3 DSEngineering

Uploaded by

Bereket Muniye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 64

Introduction to

Data Science

Solomon Teferra Abate

@
SIS, AAU
Topics of Data Engineering

3.1. Preparing data

3.2. Crawling/Collecting data
from different sources using Python and R.
3.3. Introduction to Parallel Computing,
Map Reduce and Hadoop EcoSystem
3.4. Data Driven Application Design
and Deployment
Topics of Data Engineering

3.1. Preparing data

3.2. Crawling data from different sources
using Python and R.
3.3. Introduction to Parallel Computing,
Map Reduce and Hadoop EcoSystem
3.4. Data Driven Application Design
and Deployment
Data Preparation
• Data analytics requires collecting huge amount of data
(available in data warehouses or databases or else) to
achieve the intended objective.
– Data analytics (using data mining and machine learning) starts by
understanding the business or problem domain in order to gain
the business knowledge
• Business knowledge guides the process towards useful results,
and enables the recognition of those results that are useful.
– Based on the business knowledge data related to the business
problem are identified from the data source
– Once we collect the data, the next task is data understanding
where there is a need to well-understand the type of data we are
using for analysis and identify the problem observed within the
data.
• Before feeding data to DA we have to make sure the quality
of data? 4
Data Quality Measures
• The following are a set of multidimensional data quality
measures:
– Accuracy (free from errors and outliers)
– Completeness (no missing attributes and values)
– Consistency (no inconsistent values and attributes)
– Timeliness (appropriateness of the data for the purpose it is
required)
– Reliability (acceptability)- believability
– Interpret-ability (easy to understand)
• Most of the data in the real world are poor quality

incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names

5
Why low quality Data?
• Collecting the required data is challenging
– Real life data is heterogeneous
– Data sources are distributed and so is the nature of the data
Real world data is low in quality

• There are different factors that affect the quality of data

– You didn’t collect it yourself
– It probably was created for some other use, and then you came
along wanting to integrate it
– People make mistakes (typos)
– People are busy (“this is good enough”) to systematically organize
carefully using structured formats

6
Why low quality Data?
Technical determinants
• Lack of guidelines to fill out the data sources and reporting forms
• Data collection and reporting forms are not standardized
• Complex design of data collection and reporting tools
Behavioral determinants
• Personnel not trained in the use of data sources & reporting forms
• Misunderstanding of how to compile data, use tally sheets, and prepare
reports
• Math errors occur during data consolidation from data sources, affecting
report preparation
Organizational determinants
• Lack of a reviewing process, before report submission to next level
• Organization incentivizes reporting high performance
• Absence of culture of information use

7
Problems with data
• Some data have problems on their own that needs to be
cleaned:
– Outliers: misleading data that do not fit to most of the data/facts
– Missing data: attributes values might be absent which needs to be
replaced with estimates
– Irrelevant data: attributes in the database that might not be of
interest to the task being developed
– Noisy data: attribute values that might be invalid or incorrect. E.g.
typographical errors
– Inconsistent data, duplicate data, etc.
• Other data are problematic only when we integrate them
– Everyone had their own way of structuring and formatting data,
based on what was convenient for them
– Data of this difference comes from different sources
• Coming up with good quality data needs to pass through
different data pre-processing tasks
8
Major Tasks in Data Pre-processing
• Data cleaning: to get rid of bad data
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

• Data integration
– Integration of data from multiple sources, such as databases, data
warehouses, or files

• Data reduction
– obtains a reduced representation of the data set that is much
smaller in volume, yet produces almost the same or similar
results.
o Dimensionality reduction
o Numerosity/size reduction
o Data compression

• Data transformation
– Normalization 9
– Discretization
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which require data
cleaning

• What’s wrong here?

ID Name City State

1 Ministry of Transportation Addis Ababa Addis Ababa

2 Ministry of Finance Addis Ababa Addis Ababa

3 Ministry of Finance Addis Ababa Addis Ababa

• How to clean it: manually or automatically?

10
Data Cleaning: Incomplete Data
• The dataset may lack certain attributes of interest
– Is that enough if you have patient demographic profile and
address of region to predict the vulnerability (or exposure) of a
given region to an outbreak?

• The dataset may contain only aggregate data. E.g.: traffic

police car accident report
– this much accident occurs this day in this region
No of accident Date address
3 Oct 23, 2012 Addis Ababa

2 Oct 12, 2011 Amhara region

11
Data Cleaning: Missing Data
• Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
 many tuples have no recorded value for several attributes, such
as customer income in sales data
ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa
2 Ministry of Finance ? Addis Ababa
3 Office of Foreign Affairs Addis Ababa Addis Ababa


Data is not always available, lacking attribute values. E.g., Occupation=“ ”

Missing data may be due to:

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding and may not be considered
important at the time of entry

not register history or changes of the data
12
Data Cleaning: Missing Data
There are different methods for treating missing values.
• Ignore attributes with missing value: this method is
usually done when class label is missed (assuming the
task in classification).
– not effective when the percentage of missing values per
attribute varies considerably
• Fill in the missing value manually: this method is
tedious, time consuming and infeasible.
• Use a global constant to fill in the missing value: This
can be done if a new class is unknown.
• Use the attributes’ mean or mode to fill in the
missing value: Replacing the missed values with the
attributes’ mean or mode (most frequent) for numeric
or nominal attributes, respectively.
– Use the most probable value to fill in the missing value automatically
• calculate, say, using Expectation Maximization (EM) Algorithm
13
Example: Missing Values Handling method
Attribute Name Data type Handling method

Sex
Age
Religion
Height
Marital status
Job
Weight
Example: Missing Values Handling
method
Attribute Name Data type Handling method

Sex Nominal
Age Numeric
Religion Nominal
Height Numeric
Marital status Nominal
Job Nominal
Weight Numeric
Example: Missing Values Handling
method
Attribute Name Data type Handling method

Sex Nominal Replace by the mode value.

Age Numeric Replace by the mean value.
Religion Nominal Replace by the mode value.
Height Numeric Replace by the mean value.
Marital status Nominal Replace by the mode value.
Job Nominal Replace by the mode value.
Weight Numeric Replace by the mean value.
Predict missing value using EM
• Solves estimation with incomplete data.
– Obtain initial estimates for parameters using mean value.
– use estimates for calculating a value for missing data &
– The process continue Iteratively until convergence ((μi - μi+1)
≤ Ө)
• E.g.: out of six data items given known values= {1,
5, 10, 4}, estimate the two missing data items?
– Let the EM converge if two estimates differ in 0.05 & our initial guess of
the two missing values= 3.
Predict missing value using EM
• Solves estimation with incomplete data.
– Obtain initial estimates for parameters using mean value.
– use estimates for calculating a value for missing data &
– The process continue Iteratively until convergence ((μi - μi+1)
≤ Ө)
• E.g.: out of six data items given known values= {1,
5, 10, 4}, estimate the two missing data items?
– Let the EM converge if two estimates differ in 0.05 & our initial guess of
the two missing values= 3.
• The algorithm
stop when the
last two estimates
are only 0.05
apart.
• Thus, our
estimate for the
two items is 4.97.
Data Cleaning: Noisy Data
• Noisy: containing noise, errors, or outliers
– Noise: random error or variance in a measured variable
– e.g., Salary=“−10” (an error)
• Typographical errors are errors that corrupt data
• Let say ‘green’ is written as ‘rgeen’

• Incorrect attribute values may be due to

– faulty data collection instruments (e.g.: OCR)
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention

19
19
Data Cleaning: How to catch Noisy Data
• Manually check all data : tedious + infeasible?
• Sort data by frequency
– ‘green’ is more frequent than ‘rgeen’
– Works well for categorical data

• Use, say Numerical constraints to Catch Corrupt Data

• Weight can’t be negative
• People can’t have more than 2 parents
• Salary can’t be less than Birr 300???

• Use statistical techniques to Catch Corrupt Data

– Check for outliers (the case of the 8 meters man)
– Check for correlated outliers using n-gram (“pregnant male”)
• People can be male
• People can be pregnant
• People can’t be male AND pregnant
20
20
Data Integration
• Data integration
– combines data from multiple sources (database, data
warehouse, files & sometimes from non-electronic
sources) into a coherent store
– Because of the use of different sources, data that is
fine on its own may become problematic when we
want to integrate it
• Some of the issues are:
– Different formats and structures
– Conflicting and redundant data
– Data at different levels

21
Data Integration: Formats
• Not everyone uses the same format
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from
multiple data sources: e.g., A.cust-id vs B.cust-#
• Dates are especially problematic:
– 12/19/97
– 19/12/97
– 19/12/1997
– 19-12-97
– Dec 19, 1997
– 19 December 1997
– 19th Dec. 1997
• Are you frequently writing money as:
– Birr 200, Br. 200, 200 Birr, …
22
Data Integration: Inconsistent
• Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of standardization /
naming conventions. e.g.,
– Age=“26” vs. Birthday=“03/07/1986”
– Some use “1,2,3” for rating; others “A, B, C”

• Discrepancy between duplicate records

ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa region
2 Ministry of Finance Addis Ababa Addis Ababa
administration
3 Office of Foreign Affairs Addis Ababa Addis Ababa regional
administration

23
Data Integration: Inconsistent
Attribute Current values New value
name
Job status “no work”, “job Unemployed
less”, “Jobless”
Marital status “not married”, Unmarried
“single”
Education “uneducated”, “no Illiterate
level
education level”
Data Integration: different structure
What’s wrong here? No data type constraints
ID Name City State

1234 Ministry of Addis Ababa AA

Transportation

GCR34 Ministry of Finance Addis Ababa AA

Name ID City State

Office of Foreign Affairs GCR34 Addis Ababa AA

25
Data Integration: Data that Moves
• Be careful of taking snapshots of a moving target
• Example: Let’s say you want to store the price of a shoe in
Ethiopia, and the price of a shoe in Kenya Can we use same
currency (say, US$) or country’s currency?
– You can’t store it all in the same currency (say, US$) because the
exchange rate changes frequently
– Price in foreign currency stays the same
– Must keep the data in foreign currency and use the current
exchange rate to convert
• The same needs to be done for ‘Age’
–

26
Data at different level of detail than needed
• If it is at a finer level of detail, you can sometimes bin it
• Example
– I need age ranges of 20-30, 30-40, 40-50, etc.
– Imported data contains birth date
– No problem! Divide data into appropriate categories

• Sometimes you cannot bin it

• Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
• Ignore age ranges because you aren’t sure
• Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of people
of age 20-30 & 30-40)

27
Data Integration: Conflicting Data
• Detecting and resolving data value conflicts
–For the same real world entity, attribute values from
different sources are different
–Possible reasons: different representations, different scales,
e.g., American vs. British units
• weight measurement: KG or pound
• Height measurement: meter or inch
• Information source #1 says that Alex lives in Bahirdar
– Information source #2 says that Alex lives in Mekele
• What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of information
– Use the “most trusted” information
– Flag row to be investigated further by hand

28
Handling Redundant Data
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve speed and quality of the analysis
Example: Co-Variance
• Suppose two stocks A and B have the following values in one
week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.

30
Data Reduction Strategies
• Why data reduction?
–A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the
complete data set.
• Data reduction: obtains a reduced representation of the
data set that is much smaller in volume but yet produces
the same (or almost the same) analytical results

• Data reduction strategies

– Dimensionality reduction,
• Select best attributes or remove unimportant attributes
– Numerosity reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
– Data compression

31
Data Reduction: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse

• Dimensionality reduction
– Helps to eliminate Irrelevant attributes and reduce noise: that
contain no information useful for the data anlysis task at hand
• E.g., is students' ID relevant to predict students' GPA?
–Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
• E.g., purchase price of a product & the amount of sales tax paid
– Reduce time and space required in data mining
– Allow easier visualization

• Method: attribute subset selection

–One of the method to reduce dimensionality of data is by selecting
best attributes
32
Heuristic Search in Attribute Selection
Commonly used heuristic attribute selection methods:
– Best step-wise attribute selection:
• Start with empty set of attributes, {}
• The best single-attribute is picked first, {Ai}
• Then combine best attribute with the remaining to select the best combined two
attributes, {AiAj}, then three attributes {AiAjAk},…
• The process continues until the performance of the combined attributes starts to
decline
– Example: Given ABCDE attributes, we can start with {}, and then compare and
select attribute with best accuracy, say {B}. Then combine it with others, {[BA]
[BC][BD][BE]} & compare and select those with best accuracy, say {BD}, then
combine with the rest, {[BDA][BDC][BDE]}, select those with best accuracy or
ignore if accuracy start decreasing

– Step-wise attribute elimination:

• Start with all attributes as best
• Eliminate one of the worst performing attribute
• Repeatedly continue the process if the performance of the combined attributes
increases
– Example: Given ABCDE attributes, we can start with {ABCDE}, then compare
accuracy of (n-1) attributes, like {[ABCD][ABCE][ABDE][ACDE][BCDE]}. If {ABCE}
performed best, ignore attribute {D}. Again compare accuracy of (n-2) attributes,
{[ABC][ABE][ACE][BCE]}, based on accuracy ignore the attribute that affect
accuracy.
33
Heuristic Search in Attribute Selection
Commonly used heuristic attribute selection methods:
– Best step-wise attribute selection:
• Start with empty set of attributes, {}
• The best single-attribute is picked first, {Ai}
• Then combine best attribute with the remaining to select the best combined two
attributes, {AiAj}, then three attributes {AiAjAk},…
• The process continues until the performance of the combined attributes starts to
decline
– Example: Given ABCDE attributes, we can start with {}, and then compare and
select attribute with best accuracy, say {B}. Then combine it with others, {[BA]
[BC][BD][BE]} & compare and select those with best accuracy, say {BD}, then
combine with the rest, {[BDA][BDC][BDE]}, select those with best accuracy or
ignore if accuracy start decreasing

– Step-wise attribute elimination:

• Start with all attributes as best
• Eliminate one of the worst performing attribute
• Repeatedly continue the process if the performance of the combined attributes
increases
– Example: Given ABCDE attributes, we can start with {ABCDE}, then compare
accuracy of (n-1) attributes, like {[ABCD][ABCE][ABDE][ACDE][BCDE]}. If {ABCE}
performed best, ignore attribute {D}. Again compare accuracy of (n-2) attributes,
{[ABC][ABE][ACE][BCE]}, based on accuracy ignore the attribute that affect
accuracy.
34
Data Reduction: Numerosity Reduction
• Different methods can be used, including Clustering and sampling
• Clustering
– Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
– There are many choices of clustering definitions and clustering
algorithms
• Sampling
– Obtain a small sample s to represent the whole data set N
– Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
– Key principle: Choose a representative subset of the data using
suitable sampling technique
• Samples may be >=< Population. For instance;
– Total instances of Class 1 = 10,000 & class 2 = 1000. If a sample of
5000 records required, where from class 1 = 3,000 & class 2 = 2,
000. How to select samples? Apply sampling with & without
35
replacement.
Simple Random Sampling: With or without
Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
p l
sa m m e nt )
p l a ce
re

SRSW
R

Raw Data
36
Data Transformation
• A function that maps the entire set of values of a
given attribute to a new set of replacement values
such that each old value can be identified with one
of the new values
• Methods for data transformation
– Normalization
– Discretization
– Generalization: Concept hierarchy climbing

37
Simple Discretization: Binning
• Equal-width (distance) partitioning
–Divides the range into N intervals of equal size (uniform grid)
–if A and B are the lowest and highest values of the attribute, the
width of intervals for N bins will be:
W = (B –A)/N.
–This is the most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning

–Divides the range into N bins, each containing approximately
same number of samples
–Good data scaling
–Managing categorical attributes can be tricky

38
Topics of Data Engineering

3.1. Preparing data

3.2. Crawling/Scraping/Collecting data
from different sources using Python and R.
3.3. Introduction to Parallel Computing,
Map Reduce and Hadoop EcoSystem
3.4. Data Driven Application Design
and Deployment
Data Crawling/Scraping/Collecting
• Read on: Data Crawling, Data Scraping, and Data Collection
• Describe their relation and differences
• Why should we do Data Crawling, Data Scraping, and Data
Collection?

40
Data Crawling
• Data crawling is the process of finding and downloading web
pages or documents from the web.
• For example, you might want to crawl the entire web or a
specific domain to find relevant information for a search
engine or a web scraper.
• Data crawling can be done by using a program or a bot that
can follow the links and URLs of the web pages, and store
them in a database or a file.
• Data crawling can be useful for discovering new or updated
data sources, or for creating a web archive.

41
Data Scraping
• Data scraping is the process of extracting specific data from a
web page or a document.
• For example, you might want to scrape the names and prices
of products from an e-commerce site, or the ratings and
reviews of movies from a streaming platform.
• Data scraping can be done manually, by copying and pasting
the data, or automatically, by using a script or a tool that can
parse the HTML or XML code of the web page.
• Data scraping can be useful for collecting data for analysis,
research, or comparison.

42
Challenges of Data Scraping/Crawling
• Data scraping and data crawling can be subject to a variety of
challenges:

legal and ethical issues,

technical difficulties, and

quality issues.
• It's important to respect the data owner's rights and permissions, and
avoid any violations of the law.
• Some webpages or documents may have dynamic, complex, or
encrypted content that can make data scraping or crawling difficult or
impossible.
• To overcome these challenges, you may need to use advanced
techniques, such as browser automation, proxies, or APIs.
Additionally, some webpages or documents may have inaccurate,
incomplete, or outdated data that can affect the reliability and validity
of your results.
• To ensure quality data, you may need to use data cleaning, validation,
43
or verification methods.
Tools for Data Scraping/Crawling
• Data scraping and crawling can be done using a variety of tools and
frameworks, depending on the specific needs and preferences.
• Scrapy, for example, is a Python framework for building data
crawling and scraping applications that can handle large-scale and
concurrent data extraction
• BeautifulSoup is a Python library for parsing and extracting data
from HTML and XML documents
• Selenium is a framework for automating web browsers that can be
used to scrape or crawl dynamic or interactive content.
• Requests is a Python library for sending and receiving HTTP
requests that can be used to scrape or crawl static or simple
content.
• All of these tools support various features, such as pipelines,
middleware, spiders, selectors, items, headers, cookies,
parameters, authentication, or sessions.
44
Crawling the Web
Web pages
• Few thousand characters long
• Served through the internet using the hypertext transport
protocol (HTTP)
• Viewed at client end using `browsers’
Crawler
• To fetch the pages to the computer
• At the computer
 Automatic programs can analyze hypertext documents

45
HTML:HyperText Markup Language
 Lets the author
• specify layout and typeface
• embed diagrams
• create hyperlinks.
 expressed
as an anchor tag with a HREF attribute
 HREF names another page using a Uniform Resource Locator
(URL),
• URL =
 protocol field (“HTTP”) +
 a server hostname (“www.cse.iitb.ac.in”) +
 file path (/, the `root' of the published file system).

46
HTTP(hypertext transport protocol)
 Built on top of the Transport Control Protocol (TCP)
 Steps(from client end)
• resolve the server host name to an Internet address (IP)
 Use Domain Name Server (DNS)
 DNS is a distributed database of name-to-IP mappings maintained at a set
of known servers
• contact the server using TCP
 connect to default HTTP port (80) on the server.
 Enter the HTTP requests header (E.g.: GET)
 Fetch the response header
– MIME (Multipurpose Internet Mail Extensions)
– A meta-data standard for email and Web content transfer
 Fetch the HTML page

47
Crawl “all” Web pages?
 Problem: no catalog of all accessible URLs on the
Web.
 Solution:
• start from a given set of URLs
• Progressively fetch and scan them for new outlinking
URLs
• fetch these pages in turn…..
• Submit the text in page to a text indexing system
• and so on……….

48
Crawling procedure
 Simple
• Great deal of engineering goes into industry-strength
crawlers
• Industry crawlers crawl a substantial fraction of the Web
• E.g.: Alta Vista, Northern Lights, Inktomi
 No guarantee that all accessible Web pages will be
located in this fashion
 Crawler may never halt
• pages will be added continually even as it is running.

49
Crawling overheads
 Delays involved in
• Resolving the host name in the URL to an IP address
using DNS
• Connecting a socket to the server and sending the
request
• Receiving the requested page in response
 Solution: Overlap the above delays by
• fetching many pages at the same time

50
Anatomy of a crawler.
 Page fetching threads
• Starts with DNS resolution
• Finishes when the entire page has been fetched
 Each page
• stored in compressed form to disk/tape
• scanned for outlinks
 Work pool of outlinks
• maintain network utilization without overloading it
 Dealt with by load manager
 Continue till the crawler has collected a sufficient
number of pages.

51
Anatomy of a crawler.

52
Collection of data that does not exist
• Survey

Survey Method: Person, Paper, or Online

Survey Question Types:

Multiple-choice-type questions

Rank-order-type questions

Rating or open-ended questions (eg. Likert Scale)

Dichotomous (closed-ended) questions: Yes/No questions

Tools:

Microsoft’s Ofﬁce 365,

SmartSurvey,

Google Forms,

SurveyMonkey

Pros and Cons of Surveys

53
Collection of data that does not exist
• Interviews and Focus Groups

Why interviews and/or focus groups

Procedures:

Agreement

Ice breaker

Honest opinion

Plan

Analyzing Interview Data

Pros and Cons of Interviews and Focus Groups
• Log and Diary Data
• Analysis Methods:

Quantitative

Qualitative

Mixed
54
Topics of Data Engineering

3.1. Preparing data

56
Serial Computing
• A problem is broken into a discrete series of instructions
• Instructions are executed sequentially one after another
• Executed on a single processor
• Only one instruction may execute at any moment in time

This was a huge waste of hardware resources and time

only one part of the hardware will be running for
particular instruction and of time

As problem statements were getting heavier and bulkier, so
does the amount of time in execution of those statements.
• Examples of processors are Pentium 3 and Pentium 4.

57
Parallel Computing
• Parallel computing is the simultaneous use of multiple
computer resources to solve a computational problem

A problem is broken into discrete parts that can be solved
concurrently

Each part is further broken down to a series of instructions

Instructions from each part execute simultaneously on different
processors

An overall control/coordination mechanism is employed
•

58
Why Parallel Computing
• Save time and/or money

Many resources working together will reduce the time and
cut potential costs.
• Solve complex problems

The problem will be broken into a number of simpler ones
• Create collaborative work environment

Collaborators can solve a specific problem that is part of the bigger
problem while working together with the others
•

59
Types of Parallel Computing
• Bit-level:

It is increasing the size of the processors size

Consider a scenario where an 8-bit processor must compute the sum of
two 16-bit integers. It must first sum up the 8 lower-order bits, then add
the 8 higher-order bits. A 16-bit processor can perform the operation
with just one instruction.
• Instruction-level:

A processor can only address less than one instruction for each clock
cycle phase. These instructions can be re-ordered and grouped which
are later on executed concurrently without affecting the result of the
program.
• Task Parallelism:

Task parallelism employs the decomposition of a task into subtasks and
then allocating each of the subtasks for execution.
• Data-level:

It is parallelization across multiple processors focusing on distributing
the data. It can be applied on regular data structures like arrays and
matrices by working on each element in parallel 60
MapReduce for Parallel Computing
• MapReduce refers to two separate and distinct tasks that Hadoop programs
perform.

Map job: takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value
pairs).

Reduce job: takes the output from a map as input and combines those
data tuples into a smaller set of tuples.

As the sequence of the name MapReduce implies, the reduce job is
always performed after the map job.
• MapReduce programming offers several benefits:

Scalability: Businesses can process petabytes of data stored in the
Hadoop Distributed File System (HDFS).

Flexibility: Hadoop enables easier access to multiple sources of data and
multiple types of data.

Speed: With parallel processing and minimal data movement, Hadoop
offers fast processing of massive amounts of data.

Simple: Developers can write code in a choice of languages, including
Java, C++ and Python. 61
Topics of Data Engineering

3.1. Preparing data

3.2. Crawling data from different sources
using Python and R.
3.3. Introduction to Parallel Computing,
Map Reduce and Hadoop EcoSystem
3.4. Data Driven Application Design
and Deployment
Data Driven Application
• Data-driven apps operate on a diverse set of data

Multi-modal: spatial, documents, sensor, transactional, etc.

From multiple different sources

Often in real-time
• They may use Machine Learning to make real-time:

Recommendations to customers

Detect fraudulent transactions
• They may use Graph analytics:

to identify influencers in a community and target them with
specific promotions
• They may use spatial data to keep track of deliveries
• They may use IoT streaming data or Blockchain data to create
real-time information/services

63
Data Driven Application
• When building data-driven apps, we need to leverage a lot of
data processing and machine learning algorithms
These apps are frequently deployed on multiple platforms,
including mobile devices as well as standard web browsers
 They need a flexible, scalable and reliable deployment
platform
Given the demands on these apps, they need to be
continuously developed to adapt to new use cases or user
needs, and all updates must happen online as they have to be
available 24×7.

PSE Foundation Exam 8.0
No ratings yet
PSE Foundation Exam 8.0
10 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
2 DM DataPreprocessing
No ratings yet
2 DM DataPreprocessing
43 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Chapter 2
No ratings yet
Chapter 2
40 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Mining
No ratings yet
Data Mining
40 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Correlation
No ratings yet
Correlation
14 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
ML-Lecture-5-data-quality
No ratings yet
ML-Lecture-5-data-quality
19 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
Unit I
No ratings yet
Unit I
57 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
L3
No ratings yet
L3
34 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
DM_merged
No ratings yet
DM_merged
169 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
DP
No ratings yet
DP
44 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
A Marching Strip
No ratings yet
A Marching Strip
9 pages
DSP Model Exam
No ratings yet
DSP Model Exam
2 pages
Check Your Warranty or Service Status
No ratings yet
Check Your Warranty or Service Status
2 pages
Chapter 1 Mathematical Modelling and Error Analysis
No ratings yet
Chapter 1 Mathematical Modelling and Error Analysis
17 pages
Hayden Cooper - Managing Your Digital Footprint
No ratings yet
Hayden Cooper - Managing Your Digital Footprint
1 page
4 Programming Fundamentals
No ratings yet
4 Programming Fundamentals
33 pages
On the Internet 1st Edition Hubert L. Dreyfus pdf download
100% (1)
On the Internet 1st Edition Hubert L. Dreyfus pdf download
52 pages
Techtarget - Two Factor Authentication
No ratings yet
Techtarget - Two Factor Authentication
8 pages
Tata Play Fiber 4_unlocked
No ratings yet
Tata Play Fiber 4_unlocked
1 page
Job Description Secretary
No ratings yet
Job Description Secretary
3 pages
2006 NYAMIRA DISTRICT PAPER 1
No ratings yet
2006 NYAMIRA DISTRICT PAPER 1
7 pages
Linux Commands With Examples
No ratings yet
Linux Commands With Examples
11 pages
Solution Manual for Computer Organization and Architecture, 11th Edition, William Stallings instant download
100% (3)
Solution Manual for Computer Organization and Architecture, 11th Edition, William Stallings instant download
36 pages
Chethan-Advanced Database-Quiz
No ratings yet
Chethan-Advanced Database-Quiz
20 pages
Callback Function: How Callback Functions Work?
No ratings yet
Callback Function: How Callback Functions Work?
11 pages
Is Power BI Easy To Learn - Quora
No ratings yet
Is Power BI Easy To Learn - Quora
4 pages
Introduction To Bayesian Monte Carlo Methods in WINBUGS
No ratings yet
Introduction To Bayesian Monte Carlo Methods in WINBUGS
37 pages
Nvis 5586A Final
No ratings yet
Nvis 5586A Final
191 pages
ZigBee+3.0+module+AT+command+standard+specification EN V1.0
No ratings yet
ZigBee+3.0+module+AT+command+standard+specification EN V1.0
16 pages
Cap 10
No ratings yet
Cap 10
21 pages
MRFF 2023 Research Data Infrastructure Sample Application Form PDF
No ratings yet
MRFF 2023 Research Data Infrastructure Sample Application Form PDF
22 pages
FG SSCQ2212 Domestic Data Entry Operator 12 02 2019 PDF
100% (1)
FG SSCQ2212 Domestic Data Entry Operator 12 02 2019 PDF
199 pages
Statistical Analysis System: First SAS Program
No ratings yet
Statistical Analysis System: First SAS Program
8 pages
Service Now
No ratings yet
Service Now
2 pages
8-Bit Microcontroller With 16K Bytes In-System Programmable Flash Atmega162 Atmega162V
No ratings yet
8-Bit Microcontroller With 16K Bytes In-System Programmable Flash Atmega162 Atmega162V
22 pages
Procedure - Purchasing
No ratings yet
Procedure - Purchasing
5 pages
Algebra 2 Textbook Aligment - Big Ideas
No ratings yet
Algebra 2 Textbook Aligment - Big Ideas
14 pages
Tugas Pertemuan 4 PUTRI SALSABILA 18043138 A. Baldzan
No ratings yet
Tugas Pertemuan 4 PUTRI SALSABILA 18043138 A. Baldzan
3 pages
Hyper-V Cmdlets in Windows PowerShell PDF
No ratings yet
Hyper-V Cmdlets in Windows PowerShell PDF
246 pages

3 DSEngineering

Uploaded by

3 DSEngineering

Uploaded by

Introduction to

Solomon Teferra Abate

3.1. Preparing data

3.1. Preparing data

• There are different factors that affect the quality of data

• What’s wrong here?

1 Ministry of Transportation Addis Ababa Addis Ababa

2 Ministry of Finance Addis Ababa Addis Ababa

• How to clean it: manually or automatically?

• The dataset may contain only aggregate data. E.g.: traffic

2 Oct 12, 2011 Amhara region

Sex Nominal Replace by the mode value.

• Incorrect attribute values may be due to

• Use, say Numerical constraints to Catch Corrupt Data

• Use statistical techniques to Catch Corrupt Data

• Discrepancy between duplicate records

1234 Ministry of Addis Ababa AA

GCR34 Ministry of Finance Addis Ababa AA

Name ID City State

• Sometimes you cannot bin it

– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.

• Data reduction strategies

• Method: attribute subset selection

– Step-wise attribute elimination:

– Step-wise attribute elimination:

• Equal-depth (frequency) partitioning

3.1. Preparing data

3.1. Preparing data

3.1. Preparing data

You might also like