3 DSEngineering
3 DSEngineering
Data Science
5
Why low quality Data?
• Collecting the required data is challenging
– Real life data is heterogeneous
– Data sources are distributed and so is the nature of the data
Real world data is low in quality
6
Why low quality Data?
Technical determinants
• Lack of guidelines to fill out the data sources and reporting forms
• Data collection and reporting forms are not standardized
• Complex design of data collection and reporting tools
Behavioral determinants
• Personnel not trained in the use of data sources & reporting forms
• Misunderstanding of how to compile data, use tally sheets, and prepare
reports
• Math errors occur during data consolidation from data sources, affecting
report preparation
Organizational determinants
• Lack of a reviewing process, before report submission to next level
• Organization incentivizes reporting high performance
• Absence of culture of information use
7
Problems with data
• Some data have problems on their own that needs to be
cleaned:
– Outliers: misleading data that do not fit to most of the data/facts
– Missing data: attributes values might be absent which needs to be
replaced with estimates
– Irrelevant data: attributes in the database that might not be of
interest to the task being developed
– Noisy data: attribute values that might be invalid or incorrect. E.g.
typographical errors
– Inconsistent data, duplicate data, etc.
• Other data are problematic only when we integrate them
– Everyone had their own way of structuring and formatting data,
based on what was convenient for them
– Data of this difference comes from different sources
• Coming up with good quality data needs to pass through
different data pre-processing tasks
8
Major Tasks in Data Pre-processing
• Data cleaning: to get rid of bad data
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of data from multiple sources, such as databases, data
warehouses, or files
• Data reduction
– obtains a reduced representation of the data set that is much
smaller in volume, yet produces almost the same or similar
results.
o Dimensionality reduction
o Numerosity/size reduction
o Data compression
• Data transformation
– Normalization 9
– Discretization
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which require data
cleaning
10
Data Cleaning: Incomplete Data
• The dataset may lack certain attributes of interest
– Is that enough if you have patient demographic profile and
address of region to predict the vulnerability (or exposure) of a
given region to an outbreak?
11
Data Cleaning: Missing Data
• Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
many tuples have no recorded value for several attributes, such
as customer income in sales data
ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa
2 Ministry of Finance ? Addis Ababa
3 Office of Foreign Affairs Addis Ababa Addis Ababa
Data is not always available, lacking attribute values. E.g., Occupation=“ ”
Missing data may be due to:
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding and may not be considered
important at the time of entry
not register history or changes of the data
12
Data Cleaning: Missing Data
There are different methods for treating missing values.
• Ignore attributes with missing value: this method is
usually done when class label is missed (assuming the
task in classification).
– not effective when the percentage of missing values per
attribute varies considerably
• Fill in the missing value manually: this method is
tedious, time consuming and infeasible.
• Use a global constant to fill in the missing value: This
can be done if a new class is unknown.
• Use the attributes’ mean or mode to fill in the
missing value: Replacing the missed values with the
attributes’ mean or mode (most frequent) for numeric
or nominal attributes, respectively.
– Use the most probable value to fill in the missing value automatically
• calculate, say, using Expectation Maximization (EM) Algorithm
13
Example: Missing Values Handling method
Attribute Name Data type Handling method
Sex
Age
Religion
Height
Marital status
Job
Weight
Example: Missing Values Handling
method
Attribute Name Data type Handling method
Sex Nominal
Age Numeric
Religion Nominal
Height Numeric
Marital status Nominal
Job Nominal
Weight Numeric
Example: Missing Values Handling
method
Attribute Name Data type Handling method
19
19
Data Cleaning: How to catch Noisy Data
• Manually check all data : tedious + infeasible?
• Sort data by frequency
– ‘green’ is more frequent than ‘rgeen’
– Works well for categorical data
21
Data Integration: Formats
• Not everyone uses the same format
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from
multiple data sources: e.g., A.cust-id vs B.cust-#
• Dates are especially problematic:
– 12/19/97
– 19/12/97
– 19/12/1997
– 19-12-97
– Dec 19, 1997
– 19 December 1997
– 19th Dec. 1997
• Are you frequently writing money as:
– Birr 200, Br. 200, 200 Birr, …
22
Data Integration: Inconsistent
• Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of standardization /
naming conventions. e.g.,
– Age=“26” vs. Birthday=“03/07/1986”
– Some use “1,2,3” for rating; others “A, B, C”
23
Data Integration: Inconsistent
Attribute Current values New value
name
Job status “no work”, “job Unemployed
less”, “Jobless”
Marital status “not married”, Unmarried
“single”
Education “uneducated”, “no Illiterate
level
education level”
Data Integration: different structure
What’s wrong here? No data type constraints
ID Name City State
25
Data Integration: Data that Moves
• Be careful of taking snapshots of a moving target
• Example: Let’s say you want to store the price of a shoe in
Ethiopia, and the price of a shoe in Kenya Can we use same
currency (say, US$) or country’s currency?
– You can’t store it all in the same currency (say, US$) because the
exchange rate changes frequently
– Price in foreign currency stays the same
– Must keep the data in foreign currency and use the current
exchange rate to convert
• The same needs to be done for ‘Age’
–
26
Data at different level of detail than needed
• If it is at a finer level of detail, you can sometimes bin it
• Example
– I need age ranges of 20-30, 30-40, 40-50, etc.
– Imported data contains birth date
– No problem! Divide data into appropriate categories
27
Data Integration: Conflicting Data
• Detecting and resolving data value conflicts
–For the same real world entity, attribute values from
different sources are different
–Possible reasons: different representations, different scales,
e.g., American vs. British units
• weight measurement: KG or pound
• Height measurement: meter or inch
• Information source #1 says that Alex lives in Bahirdar
– Information source #2 says that Alex lives in Mekele
• What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of information
– Use the “most trusted” information
– Flag row to be investigated further by hand
28
Handling Redundant Data
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve speed and quality of the analysis
Example: Co-Variance
• Suppose two stocks A and B have the following values in one
week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
30
Data Reduction Strategies
• Why data reduction?
–A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the
complete data set.
• Data reduction: obtains a reduced representation of the
data set that is much smaller in volume but yet produces
the same (or almost the same) analytical results
31
Data Reduction: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
• Dimensionality reduction
– Helps to eliminate Irrelevant attributes and reduce noise: that
contain no information useful for the data anlysis task at hand
• E.g., is students' ID relevant to predict students' GPA?
–Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
• E.g., purchase price of a product & the amount of sales tax paid
– Reduce time and space required in data mining
– Allow easier visualization
W O R
SRS le random
i m p ho ut
( s e wi t
p l
sa m m e nt )
p l a ce
re
SRSW
R
Raw Data
36
Data Transformation
• A function that maps the entire set of values of a
given attribute to a new set of replacement values
such that each old value can be identified with one
of the new values
• Methods for data transformation
– Normalization
– Discretization
– Generalization: Concept hierarchy climbing
37
Simple Discretization: Binning
• Equal-width (distance) partitioning
–Divides the range into N intervals of equal size (uniform grid)
–if A and B are the lowest and highest values of the attribute, the
width of intervals for N bins will be:
W = (B –A)/N.
–This is the most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well
38
Topics of Data Engineering
40
Data Crawling
• Data crawling is the process of finding and downloading web
pages or documents from the web.
• For example, you might want to crawl the entire web or a
specific domain to find relevant information for a search
engine or a web scraper.
• Data crawling can be done by using a program or a bot that
can follow the links and URLs of the web pages, and store
them in a database or a file.
• Data crawling can be useful for discovering new or updated
data sources, or for creating a web archive.
41
Data Scraping
• Data scraping is the process of extracting specific data from a
web page or a document.
• For example, you might want to scrape the names and prices
of products from an e-commerce site, or the ratings and
reviews of movies from a streaming platform.
• Data scraping can be done manually, by copying and pasting
the data, or automatically, by using a script or a tool that can
parse the HTML or XML code of the web page.
• Data scraping can be useful for collecting data for analysis,
research, or comparison.
42
Challenges of Data Scraping/Crawling
• Data scraping and data crawling can be subject to a variety of
challenges:
legal and ethical issues,
technical difficulties, and
quality issues.
• It's important to respect the data owner's rights and permissions, and
avoid any violations of the law.
• Some webpages or documents may have dynamic, complex, or
encrypted content that can make data scraping or crawling difficult or
impossible.
• To overcome these challenges, you may need to use advanced
techniques, such as browser automation, proxies, or APIs.
Additionally, some webpages or documents may have inaccurate,
incomplete, or outdated data that can affect the reliability and validity
of your results.
• To ensure quality data, you may need to use data cleaning, validation,
43
or verification methods.
Tools for Data Scraping/Crawling
• Data scraping and crawling can be done using a variety of tools and
frameworks, depending on the specific needs and preferences.
• Scrapy, for example, is a Python framework for building data
crawling and scraping applications that can handle large-scale and
concurrent data extraction
• BeautifulSoup is a Python library for parsing and extracting data
from HTML and XML documents
• Selenium is a framework for automating web browsers that can be
used to scrape or crawl dynamic or interactive content.
• Requests is a Python library for sending and receiving HTTP
requests that can be used to scrape or crawl static or simple
content.
• All of these tools support various features, such as pipelines,
middleware, spiders, selectors, items, headers, cookies,
parameters, authentication, or sessions.
44
Crawling the Web
Web pages
• Few thousand characters long
• Served through the internet using the hypertext transport
protocol (HTTP)
• Viewed at client end using `browsers’
Crawler
• To fetch the pages to the computer
• At the computer
Automatic programs can analyze hypertext documents
45
HTML:HyperText Markup Language
Lets the author
• specify layout and typeface
• embed diagrams
• create hyperlinks.
expressed
as an anchor tag with a HREF attribute
HREF names another page using a Uniform Resource Locator
(URL),
• URL =
protocol field (“HTTP”) +
a server hostname (“www.cse.iitb.ac.in”) +
file path (/, the `root' of the published file system).
46
HTTP(hypertext transport protocol)
Built on top of the Transport Control Protocol (TCP)
Steps(from client end)
• resolve the server host name to an Internet address (IP)
Use Domain Name Server (DNS)
DNS is a distributed database of name-to-IP mappings maintained at a set
of known servers
• contact the server using TCP
connect to default HTTP port (80) on the server.
Enter the HTTP requests header (E.g.: GET)
Fetch the response header
– MIME (Multipurpose Internet Mail Extensions)
– A meta-data standard for email and Web content transfer
Fetch the HTML page
47
Crawl “all” Web pages?
Problem: no catalog of all accessible URLs on the
Web.
Solution:
• start from a given set of URLs
• Progressively fetch and scan them for new outlinking
URLs
• fetch these pages in turn…..
• Submit the text in page to a text indexing system
• and so on……….
48
Crawling procedure
Simple
• Great deal of engineering goes into industry-strength
crawlers
• Industry crawlers crawl a substantial fraction of the Web
• E.g.: Alta Vista, Northern Lights, Inktomi
No guarantee that all accessible Web pages will be
located in this fashion
Crawler may never halt
• pages will be added continually even as it is running.
49
Crawling overheads
Delays involved in
• Resolving the host name in the URL to an IP address
using DNS
• Connecting a socket to the server and sending the
request
• Receiving the requested page in response
Solution: Overlap the above delays by
• fetching many pages at the same time
50
Anatomy of a crawler.
Page fetching threads
• Starts with DNS resolution
• Finishes when the entire page has been fetched
Each page
• stored in compressed form to disk/tape
• scanned for outlinks
Work pool of outlinks
• maintain network utilization without overloading it
Dealt with by load manager
Continue till the crawler has collected a sufficient
number of pages.
51
Anatomy of a crawler.
52
Collection of data that does not exist
• Survey
Survey Method: Person, Paper, or Online
Survey Question Types:
Multiple-choice-type questions
Rank-order-type questions
Rating or open-ended questions (eg. Likert Scale)
Dichotomous (closed-ended) questions: Yes/No questions
Tools:
Microsoft’s Office 365,
SmartSurvey,
Google Forms,
SurveyMonkey
Pros and Cons of Surveys
53
Collection of data that does not exist
• Interviews and Focus Groups
Why interviews and/or focus groups
Procedures:
Agreement
Ice breaker
Honest opinion
Plan
Analyzing Interview Data
Pros and Cons of Interviews and Focus Groups
• Log and Diary Data
• Analysis Methods:
Quantitative
Qualitative
Mixed
54
Topics of Data Engineering
56
Serial Computing
• A problem is broken into a discrete series of instructions
• Instructions are executed sequentially one after another
• Executed on a single processor
• Only one instruction may execute at any moment in time
This was a huge waste of hardware resources and time
only one part of the hardware will be running for
particular instruction and of time
As problem statements were getting heavier and bulkier, so
does the amount of time in execution of those statements.
• Examples of processors are Pentium 3 and Pentium 4.
57
Parallel Computing
• Parallel computing is the simultaneous use of multiple
computer resources to solve a computational problem
A problem is broken into discrete parts that can be solved
concurrently
Each part is further broken down to a series of instructions
Instructions from each part execute simultaneously on different
processors
An overall control/coordination mechanism is employed
•
58
Why Parallel Computing
• Save time and/or money
Many resources working together will reduce the time and
cut potential costs.
• Solve complex problems
The problem will be broken into a number of simpler ones
• Create collaborative work environment
Collaborators can solve a specific problem that is part of the bigger
problem while working together with the others
•
59
Types of Parallel Computing
• Bit-level:
It is increasing the size of the processors size
Consider a scenario where an 8-bit processor must compute the sum of
two 16-bit integers. It must first sum up the 8 lower-order bits, then add
the 8 higher-order bits. A 16-bit processor can perform the operation
with just one instruction.
• Instruction-level:
A processor can only address less than one instruction for each clock
cycle phase. These instructions can be re-ordered and grouped which
are later on executed concurrently without affecting the result of the
program.
• Task Parallelism:
Task parallelism employs the decomposition of a task into subtasks and
then allocating each of the subtasks for execution.
• Data-level:
It is parallelization across multiple processors focusing on distributing
the data. It can be applied on regular data structures like arrays and
matrices by working on each element in parallel 60
MapReduce for Parallel Computing
• MapReduce refers to two separate and distinct tasks that Hadoop programs
perform.
Map job: takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value
pairs).
Reduce job: takes the output from a map as input and combines those
data tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce job is
always performed after the map job.
• MapReduce programming offers several benefits:
Scalability: Businesses can process petabytes of data stored in the
Hadoop Distributed File System (HDFS).
Flexibility: Hadoop enables easier access to multiple sources of data and
multiple types of data.
Speed: With parallel processing and minimal data movement, Hadoop
offers fast processing of massive amounts of data.
Simple: Developers can write code in a choice of languages, including
Java, C++ and Python. 61
Topics of Data Engineering
63
Data Driven Application
• When building data-driven apps, we need to leverage a lot of
data processing and machine learning algorithms
These apps are frequently deployed on multiple platforms,
including mobile devices as well as standard web browsers
They need a flexible, scalable and reliable deployment
platform
Given the demands on these apps, they need to be
continuously developed to adapt to new use cases or user
needs, and all updates must happen online as they have to be
available 24×7.
64