0% found this document useful (0 votes)

32 views

BDA - Lecture 4

Uploaded by

rumman hashmi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

BDA - Lecture 4

Uploaded by

rumman hashmi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

BIG DATA

CS-585
Unit-1: Lecture 4
Contents

FILTERING BIG NEED FOR BIG DATA AND ADOPTION

DATA, STANDARDS, ANALYTICS, ARCHITECTURE.
•Filtering refers to the
process of defining, detecting
and correcting errors in given
data, in order to minimize the
impact of errors in input data
Filtering on succeeding analysis.
of the • Generally the filters are
Data presented as mathematical
formulae or pseudo codes so
that they can be implemented
in a language of choice.
Error Measurement and Estimation

Inconsistent Data

• Duplicate data
• Contradictive Data

Categories / • Error Codes

• Values out of Bound

processes / Outliers

Steps in Missing Data

• Linear interpolation

filteration • Polynomial interpolation

• Statistical model curve filleting

Evaluation of Quality of estimates

• Cross validation
• Comparison of estimated and observed data

Structure for filtering of data

Measurement Error and Estimation

• In the chain of data acquisition the

measurement error is the first kind of
error that appears. When something is
measured, there will almost always be a
deviation between the true value and
the one obtained, due to imperfections
of the measuring device.
• You will need an exact
sensor/instrument for calibration as well
as the one you are testing. You are now
able to use the mean value of your
(precise) reference sensor to estimate
the bias. The bias value can then be
subtracted from all of the collected
samples from the tested sensor to
estimate the precision of it.
Inconsistent data can be of many kinds. All kinds
have in common that the data is objectively
erroneous. Types include

Duplicate Contradictive Values out of

Error Codes
Data Data bound

Inconsistent
Data
That is, we know enough about the system that
the measurement is a part of. For example the
instrument, might deliver an error code where
the sampled value should have been.
If we sample positions of cars we know that an
error has occurred if one car is reported to be at
two places at one time.
When data is transmitted and stored,
duplicates of records sometimes
appear, for different reasons.

Duplicate The data will appear as clones, that is:

Data copies of identical data.

The solution is simple; just remove all

but one of the cloned records in the
dataset. It is important to distinguish
between clones and representative
samples.
Contradictive Data

• It is the data that contradicts itself.

• Consider the following example, where we add following error

to the temperature series introduced in figure

• We have two contradictive samples, both with order attribute

8. We know the sample number is directly correlated to time
and we know that the temperature was measured by one
sensor. Since we also know that one sensor can not have two
different temperatures at one point in time, we can say that
the values are contradictive.

• If we want to clean the dataset, we have to remove one of

those samples, the hard question to answer is; which one of
the samples is the correct one which one should be removed?

• To solve the problem, again we could use knowledge about

the system. We know that the temperature is a continuous
variable and should not vary with high frequencies 5 .

• Therefore we linearly interpolate the two neighboring values

and take the one that deviate least in temperature form the
interpolated estimate
Error Codes

● Error codes are data of another kind than the data collected.
●The codes are generated by the software involved in the data
collection process and indicate when some part of the
collection system is malfunctioning.
●The error codes could have an own channel (like an own
attribute) or they could come as a part of the ordinary data.
●The figure shows a series of air temperatures recorded by a
weather station at the side of a road. This is an example
where the data representing the measured physical quantity
(in this case temperature) and the error code use the same
channel.
●In this case a legend, explicitly telling the error code may be
redundant, since a temperature at constantly -99C for several
samples, is unlikely enough to speak for itself.
● There could be other cases where the codes are less obvious.
●The temperature values during the malfunction must be
considered missing. Depending on the time span of the
malfunction and the availability of redundant data, the
chances to make a correction, by filling in the gap, may wary.
It is simply a matter of data
that do not match the
physical quantity that is
supposed to be measured.
Values out
of bound For example a negative
magnitude might be valid if a
temperature is measured,
but not precipitation. It must
be considered as “missing”
Outliers
●“An outlier is an observation that differs so
much from other observations as to arouse
suspicion that it was generated by a different
mechanism” : The statistician Douglas Hawkins.
●When you collect data, you will often be able to,
by intuition, sort out extraordinary records (or
series of records), just by having a swift look at
the data.
●The reason for humans’ ability to recognize
extraordinary information is probably the fact
that this kind of information often is of
extraordinary importance.
●The information is extraordinary in the sense
that, it seem not to follow the background
pattern and seem to be very rare or improbable.
Outliers
• These outliers (also sometimes called anomalies) can be sorted into two different sub
groups;
• those natural and interesting and
• those caused by malfunctioning instruments (where no error code is delivered).
• The first group will contribute with data that improves the succeeding analysis or model
building. The later will contribute with errors that make the succeeding results less accurate.
• If a collected value is very unlikely, it can by itself cause the mean or the standard deviation
to drift significantly. Therefore it is an important part of the data filtering process to remove
those values.
• There are two ways to handle/solve such outliers;
• Density based approach for detecting outliers. Where the moving SD and/or mean is
calculated for the nearby n values. It is ensured that the values remain with in the
calculated distribution (+/-).
• Model based one, where a theoretical model is constructed that reflects the behavior
of your dataset. Here a regression model learns from previous examples, how the
traffic flow varies over the day, for a location along a road. It is important that the
learning is done from data that we some how know is correct. Later incoming data
will be compared with what the model predicts. If the data deviate more than a set
threshold it will be considered faulty.
Missing Data
• In a dataset, data could be missing for two reasons; either it has
never been present or it has been removed because it was
considered faulty for some reason. If data was never present, there
are two sub cases:
• Either the data was collected as a series, where the missing
data is easily detected as a gap in the series of
independent attributes
• Or the data has sporadic nature like precipitation. In this
case the gaps cold be harder to detect. One way to make
the gaps visible is, to also report “no-sporadic-event-
occurred”
• Sometimes the intended use of the filtered data requires a
complete dataset. There are different degrees of completeness;
sometimes an uninterrupted series of data is sufficient, but
sometimes data is needed “between” the uninterrupted records.
This means that methods will be needed not only to fill in data
where records are missing, but also to fill in data between the
records that are present. There are various methods to generate the
in-between data.
• Linear interpolation
• Polynomial interpolation
• Statistical Curve Fitting
Linear Interpolation

●Sometimes there is a need for

knowing what happens in between
known data. This could be
formulated as estimation of the
value of the dependent attribute
where there is no corresponding
value for the independent attribute
stored in the records.
●Interpolation is a group of methods
dealing with this problem, where
linear interpolation is the simplest
form. Linear interpolation will add
information by “binding the know
data points together” with straight
lines
Linear
Interpolation

The formula for finding the

value(Ye) at a given point X
in between X1 and X2 is
Ploynomial Interpolation
●It can be proven mathematically that, if we have n data points, there is
exactly one polynomial of degree at most n−1 going through all the data
points.
●Polynomial refers to mathematical functions that have the following
pattern

– Where ‘a’ are constants and degree is n.

– The given data points are defined in terms of (X0,Y0),
(X1,Y1),...(Xn,Yn).
– To make the curve cross all the points lets make the following set of
equations
Ploynomial Interpolation

●The earlier equation can be written in the matrix

format

● Ultimately y[]=[x][a]
● => [a] =[x]-1[y]
Polynomial interpolation: example

●The data set contains 4 points so n=4(degree 3).after inserting the values of time in x
and temperature in y, the simultaneous equation set becomes
Example
Statistical Model Curve Fitting

●There is sometimes a need for knowing what

happens in between known data.
● This could be formulated as estimation of the dependent
variable where there is no corresponding independent
variable.

●Statistical modeling like general regression uses

historical data, for both filling in missing data, but
also modeling, there is actually no difference.
Visualization filters

●The Range, List, Date, and Expression filter types are specific to either a visualization,
canvas, or project. Filter types are automatically determined based on the data
elements you choose as filters.
– Range filters - Generated for data elements that are number data types and
that have an aggregation rule set to something other than none. Range filters
are applied to data elements that are measures, and that limit data to a range
of contiguous values, such as revenue of $100,000 to $500,000. Or you can
create a range filter that excludes (as opposed to includes) a contiguous range
of values. Such exclusive filters limit data to noncontiguous ranges (for
example, revenue less than $100,000 or greater than $500,000).
– List filters - Applied to data elements that are text data types and number
data types that aren’t aggregable.
– Date filters - Use calendar controls to adjust time or date selections. You can
either select a single contiguous range of dates, or you can use a date range
filter to exclude dates within the specified range.
– Expression filters - Let you define more complex filters using SQL expressions.
Bloom Filters

Some
Terms
Attached Content Based
Filtering
with Big
Data
Filters
Collaborative
Filtering
Collaborative Filtering
•Goal: predict what movies/books/… a person may
be interested in, on the basis of
–Past preferences of the person
–Other people with similar past preferences
–The preferences of such people for a new movie/book/…
•One approach based on repeated clustering
–Cluster people on the basis of preferences for movies
–Then cluster movies on the basis of being liked by the same
clusters of people
–Again cluster people based on their preferences for (the
newly created clusters of) movies
–Repeat above till equilibrium
•Above problem is an instance of collaborative
filtering, where users collaborate in the task of
filtering information to find information of interest

23
Everyday Examples of Collaborative
Filtering...

•Bestseller lists
•Top 40 music lists
•The “recent returns” shelf at the library
•Unmarked but well-used paths through the woods
•The printer room at work
•Many weblogs
•“Read any good books lately?”
•....
•Common insight: personal tastes are correlated:
–If Alice and Bob both like X and Alice likes Y then Bob is
more likely to like Y
–especially (perhaps) if Bob knows Alice
Collaborative + Content Filtering
As Classification (Basu, Hirsh, Cohen, AAAI98)
Classification task: map (user,movie) pair into {likes,dislikes}
Training data: known likes/dislikes
Test data: active users
Features: any properties of
user/movie pair
Airplane Matrix Room with ... Hidalgo
a View

comedy action romance ... action

Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...
Kumar 25,M,22k 1 0 0 1

Ua 48,M,81k 0 1 ? ? ?
Need for Standards in Big Data
Technologies for streaming, storing, and querying big data have matured to the point
●

where the computer industry can usefully establish standards.

●As in other areas of engineering, standardization allows practitioners to port their
learnings across a multitude of solutions, and to more easily employ different
technologies together; standardization also allows solution providers to take advantage
of sub-components to expeditiously build more compelling solutions with broader
applicability.
Areas of growth that would benefit from standards are:
●

– Stream processing
– Storage engine interfaces
– Querying
– Benchmarks
– Security and governance
– Metadata management
– Deployment (including cloud / as a service options)
– Integration with other fast-growing technologies, such as AI and blockchain
Need For Big Data Standards
Big Data Landscape
Big Data Standards Timeline (Historic)
NIST Big Data Interoperability Framework :
Goal: Develop a consensus-based reference architecture that is vendor-neutral,
technology and infrastructure agnostic to enable any stakeholders to perform
analytics processing for their given data sources without worrying about the
underlying computing environment.
● Seven Volumes
– Volume 1, Definitions
– Volume 2, Taxonomies
– Volume 3, Use Cases and General Requirements
– Volume 4, Security and Privacy
– Volume 5, Architectures White Paper Survey
– Volume 6, Reference Architecture
– Volume 7, Standards Roadmap
● Latest versions available:
http://bigdatawg.nist.gov/V1_output_docs.php
● Published October 2015 as NIST SP 1500-n
Volume 1: Definitions

●Define a common vocabulary for multiple audiences

●Set the landscape and issues around big data
– – Defined a number of related terms
●Two key aspects
– Focused on Characteristics (the Vs)
– Focused on need for scalable architectures
●Issues
– Definitions need to be more normative
●Big Data consists of extensive datasets -primarily in the characteristics of volume,
variety, velocity, and/or variability that require a scalable architecture for efficient
storage, manipulation, and analysis.
•Define actors and roles as
used within the reference
architecture
Volume 2: •Start to define data
Taxonomies characteristics
• Issues
• Our eyes were bigger than our
stomach – How do you not do a
taxonomy of all computing?
Volume 3: Big Data Use cases
and Requirements

Built from 51 Responses Decomposed Detailed Issues

responses to Use then aggregated requirements all We didn’t know what
Case survey into 34 general traceable to we didn’t know –
Deep Learning and
Social Media (6) requirements general template was overly
(26 fields)
Government
across 6 requirements simplistic
Operations (4) categories Additional use cases
Data Source Requirements (3) needed
Commercial (8)
Transformation Provider
Defense (3) Requirements (3)
Healthcare and Life Data Consumer
Sciences (10) Requirements(6)
The Ecosystem for Security and Privacy
Research (4) Requirements (2)
Astronomy and Lifecycle Management
Physics (5) Requirements (9)
Earth, Environmental Other Requirements (5)
and Polar Science (10)
Energy (1)
Volume 4: Security and Privacy

Recognized early on as a key concern requiring a

more complete treatment

S&P issues particular to Big Data

Some S&P specific use cases
Describes:
S&P Taxonomy
Maps S&P use cases to Reference Architecture

Some problems are just hard

Issues: Need more use cases (or S&P requirements
from existing)
Volume 5: Architecture
White paper Survey
• Designed to determine if there are
common elements to Big Data Architecture
• Built from a survey call
• 10 responses from Industry (8) and
Academia (2)
• Was sufficient to develop a comparative
view and identify key roles and functional
components
• Helped to scope top level roles in the
RA
• Issues
• Sample set was too small
Volume 6: Reference
Architecture
• Had to be Vendor Neutral and Technology Agnostic applicable to a variety of
business and deployment models.

• The Goals:

• To illustrate and understand the various Big Data components,

processes, and systems, in the context of an overall Big Data conceptual
model;

• To provide a technical reference for U.S. Government departments,

agencies and other consumers to understand, discuss, categorize and
compare Big Data solutions; and

• To facilitate the analysis of candidate standards for interoperability,

portability, reusability, and extendibility.

• Mapped Use case categories to Reference Architecture Components and Fabrics

• Defined 7 top level and 5 sub-roles

• Two roles presented as fabrics

• Issues

• Hard to describe an architecture without being able to mention

technologies

• Terminology came back to bite us

• Current architecture is not really normative

• Too much in one diagram – mixed views

Volume 7: Standards Roadmap
Goals:
●

– Document an understanding of what standards are available or under

development for Big Data
– Perform a gap analysis and document the findings
– Identify what possible barriers may delay or prevent adoption of Big Data
– Document vision and recommendations
Also designed to be a summary document
●

Surveyed major SDOs and Consortium standards

●

– Developed criteria for “Relevant to Big Data”

– Mapped to Ref Arch roles as users or implementers of standard
Issues
●

– An exhaustive documentation of Big Data standards bigger than resources –

Almost every standard deals with data
– Initial direction of document was more a technology roadmap – can’t do a
roadmap of technologies without mentioning technologies
ISO/IEC 20546, Information technology – Big Data -- Overview and
Vocabulary

●Scope: This International Standard provides an

overview of Big Data, along with a set of terms and
definitions. It provides a terminological foundation
for Big Data-related standards.
●Schedule

– Current: 1 st WD Available
– CD: Oct 2016
– Publication: Oct 2018
ISO/IEC 20547, Information technology – Big Data Reference Architecture
Thank You

Wish you a prosperous career with Big Data Analytics

References

●https://www.oreilly.com/ideas/its-time-to-
establish-big-data-standards

Elton Gruber 7e Solution Manual
50% (4)
Elton Gruber 7e Solution Manual
139 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
1.3 Data Quality
No ratings yet
1.3 Data Quality
6 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Unit 1
No ratings yet
Unit 1
21 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
Lecture 7 - Data Cleaning
No ratings yet
Lecture 7 - Data Cleaning
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data Quality
100% (2)
Data Quality
16 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Data Quality
No ratings yet
Data Quality
14 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
Unit - 3: Big Data Analytics
No ratings yet
Unit - 3: Big Data Analytics
23 pages
ppt2
No ratings yet
ppt2
57 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
188 1496475265 - 03-06-2017 PDF
No ratings yet
188 1496475265 - 03-06-2017 PDF
6 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Lect 2
No ratings yet
Lect 2
54 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
253777
No ratings yet
253777
66 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Feature Engineering
No ratings yet
Feature Engineering
35 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Data Preprocessing Solution-24-37
No ratings yet
Data Preprocessing Solution-24-37
14 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
CH 2
No ratings yet
CH 2
36 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Be A 65 Ads Exp 7
No ratings yet
Be A 65 Ads Exp 7
7 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Moving and Exponential Smoothing
No ratings yet
Moving and Exponential Smoothing
26 pages
Deng Yuan SEM15
No ratings yet
Deng Yuan SEM15
18 pages
Emotional Intelligence and Professional Commitment PDF
No ratings yet
Emotional Intelligence and Professional Commitment PDF
18 pages
ssrn-5049349
No ratings yet
ssrn-5049349
47 pages
File004 Hatfield Sample Final Discussion
No ratings yet
File004 Hatfield Sample Final Discussion
16 pages
DEVELOPMENT AND FACTORIAL DESIGN OF GLYCERYL TRISTEARATE BASED SOLID LIPID NANOPARTICLES (SLNS) CONTAINING BERBERINE
No ratings yet
DEVELOPMENT AND FACTORIAL DESIGN OF GLYCERYL TRISTEARATE BASED SOLID LIPID NANOPARTICLES (SLNS) CONTAINING BERBERINE
12 pages
DSUR I Chapter 06 (Correlation)
No ratings yet
DSUR I Chapter 06 (Correlation)
42 pages
Laboratory Manual: FOR Physics Laboratory - I
No ratings yet
Laboratory Manual: FOR Physics Laboratory - I
71 pages
G Statiscis Chapter 8
No ratings yet
G Statiscis Chapter 8
65 pages
MATLAB Econometrics Toolbox User S Guide The Mathworks All Chapter Instant Download
100% (6)
MATLAB Econometrics Toolbox User S Guide The Mathworks All Chapter Instant Download
53 pages
Spatial Econometrics With R 2020
No ratings yet
Spatial Econometrics With R 2020
141 pages
Heteroscedasticity Week 1 Econometrics
No ratings yet
Heteroscedasticity Week 1 Econometrics
33 pages
Stock Watson 3U ExerciseSolutions Chapter04 Students PDF
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter04 Students PDF
8 pages
Quality Control and Assurance (QAQC)
No ratings yet
Quality Control and Assurance (QAQC)
25 pages
Forecastin G Moving Averages - 3 Period Moving Average
No ratings yet
Forecastin G Moving Averages - 3 Period Moving Average
9 pages
Accounting 202 Chapter 6 Notes
No ratings yet
Accounting 202 Chapter 6 Notes
13 pages
Chow Test
0% (1)
Chow Test
23 pages
Presentation Forecasting Pharmacy
No ratings yet
Presentation Forecasting Pharmacy
48 pages
Chapter 1 - Introduction To Electrical Measurement
No ratings yet
Chapter 1 - Introduction To Electrical Measurement
38 pages
Iarjset 2024 11624
No ratings yet
Iarjset 2024 11624
12 pages
SUSS BSBA: BUS105 Jul 2020 TOA Answers
100% (1)
SUSS BSBA: BUS105 Jul 2020 TOA Answers
10 pages
ISLR
No ratings yet
ISLR
9 pages
Comparing Several Means: Anova
No ratings yet
Comparing Several Means: Anova
52 pages
Computer-Aided Integrated Method For Effective Development of Robotic Welding Processes of Car Chassis Assembly
No ratings yet
Computer-Aided Integrated Method For Effective Development of Robotic Welding Processes of Car Chassis Assembly
8 pages
Master's Written Examination and Solution
No ratings yet
Master's Written Examination and Solution
14 pages
2 Forecasting Techniques
No ratings yet
2 Forecasting Techniques
47 pages
Chapter 5: Regression and Correlation: Bivariate Data) and Relationship Between The Two Variables
No ratings yet
Chapter 5: Regression and Correlation: Bivariate Data) and Relationship Between The Two Variables
5 pages
The Impact Tax Knowledge Tax Awareness Tax Morale
No ratings yet
The Impact Tax Knowledge Tax Awareness Tax Morale
17 pages
Multicollinearity and Oaxaca -Tutorial
No ratings yet
Multicollinearity and Oaxaca -Tutorial
35 pages

BDA - Lecture 4

Uploaded by

BDA - Lecture 4

Uploaded by

BIG DATA

FILTERING BIG NEED FOR BIG DATA AND ADOPTION

Categories / • Error Codes

Steps in Missing Data

filteration • Polynomial interpolation

Evaluation of Quality of estimates

Structure for filtering of data

• In the chain of data acquisition the

Duplicate Contradictive Values out of

Duplicate The data will appear as clones, that is:

The solution is simple; just remove all

• It is the data that contradicts itself.

• Consider the following example, where we add following error

• We have two contradictive samples, both with order attribute

• If we want to clean the dataset, we have to remove one of

• To solve the problem, again we could use knowledge about

• Therefore we linearly interpolate the two neighboring values

●Sometimes there is a need for

The formula for finding the

– Where ‘a’ are constants and degree is n.

●The earlier equation can be written in the matrix

●There is sometimes a need for knowing what

●Statistical modeling like general regression uses

comedy action romance ... action

where the computer industry can usefully establish standards.

●Define a common vocabulary for multiple audiences

Built from 51 Responses Decomposed Detailed Issues

Recognized early on as a key concern requiring a

S&P issues particular to Big Data

Some problems are just hard

• To illustrate and understand the various Big Data components,

• To provide a technical reference for U.S. Government departments,

• To facilitate the analysis of candidate standards for interoperability,

• Mapped Use case categories to Reference Architecture Components and Fabrics

• Defined 7 top level and 5 sub-roles

• Two roles presented as fabrics

• Hard to describe an architecture without being able to mention

• Terminology came back to bite us

• Current architecture is not really normative

• Too much in one diagram – mixed views

– Document an understanding of what standards are available or under

Surveyed major SDOs and Consortium standards

– Developed criteria for “Relevant to Big Data”

– An exhaustive documentation of Big Data standards bigger than resources –

●Scope: This International Standard provides an

Wish you a prosperous career with Big Data Analytics

You might also like