0% found this document useful (0 votes)
32 views

BDA - Lecture 4

Uploaded by

rumman hashmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

BDA - Lecture 4

Uploaded by

rumman hashmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

BIG DATA

CS-585
Unit-1: Lecture 4
Contents

FILTERING BIG NEED FOR BIG DATA AND ADOPTION


DATA, STANDARDS, ANALYTICS, ARCHITECTURE.
•Filtering refers to the
process of defining, detecting
and correcting errors in given
data, in order to minimize the
impact of errors in input data
Filtering on succeeding analysis.
of the • Generally the filters are
Data presented as mathematical
formulae or pseudo codes so
that they can be implemented
in a language of choice.
Error Measurement and Estimation

Inconsistent Data

• Duplicate data
• Contradictive Data

Categories / • Error Codes


• Values out of Bound

processes / Outliers

Steps in Missing Data

• Linear interpolation

filteration • Polynomial interpolation


• Statistical model curve filleting

Evaluation of Quality of estimates

• Cross validation
• Comparison of estimated and observed data

Structure for filtering of data


Measurement Error and Estimation

• In the chain of data acquisition the


measurement error is the first kind of
error that appears. When something is
measured, there will almost always be a
deviation between the true value and
the one obtained, due to imperfections
of the measuring device.
• You will need an exact
sensor/instrument for calibration as well
as the one you are testing. You are now
able to use the mean value of your
(precise) reference sensor to estimate
the bias. The bias value can then be
subtracted from all of the collected
samples from the tested sensor to
estimate the precision of it.
Inconsistent data can be of many kinds. All kinds
have in common that the data is objectively
erroneous. Types include

Duplicate Contradictive Values out of


Error Codes
Data Data bound

Inconsistent
Data
That is, we know enough about the system that
the measurement is a part of. For example the
instrument, might deliver an error code where
the sampled value should have been.
If we sample positions of cars we know that an
error has occurred if one car is reported to be at
two places at one time.
When data is transmitted and stored,
duplicates of records sometimes
appear, for different reasons.

Duplicate The data will appear as clones, that is:


Data copies of identical data.

The solution is simple; just remove all


but one of the cloned records in the
dataset. It is important to distinguish
between clones and representative
samples.
Contradictive Data

• It is the data that contradicts itself.

• Consider the following example, where we add following error


to the temperature series introduced in figure

• We have two contradictive samples, both with order attribute


8. We know the sample number is directly correlated to time
and we know that the temperature was measured by one
sensor. Since we also know that one sensor can not have two
different temperatures at one point in time, we can say that
the values are contradictive.

• If we want to clean the dataset, we have to remove one of


those samples, the hard question to answer is; which one of
the samples is the correct one which one should be removed?

• To solve the problem, again we could use knowledge about


the system. We know that the temperature is a continuous
variable and should not vary with high frequencies 5 .

• Therefore we linearly interpolate the two neighboring values


and take the one that deviate least in temperature form the
interpolated estimate
Error Codes

● Error codes are data of another kind than the data collected.
●The codes are generated by the software involved in the data
collection process and indicate when some part of the
collection system is malfunctioning.
●The error codes could have an own channel (like an own
attribute) or they could come as a part of the ordinary data.
●The figure shows a series of air temperatures recorded by a
weather station at the side of a road. This is an example
where the data representing the measured physical quantity
(in this case temperature) and the error code use the same
channel.
●In this case a legend, explicitly telling the error code may be
redundant, since a temperature at constantly -99C for several
samples, is unlikely enough to speak for itself.
● There could be other cases where the codes are less obvious.
●The temperature values during the malfunction must be
considered missing. Depending on the time span of the
malfunction and the availability of redundant data, the
chances to make a correction, by filling in the gap, may wary.
It is simply a matter of data
that do not match the
physical quantity that is
supposed to be measured.
Values out
of bound For example a negative
magnitude might be valid if a
temperature is measured,
but not precipitation. It must
be considered as “missing”
Outliers
●“An outlier is an observation that differs so
much from other observations as to arouse
suspicion that it was generated by a different
mechanism” : The statistician Douglas Hawkins.
●When you collect data, you will often be able to,
by intuition, sort out extraordinary records (or
series of records), just by having a swift look at
the data.
●The reason for humans’ ability to recognize
extraordinary information is probably the fact
that this kind of information often is of
extraordinary importance.
●The information is extraordinary in the sense
that, it seem not to follow the background
pattern and seem to be very rare or improbable.
Outliers
• These outliers (also sometimes called anomalies) can be sorted into two different sub
groups;
• those natural and interesting and
• those caused by malfunctioning instruments (where no error code is delivered).
• The first group will contribute with data that improves the succeeding analysis or model
building. The later will contribute with errors that make the succeeding results less accurate.
• If a collected value is very unlikely, it can by itself cause the mean or the standard deviation
to drift significantly. Therefore it is an important part of the data filtering process to remove
those values.
• There are two ways to handle/solve such outliers;
• Density based approach for detecting outliers. Where the moving SD and/or mean is
calculated for the nearby n values. It is ensured that the values remain with in the
calculated distribution (+/-).
• Model based one, where a theoretical model is constructed that reflects the behavior
of your dataset. Here a regression model learns from previous examples, how the
traffic flow varies over the day, for a location along a road. It is important that the
learning is done from data that we some how know is correct. Later incoming data
will be compared with what the model predicts. If the data deviate more than a set
threshold it will be considered faulty.
Missing Data
• In a dataset, data could be missing for two reasons; either it has
never been present or it has been removed because it was
considered faulty for some reason. If data was never present, there
are two sub cases:
• Either the data was collected as a series, where the missing
data is easily detected as a gap in the series of
independent attributes
• Or the data has sporadic nature like precipitation. In this
case the gaps cold be harder to detect. One way to make
the gaps visible is, to also report “no-sporadic-event-
occurred”
• Sometimes the intended use of the filtered data requires a
complete dataset. There are different degrees of completeness;
sometimes an uninterrupted series of data is sufficient, but
sometimes data is needed “between” the uninterrupted records.
This means that methods will be needed not only to fill in data
where records are missing, but also to fill in data between the
records that are present. There are various methods to generate the
in-between data.
• Linear interpolation
• Polynomial interpolation
• Statistical Curve Fitting
Linear Interpolation

●Sometimes there is a need for


knowing what happens in between
known data. This could be
formulated as estimation of the
value of the dependent attribute
where there is no corresponding
value for the independent attribute
stored in the records.
●Interpolation is a group of methods
dealing with this problem, where
linear interpolation is the simplest
form. Linear interpolation will add
information by “binding the know
data points together” with straight
lines
Linear
Interpolation

The formula for finding the


value(Ye) at a given point X
in between X1 and X2 is
Ploynomial Interpolation
●It can be proven mathematically that, if we have n data points, there is
exactly one polynomial of degree at most n−1 going through all the data
points.
●Polynomial refers to mathematical functions that have the following
pattern

– Where ‘a’ are constants and degree is n.


– The given data points are defined in terms of (X0,Y0),
(X1,Y1),...(Xn,Yn).
– To make the curve cross all the points lets make the following set of
equations
Ploynomial Interpolation

●The earlier equation can be written in the matrix


format

● Ultimately y[]=[x][a]
● => [a] =[x]-1[y]
Polynomial interpolation: example

●The data set contains 4 points so n=4(degree 3).after inserting the values of time in x
and temperature in y, the simultaneous equation set becomes
Example
Statistical Model Curve Fitting

●There is sometimes a need for knowing what


happens in between known data.
● This could be formulated as estimation of the dependent
variable where there is no corresponding independent
variable.

●Statistical modeling like general regression uses


historical data, for both filling in missing data, but
also modeling, there is actually no difference.
Visualization filters

●The Range, List, Date, and Expression filter types are specific to either a visualization,
canvas, or project. Filter types are automatically determined based on the data
elements you choose as filters.
– Range filters - Generated for data elements that are number data types and
that have an aggregation rule set to something other than none. Range filters
are applied to data elements that are measures, and that limit data to a range
of contiguous values, such as revenue of $100,000 to $500,000. Or you can
create a range filter that excludes (as opposed to includes) a contiguous range
of values. Such exclusive filters limit data to noncontiguous ranges (for
example, revenue less than $100,000 or greater than $500,000).
– List filters - Applied to data elements that are text data types and number
data types that aren’t aggregable.
– Date filters - Use calendar controls to adjust time or date selections. You can
either select a single contiguous range of dates, or you can use a date range
filter to exclude dates within the specified range.
– Expression filters - Let you define more complex filters using SQL expressions.
Bloom Filters

Some
Terms
Attached Content Based
Filtering
with Big
Data
Filters
Collaborative
Filtering
Collaborative Filtering
•Goal: predict what movies/books/… a person may
be interested in, on the basis of
–Past preferences of the person
–Other people with similar past preferences
–The preferences of such people for a new movie/book/…
•One approach based on repeated clustering
–Cluster people on the basis of preferences for movies
–Then cluster movies on the basis of being liked by the same
clusters of people
–Again cluster people based on their preferences for (the
newly created clusters of) movies
–Repeat above till equilibrium
•Above problem is an instance of collaborative
filtering, where users collaborate in the task of
filtering information to find information of interest

23
Everyday Examples of Collaborative
Filtering...

•Bestseller lists
•Top 40 music lists
•The “recent returns” shelf at the library
•Unmarked but well-used paths through the woods
•The printer room at work
•Many weblogs
•“Read any good books lately?”
•....
•Common insight: personal tastes are correlated:
–If Alice and Bob both like X and Alice likes Y then Bob is
more likely to like Y
–especially (perhaps) if Bob knows Alice
Collaborative + Content Filtering
As Classification (Basu, Hirsh, Cohen, AAAI98)
Classification task: map (user,movie) pair into {likes,dislikes}
Training data: known likes/dislikes
Test data: active users
Features: any properties of
user/movie pair
Airplane Matrix Room with ... Hidalgo
a View

comedy action romance ... action

Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...
Kumar 25,M,22k 1 0 0 1

Ua 48,M,81k 0 1 ? ? ?
Need for Standards in Big Data
Technologies for streaming, storing, and querying big data have matured to the point

where the computer industry can usefully establish standards.


●As in other areas of engineering, standardization allows practitioners to port their
learnings across a multitude of solutions, and to more easily employ different
technologies together; standardization also allows solution providers to take advantage
of sub-components to expeditiously build more compelling solutions with broader
applicability.
Areas of growth that would benefit from standards are:

– Stream processing
– Storage engine interfaces
– Querying
– Benchmarks
– Security and governance
– Metadata management
– Deployment (including cloud / as a service options)
– Integration with other fast-growing technologies, such as AI and blockchain
Need For Big Data Standards
Big Data Landscape
Big Data Standards Timeline (Historic)
NIST Big Data Interoperability Framework :
Goal: Develop a consensus-based reference architecture that is vendor-neutral,
technology and infrastructure agnostic to enable any stakeholders to perform
analytics processing for their given data sources without worrying about the
underlying computing environment.
● Seven Volumes
– Volume 1, Definitions
– Volume 2, Taxonomies
– Volume 3, Use Cases and General Requirements
– Volume 4, Security and Privacy
– Volume 5, Architectures White Paper Survey
– Volume 6, Reference Architecture
– Volume 7, Standards Roadmap
● Latest versions available:
http://bigdatawg.nist.gov/V1_output_docs.php
● Published October 2015 as NIST SP 1500-n
Volume 1: Definitions

●Define a common vocabulary for multiple audiences


●Set the landscape and issues around big data
– – Defined a number of related terms
●Two key aspects
– Focused on Characteristics (the Vs)
– Focused on need for scalable architectures
●Issues
– Definitions need to be more normative
●Big Data consists of extensive datasets -primarily in the characteristics of volume,
variety, velocity, and/or variability that require a scalable architecture for efficient
storage, manipulation, and analysis.
•Define actors and roles as
used within the reference
architecture
Volume 2: •Start to define data
Taxonomies characteristics
• Issues
• Our eyes were bigger than our
stomach – How do you not do a
taxonomy of all computing?
Volume 3: Big Data Use cases
and Requirements

Built from 51 Responses Decomposed Detailed Issues


responses to Use then aggregated requirements all We didn’t know what
Case survey into 34 general traceable to we didn’t know –
Deep Learning and
Social Media (6) requirements general template was overly
(26 fields)
Government
across 6 requirements simplistic
Operations (4) categories Additional use cases
Data Source Requirements (3) needed
Commercial (8)
Transformation Provider
Defense (3) Requirements (3)
Healthcare and Life Data Consumer
Sciences (10) Requirements(6)
The Ecosystem for Security and Privacy
Research (4) Requirements (2)
Astronomy and Lifecycle Management
Physics (5) Requirements (9)
Earth, Environmental Other Requirements (5)
and Polar Science (10)
Energy (1)
Volume 4: Security and Privacy

Recognized early on as a key concern requiring a


more complete treatment

S&P issues particular to Big Data


Some S&P specific use cases
Describes:
S&P Taxonomy
Maps S&P use cases to Reference Architecture

Some problems are just hard


Issues: Need more use cases (or S&P requirements
from existing)
Volume 5: Architecture
White paper Survey
• Designed to determine if there are
common elements to Big Data Architecture
• Built from a survey call
• 10 responses from Industry (8) and
Academia (2)
• Was sufficient to develop a comparative
view and identify key roles and functional
components
• Helped to scope top level roles in the
RA
• Issues
• Sample set was too small
Volume 6: Reference
Architecture
• Had to be Vendor Neutral and Technology Agnostic applicable to a variety of
business and deployment models.

• The Goals:

• To illustrate and understand the various Big Data components,


processes, and systems, in the context of an overall Big Data conceptual
model;

• To provide a technical reference for U.S. Government departments,


agencies and other consumers to understand, discuss, categorize and
compare Big Data solutions; and

• To facilitate the analysis of candidate standards for interoperability,


portability, reusability, and extendibility.

• Mapped Use case categories to Reference Architecture Components and Fabrics

• Defined 7 top level and 5 sub-roles

• Two roles presented as fabrics

• Issues

• Hard to describe an architecture without being able to mention


technologies

• Terminology came back to bite us

• Current architecture is not really normative

• Too much in one diagram – mixed views


Volume 7: Standards Roadmap
Goals:

– Document an understanding of what standards are available or under


development for Big Data
– Perform a gap analysis and document the findings
– Identify what possible barriers may delay or prevent adoption of Big Data
– Document vision and recommendations
Also designed to be a summary document

Surveyed major SDOs and Consortium standards


– Developed criteria for “Relevant to Big Data”


– Mapped to Ref Arch roles as users or implementers of standard
Issues

– An exhaustive documentation of Big Data standards bigger than resources –


Almost every standard deals with data
– Initial direction of document was more a technology roadmap – can’t do a
roadmap of technologies without mentioning technologies
ISO/IEC 20546, Information technology – Big Data -- Overview and
Vocabulary

●Scope: This International Standard provides an


overview of Big Data, along with a set of terms and
definitions. It provides a terminological foundation
for Big Data-related standards.
●Schedule

– Current: 1 st WD Available
– CD: Oct 2016
– Publication: Oct 2018
ISO/IEC 20547, Information technology – Big Data Reference Architecture
Thank You

Wish you a prosperous career with Big Data Analytics


References

●https://www.oreilly.com/ideas/its-time-to-
establish-big-data-standards

You might also like