BDA - Lecture 4
BDA - Lecture 4
CS-585
Unit-1: Lecture 4
Contents
Inconsistent Data
• Duplicate data
• Contradictive Data
processes / Outliers
• Linear interpolation
• Cross validation
• Comparison of estimated and observed data
Inconsistent
Data
That is, we know enough about the system that
the measurement is a part of. For example the
instrument, might deliver an error code where
the sampled value should have been.
If we sample positions of cars we know that an
error has occurred if one car is reported to be at
two places at one time.
When data is transmitted and stored,
duplicates of records sometimes
appear, for different reasons.
● Error codes are data of another kind than the data collected.
●The codes are generated by the software involved in the data
collection process and indicate when some part of the
collection system is malfunctioning.
●The error codes could have an own channel (like an own
attribute) or they could come as a part of the ordinary data.
●The figure shows a series of air temperatures recorded by a
weather station at the side of a road. This is an example
where the data representing the measured physical quantity
(in this case temperature) and the error code use the same
channel.
●In this case a legend, explicitly telling the error code may be
redundant, since a temperature at constantly -99C for several
samples, is unlikely enough to speak for itself.
● There could be other cases where the codes are less obvious.
●The temperature values during the malfunction must be
considered missing. Depending on the time span of the
malfunction and the availability of redundant data, the
chances to make a correction, by filling in the gap, may wary.
It is simply a matter of data
that do not match the
physical quantity that is
supposed to be measured.
Values out
of bound For example a negative
magnitude might be valid if a
temperature is measured,
but not precipitation. It must
be considered as “missing”
Outliers
●“An outlier is an observation that differs so
much from other observations as to arouse
suspicion that it was generated by a different
mechanism” : The statistician Douglas Hawkins.
●When you collect data, you will often be able to,
by intuition, sort out extraordinary records (or
series of records), just by having a swift look at
the data.
●The reason for humans’ ability to recognize
extraordinary information is probably the fact
that this kind of information often is of
extraordinary importance.
●The information is extraordinary in the sense
that, it seem not to follow the background
pattern and seem to be very rare or improbable.
Outliers
• These outliers (also sometimes called anomalies) can be sorted into two different sub
groups;
• those natural and interesting and
• those caused by malfunctioning instruments (where no error code is delivered).
• The first group will contribute with data that improves the succeeding analysis or model
building. The later will contribute with errors that make the succeeding results less accurate.
• If a collected value is very unlikely, it can by itself cause the mean or the standard deviation
to drift significantly. Therefore it is an important part of the data filtering process to remove
those values.
• There are two ways to handle/solve such outliers;
• Density based approach for detecting outliers. Where the moving SD and/or mean is
calculated for the nearby n values. It is ensured that the values remain with in the
calculated distribution (+/-).
• Model based one, where a theoretical model is constructed that reflects the behavior
of your dataset. Here a regression model learns from previous examples, how the
traffic flow varies over the day, for a location along a road. It is important that the
learning is done from data that we some how know is correct. Later incoming data
will be compared with what the model predicts. If the data deviate more than a set
threshold it will be considered faulty.
Missing Data
• In a dataset, data could be missing for two reasons; either it has
never been present or it has been removed because it was
considered faulty for some reason. If data was never present, there
are two sub cases:
• Either the data was collected as a series, where the missing
data is easily detected as a gap in the series of
independent attributes
• Or the data has sporadic nature like precipitation. In this
case the gaps cold be harder to detect. One way to make
the gaps visible is, to also report “no-sporadic-event-
occurred”
• Sometimes the intended use of the filtered data requires a
complete dataset. There are different degrees of completeness;
sometimes an uninterrupted series of data is sufficient, but
sometimes data is needed “between” the uninterrupted records.
This means that methods will be needed not only to fill in data
where records are missing, but also to fill in data between the
records that are present. There are various methods to generate the
in-between data.
• Linear interpolation
• Polynomial interpolation
• Statistical Curve Fitting
Linear Interpolation
● Ultimately y[]=[x][a]
● => [a] =[x]-1[y]
Polynomial interpolation: example
●The data set contains 4 points so n=4(degree 3).after inserting the values of time in x
and temperature in y, the simultaneous equation set becomes
Example
Statistical Model Curve Fitting
●The Range, List, Date, and Expression filter types are specific to either a visualization,
canvas, or project. Filter types are automatically determined based on the data
elements you choose as filters.
– Range filters - Generated for data elements that are number data types and
that have an aggregation rule set to something other than none. Range filters
are applied to data elements that are measures, and that limit data to a range
of contiguous values, such as revenue of $100,000 to $500,000. Or you can
create a range filter that excludes (as opposed to includes) a contiguous range
of values. Such exclusive filters limit data to noncontiguous ranges (for
example, revenue less than $100,000 or greater than $500,000).
– List filters - Applied to data elements that are text data types and number
data types that aren’t aggregable.
– Date filters - Use calendar controls to adjust time or date selections. You can
either select a single contiguous range of dates, or you can use a date range
filter to exclude dates within the specified range.
– Expression filters - Let you define more complex filters using SQL expressions.
Bloom Filters
Some
Terms
Attached Content Based
Filtering
with Big
Data
Filters
Collaborative
Filtering
Collaborative Filtering
•Goal: predict what movies/books/… a person may
be interested in, on the basis of
–Past preferences of the person
–Other people with similar past preferences
–The preferences of such people for a new movie/book/…
•One approach based on repeated clustering
–Cluster people on the basis of preferences for movies
–Then cluster movies on the basis of being liked by the same
clusters of people
–Again cluster people based on their preferences for (the
newly created clusters of) movies
–Repeat above till equilibrium
•Above problem is an instance of collaborative
filtering, where users collaborate in the task of
filtering information to find information of interest
23
Everyday Examples of Collaborative
Filtering...
•Bestseller lists
•Top 40 music lists
•The “recent returns” shelf at the library
•Unmarked but well-used paths through the woods
•The printer room at work
•Many weblogs
•“Read any good books lately?”
•....
•Common insight: personal tastes are correlated:
–If Alice and Bob both like X and Alice likes Y then Bob is
more likely to like Y
–especially (perhaps) if Bob knows Alice
Collaborative + Content Filtering
As Classification (Basu, Hirsh, Cohen, AAAI98)
Classification task: map (user,movie) pair into {likes,dislikes}
Training data: known likes/dislikes
Test data: active users
Features: any properties of
user/movie pair
Airplane Matrix Room with ... Hidalgo
a View
Joe 27,M,70k 1 1 0 1
Carol 53,F,20k 1 1 0
...
Kumar 25,M,22k 1 0 0 1
Ua 48,M,81k 0 1 ? ? ?
Need for Standards in Big Data
Technologies for streaming, storing, and querying big data have matured to the point
●
– Stream processing
– Storage engine interfaces
– Querying
– Benchmarks
– Security and governance
– Metadata management
– Deployment (including cloud / as a service options)
– Integration with other fast-growing technologies, such as AI and blockchain
Need For Big Data Standards
Big Data Landscape
Big Data Standards Timeline (Historic)
NIST Big Data Interoperability Framework :
Goal: Develop a consensus-based reference architecture that is vendor-neutral,
technology and infrastructure agnostic to enable any stakeholders to perform
analytics processing for their given data sources without worrying about the
underlying computing environment.
● Seven Volumes
– Volume 1, Definitions
– Volume 2, Taxonomies
– Volume 3, Use Cases and General Requirements
– Volume 4, Security and Privacy
– Volume 5, Architectures White Paper Survey
– Volume 6, Reference Architecture
– Volume 7, Standards Roadmap
● Latest versions available:
http://bigdatawg.nist.gov/V1_output_docs.php
● Published October 2015 as NIST SP 1500-n
Volume 1: Definitions
• The Goals:
• Issues
– Current: 1 st WD Available
– CD: Oct 2016
– Publication: Oct 2018
ISO/IEC 20547, Information technology – Big Data Reference Architecture
Thank You
●https://www.oreilly.com/ideas/its-time-to-
establish-big-data-standards