0% found this document useful (0 votes)
28 views51 pages

Lecture2_VA_Handling_Data

The document outlines the course CS 661: Big Data Visual Analytics taught by Soumya Dutta at IIT Kanpur, covering key concepts in visual design, visual variables, and the importance of scalability in big data. It emphasizes the role of visual analytics in data processing and interaction, along with challenges such as handling noisy data and data normalization. Additionally, it discusses techniques for data augmentation and reduction to manage large datasets effectively.

Uploaded by

Swaraj Sonavane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views51 pages

Lecture2_VA_Handling_Data

The document outlines the course CS 661: Big Data Visual Analytics taught by Soumya Dutta at IIT Kanpur, covering key concepts in visual design, visual variables, and the importance of scalability in big data. It emphasizes the role of visual analytics in data processing and interaction, along with challenges such as handling noisy data and data normalization. Additionally, it discusses techniques for data augmentation and reduction to manage large datasets effectively.

Uploaded by

Swaraj Sonavane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Big Data Visual Analytics (CS 661)

Instructor: Soumya Dutta


Department of Computer Science and Engineering
Indian Institute of Technology Kanpur (IITK)
email: [email protected]
Announcements
• To get a quicker response from me, please email to my CSE email and
not to my IITK email:
• My CSE email: [email protected]

IITK CS661: Big Data Visual Analytics: Soumya Dutta 2


Acknowledgements
• Some of the following slides are adapted from the excellent course
materials made available by:
• Prof. Klaus Mueller (State University of New York at Stony Brook)
• Prof. Tamara Munzner (University of British Columbia)

IITK CS661: Big Data Visual Analytics: Soumya Dutta 3


Visual Design and Visual
Variables

IITK CS661: Big Data Visual Analytics: Soumya Dutta 4


Key Visual Representations
• Gestalt Principles
• The tendency to perceive elements as belonging to a group, based on
certain visual properties
• Pre-attentiveness
• Certain low level visual aspects are recognized before conscious
awareness
• Visual variables
• The different visual aspects that can be used to encode information

IITK CS661: Big Data Visual Analytics: Soumya Dutta 5


Gestalt Principles
• “Gestalt” is German for “unified whole”
• Grasp the "totality" of something before worrying about the details
• Proximity, similarity, closure, multistability, …

Rubin’s vase
What do you see in this figure? What do you see in this figure?
IITK CS661: Big Data Visual Analytics: Soumya Dutta 6
Pre-attentiveness
• Also called pop-out

IITK CS661: Big Data Visual Analytics: Soumya Dutta 7


Visual Variables
• Two planar variables
• Spatial dimensions (X and Y)

IITK CS661: Big Data Visual Analytics: Soumya Dutta 8


Visual Variables
• Two planar variables
• Spatial dimensions (X and Y)
• Six Retinal variables
• Size
• Color
• Shape
• Orientation
• Texture
• Brightness

IITK CS661: Big Data Visual Analytics: Soumya Dutta 9


Visual Variables
• Two planar variables
• Spatial dimensions (X and Y)
• Six Retinal variables
• Size
• Color
• Shape
• Orientation
• Texture
• Brightness
• Retinal variables allow for one more variable to be encoded

IITK CS661: Big Data Visual Analytics: Soumya Dutta 10


Visual Variables

Planar Size Brightness Shape

Texture Color Orientation

IITK CS661: Big Data Visual Analytics: Soumya Dutta 11


Take Aways
• Planar variable is the strongest visual variable
• Maps to proximity
• Provides an intuitive organization of information
• Things close together are perceptually grouped together (Gestalt)
• Size and brightness are good secondary visual variables to encode
relative magnitude
• Color is a good visual variable for labeling
• Texture can do this as well, but it does not support pop-out much
• Shape provides only limited pop-out

IITK CS661: Big Data Visual Analytics: Soumya Dutta 12


Considerations with Scalability for Big Data
• Must be scalable to
• Number of data points
• Number of dimensions
• Data sources
• Diversity of data sources (heterogeneity)
• Number of users

IITK CS661: Big Data Visual Analytics: Soumya Dutta 13


Considerations with Scalability for Big Data
• Must be scalable to
• Number of data points
• Number of dimensions
• Data sources
• Diversity of data sources (heterogeneity)
• Number of users

Visual Analytics can help!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 14


What is Visual Analytics
• Visualization plus...
• Data processing (analytics)
• Intelligent computing (AI, machine learning)
• Interaction (HCI)
• Pattern discovery
• Storytelling and sensemaking
• Behavioral psychology (cognitive science, human factors)

Visual Analytics is the process of analytical reasoning often


supported by a highly interactive visual interface/tool

IITK CS661: Big Data Visual Analytics: Soumya Dutta 15


Visual Information Seeking Mantra
• Ben Shneiderman’s Mantra: Overview, zoom and filter, then details-on-demand!

Overview first
IITK CS661: Big Data Visual Analytics: Soumya Dutta 16
Visual Information Seeking Mantra
• Ben Shneiderman’s Mantra: Overview, zoom and filter, then details-on-demand!

Zoom
IITK CS661: Big Data Visual Analytics: Soumya Dutta 17
Visual Information Seeking Mantra
• Ben Shneiderman’s Mantra: Overview, zoom and filter, then details-on-demand!

Filter
IITK CS661: Big Data Visual Analytics: Soumya Dutta 18
Visual Information Seeking Mantra
• Ben Shneiderman’s Mantra: Overview, zoom and filter, then details-on-demand!

Details on demand
IITK CS661: Big Data Visual Analytics: Soumya Dutta 19
Another Paradigm: Focus + Context
• Focus + Context:
• One single view which shows information in direct context
• Maintains continuity and do not require viewer to shift back and forth
• But: there is distortion!

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://www.youtube.com/watch?v=acsFQvv4B0Q 20


Use of Visualization
• Visual Perception
• Fast screening of lot of data
• Pattern recognition
• High-level cognition
• Interaction
• Direct manipulation of data and visualization (Human in the loop)
• Two-way communication

Humans are important!


But Humans are imperfect too!!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 21


Humans Are Imperfect
• Humans tend to overlook/ignore non-focused (and unexpected)
objects even when they are very close and obvious
• Humans also have limited working memory
• Fine details are quickly forgotten when focus changes
• Need to preserve temporal context

IITK CS661: Big Data Visual Analytics: Soumya Dutta 22


Humans Are Imperfect
• Spot the difference: Change blindness

IITK CS661: Big Data Visual Analytics: Soumya Dutta Source: Google 23
Humans Are Imperfect
• Spot the difference: Change blindness

IITK CS661: Big Data Visual Analytics: Soumya Dutta Source: Wikipedia 24
Human Limitations for Visualization
• The Magic Number Seven (7 ± 2) for visualization
• Not more than 7 ± 2 segments in a pie chart
• Not more than 7 ± 2 colors in a line chart
• and so on …..

Miller, G.. (1956). "The magical number seven, plus or minus two: Some limits on our capacity for processing information".
IITK CS661: Big Data Visual Analytics: Soumya Dutta 25
Example of Visual Complexity

Do we really need the background grid? Maybe not!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 26


Handling Data

IITK CS661: Big Data Visual Analytics: Soumya Dutta 27


What Do We Do After Getting the Raw Data?
• Real world data can be dirty!

• Data cleaning (Wrangling)


• Missing values
• Noisy data
• Deal with outliers
• Standardize/normalize
• Resolve inconsistency
• Fuse/merge

Data Cleaning Cycle


IITK CS661: Big Data Visual Analytics: Soumya Dutta https://blog.insycle.com/data-cleaning-hubspot 28
Missing Data: Why?
• Data may not be always available/complete!

• Missing data may be due to


• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Many more other reasons

IITK CS661: Big Data Visual Analytics: Soumya Dutta 29


Missing Data: How to Handle?
• How would you estimate the missing value for a dataset?
• Ignore or put in a default value
• Manually fill in (can be tedious or infeasible for large data)
• Use the available value of the nearest neighbor
• Average over all the values
• Use a probabilistic methods (regression, Bayesian, decision tree)
• Use AI/ML models to predict missing data

IITK CS661: Big Data Visual Analytics: Soumya Dutta 30


Data Normalization and Standardization
• Sometimes we like to have all variables on the same scale
• Min-max normalization

• Standardization

IITK CS661: Big Data Visual Analytics: Soumya Dutta 31


Data Normalization and Standardization
• Sometimes we like to have all variables on the same scale
• Min-max normalization

• Standardization

• Clipping tails and outliers


• set all values beyond ± 3s to value at 3s

IITK CS661: Big Data Visual Analytics: Soumya Dutta 32


Normalization

IITK CS661: Big Data Visual Analytics: Soumya Dutta 33


Standardization

IITK CS661: Big Data Visual Analytics: Soumya Dutta 34


Robust Scaling

• IQR = Q3 – Q1
• Difference between the 75th percentile and the 25th percentile data
• Immune to outliers
• Relies on the median and IQR, which are robust to extreme values
• Ensures that most of the data falls within a consistent range after scaling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 35


Comparison Among Diff. Methods of Scaling

Raw Data Min-max normalization Standardization Robust Scaling

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://www.geeksforgeeks.org/standardscaler-minmaxscaler-and-robustscaler-techniques-ml/ 36


Noisy Data
• Noise = Random error in a measured variable
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention

IITK CS661: Big Data Visual Analytics: Soumya Dutta 37


Noisy Data: What to Do?
• Binning
• Replace data with bin centers

IITK CS661: Big Data Visual Analytics: Soumya Dutta 38


Noisy Data: What to Do?
• Binning
• Replace data with bin centers
• Clustering
• Detect and remove outliers

IITK CS661: Big Data Visual Analytics: Soumya Dutta 39


Noisy Data: What to Do?
• Binning
• Replace data with bin centers
• Clustering
• Detect and remove outliers
• Semi-automated method
• Combined human and computer inspection
• Detect suspicious value and check manually

IITK CS661: Big Data Visual Analytics: Soumya Dutta 40


Noisy Data: What to Do?
• Binning
• Replace data with bin centers
• Clustering
• Detect and remove outliers
• Semi-automated method
• Combined human and computer inspection
• Detect suspicious value and check manually
• Regression
• Smooth data by fitting to a regression
function

IITK CS661: Big Data Visual Analytics: Soumya Dutta 41


Noisy Data: What to Do?
• Binning
• Replace data with bin centers
• Clustering
• Detect and remove outliers
• Semi-automated method
• Combined human and computer inspection
• Detect suspicious value and check manually
• Regression
• Smooth data by fitting to a regression
function
• Outliers are not always noise! Be careful!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 42


Deal with Small Data
• Can you invent meaningful new data?

IITK CS661: Big Data Visual Analytics: Soumya Dutta 43


Deal with Small Data à Data Augmentation
• Can you invent meaningful new data?
• Data Augmentation
• Strategy to artificially synthesize new data from
existing data

IITK CS661: Big Data Visual Analytics: Soumya Dutta 44


Deal with Small Data à Data Augmentation
• Can you invent meaningful new data?
• Data Augmentation
• Strategy to artificially synthesize new data from
existing data
• Common techniques are (for images)
• rotations
• Translations
• Zooms
• Flips
• color perturbations
• crops
• add noise by jittering

IITK CS661: Big Data Visual Analytics: Soumya Dutta 45


Synthetic Data Generation for Imbalanced
Classification
• When data has severe imbalance in
the class representation
• If you use such data for ML model
training, it will perform poorly for the
minority class
• SMOTE (Synthetic Minority
Oversampling Technique) can help
• A data augmentation method

Imbalanced Data

IITK CS661: Big Data Visual Analytics: Soumya Dutta 46


SMOTE: Synthetic Data Generation for
Imbalanced Classification
• How do we generate samples for minority class?
1. Randomly under-sample the majority class
2. Select a minority class instance (x) at random and find its k-nearest
minority class neighbors
3. Select one of the k neighbors at random, say (y)
4. The synthetic instances are generated as a convex combination of the two
chosen instances x and y

IITK CS661: Big Data Visual Analytics: Soumya Dutta 47


SMOTE: Synthetic Data Generation for
Imbalanced Classification
• Example:

Imbalanced Data SMOTE + random under-sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ 48


Data Augmentation for Visualization
• Generate new samples according to the data distributions
• Cluster the data (outliers may form clusters!)
• The size of each cluster represents its percentage in the population
• Randomize new samples – bigger clusters get more samples

Augmentation rate ~ Cluster size


IITK CS661: Big Data Visual Analytics: Soumya Dutta 49
Deal with Big Data à Data Reduction!
• Purpose
• Reduce the data to a size that can be feasibly stored without missing on
important information
• Reduce the data so a mining algorithm can be feasibly run

• Alternatives
• Buy more storage
• Buy more computers or faster ones
• Develop more efficient algorithms

• In practice, all of this is happening at the same time


• But the growth of data and complexities is faster
• So, data reduction is important!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 50
Data Reduction: How?
• Summarization (Later in the course)
• Binning

Summary Data
• Distribution-based
• Clustering
• Sampling (Later in the course)
• Systematic/Regular
• Random Big Data
• Stratified
• Adaptive/Data-driven
• Importance-driven

Sampling
• Cluster-based
• Dimension Reduction (Later in the course)
AI/ML model
• AI/ML techniques (Later in the course)

IITK CS661: Big Data Visual Analytics: Soumya Dutta 51

You might also like