0% found this document useful (0 votes)

28 views51 pages

Lecture2_VA_Handling_Data

The document outlines the course CS 661: Big Data Visual Analytics taught by Soumya Dutta at IIT Kanpur, covering key concepts in visual design, visual variables, and the importance of scalability in big data. It emphasizes the role of visual analytics in data processing and interaction, along with challenges such as handling noisy data and data normalization. Additionally, it discusses techniques for data augmentation and reduction to manage large datasets effectively.

Uploaded by

Swaraj Sonavane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views51 pages

Lecture2_VA_Handling_Data

Uploaded by

Swaraj Sonavane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Big Data Visual Analytics (CS 661)

Instructor: Soumya Dutta

Department of Computer Science and Engineering
Indian Institute of Technology Kanpur (IITK)
email: [email protected]
Announcements
• To get a quicker response from me, please email to my CSE email and
not to my IITK email:
• My CSE email: [email protected]

IITK CS661: Big Data Visual Analytics: Soumya Dutta 2

Acknowledgements
• Some of the following slides are adapted from the excellent course
materials made available by:
• Prof. Klaus Mueller (State University of New York at Stony Brook)
• Prof. Tamara Munzner (University of British Columbia)

IITK CS661: Big Data Visual Analytics: Soumya Dutta 3

Visual Design and Visual
Variables

IITK CS661: Big Data Visual Analytics: Soumya Dutta 4

Key Visual Representations
• Gestalt Principles
• The tendency to perceive elements as belonging to a group, based on
certain visual properties
• Pre-attentiveness
• Certain low level visual aspects are recognized before conscious
awareness
• Visual variables
• The different visual aspects that can be used to encode information

IITK CS661: Big Data Visual Analytics: Soumya Dutta 5

Gestalt Principles
• “Gestalt” is German for “unified whole”
• Grasp the "totality" of something before worrying about the details
• Proximity, similarity, closure, multistability, …

Rubin’s vase
What do you see in this figure? What do you see in this figure?
IITK CS661: Big Data Visual Analytics: Soumya Dutta 6
Pre-attentiveness
• Also called pop-out

IITK CS661: Big Data Visual Analytics: Soumya Dutta 7

Visual Variables
• Two planar variables
• Spatial dimensions (X and Y)

IITK CS661: Big Data Visual Analytics: Soumya Dutta 8

Visual Variables
• Two planar variables
• Spatial dimensions (X and Y)
• Six Retinal variables
• Size
• Color
• Shape
• Orientation
• Texture
• Brightness

IITK CS661: Big Data Visual Analytics: Soumya Dutta 9

Visual Variables
• Two planar variables
• Spatial dimensions (X and Y)
• Six Retinal variables
• Size
• Color
• Shape
• Orientation
• Texture
• Brightness
• Retinal variables allow for one more variable to be encoded

IITK CS661: Big Data Visual Analytics: Soumya Dutta 10

Visual Variables

Planar Size Brightness Shape

Texture Color Orientation

IITK CS661: Big Data Visual Analytics: Soumya Dutta 11

Take Aways
• Planar variable is the strongest visual variable
• Maps to proximity
• Provides an intuitive organization of information
• Things close together are perceptually grouped together (Gestalt)
• Size and brightness are good secondary visual variables to encode
relative magnitude
• Color is a good visual variable for labeling
• Texture can do this as well, but it does not support pop-out much
• Shape provides only limited pop-out

IITK CS661: Big Data Visual Analytics: Soumya Dutta 12

Considerations with Scalability for Big Data
• Must be scalable to
• Number of data points
• Number of dimensions
• Data sources
• Diversity of data sources (heterogeneity)
• Number of users

IITK CS661: Big Data Visual Analytics: Soumya Dutta 13

Visual Analytics can help!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 14

What is Visual Analytics
• Visualization plus...
• Data processing (analytics)
• Intelligent computing (AI, machine learning)
• Interaction (HCI)
• Pattern discovery
• Storytelling and sensemaking
• Behavioral psychology (cognitive science, human factors)

Visual Analytics is the process of analytical reasoning often

supported by a highly interactive visual interface/tool

IITK CS661: Big Data Visual Analytics: Soumya Dutta 15

Visual Information Seeking Mantra
• Ben Shneiderman’s Mantra: Overview, zoom and filter, then details-on-demand!

Overview first
IITK CS661: Big Data Visual Analytics: Soumya Dutta 16
Visual Information Seeking Mantra
• Ben Shneiderman’s Mantra: Overview, zoom and filter, then details-on-demand!

Zoom
IITK CS661: Big Data Visual Analytics: Soumya Dutta 17
Visual Information Seeking Mantra
• Ben Shneiderman’s Mantra: Overview, zoom and filter, then details-on-demand!

Filter
IITK CS661: Big Data Visual Analytics: Soumya Dutta 18
Visual Information Seeking Mantra
• Ben Shneiderman’s Mantra: Overview, zoom and filter, then details-on-demand!

Details on demand
IITK CS661: Big Data Visual Analytics: Soumya Dutta 19
Another Paradigm: Focus + Context
• Focus + Context:
• One single view which shows information in direct context
• Maintains continuity and do not require viewer to shift back and forth
• But: there is distortion!

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://www.youtube.com/watch?v=acsFQvv4B0Q 20

Use of Visualization
• Visual Perception
• Fast screening of lot of data
• Pattern recognition
• High-level cognition
• Interaction
• Direct manipulation of data and visualization (Human in the loop)
• Two-way communication

Humans are important!

But Humans are imperfect too!!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 21

Humans Are Imperfect
• Humans tend to overlook/ignore non-focused (and unexpected)
objects even when they are very close and obvious
• Humans also have limited working memory
• Fine details are quickly forgotten when focus changes
• Need to preserve temporal context

IITK CS661: Big Data Visual Analytics: Soumya Dutta 22

Humans Are Imperfect
• Spot the difference: Change blindness

IITK CS661: Big Data Visual Analytics: Soumya Dutta Source: Google 23
Humans Are Imperfect
• Spot the difference: Change blindness

IITK CS661: Big Data Visual Analytics: Soumya Dutta Source: Wikipedia 24
Human Limitations for Visualization
• The Magic Number Seven (7 ± 2) for visualization
• Not more than 7 ± 2 segments in a pie chart
• Not more than 7 ± 2 colors in a line chart
• and so on …..

Miller, G.. (1956). "The magical number seven, plus or minus two: Some limits on our capacity for processing information".
IITK CS661: Big Data Visual Analytics: Soumya Dutta 25
Example of Visual Complexity

Do we really need the background grid? Maybe not!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 26

Handling Data

IITK CS661: Big Data Visual Analytics: Soumya Dutta 27

What Do We Do After Getting the Raw Data?
• Real world data can be dirty!

• Data cleaning (Wrangling)

• Missing values
• Noisy data
• Deal with outliers
• Standardize/normalize
• Resolve inconsistency
• Fuse/merge

Data Cleaning Cycle

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://blog.insycle.com/data-cleaning-hubspot 28
Missing Data: Why?
• Data may not be always available/complete!

• Missing data may be due to

• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Many more other reasons

IITK CS661: Big Data Visual Analytics: Soumya Dutta 29

Missing Data: How to Handle?
• How would you estimate the missing value for a dataset?
• Ignore or put in a default value
• Manually fill in (can be tedious or infeasible for large data)
• Use the available value of the nearest neighbor
• Average over all the values
• Use a probabilistic methods (regression, Bayesian, decision tree)
• Use AI/ML models to predict missing data

IITK CS661: Big Data Visual Analytics: Soumya Dutta 30

Data Normalization and Standardization
• Sometimes we like to have all variables on the same scale
• Min-max normalization

• Standardization

IITK CS661: Big Data Visual Analytics: Soumya Dutta 31

Data Normalization and Standardization
• Sometimes we like to have all variables on the same scale
• Min-max normalization

• Standardization

• Clipping tails and outliers

• set all values beyond ± 3s to value at 3s

IITK CS661: Big Data Visual Analytics: Soumya Dutta 32

Normalization

IITK CS661: Big Data Visual Analytics: Soumya Dutta 33

Standardization

IITK CS661: Big Data Visual Analytics: Soumya Dutta 34

Robust Scaling

• IQR = Q3 – Q1
• Difference between the 75th percentile and the 25th percentile data
• Immune to outliers
• Relies on the median and IQR, which are robust to extreme values
• Ensures that most of the data falls within a consistent range after scaling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 35

Comparison Among Diff. Methods of Scaling

Raw Data Min-max normalization Standardization Robust Scaling

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://www.geeksforgeeks.org/standardscaler-minmaxscaler-and-robustscaler-techniques-ml/ 36

Noisy Data
• Noise = Random error in a measured variable
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention

IITK CS661: Big Data Visual Analytics: Soumya Dutta 37

Noisy Data: What to Do?
• Binning
• Replace data with bin centers

IITK CS661: Big Data Visual Analytics: Soumya Dutta 38

Noisy Data: What to Do?
• Binning
• Replace data with bin centers
• Clustering
• Detect and remove outliers

IITK CS661: Big Data Visual Analytics: Soumya Dutta 39

IITK CS661: Big Data Visual Analytics: Soumya Dutta 40

Noisy Data: What to Do?
• Binning
• Replace data with bin centers
• Clustering
• Detect and remove outliers
• Semi-automated method
• Combined human and computer inspection
• Detect suspicious value and check manually
• Regression
• Smooth data by fitting to a regression
function

IITK CS661: Big Data Visual Analytics: Soumya Dutta 41

IITK CS661: Big Data Visual Analytics: Soumya Dutta 42

Deal with Small Data
• Can you invent meaningful new data?

IITK CS661: Big Data Visual Analytics: Soumya Dutta 43

Deal with Small Data à Data Augmentation
• Can you invent meaningful new data?
• Data Augmentation
• Strategy to artificially synthesize new data from
existing data

IITK CS661: Big Data Visual Analytics: Soumya Dutta 44

Deal with Small Data à Data Augmentation
• Can you invent meaningful new data?
• Data Augmentation
• Strategy to artificially synthesize new data from
existing data
• Common techniques are (for images)
• rotations
• Translations
• Zooms
• Flips
• color perturbations
• crops
• add noise by jittering

IITK CS661: Big Data Visual Analytics: Soumya Dutta 45

Synthetic Data Generation for Imbalanced
Classification
• When data has severe imbalance in
the class representation
• If you use such data for ML model
training, it will perform poorly for the
minority class
• SMOTE (Synthetic Minority
Oversampling Technique) can help
• A data augmentation method

Imbalanced Data

IITK CS661: Big Data Visual Analytics: Soumya Dutta 46

SMOTE: Synthetic Data Generation for
Imbalanced Classification
• How do we generate samples for minority class?
1. Randomly under-sample the majority class
2. Select a minority class instance (x) at random and find its k-nearest
minority class neighbors
3. Select one of the k neighbors at random, say (y)
4. The synthetic instances are generated as a convex combination of the two
chosen instances x and y

IITK CS661: Big Data Visual Analytics: Soumya Dutta 47

SMOTE: Synthetic Data Generation for
Imbalanced Classification
• Example:

Imbalanced Data SMOTE + random under-sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ 48

Data Augmentation for Visualization
• Generate new samples according to the data distributions
• Cluster the data (outliers may form clusters!)
• The size of each cluster represents its percentage in the population
• Randomize new samples – bigger clusters get more samples

Augmentation rate ~ Cluster size

IITK CS661: Big Data Visual Analytics: Soumya Dutta 49
Deal with Big Data à Data Reduction!
• Purpose
• Reduce the data to a size that can be feasibly stored without missing on
important information
• Reduce the data so a mining algorithm can be feasibly run

• Alternatives
• Buy more storage
• Buy more computers or faster ones
• Develop more efficient algorithms

• In practice, all of this is happening at the same time

• But the growth of data and complexities is faster
• So, data reduction is important!
•
IITK CS661: Big Data Visual Analytics: Soumya Dutta 50
Data Reduction: How?
• Summarization (Later in the course)
• Binning

Summary Data
• Distribution-based
• Clustering
• Sampling (Later in the course)
• Systematic/Regular
• Random Big Data
• Stratified
• Adaptive/Data-driven
• Importance-driven

Sampling
• Cluster-based
• Dimension Reduction (Later in the course)
AI/ML model
• AI/ML techniques (Later in the course)

IITK CS661: Big Data Visual Analytics: Soumya Dutta 51

Pwning Owasp Juice Shop PDF
No ratings yet
Pwning Owasp Juice Shop PDF
254 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
Kawasaki FS03N Compact High Speed Industrial Robot: Key Features
No ratings yet
Kawasaki FS03N Compact High Speed Industrial Robot: Key Features
2 pages
Lecture 2
No ratings yet
Lecture 2
55 pages
Lecture9_InfoVis_Intro (1)
No ratings yet
Lecture9_InfoVis_Intro (1)
34 pages
Lecture 1
No ratings yet
Lecture 1
58 pages
Lecture1_Introduction_bd248164-5dbb-4aa0-bf56-90f35db41208
No ratings yet
Lecture1_Introduction_bd248164-5dbb-4aa0-bf56-90f35db41208
59 pages
Lecture 3
No ratings yet
Lecture 3
55 pages
Lecture7_TF_Design (1)
No ratings yet
Lecture7_TF_Design (1)
37 pages
Lecture 4
No ratings yet
Lecture 4
46 pages
Lecture8 Parallel Volren 5691c06c a649 4ca9 Ac52 b236c4faf2a6
No ratings yet
Lecture8 Parallel Volren 5691c06c a649 4ca9 Ac52 b236c4faf2a6
45 pages
BDA U-5
No ratings yet
BDA U-5
33 pages
Chapter_1
No ratings yet
Chapter_1
53 pages
BDA GTU Study Material Presentations Unit-1 09082021103431AM
100% (1)
BDA GTU Study Material Presentations Unit-1 09082021103431AM
53 pages
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
BDA - UNIT 5
No ratings yet
BDA - UNIT 5
24 pages
Big Data Unit 1 Easy Notes (Edushine Classes)
No ratings yet
Big Data Unit 1 Easy Notes (Edushine Classes)
21 pages
L1
No ratings yet
L1
53 pages
Lecture8_parallel_volren
No ratings yet
Lecture8_parallel_volren
44 pages
PPT 1.1.4
No ratings yet
PPT 1.1.4
16 pages
L01-intro
No ratings yet
L01-intro
47 pages
kit-601-l-unit-1-240219102731-858108ce
No ratings yet
kit-601-l-unit-1-240219102731-858108ce
35 pages
ETI MP
No ratings yet
ETI MP
15 pages
1 Introduction
No ratings yet
1 Introduction
130 pages
dsbda_ut6
No ratings yet
dsbda_ut6
11 pages
Data Visualization-1
No ratings yet
Data Visualization-1
29 pages
BDT..U1_PPT_08112023
No ratings yet
BDT..U1_PPT_08112023
71 pages
Unit-5 BDA - Data Visualization
No ratings yet
Unit-5 BDA - Data Visualization
19 pages
Bdt..u1 PPT 08112023
No ratings yet
Bdt..u1 PPT 08112023
71 pages
Intro to Big Data Analytics
No ratings yet
Intro to Big Data Analytics
14 pages
Big Data Analytics Compiled Notes
No ratings yet
Big Data Analytics Compiled Notes
130 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
Unit I
No ratings yet
Unit I
64 pages
Unit-1 Introduction to big data analytics
No ratings yet
Unit-1 Introduction to big data analytics
57 pages
00 Intro
No ratings yet
00 Intro
40 pages
UNIT 5 BDT.pptx
No ratings yet
UNIT 5 BDT.pptx
132 pages
CSA3004 - DATA-VISUALIZATION - LT - 1.0 - 1 - CSA3004 - Data Visualization
No ratings yet
CSA3004 - DATA-VISUALIZATION - LT - 1.0 - 1 - CSA3004 - Data Visualization
3 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Purple Pink Trendy Cyber Y2K Creative Presentation_20241202_093632_0000
No ratings yet
Purple Pink Trendy Cyber Y2K Creative Presentation_20241202_093632_0000
16 pages
BDA U1
No ratings yet
BDA U1
80 pages
01 Introduction
No ratings yet
01 Introduction
26 pages
Subject Code:: Data Visualization
No ratings yet
Subject Code:: Data Visualization
8 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
Business Data Visual
No ratings yet
Business Data Visual
50 pages
Elec 3-Reviewer
No ratings yet
Elec 3-Reviewer
33 pages
LM of Data
No ratings yet
LM of Data
5 pages
Exploring Big Data Using Visual Analytics: Daniel A. Keim
No ratings yet
Exploring Big Data Using Visual Analytics: Daniel A. Keim
16 pages
Interaction 2021
No ratings yet
Interaction 2021
64 pages
huawei
No ratings yet
huawei
29 pages
CO5-Session-1-Evaluate data visualization and identify ways to improve it
No ratings yet
CO5-Session-1-Evaluate data visualization and identify ways to improve it
14 pages
PPT 1.1.3
No ratings yet
PPT 1.1.3
15 pages
business analytics
No ratings yet
business analytics
34 pages
Unit-6 - Graph Analytics and Data Visualization
No ratings yet
Unit-6 - Graph Analytics and Data Visualization
40 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
40 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
Big Data Overview
No ratings yet
Big Data Overview
75 pages
21cab14 - Big Data Analytics: Dr.M.Moorthy, Hod / Mca Mca - Ii Semester Regulation-2021
No ratings yet
21cab14 - Big Data Analytics: Dr.M.Moorthy, Hod / Mca Mca - Ii Semester Regulation-2021
22 pages
bda 1_concepts and methods
No ratings yet
bda 1_concepts and methods
68 pages
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Leaders and Innovators: How Data-Driven Organizations Are Winning with Analytics
From Everand
Leaders and Innovators: How Data-Driven Organizations Are Winning with Analytics
Tho H. Nguyen
1/5 (1)
REF601 Um 1MDU07212-YN ENc
No ratings yet
REF601 Um 1MDU07212-YN ENc
79 pages
CSS NC Ii 21ST Century
No ratings yet
CSS NC Ii 21ST Century
4 pages
Lab 8
No ratings yet
Lab 8
10 pages
Safer Internet Day Primary Tech in Our World 5th and 6th Class
No ratings yet
Safer Internet Day Primary Tech in Our World 5th and 6th Class
30 pages
Allplan BIM Compendium
No ratings yet
Allplan BIM Compendium
279 pages
Dark TRace MS365 Module
No ratings yet
Dark TRace MS365 Module
3 pages
SuperEx Whitepaper - Eng
No ratings yet
SuperEx Whitepaper - Eng
18 pages
NFS2 640
No ratings yet
NFS2 640
8 pages
Google 2016 Environmental Report
No ratings yet
Google 2016 Environmental Report
72 pages
3 Phase Line Protection Current Relay - ELP33: E-Power System Solutions
No ratings yet
3 Phase Line Protection Current Relay - ELP33: E-Power System Solutions
7 pages
TRLM Job
No ratings yet
TRLM Job
3 pages
Worksheet 2
No ratings yet
Worksheet 2
6 pages
B. Inggris X
No ratings yet
B. Inggris X
7 pages
Subtractionusing 1 Scomplement
No ratings yet
Subtractionusing 1 Scomplement
28 pages
Cinematographers
No ratings yet
Cinematographers
2 pages
Deployment Diagram
No ratings yet
Deployment Diagram
6 pages
Nuendo 12 Operation Manual en
No ratings yet
Nuendo 12 Operation Manual en
1,569 pages
GR-SXM58 GR-SXM48 GR-FXM38 GR-SX24 GR-FX14: Compact Vhs Camcorder
No ratings yet
GR-SXM58 GR-SXM48 GR-FXM38 GR-SX24 GR-FX14: Compact Vhs Camcorder
2 pages
Agenda: What Is OS?
No ratings yet
Agenda: What Is OS?
5 pages
Iec Testing Method
No ratings yet
Iec Testing Method
57 pages
Linux Commands Cheatsheet V1.01
No ratings yet
Linux Commands Cheatsheet V1.01
31 pages
MCSE On Windows Server 2003
No ratings yet
MCSE On Windows Server 2003
5 pages
Grade 8 TLE (Commercial Cooking) Table of Specifications First Quarter S.Y. 2019-2020
No ratings yet
Grade 8 TLE (Commercial Cooking) Table of Specifications First Quarter S.Y. 2019-2020
3 pages
Mysql Ppt Updated 2023 (1)
No ratings yet
Mysql Ppt Updated 2023 (1)
186 pages
NMI Quick Start & Reference
No ratings yet
NMI Quick Start & Reference
76 pages
Information and Communication Technology: Grade 6
No ratings yet
Information and Communication Technology: Grade 6
82 pages
TSSN Chap4
No ratings yet
TSSN Chap4
51 pages
Biomems 1 Intro
No ratings yet
Biomems 1 Intro
19 pages

Lecture2_VA_Handling_Data

Uploaded by

Lecture2_VA_Handling_Data

Uploaded by

Big Data Visual Analytics (CS 661)

Instructor: Soumya Dutta

IITK CS661: Big Data Visual Analytics: Soumya Dutta 2

IITK CS661: Big Data Visual Analytics: Soumya Dutta 3

IITK CS661: Big Data Visual Analytics: Soumya Dutta 4

IITK CS661: Big Data Visual Analytics: Soumya Dutta 5

IITK CS661: Big Data Visual Analytics: Soumya Dutta 7

IITK CS661: Big Data Visual Analytics: Soumya Dutta 8

IITK CS661: Big Data Visual Analytics: Soumya Dutta 9

IITK CS661: Big Data Visual Analytics: Soumya Dutta 10

Planar Size Brightness Shape

Texture Color Orientation

IITK CS661: Big Data Visual Analytics: Soumya Dutta 11

IITK CS661: Big Data Visual Analytics: Soumya Dutta 12

IITK CS661: Big Data Visual Analytics: Soumya Dutta 13

Visual Analytics can help!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 14

Visual Analytics is the process of analytical reasoning often

IITK CS661: Big Data Visual Analytics: Soumya Dutta 15

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://www.youtube.com/watch?v=acsFQvv4B0Q 20

Humans are important!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 21

IITK CS661: Big Data Visual Analytics: Soumya Dutta 22

Do we really need the background grid? Maybe not!

IITK CS661: Big Data Visual Analytics: Soumya Dutta 26

IITK CS661: Big Data Visual Analytics: Soumya Dutta 27

• Data cleaning (Wrangling)

Data Cleaning Cycle

• Missing data may be due to

IITK CS661: Big Data Visual Analytics: Soumya Dutta 29

IITK CS661: Big Data Visual Analytics: Soumya Dutta 30

IITK CS661: Big Data Visual Analytics: Soumya Dutta 31

• Clipping tails and outliers

IITK CS661: Big Data Visual Analytics: Soumya Dutta 32

IITK CS661: Big Data Visual Analytics: Soumya Dutta 33

IITK CS661: Big Data Visual Analytics: Soumya Dutta 34

IITK CS661: Big Data Visual Analytics: Soumya Dutta 35

Raw Data Min-max normalization Standardization Robust Scaling

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://www.geeksforgeeks.org/standardscaler-minmaxscaler-and-robustscaler-techniques-ml/ 36

IITK CS661: Big Data Visual Analytics: Soumya Dutta 37

IITK CS661: Big Data Visual Analytics: Soumya Dutta 38

IITK CS661: Big Data Visual Analytics: Soumya Dutta 39

IITK CS661: Big Data Visual Analytics: Soumya Dutta 40

IITK CS661: Big Data Visual Analytics: Soumya Dutta 41

IITK CS661: Big Data Visual Analytics: Soumya Dutta 42

IITK CS661: Big Data Visual Analytics: Soumya Dutta 43

IITK CS661: Big Data Visual Analytics: Soumya Dutta 44

IITK CS661: Big Data Visual Analytics: Soumya Dutta 45

IITK CS661: Big Data Visual Analytics: Soumya Dutta 46

IITK CS661: Big Data Visual Analytics: Soumya Dutta 47

Imbalanced Data SMOTE + random under-sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ 48

Augmentation rate ~ Cluster size

• In practice, all of this is happening at the same time

IITK CS661: Big Data Visual Analytics: Soumya Dutta 51

You might also like