Unit 4
Unit 4
1 2 Mins Attendance
Remarks:
Faculty Incharge
PAAVAI COLLEGE OF ENGINEERING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
DATA EXPLORATION AND Lecture no 2
Subject Name
VISUALIZATION Date
Subject Code AD3301 Day
1 2 Mins Attendance
Remarks:
Faculty Incharge
PAAVAI COLLEGE OF ENGINEERING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
DATA EXPLORATION AND Lecture no 3
Subject Name
VISUALIZATION Date
Subject Code AD3301 Day
1 2 Mins Attendance
Remarks:
Faculty Incharge
PAAVAI COLLEGE OF ENGINEERING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
DATA EXPLORATION AND Lecture no 4
Subject Name
VISUALIZATION Date
Subject Code AD3301 Day
1 2 Mins Attendance
Revision: Making sense of data, Comparing EDA with classical and Bayesian
3 7 Mins
analysis.
Objective: Scatterplots visualize the relationship between two variables,
4 1 min
displaying data points on a Cartesian plane
35 Content: Interpreting Scatterplots- Types of Resistant Lines
5
Mins
Questions by Students: Introduction to Resistant Lines (Robust Regression),
6 3 Mins Creating Scatterplots with Resistant Lines.
Remarks:
Faculty Incharge
PAAVAI COLLEGE OF ENGINEERING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
DATA EXPLORATION AND Lecture no 5
Subject Name
VISUALIZATION Date
Subject Code AD3301 Day
1 2 Mins Attendance
Revision: Making sense of data, Comparing EDA with classical and Bayesian
3 7 Mins
analysis.
Objective: In Exploratory Data Analysis (EDA), transformation involves
4 1 min
applying mathematical functions to data to better meet analytical objectives.
Outcome: The student should be able to understand the visual AIDs for
8 1 Mins
Exploratory Data Analysis.
Remarks:
Faculty Incharge
Technical Terms
Analyzing and visualizing variables one at a time is not enough. To make various
conclusions and analyses when performing exploratory data analysis, we need to
understand how the variables in a dataset interact with respect to each other. There
are numerous ways to analyze this relationship visually, one of the most common
methods is the use of popular scatterplots. But scatterplots come with certain
limitations which we will see in the later sections. Quantitatively, covariance and
correlations are used to define the relationship between variables.
Scatterplots
In the above graph, it’s easy to see that there seems to be a positive relationship
between the two variables i.e. as one increases the other increases as well. A
scatterplot with a negative relationship i.e. as one variable increases the other
reduces may take the form of Image 2.
A scatterplot with no apparent relationship between the two variables would take
the form of Image 3:
import numpy as np
import random
import matplotlib.pyplot as plt
a = np.random.rand(1000)*70
b = np.random.rand(1000)*100
plt.plot(a, b,'o',markersize=2, color='brown')
plt.xlabel('a')
plt.ylabel('b')
In the above image, not all points are visible. To overcome this, we add random
noise to the data called “Jitter”. The process is naturally called jittering to allow
for a somewhat clear visualization of those overlapped points.
As seen in Image 5, more data points are now visible. However, jitter should be
used only for visualization purposes and should be avoided for analysis purposes.
There can be an overlap of data in the case of continuous variables as well, where
overlapping points can hide in the dense part of the data and outliers may be given
disproportionate emphasis as seen in Image 1. This is called Saturation.
Covariance
In the above example, we can clearly see that as x increases, y increases too and
hence we get a positive covariance. Now, let’s consider that x and y have units. x
is height in ‘cm’ and y is weight in ‘lbs’. The unit for covariance would then be
cm-lbs. Whatever that means!
Covariance can practically take any number which can be overcome using
correlation which is in the range of -1 to 1. So covariance doesn’t exactly tell how
strong the relationship is but simply the direction of the relationship. For these
reasons, it’s also difficult to interpret covariance. To overcome some of these
disadvantages we use Correlation.
Correlation
Both these methods of calculating correlation involve transforming the data in the
variables being compared to some standard comparable format. Let’s see what
transformations are done in both these methods.
Pearson Correlation
But again, Pearson correlation does come with certain disadvantages. This method
of correlation doesn’t work well if there are outliers in the data, as it can get
affected by the outliers. Pearson Correlation works well if the changes in variable
x with respect to variable y is linear i.e. when the change happens at a constant
rate and when x and y are both somewhat normally distributed or when the data is
on an interval scale.
Consider x = [23,98,56,1,0,56,1999,12],
Corresponding Rankx = [4,7,5,2,1,6,8,3]
Similarly, for y = [5,92,88,45,2,54,90,1],
Corresponding Ranky = [3,8,6,4,2,5,7,1]
Looking at Rankx and Ranky, the advantage of this method seems to be apparent.
Both Rankx and Ranky do not contain any outliers, even if the actual data has any
outliers, the outlier will be converted into a rank that is nothing but the relative
positive of the number in the dataset. Hence, this method is robust against outliers.
This method also solves the problem of data distributions. The data distributions
of the ranks will always be uniform. We then calculate the Pearson correlation of
Rankx and Ranky using the formula seen in the Pearson correlation section.
1. When x changes as y does, but not necessarily at a constant rate i.e. when there
Taking the following baseline example with Car Accidents by Body Type and if
drinking has been involved:
The percentage Of can be displayed for each column using the Value Tap of the
Visualization properties:
This will result in the following view for the same table (as above), showing the
percentage of values:
The above setting will only show the percentage of with respect to the same
column (distribution within the same column, as black arrow below) and not as
percentage of Body Type (Green Arrow below)!
The property section, does have the possibility to change the relation for
displaying the percentage to other columns, e.g. Body Type, but this is not
calculating correctly within DV (no further inverstigation on this Issue):
As mentioned at the beginning, for some Graphs like Bar Graphs, there is no
option to display the data as percentage of values, since this is not the best
visualisation to represent distributions and thus the Bar Graphs will only show
the absolute values, like:
In order to calculate the correct sum for the occurences, the formula for decoding
the attribute field into 1 and 0 has to be done on the data preparation step (on the
prepare tab). Otherwise, the SUM will not calculate correctly when using the “My
Calculations”, since it will not apply the calculation on each row, but only on the
aggregated result as clearly shown below with the bottom red box:
zThis is because the Case When Formula is not evaluated row-by-row, but only
applied after aggregation. Hence, the decoding of values for calculating the
percentage must be placed within the data preparation step:
After this preparation, the formula can be defined as shown below (Please note
the multiplication with 1.00 to convert to a decimal number to represent the
percentage):
This gives the intended result for correct Percentage Of values within a Bar Chart
and Table View:
4.3 ANALYZING CONTIGENCY TABLES
The most common way to represent them is using contingency tables, or as some
statisticians call them – cross tables.
Imagine you are an investment manager and you manage stocks, bonds and real
estate investments for three different investors.
Each of the investors has a different idea of risk. Hence, their money is allocated
in a different way among the three asset classes.
A contingency table representing all the data looks like the following.
In the picture below, you can clearly see the rows showing the type of investment
that’s been made and the columns with each investor’s allocation.
It is a good practice to calculate the totals of each row and column because it is
often useful in further analysis.
Notice that the subtotals of the rows give us total investment in stocks, bonds and
real estate.
On the other hand, the subtotals of the columns give us the holdings of each
investor.
How to Visualize it
Once we have created a contingency table, we can proceed by visualizing the data
onto a plane.
A very useful chart in such cases is a variation of the bar chart called the side-
by-side bar chart. It represents the holdings of each investor in the different types
of assets. As you can see in the picture below, stocks are in green, bonds are in
red and real estate is in blue.
Why it is Called Side-by-Side Bar Chart
The name of this type of chart comes from the fact that for each investor, the
categories of assets are represented side by side. In this way, we can easily
compare asset holdings for a specific investor or among investors.
Important: All graphs are very easy to create and read after you have:
Finally, we would like to conclude with a very important graph – the scatter plot.
It is used when representing two numerical variables. For this example, we will
be looking at the reading and writing SAT scores of 100 individuals.
1. Second, our vertical axis shows the writing scores, while the horizontal axis
contains reading scores.
1. Third, there are 100 students and their results correspond to a specific point on
the graph. Each point gives us information about a particular student’s
performance.
Data Collection
The first step in batch processing is data collection. This involves gathering and
consolidating data from various sources into a centralized repository. The data is
usually stored in files or databases, ready for further processing.
Data Processing
Data Storage
After processing, the results are stored back into the database or files. This
transformed data is then used for reporting, analysis, or as input for other systems.
Batch processing offers several advantages that make it a preferred choice for
handling large volumes of data:
Python
import pandas as pd
def process_batch(batch_data):
return processed_data
chunk_size = 100000 # Adjust the chunk size as per your system’s memory
constraints
csv_file = “large_data.csv” # Replace with your CSV file’s path
processed_chunk = process_batch(chunk)
Python
# Create a SparkSession
spark = SparkSession.builder.appName(“BatchProcessing”).getOrCreate()
def process_batch(batch_data):
return processed_data
chunk_size = 100000 # Adjust the chunk size as per your cluster’s memory
constraints
df = spark.read.format(“csv”).option(“header”, “true”).load(csv_file)
Scatter plots usually represent lots and lots of observations. When interpreting
a scatter plot, a statistician is not expected to look into single data points. He
would be much more interested in getting the main idea of how the data is
distributed.
First Observation
Second Observation
We already mentioned that scores can be anywhere between 200 and 800. Well,
500 is the average score one can get, so it makes sense that a lot of people fall
into that area.
Third Observation
There is a group of people with both very high writing and reading scores.
The exceptional students tend to be excellent at both components.
This is less true for bad students as their performance tends to deviate when
performing different tasks.
Fourth Observation
Finally, we have Jane from a few paragraphs ago. She is far away from every
other observation as she scored above average on reading but poorly on writing.
We call this observation an outlier as it goes against the logic of the whole
dataset.
To sum up, after reading this tutorial, representing the relationship between 2
variables should be like a walk in the park for you. If the variables are categorical,
creating a contingency table should be a priority. After doing that, a side-by-side
bar chart will be a great way to visualize the data.
On the other hand, if the variables are numerical, a scatter plot will get the job
most of the time. It is extremely useful because it is quite easy to make
observations based on it and it is a great starting point for more complex analyses.
So, these are the basics when it comes to visualizing 2 variables. Now, you are
ready to dive into the heart of descriptive statistics. The first concept which you
can master revolves around the measures of central tendency – mean, median,
and mode
4.6 TRANSFORMATIONS
From a general perspective, data transformation helps businesses take raw data
(structured or unstructured) and transform it for further processing, including
analysis, integration, and visualization. All teams within a company’s structure
benefit from data transformation, as low-quality unmanaged data can negatively
impact all facets of business operations. Some additional benefits of data
transformation include:
Data integration
Before examining the various ways to transform data, it is important to take a step
back and look at the data integration process. Data integration processes multiple
types of source data into integrated data, during which the data undergoes
cleaning, transformation, analysis, loading, etc. With that, we can see that data
transformation is simply a subset of data integration.
Batch integration
ETL integration
Similar to ELT, ETL data processing involves data integration through extraction,
transformation, and loading. ETL integration is the most common form of data
integration and utilizes batch integration techniques.
ELT integration
ELT data processing involves data integration through extraction, loading, and
transformation. Similar to real-time integration, ELT applies open-source
tools and cloud technology, making this method best for organizations that need
to transform massive amounts of data at a relatively quick pace.
Real-time integration
One of the more recent data integration methods, real-time integration, processes
and transforms data upon collection and extraction. This method utilizes CDC
(Change Data Capture) techniques, among others, and is helpful for data
processing that requires near-instant use.
These same concepts utilized in data integration have also been applied to the
individual steps within the larger integration process, such as data transformation.
More specifically, both batch data processing and cloud technology, utilized in
real-time integration, have been crucial in developing successful data
transformation processes and data transformation tools. Now, let’s take a closer
look at the types of data transformation processes.
First party data (data you collect yourself about your company and your
customers) is rapidly growing in value. Your ability to transform and use
that data to drive decisions and strategies will increasingly become the
source of competitive advantage.
- Rich Edwards, CEO of Mindspan Systems
As many companies turn to cloud-based systems, IBM even reports that 81% of
companies use multiple cloud-based systems, end-users of said data are also
looking for more versatile methods to transform data. Interactive data
transformation, also referred to as real-time data transformation uses similar
concepts seen in real-time integration and ELT processing.
In addition to the various types of data transformation, developers can also utilize
a variety of transformation languages to transform formal language text into a
more useful and readable output text. There are four main types of data
transformation languages: macro languages, model transformation languages,
low-level languages, and XML transformation languages.
The most commonly used codes in data transformation include ATL, AWK,
identity transform, QVT, TXL, XQuery, and XSLT. Ultimately, before deciding
what transformation method and language to use, data scientists must consider
the source of the data, the type of data being transformed, and the project’s
objective.
Now that we’ve covered the bigger picture of how data transformation fits into
the larger picture of data integration, we can examine the more granular steps in
data transformation itself. Firstly, it is important to note that while it's possible to
transform data manually, today, companies rely on data transformation tools to
partially or fully transform their data. Either way, manual and automated data
transformation involves the same steps detailed below.
The first step in the data transformation process involves data discovery and data
parsing. Data discovery and data parsing are processes that involve collecting
data, consolidating data, and reorganizing data for specific market insights and
business intelligence. At Coresignal, we can offer you parsed, ready-to-use data.
2. Data mapping and translation
Once you have profiled your data and decided how you want to transform your
data, you can perform data mapping and translation. Data mapping and translation
refer to the process of mapping, aggregating, and filtering said data so it can be
further processed. For example, in batch transformation, this step would help
filter and sort the data in batches so executable code can run smoothly.
The data programming involves code generation, in which developers will work
with executable coding languages, such as SQL, Python, R, or other executable
instructions. During this stage, developers are working closely with
transformation technologies, also known as code generators. Code generators
provide developers with a visual design atmosphere and can run on multiple
platforms, making them a favorite among developers.
Now that the code is developed, it can be run against your data. Also known as
code execution, this step is the last stage the data passes through before reaching
human end-users.
Once the code executes the data, it is now ready for review. Similar to a quality
assurance check, the purpose of this step is to make sure the data has been
transformed properly. It is important to note that this step is iterative, in that end-
users of this data are responsible for reporting any errors they found in
transformed data to the developers, so edits to the code can be made.
The recent advancements in big data have required businesses to look elsewhere
when storing, processing, and analyzing their data. Moreso, the increasing variety
in data sources has also contributed to the strain being placed on data warehouses.
Particularly, while companies acquire powerful raw data from data types such as
firmographic data, employee data, and social media data, these same data types
typically export very large file sizes. Consequently, companies have been
searching for alternative methods.
This search has greatly impacted data integration processes, specifically data
transformation. That is, companies have been transitioning from traditional data
integration processes, such as ETL methods, to cloud-based integration
processes, such as ELT and real-time integration.
In the past, many companies have relied on local servers for data storage, making
ETL integration the preferred method. However, due to the significant increase
in digital communication and business operations in 2020, global data creation is
now modeled at a CAGR of 23%, according to Businesswire. Subsequently, the
upward trend in global data creation has put a strain on local servers and data
storage, and many businesses are looking elsewhere for cloud-based solutions.
Questions:
3. Explain why row or column percentages are used when comparing two
variables.
14. How would you assess the consistency of results across multiple
batches?
18. How can you identify outliers in a scatterplot, and what effect do they
have on a resistant line?
19. Describe how a scatterplot can indicate the strength and direction of a
relationship between variables.
20. Explain the difference between a least squares regression line and a
resistant line.