0% found this document useful (0 votes)
12 views

Unit 4

The document outlines a lesson plan for a course on Bivariate Analysis, focusing on exploratory data analysis techniques such as analyzing contingency tables, scatterplots, and data transformations. It includes objectives, content structure, and expected outcomes for each class session, emphasizing the relationships between two variables and the use of visual aids. Additionally, it covers technical terms and methods for measuring relationships, including covariance and correlation.

Uploaded by

indhuji31
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit 4

The document outlines a lesson plan for a course on Bivariate Analysis, focusing on exploratory data analysis techniques such as analyzing contingency tables, scatterplots, and data transformations. It includes objectives, content structure, and expected outcomes for each class session, emphasizing the relationships between two variables and the use of visual aids. Additionally, it covers technical terms and methods for measuring relationships, including covariance and correlation.

Uploaded by

indhuji31
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

LESSON PLAN

UNIT IV BIVARIATE ANALYSIS 8

Relationships between Two Variables - Percentage Tables - Analyzing


Contingency Tables - Handling Several Batches - Scatterplots and Resistant
Lines – Transformation.
PAAVAI COLLEGE OF ENGINEERING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
DATA EXPLORATION AND Lecture no 1
Subject Name
VISUALIZATION Date 12.08.2024
Subject Code AD3301 Day 1

Unit 4 Analyzing contingency tables Hour

Topic covered Percentage Tables


S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

3 7 Mins Revision: Data and data collection

Objective: To develop dynamic percentage tables that accurately reflect the


4 1 min
distribution of data across different categories, ensuring clarity and usability.

35 Content: Business & Finance- Education- Technology- Human Resources


5
Mins

6 3 Mins Questions by Students : What is the distribution of a categorical variable?


How do two categorical variables relate to each other?
Revision and Questions:
7 3 Mins What is Exploratory data analysis?
How are different groups represented in the dataset?
Outcome: The student should be able to understand the concept of understand
8 1 Mins
the basics in : EDA and its basic terms.

9 1 Mins Next Class : analyzing contingency tables

Remarks:

Faculty Incharge
PAAVAI COLLEGE OF ENGINEERING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
DATA EXPLORATION AND Lecture no 2
Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit 4 EXPLORATORY DATA ANALYSIS Hour

Topic covered analyzing contingency tables

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

3 7 Mins Revision: Fundamentals of Exploratory data analysis.

Objective: Analyzing contingency tables allows you to assess the relationship


4 1 min between two categorical variables by observing how the frequency
distribution of one variable varies with the other.
35 Content: Analyzing contingency tables involves examining the relationship between
5 two categorical variables by assessing the joint distribution of their categories
Mins
Questions by Students: What is the relationship between two categorical
6 3 Mins variables?, How can I interpret the percentages in a contingency table?

Revision and Questions:


7 3 Mins What do the margins of the contingency table tell me?
How can I identify the most common combinations of categories?
Outcome: The student should be able to understand the concept of
8 1 Mins
Understanding data science, Significance of EDA.

9 1 Mins Next Class: handling several Batches

Remarks:

Faculty Incharge
PAAVAI COLLEGE OF ENGINEERING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
DATA EXPLORATION AND Lecture no 3
Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit4 EXPLORATORY DATA ANALYSIS Hour

Topic covered handling several Batches

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

3 7 Mins Revision: Data science Process and Significance of EDA.

Objective: Efficiently process multiple batches by ensuring consistent quality


4 1 min control and timely completion, optimizing the workflow for enhanced
productivity.
35 Content: Batch Organization- Data Integration- Performance and Efficiency
5
Mins
Questions by Students:
6 3 Mins How do I combine data from multiple batches?
How do I analyze and compare data from different batches?
Revision and Questions:
7 3 Mins
How do I handle inconsistencies between batches?
Outcome: The student should be able to understand the concept of the Making
8 1 Mins
sense of data, Comparing EDA with classical and Bayesian analysis.

9 1 Mins Next Class: Scatterplots and resistant lines

Remarks:

Faculty Incharge
PAAVAI COLLEGE OF ENGINEERING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
DATA EXPLORATION AND Lecture no 4
Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit 4 EXPLORATORY DATA ANALYSIS Hour

Topic covered Scatterplots and resistant lines


S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

Revision: Making sense of data, Comparing EDA with classical and Bayesian
3 7 Mins
analysis.
Objective: Scatterplots visualize the relationship between two variables,
4 1 min
displaying data points on a Cartesian plane
35 Content: Interpreting Scatterplots- Types of Resistant Lines
5
Mins
Questions by Students: Introduction to Resistant Lines (Robust Regression),
6 3 Mins Creating Scatterplots with Resistant Lines.

Revision and Questions:


7 3 Mins
What are Tools to Perform Exploratory Data Analysis?
Outcome: The student should be able to understand Software tools for
8 1 Mins
Exploratory data analysis.

9 1 Mins Next Class : transformation

Remarks:

Faculty Incharge
PAAVAI COLLEGE OF ENGINEERING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
DATA EXPLORATION AND Lecture no 5
Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit4 EXPLORATORY DATA ANALYSIS Hour

Topic covered transformation


S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

Revision: Making sense of data, Comparing EDA with classical and Bayesian
3 7 Mins
analysis.
Objective: In Exploratory Data Analysis (EDA), transformation involves
4 1 min
applying mathematical functions to data to better meet analytical objectives.

35 Content: he goal is to make data more suitable for analysis by addressing


5 issues like skewness, variance, or non-linearity
Mins

Questions by Students : Why should I use transformations on my data?,


6 3 Mins What are the effects of common transformations on the data?

Revision and Questions:


How can I determine if a transformation is successful?
7 3 Mins
How do I back-transform data to its original scale?

Outcome: The student should be able to understand the visual AIDs for
8 1 Mins
Exploratory Data Analysis.

9 1 Mins Next Class : Next class

Remarks:

Faculty Incharge
Technical Terms

S.No Term Literal Meaning Technical Meaning


Covarience Relationship between a measure of the relationship
two variables between two random variables
1 and to what extent, they change
together

Scatter Plot A graph a type of plot or mathematical


represnation to diagram using Cartesian
2
compare two coordinates to display values for
variables typically two variables for a set of
data.
Data Changing the forms the application of a deterministic
Transformation of data mathematical function to each
3
point in a data set—that is, each
data point zi is replaced with the
transformed value yi = f(zi), where
f is a function.
Histogram a diagram consisting histogram is a graph showing
4 of rectangles whose frequency distributions. It is a
area is proportional to graph showing the number of
the frequency of a observations within each given
variable and whose interval.
width is equal to the
class interval.
.

Visualization the act of forming a The process of finding trends


5 picture of and correlations in our data by
somebody/somethin representing it pictorially is
g in your mind. called Data Visualization
Batch Processing Large volume of the method computers use to
Data is processed in periodically complete high-
volume, repetitive data jobs
6 batches
UNIT IV BIVARIATE ANALYSIS
4.1 Relationship between two variables
4.2 Percentage Tables
4.3 Analyzing contigency tables
4.4 Handling Several Batches
4.5 Sctter plot and Resistent lines
4.6 Transformations

4.1 RELATIONSHIP BETWEEN TWO VARIABLES

Analyzing and visualizing variables one at a time is not enough. To make various
conclusions and analyses when performing exploratory data analysis, we need to
understand how the variables in a dataset interact with respect to each other. There
are numerous ways to analyze this relationship visually, one of the most common
methods is the use of popular scatterplots. But scatterplots come with certain
limitations which we will see in the later sections. Quantitatively, covariance and
correlations are used to define the relationship between variables.

Scatterplots

A scatterplot is one of the most common visual forms when it comes to


comprehending the relationship between variables at a glance. In the simplest
form, this is nothing but a plot of Variable A against Variable B: either one being
plotted on the x-axis and the remaining one on the y-axis
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as npdf = pd.read_csv('weight-height.csv')
df.head()plt.plot(df.Height, df.Weight,'o',markersize=2, color='brown')
plt.xlabel('Height')
plt.ylabel('Weight')

Image1: Scatterplot Height vs Weight: Positive Relationship

In the above graph, it’s easy to see that there seems to be a positive relationship
between the two variables i.e. as one increases the other increases as well. A
scatterplot with a negative relationship i.e. as one variable increases the other
reduces may take the form of Image 2.

#Just for demonstration purposes I have taken 'a' and 'b'


import numpy as np
import random
import matplotlib.pyplot as plt
a = np.random.rand(100)*70
b = 100-a
plt.plot(a, b,'o',markersize=2, color='brown')
plt.xlabel('a')
plt.ylabel('b')
Image 2: Scatterplot a Vs b: Negative Relationship

A scatterplot with no apparent relationship between the two variables would take
the form of Image 3:

import numpy as np
import random
import matplotlib.pyplot as plt
a = np.random.rand(1000)*70
b = np.random.rand(1000)*100
plt.plot(a, b,'o',markersize=2, color='brown')
plt.xlabel('a')
plt.ylabel('b')

Image 3: Scatterplot a Vs b: No apparent relationship


x = [1,1,1,1,2,2,2,2,3,3,3,3]
y = [10,15,15,15,16,16,20,20,20,25,25,25]plt.plot(x,y,'o',markersize=5,
color='brown')
plt.xlabel('X')
plt.ylabel('Y')

Image 4: Scatterplot X vs Y: Discrete Variables

In the above image, not all points are visible. To overcome this, we add random
noise to the data called “Jitter”. The process is naturally called jittering to allow
for a somewhat clear visualization of those overlapped points.

def Jitter(values, jitter):


n = len(values)
return np.random.uniform(-jitter, +jitter, n) + values
y1 = Jitter(y,0.9)
plt.plot(x,y1,'o',markersize=4, color='brown')
plt.xlabel('X')
plt.ylabel('Y1')
Image 5: Scatterplot X vs Y1: With Jitter added

As seen in Image 5, more data points are now visible. However, jitter should be
used only for visualization purposes and should be avoided for analysis purposes.

There can be an overlap of data in the case of continuous variables as well, where
overlapping points can hide in the dense part of the data and outliers may be given
disproportionate emphasis as seen in Image 1. This is called Saturation.

Scatterplot comes with its own disadvantages as it doesn’t provide quantitative


measurement about the relationship, and simply shows the expression of
quantitative change. We also can’t use scatterplots to display the relationship
between more than two variables. Covariance and Correlation solve both these
problems.

Covariance

Covariance measures how variables vary together. A positive covariance means


that the variables vary together in the same direction, a negative covariance means
they vary in the opposite direction and 0 covariance means that the variables don’t
vary together or they are independent of each other. In other words, if there are
two variables X & Y, positive covariance means a larger value of X implies a
larger value of Y and negative covariance means a larger value of X implies a
smaller value of Y.

Mathematically, Cov(x,y) is given by the following formula, where dxi = xi-


xmean and dyi = yi -ymean. Note that the following is the formula for the
covariance of a population, when calculating covariance of a sample 1/n is
replaced by 1/(n-1). Why is it so, is beyond the scope of this article.

Let’s understand this with an example:


Consider,
x = [34,56,78,23]
y = [20,45,91,16]
=> xmean = 47.75
=> ymean = 43
=> Sum of (dxi*dyi) = (34–47.75)*(20–43) + (56–47.75)*(45–43) + (78–
47.75)*(91–43) + (23–47.75)*(16–43) = 2453.
=> Cov(x,y) = 2453/4 = 613.25

In the above example, we can clearly see that as x increases, y increases too and
hence we get a positive covariance. Now, let’s consider that x and y have units. x
is height in ‘cm’ and y is weight in ‘lbs’. The unit for covariance would then be
cm-lbs. Whatever that means!

Covariance can practically take any number which can be overcome using
correlation which is in the range of -1 to 1. So covariance doesn’t exactly tell how
strong the relationship is but simply the direction of the relationship. For these
reasons, it’s also difficult to interpret covariance. To overcome some of these
disadvantages we use Correlation.

Correlation

Correlation again provides quantitive information regarding the relationship


between variables. Measuring correlation can be challenging if the variables have
different units or if the data distributions of the variables are different from each
other. Two methods of calculating correlation can help with these issues: 1)
Pearson Correlation 2) Spearman Rank Correlation.

Both these methods of calculating correlation involve transforming the data in the
variables being compared to some standard comparable format. Let’s see what
transformations are done in both these methods.

Pearson Correlation

Pearson correlation involves transforming each of the values in the variables to a


standard score or Z score i.e. finding the number of standard deviations away from
each of the values is from the mean and calculating the sum of the corresponding
products of the standard scores.

Z score = (Xi-Xmean)/Sigma, where sigma implies standard deviation

Suppose we have 2 variables 'x' and 'y'


Z score of x i.e. Zx = (x-xmu)/Sx
Where xmu is the mean, Sx is standard deviation
Translating this info to our understanding of Pearson Correlation (p):
=> pi = Zxi*Zyi
=> pi = ((xi-xmean)*(yi-ymean))/Sx*Sy
=> p = mean of pi values
=> p = (sum of all values of pi)/n
=> p = (summation (xi-xmean)*(yi-ymean))/Sx*Sy*n
As seen above: (summation (xi-xmean)*(yi-ymean))/n is actually Cov(x,y).
So we can rewrite Pearson correlation (p) as Cov(x,y)/Sx*Sy
NOTE: Here, pi is not the same as mathematical constant Pi (22/7)

Pearson correlation ‘p’ will always be in the range of -1 to 1. A positive value of


‘p’ means as ‘x’ increases ‘y’ increases too, negative means as ‘x’ increases ‘y’
decreases and 0 means there is no apparent linear relationship between ‘x’ and ‘y’.
Note that a zero Pearson correlation doesn’t imply ‘no relationship’, it simply
means that there isn’t a linear relationship between ‘x’ and ‘y’.

Pearson correlation ‘p’ = 1 means a perfect positive relationship, however, a value


of 0.5 or 0.4 implies there is a positive relationship but the relationship may not
be as strong. The magnitude or the value of Pearson correlation determines the
strength of the relationship.

But again, Pearson correlation does come with certain disadvantages. This method
of correlation doesn’t work well if there are outliers in the data, as it can get
affected by the outliers. Pearson Correlation works well if the changes in variable
x with respect to variable y is linear i.e. when the change happens at a constant
rate and when x and y are both somewhat normally distributed or when the data is
on an interval scale.

These disadvantages of Pearson correlation can be overcome using the Spearman


Rank Correlation.

Spearman Rank Correlation


In the Spearman method, we transform each of the values in both variables to its
corresponding rank in the given variable and then calculate the Pearson correlation
of the ranks.

Consider x = [23,98,56,1,0,56,1999,12],
Corresponding Rankx = [4,7,5,2,1,6,8,3]
Similarly, for y = [5,92,88,45,2,54,90,1],
Corresponding Ranky = [3,8,6,4,2,5,7,1]

Looking at Rankx and Ranky, the advantage of this method seems to be apparent.
Both Rankx and Ranky do not contain any outliers, even if the actual data has any
outliers, the outlier will be converted into a rank that is nothing but the relative
positive of the number in the dataset. Hence, this method is robust against outliers.
This method also solves the problem of data distributions. The data distributions
of the ranks will always be uniform. We then calculate the Pearson correlation of
Rankx and Ranky using the formula seen in the Pearson correlation section.

But Spearman Rank method works well:

1. When x changes as y does, but not necessarily at a constant rate i.e. when there

is a non-linear relationship between x and y

2. When x and y have different data distributions or non-normal distribution

3. If you want to avoid the effect of outliers

4. When data is on an ordinal scale

Spearman should be avoided when there is a chance of ranks overlapping.


There are many other ways out there that are used to determine the relationship
between variables

To display data as percentage of, can be defined within the Visualization


properties for Tables and pivot Tables. However, the percentage of does not exist
for all Visualitzations such as Bar Graphs and thus requires the calculation of
these percentage values. Other Visuals like Pie Chart or Donut do support to
display the data as Percentage Of.

Taking the following baseline example with Car Accidents by Body Type and if
drinking has been involved:

The percentage Of can be displayed for each column using the Value Tap of the
Visualization properties:
This will result in the following view for the same table (as above), showing the
percentage of values:

The above setting will only show the percentage of with respect to the same
column (distribution within the same column, as black arrow below) and not as
percentage of Body Type (Green Arrow below)!
The property section, does have the possibility to change the relation for
displaying the percentage to other columns, e.g. Body Type, but this is not
calculating correctly within DV (no further inverstigation on this Issue):

As mentioned at the beginning, for some Graphs like Bar Graphs, there is no
option to display the data as percentage of values, since this is not the best
visualisation to represent distributions and thus the Bar Graphs will only show
the absolute values, like:

In order to display the correct Percentage Of in a table or in a graph, it is need to


calculated as Percentage Of before hand. The formula for calculating the
percentage is:
percentage value = SUM (occurences) / COUNT (overall)

In order to calculate the correct sum for the occurences, the formula for decoding
the attribute field into 1 and 0 has to be done on the data preparation step (on the
prepare tab). Otherwise, the SUM will not calculate correctly when using the “My
Calculations”, since it will not apply the calculation on each row, but only on the
aggregated result as clearly shown below with the bottom red box:

zThis is because the Case When Formula is not evaluated row-by-row, but only
applied after aggregation. Hence, the decoding of values for calculating the
percentage must be placed within the data preparation step:
After this preparation, the formula can be defined as shown below (Please note
the multiplication with 1.00 to convert to a decimal number to represent the
percentage):

This gives the intended result for correct Percentage Of values within a Bar Chart
and Table View:
4.3 ANALYZING CONTIGENCY TABLES

The most common way to represent them is using contingency tables, or as some
statisticians call them – cross tables.

Imagine you are an investment manager and you manage stocks, bonds and real
estate investments for three different investors.

Each of the investors has a different idea of risk. Hence, their money is allocated
in a different way among the three asset classes.

Using a Contingency Table

A contingency table representing all the data looks like the following.

In the picture below, you can clearly see the rows showing the type of investment
that’s been made and the columns with each investor’s allocation.
It is a good practice to calculate the totals of each row and column because it is
often useful in further analysis.

Notice that the subtotals of the rows give us total investment in stocks, bonds and
real estate.

On the other hand, the subtotals of the columns give us the holdings of each
investor.
How to Visualize it

Once we have created a contingency table, we can proceed by visualizing the data
onto a plane.

A very useful chart in such cases is a variation of the bar chart called the side-
by-side bar chart. It represents the holdings of each investor in the different types
of assets. As you can see in the picture below, stocks are in green, bonds are in
red and real estate is in blue.
Why it is Called Side-by-Side Bar Chart

The name of this type of chart comes from the fact that for each investor, the
categories of assets are represented side by side. In this way, we can easily
compare asset holdings for a specific investor or among investors.

Important: All graphs are very easy to create and read after you have:

• identified the type of data you are dealing with


• decided on the best way to visualize it.

How to Visualize Numerical Variables?

Finally, we would like to conclude with a very important graph – the scatter plot.

It is used when representing two numerical variables. For this example, we will
be looking at the reading and writing SAT scores of 100 individuals.

So, let’s take a look at the graph before analyzing it.

Analyzing the Graph


1. First, SAT scores by component range from 200 to 800 points. That is why our
data is bound within the range of 200 to 800.

1. Second, our vertical axis shows the writing scores, while the horizontal axis
contains reading scores.

1. Third, there are 100 students and their results correspond to a specific point on
the graph. Each point gives us information about a particular student’s
performance.

For example, the point in the picture below is Jane.


It is evident that she scored 300 on writing and 550 on the reading part.

4.4 HANDLING WITH SEVERAL BATCHES


Batch processing is a data processing technique where data is collected,
processed, and stored in batches, rather than being processed in real-time. It
allows organizations to process vast amounts of data efficiently and cost-
effectively. Batch processing involves three key steps: data collection, data
processing, and data storage.

Data Collection

The first step in batch processing is data collection. This involves gathering and
consolidating data from various sources into a centralized repository. The data is
usually stored in files or databases, ready for further processing.

Batch processing is particularly well-suited for scenarios where data accumulates


over time and can be processed in discrete chunks. For example, financial
institutions often use batch processing to reconcile transactions at the end of the
day or generate monthly reports based on accumulated data.

Data Processing

Once the data is collected, it is processed in batches. Unlike real-time processing,


where data is processed immediately as it arrives, batch processing waits for a
predetermined amount of data to accumulate before initiating the processing task.
This approach enables optimizations and efficiencies during data processing.
Batch processing is designed to handle large volumes of data efficiently, allowing
organizations to process data in parallel and take advantage of distributed
computing resources. This scalability is critical when dealing with big data
workloads that surpass the processing capabilities of a single machine.

Data Storage

After processing, the results are stored back into the database or files. This
transformed data is then used for reporting, analysis, or as input for other systems.

Advantages of Batch Processing

Batch processing offers several advantages that make it a preferred choice for
handling large volumes of data:

• Scalability: Batch processing is highly scalable, making it ideal for


organizations dealing with enormous datasets. As data volume grows,
batch processing can handle the load efficiently by processing data in
manageable chunks.
• Cost-Effectiveness: By processing data in scheduled or periodic batches,
organizations can optimize resource utilization, reducing costs associated
with real-time processing systems.
• Reduced Complexity: Batch processing systems are often simpler to
design, implement, and maintain compared to real-time systems. This
simplicity makes it easier to troubleshoot and debug potential issues.
• Fault Tolerance: Batch processing can be made fault-tolerant by
implementing mechanisms to handle errors gracefully. If a job fails during
processing, it can be rerun without affecting the entire system.
• Consistency: Since batch processing operates on fixed datasets, it ensures
consistent results over time. This consistency is essential for tasks like
financial reporting or data reconciliation.
Implementing Batch Processing

Now, let us dive into some practical examples of batch-processing


implementation using Python and Apache Spark. We’ll explore how to process
large CSV files, perform data transformations, and store the results back in a
database.

Batch Processing with Python


For this example, we’ll use Python and Pandas, a powerful data manipulation
library. We assume that you have Python and Pandas installed on your system.

Python

import pandas as pd

# Function to process each batch

def process_batch(batch_data):

# Perform data transformations here

processed_data = batch_data.apply(lambda x: x * 2) # Example


transformation: doubling the values

return processed_data

# Read the CSV file in chunks

chunk_size = 100000 # Adjust the chunk size as per your system’s memory
constraints
csv_file = “large_data.csv” # Replace with your CSV file’s path

output_file = “processed_data.csv” # Replace with the desired output file path

# Read the CSV file in batches and process each batch

for chunk in pd.read_csv(csv_file, chunksize=chunk_size):

processed_chunk = process_batch(chunk)

processed_chunk.to_csv(output_file, mode=”a”, header=False, index=False)

In this example, we define a function process_batch() to perform data


transformations on each batch. The CSV file is read in chunks using Pandas’
read_csv() function, and each batch is processed using the process_batch()
function. The processed data is then appended to an output file.

Batch Processing with Apache Spark

Apache Spark is a distributed computing framework that excels at processing


large-scale datasets. It provides built-in support for batch processing. For this
example, we’ll use PySpark, the Python API for Apache Spark.
Ensure you have Apache Spark and PySpark installed on your system before
proceeding.

Python

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder.appName(“BatchProcessing”).getOrCreate()

# Define the data processing logic as a function

def process_batch(batch_data):

# Perform data transformations here using Spark DataFrame operations

processed_data = batch_data.selectExpr(“col1 * 2 as col1”, “col2 + 10 as


col2”) # Example transformations

return processed_data

# Read the CSV file as a Spark DataFrame

csv_file = “hdfs://path/to/large_data.csv” # Replace with your CSV file’s HDFS


path

output_dir = “hdfs://path/to/output_dir/” # Replace with the desired output


directory path

# Read the CSV file and process each batch

chunk_size = 100000 # Adjust the chunk size as per your cluster’s memory
constraints

# Load the CSV data into Spark DataFrame

df = spark.read.format(“csv”).option(“header”, “true”).load(csv_file)

# Process each batch and write back to the storage


df.foreachPartition(lambda batch_data:
process_batch(batch_data).write.mode(“append”).csv(output_dir))

In this PySpark example, we create a SparkSession and define the


process_batch() function, which performs data transformations using Spark
DataFrame operations. The CSV data is read into a Spark DataFrame, and the for
each partition method is used to process each partition (batch) in parallel. The
processed data is then written back to the storage in CSV format.

4.5 SCATTER PLOTS AND RESISTENT LINES

Scatter plots usually represent lots and lots of observations. When interpreting
a scatter plot, a statistician is not expected to look into single data points. He
would be much more interested in getting the main idea of how the data is
distributed.

First Observation

The first thing we see is that there is an obvious uptrend.


This is because lower writing scores are usually obtained by students with lower
reading scores. Similarly, higher writing scores have been achieved by students
with higher reading scores. This is because the two tasks are closely related.

Second Observation

We notice a concentration of students in the middle of the graph with scores in


the region of 450 to 550 on both reading and writing.

We already mentioned that scores can be anywhere between 200 and 800. Well,
500 is the average score one can get, so it makes sense that a lot of people fall
into that area.
Third Observation

There is a group of people with both very high writing and reading scores.
The exceptional students tend to be excellent at both components.

This is less true for bad students as their performance tends to deviate when
performing different tasks.

Fourth Observation

Finally, we have Jane from a few paragraphs ago. She is far away from every
other observation as she scored above average on reading but poorly on writing.
We call this observation an outlier as it goes against the logic of the whole
dataset.

The Proper Ways to Visualize 2 Variables

To sum up, after reading this tutorial, representing the relationship between 2
variables should be like a walk in the park for you. If the variables are categorical,
creating a contingency table should be a priority. After doing that, a side-by-side
bar chart will be a great way to visualize the data.

On the other hand, if the variables are numerical, a scatter plot will get the job
most of the time. It is extremely useful because it is quite easy to make
observations based on it and it is a great starting point for more complex analyses.

So, these are the basics when it comes to visualizing 2 variables. Now, you are
ready to dive into the heart of descriptive statistics. The first concept which you
can master revolves around the measures of central tendency – mean, median,
and mode

4.6 TRANSFORMATIONS

From a general perspective, data transformation helps businesses take raw data
(structured or unstructured) and transform it for further processing, including
analysis, integration, and visualization. All teams within a company’s structure
benefit from data transformation, as low-quality unmanaged data can negatively
impact all facets of business operations. Some additional benefits of data
transformation include:

• Improved data organization and management


• Increased computer and end-user accessibility
• Enhanced data quality and reduced errors
• Greater application compatibility and faster data processing

Data integration

Before examining the various ways to transform data, it is important to take a step
back and look at the data integration process. Data integration processes multiple
types of source data into integrated data, during which the data undergoes
cleaning, transformation, analysis, loading, etc. With that, we can see that data
transformation is simply a subset of data integration.

Data integration as a whole involves extraction, transformation, cleaning, and


loading. Over time, data scientists have combined and rearranged these steps,
consequently creating four data integration processes: batch, ETL, ELT, and real-
time integration.

Batch integration

Another common method is batch data integration, which involves moving


batches of stored data through further transformation and loading processes. This
method is mainly used for internal databases, large amounts of data, and data that
is not time-sensitive.

ETL integration

Similar to ELT, ETL data processing involves data integration through extraction,
transformation, and loading. ETL integration is the most common form of data
integration and utilizes batch integration techniques.

ELT integration

ELT data processing involves data integration through extraction, loading, and
transformation. Similar to real-time integration, ELT applies open-source
tools and cloud technology, making this method best for organizations that need
to transform massive amounts of data at a relatively quick pace.

Real-time integration
One of the more recent data integration methods, real-time integration, processes
and transforms data upon collection and extraction. This method utilizes CDC
(Change Data Capture) techniques, among others, and is helpful for data
processing that requires near-instant use.

These same concepts utilized in data integration have also been applied to the
individual steps within the larger integration process, such as data transformation.
More specifically, both batch data processing and cloud technology, utilized in
real-time integration, have been crucial in developing successful data
transformation processes and data transformation tools. Now, let’s take a closer
look at the types of data transformation processes.

First party data (data you collect yourself about your company and your
customers) is rapidly growing in value. Your ability to transform and use
that data to drive decisions and strategies will increasingly become the
source of competitive advantage.
- Rich Edwards, CEO of Mindspan Systems

Types of data transformation

Batch data transformation

Batch data transformation, also known as bulk data transformation, involves


transforming data in groups over a period of time. Traditional batch data
transformation involves manual execution with scripted languages such as SQL
and Python and is now seen as somewhat outdated.

More specifically, batch transformation involves ETL data integration, in which


the data is stored in one location and then transformed and moved in smaller
batches over time. It is important to note the significance of batch data
transformation on many data integration processes, such as web application
integration, data warehousing, and data virtualization. When applied to other data
integration processes, the concepts and logistics within batch data transformation
can improve the overall integration process.

Interactive data transformation

As many companies turn to cloud-based systems, IBM even reports that 81% of
companies use multiple cloud-based systems, end-users of said data are also
looking for more versatile methods to transform data. Interactive data
transformation, also referred to as real-time data transformation uses similar
concepts seen in real-time integration and ELT processing.

Interactive data transformation is an expansion of batch transformation. However,


the steps are not necessarily linear. Gaining traction for its accessible end-user
visual interface, interactive data transformation takes previously generated and
inspected code to identify outliers, patterns, and errors within the data. It then
sends this information to a graphical user interface for human end-users to
quickly visualize trends, patterns, and more, within the data.

Data transformation languages

In addition to the various types of data transformation, developers can also utilize
a variety of transformation languages to transform formal language text into a
more useful and readable output text. There are four main types of data
transformation languages: macro languages, model transformation languages,
low-level languages, and XML transformation languages.

The most commonly used codes in data transformation include ATL, AWK,
identity transform, QVT, TXL, XQuery, and XSLT. Ultimately, before deciding
what transformation method and language to use, data scientists must consider
the source of the data, the type of data being transformed, and the project’s
objective.

The data transformation process

Now that we’ve covered the bigger picture of how data transformation fits into
the larger picture of data integration, we can examine the more granular steps in
data transformation itself. Firstly, it is important to note that while it's possible to
transform data manually, today, companies rely on data transformation tools to
partially or fully transform their data. Either way, manual and automated data
transformation involves the same steps detailed below.

1. Data discovery and parsing

The first step in the data transformation process involves data discovery and data
parsing. Data discovery and data parsing are processes that involve collecting
data, consolidating data, and reorganizing data for specific market insights and
business intelligence. At Coresignal, we can offer you parsed, ready-to-use data.
2. Data mapping and translation

Once you have profiled your data and decided how you want to transform your
data, you can perform data mapping and translation. Data mapping and translation
refer to the process of mapping, aggregating, and filtering said data so it can be
further processed. For example, in batch transformation, this step would help
filter and sort the data in batches so executable code can run smoothly.

3. Programming and code creation

The data programming involves code generation, in which developers will work
with executable coding languages, such as SQL, Python, R, or other executable
instructions. During this stage, developers are working closely with
transformation technologies, also known as code generators. Code generators
provide developers with a visual design atmosphere and can run on multiple
platforms, making them a favorite among developers.

4. Transforming the data

Now that the code is developed, it can be run against your data. Also known as
code execution, this step is the last stage the data passes through before reaching
human end-users.

5. Reviewing the data

Once the code executes the data, it is now ready for review. Similar to a quality
assurance check, the purpose of this step is to make sure the data has been
transformed properly. It is important to note that this step is iterative, in that end-
users of this data are responsible for reporting any errors they found in
transformed data to the developers, so edits to the code can be made.

Data extraction and transformation have an effect on other business


activities. When data is transformed into a more readable format, data
analysis can be completed more quickly and accurately than before. Not only
does this have an effect on employee morale, but it also has an impact on
company decision-making.
- Brian Stewart, CTO of ProsperoWeb

ETL vs. ELT

The recent advancements in big data have required businesses to look elsewhere
when storing, processing, and analyzing their data. Moreso, the increasing variety
in data sources has also contributed to the strain being placed on data warehouses.
Particularly, while companies acquire powerful raw data from data types such as
firmographic data, employee data, and social media data, these same data types
typically export very large file sizes. Consequently, companies have been
searching for alternative methods.

This search has greatly impacted data integration processes, specifically data
transformation. That is, companies have been transitioning from traditional data
integration processes, such as ETL methods, to cloud-based integration
processes, such as ELT and real-time integration.

In the past, many companies have relied on local servers for data storage, making
ETL integration the preferred method. However, due to the significant increase
in digital communication and business operations in 2020, global data creation is
now modeled at a CAGR of 23%, according to Businesswire. Subsequently, the
upward trend in global data creation has put a strain on local servers and data
storage, and many businesses are looking elsewhere for cloud-based solutions.

What is data transformation in ETL?

ETL, which stands for extraction, transformation, and loading, is a data


integration process that involves extracting data from various external sources,
often from third-party data providers, transforming the data into the appropriate
structure, and then loading that data into a company’s database. The ETL process
is considered the most common integration process compared to ELT, ETM, and
EMM transformation processes.

Data transformation within ETL occurs in the transformation step; however, it is


closely linked to the extraction and loading stages. Traditionally, data
transformation within the ETL method utilizes batch transformation with linear
steps, including discovery, mapping, programming, code execution, and data
review.

Questions:

1. What is the purpose of a percentage table in analyzing relationships


between two variables?

2. How would you convert a contingency table into a percentage table?

3. Explain why row or column percentages are used when comparing two
variables.

4. What is the advantage of using percentage tables over raw frequency


tables?

5. How can percentage tables help in identifying trends or patterns between


variables?

6. What is a contingency table, and how is it used in statistics?

7. Describe the concept of independence between two variables in the


context of a contingency table.

8. How do you calculate the expected frequencies in a contingency table


for a chi-square test?

9. What is the significance of the chi-square test when analyzing


contingency tables?

10. Explain how marginal totals are used in contingency tables.

11. What does it mean to handle several batches in statistical analysis?

12. Why is it important to consider the variation between batches when


analyzing data?

13. **Describe a method for comparing means across several batches.**

14. How would you assess the consistency of results across multiple
batches?

15.Explain how batch effects can influence the outcome of a statistical


analysis.

16. What is the purpose of a scatterplot in analyzing the relationship


between two variables?

17. Define a resistant line and explain its importance in regression


analysis.

18. How can you identify outliers in a scatterplot, and what effect do they
have on a resistant line?

19. Describe how a scatterplot can indicate the strength and direction of a
relationship between variables.

20. Explain the difference between a least squares regression line and a
resistant line.

You might also like