0% found this document useful (0 votes)

28 views

Class 2 - Extraction, Transformation and Load (ETL)

Uploaded by

MauJuarezSan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Class 2 - Extraction, Transformation and Load (ETL)

Uploaded by

MauJuarezSan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Tecnológico de Monterrey - CCM

TC3006C
Advanced Artificial Intelligence for Data
Science I

Module 2: Statistics for data science

Extraction, Transformation and Load (ETL)

Module 2 Professor: Jesús Manuel Vázquez Nicolás 1
Introduction

● ETL stands for Extraction, Transformation, and Load.

● It is a process used to move and prepare data from original sources to storage systems.

● Main Objective: Collect data from multiple sources, process it, and load it into a target
environment, such as a data warehouse.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 2

Introduction

The Ultimate Guide To Setting-Up An ETL (Extract, Transform, and Load) Process Pipeline

Module 2 Professor: Jesús Manuel Vázquez Nicolás 3

The Importance of ETL in Data Science.

● Quality and Consistency: ETL ensures that data is consistent and of high quality, essential
for reliable analysis.

● Data Integration: It facilitates the combination of data from multiple sources, providing a
unified view.

● Preparation for Analysis: Raw data is rarely ready for analysis. ETL prepares it to be easily
consumed by analysis tools and models.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 4

What is Data Extraction?

Data Extraction is an essential phase in any data analysis process and represents the first
stage in the ETL process. Extraction specifically refers to the act of obtaining and collecting
data from various sources to be processed and analyzed.

Data can come from multiple sources, depending on the domain and purpose of the analysis.
Common examples of data sources include databases (SQL, NoSQL), files (CSV, Excel, JSON,
XML), APIs, logging systems, ERPs, CRMs, among others.

During the extraction, it is crucial to ensure that the data is obtained completely and without
alterations. Corrupt or incomplete data can bias or affect subsequent analyses.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 5

Extracción de datos con Python

Pandas: Esta es la biblioteca de Python más utilizada para la manipulación y análisis de datos.
Ofrece estructuras de datos y operaciones para manipular tablas numéricas y series
temporales. Puede leer desde una variedad de formatos de archivos, como CSV, Excel, JSON,
SQL, entre otros.

SQLAlchemy: Esta es una biblioteca de Python que proporciona un conjunto completo de

operaciones de nivel empresarial para SQL. Proporciona una forma de manejar SQL y
interactuar con bases de datos SQL de una manera Pythonic.

Requests: Esta es una biblioteca de Python para realizar solicitudes HTTP. Es útil para la
extracción de datos de la web, especialmente cuando trabajas con APIs.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 6

Data Transformation: An Introduction
Data transformation is a series of operations that change data from its original format/structure
to a format/structure required for further analysis or to feed a machine learning model. This step
is critical in the data preparation process, as the performance and accuracy of data analysis
techniques and machine learning algorithms depend largely on the quality of the prepared data.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 7

Python Tools for Data Transformation
Python offers several libraries to assist with data transformation. Some of the most common
ones include Pandas for data manipulation and transformation, NumPy for numerical
computation, and Scikit-learn for data imputation and other feature transformations. These
libraries provide robust tools that simplify many complex transformation operations.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 8

Focus on Data Quality

Data quality is vital in data

analysis. The quality of your
insights can only be as good
as the quality of your data.
Data cleaning, transformation
and enrichment are the critical
steps to improve data quality,
which includes accuracy,
completeness, consistency,
and relevance.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 9

Data Cleaning
Data cleaning involves finding and correcting errors in the data, such as duplicates, spelling
mistakes, and anomalies. This is an essential part of the data transformation process. With
clean data, analysis and predictions are more reliable and errors are not propagated through
the data pipeline.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 10

Data Encoding
Data encoding is the process of converting data from one form to another, primarily for the
purpose of model compatibility and data analysis. Machine learning models, for example,
require numerical input, making it necessary to transform categorical or textual data into a
numerical format.

● Label Encoding: This method involves converting each category of a column into a unique
integer. For example, for the column "Color" with values "Red", "Blue", "Green", the
encoded values might be 1, 2, and 3, respectively.

● One-Hot Encoding: In this method, for each unique value in a column, a new binary column
(0 or 1) is created. Using the previous "Color" example, this would result in three new
columns: "Is_Red", "Is_Blue", and "Is_Green".

● Binary Encoding: This is a combination of both label and one-hot encoding. It first assigns
an integer value to each category and then converts it into binary.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 11

Data Encoding

Module 2 Professor: Jesús Manuel Vázquez Nicolás 12

Data Cleaning
One of the most common data cleaning activities is dealing with missing values.Proper handling
of missing values is essential because it can significantly influence the outcomes of analyses or
predictive models.

When dealing with missing values, various strategies can be adopted, such as:

● Deletion: If a row (record) has a significant number of missing values, it might be entirely
removed.
● Imputation: Replace missing values with derived values, which could be the mean, median,
or mode of the column.
● Assigning a default value: For instance, in an "income" column, a missing value might be
replaced with "0", assuming that the absence of information means no income.
● Using models to predict missing values: Models like linear regression, k-nearest neighbors,
among others, can be used to estimate the missing value based on other available data.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 13

Data Normalization
Data normalization is the process of changing the values in numeric columns in the dataset to a
common scale, without distorting differences in the ranges of values. Normalization is required
when features have different ranges, as many machine learning algorithms are sensitive to the
scale of the data.

Normalization

Module 2 Professor: Jesús Manuel Vázquez Nicolás 14

Data Discretization
Data discretization refers to the process of converting a continuous set of values into a finite set
by defining certain intervals or "bins". Essentially, it's about segmenting a continuous scale to
create discrete categories. This process can be useful for simplifying analysis or when aiming to
turn numerical data into categories that aid in interpretation and analysis.

Discretization can be beneficial for several reasons:

● Some machine learning algorithms, especially classifiers, might perform better with
categorical features than with continuous numerical features.

● It simplifies the data and can make patterns more evident.

● It can help reduce the impact of minor fluctuations or outliers.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 15

Data Discretization
Example:

Imagine having a dataset with the exact age of users. For certain analyses, knowing whether
someone is 23 or 24 might not be useful, but it would be to know if they are a teenager, a young
adult, middle-aged, or a senior. You could discretize age into categories like:

● Teenager (13-19 years)

● Young Adult (20-35 years)

● Middle-Aged Adult (36-55 years)

● Senior (56+ years)

Module 2 Professor: Jesús Manuel Vázquez Nicolás 16

Feature Engineering

Feature Engineering is the process of

selecting, modifying, or creating new
variables (features) from existing data
to enhance the performance of machine
learning models. It's a combination of
domain knowledge, creativity, and
experimentation, often making a
significant impact on the performance
and accuracy of models.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 17

Feature Engineering
It can involve operations such as calculating the mean of certain values, extracting parts of a
date (like the month or day of the week), or even combining multiple features to form a new
one.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 18

How to identify transformation needs?
Here are the strategies for identifying the needs for data cleaning and transformation in the ETL
process:

● Initial Inspection: A quick visualization and descriptive analysis can reveal obvious issues
such as missing values, outliers, or data in inconsistent formats.

● Business Rules: Establish rules based on business knowledge. For instance, in an

employee dataset, negative ages or extremely high salaries might be invalid.

● Consistency Verification: Check that data is consistent over time or between data sources.
If one source indicates a rise in sales and another shows a decline for the same period,
there are inconsistencies.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 19

How to identify transformation needs?

● Data History: If you have access to previous versions of the same dataset, it can be useful
to review changes over time and detect possible errors.

● External Sources: Compare your data with trusted external sources to spot discrepancies.

● Communication with Experts: Engage with experts in the field or topic related to the data.
They might offer valuable insights about the validity and quality of the data.

● Automation: Specialized tools can be programmed to detect common issues or erroneous

patterns in the data

Module 2 Professor: Jesús Manuel Vázquez Nicolás 20

Last stage: Loading
The "Loading" stage in the ETL process involves taking the transformed data and placing it into
a final storage system. The objective of this stage is to ensure that data is available for queries,
analytics, and other operations in an optimized and structured environment.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 21

Considerations During Loading

● Performance: Efficiency is key, especially when handling large volumes of data.

● Data Integrity: Ensure that data is loaded without errors or duplications.

● Security: Ensure that data is transmitted and stored securely.

● Frequency: Will it be loaded once, daily, or in real-time?

Module 2 Professor: Jesús Manuel Vázquez Nicolás 22

Loading Methods
● Full Load: All data in the target system is erased and reloaded.
● Incremental Load: Only new or changed data is loaded.
● Real-time Load: Data is loaded as soon as it becomes available.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 23

Loading
Once data is clean and in the desired format, the final step is loading it into a final storage
system.

Common storage options:

● Databases: SQL (e.g., PostgreSQL, MySQL, SQLite) or NoSQL (e.g., MongoDB,
Cassandra).
● Data Lakes: Massive storage repositories like Amazon S3, Azure Data Lake, among
others.
● Data Warehouses: Systems optimized for analytics like Google BigQuery, Amazon
Redshift, or Snowflake.

Considerations:
● Ensure a secure and stable connection.
● Manage concurrency and capacity of the target system.
● Monitor and manage potential errors or failures during loading.

Module 2 Professor: Jesús Manuel Vázquez Nicolás 24

References

1. Grus, J. (2019). Data science from scratch: First Principles with Python. O’Reilly Media.

2. Bruce, P., Bruce, P. C., Bruce, A., & Gedeck, P. (2020). Practical Statistics for data scientists:
50+ Essential Concepts Using R and Python. O’Reilly Media.

3. Nield, T. (2022). Essential math for data science: Take Control of Your Data with Fundamental
Linear Algebra, Probability, and Statistics. O’Reilly Media.

4. Reis, J., & Housley, M. (2022). Fundamentals of Data Engineering: Plan and Build Robust
Data Systems. O’Reilly Media.