Class 2 - Extraction, Transformation and Load (ETL)
Class 2 - Extraction, Transformation and Load (ETL)
TC3006C
Advanced Artificial Intelligence for Data
Science I
● It is a process used to move and prepare data from original sources to storage systems.
● Main Objective: Collect data from multiple sources, process it, and load it into a target
environment, such as a data warehouse.
The Ultimate Guide To Setting-Up An ETL (Extract, Transform, and Load) Process Pipeline
● Quality and Consistency: ETL ensures that data is consistent and of high quality, essential
for reliable analysis.
● Data Integration: It facilitates the combination of data from multiple sources, providing a
unified view.
● Preparation for Analysis: Raw data is rarely ready for analysis. ETL prepares it to be easily
consumed by analysis tools and models.
Data Extraction is an essential phase in any data analysis process and represents the first
stage in the ETL process. Extraction specifically refers to the act of obtaining and collecting
data from various sources to be processed and analyzed.
Data can come from multiple sources, depending on the domain and purpose of the analysis.
Common examples of data sources include databases (SQL, NoSQL), files (CSV, Excel, JSON,
XML), APIs, logging systems, ERPs, CRMs, among others.
During the extraction, it is crucial to ensure that the data is obtained completely and without
alterations. Corrupt or incomplete data can bias or affect subsequent analyses.
Pandas: Esta es la biblioteca de Python más utilizada para la manipulación y análisis de datos.
Ofrece estructuras de datos y operaciones para manipular tablas numéricas y series
temporales. Puede leer desde una variedad de formatos de archivos, como CSV, Excel, JSON,
SQL, entre otros.
Requests: Esta es una biblioteca de Python para realizar solicitudes HTTP. Es útil para la
extracción de datos de la web, especialmente cuando trabajas con APIs.
● Label Encoding: This method involves converting each category of a column into a unique
integer. For example, for the column "Color" with values "Red", "Blue", "Green", the
encoded values might be 1, 2, and 3, respectively.
● One-Hot Encoding: In this method, for each unique value in a column, a new binary column
(0 or 1) is created. Using the previous "Color" example, this would result in three new
columns: "Is_Red", "Is_Blue", and "Is_Green".
● Binary Encoding: This is a combination of both label and one-hot encoding. It first assigns
an integer value to each category and then converts it into binary.
When dealing with missing values, various strategies can be adopted, such as:
● Deletion: If a row (record) has a significant number of missing values, it might be entirely
removed.
● Imputation: Replace missing values with derived values, which could be the mean, median,
or mode of the column.
● Assigning a default value: For instance, in an "income" column, a missing value might be
replaced with "0", assuming that the absence of information means no income.
● Using models to predict missing values: Models like linear regression, k-nearest neighbors,
among others, can be used to estimate the missing value based on other available data.
Normalization
● Some machine learning algorithms, especially classifiers, might perform better with
categorical features than with continuous numerical features.
Imagine having a dataset with the exact age of users. For certain analyses, knowing whether
someone is 23 or 24 might not be useful, but it would be to know if they are a teenager, a young
adult, middle-aged, or a senior. You could discretize age into categories like:
● Initial Inspection: A quick visualization and descriptive analysis can reveal obvious issues
such as missing values, outliers, or data in inconsistent formats.
● Consistency Verification: Check that data is consistent over time or between data sources.
If one source indicates a rise in sales and another shows a decline for the same period,
there are inconsistencies.
● Data History: If you have access to previous versions of the same dataset, it can be useful
to review changes over time and detect possible errors.
● External Sources: Compare your data with trusted external sources to spot discrepancies.
● Communication with Experts: Engage with experts in the field or topic related to the data.
They might offer valuable insights about the validity and quality of the data.
Considerations:
● Ensure a secure and stable connection.
● Manage concurrency and capacity of the target system.
● Monitor and manage potential errors or failures during loading.
1. Grus, J. (2019). Data science from scratch: First Principles with Python. O’Reilly Media.
2. Bruce, P., Bruce, P. C., Bruce, A., & Gedeck, P. (2020). Practical Statistics for data scientists:
50+ Essential Concepts Using R and Python. O’Reilly Media.
3. Nield, T. (2022). Essential math for data science: Take Control of Your Data with Fundamental
Linear Algebra, Probability, and Statistics. O’Reilly Media.
4. Reis, J., & Housley, M. (2022). Fundamentals of Data Engineering: Plan and Build Robust
Data Systems. O’Reilly Media.