0% found this document useful (0 votes)
2 views

Data Cleaning

Data cleaning in SQL is the process of identifying and correcting errors or inconsistencies in data to enhance its quality and accuracy, involving tasks like handling missing values, removing duplicates, standardizing text, correcting inconsistent data, changing data types, and addressing date format issues. Various SQL functions such as COALESCE(), IFNULL(), DISTINCT, and ROW_NUMBER() are utilized for these tasks, each serving specific purposes like replacing NULL values or eliminating duplicate records. The document provides detailed examples and explanations of these functions and their applications in data cleaning scenarios.

Uploaded by

MA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Cleaning

Data cleaning in SQL is the process of identifying and correcting errors or inconsistencies in data to enhance its quality and accuracy, involving tasks like handling missing values, removing duplicates, standardizing text, correcting inconsistent data, changing data types, and addressing date format issues. Various SQL functions such as COALESCE(), IFNULL(), DISTINCT, and ROW_NUMBER() are utilized for these tasks, each serving specific purposes like replacing NULL values or eliminating duplicate records. The document provides detailed examples and explanations of these functions and their applications in data cleaning scenarios.

Uploaded by

MA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 21
DATA CLEANING Sem nB ENB OWES TCP CaT CTT OCS STN RO LST a0 i Created By: Muhammad Umar Hanif WHAT IS DATA CLEANING? Data cleaning in SQL involves identifying and correcting errors or inconsistencies in data to improve its quality and accuracy. It includes tasks like: ten Sees Stars / Pit / _ iy / / = =<€> // TA Cee DN) ross -——* ees / l/ ey exmAcTO, NOHO, \ Handing Numeric Data ES CETL(), FLOOR(), This process ensures that the data is clean, consistent, and ready for analysis. 01. Handle Missing Value: The SQL functions + COALESCE() + IFNULL() + ISNULL() are used to handle missing or NULL values in databases. Let's explore how these functions work with a common example. 1.1 - COALESCE( Returns the first non-NULL value from the list of arguments. Example: Suppose you have a table called employees with columns for base_salary, bonus, and total_compensation, where some values in bonus and total_compensation might be NULL lid [name [base_salary [bonus [total_compensation 1 [John [50000 NULL [NULL [2 |Sarah_[s0000 5000 [NULL [3 [Davia [45000 NULL [NULL la [Alice [70000 [7000 [77000 Case 01: SELECT name, base_salary, COALESCE(bonus, @) AS bonus, COALESCE (total_compensation, base_salary + COALESCE(bonus, @)) AS total_compensation FROM employees; fname __base_salary bonus [total_compensation John [50000 lo [50000 Sarah |60000 5000 [65000 [David [45000 [0 }45000 Alice [70000 7000 _|77000 Explanation: + Ifbonus is NULL, it is replaced with 0. + Iftotal_compensation is NULL, itis replaced with base_salary + bonus. Case 02: SELECT name, base_salary, COALESCE(bonus, base_salary, @) AS bonus, COALESCE(total_compensation, base_salary + COALESCE(bonus, @)) AS total_compensation FROM employees; fname base_salary [bonus [total_compensation ohn [50000 50000 |100000 Sarah |60000 5000 [65000 [David [45000 [45000 |90000 Alice |70000 17000 _|77000 Explanation: Here, COALESCE() will give you the value from bonus if it's available, or it will take base_salary if bonus is missing. If both are missing, it will return 0. 1.2 - IFNULLQ): IFNULL( is a function in SQL that checks if a value is NULL (empty) and, if itis, replaces it with a value you choose. If the value isn't NULL, it returns the original value Example: Suppose you have a table called employees with columns for base_salary, bonus, and total_compensation, where some values in bonus and total_compensation might be NULL. fid [name [base_salary [bonus |total_compensation 1 John {50000 INULL_|NULL 2 [Sarah [60000 [5000 _ [NULL |3_|David |45000 INULL_|NULL |4__|Alice [70000 |7000__|77000 SELECT name, base_salary, IFFNULL(bonus, @) AS bonus, IFFNULL(total_compensation, base_salary + IFFNULL(bonus, @)) AS total_compensation FROM employees; fname _[base_salary bonus |total_compensation John [50000 [0 [50000 Sarah {60000 5000 [65000 David [45000 lo [45000 [Alice [70000 [7000 [77000 Explanation: + Ifbonus is NULL, it is replaced with 0. + Iftotal_compensation is NULL, itis replaced with base_salary + bonus. Difference Between COALESCE() and IFNULL() in SQL: + COALESCE(): Can take multiple arguments and returns the first non-NULL value. + IFNULLO: Only takes two arguments — if the first one is NULL, it returns the second. 1.3 - ISNULLO: The ISNULL( function in SQL is used to check if a value is NULL. It returns a specified value if the original value is NULL. It's similar to IFNULL() but is more common in SQL Server. IFNULLO and ISNULLQ are simpler and more focused, but support only two arguments. Example: Suppose you have a table called employees with columns for base_salary, bonus, and total_compensation, where some values in bonus and total_compensation might be NULL. jid_[name [base salary [bonus [total_compensation 1 John [50000 INULL_|NULL 2. [Sarah [60000 5000 |NULL [3 [David [45000 INULL_|NULL |4 |Alice [70000 17000 _|77000 SELECT name, base_salary, TSNULL(bonus, @) AS bonus, ISNULL (total_compensation, base_salary + ISNULL(bonus, @)) AS total_compensation FROM employees; name |base_salary bonus |total_compensation John [50000 [0 [50000 Sarah {60000 [5000 [65000 [David [45000 lo [45000 Alice [70000 |7000_|77000 Here, ISNULL( checks if the bonus is NULL. Ifitis, it returns 0, otherwise it returns the original bonus, 02. Remove Duplicates: In SQL, both DISTINCT and ROW_NUMBER( can be used to remove duplicates, but they work in different ways and serve different purposes. Let's go through each method with an example and explain the difference between them. 2.1 - DISTINCTQ: The DISTINCT keyword removes duplicate rows from the result set. It checks all the columns you specify and returns only unique rows. Example: Suppose you have the following table of employee records with some duplicate entries: You want to remove the duplicate rows (same name, department, and salary). lemployee_id [name [department salary 1 ohn [HR [5000 2 Sarah_[IT [6000 [3 ohn [HR [5000 4 IMike [Sales }4500 Query: SELECT DISTINCT(name), department, salary FROM employees; [name [department |salary John [HR [5000 Sarah [IT [6000 Mike [Sales [4500 Here, DISTINCT( has removed the duplicate row of John. 2.2 - ROW_NUMBER(Q): ‘The ROW_NUMBER() function assigns a unique number to each row within a partition of data, based on an ORDER BY clause. You can use this to identify and remove duplicates by keeping only the first occurrence of each set of duplicates. Example: Ifyou want to keep only one row for each employee based on name and department but remove duplicates based on the combination of those columns, you can use ROW_NUMBERQ. WITH RankedEmployees AS ( SELECT employee_id, name, department, salary, ROW_NUMBER() OVER (PARTITION BY name, department ORDER BY employee_id) AS row_num FROM employees ) SELECT employee_id, name, department, salary FROM RankedEmployees WHERE row_num = 13 lemployee_id [name [department [salary u John |HR [5000 2 [Sarah_[Ir [6000 4 IMike [Sales }4500 Here, ROW_NUMBER( assigns a unique number to each row partitioned by name and department. We only keep the rows where rownum = 1, effectively removing duplicates. Key Differences between DISTINCT() and ROW_NUMBER(): + DISTINCT: + Removes duplicates by considering all specified columns + Itdoes not allow you to control which duplicate to keep. + Simpler to use if you just want to remove duplicates based on exact matches, + ROW_NUMBER(): + Provides more flexibility by assigning a unique number to each row. + Allows you to remove duplicates based on more complex logic, such as deciding which duplicate to keep (based on ordering) + Useful when you need to retain more control over how duplicates are handled (e.g, keeping the latest or the earliest record based on another column). When to Use Which? + Use DISTINCT when you simply want to eliminate exact duplicate rows based on specific columns. + Use ROW_NUMBER() when you need to remove duplicates but want more control over which duplicates to keep, especially when duplicates exist in a more complex way. 03. STANDARDIZE TEXT: In SQL, functions like LOWERQ, UPPERQ, and TRIM( are commonly used to standardize text data. They help ensure that text is consistent, which is important for comparisons, storage, and presentation. 3.1 - LOWER( Converts all characters in a string to lowercase, Input Table: name John Doe SARAH SMITH IMiKe JOHN SELECT LOWER(name) AS standardized_name FROM employees; [standardized_name [john doe [sarah smith lmike john 3.2 - UPPERQ): Converts all characters in a string to uppercase. Input Table: [name John Doe SARAH SMITH IMiKe JOHN SELECT UPPER(name) AS standardized_name FROM employees; [standardized_name JOHN DOE SARAH SMITH [MIKE JOHN 3.3 - TRIMQ: Removes leading and trailing spaces from a string, Example: Suppose you have a table named customer_feedback, which contains customer reviews. Some of these reviews have leading and trailing spaces, which can affect data analysis and reporting, |feedback_id |review o [Great product! 2 [Excellent service! 3 [Average quality. la [Not satisfied with the service, 5 [Would buy again! SELECT feedback_id, TRIM(review) AS cleaned_review FROM customer_feedback; [feedback_id|review 1 [Great product! 2 [Excellent service! iB [Average quality. la [Not satisfied with the service. 5 [Would buy again! Explanation + Feedback IDs: Remain unchanged. + Cleaned Reviews: The TRIM() function removes any leading and trailing spaces from each review. This ensures that the feedback is clean and ready for further analysis, such as sentiment analysis or reporting. 04. CORRECT INCONSISTENT DATA: Correcting inconsistent data in SQL can often involve string manipulation functions such as SUBSTR() and CONCAT(). Let's create a scenario where we specifically need to use SUBSTR() and CONCAT( together to correct inconsistent data. Example: Imagine you have a table named products, which stores product codes in inconsistent formats. Some product codes may have leading or trailing spaces, and some may have additional characters that need to be standardized. [product_id | product_code 1 |ABC-123 2 \def456 3. [ghi-789 4 JKL-0-001 [5 [MNO_234 Objective 1, Remove any leading or trailing spaces from the product codes. 2. Ensure all product codes follow a standard format: the code should start with "PROD-", followed by a numeric part extracted from the existing product code. SQL Query Using SUBSTR(Q) and CONCATQ) The approach will involve trimming the spaces, extracting the relevant parts of the product code using SUBSTRO), and concatenating them into a standardized format using CONCAT(. UPDATE products SET product_code = CONCAT('PROD-", SUBSTR(TRIM(product_code), INSTR(TRIM(product_code) , [product_id [product_code [PROD-123 [PROD-456 IPROD-789 |PROD-001 E [PROD-234 GTEPSTR T= Explanation of the Query TRIM(product.code): This removes leading and trailing spaces from each product code. 2. INSTR(TRIM(product_code), + 1: This finds the position of the first - in the trimmed product code and adds 1 to get the starting position of the numeric part. 3, SUBSTR(... INSTR(..) + 1): This extracts the substring starting from the character immediately after the -, which will give us the numeric part of the product code, 4. CONCAT('PROD- }: This concatenates "PROD-" with the extracted numeric part to create the standardized product code. 05. CHANGE DATA TYPES: You can use CAST() and CONVERT() in SQL to change data types of columns or values, and they are often used for converting between string, numeric, and date formats. Below is an example that demonstrates both CAST() and CONVERT( functions. Example Scenario We have a table sales with columns for sale_id, sale_amount, and sale_date. You want to: 1, Convert the sale_id (which is an integer) into a string for some report 2. Convert sale_date (which is a DATETIME) into a VARCHAR, but in a specific format: dd/mm/yyyy. lsale_id_|sale_amount |sale_date a [1000.50 [2024-10-01 14:30:00 2 {1500.00 [2024-10-02 09:00:00 3 [750.25 [2024-10-03 16:45:00 SQL Query Using Both CASTQ and CONVERT(): SELECT CAST(sale_id AS VARCHAR(1@)) AS sale_id_string, CAST(sale_amount AS DECIMAL(1@, 2)) AS sale_amount_decimal, CONVERT(VARCHAR(1@), sale_date, 103) AS sale date formatted FROM sales; Explanation of the Query: 1. CAST(sale_id AS VARCHAR(10)): + Converts the sale_id (an integer) into a VARCHAR of up to 10 characters. + This isa simple, straightforward type conversion with no extra formatting options. 2. CAST(sale_amount AS DECIMAL(10, 2)) * Converts sale_amount to a DECIMAL(10, 2), ensuring the amount is represented with two decimal places. + This shows that CAST() can handle numeric type conversions as well. 3. CONVERT(VARCHAR(10), sale date, 103): * Converts the sale_date (a DATETIME type) to a string in dd/mm/yyyy format. + The third parameter 103 is a style number in SQL Server that specifies the exact format (103 represents dd/mm/yyyy). + CONVERT( is used here because it allows formatting options that CAST() does not. lsale_id_string _|sale_amount_decimal |sale_date_formatted| fh 1000.50 lo1/10/2024 2 1500.00 loz/10/2024 ia [750.25 [03/10/2024 Key Differences Between CAST() and CONVERT(): 1. CAST + Basic Usage: Converts one data type to another. + Simple: Mostly used when you don't need specific formatting + Portable: Works across many SQL databases (ANSI SQL compliant). Example: Converting an integer to a string: SELECT CAST(sale_id AS VARCHAR(1®)) FROM sales; 2. CONVERT(): + Versatile: Allows additional formatting, particularly with DATETIME types. + Specific to SQL Server: Offers flexibility for converting and formatting dates, numbers, ete. Example: Converting DATETIME to string with a specific format: SELECT CONVERT(VARCHAR(19), sale_date, 103) FROM sales; Summary: + Use CAST() when you need simple, straightforward data type conversion that is portable across different database systems. + Use CONVERT() in SQL Server when you need to apply specific formatting, especially for DATETIME values or when you need more control over the output format. 06. HANDLE DATE FORMAT ISSUES: When handling date format issues in SQL, particularly in MySQL, we use functions like STR_TO_DATE(), EXTRACT(), NOW(), and DATE_FORMAT() to manipulate and extract dates from various formats. Let's explore these functions with a practical example using a table named orders. Example Scenario: You have an orders table where: * order id stores the order identification numbers. * order date stores the date as a string in inconsistent formats (e.g., DD/MM/YYYY, MM-DD- YYYY). * You need to: 1. Convert these string-formatted dates into actual DATE types. 2. Extract specific parts of the date (like year or month). 3, Format the date into a more user-friendly format for reporting purposes. 4. Get the current date for comparison purposes. lorder_id _[order_date amount i [26/09/2024 [100 2 lo9-27-2024 [150 a [28-09-2024 [200 SQL Queries Demonstrating STR_TO_DATE(), EXTRACT(), NOWO, and DATE_FORMAT(): Query 1: STR_TO_DATE() to Convert Strings into Dates + STR_TO_DATE() is used to convert strings into proper DATE types by specifying the input format. SELECT order_id, STR_TO_DATE(order_date, ‘%d/%m/%Y') AS formatted_date_1, STR_TO_DATE(order_date, '%m-%d-%V') AS formatted _date_2 FROM orders; lorder_id [formatted_date_1 _|formatted_date_2 i l2024-09-26 [NULL 2 [NULL [2024-09-27 3 [2024-09-28 INULL + For order_id = 1, the date '26/09/2024" is converted using "%d/%m/%Y' + For order_id = 2, the date '09-27-2024' is converted using "%om-%d-%Y" + order id = 3 is already in '%d/%m/%Y' format. Query 2: EXTRACT() to Extract Specific Parts of the Date SELECT order_id, EXTRACT (YEAR FROM STR_TO_DATE(order_date, "%d/%n/%V")) AS order_year, EXTRACT (MONTH FROM STR_TO_DATE(order_date, ‘%d/%m/%Y")) AS order_month FROM orders WHERE order_id lorder_id [order year [order_month HL [2024 [9 order_id = ‘The year is 2024 and the month is 9 extracted from the date "26/09/2024" Query 3: NOW() to Get the Current Date and Time SELECT NoW() AS current_datetime FROM orders LIMIT 4; [current_datetime [2024-10-01 12:30:00 This would return the current system date and time at the time of query execution. In this case, it is assumed to be 2024-10-01 12:30:00. Query 4: DATE_FORMAT() to Format Dates SELECT order_id, DATE_FORMAT(STR_TO_DATE(order_date, '%d/%m/%Y"'), '%M %d, %Y') AS formatted_order_date FROM orders WHERE order_id lorder_id |formatted_order_date i [September 26, 2024 ‘The date ‘26/09/2024’ is reformatted as ‘September 26, 2024’, Complete Query Combining All Functions: SELECT order_id, STR_TO_DATE(order_date, ‘%d/%m/%Y') AS formatted_date1, STR_TO_DATE(order_date, 'Xm-%d-2¥') AS formatted _date2, EXTRACT (YEAR FROM STR_TO_DATE(order_date, "Xd/%m/XV")) AS order_year, EXTRACT (MONTH FROM STR_TO_DATE(order_date, '%d/%m/X¥"}) AS order_month, - DATE_FORMAT(STR_TO_DATE(order_date, "%d/%m/%Y"), ‘RM %d, XY") AS formatted _date3, NoW() AS current_datetime FROM orders; prams ecaaceaas ances order: “yea peeimen ormatted dates carr a 2024-09-26 [NULL 2024 fo eptember 6, Sob 00 2 INULL 2024-09-27 NULL = [NULL NULL enna 3 12024.09-28 [NULL 2024 fo eptember %, enna + For order_id = 1, formatted_date1 is populated because it's in '%od/%m/%Y" format. + For order_id = 2, formatted_date2 is populated because it's in 'Yom-%d-%Y" format. + Current system date and time (NOW() are the same for all rows. 07. ENFORCE DATA INTEGRITY: Data integrity can be enforced using constraints like CHECK and FOREIGN KEY. Here's an example to demonstrate how these constraints work. CHECK Constraint: ‘The CHECK constraint ensures that a condition must be true for each row in the table. For example, let's create a Customers table where the age must be between 18 and 100. 01. Creating the Customers Table with CHECK Constraint: CREATE TABLE Customers ( CustomerID INT PRIMARY KEY, FirstName VARCHAR(5®), LastName VARCHAR(5@) , Age INT, Email VARCHAR(100), CHECK (Age >= 18 AND Age <= 100) 02. Creating the Orders Table with FOREIGN KEY Constraint CREATE TABLE Orders ( OrderID INT PRIMARY KEY, OrderDate DATE, CustomerID INT, Amount DECIMAL(10, 2), FOREIGN KEY (CustomerID) REFERENCES Customers (CustomerID) 3 03. Inserting Data into Customers Table INSERT INTO Customers (CustomerID, FirstName, LastName, Age, Email) VALUES (1, ‘John’, ‘Doe’, 30, ‘[email protected]), (2, ‘Jane’, ‘Smith’, 25, ‘[email protected]’), (G3, Emily’, ‘Johnson’, 45, ‘[email protected]’); [CustomerID [FirstName [LastName [Age [Email a [fohn [Doe {80 _ [[email protected] 2 liane Smith [25 _ |[email protected] iB [Emily [Johnson [45 _ [email protected] 04. Attempt to Insert Invalid Data (Will Fail Due to CHECK Constraint) INSERT INTO Customers (CustomerID, FirstName, LastName, Age, Email) VALUES (4, "Jake", ‘Williams’, 16, ‘[email protected]'); Error Message: Error: CHECK constraint violation on Age. Age must be between 18 and 100. 5. Inserti \g Data into Orders Table INSERT INTO Orders (OrderID, OrderDate, CustomerID, Amount) VALUES (1001, '2024-10-16", 1, 250.75), (1002, '2@24-10-17', 2, 100.00); [OrderID [OrderDate [CustomerID [Amount 1001 [2024-10-16 [1 250.75, 1002 [2024-10-17 [2 100.00 06. Attempt to Insert Invalid Order (Will Fail Due to FOREIGN KEY Constraint) INSERT INTO Orders (OrderID, OrderDate, CustomerID, Amount) VALUES (1003, *2024-10-18", 4, 150.00); Error Message: Error: FOREIGN KEY violation. CustomerID 4 does not exist in the Customers table. Summary + The CHECK constraint on the Customers table ensures that only customers with valid ages (18-100) are inserted. + The FOREIGN KEY constraint on the Orders table ensures that orders can only reference valid customers from the Customers table. By enforcing these constraints, the database maintains integrity and prevents invalid or inconsistent data from being entered. 08. HANDLE NUMERIC VALUES: You can handle numeric values using the functions ROUND(), CEILQ, FLOORQ), and ABS() in SQL. Here is a single dataset with examples of how each function works. Consider the following example with table: sales data, lsale_id |sale_amount 234.567 78.423 [456.789 123.001 65.999 ES RT= a I. ROUNDQ Function ‘The ROUND( function is used to round a number to a specified number of decimal places. SELECT sale_id, sale_amount, ROUND(Sale_amount, 2) AS rounded_amount_2_decimals, ROUND(sale_amount, @) AS rounded_to_nearest_integer FROM sales_data; lsale_id [sale_amount |rounded_amount_2_decimals [rounded_to_nearest_integer a 234.567 [234.57 [235 2 78.423 78.42 78 3 [456.789 [456.79 [457 la [123.001 [123.00 fv23 5 [65.999 [66.00 66 + ROUND(sale_amount, 2): Rounds the number to 2 decimal places. + ROUND(sale_amount, 0): Rounds the number to the nearest integ Il. CEILQ Function ‘The CEIL( (Ceiling) function rounds a number up to the nearest integer, regardless of the decimal part. SELECT sale_id, sale_amount, CEIL(sale_amount) AS ceiling value FROM sales_data; [ale id [sale amount [ceiling value 1 134.567 [235 2 |-78.423 -78 3 789 457, 4 123,001 124 [5 [-65.999 65 + CEILQ rounds the numbers up to the nearest integer, IIL. FLOORQ Function The FLOOR( function rounds a number down to the nearest integer, ignoring the decimal part. SELECT sale_id, sale_amount, FLOOR(sale_amount) AS floor_value FROM sales_data; lsale_id [sale_amount |floor_value 1 34.567, 234 2 78.423 -79 3 [456.789 [ase la 123,001 [123 Is 65.999 66 + FLOOR() rounds the numbers down to the nearest integer. IV. ABS() Function The ABS() function returns the absolute (positive) value of a number. SELECT sale_id, sale_amount, ABS(Sale_amount) AS absolute_value FROM sales_data; lsale_id [sale_amount |absolute_value H 234.567 [234.567 2 78.423 [78.423 3 [456.789 [456.789 la 123.001 {123.001 5 65.999 [65.999 Summary + ROUND): Rounds the number to a specified number of decimal places, + CEIL(): Rounds the number up to the nearest integer. + FLOOR(): Rounds the number down to the nearest integer. + ABS(): Returns the absolute (positive) value of a number. These functions help you clean, format, and manipulate numerical data in SQL effectively.

You might also like