Query, filter, sort, and operators in SQL
SQL is a standard language for accessing and manipulating data. We use the SQL language to work with data, return rows and columns from a table, do aggregations, summarizations, calculate values, and join data in different tables. Therefore, we will see how the SQL language is used for other data analysis tasks and how it can solve the most varied problems of access, manipulation, and summarization of data and applications in data analysis tasks.
The ANSI SQL version is the internationally and widely recognized version of SQL. Each DBMS (database management system) makes minor modifications to the SQL language, allowing more performance and customization when using language with these DBMSs. The ANSI version is understood in any DBMS.
All versions support core commands like SELECT, UPDATE, DELETE, INSERT, WHERE. While the prominent DBMS vendors modify the language according to their wishes, most base their SQL programs on the version approved by the ANSI standard. Oracle is the company that holds most of the market in terms of Relational Database Management System, followed by MySQL, which Oracle itself purchased.
Query, Filter, Sorting, and Operators
Relational Database
Files store data in a relational way, where data is distributed in tables that relate to each other. Relational banks are still the majority of the market.
Logical Relational Model
A logical model that determines how data are related. Typically, those who build this type of model are data architects or data administrators. These professionals need to understand what professionals will use the database for. It is the process where entities, relationships, cardinalities are identified, and, from that, a logical model with definitions is created.
Physical Relational Model
Set of SQL statements that allow creating one or more tables. For each table, we create columns, indexes, relationships, etc.
Database
Schemas
DBMSs are replacing the database concept with the Schema concept. Depending on the DBMS, it is the same concept, although we can have multiple schemas within the same database.
CREATE SCHEMA 'cap02';
CREATE TABLE `cap02`.`tb_ship`
`name_ship` VARCHAR(50) NULL,
`month_year` VARCHAR(10) NULL,
`risk_Classification` VARCHAR(15) NULL,
`compliance_index` VARCHAR(45) NULL,
`risk_score` INT NULL,
`season` VARCHAR(200) NULL);
(
Load table
Right-click on the tb_ship table and select the Table Data Import Wizard button. In the next moment, we can load the data into an existing table, create a new table or truncate (clear the entire table) before importing it. If we have changed the table and want to remove and reload the data, truncate will clear the records and load the data.
Anatomy of a Generic SQL Query
SELECT * FROM cap02.tb_ship;
SELECT allows us to select data from one or more tables. We can return all rows from all columns, hit use *, and apply no filter.
Select specific columns from the table
SELECT ship_name
FROM cap02.tb_ship;
SELECT ship_name, month_year
FROM cap02.tb_ship;
We will return the data as needed. We can still apply operations for each column and produce precisely the data we want. The SQL language is a data query language through SELECT statements. The SELECT is the most basic statement of the SQL Language without generating any changes to the table.
Applying Filters
We want to filter the Query in some situations, returning only some rows based on a criterion (filter).
For example, ANVISA classifies vessels into four levels according to their scores:
Therefore, we want to return all records where the risk classification was type D, the highest category of highest risk according to the ANVISA classification.
The SELECT DISTINCT command will return the unique values of the risk rating:
SELECT DISTINCT risk_classification
FROM cap02.tb_ship;
SELECT ship_name, season
FROM cap02.tb_ship
WHERE risk_classification = 'D';
Based on the ANVISA dataset, these ships had the worst risk rating (D) over the past few seasons.
Sort the result
SELECT ship_name, risk_classification, season
FROM cap02.tb_ship
WHERE risk_classification = 'D'
ORDER BY ship_name;
We want to return the three SELECT columns where the risk rating is the worst (D) and sort by the ship name column. So, our result is now sorted by one of the columns. We are slicing the table and putting more criteria to return precisely the data we need.
Logical Operators
When consulting our data dictionary, we have a risk score variable that indicates the sum of the risk values of each item in the inspection roadmap. The risk score refers to the absence or failure of controls. High-scoring ships are not adequately implementing security controls:
SELECT ship_name, risk_classification, risk_score, season
FROM cap02.tb_ship
WHERE risk_classification = 'D'
ORDER BY ship_name;
The higher the score, the fewer controls the company that manages the ships implemented. We will apply one more filter where we return all ships of risk rating D that have a risk score greater than 1,000:
SELECT ship_name, risk_classification, risk_score, score
FROM cap02.tb_ship
WHERE risk_classification = 'D' AND risk_score > 1000
ORDER BY name_ship;
We are slicing the table. We will return data within the criteria that we decide. This is one of the most significant advantages of the SQL Language, where we initially had more than 460 records.
The logical operator AND implies that the first criterion is valid, and the second criterion is also true to return the final result. If one of the two is not valid, the filter will return no record. In other words, this is when we want to be sure that both criteria are valid.
If we want at least one of the criteria to be True, even if one of the conditions is not met:
SELECT
ship_name, risk_classification, risk_score, season
FROM cap02.tb_navios
WHERE risk_classification = 'D' OR risk_score > 1000
ORDER BY ship_name;
Continuing with logical operators
Returning to the dictionary of variables, we have the compliance index representing the percentage of the items in the inspection script met by the vessel. The higher the index, the more items were attended by each vessel.
SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.tb_ship
WHERE classification_risk = 'D'
ORDER BY ship_name;
For risk classification D, the compliance index tends to be lower. We can verify this by changing the filter to A:
SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.tb_ship
WHERE classification_risk = 'A'
ORDER BY ship_name;
Alternatively, even better, if a vessel belongs to risk classification A or B and if the compliance rate is greater than 98%. With the same instruction, we were able to return two risk classification categories:
SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.tb_ship
WHERE risk_classification IN ('A', 'B')
AND compliancE_index > 90
ORDER BY ship_name;
It is widespread when we access a web form where not all data is loaded. Only a portion of the data is loaded not to overload the databases since each Query consumes computational resources, and it may be necessary to limit the number of rows.
Let us sort the data now by compliance index:
SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.tb_ship
WHERE risk_classification IN ('A', 'B') AND compliance_index > 90
ORDER BY compliance_index;
The vessels returned were precisely those with the highest compliance index value of 100, that is, that met all safety requirements according to ANVISA. However, we do not want all records, but only the first ten records keeping the preceding ordering:
SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.tb_ship
WHERE risk_classification IN ('A', 'B') AND ompliance_index > 90
ORDER BY compliance_index
LIMIT 10;
These are the top 10 vessels with the highest compliance rate. This result considers the first ten rows based on the ordering of the compliance index. If we change the ORDER BY, this result will be different. For example, if we add the ship name to ORDER BY:
SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.tb_ship
WHERE classification_risk IN ('A', 'B') AND compliance_index > 90
ORDER BY compliance_index, ship_name
LIMIT 10;
The Query will first select the columns from the ships table, apply the filter and sort. When sorting, it will return the first ten lines. LIMIT 10 is the last clause after everything is executed and ordered, limiting only the data view. LIMIT is not a filter!
In practice, what we did was apply conditional filtering with multiple conditions.
In April 2018, did any vessel have a compliance rate of 100% and a risk score equal to 0?
SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.tb_ship
WHERE compliance_index = 90 AND risk_score = 0 AND month_year = '04/2018'
ORDER BY compliance_index;
We can have multiple conditions. It will always depend on what we want to return. In this case, we were very specific by assigning more conditions. If we fill in requirements, we can solve the problems, but we can damage the database and cause problems for other people. So, we have to know the filters sparingly.
Subquery - solving problems in different ways
The SQL language has a feature that makes it a little challenging. We can reach the same result in several different ways. However, one method can be the best in terms of performance without compromising the performance of the database. Thus:
SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.tb_ship
WHERE compliance_index (SELECT compliance_index
FROM cap02.tb_ship
WHERE compliance_index > 90)
AND riks_socre = 0
AND month_year = '04/2018'
ORDER BY compliance_index;
In practice, we are using the concept of Subquery. We took the compliance index and kept it in the WHERE clause. We use the IN logical operator with a Subquery inside a query. Furthermore, after the parentheses, we kept the same conditions as before.
In this case, there is a considerable difference in performance. If we visualize the Execution Plan, how the DBMS engine interprets the commands, visualize how the Query is executed in the database. That is the execution plan of the first original Query:
Now executing the plan with the Subquery:
The second Query also solved our problem, but it took a more tortuous path to compromise the database and application. If it were a dataset with millions of records, most likely, the execution of Subquery would compromise our performance. So, we have to build SQL Queries that solve our problem but with good performance!
IN Operator Tip
The IN operator and use of Subquery should be avoided as much as possible!
1. Which vessels have a risk score equal to 310?
SELECT *
FROM cap02.tb_ship
WHERE risk_score = 310;
2. Which vessels have an A risk rating and a compliance index greater than or equal to 95%?
SELECT *
FROM cap02.tb_ship
WHERE risk_classification = 'A'
AND compliance_index >= 95;
3. Which vessels have a C or D risk rating and a compliance index less than or equal to 95%?
SELECT *
FROM cap02.tb_ship
WHERE risk_classification = 'C'
OR risk_classification 'D'
AND compliance_index <= 95;
or
SELECT
FROM cap02.tb_ship
WHERE risk_classificaiton IN ('C','D')
AND compliance_index <= 95;
4. Which vessels have an A risk rating or a risk score of 0?
SELECT *
FROM cap02.tb_ship
WHERE risk_classification = 'A'
OR risk_score = 0;
5. Which vessels were inspected in December 2016?
SELECT *
FROM cap02.tb_ship
WHERE season LIKE = '%December 2016';
In this case, we want to apply a filter to a set of characters within a specific column. LIKE will filter based on a string, that is, anything that contains December 2016 values. If we have another more performant option, we avoid IN and LIKE.