0% found this document useful (0 votes)
1 views

An_Introduction_to_SQL_1731971471

Uploaded by

Talal Ben
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

An_Introduction_to_SQL_1731971471

Uploaded by

Talal Ben
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

An introduction to SQL

Alkistis Pourtsidou
Queen Mary, University of London

Image credit: Hayden Planetarium, 2014


WHAT IS SQL?

✦ SQL stands for Structured Query Language

✦ Used to query (“talk to”) a database server

✦ Data manipulation, database creation

✦ Almost all companies use databases to store their data

✦ Have a look at the Indeed Job Search Engine…tens of


thousands of jobs mentioning SQL!

✦ https://www.indeed.co.uk/
WHY DO WE NEED DATABASES?

✦ Concurrency: multiple simultaneous changes to data

✦ Data changes regularly

✦ Large data sets but only need subsets

✦ Sharing large data sets

✦ Rapid queries

✦ Data web interfaces (dynamic data)


WAYS TO USE SQL

✦ Standard console command (e.g. mysql -u user -p dbname)

✦ GUI interfaces often available

✦ Interfaces to many programming languages (Python, R, …)

✦ SQLite - use SQL without a database server: this is what we


are going to use in our tutorial

✦ PostgreSQL - the world’s most advanced open source


database: we’ll see how that works later on
MORE ABOUT DATABASES

✦ A database server can contain many databases

✦ Basically, databases are collections of tables with rows


(observations) and columns (variables)

✦ Limited mathematical operations available

✦ Very good at combining information from several related tables

✦ We’ll explore the above (and more) in detail


duction to
SQL Finding Your Way Around
EXPLORING the Server
THE SERVER
✦ A given server can support multiple databases

✦ EachSince
database contains
a single servermany tables many databases, each
can support
containing many tables, with each table having a variety
✦ Each table contains many columns
of columns, it’s easy to get lost when you’re working with
databases.
✦ But keeping These
things commands
under control is will help figure out what’s
straightforward!
available:
I SHOW DATABASES;
I SHOW TABLES IN database;
I SHOW COLUMNS IN table;
I DESCRIBE table; - shows the columns and their
types
duction to
SQL Variable Types
VARIABLE TYPES
✦ SQLSQL
supports a variety
supports a veryoflarge
different
numberformats for storing
of di↵erent information
formats for
internal storage of information.

Numeric
I INTEGER, SMALLINT, BIGINT
I NUMERIC(w,d), DECIMAL(w,d) - numbers with width
w and d decimal places
I REAL, DOUBLE PRECISION - machine and database
dependent
I FLOAT(p) - floating point number with p binary
digits of precision
VARIABLE TYPES
uction to
SQL Variable Types (cont’d)
✦ SQL supports a variety of different formats for storing information

Character
I CHARACTER(L) - a fixed-length character of length L
I CHARACTER VARYING(L) or VARCHAR(L) - supports
maximum length of L
Binary
I BIT(L), BIT VARYING(L) - like corresponding
characters
I BINARY LARGE OBJECT(L) or BLOB(L)
Temporal
I DATE
I TIME
I TIMESTAMP
SQL: THE BASICS
✦ Enough intro! Let’s dive into SQL with hands-on examples

✦ We will use SQLite3, which is part of Python, for details see


https://www.pythoncentral.io/introduction-to-sqlite-in-python/

✦ SQLite is an embedded SQL database engine. It doesn’t have a


separate server process, which makes it really easy to use,
immediately.
✦ Using Python (in the form of a Jupyter notebook) to run our SQL
code allows us to use Pandas for importing our results and make
everything look nice and clear!
SQL: THE BASICS
SQL: THE BASICS

✦ Let’s first define a function that takes our query, stored as a string, as an input.

✦ Then shows the result as a formatted data frame (we’ll see this in action in a
bit…)
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS

✦ As we saw, the database has two tables, TRIPS and


STATIONS. We will first work with the TRIPS table. Let’s see
what kind of information it contains:
SQL: THE BASICS

✦ See all columns:


SQL: THE BASICS

✦ See specific columns:


SQL: THE BASICS
SQL: THE BASICS

✦ DESC: Descending
SQL: THE BASICS

✦ The longest trip lasts


a bit less than 3
hours.
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS

✦ So far we’ve been looking at queries pulling data from the


TRIPS table

✦ But as you might remember there’s also the STATIONS table

✦ The STATIONS table contains information about every station


in the Hubway network

✦ It also includes an id column referenced by the TRIPS table

✦ So these tables can be combined to extract useful information

✦ Let’s have a look…


SQL: THE BASICS

ID is a unique identifier for each station, corresponding to the


start_station and end_station columns in the TRIPS table
SQL: THE BASICS
✦ Let’s say we want to know which station is the most
popular starting point

✦ For that we need to combine information from both the TRIPS


and STATIONS tables

✦ We will use the JOIN command


SQL: THE BASICS

✦ We will use SELECT to return the station column from the stations table using the
table.column syntax, i.e. stations.station in our case
✦ We also return the COUNT of the number of rows from the trips table

✦ To tell the database how the stations and trips tables are connected, we’ll use JOIN and
ON.
✦ JOIN specifies which tables should be connected

✦ ON specifies which columns in each table are related

✦ INNER JOIN means rows will only be returned where there is a much in the columns
specified by ON
✦ Tables are connected ON trips.start_station = stations.id

✦ Then we group by the station column so that COUNT will give the number of trips for
each station separately
✦ Finally we ORDER BY descending order
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
✦ Let’s slightly expand this query to see which are the most
popular round-trip stations:
EXERCISES/TASKS

✦ Code up the queries we just learned in Jupyter and reproduce


the results

✦ How many trips lasted more than half an hour? (this induces
extra charges)

✦ Which bike was used for the least total time?

✦ Did registered or casual users take more round trips?

✦ Pick up any publicly available database and play with it!


PostgreSQL

✦ PostgreSQL: “the world’s most advanced open source relational


database
✦ Active development for 30 years now!

✦ www.postgresql.org

✦ Installation: For MAC OS I strongly recommend using


Postgress.app, see https://www.calhoun.io/how-to-install-
postgresql-9-6-on-mac-os-x/
✦ For other systems, see https://www.dataquest.io/blog/sql-
intermediate/ and https://www.systems.ethz.ch/sites/default/files/
ex1a_postgresql_jupyter_setup.pdf (not tested…)
PostgreSQL

✦ First we need to create new user, database, and tables

✦ Follow the instructions in https://www.dataquest.io/blog/sql-


intermediate/ to run psql, create a new user named ‘oracle’ (or
another name of your preference) and a new database

✦ The new database contains consumer complaints

✦ It has two tables: one for bank account complaints and one
for credit card complaints
PostgreSQL

✦ We need to populate these with actual data!

✦ We will use data from here https://data.world/dataquest/bank-


and-credit-card-complaints

✦ Again, follow the instructions in https://www.dataquest.io/blog/


sql-intermediate/ to load the data

✦ They are CSV files

✦ They have identical fields: complaint_id, date_received,


product, …, issue, consumer_complaint_narrative, etc.
PostgreSQL
✦ Before having a look and playing with the data, we need to
create two helper functions

✦ One to run queries and one to run commands


PostgreSQL
✦ Before having a look and playing with the data, we need to
create two helper functions

✦ One to run queries and one to run commands


PostgreSQL
✦ OK, now let’s test everything works OK.

✦ First let’s see how the credit card complaints table looks like.
PostgreSQL
✦ Then let’s get the number of records using the COUNT
function

✦ Works well! (try the bank account complaints table too)


PostgreSQL
✦ How to deal with NULL values

✦ Let’s see how many records in each table have null values for
the consumer complaint narrative field

✦ When comparing a column to null (no value), we cannot use


arithmetic operators. Instead we use IS NULL / IS NOT NULL.
PostgreSQL: Views
✦ So we just saw a large amount of records had null values for
the consumer complaint narrative field.

✦ Instead of having to filter on this field later, we’ll create a view


with this subset only.

✦ Syntax is simple:
PostgreSQL: Views
✦ Let’s have a look:
PostgreSQL: String Concatenation

✦ Extremely useful, combines two or more strings (text values)


together to form a single string

✦ For example say we have a “month” field and a “year” field but
we need to show “month-year” instead

✦ Syntax:

✦ Let’s try it out with our credit card complaints table

✦ Let’s select complaint_id, product, company, and concatenate


separated by a hyphen
PostgreSQL: String Concatenation
PostgreSQL: Subqueries
✦ Subqueries (“inline views”) create a mini view within a single
query

✦ The best way to understand how they work is via an example:


TASKS

✦ Read about UNION/UNION ALL, and put them in action using


different views of the banking data

✦ Same with INTERSECT/EXCEPT

✦ Explore subqueries, for example by reproducing the


“Subqueries in action” examples in https://www.dataquest.io/
blog/sql-intermediate/

✦ Go through the SQL/Pandas tutorial in https://


www.dataquest.io/blog/python-pandas-databases/
REFERENCES

✦ Introduction to SQL https://www.stat.berkeley.edu/~spector/


sql.pdf

✦ https://www.w3schools.com/sql/

✦ http://www.sql-tutorial.net/

✦ https://www.kaggle.com/learn/sql

✦ https://www.dataquest.io/blog/sql-basics/

✦ https://www.dataquest.io/blog/sql-intermediate/

✦ https://www.dataquest.io/blog/python-pandas-databases/

You might also like