0% found this document useful (0 votes)
39 views

Advanced DBMS Viva :: New Edition

The document discusses several questions related to databases and database management systems (DBMS). It addresses issues with traditional file-based systems that make DBMS superior, examples of open source and commercial relational DBMS, common database models, how to choose a database model, what entity-relationship (ER) modeling is, what NoSQL databases are, ACID properties of transactions, levels of data abstraction, differences between columnar and row-based databases, differences between online transaction processing (OLTP) and online analytical processing (OLAP), what normalization and de-normalization are, what data warehousing is, and types of database locks.

Uploaded by

Mahadeva Herbals
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Advanced DBMS Viva :: New Edition

The document discusses several questions related to databases and database management systems (DBMS). It addresses issues with traditional file-based systems that make DBMS superior, examples of open source and commercial relational DBMS, common database models, how to choose a database model, what entity-relationship (ER) modeling is, what NoSQL databases are, ACID properties of transactions, levels of data abstraction, differences between columnar and row-based databases, differences between online transaction processing (OLTP) and online analytical processing (OLAP), what normalization and de-normalization are, what data warehousing is, and types of database locks.

Uploaded by

Mahadeva Herbals
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

General Questions

#1 What are the issues of traditional file-based systems


that make DBMS a superior alternative?

One main issue is access. In the absence of indexing, your only


option is full page scan, which is super slow.

The other issue is redundancy and inconsistency. Files have many


duplicate and redundant data and if you change one of them, you
probably make them all inconsistent. It is very expensive to keep
files consistent.

Another issue is lack of concurrency control. As a result, one


operation might lock the entire page, while in DBMS, multiple
operations are allowed to work on a single file.

Data isolation, integrity check, atomicity of transactions, and


security problem are some other issues with traditional file-based
systems, which DBMSs have provided some good solutions for.

#2 What are some examples of open source and


commercial Relational DBMSs?
For open source RDBMS, three popular software are MySQL,
PostgreSQL, and SQLite. For commercial RDBMSs, you can
mention Oracle, Microsoft SQL server, IBM DB2, and Teradata.

#3 What is a database model? and name a few common


database models?

A database model is a logical structure of a database describing


relationships and constraints to store and access data. Some
common database models include:

 Relational model

 Hierarchical model

 Entity-Relationship (ER) model

 Document model

 Object-Oriented (OO) model

#4 How do you choose a database model?

To some extent it depends on your application, each database


model has its own strength. For example, the document model is
suitable for text or semi-structured data. On the other hand, if you
have atomic data, the relational model is your best option.
It also depends on which DBMS you use. Many DBMSs are built to
work only with one particular model and the user does not have
any other choices.

#5 What is ER modeling?

Entity-Relationship is a form of modeling that tries to imitate the


relationships that exist among entities in the real world. In ER
modeling, entities are some aspect of the real world, e.g. an event,
a location, persons, and relationships, as its name suggests, are the
relationship between these entities.

In ER modeling, all entities have their attributes, which in the real


world can be looked at characteristics of the object. For example, if
employee is an entity, then the name of that employee is one of its
attributes.

As an example of ER modeling, we can model one form of


relationship among employees as below: two entities, i.e.
supervisors and employees, and a relationship, i.e. supervise. You
can model your entire organization like this.

ER model of employees

#6 What is NoSQL?

NoSQL refers to a group of databases that are built for some


specific data models, e.g. graphs, documents, key-pairs, and wide-
columns. Unlike relational databases, NoSQL databases have
flexible schemas. NoSQL databases are widely recognized for their
ease of development, functionality, and performance at scale.
Unlike SQL databases, many NoSQL databases can be scaled
horizontally across hundreds or thousands of servers.

NoSQL systems are considered very young compared to


traditional relational databases. However, because of many
innovations and performance improvements, their popularity is
on the rise.

Besides all the benefit of these systems, it is worth mentioning that


NoSQL databases do not generally provide the same level of data
consistency as of relational databases. It is due to the fact that
NoSQL systems have sacrificed ACID properties in favor of speed
and scalability.

If you want to learn more about NoSQL databases, an awesome


source is here.

#7 What is ACID properties of transactions?

ACID stands for Atomicity, Consistency, Isolation,
and Durability. In order to maintain consistency of a database,
before and after the transaction, these four conditions must be
met. Below, I try to briefly describe the concept.
Atomicity: It is also known as “all or nothing rule”. Basically
either all parts of a transaction are stored or none of them is
stored. No partial transaction is allowed. For example, if a
transaction is to take money from an account and deposit it into
another account, all parts of it must be completed for a database to
stay consistent. If we do this transaction partially, then we made
our database inconsistent.

Consistency: There is no consensus over the definition of this


term. In general, you can look at this way, before the transaction,
the database was consistent, after the transaction, it must stay
consistent too.

Isolation: We run a lot of transactions concurrently and the


intermediate state of each transaction must stay invisible from the
others. For example, in the transfer fund transaction that I
described for atomicity, the other transactions must either see the
money in one account or the other, and not in neither. In other
words, if we make transactions fully isolated from each other, then
it must appear that the transactions are run serially and not
concurrently.

Durability: When a transaction successfully commits, then it


must persist (stored on disk) and can’t be undone, even in the
event of a crash.

#8 What are the different levels of data abstraction?


Data abstraction in DBMS is a process of hiding irrelevant details
from users. In general, there are three levels of data abstraction. 1)
Physical level, which is the lowest level, is the data storage
description, this level is managed by DBMS, and the details of this
level are typically hidden from system admins, developers, and
users; 2) Conceptual or Logical level that describes databases and
relationship between different fields. Developers and system
admins work on this level; 3) External or VIEW level that
describes only part of database. For example, the results of a query
is a VIEW level data abstraction. Users typically work on this level
and the details of the table schema and its physical storage are
hidden from them.

#9 What is the difference between columnar and row-


based databases?

Row-based databases store the data on disk row by row, whereas,


columnar databases store the data column by column. Each
method has its own advantages. The former is very fast and
efficient for the operations on rows and the latter is fast and
efficient for the operations on columns, e.g. aggregating large
volumes of data for a subset of columns.

Typically the operations that need the whole row are writing
operations like INSERT, DELETE, UPDATE. The operations that need
columns are typically read operations like SELECT, GROUP BY, JOIN ,

etc.

In general, columnar database are ideal for analytical operations


and row databases are ideal for transaction processing.

#10 What are OLTP and OLAP and their differences?

OLTP and OLAP are both online processing systems. OLTP stands
for “Online Transaction Processing” and it is a system that
manages transaction-oriented applications, and OLAP stands for
“Online Analytical Processing”, and it is a system to manage
analytical queries.

The major difference between the two systems is that OLTP is a


write-heavy system and OLAP is a read-heavy system. This
difference has a major impact on their implementation. For
example, it is very important for OLTP systems to adopt a proper
concurrency control, while this is not a major concern in read-
heavy operations. Another difference between the two systems is
that OLTP queries are generally simple and return relatively small
number of records while OLAP queries are very complex and
involve many intricate joins and aggregations.

The other difference is that due to the real-time nature of OLTP


systems, they often follow a decentralized architecture to avoid
single points of failure, while OLAP systems often have centralized
architecture.

Also, in the majority of DBMSs, OLTP is row-based database and


OLAP is columnar database.

#11 What is normalization and de-normalization?

Normalization is a process that organizes the data into multiple


tables to minimize redundancy. De-normalization is the opposite
process. It combines the normalized tables into one table so that
data retrieval becomes faster. The main advantage of
normalization is the better use of disk spaces. It is also easier to
maintain the integrity of the database when it is normalized.

JOIN is the operation that allows us to reverse the normalization


and create a de-normalized form of the data.

#12 What is Data Warehousing?

It is a process of collecting (extracting, transforming, and loading)


data from heterogeneous sources and storing them into one
database. You can consider the data warehouse as a central
repository where data flows into it from the transactional systems
and other relational databases. It can correlate broad business
data to provide greater executive insight into an organization
performance. The data warehouse is the core of the business
intelligence, which is a system for data analysis and reporting.

This database is maintained separately from standard operational


databases. They are two separate systems, the latter are optimized
to update real-time data quickly and accurately, while the former
is mostly suitable for offline operations to give a long-range view
of data over time.
Data Warehousing

Concurrency control

Concurrency control is the procedure in DBMS that ensure


simultaneous operations do not conflict with each another.

#13 What are database locks and its types?

In general, it is fair to say that locks are mostly used to ensure that
only one user/session is allowed to update a particular data. Here I
describe two types of locks: shared lock (S) and exclusive lock (X).
These locks can be held on a table, a page, an index key, or an
individual row.

Shared lock: When an operation requests a shared lock on a


table, if granted, that table becomes open to reading. This lock can
be shared with other read operations and they can read the table at
the same time.
Exclusive lock: When an operation requests an exclusive lock on
a table, if granted, has an exclusive right to write on the table.
Other operations, if they request an access on that locked table,
will be blocked.

Lock compatibility matrix

There is another related concept for locks, called Intent (I)


locks. We have Intent Shared (IS) and Intent Exclusive (IX) locks.
These locks allow more granular concurrency control. Technically,
we do not need them. S and X locks are enough, but they are
helpful for query optimization. More details about Intent locks are
typically beyond the scope of even advanced questions.

#14 What is “lock escalation”?

Database locks can exist on rows, pages or whole tables or indexes.


When a transaction is in progress, the locks held by the
transaction take up resources. Lock escalation is where the system
consolidates multiple locks into a higher level one (for example
consolidating multiple row locks to a page or multiple pages to a
whole table) typically to recover resources taken up by large
numbers of fine-grained locks.

#15 What is “lock contention”?


When multiple operations request an exclusive lock on one table,
lock contention occurs. In this scenario, operations must wait in a
queue. If you run into chronic lock contention, it means that some
parts of your database is hot, you must divide those data blocks
further to allow more operations to obtain exclusive lock at the
same time. lock contention can be a bottleneck for scaling up a
database.

#16 What is “deadlock”?

A deadlock is a situation that some transactions are waiting


indefinitely for each other to give up locks. Typically, there is two
approach to address this issue, one is the lazy way, which means
do nothing, and if it happened then detect it and restart one
operation to disentangle the deadlock. The other approach is
proactive, which means preventing deadlocks to ever happen. If
you want to learn more about deadlock prevention, read it here.

#17 What are isolation levels?

Isolation is the third letter in ACID properties. With this property,


our goal is to make all transactions completely isolated from each
other (serializable). However, there are some applications that
they do not need full isolation. As a result, we define some other
isolation levels that are less stringent than full isolation. In
general, five isolation levels are defined.
Read Uncommitted: No lock at all. Concurrent transactions can
read uncommitted data of other transactions and can also write on
them. In the database vernacular, they say DBMS allows dirty
reads.

Read Committed: In this level, DBMS does not allow dirty


reads. In this level, each transaction holds a read/write lock on the
current row and only release when it commits the changes. This
isolation level still allows non-repeatable read, which means a
transaction return different value when it reads the same row. It is
more clear if you take a look at below picture. As it is clear, no
dirty read is allowed, but non-repeatable reads still exist.

Repeatable Read: As you saw earlier, the problem with “read


committed” isolation level was “non-repeatable reads”. To avoid
non-repeatable reads, each transaction must hold a read lock on
the rows they read and write lock on the rows they write (e.g.
insert, update, and delete) until they commit the changes. This
level of isolation is repeatable read.

However, there is still one scenario in this isolation level, which


can make the database inconsistent. If we insert or delete new
rows to a table and then range query, then the results would be
inconsistent. Look at below example. The same query in
transaction 1 returns two different results. This scenario is known
as ‘phantom read’.
Serializable: This is the highest isolation level. As you saw in
“repeatable read”, phantom read can happen. In order to prevent
the phantom read, we must hold the lock on the entire table and
not the rows. Below picture is summary of all isolation levels so
far.

Snapshot: This isolation level is different from the others I have


described so far. The others were based on locks and blocks. This
one does not use locks. In this isolation level, when a transaction
modifies (i.e. insert, update, and delete) a row, the committed
version of the modified row will be copied to a temporary database
(tempdb) and receives a version number. It is also known as row
versioning. Then if another session tries to read the modified
object, the committed version of that object will be returned from
tempdb to that operation.

If what I described for snapshot isolation sounds fundamentally


different from other isolation levels, it is because it really is. Other
isolation levels are based on a pessimistic concurrency control
model, but snapshot isolation is based an optimistic model.
Optimistic model assumes conflicts are rare and decides not to
prevent them and handle them if they occur. It is different from
pessimistic model, which ensures that no conflict happens
whatsoever.

Access methods
Access methods are organization techniques or data structures
that support fast access to subsets of rows/columns. Some of the
most common data structures are variants of hash tables and B-
trees.

#18 What is hashing and its advantages and


disadvantages?

Hashing is a look up technique. Basically, a way to map keys to


values. Hash functions convert a string of characters into a usually
shorter fixed-length values, which can then be used as an index to
store the original element.

Hashing, if we use a good hash function, can be used to index and


retrieve items in a database in a constant time, which is faster than
the other look up techniques.

Advantages:

 Hash table is an ideal data structure for point look up


(a.k.a equality queries), especially when the database is
large, because regardless of the input size, you can
search, insert and delete data in constant time.

Disadvantages:
 There are situations that hashing is not necessarily the
best option. For example, for small data, the cost of a
good hash function makes hashing more expensive than
a simple sequential search.

 Another situation is range scan operation (a.k.a range


query), for this operation, B+ tree is an ideal data
structure.

 The other situation is looking for a sub-string or prefix


matching, hashing is basically useless for these
operations.

 Another disadvantage of hashing is scalability. The


performance of hash table degrades as the database
grows (more collisions and higher cost of collision
resolution, e.g. adding more buckets or rehashing the
existing items)

#19 What is B+ tree and its advantages and


disadvantages?

B+ tree is a data structure from the family of B trees. This data


structure and its variants is very popular for indexing purpose.
This tree is a self-balancing tree and there are mechanisms in
place to ensure that the nodes are at least half full. In B+ tree, the
data is stored in leaf nodes, and leaf nodes are sequentially linked
to each other. These sequential links between leaf nodes allow
sequential access to data without traversing the tree structure.
This sequential access allows fast and efficient range scan.

B+ tree allows searches, sequential access, insertions, and


deletions in logarithmic time. At the end of this answer, you can
find a B+ tree sample visualization. If you are interested, you can
plot one for yourself by this visualization tool (here).

Typically, in database systems, B+ tree data structure is compared


against hash table. So, here I try to explain advantages and
disadvantages of B+ tree compared to hash table. Advantages of
B+ tree is in range query and also searching for substrings, using
LIKE command. On the other hand, for equivalence query, the
hash index is better than B+ tree. Another advantage of B+ tree is
that it can easily grow with the data, and as a result, it is more
suitable for storing large amount of data on disk.

One common follow up question is that what is the difference


between B+ tree and Binary Search Tree (BST)? B+ tree is a
generalization of BST, which allows tree nodes to have more than
two children.

If someone asked you about the difference of B tree and B+ tree,


you can mention two things. First, in B+ three, the records are
only stored in leaves, and the internal nodes store the pointers
(key). Unlike B+ tree, in B tree, keys and records both can be
stored in the internal and leaf nodes. Second, the leaf nodes of a
B+ tree are linked together, but in B tree, they are not linked. You
can see these two differences in the below example.

#20 What is the difference between clustered and non-


clustered indexes?

Indexes are used to speed-up query process. Without them, a


DBMS needs to perform full table scan, which is very slow.

Clustered index is related to physical storage of data. Basically, we


ask our DBMS to sort our rows by a column and physically store
them with that order. As you can see, we only can have one
clustered index per table. Clustered index allows us to retrieve our
data very fast because it provides fast sequential scan. You can
either create a customized clustered index or let the DBMS to
automatically create one for you using the primary key.

In contrast, non-clustered index is not related to physical storage.


These indexes are sorted based on a column and stored
somewhere different from the table. You can imagine these
indexes as a lookup table with two columns. One column is a
sorted form of one of the table columns, the other is the their
physical address on memory (row address). For these indexes, if
we look for a record, we first go and search its index on the look up
table, then we go to the physical memory to fetch all the records
associated with that row address.
In summary, non-clustered indexes are slower than clustered
indexes because they involve extra lookup step. Moreover, since
we need to store these lookup tables, we need extra storage space.
The other difference is that we can have one clustered index per
table while we can have as many non-clustered index as we want.

operator execution

#21 What are correlated and non-correlated sub-


queries?
In terms of inter-dependency, there are two types of sub-queries,
in one of them, the inner query is dependent to the value of the
outer query, we call this kind of queries “correlated” queries and in
the other one, inner and outer queries are independent, which we
call them “non-correlated” queries.

It is needless to mention that correlated sub-queries are very slow


because it requires the inner sub-query to run once for every row
in the outer query.

#22 What are different JOIN algorithms?

There are three major algorithms to perform JOIN. Here I try to


describe them briefly and mention their advantages.
 Nested Loop: It compares all values of outer and inner
tables against each other. It is the only algorithm that is
capable of cross join (many-to-many joins). It serves as
a fallback option in the absence of better algorithms.

 Hash Join: This is the most versatile join method. In a


nutshell, it builds an in-memory hash table of the
smaller of its two inputs, and then reads the larger input
and probes the in-memory hash table to find matches.
Hash joins can only be used to compute equi-joins. It is
typically more efficient than nested loops, except when
the probe side of the join is very small.

 Sort-Merge Join: This algorithm, first, sorts two tables


based on the join attributes. Then it finds the first match
and scrolls down on two tables and merge the rows for
the matching attributes.

#23 What is stored procedure?

You can consider it as a semi-program. A set of SQL statements to


perform a specific task. If this task is a common task, then instead
of running the query each time, we can store that query into a
procedure and execute it when we need it. Below is a simple
structure of a procedure.
CREATE PROCEDURE <Procedure-Name> AS
Begin
<SQL STATEMENT>
End
After creating a procedure, whenever we need, we can execute it
using EXECUTE command
EXECUTE <Procedure-Name>

Stored procedure has many benefits. The major one is reusability


of the SQL code. If the procedure is in common use, it helps to
avoid writing the code multiple times. Another benefit is when we
use distributed databases, it reduces the amount of information
sent over the network.

#24 What is a database trigger?

A trigger is a stored procedure, which automatically runs


before/after an event occurs. These events can be DML, DDL,
DCL, or database operations such as LOGON/LOGOFF. The
general trigger syntax is as below:
CREATE [ OR ALTER ] TRIGGER [ Trigger_name ]
[BEFORE | AFTER | INSTEAD OF]
{[ INSERT ] | [ UPDATE ] | [ DELETE ]}
ON [table_name]
AS
{SQL Statement}

Some applications of triggers are: checking the validity of


transaction, enforcing referential integrity, event logging,
generating some derived columns automatically, and security
authorizations before/after user LOGON.

If you are asked about the difference of trigger and stored


procedure, you can mention that triggers can not be called by their
own. They are invoked during events. In contrast, a stored
procedure is an independent query and can be called
independently.

Query planning and optimization

Query planning and optimization is a feature of many RDBMSs.


The query optimizer attempts to determine the most efficient way
to execute a given query by considering the possible query plans.

#25 What is an execution plan?

An execution plan is the description of the operations that the


database engine plans to perform to run the query efficiently. You
can look at it as a view into your DBMS query optimizer, which
basically is a software to find the most efficient way to implement
a query. Execution plan is a primary means of troubleshooting a
poorly performing query.

Reading an execution plan, understanding it and troubleshooting


based on the plan is an art. So, if If you want to learn more about
it, a great reference is here.

#26 What is query optimization?


SQL is a declarative language and not procedural. Basically you
tell the DBMS what you want, but you don’t say how to get those
results. It is up to DBMS to figure it out. DBMS can adopt multiple
query strategies to get the correct results. However, these
execution plans incur different costs. It is the query optimizer’s job
to compare these strategies and pick the one with the least
expected cost. The cost in this context is a weighted combination
of I/O and processing costs. The I/O cost is the cost of accessing
index and data pages from disk.

#27 Mention a few best practices to improve query


performance?

It is a general question, and in practice there are many ways to


improve query performance. Here only I mention a few.

 Avoid multiple joins in a single query

 Use joins instead of sub-queries.

 Use stored procedure for frequently used data and more


complex queries.

 Use WHERE expressions to limit the size of your results


as much as possible.

Crash Recovery
Crash recovery is the process to roll back to a consistent and
usable state. This is done by undoing incomplete transactions and
redoing committed transactions that were still in memory when
the crash occurred

#28 What is Write-Ahead Log (WAL)?

In DBMSs, the de facto technique for recovery management and


maintaining Atomicity and Durability is using Write-Ahead Log
(WAL). With WAL, all changes are written to a log first, then the
log itself must be written to a stable storage before the changes are
allowed to be physically applied. That is why it is called Write-
Ahead log. A simple technique that guarantees that when we come
back from the crash, we still can figure out what we were doing
before the crash and pick up from where we left off.

#29 What is a checkpoint?

checkpoints are relevant to log-based recovery system. For


restoring a database after crash, we must redo all the log records.
But if we redo all the log records from the beginning, it takes
forever to recover a database. So, we must ignore some of the
records after a while. Checkpoints are the points where we decide
to ignore records before them for the recovery purpose. As you can
see, by using checkpoints, the DBMS can reduce the amount of
work to restart a database in the event of a crash.
Distributed systems

#30 What is a distributed database?

A distributed database is a collection of multiple interconnected


databases that are spread physically across various locations. In
almost all cases, these physically separated databases have shared-
nothing architecture and are independent from each other.
However, the DBMS integrates data logically in a way that it
appears as one single database to the user/application.

#31 What is database partitioning?

Partitioning is a process where very large tables are divided into


multiple smaller manageable pieces. Some benefits of partitioning
include faster queries, faster data load, and faster deletion of old
data. The benefits of partitioning are constrained by the selection
of the partition key and the granularity.

Two are two ways that a table can be partitioned: horizontally and
vertically. Vertical partitioning puts different columns into
different partitions, whereas horizontal partitioning puts subset of
rows in different partitions based on a partition key. For example,
a company’s sale records can be horizontally partitioned based on
the sale date.
#32 What is database sharding?

Sharding, in essence, is a horizontal partitioning architecture.


Each shard has the same schema and columns, but different rows,
and they are independent from each other.

The main benefit of sharding is scalability. With automatic


sharding architecture, you can simply add more machines to your
stack whenever it is needed and reduce the load on existing
machines and allow more traffic and faster processing. It is very
appealing to applications during the growth stage.

If you want to learn more about it, you can refer to this article
(here).

SQL related question


#33 What are different types of SQL statements?

SQL statements are high-level instructions and each statement is


responsible for a specific task. These statements can generally be
classified into five categories:

 Data Definition Language (DDL)


* This family of SQL commands is used to define
database schema.
* Examples include CREATE , DROP, ALTER
 Data Manipulation Language (DML)
* This family of SQL commands is used to modify the
data inside a table.
* Examples include INSERT , UPDATE , DELETE

 Data Query Language (DQL)


* This family of SQL commands performs query on
existing tables.
* Examples include SELECT

 Data Control Language (DCL)


* This family of SQL commands deals with the rights
and permissions.
* Examples include GRANT , REVOKE

 Transaction Control Language (TCL)


* This family of SQL commands deals with transactions.
* Examples include COMMIT , ROLLBACK , SAVEPOINT
* You only need these commands if you have OLTP
operation.

#34 What is the difference of DDL and DML?

They are closely related. DDL is responsible to define the structure


of the table, basically what is allowed to enter the table and what is
not allowed. DDL can be regarded as a set of rules that shape the
table structure (schema). After DDL defines the schema, then it is
DML job to fill the table with the data.
#35 What is the difference between scalar and aggregate
functions?

Both of these functions return a single value, the difference is the


input, scalar functions operate on a single value while aggregate
functions operate on a set of values. I try to clarify the difference
with an example. For example, string functions
like ISNULL(), ISNUMERIC(), LEN()are scalar functions. They input a
single value and return a single value. On the other
hand, AVG(), MAX(), SUM() are aggregate functions, they input
multiple values and output a single value.

#36 What is database VIEW?

A VIEW is a virtual table. This virtual table can temporarily keep


the result of your SQL query for future references. You can look at
it as a named query, which you can refer to it later by its name
rather than writing the whole query again. One advantage of the
VIEW is that instead of creating a brand new table to store the
result of your query, you create a VIEW and save your disk space.

#37 What is the difference between VIEW and


materialized VIEW?

The VIEW results are not stored on the disk, and every time a
VIEW is run, we get updated results. However, in materialized
VIEW, things are different. We store the results on the disk and we
put some mechanism in place to keep them updated (a.k.a VIEW
maintenance).

Materialized VIEW is beneficial when the VIEW is accessed


frequently. This way, we don’t need to run it every time.
Nonetheless, materialized VIEW has storage cost and update
overhead.

#38 What is Common Table Expressions (CTE)?

CTE can be regarded as temporary VIEW or ‘inline’ VIEW . In


other words, CTE is a temporary result set that you can reference
within another SELECT, INSERT, UPDATE, or DELETE statement. The
reason I say CTE is a temporary VIEW because it can only be used
by a query attached to it and can not be referenced somewhere
else.

CTE is defined with WITH operator. Below is simple blueprint of


CTE. You can define multiple CTEs with only one WITH and it
allows you to simplify intricate queries.

Not my favorites, irritating, but they are often


asked
#39 What are the differences between DROP and
TRUNCATE commands?
DROP removes the table altogether and can not be rolled back.
TRUNCATE, on the other hand, removes all the rows of that table.
With TRUNCATE, the table definition (schema) still exists and we
can INSERT data in the future, if we desire.

#40 What are the differences between DELETE and


TRUNCATE commands?

DELETE and TRUNCATE belong to two different categories of


SQL commands, the former belongs to DML and the latter belongs
to DDL. In other words, the former operates on row level while the
latter operates on table level. TRUNCATE deletes all the rows of
the table at once. DELETE can do the same, but it takes too much
time as it deletes them row by row. If you use TRUNCATE, you
can’t rollback the data, but with DELETE, you can rollback the row
you deleted.

#41 What are the differences between PRIMARY KEY


and FOREIGN KEY?

These keys are very important tools in cross-referencing different


tables. They also help with referential integrity. However, they are
different in nature. A table only has one primary key. Primary key
1) can uniquely identify a record in the table; 2) can not accept
NULL value; and 3) for most DBMSs, by default, is the clustered
index. A table, unlike primary key, can have more than one foreign
key. Foreign key 1) is a primary key in another table; 2) can not
uniquely identify a record in the table; 3) can accept NULL values,
and 4) is not automatically indexed. It is up to user to create an
index for it.

#42 What is the difference between WHERE and


HAVING clause?

Both clauses are used to limit the result set by providing some
conditions to filter the rows from the result set. However, there is
one difference. WHERE clause scans the raw data (row by row) to
check the conditions and filter them, but HAVING scans the
aggregated results to check the conditions and filter them. For this
reason, HAVING comes after GROUP BY in the SQL query. In
summary, WHERE filters the raw data, but HAVING filters the
processed data.

#43 What is functional dependency?

The attributes of a table are functionally dependent if one of them


uniquely identifies the other. It becomes more clear with an
example. Suppose we have an employee table with attributes: Id,
Name, and Age. Here employee Id uniquely identifies employee
Name because if we know the employee Id, we can uniquely say
his/her name. A functional dependency is denoted by an arrow →.
So, in this case, we can show it by Id -> Name

#44 What are different normalization types?


There are many levels of normalization. As I explained earlier, the
goal of normalization is to avoid redundancy and dependency.
However, this goal can not be achieved in one step. Each
normalization step (type) allows us to get closer to our goal.

We start with Un-normalized form of data (UNF), In UNF, cells


can can have multiple values (non-atomic), if we divide those
values into multiple cells to make sure each table cell has a single
value, and also remove duplicate rows to make sure all the records
are unique, then we transformed UNF to 1st normal form (1NF).

The next step is to introduce primary key and foreign key. In this
step, we divide the 1NF tables further and create new tables and
connect them through primary and foreign keys. If in these new
tables, all non-key attributes be fully functional dependent on the
primary key, then we reach to 2nd normal form (2NF)

Although with 2NF, we have reduced redundancy and dependency


significantly, there is still a lot of room to improve. The next step is
to remove transitive functional dependencies, which basically
means a situation that if we change a non-key column, we might
cause another non-key column to change. To disentangle these
non-key columns, we must create a separate tables. When we have
no transitive functional dependencies, then we reached to 3rd
normal form (3NF).
In almost all databases, 3NF is a point where we can not
decompose the database further into a higher forms of
normalization. However, in some complex databases, there are
situations that you can achieve higher forms of normalization.

Higher forms of normalization are Boyce-Codd Normal Form


(BCNF), 4NF, 5NF, and 6NF, which I prefer not to cover. If you
are interested to learn more about these higher forms of
normalization, you can read them from Wikipedia page of data
normalization (here).

#45 What are different integrity rules?

There are two types of integrity rules that if we obey them, we can
maintain our database consistency. Entity rule and Referential
rule. Entity rule is related to our primary key, if we make a column
the primary key, then it can not have any NULL value. Referential
rule is related to our foreign keys, our foreign keys must either
have a NULL value or their values must be the primary key of
another table.

#46 What is DML Compiler?

It is a translator. It transforms DML queries from high-level


statements to low-level instructions, executable by query
evaluation engine.
#47 What is cursor?

Cursor is a tool that can be employed by user to return the results


in a row by row manner. It is in contrast with the typical SELECT
statement, which return the entire result set. Typically we are
interested in complete set of rows, but there are applications,
especially those that are interactive and online, that cannot always
work effectively with the entire result set as a unit. These
applications prefer to work on the data row by row. Cursors allow
them to perform this row by row operation.

You can see the whole process of using a cursor in the below
example.

#48 What is cardinality?

Cardinality in the context of database systems means


“uniqueness”. For example, when we say a column has a low
cardinality, it means it has many duplicate values.

You might also like