MODULE 3 For BSIT 3B
MODULE 3 For BSIT 3B
Module 3
Data and Business Intelligence
LEARNING OBJECTIVES
3-2 Explain logical database design and the relational database model.
3-5 Analyze the four major components and functions of a data warehouse and their use for
business.
3-7 Explain business analytics and describe its role in the decision-making process.
3-8 Examine the advantages and challenges of big data and predictive analytics for a business.
3-10 Explain key features of Tableau and Power BI as two popular business intelligence and
visualization platforms.
I. Databases
In the past, data was stored in a series of files that were called “flat files” because they were not
arranged in a hierarchy, and there was no relationship among these files. The problem with this
flat file organization was that the same data could be stored in more than one file, creating data
redundancy. Some of the advantages of a database over a flat file system are as follows:
More information can be generated from the same data.
Complex requests can be handled more easily.
Data redundancy is eliminated or minimized.
Data management is improved.
Storage space is reduced.
In a database, files are accessed by using a sequential, random, or indexed sequential method.
In a sequential access file structure, records in files are organized and processed in
numerical or sequential order, typically the order in which they were entered. Records are
organized based on what is known as a “primary key.”
In a random access file structure, records can be accessed in any order, regardless of their
physical locations in storage media. This method of access is fast and very effective when a
small number of records needs to be processed daily or weekly. To achieve this speed, these
records are often stored on magnetic disks. Disks are random access devices, whereas tapes
are sequential access devices.
With the indexed sequential access method (ISAM), records can be accessed sequentially or
randomly, depending on the number being accessed. ISAM, as the name suggests, uses an
index structure and has two parts: the indexed value and a pointer to the disk location of the
record matching the indexed value. Retrieving a record requires at least two disk accesses,
once for the index structure and once for the actual record. Because every record needs to be
3
indexed, if the file contains a huge number of records, the index is also quite large. Therefore,
an index is more useful when the number of records is small. Access speed with this method
is fast, so it is recommended when records must be accessed frequently.
Before designing a database, one needs to know the two ways information is viewed in a
database. The physical view involves how data is stored on and retrieved from storage media,
such as hard disks, magnetic tapes, or CDs. For each database, there is only one physical view of
data. The logical view involves how information appears to users and how it can be organized
and retrieved.
The first step in database design is defining a data model, which determines how data is created,
represented, organized, and maintained. A data model usually includes the following three
components:
Data structure—describes how data is organized and the relationship among records
Operations—describes methods, calculations, and so forth that can be performed on data,
such as updating and querying data
Integrity rules—defines the boundaries of a database, such as maximum and minimum
values allowed for a field, constraints (limits on what type of data can be stored in a field),
and access methods
Many data models are used. In a hierarchical model, the relationships among records form a
treelike structure (hierarchy). Records are called nodes, and relationships among records are
called branches. The node at the top is called the root, and every other node (called a child) has a
parent. Nodes with the same parents are called twins or siblings.
The network model is similar to the hierarchical model, but records are organized differently.
This model links invoice number, customer number, and method of payment. Unlike the
hierarchical model, each record in the network model can have multiple parent and child records.
A relational model uses a two-dimensional table of rows and columns of data. Rows are
records (also called tuples), and columns are fields (also referred to as attributes). To begin
designing a relational database, one must define the logical structure by defining each table
4
and the fields in it. The collection of these definitions is stored in the data dictionary. The
data dictionary can also store other definitions, such as data types for fields, default values
for fields, and validation rules for data in each field.
In a relational database, every record must be uniquely identified by a primary key. Student
ID numbers, Social Security numbers, account numbers, and invoice numbers are examples of
primary keys. To establish relationships among tables so data can be linked and retrieved
more efficiently, a primary key for one table can appear in other tables. In this case, it is
called a foreign key. To improve database efficiency, a process called normalization is used,
which eliminates redundant data and ensures that only related data is stored in a table.
Normalization can go through several stages, from first normal form (1NF) to fifth normal
form (5NF). Typically, however, only stages 1NF through 3NF are used.
Data stored in a relational model is retrieved from tables by using operations that pick and
combine data from one or more tables. There are several operations, such as select, project,
join, intersect, union, and difference. A select operation searches data in a table and retrieves
records based on certain criteria (also called conditions). A project operation pares down a
table by eliminating columns (fields) according to certain criteria. A join operation combines
two tables based on a common field (e.g., the primary key in the first table and the foreign key
in the second table).
5
A. Database Engine
A database engine, the heart of DBMS software, is responsible for data storage, manipulation,
and retrieval. It converts logical requests from users into their physical equivalents (e.g.,
reports) by interacting with other components of the DBMS (usually the data manipulation
component). Because more than one logical view of data is possible, the database engine can
retrieve and return data to users in many different ways.
B. Data Definition
The data definition component is used to create and maintain the data dictionary and define
the structure of files in a database. Any changes to a database’s structure, such as adding a
field, deleting a field, changing a field’s size, or changing the data type stored in a field, are
made with this component.
C. Data Manipulation
The data manipulation component is used to add, delete, modify, and retrieve records from a
database. Typically, a query language is used for this component. Many query languages are
available, but Structured Query Language (SQL) and query by example (QBE) are two of the
most widely used. Structured Query Language (SQL) is a standard fourth-generation query
language used by many DBMS packages, such as Oracle Database 11g and Microsoft SQL
Server. SQL consists of several keywords specifying actions to take. The basic format of an
SQL query is as follows: SELECT field FROM table or file WHERE conditions
D. Application Generation
E. Data Administration
Recent trends in database design and use include data-driven Web sites, natural language
processing, distributed databases, and object-oriented databases. In addition to these trends,
advances in artificial intelligence and natural language processing will have an impact on
database design and use, such as improving user interfaces.
With the popularity of e-commerce applications, data-driven Web sites are used more widely
to provide dynamic content. A data-driven Web site acts as an interface to a database,
retrieving data for users and allowing users to enter data in the database. Without this feature,
Web site designers must edit the HTML code every time a Web site’s data contents change.
This type of site is called a “static” Web site. A data-driven Web site, on the other hand,
changes automatically because it retrieves content from external dynamic data sources, such
as MySQL, Microsoft SQL Server, Microsoft Access, Oracle, IBM DB2, and other databases.
A data-driven Web site improves access to information, so users’ experiences are more
interactive, and it reduces the support and overhead needed to maintain static Web sites. Data-
driven Web sites are useful for the following applications, among others:
E-commerce sites that need frequent updates
News sites that need regular updating of content
Forums and discussion groups
7
B. Distributed Databases
The database types discussed so far use a central database for all users of an information
system. However, in some situations, a distributed database management system
(DDBMS), in which data is stored on multiple servers placed throughout an organization, is
preferable. Some of the reasons an organization would choose a distributed database are as
follows:
Local storage of data decreases response time but increases communication costs.
Several small integrated systems might cost less than one large server.
Most importantly, a distributed database is not limited by the data’s physical location.
There are three approaches to setting up a DDBMS, although these approaches can be
combined:
Fragmentation—the fragmentation approach addresses how tables are divided among
multiple locations. Horizontal fragmentation breaks a table into rows, storing all fields
(columns) in different locations. Vertical fragmentation stores a subset of columns in
different locations. Mixed fragmentation, which combines vertical and horizontal
fragmentation, stores only site-specific data in each location.
Replication—with the replication approach, each site stores a copy of data in the
organization’s database.
Allocation—the allocation approach combines fragmentation and replication.
Security issues are more challenging in a distributed database because of multiple access
points from both inside and outside the organization. Security policies, scope of user access,
and user privileges must be clearly defined, and authorized users must be identified.
C. Object-Oriented Databases
The relational model discussed previously is designed to handle homogenous data organized
in a field-and-record format. Representing more complex data relationships sometimes is not
possible with a relational database. To address these problems, object-oriented databases
were developed. Like object-oriented programming, this data model represents real-world
entities with database objects. An object consists of attributes (characteristics describing an
entity) and methods (operations or calculations) that can be performed on the object’s data.
Grouping objects along with their attributes and methods into a class is called encapsulation,
which essentially means grouping related items into a single unit. Encapsulation helps handle
more complex types of data, such as images and graphs. Object-oriented databases can also
8
use inheritance, which means new objects can be created faster and more easily by entering
new data in attributes. This data model expands on the relational model by supporting more
complex data management, so modeling real-world problems is easier. In contrast to the query
languages used to interact with a relational database, interaction with an object-oriented
database takes places via methods, which are called by sending a message to an object.
V. Data Warehouses
A data warehouse is a collection of data from a variety of sources used to support decision-
making applications and generate business intelligence. Data warehouses store multidimensional
data, so they are sometimes called “hypercubes.” Typically, data in a data warehouse is
described as having the following characteristics in contrast to data in a database:
Subject oriented—focused on a specific area, such as the home-improvement business or a
university, whereas data in a database is transaction/function oriented
Integrated—comes from a variety of sources, whereas data in a database usually does not
Time variant—categorized based on time, such as historical information, whereas data in a
database only keeps recent activity in memory
Type of data—captures aggregated data, whereas data in a database captures raw
transaction data
Purpose—used for analytical purposes, whereas data in a database is used for capturing
and managing transactions
Designing and implementing a data warehouse is a complex task, but specific software is
available to help. Oracle, IBM, Microsoft, Teradata, SAS, and Hewlett-Packard are market
leaders in data-warehousing platforms. Exhibit 3.6 shows a data warehouse configuration with
four major components: input; extraction, transformation, and loading (ETL); storage; and
output.
A. Input
Data can come from a variety of sources, including external data sources, databases,
transaction files, enterprise resource planning (ERP) systems, and customer relationship
management (CRM) systems. ERP systems collect, integrate, and process data that can be
used by all functional areas in an organization. CRM systems collect and process customer
data to provide information for improving customer service. Together, these data sources
provide the input a data warehouse needs to perform analyses and generate reports.
B. ETL
9
Extraction, transformation, and loading (ETL) refers to the processes used in a data
warehouse. Extraction means collecting data from a variety of sources and converting it into a
format that can be used in transformation processing. Transformation processing is done to
make sure data meets the data warehouse’s needs. Its tasks include the following:
Selecting only certain columns or rows to load
Translating coded values, such as replacing Yes with 1 and No with 2
Performing select, project, and join operations on data
Sorting and filtering data
Aggregating and summarizing data before loading it in the data warehouse
Loading is the process of transferring data to the data warehouse. Depending on the
organization’s needs and the data warehouse’s storage capacity, loading might overwrite
existing data or add collected data to existing data.
C. Storage
D. Output
The databases discussed so far support online transaction processing (OLTP) to generate
reports such as the following:
Which product generated the highest sales last month?
Which region generated the lowest sales last month?
Which salespersons increased sales by more than 30 percent last quarter?
Data warehouses, however, use online analytical processing and data-mining analysis to
generate reports.
data based on time, product, and location. Sometimes, this analysis is called slicing and
dicing data. OLAP allows one to analyze information that has been summarized in
multidimensional views. OLAP tools are used to perform trend analysis and sift through
massive amounts of statistics to find specific information. These tools usually have a “drill-
down” feature for accessing multilayer information.
Data-Mining Analysis
Data-mining analysis is used to discover patterns and relationships. The following are
typical questions one can answer by using data-mining tools:
Which customers are likely to respond to a new product?
Which customers are likely to respond to a new ad campaign?
What product should be recommended to this customer based on his or her past
buying patterns?
Decision-Making Reports
A data warehouse can generate all types of information as well as reports used for decision
making. The following are examples of what a data warehouse can allow one to do:
Analyze large amounts of historical data quickly.
Assist management in making well-informed business decisions.
Manage a high demand for information from many users with different needs and
decision-making styles.
A data mart is usually a smaller version of a data warehouse, used by a single department or
function. Data marts focus on business functions for a specific user group in an organization,
such as a data mart for the Marketing Department. Data marts have the following advantages
over data warehouses:
Access to data is often faster because of their smaller size.
Response time for users is improved.
They are easier to create because they are smaller and often less complex.
They are less expensive.
11
Users are targeted better, because a data mart is designed for a specific department or
division.
Data marts, however, usually have more limited scope than data warehouses, and consolidating
information from different departments or functional areas (such as sales and production) is more
difficult.
Business analytics (BA) uses data and statistical methods to gain insight into the data and
provide decision makers with information they can act on. Compared to business intelligence
(BI), BA is more forward looking; it tells the user what is going to happen in the future rather
than what has happened in the past.
BI can help determine what happened, what the problem is, and what decisions need to be made
based on the available data. BA can help determine why this is happening, what it will mean if
the sales trend continues, and what actions should be taken. BI uses dashboards, scorecards,
OLAP, and query reports to support decision-making activities, whereas BA uses statistical
analysis, data mining tools, and predictive modeling.
There are several types of BA methods. Three popular ones are descriptive, predictive, and
prescriptive analytics. Descriptive analytics reviews past events, analyzes the data, and provides
a report indicating what happened in a given period and how to prepare for the future. Thus, it is
a reactive strategy. Predictive analytics, as the name indicates, is a proactive strategy; it prepares
a decision maker for future events. Prescriptive analytics goes beyond descriptive and predictive
analytics by recommending a course of action that a decision maker should follow and showing
the likely outcome of each decision.
Within the general domain of analytics two other terms are being commonly used: Web analytics
and mobile analytics. Web analytics collects and analyzes the Web data in order to find out the
efficiency and effectiveness of a Web site. On the other hand, mobile analytics is concerned with
mobile Web and mobile apps. It assesses and measures traffic among mobile devices and all the
apps used by these mobile devices.
The value that BA offers organizations is huge. Organizations that want to take full advantage of
it, however, will need a chief analytics officer (CAO). This individual should sift through the
enterprise-wide data in order to generate BI. Major providers of predictive analytics software
12
There has been an explosion in data growth throughout the business world in recent years. In
fact, it has been estimated that the volume of business data worldwide doubles every 1.2 years.
Big data is data so voluminous that conventional computing methods are not able to efficiently
process and manage it. There are five dimensions to big data, known as the 5 Vs: volume,
variety, velocity, veracity, and value. Volume—this refers to the sheer quantity of transactions,
measured in petabytes or exabytes. Variety—this refers to the combination of structured data and
unstructured data. Velocity—this refers to the speed with which the data has to be gathered and
processed. Veracity-this refers to the trustworthiness and accuracy of the data. Value—this refers
to the value that the collected data brings to the decision-making process.
Many industries could benefit from big data analytics and gain a competitive advantage in the
following areas: retail, financial services, advertising and public relations, government,
manufacturing, media and telecommunications, energy, and healthcare.
Many technologies and applications have contributed to the growth and popularity of big data.
Mobile and wireless technology, the popularity of social networks, and the enhanced power
and sophistication of smartphones and handheld devices are among the key factors. The most
commonly used platform for big data analytics is the open-source Apache Hadoop, which
uses the Hadoop Distributed File System (HDFS) to manage storage. Distributed databases,
including NoSQL and Cassandra, are also commonly associated with big data projects.
Big data analytics is able to provide key decision makers with unprecedented decision-making
support and great potential for gaining a competitive advantage. However, this powerful tool
could reveal and expose certain information that puts some people’s privacy at risk. Some of
the risks associated with this technology are: discrimination, privacy breaches and
embarrassments, unethical actions based on interpretations, loss of anonymity, and only few
13
One of the main applications of database and database systems is using it for database
marketing. So what is database marketing? For the purposes of this book we define database
marketing as using the organization's database of customers and the potential customers in
order to promote the products or services that an organization offers.
The main goal of a successful database marketing campaign is to use the information within
the database to implement marketing strategies that eventually increase profits and enhance
the competitiveness of the organization in the marketplace. The key motivation behind
database marketing is establishing, building, and maintaining a long term relationship with the
customers.
Tableau is a data visualization tool used for generating business intelligence. It is used to
analyze data for generating trends using graphs and charts. Tableau users can:
Power BI from Microsoft is a platform that allows a user to analyze and visualize data from
different sources with different formats. Advantages of Power BI include:
Key Terms
locations.
A data hierarchy is the structure and organization of data, which involves fields, records, and
files.
A database management system (DBMS) is software for creating, storing, maintaining, and
accessing database files. A DBMS makes using databases more efficient.
In a sequential access file structure, records in files are organized and processed in numerical
or sequential order, typically the order in which they were entered.
In a random access file structure, records can be accessed in any order, regardless of their
physical locations in storage media. This method of access is fast and very effective when a
small number of records need to be processed daily or weekly.
With the indexed sequential access method (ISAM), records can be accessed sequentially or
randomly, depending on the number being accessed. For a small number, random access is used,
and for a large number, sequential access is used.
The physical view involves how data is stored on and retrieved from storage media, such as hard
disks, magnetic tapes, or CDs.
The logical view involves how information appears to users and how it can be organized and
retrieved.
A data model determines how data is created, represented, organized, and maintained. It usually
contains data structure, operations, and integrity rules.
In a hierarchical model, the relationships between records form a tree-like structure (hierarchy).
Records are called nodes, and relationships between records are called branches. The node at the
top is called the root, and every other node (called a child) has a parent. Nodes with the same
parents are called twins or siblings.
The network model is similar to the hierarchical model, but records are organized differently.
Unlike the hierarchical model, each record in the network model can have multiple parent and
child records.
A relational model uses a two-dimensional table of rows and columns of data. Rows are records
(also called tuples), and columns are fields (also referred to as attributes).
The data dictionary stores definitions, such as data types for fields, default values, and
15
A primary key uniquely identifies every record in a relational database. Examples include
student ID numbers, account numbers, Social Security numbers, and invoice numbers.
A foreign key is a field in a relational table that matches the primary key column of another
table. It can be used to cross-reference tables.
Normalization improves database efficiency by eliminating redundant data and ensures that
only related data is stored in a table.
With query by example (QBE), you request data from a database by constructing a statement
made up of query forms. With current graphical databases, you simply click to select query
forms instead of having to remember keywords, as you do with SQL. You can add AND, OR,
and NOT operators to the QBE form to fine-tune the query.
Create, read, update, and delete (CRUD) refers to the range of functions that data
administrators determine who has permission to perform certain functions.
Database administrators (DBA), found in large organizations, design and set up databases,
establish security measures, develop recovery procedures, evaluate database performance, and
add and fine-tune database functions.
A data-driven Web site acts as an interface to a database, retrieving data for users and allowing
users to enter data in the database.
The fragmentation approach to a distributed DBMS addresses how tables are divided among
multiple locations. There are three variations: horizontal, vertical, and mixed.
The replication approach to a distributed DBMS has each site store a copy of the data in the
organization’s database.
The allocation approach to a distributed DBMS combines fragmentation and replication, with
each site storing the data it uses most often.
16
In object-oriented databases, both data and their relationships are contained in a single object.
An object consists of attributes and methods that can be performed on the object’s data.
Encapsulation refers to the grouping into a class of various objects along with their attributes
and methods—meaning, grouping related items into a single unit. This helps handle more
complex types of data, such as images and graphs.
Inheritance refers to new objects being created faster and more easily by entering new data in
attributes.
A data warehouse is a collection of data from a variety of sources used to support decision-
making applications and generate business intelligence.
Extraction, transformation, and loading (ETL) refers to the processes used in a data
warehouse. It includes extracting data from outside sources, transforming it to fit operational
needs, and loading it into the end target (database or data warehouse
Online analytical processing (OLAP) generates business intelligence. It uses multiple sources
of information and provides multidimensional analysis, such as viewing data based on time,
product, and location.
A data mart is usually a smaller version of a data warehouse, used by a single department or
function.
Business analytics (BA) uses data and statistical methods to gain insight into the data and
provide decision makers with information they can act on.
Big data is data so voluminous that conventional computing methods are not able to efficiently
process and manage it.
Tableau is a data visualization tool used for generating business intelligence. It is used to
analyze data for generating trends using graphs and charts.
Power BI from Microsoft is a platform that allows a user to analyze and visualize data from
17