2017 Information Management Unit 3 DBMS
2017 Information Management Unit 3 DBMS
INFORMATION MANAGEMENT
DBMS – HDBMS, NDBMS, RDBMS, OODBMS, Query Processing, SQL, Concurrency Management,
Data warehousing and Data Mart.
Functions of DBMS
Create databases
Create tables
Create supporting structures
Read database data
Modify database data (insert, update, delete)
Maintain database structures
Enforce rules
Control concurrency
Provide security
Perform backup and recovery
Advantages of DBMS
The database management system has promising potential advantages, which are explained below:
a) Controlling Redundancy: In file system, each application has its own private files, which cannot be shared
between multiple applications. 1:his can often lead to considerable redundancy in the stored data, which results in
wastage of storage space. By having centralized database most of this can be avoided. It is not possible that all
redundancy should be eliminated. Sometimes there are sound business and technical reasons for· maintaining
multiple copies of the same data. In a database system, however this redundancy can be controlled.
b) For example: In case of college database, there may be the number of applications like General Office, Library,
Account Office, Hostel etc.
c) Integrity can be enforced: Integrity of data means that data in database is always accurate, such that incorrect
information cannot be stored in database. In order to maintain the integrity of data, some integrity constraints are
enforced on the database. A DBMS should provide capabilities for defining and enforcing the constraints.
d) Inconsistency can be avoided : When the same data is duplicated and changes are made at one site, which is not
propagated to the other site, it gives rise to inconsistency and the two entries regarding the same data will not agree.
At such times the data is said to be inconsistent. So, if the redundancy is removed chances of having inconsistent
data is also removed.
e) Data can be shared: As explained earlier, the data about Name, Class, Father __name etc. of General_Office is
shared by multiple applications in centralized DBMS as compared to file system so now applications can be
developed to operate against the same stored data. The applications may be developed without having to create any
new stored files.
f) Standards can be enforced : Since DBMS is a central system, so standard can be enforced easily may be at
Company level, Department level, National level or International level. The standardized data is very helpful during
migration or interchanging of data. The file system is an independent system so standard cannot be easily enforced
on multiple independent applications.
g) Restricting unauthorized access: When multiple users share a database, it is likely that some users will not be
authorized to access all information in the database. For example, account office data is often considered
confidential, and hence only authorized persons are allowed to access such data. In addition, some users may be
permitted only to retrieve data, whereas other are allowed both to retrieve and to update. Hence, the type of access
operation retrieval or update must also be controlled. Typically, users or user groups are given account numbers
protected by passwords, which they can use to gain access to the database. A DBMS should provide a security and
authorization subsystem, which the DBA uses to create accounts and to specify account restrictions. The DBMS
should then enforce these restrictions automatically.
h) Solving Enterprise Requirement than Individual Requirement: Since many types of users with varying level of
technical knowledge use a database, a DBMS should provide a variety of user interface. The overall requirements of
the enterprise are more important than the individual user requirements. So, the DBA can structure the database
system to provide an overall service that is "best for the enterprise".
i) Providing Backup and Recovery: A DBMS must provide facilities for recovering from hardware or software
failures. The backup and recovery subsystem of the DBMS is responsible for recovery. For example, if the
computer system fails in the middle of a complex update program, the recovery subsystem is responsible for
making sure that the .database is restored to the state it was in before the program started executing.
j) Cost of developing and maintaining system is lower: It is much easier to respond to unanticipated requests
when data is centralized in a database than when it is stored in a conventional file system. Although the initial cost
of setting up of a database can be large, but the cost of developing and maintaining application programs to be far
lower than for similar service using conventional systems. The productivity of programmers can be higher in using
non-procedural languages that have been developed with DBMS than using procedural languages.
k) Data Model can be developed : The centralized system is able to represent the complex data and interfile
relationships, which results better data modeling properties. The data madding properties of relational model is
based on Entity and their Relationship, which is discussed in detail in chapter 4 of the book.
l) Concurrency Control : DBMS systems provide mechanisms to provide concurrent access of data to multiple
users.
Disadvantages of DBMS: The disadvantages of the database approach are summarized as follows:
a. Complexity : The provision of the functionality that is expected of a good DBMS makes the DBMS an extremely
complex piece of software. Database designers, developers, database administrators and end-users must
understand this functionality to take full advantage of it. Failure to understand the system can lead to bad design
decisions, which can have serious consequences for an organization.
b. Size : The complexity and breadth of functionality makes the DBMS an extremely large piece of software,
occupying many megabytes of disk space and requiring substantial amounts of memory to run efficiently.
c. Performance: Typically, a File Based system is written for a specific application, such as invoicing. As result,
performance is generally very good. However, the DBMS is written to be more general, to cater for many
applications rather than just one. The effect is that some applications may not run as fast as they used to.
d. Higher impact of a failure: The centralization of resources increases the vulnerability of the system. Since all
users and applications rely on the ~vailabi1ity of the DBMS, the failure of any component can bring operations to
a halt.
e. Cost of DBMS: The cost of DBMS varies significantly, depending on the environment and functionality
provided. There is also the recurrent annual maintenance cost.
f. Additional Hardware costs: The disk storage requirements for the DBMS and the database may necessitate the
purchase of additional storage space. Furthermore, to achieve the required performance it may be necessary to
purchase a larger machine, perhaps even a machine dedicated to running the DBMS. The procurement of
additional hardware results in further expenditure.
g. Cost of Conversion: In some situations, the cost oftlle DBMS and extra hardware may be insignificant compared
with the cost of converting existing applications to run on the new DBMS and hardware. This cost also includes
the cost of training staff to use these new systems and possibly the employment of specialist staff to help with
conversion and running of the system. This cost is one of the main reasons why some organizations feel tied to
their current systems and cannot switch to modern database technology.
Organizational DBMS
Organizational database systems typically:
Support several users simultaneously
Include more than one application
Involve multiple computers
Are complex in design
Have many tables
Have many databases
Conventional Data Processing techniques:
It reduces data redundancy and inconsistency by minimizing isolated files -It can’t eliminate data redundancy as a
whole, but can help control it -It uncouples data and programs, enabling data to stand up on their own -Access
and availability of information increases -Program development and maintenance costs decreases -Users and
programmers can perform and hoc queries of data in the database -Enables the organization to centrally manage:
the data, their use, and security through the use of a data dictionary Relational DBMS -Contemporary DBMS uses
different database models -Most popular type is the relational DBMS -Relational DBMS: data as two-dimensional
tables (called relations) -Tables are also referred to as files -Each table contains data on an entity and its attributes
-Example: Microsoft Access is a Relational DBMS -Each element of data for each entity is stored as a separate
field -Each field represents an attribute for that entity -Fields in a relational database are also called columns - The
actual information about a single supplier that resides in a table is called a row -Rows are referred to as records or
as tuples -When a field uniquely identifies each record, so that it can be retrieved, updated or sorted, it is called a
key field - Every table in a relational database has one field designated as its primary key -The key field is the
unique identifier for all the information in any row of the table -The primary key cannot be duplicated
A primary key is a column or a set of columns that uniquely identify a row in a table. A primary key should be
short, stable and simple. A foreign key is a field (or collection of fields) in a table whose value is required to
match the value of the primary key for a second table.
Relational databases work on each table has a key field that uniquely indicates each row, and
that these key fields can be used to connect one table of data to another.
3.4.2 The relational database has two major reasons
1. Relational databases can be used with little or no training.
2. Database entries can be modified without specify the entire body.
3.4.3 Properties of Relational Tables
In the relational database some properties have to be followed, which are given below.
It's Values are Atomic
In Each Row is alone.
Column Values are of the same thing.
Columns are undistinguished.
Sequence of Rows is Insignificant.
Each Column has a common Name.
Distinguish between DBMS & RDBMS. Explain the advantages & disadvantages of both.
3.5 OODBMS – Object oriented Database Management System
An Object Oriented database is a combination of objects in a persistent storage which holds
information. It is quite similar to the object oriented languages. It can be named as the fifth-
generation database technology that was began to develop in mid 80’s. The real world
entities are represented like an object in the Object Oriented Data Model.
In this Model we have to discuss the functionality of the object oriented Programming .It
takes more than storage of programming language objects. Object DBMS's increase the
semantics of the C++ and Java .It provides full-featured database programming capability,
while containing native language compatibility. It adds the database functionality to object
programming languages. This approach is the analogical of the application and database
development into a constant data model and language environment. Applications require less
code, use more natural data modeling, and code bases are easier to maintain. Object
developers can write complete database applications with a decent amount of additional
effort.
Object-oriented databases use small, recyclable separated of software called objects. The
objects themselves are stored in the object-oriented database. Each object contains of two
elements:
1. Piece of data (e.g., sound, video, text, or graphics).
2. Instructions or software programs called methods, for what to do with the data.
3.5.1 Disadvantage of Object-oriented databases
Object-oriented databases have these disadvantages.
Object-oriented database are more expensive to develop.
In the Most organizations are unwilling to abandon and convert from those databases.
The benefits to object-oriented databases are compelling. The ability to mix and match
reusable objects provides incredible multimedia capability.
3.6 Object-Relational Model (Hybrid Model): It is also a relational data model but with
object orientation in it. It reduces the gap between the conceptual data modeling techniques
and object-relational mapping.
3.8.2 History
1970 -- Dr. Edgar F. "Ted" Codd of IBM is known as the father of relational databases.
He described a relational model for databases.
1974 -- Structured Query Language appeared.
1978 -- IBM worked to develop Codd's ideas and released a product named System/R.
1986 -- IBM developed the first prototype of relational database and standardized by
ANSI. The first relational database was released by Relational Software and its later
becoming Oracle.
3.8.3 SQL Process
When you are executing an SQL command for any RDBMS, the system determines the best
way to carry out your request and SQL engine figures out how to interpret the task.
There are various components included in the process. These components are Query
Dispatcher, Optimization Engines, Classic Query Engine and SQL Query Engine, etc.
Classic query engine handles all non-SQL queries but SQL query engine won't handle
logical files.
3.8.4 SQL Commands
The standard SQL commands to interact with relational databases are CREATE, SELECT,
INSERT, UPDATE, DELETE and DROP. These commands can be classified into groups
based on their nature.
3.8.4.1 DDL - Data Definition Language
Command Description
CREATE Creates a new table, a view of a table, or other object in database
ALTER Modifies an existing database object, such as a table.
DROP Deletes an entire table, a view of a table or other object in the database.
3.9.1 Lock based protocols: Database systems, which are equipped with lock-based
protocols, use mechanism by which any transaction cannot read or write data until it
acquires appropriate lock on it first. Locks are of two kinds:
Binary Locks: a lock on data item can be in two states; it is either locked or unlocked.
Shared/exclusive: this type of locking mechanism differentiates lock based on their
uses. If a lock is acquired on a data item to perform a write operation, it is exclusive lock.
Because allowing more than one transactions to write on same data item would lead the
database into an inconsistent state. Read locks are shared because no data value is being
changed.
3.9.1.1 Types lock protocols
Simplistic: Simplistic lock based protocols allow transaction to obtain lock on every
object before 'write' operation is performed. As soon as 'write' has been done, transactions
may unlock the data item.
Pre-claiming: In this protocol, a transactions evaluations its operations and creates a list
of data items on which it needs locks. Before starting the execution, transaction requests
the system for all locks it needs beforehand. If all the locks are granted, the transaction
executes and releases all the locks when all its operations are over. Else if all the locks
are not granted, the transaction rolls back and waits until all locks are granted.
Two Phase Locking - 2PL: This locking protocol is divides transaction execution phase
into three parts.
1. When transaction starts executing, transaction seeks grant for locks it needs as it
executes.
2. Where the transaction acquires all locks and no other lock is required. Transaction
keeps executing its operation. As soon as the transaction releases its first lock, the
third phase starts.
3. A transaction cannot demand for any lock but only releases the acquired locks.
Two phase locking has two phases, one is growing; where all locks are being acquired by
transaction and second one is shrinking, where locks held by the transaction are being
released. To claim an exclusive (write) lock, a transaction must first acquire a shared (read)
lock and then upgrade it to exclusive lock.
Strict Two Phase Locking: The first phase of Strict-2PL is same as 2PL. After acquiring
all locks in the first phase, transaction continues to execute normally. But in contrast to 2PL,
Strict-2PL does not release lock as soon as it is no more required, but it holds all locks until
commit state arrive. Strict- 2PL releases all locks at once at commit point.
3.9.2 Time stamp based protocols: The most commonly used concurrency protocol is time-
stamp based protocol. This protocol uses either system time or logical counter to be used as
a time-stamp. Lock based protocols manage the order between conflicting pairs among
transaction at the time of execution whereas time-stamp based protocols start working as
soon as transaction is created.
Every transaction has a time-stamp associated with it and the ordering is determined by the
age of the transaction. A transaction created at 0002 clock time would be older than all other
transaction, which come after it. For example, any transaction 'y' entering the system at 0004
is two seconds younger and priority may be given to the older one. In addition, every data
item is given the latest read and write-timestamp. This lets the system know, when last read
was and write operation made on the data item.
3.9.2.1 Time-stamp ordering protocol: The timestamp-ordering protocol ensures
serializability among transaction in their conflicting read and writes operations. This is the
responsibility of the protocol system that the conflicting pair of tasks should be
If a transaction Ti issues write(X) operation: executed according to the timestamp values
of the transactions.
Time-stamp of Transaction Ti is denoted as TS (Ti).
Read time-stamp of data-item X is denoted by R-timestamp(X).
Write time-stamp of data-item X is denoted by W-timestamp(X).
Timestamp ordering protocol works as follows:
If a transaction Ti issues read(X) operation:
If TS(Ti) < W-timestamp(X)
o Operation rejected.
If TS(Ti) >= W-timestamp(X)
o Operation executed.
All data-item Timestamps updated.
Data Warehouse
Data warehousing is the process of constructing and using a data warehouse. It is a process
of transforming data into information and making it available to users in a timely enough
manner to make a difference. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc
queries, and decision making. Data warehousing involves data cleaning, data integration,
and data consolidations. Data warehouse is data management and data analysis. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data consolidations. Its main goal
is to integrate enterprise wide corporate data into a single repository from which users can
easily run queries.
• The data has been selected from various sources and then integrate and store the data in a
single and particular format.
• Data warehouses contain current detailed data, historical detailed data, lightly and highly
summarized data, and metadata.
• Current and historical data are voluminous because they are stored at the highest level of
detail.
• Lightly and highly summarized data are necessary to save processing time when users
request them and are readily accessible.
• Metadata are “data about data”. It is important for designing, constructing, retrieving, and
controlling the warehouse data.
• Technical metadata include where the data come from, how the data were changed,
how the data are organized, how the data are stored, who owns the data, who is
responsible for the data and how to contact them, who can access the data , and the
date of last update.
• Business metadata include what data are available, where the data are, what the data
mean, how to access the data, predefined reports and queries, and how current the data
are.
A producer wants to know….
Which are our lowest/highest margin customers ?
Who are my customers and what products are they buying?
I
Which customers are most likely to go to the competition? nf
What impact will new products/services have on revenue and margins?
What product promotions have the biggest impact on revenue?
D or
What is the most effective distribution channel?
am
t
Features of Data warehousing: Data warehousing is a single, complete and consistent store
at
of data obtained from a variety of different sources made available to end users in a what
they can understand and use in a business context.
• Subject Oriented: Data that gives information about a particular subjectainstead of
about a company's ongoing operations. io
• Integrated: Data that is gathered into the data warehouse from a variety of sources
and merged into a coherent whole. n
• Time-variant: All data in the data warehouse is identified with a particular time
period.
• Non-volatile: Data is stable in a data warehouse. More data is added but data is never
removed. This enables management to gain a consistent picture of the business.
• Data warehousing is combining data from multiple and usually varied sources into one
comprehensive and easily manipulated database.
• Common accessing systems of data warehousing include queries, analysis and
reporting.
• Because data warehousing creates one database in the end, the number of sources can
be anything you want it to be, provided that the system can handle the volume, of
course.
• The final result, however, is homogeneous data, which can be more easily
manipulated.
• It is a relational or multidimensional database management system designed to
support management decision making.
• A data warehousing is a copy of transaction data specifically structured for querying
and reporting.
• Technique for assembling and managing data from various sources for the purpose of
answering business questions. Thus making decisions that were not previous possible.
• It is a relational or multidimensional database management system designed to
support management decision making.
• A data warehousing is a copy of transaction data specifically structured for querying
and reporting.
• Technique for assembling and managing data from various sources for the purpose of
answering business questions. Thus making decisions that were not previous possible
3.10.1 Benefits / Business advantages of Data warehouse: There are decision support
technologies that help utilize the data available in a data warehouse. These technologies help
executives to use the warehouse quickly and effectively. They can gather data, analyze it,
and take decisions based on the information present in the warehouse. The information
gathered in a warehouse can be used in any of the following domains:
Tuning Production Strategies - The product strategies can be well tuned by repositioning
the products and managing the product portfolios by comparing the sales quarterly or yearly.
Customer Analysis - Customer analysis is done by analyzing the customer's buying
preferences, buying time, budget cycles, etc.
Operations Analysis - Data warehousing also helps in customer relationship management,
and making environmental corrections. The information also allows us to analyze business
operations.
High returns on investment.
Increased productivity of corporate decision-makers.
It provides business users with a “customer-centric” view of the company’s heterogeneous
data by helping to integrate data from sales, service, manufacturing and distribution, and
other customer-related business systems.
It provides added value to the company’s customers by allowing them to access better
information when data warehousing is coupled with internet technology.
It consolidates data about individual customers and provides a repository of all customer
contacts for segmentation modeling, customer retention planning, and cross sales analysis.
It removes barriers among functional areas by offering a way to reconcile views from
multiple areas, thus providing a look at activities that cross functional lines.
It reports on trends across multidivisional, multinational operating units, including trends or
relationships in areas such as merchandising, production planning etc.
Strategic uses of data warehousing
Functional areas of
Industry Strategic use
use
Crew assignment, aircraft development, mix of fares, analysis of
Airline Operations; marketing
route profitability, frequent flyer program promotions
Product development; Customer service, trend analysis, product and service
Banking
Operations; marketing promotions, reduction of IS expenses
Product development;
Credit card Customer service, new information service, fraud detection
marketing
Health care Operations Reduction of operational expenses
Investment and Product development; Risk management, market movements analysis, customer
Insurance Operations; marketing tendencies analysis, portfolio management
Distribution; Trend analysis, buying pattern analysis, pricing policy, inventory
Retail chain
marketing control, sales promotions, optimal distribution channel
Product development; New product and service promotions, reduction of IS budget,
Telecommunications
Operations; marketing profitability analysis
Distribution; Distribution decisions, product promotions, sales decisions,
Personal care
marketing pricing policy
Public sector Operations Intelligence gathering
All measures in the fact table are related to all the dimensions that fact table is related
to. In other words, they all have the same level of granularity. A star schema can be
simple or complex. A simple star consists of one fact table; a complex star can have
more than one fact table. Let's look at an example: Assume data warehouse keeps
store sales data, and the different dimensions are time, store, product, and customer.
In this case, the figure on the left represents our star schema. The lines between two
tables indicate that there is a primary key / foreign key relationship between the two
tables. Note that different dimensions are not related to one another.
Snowflake Schema: The snowflake schema is an extension of the star schema, where
each point of the star explodes into more points. In a star schema, each dimension is
represented by a single dimensional table, whereas in a snowflake schema, that
dimensional table is normalized into multiple lookup tables, each representing a level
in the dimensional hierarchy.
Sample snowflake schema
For example, the Time Dimension that consists of 2 different hierarchies:
1. Year → Month → Day
2. Week → Day
4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for
month, a lookup table for week, and a lookup table for day. Year is connected to Month,
which is then connected to Day. Week is only connected to Day. A sample snowflake
schema illustrating the above relationships in the Time Dimension is shown to the right.
The main advantage of the snowflake schema is the improvement in query
performance due to minimized disk storage requirements and joining smaller lookup
tables. The main disadvantage of the snowflake schema is the additional maintenance
efforts needed due to the increase number of lookup tables.
Slowly Changing Dimension: This is a common issue facing data warehousing
practioners.
Conceptual Data Model: What is a conceptual data model, its features, and an example of
this type of data model.
Logical Data Model: What is a logical data model, its features, and an example of this type
of data model.
Physical Data Model: What is a physical data model, its features, and an example of this
type of data model.
Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data
model.
Data Integrity: What is data integrity and how it is enforced in data warehousing.
What is OLAP: Definition of OLAP.
o OLTP : OLTP- ONLINE TRANSACTION PROCESSING
o Special data organization, access methods and implementation methods are needed to
support data warehouse queries (typically multidimensional queries)
o OLTP systems are tuned for known transactions and workloads
o OLTP Systems are used to “run” a business
o e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of
December
OLTP vs Data Warehouse:
OLTP • Data Warehouse (DSS)
• Application Oriented • Subject Oriented
• Used to run business • Used to analyze business
• Detailed data • Summarized and refined
• Current up to date • Snapshot data
• Isolated Data • Integrated Data
• Clerical User • Knowledge User (Manager)
• Few Records accessed at a time (tens) • Large volumes accessed at a time (millions)
• Read/Update Access • Mostly Read (Batch Update)
• No data redundancy • Redundancy present
• Database Size 100MB -100 GB • Database Size \100 GB - few terabytes
• Transaction throughput is the performance metric • Query throughput is the performance metric
• Thousands of users • Hundreds of users
• Managed in entirety • Managed by subsets
• OLTP Systems are • The Data Warehouse helps to “optimize” the
used to “run” a business business
Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different
view of the role between data warehouse and data mart.
Factless Fact Table: A fact table without any fact may sound silly, but there are real life
instances when a factless fact table is useful in data warehousing.
Junk Dimension: Discusses the concept of a junk dimension: When to use it and why it is
useful.
Conformed Dimension: Discusses the concept of a conformed dimension: What is it and
why is it important.
3.10.4 Data flow
Inflow: The processes associated with the extraction, cleansing, and loading of the data
from the source systems into the data warehouse.
Upflow: The process associated with adding value to the data in the warehouse through
summarizing, packaging, packaging, and distribution of the data.
Downflow: The processes associated with archiving and backing-up of data in the
warehouse.
3.10.5 Tools and Technologies
The critical steps in the construction of a data warehouse:
Extraction
Cleansing
Transformation
After the critical steps, loading the results into target system can be carried out either by
separate products, or by a single, category:
code generators
database data replication tools
dynamic transformation engines
For the various types of meta-data and the day-to-day operations of the data warehouse, the
administration and management tools must be capable of supporting those tasks:
Monitoring data loading from multiple sources
Data quality and integrity checks
Managing and updating meta-data
Monitoring database performance to ensure efficient query response times and resource
utilization
Auditing data warehouse usage to provide user chargeback information
Replicating, subsetting, and distributing data
Maintaining effient data storage management
Purging data;
Archiving and backing-up data
Implementing recovery following failure
Virtual Warehouse: The view over an operational data warehouse is known as virtual
warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires
excess capacity on operational database servers.
3.11 Data Mart
A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), such as sales, finance or marketing. Data marts are often built and
controlled by a single department within an organization. Given their single-subject focus,
data marts usually draw data from only a few sources. The sources could be internal
operational systems, a central data warehouse, or external data. Data marts contain a subset
of organization-wide data that is valuable to specific groups of people in an organization. In
other words, a data mart contains only those data that is specific to a particular group. For
example, the marketing data mart may contain only data related to items, customers, and
sales. Data marts are confined to subjects.
• A data mart is a scaled down version of a data warehouse that focuses on a particular
subject area.
• A data mart is a subset of an organizational data store, usually oriented to a specific
purpose or major data subject, that may be distributed to support business needs.
• Implemented as the first step in proving the usefulness of the technologies to solve
business problems
Reasons for creating a data mart
• Easy access to frequently needed data
• Creates collective view by a group of users
• Improves end-user response time
• Ease of creation in less time
• Lower cost than implementing a full Data warehouse
• Potential users are more clearly defined than in a full Data warehouse
The following figure shows a graphical representation of data marts.
5. What do you understand by data warehousing & Data Mart? What are the advantages of Data warehousing?
(Unit – 3)
Database is a structured collection of data. It can be anything from list of names in a text file, to a relational
database. It is commonly confused with the database management system (ex: MySQL is a relational database
management system, but if you store data in it, that data is a database. People incorrectly say ‘I use MySQL as my
database’)
Data warehouse is a structured collection of [ideally] all theorganisation’s data.
The concept of a data warehouse is not difficult to understand. Basically the idea is to create a permanent storage
space for the data needed to support reporting, analysis, and other BI functions. It may seem wasteful to store
data in multiple places (source systems and the data warehouse), the many advantages of doing that more than
justify the effort and expense.
Data warehouses reside on servers dedicated to this function running a database management system (DBMS)
such as SQL Server and using Extract, Transform, and Load (ETL) software such as SQL Server Integration
Services (SSIS) to pull data from the source systems and into the data warehouse.
Benefits of a Data Warehouse and BI solution: Once a data warehouse is in place and populated with data, it will
become a part of a BI solution that will deliver benefits to business users in many ways:
End user creation of reports: The creation of reports directly by end users is much easier to accomplish in a BI
environment. They can also create much more useful reports because of the power and capability of BI tools
compared to a source application. And moving the creation of reports to a BI system increases consistency and
accuracy and usually reduces cost
Ad-hoc reporting and analysis: Since the data warehouse eliminates the need for BI tools to compete with the
transactional source systems, users can analyze data faster and generate reports more easily, and slice-and-dice in
ways they could never do before. The Microsoft BI toolset vastly improves the ability to analyze data
Dynamic presentation through dashboards: Managers want access to an interactive display of up-to-date
critical management data. That is accomplished via dashboards, which are sophisticated displays that show
information in creative and highly graphical forms, much like the instrument panel on an automobile
Drill-down capability: Users can drill down into the details underlying the summaries on dashboards and
reports. The allows users to slice and dice to find underlying problems
Support for regulations: Sarbanes-Oxley and other related regulations have requirements that transactional
systems are sometimes not able to support. With a data warehouse, the necessary data can be retained as long as
the law requires
Metadata creation: Descriptions of the data can be stored with the data warehouse to make it a lot easier for
users to understand the data in the warehouse. This will make report creation much simpler for the end-user
Support for operational processes: A data warehouse can help support business needs, such as the ability to
consolidate financial results within a complex company that uses different software for different divisions
Data mining: Once you have built out a data warehouse, there are data mining tools that you can use to help find
hidden patterns using automatic methodologies. While reporting tools can tell you where you have been, data
mining tools can tell you where you are going
Security: A data warehouse makes it much easier to provide secure access to those that have a legitimate need to
specific data and to exclude others
Analytical tool support: There are many vendors who have analytical tools (i.eQlikView, Tableau) that allow
business units to slice and dice the data and create reports and dashboards. These tools will all work best when
extracting data from a data warehouse
This long list of benefits is what makes BI based on a data warehouse an essential management tools for companies.
A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data
mart is a subset of the data warehouse and is usually oriented to a specific business line or team. Whereas data
warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. While
transactional databases are designed to be updated, data warehouses or marts are read only. Data mart is a subset
of the data warehouse structured to allow easy user access.