DBMS (UNIT 5)

 File Organization & Data warehousing
 File & Record Concept
 Fixed and variable sized Records
 Types of Single level Index
 Multilevel Indexes
 Dynamic Multilevel Indexes using B trees
 Data warehousing:Introduction
 Basic concepts
 Data warehouse architecture
 Various models
 Basic operations

 File Organization refers to the logical relationships among various records that
constitute the file, particularly with respect to the means of identification and access
to any specific record.
 In simple terms, Storing the files in a certain order is called File Organization.
 File Structure refers to the format of the label and data blocks and of any logical
control record.
 The Objective of File Organization
 It helps in the faster selection of records i.e. it makes the process faster.
 Different Operations like inserting, deleting, and update on different records are
faster and easier.
 It prevents us from inserting duplicate records via various operations.
 It helps in storing the records or the data very efficiently at a minimal cost

 Various methods have been introduced to Organize files. These particular methods
have advantages and disadvantages on the basis of access or selection. Thus it is all
upon the programmer to decide the best-suited file Organization method according to
his requirements.
 Some types of File Organizations are :
 Sequential File Organization
 Heap File Organization
 Hash File Organization
 B+ Tree File Organization
 Clustered File Organization
 ISAM (Indexed Sequential Access Method)

 A Database Management System (DBMS) stores data in the form of tables, uses
ER model and the goal is ACID properties. For example, a DBMS of college has
tables for students, faculty, etc.
 A Data Warehouse is separate from DBMS, it stores a huge amount of data, which
is typically collected from multiple heterogeneous sources like files, DBMS, etc.
 The goal is to produce statistical results that may help in decision makings. For
example, a college might want to see quick different results, like how the
placement of CS students has improved over the last 10 years, in terms of
salaries, counts, etc.

 An ordinary Database can store MBs to GBs of data and that too for a specific
purpose.
 For storing data of TB size, the storage shifted to Data Warehouse. Besides this, a
transactional database doesn’t offer itself to analytics.
 To effectively perform analytics, an organization keeps a central Data Warehouse
to closely study its business by organizing, understanding, and using its historic
data for taking strategic decisions and analyzing trends.

 Better business analytics: Data warehouse plays an important role in every
business to store and analysis of all the past data and records of the company.
which can further increase the understanding or analysis of data to the company.
 Faster Queries: Data warehouse is designed to handle large queries that’s why it
runs queries faster than the database.
 Improved data Quality: In the data warehouse the data you gathered from
different sources is being stored and analyzed it does not interfere with or add
data by itself so your quality of data is maintained and if you get any issue
regarding data quality then the data warehouse team will solve this.
 Historical Insight: The warehouse stores all your historical data which contains
details about the business so that one can analyze it at any time and extract
insights from it

 The File is a collection of records. Using the primary key, we can access the
records. The type and frequency of access can be determined by the type of file
organization which was used for a given set of records.
 File organization is a logical relationship among various records. This method
defines how file records are mapped onto disk blocks.
 File organization is used to describe the way in which the records are stored in
terms of blocks, and the blocks are placed on the storage medium.
 The first approach to map the database to the file is to use the several files and
store only one fixed length record in any given file. An alternative approach is to
structure our files so that we can contain multiple lengths for records.
 Files of fixed length records are easier to implement than the files of variable
length records.

 It contains an optimal selection of records, i.e., records can be selected as fast as
possible.
 To perform insert, delete or update transaction on the records should be quick and
easy.
 The duplicate records cannot be induced as a result of insert, update or delete.
 For the minimal cost of storage, records should be stored efficiently.

 File organization contains various methods. These particular methods have pros
and cons on the basis of access or selection. In the file organization, the
programmer decides the best-suited file organization method according to his
requirement.

 Sequential file organization
 Heap file organization
 Hash file organization
 B+ file organization
 Indexed sequential access method (ISAM)
 Cluster file organization

 In relational databases, a record is a group of related data held within the same
structure. More specifically, a record is a grouping of fields within a table that
reference one particular object. The term record is frequently used synonymously
with row.
 For example, a customer record may include items, such as first name, physical
address, email address, date of birth and gender.
 A record is also known as a tuple.

 Fixed-length records means setting a length and storing the records into the file.
If the record size exceeds the fixed size, it gets divided into more than one block.
 Due to the fixed size there occurs following two problems:
 Partially storing subparts of the record in more than one block requires access to
all the blocks containing the subparts to read or write in it.
 It is difficult to delete a record in such a file organization. It is because if the size
of the existing record is smaller than the block size, then another record or a part
fills up the block.

 Variable-length records are the records that vary in size. It requires the creation
of multiple blocks of multiple sizes to store them. These variable-length records
are kept in the following ways in the database system:
 Storage of multiple record types in a file.
 It is kept as Record types that enable repeating fields like multisets or arrays.
 It is kept as Record types that enable variable lengths either for one field or more.
 In variable-length records, there exist the following two problems:
 Defining the way of representing a single record so as to extract the individual
attributes easily.
 Defining the way of storing variable-length records within a block so as to extract
that record in a block easily.

 Indexing is used to optimize the performance of a database by minimizing the
number of disk accesses required when a query is processed.
 The index is a type of data structure. It is used to locate and access the data in a
database table quickly.
 Index structure:
 Indexes can be created using some database columns.

 The first column of the database is the search key that contains a copy of the
primary key or candidate key of the table. The values of the primary key are
stored in sorted order so that the corresponding data can be accessed easily.
 The second column of the database is the data reference. It contains a set of
pointers holding the address of the disk block where the value of the particular
key can be found.

 Ordered indices
 The indices are usually sorted to make searching faster. The indices which are
sorted are known as ordered indices.
 Primary Index
 If the index is created on the basis of the primary key of the table, then it is
known as primary indexing. These primary keys are unique to each record and
contain 1:1 relation between the records.
 As primary keys are stored in sorted order, the performance of the searching
operation is quite efficient.
 The primary index can be classified into two types: Dense index and Sparse index.

 Dense index
 The dense index contains an index record for every search key value in the data file. It
makes searching faster.
 In this, the number of records in the index table is same as the number of records in
the main table.
 It needs more space to store index record itself. The index records have the search key
and a pointer to the actual record on the disk.
 Sparse index
 In the data file, index record appears only for a few items. Each item points to a block.
 In this, instead of pointing to each record in the main table, the index points to the
records in the main table in a gap.

 Clustering Index
 A clustered index can be defined as an ordered data file. Sometimes the index is
created on non-primary key columns which may not be unique for each record.
 In this case, to identify the record faster, we will group two or more columns to get
the unique value and create index out of them. This method is called a clustering
index.
 The records which have similar characteristics are grouped, and indexes are
created for these group.

 Secondary Index
 In the sparse indexing, as the size of the table grows, the size of mapping also
grows. These mappings are usually kept in the primary memory so that address
fetch should be faster. Then the secondary memory searches the actual data based
on the address got from mapping. If the mapping size grows then fetching the
address itself becomes slower. In this case, the sparse index will not be efficient.
To overcome this problem, secondary indexing is introduced.
 In secondary indexing, to reduce the size of mapping, another level of indexing is
introduced. In this method, the huge range for the columns is selected initially so
that the mapping size of the first level becomes small. Then each range is further
divided into smaller ranges. The mapping of the first level is stored in the primary
memory, so that address fetch is faster. The mapping of the second level and
actual data are stored in the secondary memory (hard disk).

 With the growth of the size of the database, indices also grow. As the index is
stored in the main memory, a single-level index might become too large a size to
store with multiple disk accesses.
 The multilevel indexing segregates the main block into various smaller blocks so
that the same can be stored in a single block. The outer blocks are divided into
inner blocks which in turn are pointed to the data blocks. This can be easily stored
in the main memory with fewer overheads.
 In Relational Database Management Systems (RDBMS), indexes are essential
data structures that allow faster data retrieval by reducing the number of disk
accesses required to retrieve data. But, traditional indexes can become inefficient
as the database size grows. Multilevel indexes provide a solution to this problem
by dividing the index into smaller, manageable pieces.

 Indexing helps to optimize the performance of a database. It minimizes the
number of disk accesses required when a query is processed. It is a data structure
technique which is used to quickly locate and access the data in a database.
 There are two things used in indexing, these are : Search Key or Candidate key
and Data Reference or Pointer.

 B Tree is a self-balancing tree data structure.
 It stores and maintains data in a sorted form where the left children of the root
are smaller than the root and the right children are larger than the root in value.
 It makes searching efficient and allows all operations in logarithmic time. It
allows nodes with more than two children.
 B-tree is used for implementing multilevel indexing.
 Every node of the B-tree stores the key-value along with the data pointer pointing
to the block in the disk file containing that key.

 Every node has at most m children where m is the order of the B-Tree.
 A node with K children contains K-1 keys.
 Every non-leaf node except the root node must have at least ⌈m/2⌉ child nodes.
 The root must have at least 2 children if it is not the leaf node too.
 All leaves of a B-Tree stays at the same level.
 Unlike other trees, its height increases upwards towards the root, and insertion
happens at the leaf node.
 The time complexity of all the operations of a B-Tree is O(log n), here ‘n’ is the
number of elements in the B-Tree.

 A Data Warehouse is separate from DBMS, it stores a huge amount of data, which
is typically collected from multiple heterogeneous sources like files, DBMS, etc.
 The goal is to produce statistical results that may help in decision makings.
 For example, a college might want to see quick different results, like how the
placement of CS students has improved over the last 10 years, in terms of
salaries, counts, etc.
 An ordinary Database can store MBs to GBs of data and that too for a specific
purpose. For storing data of TB size, the storage shifted to Data Warehouse.
Besides this, a transactional database doesn’t offer itself to analytics. To
effectively perform analytics, an organization keeps a central Data Warehouse to
closely study its business by organizing, understanding, and using its historic
data for taking strategic decisions and analyzing trends.

 A data-warehouse is a heterogeneous collection of different data sources organised
under a unified schema. There are 2 approaches for constructing data-warehouse:
Top-down approach and Bottom-up approach are explained as below.
 1. Top-down approach:

 External Sources –
External source is a source from where data is collected irrespective of the type of
data. Data can be structured, semi structured and unstructured as well.
 Stage Area –
Since the data, extracted from the external sources does not follow a particular
format, so there is a need to validate this data to load into datawarehouse. For
this purpose, it is recommended to use ETL tool.
 E(Extracted): Data is extracted from External data source.
 T(Transform): Data is transformed into the standard format.
 L(Load): Data is loaded into datawarehouse after transforming it into the standard
format.

 Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central repository. It
actually stores the meta data and the actual data gets stored in the data
marts. Note that datawarehouse stores the data in its purest form in this top-down
approach.
 Data Marts –
Data mart is also a part of storage component. It stores the information of a particular
function of an organisation which is handled by single authority. There can be as many
number of data marts in an organisation depending upon the functions. We can also
say that data mart contains subset of the data stored in datawarehouse.
 Data Mining –
The practice of analysing the big data present in datawarehouse is data mining. It is
used to find the hidden patterns that are present in the database or in datawarehouse
with the help of algorithm of data mining. This approach is defined by Inmon as –
datawarehouse as a central repository for the complete organisation and data marts
are created from it after the complete datawarehouse has been created.

 First, the data is extracted from external sources (same as happens in top-down
approach).
 Then, the data go through the staging area (as explained above) and loaded into
data marts instead of datawarehouse. The data marts are created first and
provide reporting capability. It addresses a single business area.
 These data marts are then integrated into datawarehouse.

 Any data warehouse will consist of random data which will surely be in
unstructured manner with a lot of unwanted and dirty data.
 Dirty data refers to incomplete and noisy data containing errors.
 To make this data structured and noise free, dirty data needs to be removed. This
will help in converting data into useful information and can be achieved using
certain data warehouse operations.
 These operations are combination of ETL(Extraction, Transform,
Loading) operations along with data cleaning and data refresh operations.


 Data Cleaning
 In data cleaning, inconsistencies are removed. Also, noisy data containing errors are also
rectified.
 For example : Cleaning of redundant(duplicate) data.
 Data Refresh
 In data refresh operation, data in data warehouse is refreshed by broadcasting the data from
multiple sources and updating it on timely basis. This is done because, data inside data bases
are updated every minute and to get this same data on data warehouse, the process of refreshing
is performed.
 Extraction of Data
 Data obtained after cleaning and refresh is still unstructured and unorganized. To make it
organised and enable user to extract and retrieve relevant data is done through data extraction
process. This is helpful, if any user wants to mine the data.


 Transformation of data
 Data obtained through heterogeneous data bases have native structure of their
respective databases that might be different from that structure of data
warehouse. So, transformation of data from heterogeneous database is done to
organize data in the structure similar to that of the data warehouse.
 Data Loading
 Data loading is responsible for loading the data to its respective target data
repository that might include data bases, data marts data warehouses etc.

DBMS (UNIT 5)

Recommended

More Related Content

What's hot (20)

Similar to DBMS (UNIT 5) (20)

More from Dr. SURBHI SAROHA (20)

Recently uploaded (20)

DBMS (UNIT 5)