0% found this document useful (0 votes)
5 views32 pages

20PMHS012_RH

The document outlines the process of data analysis, which involves cleaning, transforming, and modeling data to extract useful information for decision-making. It details the phases of data analysis, including data requirement gathering, collection, processing, cleaning, analysis, interpretation, and visualization. Additionally, it discusses data cleansing techniques and challenges faced by data analysts, emphasizing the importance of accurate and timely data for effective analysis.

Uploaded by

lemaisondart1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views32 pages

20PMHS012_RH

The document outlines the process of data analysis, which involves cleaning, transforming, and modeling data to extract useful information for decision-making. It details the phases of data analysis, including data requirement gathering, collection, processing, cleaning, analysis, interpretation, and visualization. Additionally, it discusses data cleansing techniques and challenges faced by data analysts, emphasizing the importance of accurate and timely data for effective analysis.

Uploaded by

lemaisondart1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

THIRD 20PMHS012 20/09/2021 1

ASSIGNMENT
Q1. What is the process of Data Analysis?

Ans:
Data analysis is de ned as a process of cleaning, transforming, and
modelling data to discover useful information for business decision-
making. The purpose of Data Analysis is to extract useful information
from data and taking the decision based upon the data analysis

A simple example of Data analysis is whenever we take any


decision in our day-to-day life is by thinking about what happened last
time or what will happen by choosing that particular decision. This is
nothing but analysing our past or future and making decisions based on
it. For that, we gather memories of our past or dreams of our future. So
that is nothing but data analysis. Now same thing analyst does for
business purposes, is called Data Analysis

The Data Analysis Process is nothing but gathering information by


using a proper application or tool which allows you to explore the data
and nd a pattern in it. Based on that information and data, you can
make decisions, or you can get ultimate conclusions

Data Analysis consists of the following phases

• Data Requirement Gatherin

• Data Collectio

• Data Processin

• Data Cleanin

• Data Analysi

• Data Interpretatio

• Data Visualizatio
fi

fi
g

THIRD 20PMHS012 20/09/2021 2

1. Data Requirement Gathering :


• In this stage rst of all, we have to think about why do we want to
do this data analysis? All we need to nd out the purpose or aim of
doing the Analysis of data.
• We have to decide which type of data analysis we wanted to do!
• In this phase, you have to decide what to analyse and how to
measure it, we have to understand why we are investigating and
what measures we have to use to do this Analysis

2. Data Collection :
• After requirement gathering, we will get a clear idea about what
things we have to measure and what should be your ndings.
• Now it’s time to collect our data based on requirements. Once we
collect our data, we must remember that the collected data must be
processed or organised for Analysis.
• As we collected data from various sources, we must have to keep a
log with a collection date and source of the data

3. Data Processing :
• The data that is collected must be processed or organised for
analysis.
• This includes structuring the data as required for the relevant
Analysis Tools.
• For example, the data might have to be placed into rows and
columns in a table within a Spreadsheet or Statistical Application.
• A Data Model might have to be created

4. Data Cleaning :
• Now whatever data is collected may not be useful or irrelevant to
our aim of Analysis, hence it should be cleaned.
• The data which is collected may contain duplicate records, white
spaces or errors. The data should be cleaned and error free.
• This phase must be done before Analysis because based on data
cleaning, our output of Analysis will be closer to our expected
outcome

fi

fi
.

fi

THIRD 20PMHS012 20/09/2021 3

5. Data Analysis :
• Once the data is collected, cleaned, and processed, it is ready for
Analysis.
• As we manipulate data, we may nd we have the exact information
we need, or we might need to collect more data.
• During this phase, we can use data analysis tools and software
which will help you to understand, interpret, and derive
conclusions based on the requirements

6. Data Interpretation :
• After analysing our data, it’s nally time to interpret your results
• We can choose the way to express or communicate our data
analysis either we can use simply in words or maybe a table or
chart.
• Then use the results of our data analysis process to decide our best
course of action for our intended project

7. Data Visualization :
• Data visualization is very common in your day to day life; they
often appear in the form of charts and graphs.
• In other words, data shown graphically so that it will be easier for
the human brain to understand and process it.
• Data visualization often used to discover unknown facts and
trends. By observing relationships and comparing datasets, we can
nd a way to nd out meaningful information

Data Requirement
Data Visualization

Data Interpretation Data Collection

Data Analysis Data Processing

Data Cleaning
fi

fi

fi
fi
.

Ans:

Criteria Data Mining Data Analytics

De nition Discovering patterns in a Extracting and organising


large set of data data to draw conclusions
that can be used to make
informed decisions.

Scope of Machine learning, statistics Data mining, data


coverage and database systems analytics, Computer
science, non-technical
tools.

Synonyms Knowledge discovery in Descriptive, predictive


databases (KDD) analysis, explanatory
analysis, etc.

Purpose Finding patterns in raw data Testing hypothesis,


sets business decision

Work Pro le Single person, a specialist. A larger team

Output Data pattern Veri ed hypothesis, deep


insight into data

Data Structure Very structured Structured and


unstructured

Examples Relationship between beer “Times series study of e-


and diapers cigarette usage in the last 8
years.”
Team Size One person can do the job Requires a team

Data Science Subset of data analytics Subset of data science


fi
fi

fi

THIRD 20PMHS012 20/09/2021 5

Q3. What is data cleansing and what are the best ways to practice data
cleansing?

Ans:
Data cleansing or data cleaning is the process of identifying and
removing (or correcting) inaccurate records from a dataset, table, or
database and refers to recognising un nished, unreliable, inaccurate, or
non-relevant parts of the data and then restoring, remodelling, or
removing the dirty or crude data
Data cleaning techniques may be performed as batch processing
through scripting or interactively with data cleansing tools
After cleaning, a dataset should be uniform with other related
datasets in the operation. The discrepancies identi ed or eliminated may
have been basically caused by user entry mistakes, by corruption in
storage or transmission, or by various data dictionary descriptions of
similar items in various stores

Best ways to practice data cleansing

1. Develop a data quality plan:


It is essential to rst understand where the majority of errors occur
so that the root cause can be identi ed and a plan built to manage it.
Remember that effective data cleaning practices will have an overarching
impact throughout an organisation, so it is important to remain as open
and communicative as possible. A plan needs to include
• Responsible: A C-Level executive, Chief Data Of cer (CDO) if the
company already appointed such an executive. Additionally,
business and tech responsible need to be assigned for different
data.
• Metrics: Ideally, data quality should be summarizable as a single
number on a 1-100 scale. While different data can have different
data quality, having an overall number can help the organisation
measure its constant improvement. This overall number can give
more weight to data that are critical to the companies success,
helping prioritise data quality initiatives that impact important
data.

fi

fi
:

fi
fi
fi
.

THIRD 20PMHS012 20/09/2021 6

• Actions: A clear set of actions should be identi ed to kick off the


data quality plan. Over time, these actions will need to be updated
as data quality changes and as companies priorities change

2. Correct data at the source :


If data can be xed before it becomes an erroneous (or duplicated)
entry in the system, it saves hours of time and stress down the line. For
example, if your forms are overcrowded and require too many elds to
be lled, you will get data quality issues from those forms. Given that
businesses are constantly producing more data, it is crucial to x data at
the source

3. Measure data accuracy :


Invest in the time, tools, and research necessary to measure the
accuracy of your data in real-time. If you need to purchase a data quality
tool to measure data accuracy, you can check out our data quality tools
article where we explain the selection criteria for the right data quality
tool

4. Manage data and duplicates :


If some duplicates do sneak past your new entry practices, be sure
to actively detect and remove them. After removing any duplicate
entries, it is important to also consider the following
• Standardising: Con rming that the same type of data exists in each
column
• Normalising: Ensuring that all data is recorded consistently
• Merging: When data is scattered across multiple datasets, merging
is the act of combining relevant parts of those datasets to create a
new le
• Aggregating: Sorting data and expressing it in a summary form
• Filtering: Narrowing down a dataset to only include the
information we wan
• Scaling: Transforming data so that it ts within a speci c scale such
as 0-100 or 0-
• Removing: Removing duplicate and outlier data points to prevent
a bad t in linear regression.
fi
.

fi
fi
.

fi
fi
t

fi
fi
:

fi
.

fi
fi
.

THIRD 20PMHS012 20/09/2021 7

5. Append data :
Append is a process that helps organisations to de ne and
complete missing information. Reliable third party sources are often one
of the best options for managing this practice
Append is a process that helps organisations to de ne and
complete missing information. Reliable third party sources are often one
of the best options for managing this practice

Data Cleaning Techniques


As is the case with many other actions, ensuring the cleanliness of
big data presents its own unique set of considerations. Subsequently,
there are a number of techniques that have been developed to assist in
cleaning big data
1. Conversion tables: When certain data issues are already known (for
example, that the names included in a dataset are written in several
ways), it can be sorted by the relevant key and then lookups can be
used in order to make the conversion
2. Histograms: These allow for the identi cation of values that occur
less frequently and may be invalid
3. Tools: Every day major vendors are coming out with new and better
tools to manage big data and the complexities that can accompany it
4. Algorithms: Such as spell check or phonetic algorithms can be useful
– but they can also make the wrong suggestion

Best practices in data cleaning


1. Consider your data in the most holistic way possible – thinking
about not only who will be doing the analysis but also who will be
using the results derived from it
2. Increased controls on database inputs can ensure that cleaner data is
what ends up being used in the system
3. Choose software solutions that are able to highlight and potentially
even resolve faulty data before it becomes problematic
4. In the case of large datasets, be sure to limit your sample size in
order to minimise prep time and accelerate performance
5. Spot check throughout to prevent any errors from being replicate

fi
.

fi
fi
.

THIRD 20PMHS012 20/09/2021 8

Q4. When do you think you should retrain a model? Is it dependent on


the data?

Ans:
The knowledge embedded in a model is a frozen snapshot of a real-
world process imperfectly captured in data. The required change may be
complex, but the reasoning is simple
As the real world and the engineering around that snapshot
change, our model needs to keep up with reality in order to meet the
performance metrics achieved in development
Some retraining schedules present themselves naturally during
model development — when the model depends on a data source that is
updated periodically, for example. Many changes in data, engineering, or
business can be dif cult or impossible to predict and communicate.
Changes anywhere in the model dependency pipeline can degrade
model performance; it’s an inconvenience, but not unsolvable

Yes, retaining a model does depend on Data. Let us considering a


user interacting with a website application. The application may capture
clicks to represent this interaction, and these clicks may be monetised
under speci c business rules. This data can be used to build a model to
predict the lifetime value of that customer or the risk of losing that
customer’s business (churn)
Another real-world scenario could be someone purchasing a
vehicle. This process gets captured as a purchase order, which gets
modelled and stored in a database. Business de nitions are then used to
calculate pro t. From this, we can build a model to predict pro t per car
or the length of time required to make a sale.

fi
fi
fi
.

fi
.

fi

THIRD 20PMHS012 20/09/2021 9

Q5. Can you mention a few problems that data analyst usually
encounter while performing the analysis?

Ans:
Few problems that data analyst usually encounter while performing the
analysis are :

1. The amount of data being collected:


An organisation may receive information on every incident and
interaction that takes place on a daily basis, leaving analysts with
thousands of interlocking data sets
There is a need for a data system that automatically collects and
organises information. Manually performing this process is far too time-
consuming and unnecessary in today’s environment.
An automated system will allow employees to use the time spent
processing data to act on it instead

2. Collecting meaningful and real-time data:


With so much data available, it’s dif cult to dig down and access
the insights that are needed most. When employees are overwhelmed,
they may not fully analyse data or only focus on the measures that are
easiest to collect instead of those that truly add value.
In addition, if an employee has to manually sift through data, it
can be impossible to gain real-time insights on what is currently
happening. Outdated data can have signi cant negative impacts on
decision-making.
A data system that collects, organises and automatically alerts
users of trends will help solve this issue. Employees can input their goals
and easily create a report that provides the answers to their most
important questions. With real-time reports and alerts, decision-makers
can be con dent they are basing any choices on complete and accurate
information.

fi
.

fi
fi

THIRD 20PMHS012 20/09/2021 10

3. Visual representation of data:


To be understood and impactful, data often needs to be visually
presented in graphs or charts. While these tools are incredibly useful, it’s
dif cult to build them manually. Taking the time to pull information
from multiple areas and put it into a reporting tool is frustrating and
time-consuming.
Strong data systems enable report building at the click of a
button. Employees and decision-makers will have access to the real-time
information they need in an appealing and educational format

4. Data from multiple sources:


The issue in multiple source is to analyse data across multiple,
disjointed sources. Different pieces of data are often housed in different
systems. Employees may not always realise this, leading to incomplete
or inaccurate analysis. Manually combining data is time-consuming and
can limit insights to what is easily viewed.
With a comprehensive and centralised system, employees will
have access to all types of information in one location. Not only does this
free up time spent accessing multiple sources, it allows cross-
comparisons and ensures data is complete

5. Inaccessible data:
Moving data into one centralised system has little impact if it is not
easily accessible to the people that need it. Decision-makers and risk
managers need access to all of an organisation’s data for insights on what
is happening at any given moment, even if they are working off-site.
Accessing information should be the easiest part of data analytics.
An effective database will eliminate any accessibility issues.
Authorised employees will be able to securely view or edit data from
anywhere, illustrating organisational changes and enabling high-speed
decision making.
fi

THIRD 20PMHS012 20/09/2021 11

6. Poor quality data:


Nothing is more harmful to data analytics than inaccurate data.
Without good input, output will be unreliable. A key cause of inaccurate
data is manual errors made during data entry. This can lead to signi cant
negative consequences if the analysis is used to in uence decisions.
Another issue is asymmetrical data: when information in one system
does not re ect the changes made in another system, leaving it outdated.
A centralised system eliminates these issues. Data can be input
automatically with mandatory or drop-down elds, leaving little room
for human error. System integrations ensure that a change in one area is
instantly re ected across the board

7. Pressure from the top:


As risk management becomes more popular in organisations, CFOs
and other executives demand more results from risk managers. They
expect higher returns and a large number of reports on all kinds of data.
With a comprehensive analysis system, risk managers can go
above and beyond expectations and easily deliver any desired analysis.
They’ll also have more time to act on insights and further the value of
the department to the organisation

8. Lack of support:
Data analytics can’t be effective without organisational support,
both from the top and lower-level employees. Risk managers will be
powerless in many pursuits if executives don’t give them the ability to
act. Other employees play a key role as well: if they do not submit data
for analysis or their systems are inaccessible to the risk manager, it will
be hard to create any actionable information.
Emphasise the value of risk management and analysis to all
aspects of the organisation to get past this challenge. Once other
members of the team understand the bene ts, they’re more likely to
cooperate. Implementing change can be dif cult, but using a centralised
data analysis system allows risk managers to easily communicate
results and effectively achieve buy-in from multiple stakeholders.
fl
fl

fi
fi
fi
fl
fi
THIRD 20PMHS012 20/09/2021 12

9. Confusion or anxiety:
Users may feel confused or anxious about switching from
traditional data analysis methods, even if they understand the bene ts
of automation. Nobody likes change, especially when they are
comfortable and familiar with the way things are done.
To overcome this HR problem, it’s important to illustrate how
changes in analytics will actually streamline the role and make it more
meaningful and ful lling. With comprehensive data analytics, employees
can eliminate redundant tasks like data collection and report building
and spend time acting on insights instead

10. Budget:
Another challenge risk managers regularly face is budget. Risk is
often a small department, so it can be dif cult to get approval for
signi cant purchases such as an analytics system.
Risk managers can secure budget for data analytics by measuring
the return on investment of a system and making a strong business
case for the bene ts it will achieve. For more information on gaining
support for a risk management software system, check out our blog post
here

11. Shortage of skills:


Some organisations struggle with analysis due to a lack of talent.
This is especially true in those without formal risk departments.
Employees may not have the knowledge or capability to run in-depth
data analysis.
This challenge is mitigated in two ways: by addressing analytical
competency in the hiring process and having an analysis system that is
easy to use. The rst solution ensures skills are on hand, while the
second will simplify the analysis process for everyone. Everyone can
utilize this type of system, regardless of skill level.
.

fi

fi
fi

fi

fi
.

fi
THIRD 20PMHS012 20/09/2021 13

12. Scaling data analysis:


Finally, analytics can be hard to scale as an organisation and the
amount of data it collects grows. Collecting information and creating
reports becomes increasingly complex. A system that can grow with the
organisation is crucial to manage this issue.
While overcoming these challenges may take some time, the
bene ts of data analysis are well worth the effort. Improve one’s
organisation today and consider investing in a data analytics system

Q6. Mention the name of the framework developed by Apache for


processing large dataset for an application in a distributed computing
environment?

Ans:
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing
The Apache Hadoop software library is a framework that allows
for the distributed processing of large data sets across clusters of
computers using simple programming models.
It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather than rely
on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-
available service on top of a cluster of computers, each of which may be
prone to failures.

fi

THIRD 20PMHS012 20/09/2021 14

Q7. How can you highlight cells with negative values in Excel?

Ans:
Steps to highlight cells with negative values in Excel
1. Select the cells in which you want to highlight the negative numbers
in red.
2. Go to Home → Conditional Formatting → Highlight Cell Rules →
Less Than
3. In the Less Than dialog box, specify the value as “0” below which the
formatting should be applied
4. Click OK

DATA SET

STEP 1 and STEP 2

STEP3

Final Result

THIRD 20PMHS012 20/09/2021 15

Q8. How can you clear all the formatting without actually removing
the cell contents?

Ans:
Steps to clear all the formatting without actually removing the cell
contents in Excel
1. Highlight the portion of the spreadsheet from which you want to
remove formatting.
2. Click the Home tab.
3. Select Clear from the Editing portion of the Home tab.
4. From the drop down menu of the Clear button, select Clear Formats.

To the negative highlight formatting from STEP 1


last question.

Final result
STEP 2 and STEP 3

THIRD 20PMHS012 20/09/2021 16

Q9. What is a Print Area and how can you set it in Excel ?

Ans:
A print area is one or more ranges of cells that you designate to
print when you don't want to print the entire worksheet. When you print
a worksheet after de ning a print area, only the print area is printed
Following are Set print areas in Excel:
1. On the worksheet, select the cells that you want to de ne as the print
area.
2. On the Page Layout tab, in the Page Setup group, click Print Area,
and then click Set Print Area.

Then our print area gets selected. Each time we go for print those
speci ed cells will only be printed

Q10. What is the default port for SQL?

Ans:
By default, SQL will attempt to use TCP 1433.
If that port is unavailable, it will automatically choose another port.
If this is the case, that port will need to be opened through the rewall
instead
By default, the typical ports used by SQL Server and associated
database engine services are: TCP 1433, 4022, 135, 1434, UDP 1434

fi
.

fi

fi
fi
.

THIRD 20PMHS012 20/09/2021 17

Q11. What do you mean by DBMS? What are its different types?

Ans:
A database management system (DBMS) is a collection of programs
that enables you to store, modify, and extract information from a
database.
There are many different types of DBMSs, ranging from small
systems that run on personal computers to huge systems that run on
mainframes. The DBMS acts as an interface between the application
program and the data in the database.
The following are examples of database applications:
• Computerised library systems
• Automated teller machines
• Flight reservation systems
• Computerised parts inventory system

Types of database management system are :

1. Centralised Database:
It is the type of database that stores data at a centralised database
system. It comforts the users to access the stored data from different
locations through several applications. These applications contain the
authentication process to let users access data securely. An example of a
Centralised database can be Central Library that carries a central
database of each library in a college/university.

THIRD 20PMHS012 20/09/2021 18

2. Distributed Database:
Unlike a centralised database system, in distributed systems, data
is distributed among different database systems of an organisation.
These database systems are connected via communication links.
Such links help the end-users to access the data easily. Examples of the
Distributed database are Apache Cassandra, HBase, Ignite, etc
We can further divide a distributed database system into
• Homogeneous DDB: Those database systems which execute on the
same operating system and use the same application process and
carry the same hardware devices.
• Heterogeneous DDB: Those database systems which execute on
different operating systems under different application procedures,
and carries different hardware devices

3. Relational Database:
This database is based on the relational data model, which stores
data in the form of rows(tuple) and columns(attributes), and together
forms a table(relation). A relational database uses SQL for storing,
manipulating, as well as maintaining the data. E.F. Codd invented the
database in 1970.
Each table in the database carries a key that makes the data unique
from others. Examples of Relational databases are MySQL, Microsoft
SQL Server, Oracle, etc

THIRD 20PMHS012 20/09/2021 19

4. NoSQL Database:
Non-SQL/Not Only SQL is a type of database that is used for
storing a wide range of data sets. It is not a relational database as it stores
data not only in tabular form but in several different ways.
It came into existence when the demand for building modern
applications increased. Thus, NoSQL presented a wide variety of
database technologies in response to the demands.We can further divide
a NoSQL database into the following four types:
• Key-value storage: It is the simplest type of database storage where
it stores every single item as a key (or attribute name) holding its
value, together.
• Document-oriented Database: A type of database used to store
data as JSON-like document. It helps developers in storing data by
using the same document-model format as used in the application
code.
• Graph Databases: It is used for storing vast amounts of data in a
graph-like structure. Most commonly, social networking websites
use the graph database.
• Wide-column stores: It is similar to the data represented in
relational databases. Here, data is stored in large columns together,
instead of storing in rows.

THIRD 20PMHS012 20/09/2021 20

5. Cloud Database:
A type of database where data is stored in a virtual environment
and executes over the cloud computing platform. It provides users with
various cloud computing services (SaaS, PaaS, IaaS, etc.) for accessing the
database. There are numerous cloud platforms, but the best options are:
• Amazon Web Services(AWS)
• Microsoft Azure
• Kamatera
• PhonixNA
• ScienceSoft
• Google Cloud SQL, etc

Cloud Database
6. Object-oriented Databases
The type of database that uses the object-based data model
approach for storing data in the database system. The data is represented
and stored as objects which are similar to the objects used in the object-
oriented programming language.

7. Hierarchical Databases
It is the type of database that stores data in the form of parent-
children relationship nodes. Here, it organises data in a tree-like
structure
Data get stored in the form of records that are connected via links.
Each child record in the tree will contain only one parent. On the other
hand, each parent record can have multiple child records.
.

THIRD 20PMHS012 20/09/2021 21

8. Network Databases:
It is the database that typically follows the network data model.
Here, the representation of data is in the form of nodes connected via
links between them.
Unlike the hierarchical database, it allows each record to have
multiple children and parent nodes to form a generalised graph
structure

9. Personal Database
Collecting and storing data on the user's system de nes a Personal
Database. This database is basically designed for a single user.

10. Operational Database


The type of database which creates and updates the database in
real-time. It is basically designed for executing and handling the daily
data operations in several businesses. For example, An organisation uses
operational databases for managing per day transactions

Operational Database
Enterprise Database

11. Enterprise Database


Large organisations or enterprises use this database for managing a
massive amount of data. It helps organisations to increase and improve
their ef ciency. Such a database allows simultaneous access to users.
fi
.

fi
.

THIRD 20PMHS012 20/09/2021 22

Q12. What is Normalization? Explain different types of Normalization


with advantages.

Ans:
Normalization is the process of minimising redundancy from a
relation or set of relations. Redundancy in relation may cause insertion,
deletion and updation anomalies. So, it helps to minimize the
redundancy in relations. Normal forms are used to eliminate or reduce
redundancy in database tables
Types of Normalisation :
1. First Normal Form (1NF)
The First normal form simply says that each cell of a table should
contain exactly one value. Let us take an example. Suppose we are
storing the courses that a particular instructor takes, we can store it like
this

Here, the issue is that in the rst row, we are storing 2 courses
against Prof. George. This isn’t the optimal way since that’s now how
SQL databases are designed to be used. A better method would be to
store the courses separately. For instance

This way, if we want to edit some information related to CS101, we do


not have to touch the data corresponding to CS154. Also, observe that
each row stores unique information
Advantage:
There is no repetition. This is the First Normal Form.

fi
.

THIRD 20PMHS012 20/09/2021 23

2. Second Normal Form (2NF)


For a table to be in second normal form, the following 2 conditions
are to be met:
1. The table should be in the rst normal form.
2. The primary key of the table should compose of exactly 1 column

Here, the rst column is the student name and the second column
is the course taken by the student. Clearly, the student name column isn’t
unique as we can see that there are 2 entries corresponding to the name
‘Rahul’ in row 1 and row 3. Similarly, the course code column is not
unique as we can see that there are 2 entries corresponding to course
code CS101 in row 2 and row 4. However, the tuple (student name,
course code) is unique since a student cannot enroll in the same course
more than once. So, these 2 columns when combined form the primary
key for the database
As per the second normal form de nition, our enrollment table
above isn’t in the second normal form. To achieve the same (1NF to 2NF),
we can rather break it into 2 tables

Here the second column is unique and it indicates the enrollment


number for the student. Clearly, the enrollment number is unique. Now,
we can attach each of these enrollment numbers with course codes. These
2 tables together provide us with the exact same information as our
original table
Advantage:
There is a primary key of the table composed of exactly 1 column.

fi
.

fi
:

fi
.

THIRD 20PMHS012 20/09/2021 24

3. Third Normal Form (3NF)


For a table to be in Third normal form, the following 2 conditions
are to be met:
1. The table should be in the second normal form.
2. There should not be any functional dependency

Here, when we changed the name of the professor, we also had to


change the department column (because of their dependency). This is not
desirable since someone who is updating the database may remember to
change the name of the professor, but may forget updating the
department value. This can cause inconsistency in the database
Third normal form avoids this by breaking this into separate tables

Here, the third column is the ID of the professor who’s taking the
course

Here, in the above table, we store the details of the professor


against his/her ID. This way, whenever we want to reference the
professor somewhere, we don’t have to put the other details of the
professor in that table again. We can simply use the ID

Advantage:
There is no functional dependency in the data base. In this way
inconsistency in the database can be avoided.
.

THIRD 20PMHS012 20/09/2021 25

4. Boyce-Codd Normal Form (BCNF)


Boyce-Codd Normal form is a stronger generalisation of third
normal form. A table is in Boyce-Codd Normal form if and only if at least
one of the following conditions are met for each functional dependency
A → B:
1. A is a superkey.
2. It is a trivial functional dependency

Advantage:
• A is a superkey: this means that only and only on a superkey column
should it be the case that there is a dependency of other columns.
Basically, if a set of columns (B) can be determined knowing some
other set of columns (A), then A should be a superkey. Superkey
basically determines each row uniquely.
• It is a trivial functional dependency: this means that there should be
no non-trivial dependency. For instance, we saw how the professor’s
department was dependent on the professor’s name. This may create
integrity issues since someone may edit the professor’s name without
changing the department. This may lead to an inconsistent database

5. Fourth normal form


A table is said to be in fourth normal form if there is no two or
more, independent and multivalued data describing the relevant entity.

6. Fifth normal form


A table is in fth Normal Form if
1. It is in fourth normal form.
2. It cannot be subdivided into any smaller tables without losing
some form of information.

fi
:

THIRD 20PMHS012 20/09/2021 26

Q13. How records are represented and organized in a le ?

Ans:
There are several ways of organizing records in les:

• Heap le organization. Any record can be placed anywhere in the


le where there is space for the record. There is no ordering of
records

• Sequential le organization. Records are stored in sequential order,


based on the value of the search key of each record

• Hashing le organization. A hash function is computed on some


attribute of each record. The result of the function speci es in which
block of the le the record should be placed

• Clustering le organization. Records of several different relations


can be stored in the same le. Related records of the different
relations are stored on the same block so that one I/O operation
fetches related records from all the relations

Q14. Discuss the advantages and disadvantages of using ?

Ans:
a) Files of unordered records (Heap Files) :
Advantage
• insert simple, records added at end of l
• easier for retrievals a large proportion of record
• effective for bulk loading dat

Disadvantages
• in the case of deletion will appear blank spaces untappe
• sorting may be take long tim
• Retrieval requires a linear search and is inef cient
fi

fi
.

fi
:

fi
fi
fi
:

fi
e

fi
e

fi
.

fi
fi
fi
d

THIRD 20PMHS012 20/09/2021 27

b) Files of ordered records (sequential le) :


Advantage
• To nd a record in the sequential le is very ef cient, because all les
are stored in an order
• It is fast and ef cient when dealing with large volumes of data that
need to be processed periodicall
Disadvantages
• Locating, storing, modifying, deleting, or adding records in the le
require rearranging the le
• search by using master key is not possible ,record are not arranged for
with master key facility.
• This method is too slow to handle applications requiring immediate
updating or responses

c) Files of hashed records :


Advantage
• Direct Access to the dat
• Hash function or randomizing function are utilise
• Best if equality search is needed on hash-key
Disadvantages
• Is not arranged for keys are not main key
• collisions between les ,two or more reference in same inde
• need to reserve space for le before storage this cause loss the space
on the disk

Q15. State different types of indexes ?

Ans:
Indexing in Database is de ned based on its indexing attributes.
Three main types of indexing methods are:

• Primary Indexing
• Secondary Indexin
• Clustering Index.

fi

fi
:

fi
.

fi
fi
.

fi

fi
.

fi
fi
d

fi
fi
THIRD 20PMHS012 20/09/2021 28

• Primary Indexing:
Primary Index is an ordered le which is xed length size with two
elds. The rst eld is the same a primary key and second, led is
pointed to that speci c data block. In the primary Index, there is
always one to one relationship between the entries in the index
table

The primary Indexing in DBMS is also further divided into two


types
1. Dense Index : In a dense index, a record is created for every
search key valued in the database. This helps you to search faster but
needs more space to store index records.
2. Sparse Index : It is an index record that appears for only some of
the values in the le. Sparse Index helps you to resolve the issues of
dense Indexing in DBMS. In this method of indexing technique, a range
of index columns stores the same data block address, and when data
needs to be retrieved, the block address will be fetched.

Dense Index Sparse Index

• Secondary Indexing:
The secondary Index in DBMS can be generated by a eld which
has a unique value for each record, and it should be a candidate key. It
is also known as a non-clustering index
This two-level database indexing technique is used to reduce the
mapping size of the rst level. For the rst level, a large range of
numbers is selected because of this; the mapping size always remains
small.
fi
.

fi
fi
fi

fi
fi

fi
fi
.

fi
fi

fi
THIRD 20PMHS012 20/09/2021 29

Let’s understand secondary indexing with a database index


example
In a bank account database, data is stored sequentially by acc_no; you
may want to nd all accounts in of a speci c branch of ABC bank
Here, you can have a secondary index in DBMS for every search-
key. Index record is a record point to a bucket that contains pointers to
all the records with their speci c search-key value.

Secondary Indexing

• Clustering Indexing:
In a clustered index, records themselves are stored in the Index and
not pointers. Sometimes the Index is created on non-primary key
columns which might not be unique for each record.
In such a situation, you can group two or more columns to get the
unique values and create an index which is called clustered Index.
This also helps you to identify the record faster
Let’s assume that a company recruited many employees in various
departments. In this case, clustering indexing in DBMS should be
created for all employees who belong to the same dept
It is considered in a single cluster, and index points point to the
cluster as a whole. Here, Department _no is a non-unique key.
:

fi

fi
fi
.

THIRD 20PMHS012 20/09/2021 30

Q16. What is le? Explain different operations performed on the le ?


Ans: The File is a collection of records. File activity: speci es percent of
actual records which proceed in a single run. File volatility: addresses
the properties of record changes. It helps to increase the ef ciency of disk
design than tape
• Using the primary key, we can access the records. The type and
frequency of access can be determined by the type of le organisation
which was used for a given set of records
• File organisation is a logical relationship among various records. This
method de nes how le records are mapped onto disk blocks
• File organisation is used to describe the way in which the records are
stored in terms of blocks, and the blocks are placed on the storage
medium
• The rst approach to map the database to the le is to use the several
les and store only one xed length record in any given le. An
alternative approach is to structure our les so that we can contain
multiple lengths for records. Files of xed length records are easier to
implement than the les of variable length records

Operations on database les can be broadly classi ed into two categories:


• Update Operations : Update operations change the data values by
insertion, deletion, or update
• Retrieval Operations : Retrieval operations, on the other hand, do
not alter the data but retrieve them after optional conditional
ltering
There could be several operations, which can be done on les:
1. Open − A le can be opened in one of the two modes, read mode or
write mode.
2. Locate − Every le has a le pointer, which tells the current position
where the data is to be read or written. This pointer can be adjusted
accordingly. Using nd (seek) operation, it can be moved forward or
backward
3. Read − By default, when les are opened in read mode, the le
pointer points to the beginning of the le. There are options where
the user can tell the operating system where to locate the le pointer
at the time of opening a le.
fi
fi
fi
.

fi
fi

fi
fi
.

fi
fi
fi
fi
fi
fi
fi
fi
.

fi
fi
fi
.

fi
fi
.

fi
fi
fi
fi
fi
fi
fi
.

fi

THIRD 20PMHS012 20/09/2021 31

4. Write − User can select to open a le in write mode, which enables


them to edit its contents. It can be deletion, insertion, or modi cation.
The le pointer can be located at the time of opening or can be
dynamically changed if the operating system allows to do so

5. Close − This is the most important operation from the operating


system’s point of view. When a request to close a le is generated, the
operating syste
• removes all the locks (if in shared mode)
• saves the data (if altered) to the secondary storage media, an
• releases all the buffers and le handlers associated with the le

Q17. What is an index le? What is the relationship between this les
and indexes?

Ans:
Indexing is a data structure technique to ef ciently retrieve records
from the database les based on some attributes on which the indexing
has been done. Indexing in database systems is similar to what we see in
books.
It is near-universal for databases to provide indexes for tables.
Indexes provide a way, given a value for the indexed eld, to nd the
record or records with that eld value. An index can be on either a key
eld or a non-key eld
Relationship between les and Index
• The File is a collection of records while Indexes provide a way, given a
value for the indexed eld, to nd the record or records with that eld
value
• So le contains the records and index is used to provide location of
records in a le
• Index table contains a search key and a pointer of the records stored in
the le.
• Search key- an attribute or set of attributes that is used to look up the
records in a le
• Pointer- contains the address of where the records is stored in
memory.
fi

fi
fi
fi

fi
fi
m

fi
fi
fi
fi
.

fi
fi
fi
fi
fi
:

fi
fi
fi
fi
d

fi
fi
.

fi
fi
THIRD 20PMHS012 20/09/2021 32

Q19. What is the purpose of normalization in DBMS?

Ans:
Purpose of normalization in DBMS
• It is used to remove the duplicate data and database anomalies from
the relational table
• Normalization helps to reduce redundancy and complexity by
examining new data types used in the table
• It is helpful to divide the large database table into smaller tables and
link them using relationship
• It avoids duplicate data or no repeating groups into a table
• It reduces the chances for anomalies to occur in a database
• I helps to reduce anomalies like: Data redundancy, Insert Anomaly,
Update Anomalies, Delete Anomalies

Q20. Why is the use of DBMS recommended? Explain by listing some


of its major advantages ?
Ans:
• Controlled Redundancy: DBMS supports a mechanism to control
redundancy of data inside the database by integrating all the data into
a single database and as data is stored at only one place, the duplicity
of data does not happen
• Data Sharing: Sharing of data among multiple users simultaneously
can also be done in DBMS as the same database will be shared among
all the users and by different application programs
• Backup and Recovery Facility: DBMS minimizes the pain of creating
the backup of data again and again by providing a feature of ‘backup
and recovery’ which automatically creates the data backup and
restores the data whenever required
• Enforcement of Integrity Constraints: Integrity Constraints are very
important to be enforced on the data so that the re ned data after
putting some constraints are stored in the database and this is
followed by DBMS
• Independence of Data: It simply means that you can change the
structure of the data without affecting the structure of any of the
application programs.

fi
.

You might also like