20PMHS012_RH
20PMHS012_RH
ASSIGNMENT
Q1. What is the process of Data Analysis?
Ans:
Data analysis is de ned as a process of cleaning, transforming, and
modelling data to discover useful information for business decision-
making. The purpose of Data Analysis is to extract useful information
from data and taking the decision based upon the data analysis
• Data Collectio
• Data Processin
• Data Cleanin
• Data Analysi
• Data Interpretatio
• Data Visualizatio
fi
fi
g
2. Data Collection :
• After requirement gathering, we will get a clear idea about what
things we have to measure and what should be your ndings.
• Now it’s time to collect our data based on requirements. Once we
collect our data, we must remember that the collected data must be
processed or organised for Analysis.
• As we collected data from various sources, we must have to keep a
log with a collection date and source of the data
3. Data Processing :
• The data that is collected must be processed or organised for
analysis.
• This includes structuring the data as required for the relevant
Analysis Tools.
• For example, the data might have to be placed into rows and
columns in a table within a Spreadsheet or Statistical Application.
• A Data Model might have to be created
4. Data Cleaning :
• Now whatever data is collected may not be useful or irrelevant to
our aim of Analysis, hence it should be cleaned.
• The data which is collected may contain duplicate records, white
spaces or errors. The data should be cleaned and error free.
• This phase must be done before Analysis because based on data
cleaning, our output of Analysis will be closer to our expected
outcome
fi
fi
.
fi
5. Data Analysis :
• Once the data is collected, cleaned, and processed, it is ready for
Analysis.
• As we manipulate data, we may nd we have the exact information
we need, or we might need to collect more data.
• During this phase, we can use data analysis tools and software
which will help you to understand, interpret, and derive
conclusions based on the requirements
6. Data Interpretation :
• After analysing our data, it’s nally time to interpret your results
• We can choose the way to express or communicate our data
analysis either we can use simply in words or maybe a table or
chart.
• Then use the results of our data analysis process to decide our best
course of action for our intended project
7. Data Visualization :
• Data visualization is very common in your day to day life; they
often appear in the form of charts and graphs.
• In other words, data shown graphically so that it will be easier for
the human brain to understand and process it.
• Data visualization often used to discover unknown facts and
trends. By observing relationships and comparing datasets, we can
nd a way to nd out meaningful information
Data Requirement
Data Visualization
Data Cleaning
fi
fi
fi
fi
.
Ans:
fi
Q3. What is data cleansing and what are the best ways to practice data
cleansing?
Ans:
Data cleansing or data cleaning is the process of identifying and
removing (or correcting) inaccurate records from a dataset, table, or
database and refers to recognising un nished, unreliable, inaccurate, or
non-relevant parts of the data and then restoring, remodelling, or
removing the dirty or crude data
Data cleaning techniques may be performed as batch processing
through scripting or interactively with data cleansing tools
After cleaning, a dataset should be uniform with other related
datasets in the operation. The discrepancies identi ed or eliminated may
have been basically caused by user entry mistakes, by corruption in
storage or transmission, or by various data dictionary descriptions of
similar items in various stores
fi
fi
:
fi
fi
fi
.
fi
fi
.
fi
fi
t
fi
fi
:
fi
.
fi
fi
.
5. Append data :
Append is a process that helps organisations to de ne and
complete missing information. Reliable third party sources are often one
of the best options for managing this practice
Append is a process that helps organisations to de ne and
complete missing information. Reliable third party sources are often one
of the best options for managing this practice
fi
.
fi
fi
.
Ans:
The knowledge embedded in a model is a frozen snapshot of a real-
world process imperfectly captured in data. The required change may be
complex, but the reasoning is simple
As the real world and the engineering around that snapshot
change, our model needs to keep up with reality in order to meet the
performance metrics achieved in development
Some retraining schedules present themselves naturally during
model development — when the model depends on a data source that is
updated periodically, for example. Many changes in data, engineering, or
business can be dif cult or impossible to predict and communicate.
Changes anywhere in the model dependency pipeline can degrade
model performance; it’s an inconvenience, but not unsolvable
fi
fi
fi
.
fi
.
fi
THIRD 20PMHS012 20/09/2021 9
Q5. Can you mention a few problems that data analyst usually
encounter while performing the analysis?
Ans:
Few problems that data analyst usually encounter while performing the
analysis are :
fi
.
fi
fi
5. Inaccessible data:
Moving data into one centralised system has little impact if it is not
easily accessible to the people that need it. Decision-makers and risk
managers need access to all of an organisation’s data for insights on what
is happening at any given moment, even if they are working off-site.
Accessing information should be the easiest part of data analytics.
An effective database will eliminate any accessibility issues.
Authorised employees will be able to securely view or edit data from
anywhere, illustrating organisational changes and enabling high-speed
decision making.
fi
8. Lack of support:
Data analytics can’t be effective without organisational support,
both from the top and lower-level employees. Risk managers will be
powerless in many pursuits if executives don’t give them the ability to
act. Other employees play a key role as well: if they do not submit data
for analysis or their systems are inaccessible to the risk manager, it will
be hard to create any actionable information.
Emphasise the value of risk management and analysis to all
aspects of the organisation to get past this challenge. Once other
members of the team understand the bene ts, they’re more likely to
cooperate. Implementing change can be dif cult, but using a centralised
data analysis system allows risk managers to easily communicate
results and effectively achieve buy-in from multiple stakeholders.
fl
fl
fi
fi
fi
fl
fi
THIRD 20PMHS012 20/09/2021 12
9. Confusion or anxiety:
Users may feel confused or anxious about switching from
traditional data analysis methods, even if they understand the bene ts
of automation. Nobody likes change, especially when they are
comfortable and familiar with the way things are done.
To overcome this HR problem, it’s important to illustrate how
changes in analytics will actually streamline the role and make it more
meaningful and ful lling. With comprehensive data analytics, employees
can eliminate redundant tasks like data collection and report building
and spend time acting on insights instead
10. Budget:
Another challenge risk managers regularly face is budget. Risk is
often a small department, so it can be dif cult to get approval for
signi cant purchases such as an analytics system.
Risk managers can secure budget for data analytics by measuring
the return on investment of a system and making a strong business
case for the bene ts it will achieve. For more information on gaining
support for a risk management software system, check out our blog post
here
fi
fi
fi
fi
fi
.
fi
THIRD 20PMHS012 20/09/2021 13
Ans:
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing
The Apache Hadoop software library is a framework that allows
for the distributed processing of large data sets across clusters of
computers using simple programming models.
It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather than rely
on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-
available service on top of a cluster of computers, each of which may be
prone to failures.
fi
Q7. How can you highlight cells with negative values in Excel?
Ans:
Steps to highlight cells with negative values in Excel
1. Select the cells in which you want to highlight the negative numbers
in red.
2. Go to Home → Conditional Formatting → Highlight Cell Rules →
Less Than
3. In the Less Than dialog box, specify the value as “0” below which the
formatting should be applied
4. Click OK
DATA SET
STEP3
Final Result
Q8. How can you clear all the formatting without actually removing
the cell contents?
Ans:
Steps to clear all the formatting without actually removing the cell
contents in Excel
1. Highlight the portion of the spreadsheet from which you want to
remove formatting.
2. Click the Home tab.
3. Select Clear from the Editing portion of the Home tab.
4. From the drop down menu of the Clear button, select Clear Formats.
Final result
STEP 2 and STEP 3
Q9. What is a Print Area and how can you set it in Excel ?
Ans:
A print area is one or more ranges of cells that you designate to
print when you don't want to print the entire worksheet. When you print
a worksheet after de ning a print area, only the print area is printed
Following are Set print areas in Excel:
1. On the worksheet, select the cells that you want to de ne as the print
area.
2. On the Page Layout tab, in the Page Setup group, click Print Area,
and then click Set Print Area.
Then our print area gets selected. Each time we go for print those
speci ed cells will only be printed
Ans:
By default, SQL will attempt to use TCP 1433.
If that port is unavailable, it will automatically choose another port.
If this is the case, that port will need to be opened through the rewall
instead
By default, the typical ports used by SQL Server and associated
database engine services are: TCP 1433, 4022, 135, 1434, UDP 1434
fi
.
fi
fi
fi
.
Q11. What do you mean by DBMS? What are its different types?
Ans:
A database management system (DBMS) is a collection of programs
that enables you to store, modify, and extract information from a
database.
There are many different types of DBMSs, ranging from small
systems that run on personal computers to huge systems that run on
mainframes. The DBMS acts as an interface between the application
program and the data in the database.
The following are examples of database applications:
• Computerised library systems
• Automated teller machines
• Flight reservation systems
• Computerised parts inventory system
1. Centralised Database:
It is the type of database that stores data at a centralised database
system. It comforts the users to access the stored data from different
locations through several applications. These applications contain the
authentication process to let users access data securely. An example of a
Centralised database can be Central Library that carries a central
database of each library in a college/university.
2. Distributed Database:
Unlike a centralised database system, in distributed systems, data
is distributed among different database systems of an organisation.
These database systems are connected via communication links.
Such links help the end-users to access the data easily. Examples of the
Distributed database are Apache Cassandra, HBase, Ignite, etc
We can further divide a distributed database system into
• Homogeneous DDB: Those database systems which execute on the
same operating system and use the same application process and
carry the same hardware devices.
• Heterogeneous DDB: Those database systems which execute on
different operating systems under different application procedures,
and carries different hardware devices
3. Relational Database:
This database is based on the relational data model, which stores
data in the form of rows(tuple) and columns(attributes), and together
forms a table(relation). A relational database uses SQL for storing,
manipulating, as well as maintaining the data. E.F. Codd invented the
database in 1970.
Each table in the database carries a key that makes the data unique
from others. Examples of Relational databases are MySQL, Microsoft
SQL Server, Oracle, etc
4. NoSQL Database:
Non-SQL/Not Only SQL is a type of database that is used for
storing a wide range of data sets. It is not a relational database as it stores
data not only in tabular form but in several different ways.
It came into existence when the demand for building modern
applications increased. Thus, NoSQL presented a wide variety of
database technologies in response to the demands.We can further divide
a NoSQL database into the following four types:
• Key-value storage: It is the simplest type of database storage where
it stores every single item as a key (or attribute name) holding its
value, together.
• Document-oriented Database: A type of database used to store
data as JSON-like document. It helps developers in storing data by
using the same document-model format as used in the application
code.
• Graph Databases: It is used for storing vast amounts of data in a
graph-like structure. Most commonly, social networking websites
use the graph database.
• Wide-column stores: It is similar to the data represented in
relational databases. Here, data is stored in large columns together,
instead of storing in rows.
5. Cloud Database:
A type of database where data is stored in a virtual environment
and executes over the cloud computing platform. It provides users with
various cloud computing services (SaaS, PaaS, IaaS, etc.) for accessing the
database. There are numerous cloud platforms, but the best options are:
• Amazon Web Services(AWS)
• Microsoft Azure
• Kamatera
• PhonixNA
• ScienceSoft
• Google Cloud SQL, etc
Cloud Database
6. Object-oriented Databases
The type of database that uses the object-based data model
approach for storing data in the database system. The data is represented
and stored as objects which are similar to the objects used in the object-
oriented programming language.
7. Hierarchical Databases
It is the type of database that stores data in the form of parent-
children relationship nodes. Here, it organises data in a tree-like
structure
Data get stored in the form of records that are connected via links.
Each child record in the tree will contain only one parent. On the other
hand, each parent record can have multiple child records.
.
8. Network Databases:
It is the database that typically follows the network data model.
Here, the representation of data is in the form of nodes connected via
links between them.
Unlike the hierarchical database, it allows each record to have
multiple children and parent nodes to form a generalised graph
structure
9. Personal Database
Collecting and storing data on the user's system de nes a Personal
Database. This database is basically designed for a single user.
Operational Database
Enterprise Database
fi
.
Ans:
Normalization is the process of minimising redundancy from a
relation or set of relations. Redundancy in relation may cause insertion,
deletion and updation anomalies. So, it helps to minimize the
redundancy in relations. Normal forms are used to eliminate or reduce
redundancy in database tables
Types of Normalisation :
1. First Normal Form (1NF)
The First normal form simply says that each cell of a table should
contain exactly one value. Let us take an example. Suppose we are
storing the courses that a particular instructor takes, we can store it like
this
Here, the issue is that in the rst row, we are storing 2 courses
against Prof. George. This isn’t the optimal way since that’s now how
SQL databases are designed to be used. A better method would be to
store the courses separately. For instance
fi
.
Here, the rst column is the student name and the second column
is the course taken by the student. Clearly, the student name column isn’t
unique as we can see that there are 2 entries corresponding to the name
‘Rahul’ in row 1 and row 3. Similarly, the course code column is not
unique as we can see that there are 2 entries corresponding to course
code CS101 in row 2 and row 4. However, the tuple (student name,
course code) is unique since a student cannot enroll in the same course
more than once. So, these 2 columns when combined form the primary
key for the database
As per the second normal form de nition, our enrollment table
above isn’t in the second normal form. To achieve the same (1NF to 2NF),
we can rather break it into 2 tables
fi
.
fi
:
fi
.
Here, the third column is the ID of the professor who’s taking the
course
Advantage:
There is no functional dependency in the data base. In this way
inconsistency in the database can be avoided.
.
Advantage:
• A is a superkey: this means that only and only on a superkey column
should it be the case that there is a dependency of other columns.
Basically, if a set of columns (B) can be determined knowing some
other set of columns (A), then A should be a superkey. Superkey
basically determines each row uniquely.
• It is a trivial functional dependency: this means that there should be
no non-trivial dependency. For instance, we saw how the professor’s
department was dependent on the professor’s name. This may create
integrity issues since someone may edit the professor’s name without
changing the department. This may lead to an inconsistent database
fi
:
Ans:
There are several ways of organizing records in les:
Ans:
a) Files of unordered records (Heap Files) :
Advantage
• insert simple, records added at end of l
• easier for retrievals a large proportion of record
• effective for bulk loading dat
Disadvantages
• in the case of deletion will appear blank spaces untappe
• sorting may be take long tim
• Retrieval requires a linear search and is inef cient
fi
fi
.
fi
:
fi
fi
fi
:
fi
e
fi
e
fi
.
fi
fi
fi
d
Ans:
Indexing in Database is de ned based on its indexing attributes.
Three main types of indexing methods are:
• Primary Indexing
• Secondary Indexin
• Clustering Index.
fi
fi
:
fi
.
fi
fi
.
fi
fi
.
fi
fi
d
fi
fi
THIRD 20PMHS012 20/09/2021 28
• Primary Indexing:
Primary Index is an ordered le which is xed length size with two
elds. The rst eld is the same a primary key and second, led is
pointed to that speci c data block. In the primary Index, there is
always one to one relationship between the entries in the index
table
• Secondary Indexing:
The secondary Index in DBMS can be generated by a eld which
has a unique value for each record, and it should be a candidate key. It
is also known as a non-clustering index
This two-level database indexing technique is used to reduce the
mapping size of the rst level. For the rst level, a large range of
numbers is selected because of this; the mapping size always remains
small.
fi
.
fi
fi
fi
fi
fi
fi
fi
.
fi
fi
fi
THIRD 20PMHS012 20/09/2021 29
Secondary Indexing
• Clustering Indexing:
In a clustered index, records themselves are stored in the Index and
not pointers. Sometimes the Index is created on non-primary key
columns which might not be unique for each record.
In such a situation, you can group two or more columns to get the
unique values and create an index which is called clustered Index.
This also helps you to identify the record faster
Let’s assume that a company recruited many employees in various
departments. In this case, clustering indexing in DBMS should be
created for all employees who belong to the same dept
It is considered in a single cluster, and index points point to the
cluster as a whole. Here, Department _no is a non-unique key.
:
fi
fi
fi
.
fi
fi
fi
fi
.
fi
fi
fi
fi
fi
fi
fi
fi
.
fi
fi
fi
.
fi
fi
.
fi
fi
fi
fi
fi
fi
fi
.
fi
Q17. What is an index le? What is the relationship between this les
and indexes?
Ans:
Indexing is a data structure technique to ef ciently retrieve records
from the database les based on some attributes on which the indexing
has been done. Indexing in database systems is similar to what we see in
books.
It is near-universal for databases to provide indexes for tables.
Indexes provide a way, given a value for the indexed eld, to nd the
record or records with that eld value. An index can be on either a key
eld or a non-key eld
Relationship between les and Index
• The File is a collection of records while Indexes provide a way, given a
value for the indexed eld, to nd the record or records with that eld
value
• So le contains the records and index is used to provide location of
records in a le
• Index table contains a search key and a pointer of the records stored in
the le.
• Search key- an attribute or set of attributes that is used to look up the
records in a le
• Pointer- contains the address of where the records is stored in
memory.
fi
fi
fi
fi
fi
fi
m
fi
fi
fi
fi
.
fi
fi
fi
fi
fi
:
fi
fi
fi
fi
d
fi
fi
.
fi
fi
THIRD 20PMHS012 20/09/2021 32
Ans:
Purpose of normalization in DBMS
• It is used to remove the duplicate data and database anomalies from
the relational table
• Normalization helps to reduce redundancy and complexity by
examining new data types used in the table
• It is helpful to divide the large database table into smaller tables and
link them using relationship
• It avoids duplicate data or no repeating groups into a table
• It reduces the chances for anomalies to occur in a database
• I helps to reduce anomalies like: Data redundancy, Insert Anomaly,
Update Anomalies, Delete Anomalies
fi
.