0% found this document useful (0 votes)

4 views

data analytics unit 1

The document outlines the principles of data architecture design and data management, emphasizing the importance of structured data handling in the era of Big Data. It discusses various data sources, data management responsibilities, and the significance of data quality and preprocessing for effective analysis. Additionally, it highlights the risks and challenges associated with data management, including security, data quality issues, and integration complexities.

Uploaded by

sanjeevsrikarayadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

data analytics unit 1

Uploaded by

sanjeevsrikarayadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Data Analytics(U21PC701CS)

UNIT-1

Table of Contents

Data Architecture Design and Data Management ....................................................... 2

Different Sources of Data for Data Analysis ................................................................................ 3
Data Management ..................................................................................................................... 7
Data quality ......................................................................................................................... 10
Data preprocessing ............................................................................................................ 11
Questions ............................................................................................................................. 16

1
Data Architecture Design and Data Management

In the past, data was small and manageable, easily stored on a single
computer. Today, data volumes have exploded, with around 2.5 quintillion
bytes generated daily,far exceeding the earlier limit of 19 exabytes.
Today, most data is generated from social media sites like Facebook,
Instagram, and Twitter, as well as from e-commerce, hospitals, schools, and
banks. Traditional storage methods can't manage this large and messy data,
so Big Data was created to handle it.

Big Data involves collecting large data sets from sources like social media,
GPS, and sensors, and then analyzing them to find useful patterns using
tools like SAS, Microsoft Excel, R, Python, Tableau, RapidMiner, and
KNIME. Before analysis, a data architect needs to design the data structure.

Data architecture design is a set of standards that includes policies, rules,

and models for managing data. It covers what data is collected, where it
comes from, how it's organized, stored, used, and secured in systems and
data warehouses for future analysis.

Data is a key part of enterprise architecture, helping businesses

successfully execute their strategies.

Data architecture design is crucial for planning how data systems interact.
For instance, if a data architect needs to integrate data from two systems,
data architecture provides a clear model of how these systems will connect
and work together.

Data architecture specifies the data structures for managing and

preprocessing data. It includes three main models that are integrated to
form the complete system.

2
 Physical model –Physical models holds the database design like
which type of database technology will be suitable for architecture.

 Conceptual model –It is a business model which uses Entity

Relationship (ER) model for relation between entities and their
attributes.

 Logical model –It is a model where problems are represented in the

form of logic such as rows and column of data, classes, xml tags and
other DBMS techniques.

A data architect is responsible for all the design, creation, manage,

deployment of data architecture and defines how data is to be stored and
retrieved, other decisions are made by internal bodies.

Factors that influence Data Architecture:

Data architecture is influenced by business needs, policies, technology,

economics, and data processing requirements, affecting how data is
managed.

 Business Requirements: Factors like business expansion, system

performance, data management, transaction management, and
storing data in warehouses.
 Business Policies: Rules set by the organization and government for
processing data.
 Technology in Use: Includes past data architecture designs and
existing licensed software and database technologies.
 Business Economics: Factors such as growth, losses, interest rates,
loans, market conditions, and overall costs.
 Data Processing Needs: Involves data mining, handling large
transactions, database management, and preprocessing.

Different Sources of Data for Data Analysis

Data collection involves gathering and storing large amounts of data, which
can be in various forms like text, video, audio, or images. It's the first step in
big data analysis, where data is collected from valid sources before analyzing
patterns or information.

3
Raw data, initially not useful, becomes valuable through cleaning and
analysis, turning into actionable knowledge for various fields. Data
collection aims to gather rich information, starting with defining the data
type and source. Data is categorized as qualitative (non-numerical, focusing
on behavior) or quantitative (numerical, analyzed with scientific tools).

The actual data is then further divided mainly into two types known
as:

Primary data

Secondary data

1. Primary data: The data which is Raw, original, and extracted directly
from the official sources is known as primary data. This type of data is
collected directly by performing techniques such as questionnaires,

4
interviews, and surveys. The data collected must be according to the
demand and requirements of the target audience on which analysis is
performed otherwise it would be a burden in the data processing. Few
methods of collecting primary data:

Interview method: The data collected during this process is through

interviewing the target audience by a person called interviewer and the
person who answers the interview is known as the interviewee. Some basic
business or product related questions are asked and noted down in the
form of notes, audio, or video and this data is stored for processing. These
can be both structured and unstructured like personal interviews or formal
interviews through telephone, face to face, email, etc.

Survey method:The survey method is the process of research where a list

of relevant questions are asked and answers are noted down in the form of
text, audio, or video. The survey method can be obtained in both online
and offline mode like through website forms and email. Then that survey
answers are stored for analyzing data. Examples are online surveys or
surveys through social media polls.

Observation method:The observation method is a method of data

collection in which the researcher keenly observes the behavior and
practices of the target audience using some data collecting tool and stores
the observed data in the form of text, audio, video, or any raw formats. In
this method, the data is collected directly by posting a few questions on the
participants. For example, observing a group of customers and their
behavior towards the products. The data obtained will be sent for
processing.

Experimental method: The experimental method is the process of

collecting data through performing experiments, research, and
investigation. The most frequently used experiment methods are CRD,
RBD, LSD, FD.

CRD- Completely Randomized design is a simple experimental

design used in data analytics which is based on randomization and
replication. It is mostly used for comparing the experiments.

RBD- Randomized Block Design is an experimental design in which

the experiment is divided into small units called blocks. Random
experiments are performed on each of the blocks and results are
drawn using a technique known as analysis of variance (ANOVA).
RBD was originated from the agriculture sector.

5
LSD – Latin Square Design is an experimental design that is similar
to CRD and RBD blocks but contains rows and columns. It is an
arrangement of NxN squares with an equal amount of rows and
columns which contain letters that occurs only once in a row. Hence
the differences can be easily found with fewer errors in the
experiment. Sudoku puzzle is an example of a Latin square design.

FD- Factorial design is an experimental design where each

experiment has two factors each with possible values and on
performing trail other combinational factors are derived.

2. Secondary data: Secondary data is the data which has already been
collected and reused again for some valid purpose. This type of data is
previously recorded from primary data and it has two types of sources
named internal source and external source.

Internal source: These types of data can easily be found within the
organization such as market record, a sales record, transactions, customer
data, accounting resources, etc. The cost and time consumption is less in
obtaining internal sources.

External source: The data which can’t be found at internal organizations

and can be gained through external third party resources is external
source data. The cost and time consumption is more because this contains
a huge amount of data. Examples of external sources are Government
publications, news publications, Registrar General of India, planning
commission, international labor bureau, syndicate services, and other non-
governmental publications.

Other sources:

Sensors data: With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics to track
the performance and usage of products.

Satellites data: Satellites collect a lot of images and data in terabytes on

daily basis through surveillance cameras which can be used to collect
useful information.

Web traffic: Due to fast and cheap internet facilities many formats of data
which is uploaded by users on different platforms can be predicted and
collected with their permission for data analysis. The search engines also
provide their data through keywords and queries searched mostly.

6
Data Management

Data management includes tasks like extracting, storing, transferring,

processing, and securing data in a cost-effective manner. Its goal is to
efficiently manage and protect data, enabling easy creation, access, deletion,
and updates. Good data management is essential for business growth and
productivity. For large volumes of data, like big data, advanced tools and
technologies such as Hadoop, Scala, Tableau, and AWS are needed. Effective
data management also requires training employees and ongoing
maintenance by DBAs, data analysts, and data architects.

An efficient data management system involves the collection, filtering, and

organization of data to support an organization’s goals and decision-making.
It is considered crucial in the IT sector for running business applications
and providing analytical insights. This process comprises various functions
to ensure data accessibility. Here, the key concepts of data management, its
importance, and the risks and challenges involved in data handling will be
explored.

Data management is a system for collecting and analyzing raw data to help
people and organizations use it effectively while following policies and
regulations. The first step is gathering data from various sources, which can
be structured or unstructured. This data must be sorted securely and
organized, and the right storage technology should be chosen based on the
data volume..

The first step in data management is collecting data from various sources in
its raw form, whether structured or unstructured. This data must be sorted
securely and organized, with the right storage technology chosen based on
the volume. Next, the data is processed by cleaning, aggregating, and
enhancing it to make it meaningful. Ensuring data accuracy and reliability
involves using validation rules and error-checking processes.

To keep data secure and private, measures such as encryption and access
control are implemented to prevent unauthorized access and data loss. Data

7
should also be analyzed using techniques like data mining, machine
learning, and visualization. Different data management lifecycles help
organizations meet business and regulatory requirements, manage
metadata, and provide detailed information about data, the mining process,
and data usage to ensure effective management.

Importance of Data Management

In today’s data-driven world, data management has become a paramount

concept, which involves various organizations, storage, processing, and data
protection. It increases data accuracy and accessibility to ensure user
reliability. Here are some key reasons that make Data Management very
important:

Informed Decision-Making Process: Data is the most important

component for businesses and organizations because they make their
important decisions based on data. A proper data management process
ensures that the decision-makers have direct access to the updated
information which helps to make effective choices.

Data Quality and Efficiency: A well-managed data set leads to a

streamlined process, which helps to maintain data quality and efficiency. It
reduces error risks and poor decision-making.

Compliance and Customer Trust: Many organisations have strict

regulations to maintain the data management process properly. It also
follows effective processes to handle client data responsibly.

Strategy Development and Innovation: In the modern context, data is a

valuable asset that can help organisations to identify trends and potential
opportunities with the challenges. It helps to understand the organisations
to culture the market trends with customer behaviour. On the other hand,
effective data management allows you to analyse the previous data to
identify the patterns which lead to the development of new products and
solutions.

Long-term Sustainability: Proper data management helps organisations to

plan for the long run. It helps to master data management efficiently by
reducing redundancies, data duplication, and unnecessary storage costs.

Competitive Advantage: Proper data management entitles organisations to

explore market trends, customer behaviours, and other insights that can
help them outperform competitors.

8
Data Management Responsibilities and Roles in IT industry

In the IT field, data management involves various job roles and

responsibilities to process data properly. Different data management job
roles collaborate with the users to handle the different aspects of data
management. Here are some common data management jobs in the IT
industry –

Data Manager: They are responsible for overseeing the whole data
management strategy. They define the data handling policies and standards
by ensuring the data quality, accuracy, and compliance regulations.

Database Administrator: The job role is related to the database

management system. Here the main work is to manage and maintain the
databases to store the relevant data by ensuring overall performance and
security.

Data Architect: Data architects design the structure and architecture

of the database and whole data systems. It includes the data models
and schemas with the developed relationships between the data set to
achieve the business requirements.

Data Analyst: Respective persons perform the data analysis process

with data visualisation by analysing the current trends and patterns.

Data Scientist: They utilise statistical data processing,

specifically machine learning techniques, and algorithms to solve
complex problems. They mostly collaborate with the businesses and
technical teams to deploy the production models.

Data Security Analyst: They are responsible to implement and

manage the security measures to protect the data from breaches and
unnecessary access. They monitor all the data access by ensuring the
security policies in collaboration with the IT and security teams.

Chief Data Officers: CDOs hold a strategic role in the IT field and
check the data-related activities by defining the data management
strategies to achieve the business goals and objectives.

Risks and Challenges in Data Management

While effective data management can produce significant benefits, on the

other hand, there are so many risks and challenges related to it. Here are
some aspects of them:

9
Security and Privacy: Unauthorised access to sensitive data by
hacking can be a cause of data breaches, which can expose
confidential information and may cause financial losses for an
organisation.

Data Quality: Poor data quality and duplicate data lead to stemming
errors during data collection, leading to incorrect decision-making. It
occupies valuable storage and creates confusion during the analysis
process.

Data Governance: Lack of data ownership and access control can

lead to inconsistent data management. This process leads to security
risks and compromises the security of data.

Data Integration Process: Integrating the data from various sources

is difficult, as it contains different formats and complex structures. It
disrupts the proper decision-making and process of data analysis.

Data Scaling: Scaling the data management systems is needed to

increase the data loads to maintain the performance by overcoming
technical challenges.

Data Lifecycle Management: Organisations need to be transparent

in their data retention policies which helps to determine the data
processing time and which data needs to be deleted. Data disposal is
also needed for security measures to prevent unauthorised access.

Data Analysis: Analysing complex and various data sets, required to

create the advanced analytics tools. For actionable data insight
development, it is needed to understand the business context properly
with the particular domain knowledge.

Data quality

Data quality is defined in terms of accuracy, completeness, consistency,

timeliness,believability, and interpretabilty. These qualities are assessed
based on the intended use of the data.

Accuracy means the data must be correct and free of errors. Completeness
refers to having all necessary values recorded. Consistency ensures that
data is uniform and free from discrepancies.

Data often faces challenges such as inaccuracy, which can arise from faulty
collection tools, human errors, or inconsistencies in naming conventions.

10
For instance, if sales records have incorrect pricing information, decisions
based on this data could be flawed. Incompleteness occurs when some data
attributes are missing or not recorded, such as missing sale details.

Timeliness is also critical; outdated or incomplete data, if not updated

promptly, can negatively impact decision-making. For example, calculating
sales bonuses based on incomplete monthly data could result in incorrect
bonus distributions.

Believability is about how much users trust the data. Even if data is
accurate now, past errors might lead users to distrust it. For example, if a
database had numerous errors previously, users might still doubt its
reliability.

Interpretability refers to how easily data can be understood. If a database

uses complex accounting codes that users don’t understand, even accurate
and complete data can be seen as low quality.

Data preprocessing

Data preprocessing is an important step in the data mining process. It

refers to the cleaning, transforming, and integrating of data in order to
make it ready for analysis. The goal of data preprocessing is to improve the
quality of the data and to make it more suitable for the specific data
mining task.
Data preprocessing is an important step in the data mining process that
involves cleaning and transforming raw data to make it suitable for
analysis.
Some common steps in data preprocessing include:
Data cleaning routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Data
cleaning is usually performed as an iterative two-step process consisting of
discrepancy detection and data transformation.

Data integration combines data from multiple sources, which may involve
dealing with inconsistencies like different names for the same attribute (e.g.,
"customer id" vs. "cust id").The resolution of semantic heterogeneity,
metadata, correlation analysis,tuple duplication detection, and data conflict
detection contribute to smooth data integration.

Data transformation routines convert the data into appropriate forms for
mining. For example, in normalization, attribute data are scaled so as to
fall within a small range such as 0.0 to 1.0. Data discretization transforms
numeric data by mapping values to interval or concept labels. Such methods
can be used to automatically generate concept hierarchies for the data,
which allows for mining at multiple levels of granularity. Discretization

11
techniques include binning, histogram analysis, cluster analysis, decision
tree analysis, and correlation analysis. For nominal data, concept
hierarchies may be generated based on schema definitions as well as the
number of distinct values per attribute.

Data reduction techniques obtain a reduced representation of the data

while minimizing the loss of information content. These include methods of
dimensionality reduction, numerosity reduction, and data compression.

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform

the raw data in a useful and efficient format.

1. Data Cleaning: The data can have many irrelevant and missing parts.
To handle this part, data cleaning is done. It involves handling of
missing data, noisy data etc.

 Missing Data: This situation arises when some data is missing in the
data. It can be handled in various ways.
 Ignore the Tuple: This method involves discarding the entire
record (tuple) if it contains any missing values. This approach is
generally ineffective, particularly if the tuple has other valuable
attributes or if the missing values are unevenly distributed
across attributes.

12
 Fill in the Missing Value Manually: This is a labor-intensive
method where missing values are filled in by hand. It can be
impractical for large datasets with numerous missing values.
 Use a Global Constant: All missing values are replaced with a
constant value, such as "Unknown" or a specific number (e.g.,
−∞). While simple, this approach can lead to misleading
interpretations since the constant might be seen as a valid,
meaningful value by the mining algorithm.
 Use a Measure of Central Tendency: Missing values are
replaced with the mean or median of the attribute. The mean is
used for symmetric distributions, while the median is better for
skewed distributions.
 Use the Mean or Median of the Same Class: This approach is
similar to the previous one but is done within the context of a
specific class. For example, in a classification problem, missing
values might be filled in with the average value for all samples
in the same class.
 Use the Most Probable Value: The missing value is estimated
using more sophisticated methods like regression, Bayesian
inference, or decision tree induction, which predict the value
based on other attributes in the dataset.

 Noisy Data: Noisy data is a meaningless data that can’t be interpreted

by machines.It can be generated due to faulty data collection, data entry
errors etc. It can be handled in following ways :

 Binning Method: This method works on sorted data in

order to smooth it. The whole data is divided into segments
of equal size and then various methods are performed to
complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
 Regression: Here data can be made smooth by fitting it to a
regression function.The regression used may be linear
(having one independent variable) or multiple (having
multiple independent variables).
 Clustering: This approach groups the similar data in a
cluster. The outliers may be undetected or it will fall outside
the clusters.

2. Data integration

Data integration combines data from various sources into one unified
dataset, reducing redundancies and inconsistencies to improve data
mining accuracy and efficiency. For example, in merging customer data
from different systems, "customer id" in one database may need to be

13
matched with "cust number" in another. This ensures that the same
customer is not misidentified or duplicated.

Schema integration can be challenging when attributes have different

names or formats. For instance, one database might record discounts at
the order level, while another records them at the item level, potentially
leading to errors if not corrected before integration.

Redundancies can be detected using correlation analysis, which

measures how one attribute relates to another. For example, if annual
revenue can be derived from sales and discounts, redundancy can be
identified. Tuple duplication also needs to be addressed, such as when a
purchase order database contains multiple entries for the same customer
with different addresses due to data entry errors.

Data value conflicts arise when the same attribute is represented

differently across systems. For example, weight might be recorded in
metric units in one system and in British imperial units in another, or
room prices might differ due to varying currencies and included services.
Resolving these discrepancies ensures consistent and accurate data
integration.

3. Data Transformation: (where data are transformed and consolidated

into forms
appropriate for mining by performing summary or aggregation
operations)

 Smoothing, which works to remove noise from the data.

Techniques include binning,regression, and clustering.
 Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of
attributes to help the mining process.
 Aggregation, where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube
for data analysis at multiple abstraction levels.
 Normalization, where the attribute data are scaled so as to fall
within a smaller range,such as 1.0 to 1.0, or 0.0 to 1.0.
 Discretization, where the raw values of a numeric attribute
(e.g., age) are replaced by interval labels (e.g., 0–10, 11–20, etc.)
or conceptual labels (e.g., youth, adult, senior).The labels, in
turn, can be recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute.
 Concept hierarchy generation for nominal data, where
attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal

14
attributes are implicit within the database schema and can be
automatically defined at the schema definition level.

4. Data Reduction: Data reduction techniques are used to create a smaller

representation of a dataset that maintains its essential characteristics.
This allows for more efficient data mining while still producing similar
analytical results. The main strategies for data reduction include
dimensionality reduction, numerosity reduction, and data compression.

 Dimensionality reduction involves decreasing the number of

variables or attributes in the data. This can be done using
methods like wavelet transforms and principal components
analysis (PCA), which transform the data into a smaller space.
Another method, called attribute subset selection, identifies and
removes irrelevant or redundant attributes, further simplifying
the dataset.
 Numerosity reduction replaces the original data with a more
compact form. This can be done using parametric methods,
which create a model to estimate the data and store only the
model's parameters. Examples include regression and log-linear
models. Nonparametric methods offer alternative approaches,
such as using histograms to summarize data, clustering similar
data points together, sampling a representative subset of the
data, or aggregating data into cubes for easier analysis.
 Data compression applies transformations to shrink the
dataset's size. If the original data can be perfectly reconstructed
from the compressed version, it is called lossless compression. If
only an approximation of the original data can be reconstructed,
with some loss of information, it is referred to as lossy
compression. Techniques for dimensionality reduction and
numerosity reduction can also be considered forms of data
compression.

It's important to ensure that the time spent on data reduction does
not outweigh the time saved by working with a smaller dataset. The
goal is to make data analysis more efficient without compromising the
quality of the results.

15
Questions

1. Data quality can be assessed in terms of accuracy, completeness, and

consistency. Propose two other dimensions of data quality.

2. In real-world data, tuples with missing values for some attributes are
a common occurrence. Describe various methods for handling this
problem.

3. Using the data for age in Suppose that the data for analysis includes
the attribute age. The age values for the data tuples are (in increasing
order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. answer the following. (a)
Use smoothing by bin means to smooth the above data, using a bin
depth of 3. Illustrate your steps. Comment on the effect of this
technique for the given data. (b) How might you determine outliers in
the data? (c) What other methods are there for data smoothing?

4. Discuss issues to consider during data integration.

5. What are the value ranges of the following normalization methods? (a)
min-max normalization (b) z-score normalization (c) normalization by
decimal scaling

6. Use the two methods below to normalize the following group of data:
200, 300, 400, 600, 1000 (a) min-max normalization by setting min =
0 and max = 1 (b) z-score normalization

7. Use a flow chart to summarize the following procedures for attribute

subset selection: (a) stepwise forward selection (b) stepwise backward
elimination (c) a combination of forward selection and backward
elimination

8. Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215. Partition them into
three bins by each of the following methods. (a) equal-frequency
partitioning (b) equal-width partitioning

Reference:

https://jcsites.juniata.edu/faculty/rhodes/ml/datapreprocessing.htm(Data
Preprocessing)

https://www.futurelearn.com/info/courses/data-analytics-python-
statistics-and-analytics-fundamentals/0/steps/186574 (Types and sources
of data)

DA_unit1_notes
No ratings yet
DA_unit1_notes
28 pages
Unit-1 - ADA - Notes
No ratings yet
Unit-1 - ADA - Notes
23 pages
Data Analytics pdf
No ratings yet
Data Analytics pdf
115 pages
MODULE-1
No ratings yet
MODULE-1
20 pages
1 Da
No ratings yet
1 Da
12 pages
DA-MODULE-1
No ratings yet
DA-MODULE-1
34 pages
Data Analytics Unit I
No ratings yet
Data Analytics Unit I
22 pages
DA total notes
No ratings yet
DA total notes
99 pages
Data Analytics - Unit - 1
No ratings yet
Data Analytics - Unit - 1
25 pages
DA Question Bank
No ratings yet
DA Question Bank
16 pages
Notes of Unit-I Data Analyticsdocx_250319_093958
No ratings yet
Notes of Unit-I Data Analyticsdocx_250319_093958
18 pages
Unit 1 Da
No ratings yet
Unit 1 Da
69 pages
DATA ANALYTICS Syllabus 3 Units
No ratings yet
DATA ANALYTICS Syllabus 3 Units
37 pages
Unit-I (Data Analytics)
No ratings yet
Unit-I (Data Analytics)
22 pages
Data Analytics BCSDS501
No ratings yet
Data Analytics BCSDS501
114 pages
DA Unit I
No ratings yet
DA Unit I
75 pages
Da Unit-I
No ratings yet
Da Unit-I
39 pages
Unit 2 BI & Data Science (1)
No ratings yet
Unit 2 BI & Data Science (1)
35 pages
DAFD UNit-2
No ratings yet
DAFD UNit-2
16 pages
DA-Unit-1-Trio-1
No ratings yet
DA-Unit-1-Trio-1
16 pages
all-unit-notes
No ratings yet
all-unit-notes
116 pages
Data Analytics Unit I
No ratings yet
Data Analytics Unit I
16 pages
DA NOTES-1
No ratings yet
DA NOTES-1
21 pages
Data Analytics by Srikanth Sagar
No ratings yet
Data Analytics by Srikanth Sagar
439 pages
U1,U2 Q&A
No ratings yet
U1,U2 Q&A
21 pages
Big Data Analysis Notes
No ratings yet
Big Data Analysis Notes
102 pages
BigDataAnalytics _ Unit1
No ratings yet
BigDataAnalytics _ Unit1
21 pages
DA All Units
No ratings yet
DA All Units
85 pages
data analytics unit-1 part 1
No ratings yet
data analytics unit-1 part 1
37 pages
Data mining 3
No ratings yet
Data mining 3
31 pages
DA KCS051 Unit 1
No ratings yet
DA KCS051 Unit 1
26 pages
Unit 1
No ratings yet
Unit 1
61 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Unit 2
No ratings yet
Unit 2
37 pages
unit-1ppt-241202105748-ba1c594f
No ratings yet
unit-1ppt-241202105748-ba1c594f
30 pages
data modeling
No ratings yet
data modeling
6 pages
Unit 1
No ratings yet
Unit 1
19 pages
Chapter 2 - Sources of Data
No ratings yet
Chapter 2 - Sources of Data
11 pages
Data Science
No ratings yet
Data Science
68 pages
DATA ANALYSIS_Full_Note_Immersive 2
No ratings yet
DATA ANALYSIS_Full_Note_Immersive 2
13 pages
Sources of Data
No ratings yet
Sources of Data
10 pages
Data Analytics Unit I
No ratings yet
Data Analytics Unit I
17 pages
DA Unit 1
No ratings yet
DA Unit 1
24 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
microooooooooooooo
No ratings yet
microooooooooooooo
33 pages
Data Analytics - Unit 1
No ratings yet
Data Analytics - Unit 1
30 pages
Da Unit-1
No ratings yet
Da Unit-1
23 pages
DAUnit-1
No ratings yet
DAUnit-1
20 pages
Comprehensive Guide to Data Collection
No ratings yet
Comprehensive Guide to Data Collection
16 pages
Data Analytics Unit-I
No ratings yet
Data Analytics Unit-I
25 pages
Unit II
No ratings yet
Unit II
6 pages
unit-1ppt
No ratings yet
unit-1ppt
29 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
R15a0530 Bda PDF
No ratings yet
R15a0530 Bda PDF
43 pages
IA Unit4
No ratings yet
IA Unit4
54 pages
essay2
No ratings yet
essay2
3 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Zero To Mastery In Cybersecurity- Become Zero To Hero In Cybersecurity, This Cybersecurity Book Covers A-Z Cybersecurity Concepts, 2022 Latest Edition
From Everand
Zero To Mastery In Cybersecurity- Become Zero To Hero In Cybersecurity, This Cybersecurity Book Covers A-Z Cybersecurity Concepts, 2022 Latest Edition
RAJIV JAIN
No ratings yet
Rudra Bhatt Data
No ratings yet
Rudra Bhatt Data
9 pages
Chapter 2 Data Science (4)
No ratings yet
Chapter 2 Data Science (4)
8 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
E 03 H 1 Jobs
No ratings yet
E 03 H 1 Jobs
3 pages
Download (Ebook) Oracle Insights: Tales of the Oak Table by Mogens Nørgaard, James Morle, Dave Ensor, Tim Gorman, Kyle Hailey, Anjo Kolk, Jonathan Lewis, Connor McDonald, Cary Millsap, David Ruthven, Gaja Krishna Vaidyanatha (auth.) ISBN 9781430207382, 9781590593875, 1430207388, 1590593871 ebook All Chapters PDF
100% (4)
Download (Ebook) Oracle Insights: Tales of the Oak Table by Mogens Nørgaard, James Morle, Dave Ensor, Tim Gorman, Kyle Hailey, Anjo Kolk, Jonathan Lewis, Connor McDonald, Cary Millsap, David Ruthven, Gaja Krishna Vaidyanatha (auth.) ISBN 9781430207382, 9781590593875, 1430207388, 1590593871 ebook All Chapters PDF
76 pages
Computer Paper12
No ratings yet
Computer Paper12
15 pages
Troytecdumps: The Best Real Exam Dumps & High-Quality Troytec Review Files
No ratings yet
Troytecdumps: The Best Real Exam Dumps & High-Quality Troytec Review Files
7 pages
XO Framework - Extensible Objects: 5.1 Scenario A - Integration of Application Owned Tables
No ratings yet
XO Framework - Extensible Objects: 5.1 Scenario A - Integration of Application Owned Tables
3 pages
Aditya Gupta JavaDeveloper
No ratings yet
Aditya Gupta JavaDeveloper
2 pages
Tnega - Gis - Job Descriptions
No ratings yet
Tnega - Gis - Job Descriptions
7 pages
Errors Codes SQL
No ratings yet
Errors Codes SQL
368 pages
Mysql vs. Sqlite
No ratings yet
Mysql vs. Sqlite
15 pages
Python Programming: Presented by - Rashmi Bca Section B' Roll No-38
No ratings yet
Python Programming: Presented by - Rashmi Bca Section B' Roll No-38
12 pages
Daily Alert Category
No ratings yet
Daily Alert Category
8 pages
Transaction Internals: Julian Dyke Independent Consultant
No ratings yet
Transaction Internals: Julian Dyke Independent Consultant
41 pages
Data Engineering Interview Preparation Questions
No ratings yet
Data Engineering Interview Preparation Questions
7 pages
DMS (22319) - Chapter 4 Notes
No ratings yet
DMS (22319) - Chapter 4 Notes
87 pages
NoSQL M2
No ratings yet
NoSQL M2
47 pages
100 Cool Mai Nframe T I Ps
No ratings yet
100 Cool Mai Nframe T I Ps
24 pages
Distributed Databases: Course Code:13IT1109 L TPC 4 0 0 3
No ratings yet
Distributed Databases: Course Code:13IT1109 L TPC 4 0 0 3
3 pages
Documentation of Online Banking System
89% (9)
Documentation of Online Banking System
48 pages
(SAP-HANA) Connector en
No ratings yet
(SAP-HANA) Connector en
24 pages
Oracle SQL Injection Cheat Sheet
No ratings yet
Oracle SQL Injection Cheat Sheet
5 pages
1.1. Class - Ada
No ratings yet
1.1. Class - Ada
22 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
30 pages
OBIEE - Quick Guide
No ratings yet
OBIEE - Quick Guide
78 pages
SQL Notes
No ratings yet
SQL Notes
49 pages
Mule Resume1
No ratings yet
Mule Resume1
4 pages
Tisha SIP
No ratings yet
Tisha SIP
60 pages
System Statistics Are A Little Complex: Neil Chandler's DB Blog
No ratings yet
System Statistics Are A Little Complex: Neil Chandler's DB Blog
16 pages
Doc1
No ratings yet
Doc1
9 pages
Empowering Developers To Deploy Their Own Data Stores. A Story of Terraform, Puppet and Rage - Tomas Doran
No ratings yet
Empowering Developers To Deploy Their Own Data Stores. A Story of Terraform, Puppet and Rage - Tomas Doran
29 pages

data analytics unit 1

Uploaded by

data analytics unit 1

Uploaded by

Data Analytics(U21PC701CS)

Data Architecture Design and Data Management ....................................................... 2

Data architecture design is a set of standards that includes policies, rules,

Data is a key part of enterprise architecture, helping businesses

Data architecture specifies the data structures for managing and

 Conceptual model –It is a business model which uses Entity

 Logical model –It is a model where problems are represented in the

A data architect is responsible for all the design, creation, manage,

Factors that influence Data Architecture:

Data architecture is influenced by business needs, policies, technology,

 Business Requirements: Factors like business expansion, system

Different Sources of Data for Data Analysis

Interview method: The data collected during this process is through

Survey method:The survey method is the process of research where a list

Observation method:The observation method is a method of data

Experimental method: The experimental method is the process of

CRD- Completely Randomized design is a simple experimental

RBD- Randomized Block Design is an experimental design in which

FD- Factorial design is an experimental design where each

External source: The data which can’t be found at internal organizations

Satellites data: Satellites collect a lot of images and data in terabytes on

Data management includes tasks like extracting, storing, transferring,

An efficient data management system involves the collection, filtering, and

Importance of Data Management

In today’s data-driven world, data management has become a paramount

Informed Decision-Making Process: Data is the most important

Data Quality and Efficiency: A well-managed data set leads to a

Compliance and Customer Trust: Many organisations have strict

Strategy Development and Innovation: In the modern context, data is a

Long-term Sustainability: Proper data management helps organisations to

Competitive Advantage: Proper data management entitles organisations to

In the IT field, data management involves various job roles and

Database Administrator: The job role is related to the database

Data Architect: Data architects design the structure and architecture

Data Analyst: Respective persons perform the data analysis process

Data Scientist: They utilise statistical data processing,

Data Security Analyst: They are responsible to implement and

Risks and Challenges in Data Management

While effective data management can produce significant benefits, on the

Data Governance: Lack of data ownership and access control can

Data Integration Process: Integrating the data from various sources

Data Scaling: Scaling the data management systems is needed to

Data Lifecycle Management: Organisations need to be transparent

Data Analysis: Analysing complex and various data sets, required to

Data quality is defined in terms of accuracy, completeness, consistency,

Timeliness is also critical; outdated or incomplete data, if not updated

Interpretability refers to how easily data can be understood. If a database

Data preprocessing is an important step in the data mining process. It

Data reduction techniques obtain a reduced representation of the data

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform

 Noisy Data: Noisy data is a meaningless data that can’t be interpreted

 Binning Method: This method works on sorted data in

Schema integration can be challenging when attributes have different

Redundancies can be detected using correlation analysis, which

Data value conflicts arise when the same attribute is represented

3. Data Transformation: (where data are transformed and consolidated

 Smoothing, which works to remove noise from the data.

4. Data Reduction: Data reduction techniques are used to create a smaller

 Dimensionality reduction involves decreasing the number of

1. Data quality can be assessed in terms of accuracy, completeness, and

4. Discuss issues to consider during data integration.

7. Use a flow chart to summarize the following procedures for attribute

8. Suppose a group of 12 sales price records has been sorted as follows:

You might also like