0% found this document useful (0 votes)
4 views

data analytics unit 1

The document outlines the principles of data architecture design and data management, emphasizing the importance of structured data handling in the era of Big Data. It discusses various data sources, data management responsibilities, and the significance of data quality and preprocessing for effective analysis. Additionally, it highlights the risks and challenges associated with data management, including security, data quality issues, and integration complexities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

data analytics unit 1

The document outlines the principles of data architecture design and data management, emphasizing the importance of structured data handling in the era of Big Data. It discusses various data sources, data management responsibilities, and the significance of data quality and preprocessing for effective analysis. Additionally, it highlights the risks and challenges associated with data management, including security, data quality issues, and integration complexities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Analytics(U21PC701CS)

UNIT-1

Table of Contents

Data Architecture Design and Data Management ....................................................... 2


Different Sources of Data for Data Analysis ................................................................................ 3
Data Management ..................................................................................................................... 7
Data quality ......................................................................................................................... 10
Data preprocessing ............................................................................................................ 11
Questions ............................................................................................................................. 16

1
Data Architecture Design and Data Management

In the past, data was small and manageable, easily stored on a single
computer. Today, data volumes have exploded, with around 2.5 quintillion
bytes generated daily,far exceeding the earlier limit of 19 exabytes.
Today, most data is generated from social media sites like Facebook,
Instagram, and Twitter, as well as from e-commerce, hospitals, schools, and
banks. Traditional storage methods can't manage this large and messy data,
so Big Data was created to handle it.

Big Data involves collecting large data sets from sources like social media,
GPS, and sensors, and then analyzing them to find useful patterns using
tools like SAS, Microsoft Excel, R, Python, Tableau, RapidMiner, and
KNIME. Before analysis, a data architect needs to design the data structure.

Data architecture design is a set of standards that includes policies, rules,


and models for managing data. It covers what data is collected, where it
comes from, how it's organized, stored, used, and secured in systems and
data warehouses for future analysis.

Data is a key part of enterprise architecture, helping businesses


successfully execute their strategies.

Data architecture design is crucial for planning how data systems interact.
For instance, if a data architect needs to integrate data from two systems,
data architecture provides a clear model of how these systems will connect
and work together.

Data architecture specifies the data structures for managing and


preprocessing data. It includes three main models that are integrated to
form the complete system.

2
 Physical model –Physical models holds the database design like
which type of database technology will be suitable for architecture.

 Conceptual model –It is a business model which uses Entity


Relationship (ER) model for relation between entities and their
attributes.

 Logical model –It is a model where problems are represented in the


form of logic such as rows and column of data, classes, xml tags and
other DBMS techniques.

A data architect is responsible for all the design, creation, manage,


deployment of data architecture and defines how data is to be stored and
retrieved, other decisions are made by internal bodies.

Factors that influence Data Architecture:

Data architecture is influenced by business needs, policies, technology,


economics, and data processing requirements, affecting how data is
managed.

 Business Requirements: Factors like business expansion, system


performance, data management, transaction management, and
storing data in warehouses.
 Business Policies: Rules set by the organization and government for
processing data.
 Technology in Use: Includes past data architecture designs and
existing licensed software and database technologies.
 Business Economics: Factors such as growth, losses, interest rates,
loans, market conditions, and overall costs.
 Data Processing Needs: Involves data mining, handling large
transactions, database management, and preprocessing.

Different Sources of Data for Data Analysis

Data collection involves gathering and storing large amounts of data, which
can be in various forms like text, video, audio, or images. It's the first step in
big data analysis, where data is collected from valid sources before analyzing
patterns or information.

3
Raw data, initially not useful, becomes valuable through cleaning and
analysis, turning into actionable knowledge for various fields. Data
collection aims to gather rich information, starting with defining the data
type and source. Data is categorized as qualitative (non-numerical, focusing
on behavior) or quantitative (numerical, analyzed with scientific tools).

The actual data is then further divided mainly into two types known
as:

Primary data

Secondary data

1. Primary data: The data which is Raw, original, and extracted directly
from the official sources is known as primary data. This type of data is
collected directly by performing techniques such as questionnaires,

4
interviews, and surveys. The data collected must be according to the
demand and requirements of the target audience on which analysis is
performed otherwise it would be a burden in the data processing. Few
methods of collecting primary data:

Interview method: The data collected during this process is through


interviewing the target audience by a person called interviewer and the
person who answers the interview is known as the interviewee. Some basic
business or product related questions are asked and noted down in the
form of notes, audio, or video and this data is stored for processing. These
can be both structured and unstructured like personal interviews or formal
interviews through telephone, face to face, email, etc.

Survey method:The survey method is the process of research where a list


of relevant questions are asked and answers are noted down in the form of
text, audio, or video. The survey method can be obtained in both online
and offline mode like through website forms and email. Then that survey
answers are stored for analyzing data. Examples are online surveys or
surveys through social media polls.

Observation method:The observation method is a method of data


collection in which the researcher keenly observes the behavior and
practices of the target audience using some data collecting tool and stores
the observed data in the form of text, audio, video, or any raw formats. In
this method, the data is collected directly by posting a few questions on the
participants. For example, observing a group of customers and their
behavior towards the products. The data obtained will be sent for
processing.

Experimental method: The experimental method is the process of


collecting data through performing experiments, research, and
investigation. The most frequently used experiment methods are CRD,
RBD, LSD, FD.

CRD- Completely Randomized design is a simple experimental


design used in data analytics which is based on randomization and
replication. It is mostly used for comparing the experiments.

RBD- Randomized Block Design is an experimental design in which


the experiment is divided into small units called blocks. Random
experiments are performed on each of the blocks and results are
drawn using a technique known as analysis of variance (ANOVA).
RBD was originated from the agriculture sector.

5
LSD – Latin Square Design is an experimental design that is similar
to CRD and RBD blocks but contains rows and columns. It is an
arrangement of NxN squares with an equal amount of rows and
columns which contain letters that occurs only once in a row. Hence
the differences can be easily found with fewer errors in the
experiment. Sudoku puzzle is an example of a Latin square design.

FD- Factorial design is an experimental design where each


experiment has two factors each with possible values and on
performing trail other combinational factors are derived.

2. Secondary data: Secondary data is the data which has already been
collected and reused again for some valid purpose. This type of data is
previously recorded from primary data and it has two types of sources
named internal source and external source.

Internal source: These types of data can easily be found within the
organization such as market record, a sales record, transactions, customer
data, accounting resources, etc. The cost and time consumption is less in
obtaining internal sources.

External source: The data which can’t be found at internal organizations


and can be gained through external third party resources is external
source data. The cost and time consumption is more because this contains
a huge amount of data. Examples of external sources are Government
publications, news publications, Registrar General of India, planning
commission, international labor bureau, syndicate services, and other non-
governmental publications.

Other sources:

Sensors data: With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics to track
the performance and usage of products.

Satellites data: Satellites collect a lot of images and data in terabytes on


daily basis through surveillance cameras which can be used to collect
useful information.

Web traffic: Due to fast and cheap internet facilities many formats of data
which is uploaded by users on different platforms can be predicted and
collected with their permission for data analysis. The search engines also
provide their data through keywords and queries searched mostly.

6
Data Management

Data management includes tasks like extracting, storing, transferring,


processing, and securing data in a cost-effective manner. Its goal is to
efficiently manage and protect data, enabling easy creation, access, deletion,
and updates. Good data management is essential for business growth and
productivity. For large volumes of data, like big data, advanced tools and
technologies such as Hadoop, Scala, Tableau, and AWS are needed. Effective
data management also requires training employees and ongoing
maintenance by DBAs, data analysts, and data architects.

An efficient data management system involves the collection, filtering, and


organization of data to support an organization’s goals and decision-making.
It is considered crucial in the IT sector for running business applications
and providing analytical insights. This process comprises various functions
to ensure data accessibility. Here, the key concepts of data management, its
importance, and the risks and challenges involved in data handling will be
explored.

Data management is a system for collecting and analyzing raw data to help
people and organizations use it effectively while following policies and
regulations. The first step is gathering data from various sources, which can
be structured or unstructured. This data must be sorted securely and
organized, and the right storage technology should be chosen based on the
data volume..

The first step in data management is collecting data from various sources in
its raw form, whether structured or unstructured. This data must be sorted
securely and organized, with the right storage technology chosen based on
the volume. Next, the data is processed by cleaning, aggregating, and
enhancing it to make it meaningful. Ensuring data accuracy and reliability
involves using validation rules and error-checking processes.

To keep data secure and private, measures such as encryption and access
control are implemented to prevent unauthorized access and data loss. Data

7
should also be analyzed using techniques like data mining, machine
learning, and visualization. Different data management lifecycles help
organizations meet business and regulatory requirements, manage
metadata, and provide detailed information about data, the mining process,
and data usage to ensure effective management.

Importance of Data Management

In today’s data-driven world, data management has become a paramount


concept, which involves various organizations, storage, processing, and data
protection. It increases data accuracy and accessibility to ensure user
reliability. Here are some key reasons that make Data Management very
important:

Informed Decision-Making Process: Data is the most important


component for businesses and organizations because they make their
important decisions based on data. A proper data management process
ensures that the decision-makers have direct access to the updated
information which helps to make effective choices.

Data Quality and Efficiency: A well-managed data set leads to a


streamlined process, which helps to maintain data quality and efficiency. It
reduces error risks and poor decision-making.

Compliance and Customer Trust: Many organisations have strict


regulations to maintain the data management process properly. It also
follows effective processes to handle client data responsibly.

Strategy Development and Innovation: In the modern context, data is a


valuable asset that can help organisations to identify trends and potential
opportunities with the challenges. It helps to understand the organisations
to culture the market trends with customer behaviour. On the other hand,
effective data management allows you to analyse the previous data to
identify the patterns which lead to the development of new products and
solutions.

Long-term Sustainability: Proper data management helps organisations to


plan for the long run. It helps to master data management efficiently by
reducing redundancies, data duplication, and unnecessary storage costs.

Competitive Advantage: Proper data management entitles organisations to


explore market trends, customer behaviours, and other insights that can
help them outperform competitors.

8
Data Management Responsibilities and Roles in IT industry

In the IT field, data management involves various job roles and


responsibilities to process data properly. Different data management job
roles collaborate with the users to handle the different aspects of data
management. Here are some common data management jobs in the IT
industry –

Data Manager: They are responsible for overseeing the whole data
management strategy. They define the data handling policies and standards
by ensuring the data quality, accuracy, and compliance regulations.

Database Administrator: The job role is related to the database


management system. Here the main work is to manage and maintain the
databases to store the relevant data by ensuring overall performance and
security.

Data Architect: Data architects design the structure and architecture


of the database and whole data systems. It includes the data models
and schemas with the developed relationships between the data set to
achieve the business requirements.

Data Analyst: Respective persons perform the data analysis process


with data visualisation by analysing the current trends and patterns.

Data Scientist: They utilise statistical data processing,


specifically machine learning techniques, and algorithms to solve
complex problems. They mostly collaborate with the businesses and
technical teams to deploy the production models.

Data Security Analyst: They are responsible to implement and


manage the security measures to protect the data from breaches and
unnecessary access. They monitor all the data access by ensuring the
security policies in collaboration with the IT and security teams.

Chief Data Officers: CDOs hold a strategic role in the IT field and
check the data-related activities by defining the data management
strategies to achieve the business goals and objectives.

Risks and Challenges in Data Management

While effective data management can produce significant benefits, on the


other hand, there are so many risks and challenges related to it. Here are
some aspects of them:

9
Security and Privacy: Unauthorised access to sensitive data by
hacking can be a cause of data breaches, which can expose
confidential information and may cause financial losses for an
organisation.

Data Quality: Poor data quality and duplicate data lead to stemming
errors during data collection, leading to incorrect decision-making. It
occupies valuable storage and creates confusion during the analysis
process.

Data Governance: Lack of data ownership and access control can


lead to inconsistent data management. This process leads to security
risks and compromises the security of data.

Data Integration Process: Integrating the data from various sources


is difficult, as it contains different formats and complex structures. It
disrupts the proper decision-making and process of data analysis.

Data Scaling: Scaling the data management systems is needed to


increase the data loads to maintain the performance by overcoming
technical challenges.

Data Lifecycle Management: Organisations need to be transparent


in their data retention policies which helps to determine the data
processing time and which data needs to be deleted. Data disposal is
also needed for security measures to prevent unauthorised access.

Data Analysis: Analysing complex and various data sets, required to


create the advanced analytics tools. For actionable data insight
development, it is needed to understand the business context properly
with the particular domain knowledge.

Data quality

Data quality is defined in terms of accuracy, completeness, consistency,


timeliness,believability, and interpretabilty. These qualities are assessed
based on the intended use of the data.

Accuracy means the data must be correct and free of errors. Completeness
refers to having all necessary values recorded. Consistency ensures that
data is uniform and free from discrepancies.

Data often faces challenges such as inaccuracy, which can arise from faulty
collection tools, human errors, or inconsistencies in naming conventions.

10
For instance, if sales records have incorrect pricing information, decisions
based on this data could be flawed. Incompleteness occurs when some data
attributes are missing or not recorded, such as missing sale details.

Timeliness is also critical; outdated or incomplete data, if not updated


promptly, can negatively impact decision-making. For example, calculating
sales bonuses based on incomplete monthly data could result in incorrect
bonus distributions.

Believability is about how much users trust the data. Even if data is
accurate now, past errors might lead users to distrust it. For example, if a
database had numerous errors previously, users might still doubt its
reliability.

Interpretability refers to how easily data can be understood. If a database


uses complex accounting codes that users don’t understand, even accurate
and complete data can be seen as low quality.

Data preprocessing

Data preprocessing is an important step in the data mining process. It


refers to the cleaning, transforming, and integrating of data in order to
make it ready for analysis. The goal of data preprocessing is to improve the
quality of the data and to make it more suitable for the specific data
mining task.
Data preprocessing is an important step in the data mining process that
involves cleaning and transforming raw data to make it suitable for
analysis.
Some common steps in data preprocessing include:
Data cleaning routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Data
cleaning is usually performed as an iterative two-step process consisting of
discrepancy detection and data transformation.

Data integration combines data from multiple sources, which may involve
dealing with inconsistencies like different names for the same attribute (e.g.,
"customer id" vs. "cust id").The resolution of semantic heterogeneity,
metadata, correlation analysis,tuple duplication detection, and data conflict
detection contribute to smooth data integration.

Data transformation routines convert the data into appropriate forms for
mining. For example, in normalization, attribute data are scaled so as to
fall within a small range such as 0.0 to 1.0. Data discretization transforms
numeric data by mapping values to interval or concept labels. Such methods
can be used to automatically generate concept hierarchies for the data,
which allows for mining at multiple levels of granularity. Discretization

11
techniques include binning, histogram analysis, cluster analysis, decision
tree analysis, and correlation analysis. For nominal data, concept
hierarchies may be generated based on schema definitions as well as the
number of distinct values per attribute.

Data reduction techniques obtain a reduced representation of the data


while minimizing the loss of information content. These include methods of
dimensionality reduction, numerosity reduction, and data compression.

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform


the raw data in a useful and efficient format.

1. Data Cleaning: The data can have many irrelevant and missing parts.
To handle this part, data cleaning is done. It involves handling of
missing data, noisy data etc.

 Missing Data: This situation arises when some data is missing in the
data. It can be handled in various ways.
 Ignore the Tuple: This method involves discarding the entire
record (tuple) if it contains any missing values. This approach is
generally ineffective, particularly if the tuple has other valuable
attributes or if the missing values are unevenly distributed
across attributes.

12
 Fill in the Missing Value Manually: This is a labor-intensive
method where missing values are filled in by hand. It can be
impractical for large datasets with numerous missing values.
 Use a Global Constant: All missing values are replaced with a
constant value, such as "Unknown" or a specific number (e.g.,
−∞). While simple, this approach can lead to misleading
interpretations since the constant might be seen as a valid,
meaningful value by the mining algorithm.
 Use a Measure of Central Tendency: Missing values are
replaced with the mean or median of the attribute. The mean is
used for symmetric distributions, while the median is better for
skewed distributions.
 Use the Mean or Median of the Same Class: This approach is
similar to the previous one but is done within the context of a
specific class. For example, in a classification problem, missing
values might be filled in with the average value for all samples
in the same class.
 Use the Most Probable Value: The missing value is estimated
using more sophisticated methods like regression, Bayesian
inference, or decision tree induction, which predict the value
based on other attributes in the dataset.

 Noisy Data: Noisy data is a meaningless data that can’t be interpreted


by machines.It can be generated due to faulty data collection, data entry
errors etc. It can be handled in following ways :

 Binning Method: This method works on sorted data in


order to smooth it. The whole data is divided into segments
of equal size and then various methods are performed to
complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
 Regression: Here data can be made smooth by fitting it to a
regression function.The regression used may be linear
(having one independent variable) or multiple (having
multiple independent variables).
 Clustering: This approach groups the similar data in a
cluster. The outliers may be undetected or it will fall outside
the clusters.

2. Data integration

Data integration combines data from various sources into one unified
dataset, reducing redundancies and inconsistencies to improve data
mining accuracy and efficiency. For example, in merging customer data
from different systems, "customer id" in one database may need to be

13
matched with "cust number" in another. This ensures that the same
customer is not misidentified or duplicated.

Schema integration can be challenging when attributes have different


names or formats. For instance, one database might record discounts at
the order level, while another records them at the item level, potentially
leading to errors if not corrected before integration.

Redundancies can be detected using correlation analysis, which


measures how one attribute relates to another. For example, if annual
revenue can be derived from sales and discounts, redundancy can be
identified. Tuple duplication also needs to be addressed, such as when a
purchase order database contains multiple entries for the same customer
with different addresses due to data entry errors.

Data value conflicts arise when the same attribute is represented


differently across systems. For example, weight might be recorded in
metric units in one system and in British imperial units in another, or
room prices might differ due to varying currencies and included services.
Resolving these discrepancies ensures consistent and accurate data
integration.

3. Data Transformation: (where data are transformed and consolidated


into forms
appropriate for mining by performing summary or aggregation
operations)

 Smoothing, which works to remove noise from the data.


Techniques include binning,regression, and clustering.
 Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of
attributes to help the mining process.
 Aggregation, where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube
for data analysis at multiple abstraction levels.
 Normalization, where the attribute data are scaled so as to fall
within a smaller range,such as 1.0 to 1.0, or 0.0 to 1.0.
 Discretization, where the raw values of a numeric attribute
(e.g., age) are replaced by interval labels (e.g., 0–10, 11–20, etc.)
or conceptual labels (e.g., youth, adult, senior).The labels, in
turn, can be recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute.
 Concept hierarchy generation for nominal data, where
attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal

14
attributes are implicit within the database schema and can be
automatically defined at the schema definition level.

4. Data Reduction: Data reduction techniques are used to create a smaller


representation of a dataset that maintains its essential characteristics.
This allows for more efficient data mining while still producing similar
analytical results. The main strategies for data reduction include
dimensionality reduction, numerosity reduction, and data compression.

 Dimensionality reduction involves decreasing the number of


variables or attributes in the data. This can be done using
methods like wavelet transforms and principal components
analysis (PCA), which transform the data into a smaller space.
Another method, called attribute subset selection, identifies and
removes irrelevant or redundant attributes, further simplifying
the dataset.
 Numerosity reduction replaces the original data with a more
compact form. This can be done using parametric methods,
which create a model to estimate the data and store only the
model's parameters. Examples include regression and log-linear
models. Nonparametric methods offer alternative approaches,
such as using histograms to summarize data, clustering similar
data points together, sampling a representative subset of the
data, or aggregating data into cubes for easier analysis.
 Data compression applies transformations to shrink the
dataset's size. If the original data can be perfectly reconstructed
from the compressed version, it is called lossless compression. If
only an approximation of the original data can be reconstructed,
with some loss of information, it is referred to as lossy
compression. Techniques for dimensionality reduction and
numerosity reduction can also be considered forms of data
compression.

It's important to ensure that the time spent on data reduction does
not outweigh the time saved by working with a smaller dataset. The
goal is to make data analysis more efficient without compromising the
quality of the results.

15
Questions

1. Data quality can be assessed in terms of accuracy, completeness, and


consistency. Propose two other dimensions of data quality.

2. In real-world data, tuples with missing values for some attributes are
a common occurrence. Describe various methods for handling this
problem.

3. Using the data for age in Suppose that the data for analysis includes
the attribute age. The age values for the data tuples are (in increasing
order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. answer the following. (a)
Use smoothing by bin means to smooth the above data, using a bin
depth of 3. Illustrate your steps. Comment on the effect of this
technique for the given data. (b) How might you determine outliers in
the data? (c) What other methods are there for data smoothing?

4. Discuss issues to consider during data integration.

5. What are the value ranges of the following normalization methods? (a)
min-max normalization (b) z-score normalization (c) normalization by
decimal scaling

6. Use the two methods below to normalize the following group of data:
200, 300, 400, 600, 1000 (a) min-max normalization by setting min =
0 and max = 1 (b) z-score normalization

7. Use a flow chart to summarize the following procedures for attribute


subset selection: (a) stepwise forward selection (b) stepwise backward
elimination (c) a combination of forward selection and backward
elimination

8. Suppose a group of 12 sales price records has been sorted as follows:


5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215. Partition them into
three bins by each of the following methods. (a) equal-frequency
partitioning (b) equal-width partitioning

Reference:

https://jcsites.juniata.edu/faculty/rhodes/ml/datapreprocessing.htm(Data
Preprocessing)

https://www.futurelearn.com/info/courses/data-analytics-python-
statistics-and-analytics-fundamentals/0/steps/186574 (Types and sources
of data)

16

You might also like