data analytics unit 1
data analytics unit 1
UNIT-1
Table of Contents
1
Data Architecture Design and Data Management
In the past, data was small and manageable, easily stored on a single
computer. Today, data volumes have exploded, with around 2.5 quintillion
bytes generated daily,far exceeding the earlier limit of 19 exabytes.
Today, most data is generated from social media sites like Facebook,
Instagram, and Twitter, as well as from e-commerce, hospitals, schools, and
banks. Traditional storage methods can't manage this large and messy data,
so Big Data was created to handle it.
Big Data involves collecting large data sets from sources like social media,
GPS, and sensors, and then analyzing them to find useful patterns using
tools like SAS, Microsoft Excel, R, Python, Tableau, RapidMiner, and
KNIME. Before analysis, a data architect needs to design the data structure.
Data architecture design is crucial for planning how data systems interact.
For instance, if a data architect needs to integrate data from two systems,
data architecture provides a clear model of how these systems will connect
and work together.
2
Physical model –Physical models holds the database design like
which type of database technology will be suitable for architecture.
Data collection involves gathering and storing large amounts of data, which
can be in various forms like text, video, audio, or images. It's the first step in
big data analysis, where data is collected from valid sources before analyzing
patterns or information.
3
Raw data, initially not useful, becomes valuable through cleaning and
analysis, turning into actionable knowledge for various fields. Data
collection aims to gather rich information, starting with defining the data
type and source. Data is categorized as qualitative (non-numerical, focusing
on behavior) or quantitative (numerical, analyzed with scientific tools).
The actual data is then further divided mainly into two types known
as:
Primary data
Secondary data
1. Primary data: The data which is Raw, original, and extracted directly
from the official sources is known as primary data. This type of data is
collected directly by performing techniques such as questionnaires,
4
interviews, and surveys. The data collected must be according to the
demand and requirements of the target audience on which analysis is
performed otherwise it would be a burden in the data processing. Few
methods of collecting primary data:
5
LSD – Latin Square Design is an experimental design that is similar
to CRD and RBD blocks but contains rows and columns. It is an
arrangement of NxN squares with an equal amount of rows and
columns which contain letters that occurs only once in a row. Hence
the differences can be easily found with fewer errors in the
experiment. Sudoku puzzle is an example of a Latin square design.
2. Secondary data: Secondary data is the data which has already been
collected and reused again for some valid purpose. This type of data is
previously recorded from primary data and it has two types of sources
named internal source and external source.
Internal source: These types of data can easily be found within the
organization such as market record, a sales record, transactions, customer
data, accounting resources, etc. The cost and time consumption is less in
obtaining internal sources.
Other sources:
Sensors data: With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics to track
the performance and usage of products.
Web traffic: Due to fast and cheap internet facilities many formats of data
which is uploaded by users on different platforms can be predicted and
collected with their permission for data analysis. The search engines also
provide their data through keywords and queries searched mostly.
6
Data Management
Data management is a system for collecting and analyzing raw data to help
people and organizations use it effectively while following policies and
regulations. The first step is gathering data from various sources, which can
be structured or unstructured. This data must be sorted securely and
organized, and the right storage technology should be chosen based on the
data volume..
The first step in data management is collecting data from various sources in
its raw form, whether structured or unstructured. This data must be sorted
securely and organized, with the right storage technology chosen based on
the volume. Next, the data is processed by cleaning, aggregating, and
enhancing it to make it meaningful. Ensuring data accuracy and reliability
involves using validation rules and error-checking processes.
To keep data secure and private, measures such as encryption and access
control are implemented to prevent unauthorized access and data loss. Data
7
should also be analyzed using techniques like data mining, machine
learning, and visualization. Different data management lifecycles help
organizations meet business and regulatory requirements, manage
metadata, and provide detailed information about data, the mining process,
and data usage to ensure effective management.
8
Data Management Responsibilities and Roles in IT industry
Data Manager: They are responsible for overseeing the whole data
management strategy. They define the data handling policies and standards
by ensuring the data quality, accuracy, and compliance regulations.
Chief Data Officers: CDOs hold a strategic role in the IT field and
check the data-related activities by defining the data management
strategies to achieve the business goals and objectives.
9
Security and Privacy: Unauthorised access to sensitive data by
hacking can be a cause of data breaches, which can expose
confidential information and may cause financial losses for an
organisation.
Data Quality: Poor data quality and duplicate data lead to stemming
errors during data collection, leading to incorrect decision-making. It
occupies valuable storage and creates confusion during the analysis
process.
Data quality
Accuracy means the data must be correct and free of errors. Completeness
refers to having all necessary values recorded. Consistency ensures that
data is uniform and free from discrepancies.
Data often faces challenges such as inaccuracy, which can arise from faulty
collection tools, human errors, or inconsistencies in naming conventions.
10
For instance, if sales records have incorrect pricing information, decisions
based on this data could be flawed. Incompleteness occurs when some data
attributes are missing or not recorded, such as missing sale details.
Believability is about how much users trust the data. Even if data is
accurate now, past errors might lead users to distrust it. For example, if a
database had numerous errors previously, users might still doubt its
reliability.
Data preprocessing
Data integration combines data from multiple sources, which may involve
dealing with inconsistencies like different names for the same attribute (e.g.,
"customer id" vs. "cust id").The resolution of semantic heterogeneity,
metadata, correlation analysis,tuple duplication detection, and data conflict
detection contribute to smooth data integration.
Data transformation routines convert the data into appropriate forms for
mining. For example, in normalization, attribute data are scaled so as to
fall within a small range such as 0.0 to 1.0. Data discretization transforms
numeric data by mapping values to interval or concept labels. Such methods
can be used to automatically generate concept hierarchies for the data,
which allows for mining at multiple levels of granularity. Discretization
11
techniques include binning, histogram analysis, cluster analysis, decision
tree analysis, and correlation analysis. For nominal data, concept
hierarchies may be generated based on schema definitions as well as the
number of distinct values per attribute.
1. Data Cleaning: The data can have many irrelevant and missing parts.
To handle this part, data cleaning is done. It involves handling of
missing data, noisy data etc.
Missing Data: This situation arises when some data is missing in the
data. It can be handled in various ways.
Ignore the Tuple: This method involves discarding the entire
record (tuple) if it contains any missing values. This approach is
generally ineffective, particularly if the tuple has other valuable
attributes or if the missing values are unevenly distributed
across attributes.
12
Fill in the Missing Value Manually: This is a labor-intensive
method where missing values are filled in by hand. It can be
impractical for large datasets with numerous missing values.
Use a Global Constant: All missing values are replaced with a
constant value, such as "Unknown" or a specific number (e.g.,
−∞). While simple, this approach can lead to misleading
interpretations since the constant might be seen as a valid,
meaningful value by the mining algorithm.
Use a Measure of Central Tendency: Missing values are
replaced with the mean or median of the attribute. The mean is
used for symmetric distributions, while the median is better for
skewed distributions.
Use the Mean or Median of the Same Class: This approach is
similar to the previous one but is done within the context of a
specific class. For example, in a classification problem, missing
values might be filled in with the average value for all samples
in the same class.
Use the Most Probable Value: The missing value is estimated
using more sophisticated methods like regression, Bayesian
inference, or decision tree induction, which predict the value
based on other attributes in the dataset.
2. Data integration
Data integration combines data from various sources into one unified
dataset, reducing redundancies and inconsistencies to improve data
mining accuracy and efficiency. For example, in merging customer data
from different systems, "customer id" in one database may need to be
13
matched with "cust number" in another. This ensures that the same
customer is not misidentified or duplicated.
14
attributes are implicit within the database schema and can be
automatically defined at the schema definition level.
It's important to ensure that the time spent on data reduction does
not outweigh the time saved by working with a smaller dataset. The
goal is to make data analysis more efficient without compromising the
quality of the results.
15
Questions
2. In real-world data, tuples with missing values for some attributes are
a common occurrence. Describe various methods for handling this
problem.
3. Using the data for age in Suppose that the data for analysis includes
the attribute age. The age values for the data tuples are (in increasing
order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. answer the following. (a)
Use smoothing by bin means to smooth the above data, using a bin
depth of 3. Illustrate your steps. Comment on the effect of this
technique for the given data. (b) How might you determine outliers in
the data? (c) What other methods are there for data smoothing?
5. What are the value ranges of the following normalization methods? (a)
min-max normalization (b) z-score normalization (c) normalization by
decimal scaling
6. Use the two methods below to normalize the following group of data:
200, 300, 400, 600, 1000 (a) min-max normalization by setting min =
0 and max = 1 (b) z-score normalization
Reference:
https://jcsites.juniata.edu/faculty/rhodes/ml/datapreprocessing.htm(Data
Preprocessing)
https://www.futurelearn.com/info/courses/data-analytics-python-
statistics-and-analytics-fundamentals/0/steps/186574 (Types and sources
of data)
16