Unit-1-1
Unit-1-1
By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way
to see and understand trends, outliers, and patterns in data.
The importance of data visualization is simple: it helps people see, interact with, and better understand
data. Whether simple or complex, the right visualization can bring everyone on the same page, regardless
of their level of expertise.
1. Simplifies Complex Data: Transforms large datasets into a more understandable format.
2. Reveals Patterns and Trends: Helps in identifying trends, correlations, and outliers.
3. Enhances Data Analysis: Makes data analysis more efficient and insightful.
Chart: Information presented in a tabular, graphical form with data displayed along two axes. Can
be in the form of a graph, diagram, or map.
Geospatial: A visualization that shows data in map form using different shapes and colors to show
the relationship between pieces of data and specific locations.
1. Charts:
Examples:
2. Graphs:
Examples:
3. Maps:
o Choropleth Maps: Use color gradients to represent data values across geographical
regions.
Power BI: Integrates with various data sources for interactive reports.
D3.js: JavaScript library for producing dynamic, interactive data visualizations in web browsers.
Google Data Studio: Free tool for creating dashboards and reports.
Relationship between Data Visualization and Other Fields
Data visualization is a multidisciplinary field that intersects with various other domains, enhancing the way
information is interpreted and communicated. Here’s a look at how data visualization interacts with and
benefits different fields:
1. Statistics
Relationship:
Enhancement of Statistical Analysis: Data visualization tools are used to illustrate statistical
findings, making complex data more accessible and understandable.
Exploratory Data Analysis (EDA): Visual techniques help in identifying patterns, trends, and
outliers in data, which are crucial for statistical analysis.
2. Computer Science
Relationship:
Algorithms and Programming: Data visualization requires efficient algorithms to process and
render data effectively.
Relationship:
Decision Support Systems: Data visualization is a key component of BI tools, helping businesses
make data-driven decisions.
Performance Metrics: Visual dashboards display key performance indicators (KPIs) and metrics in
an easily digestible format.
4. Healthcare
Relationship:
Medical Imaging: Visualization techniques are used to interpret complex medical images (e.g.,
MRIs, CT scans).
Epidemiology: Visualizing data helps track the spread of diseases and the effectiveness of
interventions.
5. Environmental Science
Relationship:
Climate Data Analysis: Visualization helps in understanding and communicating climate change
data and environmental impacts.
Geospatial Analysis: Maps and geographic visualizations are used to study environmental
phenomena and resource distribution.
6. Finance
Relationship:
Market Analysis: Financial data visualization aids in analyzing stock market trends and investment
performance.
Risk Management: Visualization tools help in assessing and communicating financial risks.
7. Education
Relationship:
Interactive Learning Tools: Visual aids and interactive dashboards are used in educational settings
to facilitate learning and engagement.
8. Social Sciences
Relationship:
Survey Data Analysis: Visualization helps in interpreting data from social science research, such as
surveys and experiments.
Behavioral Studies: Visual tools are used to analyze and present findings in psychology and
sociology.
9. Journalism
Relationship:
Data Journalism: Visualizations are used to tell compelling stories with data, making complex
information accessible to a broad audience.
Infographics: Journalists use infographics to summarize and highlight key points in their articles.
Benefits:
10. Marketing
Relationship:
Customer Insights: Visualization tools analyze customer data to understand behavior and
preferences.
Campaign Performance: Marketers use dashboards to track and visualize the performance of
marketing campaigns.
o Purpose: Understand why you need to visualize the data. Is it to identify trends, make
decisions, or communicate findings?
o Gather Data: Collect the necessary data from various sources such as databases,
spreadsheets, or APIs.
o Clean Data: Clean the data by handling missing values, removing duplicates, and
correcting errors. Ensure the data is in a suitable format for analysis.
o Explore Data: Use statistical methods and exploratory data analysis (EDA) to understand
the data’s structure, patterns, and relationships.
o Identify Key Metrics: Determine the key metrics and dimensions that are most relevant to
your objectives.
o Match Data to Visualization: Select the most appropriate type of visualization based on
the data and the insights you want to convey. Common types include bar charts, line
graphs, scatter plots, pie charts, histograms, heatmaps, and more.
o Consider Complexity: For complex data sets, consider using advanced visualizations like
treemaps, network diagrams, or interactive dashboards.
o Create Layout: Design a clear and logical layout for your visualization. Organize the
elements in a way that guides the viewer’s eye to the most important information.
o Use Colors and Styles: Use color, shapes, and styles effectively to highlight key insights and
make the visualization aesthetically pleasing.
o Add Labels and Annotations: Include titles, axis labels, legends, and annotations to
provide context and make the visualization self-explanatory.
o Choose Tools: Select appropriate tools and software for creating the visualization. Popular
tools include:
o Create Visualization: Use the chosen tool to create your visualization, applying the design
principles you’ve planned.
Def: Scatter plots are the graphs that present the relationship between two variables in a data-set. It
represents data points on a two-dimensional plane . The independent variable or attribute is plotted on
the X-axis, while the dependent variable is plotted on the Y-axis. These plots are often called scatter graphs
or scatter diagrams.
A scatter plot is a diagram where each value in the data set is represented by a dot.
Creating a scatter plot involves plotting points on a two-dimensional graph based on a pair of numerical
data. Here's a pseudo code outline for generating a scatter plot:
Scatter Plot
1. Initialize Data
2. Setup Plot
3. Plot Points
o Iterate over the data points and plot each (X, Y) coordinate on the canvas.
o Add axis labels and a title to the scatter plot for clarity.
5. Display Plot
BEGIN
PLOT_POINT(X[i], Y[i])
END FOR
CALL DisplayCanvas()
END
A data foundation refers to the fundamental infrastructure, processes, and strategies that lay the
groundwork for effectively collecting, managing, storing, organizing, and leveraging enterprise data.
A robust data foundation ensures that the data is accurate, reliable, and prepared for analysis, which is
crucial for generating meaningful insights and making informed decisions.
3. Data Cleaning: Ensuring the data is free from errors and inconsistencies.
4. Data Integration: Combining data from different sources to provide a unified view.
5. Data Preparation: Transforming data into a format suitable for analysis and visualization.
Data Collection
Data Sources
Data Storage
2. Data Warehouses: Centralized repositories for storing large volumes of data from multiple
sources.
3. Data Lakes: Storage systems that hold raw data in its native format.
o Examples: Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage.
Data Cleaning
Data Integration
1. Merging Datasets: Combining data from different sources into a single dataset.
3. ETL (Extract, Transform, Load): Extracting data from sources, transforming it into the desired
format, and loading it into a storage system.
4. APIs: Using Application Programming Interfaces to integrate data from different systems.
Data Preparation
DATA : Data is defined as facts or figures, or information that's stored in or used by a computer. An
example of data is information collected for a research paper. An example of data is an email.
Types of Data
Understanding the different types of data (in statistics, marketing research, or data science) allows you to
pick the data type that most closely matches your needs and goals
Types of Data
As the name suggest Qualitative Data tells the features of the data in the statistics. Qualitative Data is also
called Categorical Data and it categorizes the data into various categories. Qualitative data includes data
such as gender of people, their family name and others in sample of population data.
Nominal Data
Ordinal Data
Nominal Data
Nominal data is a type of data that consists of categories or names that cannot be ordered or ranked.
Nominal data is often used to categorize observations into groups, and the groups are not comparable. In
other words, nominal data has no inherent order or ranking. Examples of nominal data include gender
(Male or female), race (White, Black, Asian), religion (Hinuduism, Christianity, Islam, Judaism), and blood
type (A, B, AB, O).
Nominal data can be represented using frequency tables and bar charts, which display the number or
proportion of observations in each category. For example, a frequency table for gender might show the
number of males and females in a sample of people.
Nominal data is analyzed using non-parametric tests, which do not make any assumptions about the
underlying distribution of the data. Common non-parametric tests for nominal data include Chi-Squared
Tests and Fisher’s Exact Tests. These tests are used to compare the frequency or proportion of
observations in different categories.
Ordinal Data
Ordinal data is a type of data that consists of categories that can be ordered or ranked. However, the
distance between categories is not necessarily equal. Ordinal data is often used to measure subjective
attributes or opinions, where there is a natural order to the responses. Examples of ordinal data include
education level (Elementary, Middle, High School, College), job position (Manager, Supervisor, Employee),
etc.
Ordinal data can be represented using bar charts, line charts. These displays show the order or ranking of
the categories, but they do not imply that the distances between categories are equal.
Ordinal data is analyzed using non-parametric tests, which make no assumptions about the underlying
distribution of the data. Common non-parametric tests for ordinal data include the Wilcoxon Signed-Rank
test and Mann-Whitney U test.
Quantitavive Data is the type of the data that represents the numerical value of the data. They are also
called the Numerical Data. This data type is used to represent the height, weight, length and other things
of the data. Quantitative data is further classified into two categories that are,
Discrete Data
Continuous Data
Discrete Data
Discrite data type is a type of data in statistics that only uses Discrete Value or Single Values. These data
types have values that can be easily counted as whole numbers. The example of the discreate data types
are,
Continuous Data
Continuous data is the type of the quantitative data that represent the data in a continuous range. The
variable in the data set can have any value between the range of the data set. Examples of the continuous
data types are,
Temperature Range
Key Elements
1. Attributes/Fields:
o These are individual pieces of data within a record. For example, in a dataset of customer
information, fields might include CustomerID, Name, Age, and PurchaseAmount.
2. Data Types:
o Each attribute has a data type, such as integer, float, string, or date. Proper data types
ensure that the data is correctly interpreted and manipulated.
1. Tabular Structure
Use Cases: Best for simple data visualizations like bar charts, line graphs, or heat maps.
2. Hierarchical Structure
3. Network/Graph Structure
Description: Data consists of nodes (entities) and edges (relationships) that connect the nodes.
Example: Social networks, transportation networks, or web pages connected via hyperlinks.
4. Geospatial Structure
Description: Data is associated with geographical locations, often with latitude and longitude
coordinates.
Example: Population density by region, meteorological data, or crime rates across cities.
Use Cases: When geographic context is important for understanding the data.
5. Temporal Structure
Description: Data is structured around time, where each data point is connected to a specific point
in time.
Example: Time-series data, such as stock prices over time or website traffic trends.
6. Matrix Structure
Description: Data is structured in a two-dimensional grid where rows and columns intersect to
form cells.
7. Textual/Unstructured Data
Description: Text data, which does not follow a fixed format or structure.
Use Cases: Useful in natural language processing (NLP) and sentiment analysis.
Description: Data with more than two dimensions (e.g., features, categories) is represented.
Example: A dataset with multiple attributes like age, gender, income, and education level.
Use Cases: Analyzing multi-dimensional datasets where several factors play a role.
Mapping: Data is mapped to visual attributes like position, size, color, shape, and orientation.
Rendering: Visualization is rendered based on mapped attributes, using tools like D3.js, Tableau, or
Python’s Matplotlib.
Data Preprocessing:
The process of converting raw data into understandable format.
Data preprocessing is a crucial step in data visualization, as it prepares raw data for analysis and
visualization. The goal is to clean and transform the data to make it suitable for the intended analysis.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can
be done using various techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features
are high-dimensional and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used
to reduce the size of the dataset while preserving the important information. It can be done using
techniques such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often
used to reduce the size of the dataset by replacing similar data points with a representative
centroid. It can be done using techniques such as k-means, hierarchical clustering, and density-
based clustering.
1. Data Collection
o Objective: Gather data from various sources, such as databases, APIs, or
spreadsheets.
o Example: Collect sales data from a company's sales database.
2. Data Cleaning
o Objective: Identify and correct errors, inconsistencies, and missing values in the
dataset.
o Tasks:
Remove Duplicates: Ensure there are no duplicate records.
Handle Missing Values: Fill, interpolate, or remove missing data.
Correct Errors: Fix any inconsistencies or inaccuracies.
import pandas as pd
# Sample data
df = pd.DataFrame({
})
df['Date'] = df['Date'].fillna(method='ffill')
df['Product'] = df['Product'].fillna('Unknown')
df['Sales'] = df['Sales'].fillna(df['Sales'].mean())
Data Transformation
Objective: Convert data into a format suitable for analysis and visualization.
Tasks:
o Normalization/Scaling: Adjust the range of data values.
o Encoding Categorical Variables: Convert categorical data into numerical format.
o Date/Time Conversion: Ensure date and time data are in the correct format.
Example:
from sklearn.preprocessing import StandardScaler, LabelEncoder
4. Data Aggregation
Example:
5. Feature Engineering
Objective: Create new features or variables that can provide additional insights.
Tasks:
o Create Derived Variables: Generate new columns based on existing data.
o Binning: Group numerical data into bins or categories.
Example:
df['Month'] = df['Date'].dt.to_period('M')
monthly_sales = df.groupby('Month')['Sales'].sum().reset_index()
import pandas as pd
# Load dataset
df = pd.DataFrame({
})
df['Date'] = pd.to_datetime(df['Date'])
scaler = StandardScaler()
df['Month'] = df['Date'].dt.to_period('M')
# Aggregate data
monthly_sales = df.groupby('Month')['Sales'].sum().reset_index()
# Visualization
plt.figure(figsize=(10, 6))
plt.xlabel('Month')
plt.ylabel('Normalized Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()