0% found this document useful (0 votes)

14 views

Unit-1-1

Uploaded by

Manasa Bogam

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Unit-1-1

Uploaded by

Manasa Bogam

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

UNIT-I

INTRODUCTION AND DATA FOUNDATION

Introduction:
Data visualization is the graphical representation of information and data.

By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way
to see and understand trends, outliers, and patterns in data.

Importance of Data Visualization:

The importance of data visualization is simple: it helps people see, interact with, and better understand
data. Whether simple or complex, the right visualization can bring everyone on the same page, regardless
of their level of expertise.

1. Simplifies Complex Data: Transforms large datasets into a more understandable format.

2. Reveals Patterns and Trends: Helps in identifying trends, correlations, and outliers.

3. Enhances Data Analysis: Makes data analysis more efficient and insightful.

4. Improves Decision Making: Facilitates quicker and better decision-making.

5. Communication Tool: Aids in communicating data-driven insights clearly and effectively.

Types of Data Visualizations

 Chart: Information presented in a tabular, graphical form with data displayed along two axes. Can
be in the form of a graph, diagram, or map.

 Table: A set of figures displayed in rows and columns.

 Graph: A diagram of points, lines, segments, curves, or areas that represents certain variables in
comparison to each other, usually along two axes at a right angle.

 Geospatial: A visualization that shows data in map form using different shapes and colors to show
the relationship between pieces of data and specific locations.

1. Charts:

Examples:

o Bar Charts: Compare quantities across categories.

o Line Charts: Show trends over time.

o Pie Charts: Represent proportions within a whole.

o Histogram: Display the distribution of a dataset.

o Scatter Plots: Show relationships between two variables.

2. Graphs:

Examples:

o Network Graphs: Display relationships and interconnections.

o Flowcharts: Illustrate processes or workflows.

o Tree Maps: Represent hierarchical data.

3. Maps:

o Heat Maps: Show data density on a geographical map.

o Choropleth Maps: Use color gradients to represent data values across geographical
regions.

Tools and Software for Data Visualization

 Microsoft Excel: Basic charts and graphs.

 Tableau: Advanced interactive visualizations.

 Power BI: Integrates with various data sources for interactive reports.

 D3.js: JavaScript library for producing dynamic, interactive data visualizations in web browsers.

 Google Data Studio: Free tool for creating dashboards and reports.
Relationship between Data Visualization and Other Fields
Data visualization is a multidisciplinary field that intersects with various other domains, enhancing the way
information is interpreted and communicated. Here’s a look at how data visualization interacts with and
benefits different fields:

1. Statistics

Relationship:

 Enhancement of Statistical Analysis: Data visualization tools are used to illustrate statistical
findings, making complex data more accessible and understandable.

 Exploratory Data Analysis (EDA): Visual techniques help in identifying patterns, trends, and
outliers in data, which are crucial for statistical analysis.

2. Computer Science

Relationship:

 Algorithms and Programming: Data visualization requires efficient algorithms to process and
render data effectively.

 Human-Computer Interaction (HCI): Focuses on designing user-friendly visualization tools that

facilitate interaction and understanding.

3. Business Intelligence (BI)

Relationship:

 Decision Support Systems: Data visualization is a key component of BI tools, helping businesses
make data-driven decisions.

 Performance Metrics: Visual dashboards display key performance indicators (KPIs) and metrics in
an easily digestible format.

4. Healthcare

Relationship:

 Medical Imaging: Visualization techniques are used to interpret complex medical images (e.g.,
MRIs, CT scans).

 Epidemiology: Visualizing data helps track the spread of diseases and the effectiveness of
interventions.

5. Environmental Science

Relationship:

 Climate Data Analysis: Visualization helps in understanding and communicating climate change
data and environmental impacts.

 Geospatial Analysis: Maps and geographic visualizations are used to study environmental
phenomena and resource distribution.

6. Finance

Relationship:

 Market Analysis: Financial data visualization aids in analyzing stock market trends and investment
performance.
 Risk Management: Visualization tools help in assessing and communicating financial risks.

7. Education

Relationship:

 Interactive Learning Tools: Visual aids and interactive dashboards are used in educational settings
to facilitate learning and engagement.

 Curriculum Development: Data visualization assists educators in analyzing student performance

and curriculum effectiveness.

8. Social Sciences

Relationship:

 Survey Data Analysis: Visualization helps in interpreting data from social science research, such as
surveys and experiments.

 Behavioral Studies: Visual tools are used to analyze and present findings in psychology and
sociology.

9. Journalism

Relationship:

 Data Journalism: Visualizations are used to tell compelling stories with data, making complex
information accessible to a broad audience.

 Infographics: Journalists use infographics to summarize and highlight key points in their articles.

Benefits:

 Increased reader engagement and comprehension.

 Effective communication of complex stories.

10. Marketing

Relationship:

 Customer Insights: Visualization tools analyze customer data to understand behavior and
preferences.

 Campaign Performance: Marketers use dashboards to track and visualize the performance of
marketing campaigns.

Data Visualization Process:

Data visualization is a process that transforms raw data into graphical representations to help
communicate insights and findings effectively. Here's a step-by-step guide to the data visualization process:

1. Define Your Objectives:

o Purpose: Understand why you need to visualize the data. Is it to identify trends, make
decisions, or communicate findings?

2. Collect and Prepare Data:

o Gather Data: Collect the necessary data from various sources such as databases,
spreadsheets, or APIs.
o Clean Data: Clean the data by handling missing values, removing duplicates, and
correcting errors. Ensure the data is in a suitable format for analysis.

3. Understand the Data:

o Explore Data: Use statistical methods and exploratory data analysis (EDA) to understand
the data’s structure, patterns, and relationships.

o Identify Key Metrics: Determine the key metrics and dimensions that are most relevant to
your objectives.

4. Choose the Right Visualization Type:

o Match Data to Visualization: Select the most appropriate type of visualization based on
the data and the insights you want to convey. Common types include bar charts, line
graphs, scatter plots, pie charts, histograms, heatmaps, and more.

o Consider Complexity: For complex data sets, consider using advanced visualizations like
treemaps, network diagrams, or interactive dashboards.

5. Design the Visualization:

o Create Layout: Design a clear and logical layout for your visualization. Organize the
elements in a way that guides the viewer’s eye to the most important information.

o Use Colors and Styles: Use color, shapes, and styles effectively to highlight key insights and
make the visualization aesthetically pleasing.

o Add Labels and Annotations: Include titles, axis labels, legends, and annotations to
provide context and make the visualization self-explanatory.

6. Implement the Visualization:

o Choose Tools: Select appropriate tools and software for creating the visualization. Popular
tools include:

 Excel and Google Sheets: For simple charts and graphs.

 Tableau and Power BI: For interactive and complex dashboards.

 Python Libraries (Matplotlib, Seaborn, Plotly): For customizable and advanced

visualizations.

 R (ggplot2, Shiny): For statistical and customized visualizations.

o Create Visualization: Use the chosen tool to create your visualization, applying the design
principles you’ve planned.

7. Present and Share:

o Present: Share the visualization in presentations, reports, or online platforms, ensuring it

is accessible to your intended audience.

o Provide Context: Include a narrative or explanation to guide viewers through the

visualization and highlight the key takeaways.

Pseudo code Conventions - The Scatter plot.

Pseudocode is defined as a method of describing a process or writing programming code and algorithms
using a natural language such as English. It is not the code itself, but rather a description of what the code
should do. In other words, it is used as a detailed yet understandable step-by-step plan or blueprint from
which a program can be written. It is like a rough draft of a program or an algorithm before it is
implemented in a programming language

Def: Scatter plots are the graphs that present the relationship between two variables in a data-set. It
represents data points on a two-dimensional plane . The independent variable or attribute is plotted on
the X-axis, while the dependent variable is plotted on the Y-axis. These plots are often called scatter graphs
or scatter diagrams.

A scatter plot is a diagram where each value in the data set is represented by a dot.

Days of the week Sales in $

1 250
2 280
3 380
4 260
5 300
6 240
7 180

Creating a scatter plot involves plotting points on a two-dimensional graph based on a pair of numerical
data. Here's a pseudo code outline for generating a scatter plot:
Scatter Plot

1. Initialize Data

o Define arrays/lists for X and Y coordinates.

2. Setup Plot

o Create a canvas or graph where the scatter plot will be drawn.

3. Plot Points

o Iterate over the data points and plot each (X, Y) coordinate on the canvas.

4. Add Labels and Titles

o Add axis labels and a title to the scatter plot for clarity.

5. Display Plot

o Render the scatter plot on the screen.

Here's a detailed pseudo code:

BEGIN

// Step 1: Initialize Data

DECLARE List X = [x1, x2, x3, ..., xn]

DECLARE List Y = [y1, y2, y3, ..., yn]

// Step 2: Setup Plot

CALL CreateCanvas(width, height)

SET CanvasTitle = "Scatter Plot"

SET XAxisLabel = "X-Axis"

SET YAxisLabel = "Y-Axis"

// Step 3: Plot Points

FOR i FROM 0 TO LENGTH(X) - 1 DO

PLOT_POINT(X[i], Y[i])

END FOR

// Step 4: Add Labels and Titles

SET Canvas.XLabel = XAxisLabel

SET Canvas.YLabel = YAxisLabel

SET Canvas.Title = CanvasTitle

// Step 5: Display Plot

CALL DisplayCanvas()

END

Introduction Data Foundation:

Building a strong data foundation is essential for any data-driven initiative, including data visualization.

A data foundation refers to the fundamental infrastructure, processes, and strategies that lay the
groundwork for effectively collecting, managing, storing, organizing, and leveraging enterprise data.

A robust data foundation ensures that the data is accurate, reliable, and prepared for analysis, which is
crucial for generating meaningful insights and making informed decisions.

Key Components of Data Foundation

1. Data Collection: The process of gathering data from various sources.

2. Data Storage: Storing the collected data in a structured manner.

3. Data Cleaning: Ensuring the data is free from errors and inconsistencies.

4. Data Integration: Combining data from different sources to provide a unified view.

5. Data Preparation: Transforming data into a format suitable for analysis and visualization.

Data Collection

Data Sources

1. Internal Sources: Data generated within an organization.

o Examples: Operational databases, CRM systems, financial records.

2. External Sources: Data obtained from outside the organization.

o Examples: Market research reports, social media data, third-party datasets.

Data Collection Methods

1. Surveys and Questionnaires: Gathering data directly from respondents.

2. Interviews: Collecting detailed information through personal or group interviews.

3. Observations: Recording data based on observed behaviors or events.

4. Transactional Data: Capturing data from transactions or activities.

5. Web Scraping: Extracting data from websites using automated tools.

Data Storage

Types of Data Storage Systems

1. Databases: Structured storage systems for organizing and retrieving data.

o Examples: SQL databases (MySQL, PostgreSQL), NoSQL databases (MongoDB, Cassandra).

2. Data Warehouses: Centralized repositories for storing large volumes of data from multiple
sources.

o Examples: Amazon Redshift, Google BigQuery, Snowflake.

3. Data Lakes: Storage systems that hold raw data in its native format.

o Examples: Hadoop, Azure Data Lake, AWS Lake Formation.

4. Cloud Storage: Scalable storage solutions provided by cloud service providers.

o Examples: Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage.

Data Cleaning

Common Data Cleaning Tasks

1. Handling Missing Data: Filling in or removing missing values.

2. Removing Duplicates: Identifying and removing duplicate records.

3. Correcting Errors: Fixing inaccuracies in the data.

4. Standardizing Data: Ensuring consistency in data formats and values.

Data Integration

Data Integration Techniques

1. Merging Datasets: Combining data from different sources into a single dataset.

2. Joining Tables: Linking tables based on common keys.

3. ETL (Extract, Transform, Load): Extracting data from sources, transforming it into the desired
format, and loading it into a storage system.

4. APIs: Using Application Programming Interfaces to integrate data from different systems.

Data Preparation

Data Preparation Steps

1. Data Transformation: Converting data into a suitable format for analysis.

o Techniques: Normalization, aggregation, encoding categorical variables.

2. Data Enrichment: Enhancing data with additional information.

o Examples: Adding geolocation data, appending demographic information.

3. Data Validation: Ensuring data accuracy and completeness.

o Techniques: Cross-checking with other data sources, validating against known

benchmarks.
Tools for Data Handling

Data Collection Tools

 Survey Tools: SurveyMonkey, Google Forms.

 Web Scraping Tools: BeautifulSoup, Scrapy.

Data Storage Tools

 Database Management Systems: MySQL, MongoDB.

 Cloud Storage Solutions: Amazon S3, Google Cloud Storage.

Data Cleaning Tools

 Data Cleaning Software: OpenRefine, Trifacta.

 Programming Libraries: Pandas (Python), dplyr (R).

Data Integration Tools

 ETL Tools: Talend, Apache Nifi.

 API Management Tools: Postman, Swagger.

Data Preparation Tools

 Data Transformation Libraries: Pandas (Python), DataWrangler.

 Data Validation Tools: Great Expectations, DataCleaner.

DATA : Data is defined as facts or figures, or information that's stored in or used by a computer. An
example of data is information collected for a research paper. An example of data is an email.

Types of Data
Understanding the different types of data (in statistics, marketing research, or data science) allows you to
pick the data type that most closely matches your needs and goals
Types of Data

Qualitative Data (Categorical Data)

As the name suggest Qualitative Data tells the features of the data in the statistics. Qualitative Data is also
called Categorical Data and it categorizes the data into various categories. Qualitative data includes data
such as gender of people, their family name and others in sample of population data.

Qualitative data is further categorized into two categories that includes,

 Nominal Data

 Ordinal Data

Nominal Data

Nominal data is a type of data that consists of categories or names that cannot be ordered or ranked.
Nominal data is often used to categorize observations into groups, and the groups are not comparable. In
other words, nominal data has no inherent order or ranking. Examples of nominal data include gender
(Male or female), race (White, Black, Asian), religion (Hinuduism, Christianity, Islam, Judaism), and blood
type (A, B, AB, O).

Nominal data can be represented using frequency tables and bar charts, which display the number or
proportion of observations in each category. For example, a frequency table for gender might show the
number of males and females in a sample of people.

Nominal data is analyzed using non-parametric tests, which do not make any assumptions about the
underlying distribution of the data. Common non-parametric tests for nominal data include Chi-Squared
Tests and Fisher’s Exact Tests. These tests are used to compare the frequency or proportion of
observations in different categories.

Ordinal Data

Ordinal data is a type of data that consists of categories that can be ordered or ranked. However, the
distance between categories is not necessarily equal. Ordinal data is often used to measure subjective
attributes or opinions, where there is a natural order to the responses. Examples of ordinal data include
education level (Elementary, Middle, High School, College), job position (Manager, Supervisor, Employee),
etc.

Ordinal data can be represented using bar charts, line charts. These displays show the order or ranking of
the categories, but they do not imply that the distances between categories are equal.

Ordinal data is analyzed using non-parametric tests, which make no assumptions about the underlying
distribution of the data. Common non-parametric tests for ordinal data include the Wilcoxon Signed-Rank
test and Mann-Whitney U test.

Quantitative Data (Numerical Data)

Quantitavive Data is the type of the data that represents the numerical value of the data. They are also
called the Numerical Data. This data type is used to represent the height, weight, length and other things
of the data. Quantitative data is further classified into two categories that are,

 Discrete Data

 Continuous Data

Discrete Data
Discrite data type is a type of data in statistics that only uses Discrete Value or Single Values. These data
types have values that can be easily counted as whole numbers. The example of the discreate data types
are,

 Height of Students in a class

 Marks of the students in a class test

 Weight of different members of a family, etc.

Continuous Data

Continuous data is the type of the quantitative data that represent the data in a continuous range. The
variable in the data set can have any value between the range of the data set. Examples of the continuous
data types are,

 Temperature Range

 Salary range of Workers in a Factory, etc.

Structure of data Within Records

Within a record, data is organized as a single unit, typically corresponding to a row in a table or a record in
a database. Each record consists of multiple fields or attributes, each containing a piece of data.

Key Elements

1. Attributes/Fields:

o These are individual pieces of data within a record. For example, in a dataset of customer
information, fields might include CustomerID, Name, Age, and PurchaseAmount.

2. Data Types:

o Each attribute has a data type, such as integer, float, string, or date. Proper data types
ensure that the data is correctly interpreted and manipulated.

Structure of Data in data visualization:

In data visualization, the structure of data plays a crucial role in determining how the data is represented
and visualized. There are various ways data can be structured depending on the type of visualization, the
purpose, and the characteristics of the data itself. Here are some common structures:

1. Tabular Structure

 Description: Data is arranged in rows and columns, much like a spreadsheet.

 Example: A table with columns like "Date," "Sales," and "Region."

 Use Cases: Best for simple data visualizations like bar charts, line graphs, or heat maps.

 Visualization: Bar chart, line chart, scatter plot.

2. Hierarchical Structure

 Description: Data is organized in a tree-like format, with parent-child relationships.

 Example: Organization charts, file directory structures, or any dataset with levels of categorization
(e.g., family trees).

 Use Cases: Useful for visualizing relationships and categorizations.

 Visualization: Tree maps, dendrograms, sunburst charts.

3. Network/Graph Structure

 Description: Data consists of nodes (entities) and edges (relationships) that connect the nodes.

 Example: Social networks, transportation networks, or web pages connected via hyperlinks.

 Use Cases: Shows relationships and interactions between entities.

 Visualization: Network graphs, force-directed graphs, radial graphs.

4. Geospatial Structure

 Description: Data is associated with geographical locations, often with latitude and longitude
coordinates.

 Example: Population density by region, meteorological data, or crime rates across cities.

 Use Cases: When geographic context is important for understanding the data.

 Visualization: Choropleth maps, heat maps, point maps.

5. Temporal Structure

 Description: Data is structured around time, where each data point is connected to a specific point
in time.

 Example: Time-series data, such as stock prices over time or website traffic trends.

 Use Cases: Analyzing changes and trends over time.

 Visualization: Line charts, area charts, Gantt charts.

6. Matrix Structure

 Description: Data is structured in a two-dimensional grid where rows and columns intersect to
form cells.

 Example: A confusion matrix in machine learning or correlation matrices between variables.

 Use Cases: Often used when comparing relationships between variables.

 Visualization: Heatmaps, correlation plots.

7. Textual/Unstructured Data

 Description: Text data, which does not follow a fixed format or structure.

 Example: Documents, social media posts, or articles.

 Use Cases: Useful in natural language processing (NLP) and sentiment analysis.

 Visualization: Word clouds, text graphs, frequency distributions.

8. Multi-dimensional (n-D) Structure

 Description: Data with more than two dimensions (e.g., features, categories) is represented.

 Example: A dataset with multiple attributes like age, gender, income, and education level.
 Use Cases: Analyzing multi-dimensional datasets where several factors play a role.

 Visualization: Parallel coordinate plots, 3D scatter plots, radar charts.

Data Flow in Visualization

 Raw Data: Initially in a structured or unstructured format.

 Transformation: Data is cleaned, transformed, and possibly aggregated.

 Mapping: Data is mapped to visual attributes like position, size, color, shape, and orientation.

 Rendering: Visualization is rendered based on mapped attributes, using tools like D3.js, Tableau, or
Python’s Matplotlib.

Data Preprocessing:
The process of converting raw data into understandable format.

Data preprocessing is a crucial step in data visualization, as it prepares raw data for analysis and
visualization. The goal is to clean and transform the data to make it suitable for the intended analysis.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

 (a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple.

2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.

 (b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can
be done using various techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features
are high-dimensional and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used
to reduce the size of the dataset while preserving the important information. It can be done using
techniques such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often
used to reduce the size of the dataset by replacing similar data points with a representative
centroid. It can be done using techniques such as k-means, hierarchical clustering, and density-
based clustering.

Example: Preparing Data for Visualization

Let’s walk through a complete example of preprocessing a dataset for visualization:

Date Product Sales Revenue

1/1/202
4 A 100 2000
1/2/202
4 B 150 3000
1/3/202
4 A 200 4000
1/4/202
4 C 50 1000
1/5/202
4 B 300 6000
1/6/202
4 A 250 5000

Key Steps in Data Preprocessing for Visualization

1. Data Collection
o Objective: Gather data from various sources, such as databases, APIs, or
spreadsheets.
o Example: Collect sales data from a company's sales database.
2. Data Cleaning
o Objective: Identify and correct errors, inconsistencies, and missing values in the
dataset.
o Tasks:
 Remove Duplicates: Ensure there are no duplicate records.
 Handle Missing Values: Fill, interpolate, or remove missing data.
 Correct Errors: Fix any inconsistencies or inaccuracies.

import pandas as pd

# Sample data

df = pd.DataFrame({

'Date': ['2024-01-01', '2024-01-02', None, '2024-01-04', '2024-01-05', '2024-01-05'],

'Product': ['A', 'B', 'A', 'C', None, 'B'],

'Sales': [100, 150, 200, None, 300, 300],

'Revenue': [2000, 3000, 4000, 1000, 6000, 6000]

})

# Drop duplicate rows

df = df.drop_duplicates()

# Fill missing values

df['Date'] = df['Date'].fillna(method='ffill')

df['Product'] = df['Product'].fillna('Unknown')

df['Sales'] = df['Sales'].fillna(df['Sales'].mean())

Data Transformation

 Objective: Convert data into a format suitable for analysis and visualization.
 Tasks:
o Normalization/Scaling: Adjust the range of data values.
o Encoding Categorical Variables: Convert categorical data into numerical format.
o Date/Time Conversion: Ensure date and time data are in the correct format.

Example:
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Convert 'Date' to datetime type

df['Date'] = pd.to_datetime(df['Date'])

# Normalize 'Sales' and 'Revenue'

scaler = StandardScaler()
df[['Sales', 'Revenue']] = scaler.fit_transform(df[['Sales', 'Revenue']])

# Encode 'Product' column

le = LabelEncoder()
df['Product'] = le.fit_transform(df['Product'])

4. Data Aggregation

 Objective: Summarize and group data for easier visualization.

 Tasks:
o Group By: Aggregate data based on certain columns.
o Compute Summary Statistics: Calculate totals, averages, or other statistics.

Example:

# Group by 'Product' and calculate total 'Sales'

product_sales = df.groupby('Product')['Sales'].sum().reset_index()

5. Feature Engineering
 Objective: Create new features or variables that can provide additional insights.
 Tasks:
o Create Derived Variables: Generate new columns based on existing data.
o Binning: Group numerical data into bins or categories.

Example:

# Add a 'Month' column

df['Month'] = df['Date'].dt.to_period('M')

# Compute monthly sales totals

monthly_sales = df.groupby('Month')['Sales'].sum().reset_index()

Data Preprocessing Comlete code:

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

# Load dataset

df = pd.DataFrame({

'Date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-06'],

'Product': ['A', 'B', 'A', 'C', 'B', 'A'],

'Sales': [100, 150, 200, 50, 300, 250],

'Revenue': [2000, 3000, 4000, 1000, 6000, 5000]

})

# Convert 'Date' to datetime type

df['Date'] = pd.to_datetime(df['Date'])

# Normalize 'Sales' and 'Revenue'

scaler = StandardScaler()

df[['Sales', 'Revenue']] = scaler.fit_transform(df[['Sales', 'Revenue']])

# Add 'Month' column

df['Month'] = df['Date'].dt.to_period('M')

# Aggregate data

monthly_sales = df.groupby('Month')['Sales'].sum().reset_index()

# Visualization

plt.figure(figsize=(10, 6))

plt.plot(monthly_sales['Month'].astype(str), monthly_sales['Sales'], marker='o')

plt.xlabel('Month')

plt.ylabel('Normalized Sales')

plt.title('Monthly Sales Trend')

plt.xticks(rotation=45)

plt.grid(True)

plt.show()

Salesforce AI Associate Dumps
100% (4)
Salesforce AI Associate Dumps
60 pages
Sunbeam Popcorn Maker FPSBPP7310 FPSBPP7316
60% (10)
Sunbeam Popcorn Maker FPSBPP7310 FPSBPP7316
9 pages
Wiley - Operations Management - An Integrated Approach, 7th Edition - 978-1-119-49706-6
No ratings yet
Wiley - Operations Management - An Integrated Approach, 7th Edition - 978-1-119-49706-6
3 pages
Uber: Applying Machine Learning To Improve The Customer Pickup Experience
100% (1)
Uber: Applying Machine Learning To Improve The Customer Pickup Experience
21 pages
Data Visualization Complete Notes
100% (9)
Data Visualization Complete Notes
28 pages
Ford Escape 4wd Workshop Manual v6 3 0l 2008
100% (4)
Ford Escape 4wd Workshop Manual v6 3 0l 2008
7,556 pages
Unit 2 Foundations For Visualization
No ratings yet
Unit 2 Foundations For Visualization
25 pages
2019 Book EssentialsOfBusinessAnalytics PDF
93% (14)
2019 Book EssentialsOfBusinessAnalytics PDF
971 pages
AP Statistics Chapter 3
0% (1)
AP Statistics Chapter 3
3 pages
Aim Institute of Stock Market
No ratings yet
Aim Institute of Stock Market
8 pages
Udemy 2024 Learning Trends Top 100 Surging Skills Infographic
100% (1)
Udemy 2024 Learning Trends Top 100 Surging Skills Infographic
1 page
Home Depot Strategy
100% (1)
Home Depot Strategy
8 pages
DV UNIT-1
No ratings yet
DV UNIT-1
8 pages
Data Visualization
No ratings yet
Data Visualization
16 pages
UNIT 1 DVT
No ratings yet
UNIT 1 DVT
22 pages
Unit-1 Data Visualization Notes
No ratings yet
Unit-1 Data Visualization Notes
15 pages
Reading and Writing Set 2 Assgn
No ratings yet
Reading and Writing Set 2 Assgn
16 pages
LM1
No ratings yet
LM1
12 pages
UNIT-2
No ratings yet
UNIT-2
12 pages
15 Questions DV 3rd Year a Sec
No ratings yet
15 Questions DV 3rd Year a Sec
51 pages
Data Visualization Notes
No ratings yet
Data Visualization Notes
4 pages
Subject Code:Mb20Ba01 Subject Name: Data Visulization For Managers Faculty Name: Dr.M.Karthikeyan
No ratings yet
Subject Code:Mb20Ba01 Subject Name: Data Visulization For Managers Faculty Name: Dr.M.Karthikeyan
34 pages
Data Visualization in Data Science
No ratings yet
Data Visualization in Data Science
50 pages
Data Visualization
No ratings yet
Data Visualization
33 pages
DV
No ratings yet
DV
30 pages
UNIT 5 Data Analytics
No ratings yet
UNIT 5 Data Analytics
20 pages
Data Visualization
No ratings yet
Data Visualization
23 pages
5th Unit Fds
No ratings yet
5th Unit Fds
5 pages
DATA WRANGLING AND DATA VISUALIZATION UNIT - IV
No ratings yet
DATA WRANGLING AND DATA VISUALIZATION UNIT - IV
43 pages
DATA VISUALIZATION USING PYTHON
No ratings yet
DATA VISUALIZATION USING PYTHON
79 pages
Data Visualization New
No ratings yet
Data Visualization New
103 pages
UNIT -1 DV
No ratings yet
UNIT -1 DV
10 pages
Data Visualization - Data Mining
No ratings yet
Data Visualization - Data Mining
11 pages
LM3
No ratings yet
LM3
9 pages
Unit-5 new
No ratings yet
Unit-5 new
31 pages
1 Introduction
No ratings yet
1 Introduction
130 pages
Data Visualizations
No ratings yet
Data Visualizations
6 pages
Data Visualization Notes
No ratings yet
Data Visualization Notes
71 pages
Data Visualization and Processing
No ratings yet
Data Visualization and Processing
46 pages
Unit III Business Analytics
No ratings yet
Unit III Business Analytics
8 pages
Data Visualization-1
No ratings yet
Data Visualization-1
29 pages
Business Analytics
No ratings yet
Business Analytics
13 pages
Data Visualization Notes
No ratings yet
Data Visualization Notes
22 pages
Data Visualization
No ratings yet
Data Visualization
103 pages
data visual (1)
No ratings yet
data visual (1)
14 pages
CSC504 note on VISUALIZATION
No ratings yet
CSC504 note on VISUALIZATION
13 pages
Data Visualization Techniques: Dr. D. Koteswara Rao
No ratings yet
Data Visualization Techniques: Dr. D. Koteswara Rao
41 pages
Notes_Data_Analysis_and_Visualization_using_Tableau_complete_notes
No ratings yet
Notes_Data_Analysis_and_Visualization_using_Tableau_complete_notes
101 pages
Exp-1
No ratings yet
Exp-1
2 pages
Eds Unit 3
No ratings yet
Eds Unit 3
22 pages
DMV - UNIT 3 & 4 (1)
No ratings yet
DMV - UNIT 3 & 4 (1)
32 pages
CSC 428_4
No ratings yet
CSC 428_4
12 pages
Data Visualization
No ratings yet
Data Visualization
9 pages
1.2.2 New
No ratings yet
1.2.2 New
13 pages
Dvi 1
No ratings yet
Dvi 1
41 pages
DV Notes Diskha
No ratings yet
DV Notes Diskha
15 pages
EIT Project
No ratings yet
EIT Project
16 pages
Data visualisation
No ratings yet
Data visualisation
232 pages
Data_Visualization_Presentation
No ratings yet
Data_Visualization_Presentation
15 pages
Business Analytics 10 Marks
No ratings yet
Business Analytics 10 Marks
10 pages
1152cs191 Data Visualization Unit i
No ratings yet
1152cs191 Data Visualization Unit i
129 pages
UNIT 5 (1)
No ratings yet
UNIT 5 (1)
6 pages
Data Visualization 21st June
No ratings yet
Data Visualization 21st June
110 pages
Visualization Using Charts and Graph and Their Uses
No ratings yet
Visualization Using Charts and Graph and Their Uses
11 pages
final seminar report
No ratings yet
final seminar report
27 pages
module 1
No ratings yet
module 1
28 pages
DA unit-5 - Copy
No ratings yet
DA unit-5 - Copy
8 pages
Report Data (1)
No ratings yet
Report Data (1)
22 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Resume Updated
100% (3)
Resume Updated
2 pages
Consumer Reports Buying Guide 2021
100% (1)
Consumer Reports Buying Guide 2021
227 pages
A Collection of Fraud Schemes
67% (3)
A Collection of Fraud Schemes
54 pages
SAP GTS Case Study - Citrix - Systems
100% (1)
SAP GTS Case Study - Citrix - Systems
2 pages
Online Casino Software For Sale and Casino Software Solutions
No ratings yet
Online Casino Software For Sale and Casino Software Solutions
2 pages
GRE Text Completion and Sentence Equivalence Practice Questions
100% (2)
GRE Text Completion and Sentence Equivalence Practice Questions
32 pages
TED Talks List
100% (2)
TED Talks List
15 pages
Political Analysis
No ratings yet
Political Analysis
11 pages
(PDF) Introduction To Selling Value - Course-Final
No ratings yet
(PDF) Introduction To Selling Value - Course-Final
75 pages
ATS Resume Template PDF
No ratings yet
ATS Resume Template PDF
1 page
Outdoor Living Skills (PDFDrive) PDF
No ratings yet
Outdoor Living Skills (PDFDrive) PDF
157 pages
Data Analytics Concepts Techniques and A PDF
100% (11)
Data Analytics Concepts Techniques and A PDF
451 pages
Globalization Strategy Playbook: Document Revision History
100% (2)
Globalization Strategy Playbook: Document Revision History
93 pages
QuickBooks Online Core Certification Self Study Workbook V21.2.2
100% (1)
QuickBooks Online Core Certification Self Study Workbook V21.2.2
55 pages
Focus Investing PDF
No ratings yet
Focus Investing PDF
18 pages
The Chemical Engineer - Issue 983 - May 2023
No ratings yet
The Chemical Engineer - Issue 983 - May 2023
68 pages
Cyber Resilience Blueprint
No ratings yet
Cyber Resilience Blueprint
12 pages
NIST 2 Framework
100% (1)
NIST 2 Framework
32 pages
Guidance On Good Data and Record Management Practices
No ratings yet
Guidance On Good Data and Record Management Practices
44 pages
Microsoft AppSource Partner Listing Guidelines PDF
No ratings yet
Microsoft AppSource Partner Listing Guidelines PDF
10 pages
2015 Book IntroductionToNursingInformati
100% (1)
2015 Book IntroductionToNursingInformati
456 pages
Whitepaper - Third-Party Risk Management Services
No ratings yet
Whitepaper - Third-Party Risk Management Services
24 pages
Circuit Breaker
No ratings yet
Circuit Breaker
3 pages
Sistem Proteksi Tegangan 220 Volt Menggunakan Relay KLARSERN KS7311 220V 32-37
No ratings yet
Sistem Proteksi Tegangan 220 Volt Menggunakan Relay KLARSERN KS7311 220V 32-37
6 pages
Asme Section Ix 2019
100% (5)
Asme Section Ix 2019
116 pages
1st Grade Math Lesson Plan 2-Digits PDF
0% (1)
1st Grade Math Lesson Plan 2-Digits PDF
6 pages
Threads- Threading Issues
No ratings yet
Threads- Threading Issues
19 pages
Lecture 4-Logistic Regression
No ratings yet
Lecture 4-Logistic Regression
20 pages
Objects & Classes: Softuni Team
No ratings yet
Objects & Classes: Softuni Team
42 pages
Pki and Digital Certification Infrastructure PDF
No ratings yet
Pki and Digital Certification Infrastructure PDF
6 pages
BEE Assignment 2 With Solutions
No ratings yet
BEE Assignment 2 With Solutions
19 pages
Beginners Star Atlas
100% (1)
Beginners Star Atlas
6 pages
09 - Structural Mapping
No ratings yet
09 - Structural Mapping
33 pages
Discriminating P and NP Problems AND Polynomial Time Verification NP Complete Problem of Clique
No ratings yet
Discriminating P and NP Problems AND Polynomial Time Verification NP Complete Problem of Clique
5 pages
007 Sashpa Vijayamu 00 13
100% (2)
007 Sashpa Vijayamu 00 13
43 pages
TESLA Presentation
No ratings yet
TESLA Presentation
13 pages
Iron-Iron Carbide Phase Diagram Example
No ratings yet
Iron-Iron Carbide Phase Diagram Example
3 pages
DH 40
No ratings yet
DH 40
1 page
US Navy NEETS - NAVEDTRA 14173 Module 01 Introduction To Matter, Energy, and Direct Current
100% (2)
US Navy NEETS - NAVEDTRA 14173 Module 01 Introduction To Matter, Energy, and Direct Current
278 pages
RDM Resistance of Materials: Chapter 7 - Shear Week 7 Lectures 11 & 12
No ratings yet
RDM Resistance of Materials: Chapter 7 - Shear Week 7 Lectures 11 & 12
19 pages
Practice Paper - WMC - REC-085 - SET1 - Mohit Tyagi - With Answer
No ratings yet
Practice Paper - WMC - REC-085 - SET1 - Mohit Tyagi - With Answer
9 pages
Download ebooks file Set Theoretical Aspects of Real Analysis 1st Edition Alexander B. Kharazishvili all chapters
100% (1)
Download ebooks file Set Theoretical Aspects of Real Analysis 1st Edition Alexander B. Kharazishvili all chapters
81 pages
Compiler Construction
No ratings yet
Compiler Construction
40 pages
Sensors: A Stress Sensor Based On Galvanic Skin Response (GSR) Controlled by Zigbee
No ratings yet
Sensors: A Stress Sensor Based On Galvanic Skin Response (GSR) Controlled by Zigbee
30 pages
CNS Previous Paper
No ratings yet
CNS Previous Paper
9 pages
Vmware Ruby Vsphere Console Command Reference For Virtual San
No ratings yet
Vmware Ruby Vsphere Console Command Reference For Virtual San
82 pages
Z Ton - Kliring - PMRN
No ratings yet
Z Ton - Kliring - PMRN
6 pages
Remote Controlled Fan Regulator Project Report
100% (2)
Remote Controlled Fan Regulator Project Report
11 pages
FT 1 RM Phase - 4 Code - D Que Paper @cet - Jee - Neet
No ratings yet
FT 1 RM Phase - 4 Code - D Que Paper @cet - Jee - Neet
18 pages
A IEEE MadridConference
No ratings yet
A IEEE MadridConference
6 pages