0% found this document useful (0 votes)
2 views

Session1-DataCharacteristics

The document provides an overview of machine learning, its processes, and its applications in various fields such as healthcare and finance. It discusses the importance of data analytics, including descriptive, diagnostic, predictive, and prescriptive analytics, as well as exploratory data analysis (EDA) techniques. Additionally, it highlights the differences between structured, unstructured, and semi-structured data, and emphasizes the role of data processing in deriving insights from raw data.

Uploaded by

pavitradevi297
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Session1-DataCharacteristics

The document provides an overview of machine learning, its processes, and its applications in various fields such as healthcare and finance. It discusses the importance of data analytics, including descriptive, diagnostic, predictive, and prescriptive analytics, as well as exploratory data analysis (EDA) techniques. Additionally, it highlights the differences between structured, unstructured, and semi-structured data, and emphasizes the role of data processing in deriving insights from raw data.

Uploaded by

pavitradevi297
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Demystifying Machine Learning

Dr. J.V. Benifa


Associate Dean,
Faculty of AI and Data Science & Lead- BigML Labs,
Indian Institute of Information Technology Kottayam
(Under Ministry of Education, Govt. of India)
bigml@iiitkottayam
Artificial Intelligence
• The need for intelligence in real-world applications stems from
the complex and challenging nature of problems that humans
encounter in various domains.
• Machine learning is used as a foundational technology in the
development of AI systems.
• Recognize and interpret complex patterns, understand natural
language, detect objects or faces in images, make
recommendations etc.,

➢Handling Complexity
➢Decision-Making and Optimization
➢Adaptability to Changing
➢Environments
➢Automation and Efficiency
What is Machine Learning?
• Machine learning is a set of methods that can automatically
detect patterns in data.
• These uncovered patterns are then used to predict future
data, or to perform other kinds of decision-making under
uncertainty.
• The key premise is learning from data!!
• Addresses the problem of analyzing huge bodies of data so
that they can be understood.
• Providing techniques to automate the analysis and
exploration of large, complex data sets.
• Tools, methodologies, and theories for revealing patterns in
data – critical step in knowledge discovery.
Machine Learning Process
Examples
• Machine learning plays a key role in many areas of science, finance and industry:
• Predict whether a patient, hospitalized due to a heart attack, will have a second
heart attack.
• The prediction is to be based on demographic, diet and clinical measurements
for that patient.
• Predict the price of a stock in 6 months from now, on the basis of company
performance measures and economic data.
• Identify the numbers in a handwritten ZIP code, from a digitized image.
• Estimate the amount of glucose in the blood of a diabetic person, from the
infrared absorption spectrum of that person’s blood.
• Identify the risk factors for prostate cancer, based on clinical and demographic
variables.
The Modelling Process

1 6 7
• Define Business • Optimize Model • Determine Best Fit
Problem

2 5 8
• Define Hypotheses • Develop Predictive • Utilize Model/Score
Model New Data

3 4 9
• Collect Data • Analyze Data • Monitor Model
Background of Learning Process

Data Science & AI Research Group, IIIT Kottayam


Machine Learning Vs Deep Learning
Data and Exploratory Analysis

Dr. J.V. Bibal Benifa


Associate Dean,
Faculty of AI and Data Science & Lead- BigML Labs,
Indian Institute of Information Technology Kottayam
(Under Ministry of Education, Govt. of India)

bigml@iiitkottayam
Contents

1. Introduction to Data

2. Data Analytics

3. Exploratory Data Analysis


DATA
• Data refers to distinct pieces of information, typically formatted in a
specific way, that can be measured, collected, reported, and
analyzed.
• It encompasses various forms such as numbers, text, sound,
images, or any other format.
• Data can be generated by humans, machines, or a combination of
both, and it can be stored in structured or unstructured formats
Data vs Information
Types of Data – Based on nature of data
Types of Data – Based on data organized
Structured Data
Structured data is data that is organized and designed in a specific way to make it easily
readable and understand by both humans and machines.
Advantages of Structured Data
• Easy to understand and use
• Consistency
• Efficient storage and retrieval
Disadvantages of Structured Data
• Inflexibility
• Limited complexity
• Limited context
Examples:
• Customer names and contact information
• Transaction records in finance
• Patient records and diagnostic reports in healthcare
Unstructured Data
• Unstructured data refers to information that does not have a predefined data
model or structure, making it challenging to collect, process, and analyze
using traditional data management tools.
Advantages of Unstructured Data:
• The data is not constrained by a fixed schema
• Very Flexible due to the absence of schema.
• Data is portable
• It is very scalable.
Disadvantages of Unstructured Data :
• It is difficult to store and manage unstructured data due to lack of schema and
structure.
• Indexing the data is difficult and error-prone due to unclear structure.
• Ensuring the security of data is a difficult task.
Semi-structured Data
• Semi-structured data is a type of data that is not purely structured, but also
not completely unstructured. It contains some level of organization or
structure, but does not conform to a rigid schema or data model, and may
contain elements that are not easily categorized or classified.
Advantages of Semi-structured Data:
• The data is not constrained by a fixed schema
• Flexible i.e. Schema can be easily changed.
• Data is portable
• It is possible to view structured data as semi-structured data.
Disadvantages of Semi-structured data:
• Lack of fixed, rigid schema make it difficult in storage of the data
• Interpreting the relationship between data is difficult.
• Queries are less efficient as compared to structured data.
• Complexity
Lifecycle of Data:
Data analytics

• Data analytics is the collection, transformation, and organization of


data in order to draw conclusions, make predictions, and drive
informed decision making.
• Data analytics is a multidisciplinary field that employs a wide range of
analysis techniques, including math, statistics, and computer science,
to draw insights from data sets
• Data analytics is a broad term that includes everything from simply
analyzing data to theorizing ways of collecting data and creating the
frameworks needed to store it
Data Analytics using AI
1. Descriptive Analytics
Descriptive analytics focuses on
summarizing historical data to
identify trends and patterns.

2. Diagnostic Analytics
Diagnostic analytics is used to
identify the root causes or reasons
behind a trend or anomaly
observed in the data.
Data Analytics using AI
3. Predictive Analytics
Predictive analytics helps forecast future events based on historical
data and trends

4. Prescriptive Analytics
Prescriptive analytics suggests actions to take based on the data
analysis and predictions, providing recommendations on the best
course of action.
Exploratory Data Analysis:
• Exploratory Data Analysis (EDA) involves analyzing and
visualizing data to understand its key characteristics, uncover
patterns, and identify relationships between variables refers to the
method of studying and exploring record sets to apprehend their
predominant traits, discover patterns, locate outliers, and identify
relationships between variables.
• EDA is normally carried out as a preliminary step before
undertaking extra formal statistical analyses or modeling.
Key aspects of EDA:
• Distribution of Data: Examining the distribution of data points to understand their
range, central tendencies (mean, median), and dispersion (variance, standard
deviation).
• Graphical Representations: Utilizing charts such as histograms, box plots, scatter
plots, and par charts to visualize relationships within the data and distributions of
variables.
• Outlier Detection: Identifying unusual values that deviate from other data points.
Outliers can influence statistical analyses and might indicate data entry errors or
unique cases.
• Correlation Analysis: Checking the relationships between variables to understand
how they night affect each other. This includes computing correlation coefficients
and creating correlation matrices.
• Handling Missing Values: Detecting and deciding how to address missing data
points, whether by imputation or removal, depending on their impact and the
amount of missing data.
• Summary Statistics: Calculating key statistics that provide insight into data trends.
• Testing Assumptions: Many statistical tests and models assume the data meet
certain conditions. EDA helps verify these assumptions.
Python libraries for EDA

• Pandas: Provides extensive functions for data manipulation and


analysis, including data structure handling and time series
functionality.
• Matplotlib: A plotting library for creating static, interactive, and
• animated visualizations in Python.
• Seaborn: Built on top of Matplotlib, it provides a high-level interface
for drawing attractive and informative statistical graphics.
• Plotly: An interactive graphing library for making interactive plots and
offers more sophisticated visualization capabilities.
Data Science & AI Research Group, IIIT Kottayam
Introduction to Data Processing steps in
EDA:
• Data processing is essential to derive insights and value from raw
data.
• Key stages:
• Data Collection
• Data Cleaning
• Data Transformation
• Data Integration
• Data Aggregation
• Data Visualization
Data Collection
• Understand the Problem and the Data
• Key Points:
• Identify the business goal or research question.
• Understand variables and their representation.
• Determine data types (numerical, categorical,
text).
• Address known data quality issues or domain-
specific concerns.
Data Cleaning
Import and Inspect the Data
• Key Points:
• Load data into your analysis
environment (e.g., Python, R).
• Check data size (rows, columns)
and inspect for missing values.
• Identify data types and spot
inconsistencies or errors.
Data Transformation
• Definition: Changing the format,
structure, or values of data for analysis.
• Objective: Convert data into usable
forms, like scaling or aggregating.
• Common Techniques: Normalization,
aggregation, creating new variables.
• Use Case: Creating a column for crime
description length.
Data Integration
• Definition: Combining data from
different sources into one cohesive
dataset.
• Objective: Create unified datasets for
comprehensive analysis.
• Common Techniques: Merging datasets,
concatenating datasets.
• Use Case: Merging crime data with
weather data to analyze trends.
Data Aggregation
• Definition: Summarizing or grouping
data into categories.
• Objective: Reduce datasets by
summarizing key statistics.
• Common Techniques: Grouping by
categories, calculating summary
statistics.
• Use Case: Aggregating crime data by
year.
Data Filtering
• Definition: Selecting data subsets based on criteria.
• Objective: Focus analysis on relevant data points.
• Common Techniques: Filtering rows/columns based on conditions.
• Use Case: Filtering crime data for a specific year.
Data Reshaping
• Definition: Changing the layout of a dataset (pivoting, melting).
• Objective: Rearrange data for easier analysis or visualization.
• Common Techniques: Pivot tables, melting, transposing data.
• Use Case: Creating pivot tables for crimes by year.
Data Visualization
• Definition: Graphical
representation of data to identify
trends, patterns, and outliers.
• Objective: Visually understand the
distribution of data, correlations,
and variable relationships.
• Common Techniques: Bar plots,
histograms, scatter plots, and
heatmaps.
• Use Case: Visualizing crime types
using bar charts or trends over
time.
Other tools:
Use Cases 1:
Netflix - Personalized Recommendations
• Challenge: Netflix needed to improve user experience and
engagement by providing personalized content recommendations.
• Solution: Netflix used data analytics to analyze users' viewing history,
preferences, and behavior patterns.
• Outcome: Enhanced recommendations using machine learning
algorithms, leading to higher user retention and
engagement.Increased time spent on the platform and more content
consumption by users.
• Impact: The recommendations system is a key factor in Netflix's
success, with over 80% of content watched being based on
personalized recommendations
Use Case 2:
Walmart - Inventory Optimization
• Challenge: Walmart struggled with inventory management across
thousands of stores.
• Solution: By applying data analytics, Walmart analyzed sales data,
inventory levels, and supply chain data to predict demand.
• Outcome: Optimized inventory management and reduced stockouts.
Improved supply chain efficiency and reduced operational costs.
• Impact: Data analytics helped Walmart increase operational efficiency
and improve customer satisfaction by ensuring product availability.
Summary
• Data: Raw facts and figures that are collected and stored for analysis.
• Data Analysis: The process of inspecting, cleaning, transforming, and
modeling data to extract useful insights and support decision-making.
• Exploratory Data Analysis (EDA): The initial step of data analysis focused
on summarizing main characteristics of data, often using visual methods,
before applying formal modeling.
• EDA Tools: Common techniques in EDA include visualizations (histograms,
scatter plots) and statistical summaries (mean, median, standard
deviation).
• Purpose of Data Analysis: To uncover patterns, identify trends, test
hypotheses, and derive actionable insights for informed decision-making.
References

• GeeksforGeeks (https://www.geeksforgeeks.org/data-analysis/)
• IBM (https://www.ibm.com/topics/data-analysis)
• DataCamp (https://www.datacamp.com/community/tutorials/exploratory-
data-analysis-python)
• GeeksforGeeks (https://www.geeksforgeeks.org/text-mining-using-
python/)
• KDnuggets (https://www.kdnuggets.com/2021/03/data-analysis-overview-
process.html)
• Microsoft (https://learn.microsoft.com/en-us/azure/machine-
learning/data-science-workflow)
References (Cont.)
• Towards Data Science
(https://towardsdatascience.com/understanding-exploratory-data-
analysis-in-python-85f8eacfaedb)
• DataFlair (https://data-flair.training/blogs/text-mining-using-python/)
• SAS (https://www.sas.com/en_us/insights/analytics/text-
analytics.html)

You might also like