0% found this document useful (0 votes)

11 views

Data Mining Group Project .

Uploaded by

s285153

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Data Mining Group Project .

Uploaded by

s285153

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Data Mining

Presented By:

Chantelle Chifamba(301270348)
Khalid Dawd(301144241)
AbdulMujeeb Adesoye(301208797)
Jatinder Dosanjh
Data Mining
• Data mining is the process of automatically discovering useful information in
large data repositories.
• Human analysts may take weeks to discover useful information.
• Much of the data is never analyzed at all.

4,000,000

3,500,000
The Data Gap
3,000,000

2,500,000

2,000,000 Total new disk (TB) since 1995

1,500,000
Number of
1,000,000
analysts
500,000

0
1995 1996 1997 1998 1999
Largest databases in 2007
• Largest database in the world: World Data Centre for Climate (WDCC)
operated by the Max Planck Institute and German Climate Computing
Centre
• 220 terabytes of data on climate research and climatic trends,
• 110 terabytes worth of climate simulation data.
• 6 petabytes worth of additional information stored on tapes.
• AT&T
• 323 terabytes of information
• 1.9 trillion phone call records
• Google
• 91 million searches per day,
• After a year worth of searches, this figure amounts to more than 33 trillion database entries.
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining?

– Look up phone – Certain names are more

number in phone prevalent in certain locations
directory (O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Query a Web
search engine for –Discover groups of similar
information about documents on the Web
“Amazon”
Origins of Data Mining
• Draws ideas from: machine learning/AI, statistics, and database
systems

Statistics Machine Learning

Data Mining

Database
systems
Data Mining Tasks
Data mining tasks are generally divided into two major categories:

• Predictive tasks [Use some attributes to predict unknown or future

values of other attributes.]
• Classification
• Regression
• Deviation Detection

• Descriptive tasks [Find human-interpretable patterns that describe the

data.]
• Association Discovery
• Clustering
Predictive Data Mining or Supervised
learning

Predictive Data Mining is a type of advanced analytics that

uses historical data, statistical modeling, Data Mining

techniques, and Machine Learning to make predictions

about future outcomes. Predictive analytics is used by

businesses to find patterns in data and identify risks and

opportunities
Example problem
(Adapted from Leslie Kaelbling's example in the MIT courseware)

• Imagine that I'm trying predict whether my neighbor is going to drive

into work, so I can ask for a ride.

• Whether she drives into work seems to depend on the following

attributes of the day:
• temperature,
• expected precipitation,
• day of the week,
• what she's wearing.
Memory
• Okay. Let's say we observe our friend on three days:

Temp Precip Day Shop Clothes

25 None Sat No Casual Walk
-5 Snow Mon Yes Casual Drive
15 Snow Mon Yes Casual Walk
Memory

• Now, we find ourselves on a snowy “–5” – degree Monday, when the

neighbor is wearing casual clothes and going shopping.
• Do you think she's going to drive?

Temp Precip Day Clothes

25 None Sat Casual Walk
-5 Snow Mon Casual Drive
15 Snow Mon Casual Walk
-5 Snow Mon Casual
Memory

• The standard answer in this case is "yes".

• This day is just like one of the ones we've seen before, and so it seems like a good
bet to predict "yes."

• This is about the most rudimentary form of learning, which is just to

memorize the things you've seen before.

Temp Precip Day Clothes

25 None Sat Casual Walk
-5 Snow Mon Casual Drive
15 Snow Mon Casual Walk
-5 Snow Mon Casual Drive
Averaging

• One strategy would be to predict the majority outcome.

• The neighbor walked more times than she drove in this situation, so we might
predict "walk".

Temp Precip Day Clothes

25 None Sat Casual Walk
25 None Sat Casual Walk
25 None Sat Casual Drive
25 None Sat Casual Drive
25 None Sat Casual Walk
25 None Sat Casual Walk
25 None Sat Casual Walk
25 None Sat Casual Walk
Generalization • We might plausibly
make any of the
following arguments:
• Dealing with previously unseen cases – She's going to
• Will she walk or drive? walk because it's
raining today and
the only other time
Temp Precip Day Clothes it rained, she
22 None Fri Casual Walk walked.
3 None Sun Casual Walk – She's going to
drive because she
10 Rain Wed Casual Walk has always driven
30 None Mon Casual Drive on Mondays…
20 None Sat Formal Drive
25 None Sat Casual Drive
-5 Snow Mon Casual Drive
27 None Tue Casual Drive
24 Rain Mon Casual ?
Descriptive Data Mining

• Descriptive Data Mining tasks are used to find data

describing patterns and to extract new, significant

information from a data set. A Descriptive Data

Mining task could be defined as a retailer attempting

to identify products that are purchased together.

Clustering
• Given a set of data points, each having a set of attributes, and a
similarity measure among them, find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one another.
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
E.g. Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances

are minimized are maximized
Data Mining Tools
MonkeyLearn | No-code text mining tools

RapidMiner | Drag and drop workflows or data mining in Python

Oracle Data Mining | Predictive data mining models

IBM SPSS Modeler | A predictive analytics platform for data scientists

H2O | Open-source library offering data mining in Python

Orange | Open-source data mining toolbox

MonkeyLearn
MonkeyLearn supports various data mining tasks, from detecting topics,
sentiment, and intent, to extracting keywords and named entities.
MonkeyLearn’s text mining tools are already being used to automate
ticket tagging and routing in customer support, automatically detect
negative feedback in social media, and deliver fine-grained insights
that lead to better decision making.
With MonkeyLearn, you can also connect your analyzed data
to MonkeyLearn Studio, a customizable data visualization dashboard
that makes it even easier to detect trends and patterns in your data.
RapidMiner
RapidMiner is a free open-source data science platform that features

hundreds of algorithms for data preparation, machine learning, deep

learning, text mining, and predictive analytics.

Its drag-and-drop interface and pre-built models allow non-programmers

to intuitively create predictive workflows for specific use cases, like

fraud detection and customer churn. Meanwhile, programmers can

take advantage of RapidMiner’s R and Python extensions to tailor

their data mining.

Oracle Data Mining
Oracle Data Mining is a component of Oracle Advanced Analytics that
enables data analysts to build and implement predictive models. It
contains several data mining algorithms for tasks like classification,
regression, anomaly detection, prediction, and more.
With Oracle Data Mining, you can build models that help you predict
customer behavior, segment customer profiles, detect fraud, and
identify the best prospects to target. Developers can use a Java API to
integrate these models into business intelligence applications to help
them discover new trends and patterns.
IBM SPSS Modeler
IBM SPSS Modeler is a data mining solution, which allows data
scientists to speed up and visualize the data mining process. Even
users with little or no programming experience can use advanced
algorithms to build predictive models in a drag-and-drop interface.
With IBM’s SPSS Modeler, data science teams can import vast amounts
of data from multiple sources and rearrange it to uncover trends and
patterns. The standard version of this tool works with numerical data
from spreadsheets and relational databases. To add text analytics
capabilities, you need to install the premium version.
H20
H2O is an open-source machine learning platform, which aims to make

AI technology accessible to everyone. It supports the most common

ML algorithms and offers Auto ML functions to help users build and

deploy machine learning models in a fast and simple way, even if they

are not experts.

H2O can be integrated through an API, available in all major

programming languages, and uses distributed in-memory computing,

which makes it ideal when analyzing huge datasets.

Orange
Orange is a free, open-source data science toolbox for developing,
testing, and visualizing data mining workflows.
It is a component-based software, with a large collection of pre-built
machine learning algorithms and text mining add-ons. It also has
extended functionalities for bioinformaticians and molecular
biologists.
Orange also allows for interactive data visualization, offering numerous
graphics like silhouette plots and sieve diagrams, and non-
programmers can perform data mining tasks through visual
programming in a drag-and-drop interface. Developers, meanwhile,
can opt to mine data in Python.
Applications Of Data Mining
Fraud Detection
Predict fraudulent cases in credit card
transactions.
• Approach:
• Use credit card transactions and the
information associated with them as
attributes, e.g.
– when does a customer buy,
– what does he buy,
– where does he buy, etc.
• Label some past transactions as fraud or fair
transactions. This forms the class attribute.
• Learn a model for the class of the
transactions.
• Use this model to detect fraud by observing
credit card transactions on an account.
Assessing Credit Risk

• Situation: Person applies for a loan

• Task: Should a bank approve the loan?
• People who have the best credit don’t need the loans
• People with worst credit are not likely to repay.
• Bank’s best customers are in the middle

• Banks develop credit models using a variety of data mining

methods.
• Mortgage and credit card proliferation are the results of being able
to "successfully" predict if a person is likely to default on a loan.
• Widely deployed in many countries.
Benefits Of Data Mining Applications
• It helps companies gather reliable information
• It’s an efficient, cost-effective solution compared to other data
applications
• It helps businesses make profitable production and operational
adjustments
• Data mining uses both new and legacy systems
• It helps businesses make informed decisions
• It helps detect credit risks and fraud
• It helps data scientists easily analyze enormous amounts of data
quickly
• Data scientists can use the information to detect fraud, build risk
models, and improve product safety

BUS5PB-Lecture1 Introduction To Business Analytics S1-2024
No ratings yet
BUS5PB-Lecture1 Introduction To Business Analytics S1-2024
82 pages
An Introduction To Business Analytics by Koole (2019)
No ratings yet
An Introduction To Business Analytics by Koole (2019)
171 pages
Introd M
No ratings yet
Introd M
38 pages
Introd M
No ratings yet
Introd M
37 pages
Introduction To Data and AI Ethics
No ratings yet
Introduction To Data and AI Ethics
45 pages
1076221 Unit2_ai Project Cycle by Anu Uploaded
No ratings yet
1076221 Unit2_ai Project Cycle by Anu Uploaded
49 pages
Week 1
No ratings yet
Week 1
50 pages
AI Project Cycle Case Study 1
No ratings yet
AI Project Cycle Case Study 1
15 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
51092c905deedae419cd86b84982c072 (1)
No ratings yet
51092c905deedae419cd86b84982c072 (1)
69 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
Lecture1_SML-I_merged
No ratings yet
Lecture1_SML-I_merged
157 pages
Module 2_final
No ratings yet
Module 2_final
58 pages
Informa) CS: Lecture 6 - Processing Informa4on
No ratings yet
Informa) CS: Lecture 6 - Processing Informa4on
29 pages
Lecture 2
No ratings yet
Lecture 2
18 pages
Week 1
No ratings yet
Week 1
54 pages
0 KDLVLP Đã G P
No ratings yet
0 KDLVLP Đã G P
523 pages
COEN413 Machine Learning-2
No ratings yet
COEN413 Machine Learning-2
38 pages
lecture1&2-đã chuyển đổi
No ratings yet
lecture1&2-đã chuyển đổi
46 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
Polong Lin Presentation
No ratings yet
Polong Lin Presentation
34 pages
Solving Problems by Searching & Constraint Satisfaction Problem
No ratings yet
Solving Problems by Searching & Constraint Satisfaction Problem
53 pages
Bda Toppers Solution
No ratings yet
Bda Toppers Solution
71 pages
157 37325 EA221 2013 1 2 1 Chapter-1-introduction-to-OR-1
No ratings yet
157 37325 EA221 2013 1 2 1 Chapter-1-introduction-to-OR-1
90 pages
OR Dr. Mohamed Abdel Salam: Introduction To Operations Research
100% (1)
OR Dr. Mohamed Abdel Salam: Introduction To Operations Research
90 pages
AI PROJECT CYCLE
No ratings yet
AI PROJECT CYCLE
30 pages
Operational Esearch (OR)
No ratings yet
Operational Esearch (OR)
88 pages
3 - Riset Operasi Dan Contoh Sukses Aplikasi Riset Operasi
No ratings yet
3 - Riset Operasi Dan Contoh Sukses Aplikasi Riset Operasi
88 pages
Stata An Introduction Summer 2020
No ratings yet
Stata An Introduction Summer 2020
60 pages
Fundamentals OperationsResearch
No ratings yet
Fundamentals OperationsResearch
89 pages
Lect3 Machine Learning
No ratings yet
Lect3 Machine Learning
27 pages
Business Analytics
No ratings yet
Business Analytics
33 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
Unit 1: Data Warehousing & Data Mining
No ratings yet
Unit 1: Data Warehousing & Data Mining
54 pages
DM Lec1
No ratings yet
DM Lec1
40 pages
Chapter 1 DM
No ratings yet
Chapter 1 DM
20 pages
2The data analysis process
No ratings yet
2The data analysis process
7 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Training on Problem solving tools
No ratings yet
Training on Problem solving tools
30 pages
5 Data Science Project Lifecycle
No ratings yet
5 Data Science Project Lifecycle
33 pages
lecture17
No ratings yet
lecture17
33 pages
4_18KP3CS10_2020101607102557
No ratings yet
4_18KP3CS10_2020101607102557
54 pages
03 - Data & Learning
No ratings yet
03 - Data & Learning
53 pages
AI Project Cycle
No ratings yet
AI Project Cycle
31 pages
Chapter 3 - Searching and Planning
No ratings yet
Chapter 3 - Searching and Planning
56 pages
Chapter 3
No ratings yet
Chapter 3
52 pages
01 - Data Mining Introduction
No ratings yet
01 - Data Mining Introduction
21 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Chapter 1 - 1
No ratings yet
Chapter 1 - 1
44 pages
DSA Using Java 1686208446
No ratings yet
DSA Using Java 1686208446
22 pages
Dmbi PPT 1
No ratings yet
Dmbi PPT 1
40 pages
Ai Project Cycle
No ratings yet
Ai Project Cycle
52 pages
An Introduction To Business Data Analytics A Business Analysis Viewpoint - IIBA
No ratings yet
An Introduction To Business Data Analytics A Business Analysis Viewpoint - IIBA
29 pages
DLWSS551 - Introduction
No ratings yet
DLWSS551 - Introduction
54 pages
Application of or in Agriculture
No ratings yet
Application of or in Agriculture
48 pages
Computational Thinking Final
No ratings yet
Computational Thinking Final
35 pages
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
NumPy Beginner's Guide
From Everand
NumPy Beginner's Guide
Ivan Idris
5/5 (3)
English
No ratings yet
English
12 pages
unit 1
No ratings yet
unit 1
24 pages
CO COMP9321 1 2024 Term1 T1 Online Standard Kensington
No ratings yet
CO COMP9321 1 2024 Term1 T1 Online Standard Kensington
11 pages
What Is A Data Scientist
No ratings yet
What Is A Data Scientist
21 pages
CV Debs
No ratings yet
CV Debs
1 page
CSEDS20
No ratings yet
CSEDS20
8 pages
2.data Science Tools
No ratings yet
2.data Science Tools
13 pages
01 ML
No ratings yet
01 ML
43 pages
Akanksha Yadav Resume
No ratings yet
Akanksha Yadav Resume
2 pages
IT Grade 11 Unit2 part 2 short note
No ratings yet
IT Grade 11 Unit2 part 2 short note
4 pages
Big Data Analytics and Application For Logistics and Supply Chain Management
No ratings yet
Big Data Analytics and Application For Logistics and Supply Chain Management
8 pages
Tata Steel and Mckinsey CASE STUDY
No ratings yet
Tata Steel and Mckinsey CASE STUDY
8 pages
Instant download Doing Data Science in R An Introduction for Social Scientists 1st Edition Mark Andrews pdf all chapter
100% (3)
Instant download Doing Data Science in R An Introduction for Social Scientists 1st Edition Mark Andrews pdf all chapter
55 pages
Final Report Mini Project
No ratings yet
Final Report Mini Project
45 pages
PredictiveanalyticsformarkettrendsusingAIastudyinconsumerbehavior
No ratings yet
PredictiveanalyticsformarkettrendsusingAIastudyinconsumerbehavior
14 pages
Introduction of Data Science - Mahatma Gandhi Central University
No ratings yet
Introduction of Data Science - Mahatma Gandhi Central University
17 pages
Actuary in an Age of AI
No ratings yet
Actuary in an Age of AI
38 pages
Google Data Analytics KARTHIK FINAL
No ratings yet
Google Data Analytics KARTHIK FINAL
316 pages
Top Books in Trending Tech-4
No ratings yet
Top Books in Trending Tech-4
6 pages
Business Intelligence Analytics and Data Science A Managerial Perspective 4th Edition Sharda Solutions Manual
100% (42)
Business Intelligence Analytics and Data Science A Managerial Perspective 4th Edition Sharda Solutions Manual
19 pages
Blockchain and Data Science Ensuring Data Integrity and Security
No ratings yet
Blockchain and Data Science Ensuring Data Integrity and Security
5 pages
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
No ratings yet
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
91 pages
Gaurav Resume
No ratings yet
Gaurav Resume
4 pages
Chapter 6 2.0
No ratings yet
Chapter 6 2.0
4 pages
Data Science & Analytics Test For Examinee Latest Docx 2
No ratings yet
Data Science & Analytics Test For Examinee Latest Docx 2
3 pages
Your Future in AI - Resources To Explore - EN
No ratings yet
Your Future in AI - Resources To Explore - EN
2 pages
Sagar Singh - Introduction To Data Science and Basic Statistics For Business
No ratings yet
Sagar Singh - Introduction To Data Science and Basic Statistics For Business
4 pages
Bertrand Loison - An Innovative Data Strategy
No ratings yet
Bertrand Loison - An Innovative Data Strategy
35 pages

Data Mining Group Project .

Uploaded by

Data Mining Group Project .

Uploaded by

Data Mining

2,000,000 Total new disk (TB) since 1995

– Look up phone – Certain names are more

Statistics Machine Learning

• Predictive tasks [Use some attributes to predict unknown or future

• Descriptive tasks [Find human-interpretable patterns that describe the

Predictive Data Mining is a type of advanced analytics that

uses historical data, statistical modeling, Data Mining

techniques, and Machine Learning to make predictions

about future outcomes. Predictive analytics is used by

businesses to find patterns in data and identify risks and

• Imagine that I'm trying predict whether my neighbor is going to drive

• Whether she drives into work seems to depend on the following

Temp Precip Day Shop Clothes

• Now, we find ourselves on a snowy “–5” – degree Monday, when the

Temp Precip Day Clothes

• The standard answer in this case is "yes".

• This is about the most rudimentary form of learning, which is just to

Temp Precip Day Clothes

• One strategy would be to predict the majority outcome.

Temp Precip Day Clothes

• Descriptive Data Mining tasks are used to find data

describing patterns and to extract new, significant

information from a data set. A Descriptive Data

Mining task could be defined as a retailer attempting

to identify products that are purchased together.

Intracluster distances Intercluster distances

RapidMiner | Drag and drop workflows or data mining in Python

Oracle Data Mining | Predictive data mining models

IBM SPSS Modeler | A predictive analytics platform for data scientists

H2O | Open-source library offering data mining in Python

Orange | Open-source data mining toolbox

hundreds of algorithms for data preparation, machine learning, deep

learning, text mining, and predictive analytics.

Its drag-and-drop interface and pre-built models allow non-programmers

to intuitively create predictive workflows for specific use cases, like

fraud detection and customer churn. Meanwhile, programmers can

take advantage of RapidMiner’s R and Python extensions to tailor

their data mining.

AI technology accessible to everyone. It supports the most common

ML algorithms and offers Auto ML functions to help users build and

are not experts.

H2O can be integrated through an API, available in all major

programming languages, and uses distributed in-memory computing,

which makes it ideal when analyzing huge datasets.

• Situation: Person applies for a loan

• Banks develop credit models using a variety of data mining

You might also like