0% found this document useful (0 votes)
11 views

Data Mining Group Project .

Uploaded by

s285153
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Data Mining Group Project .

Uploaded by

s285153
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Mining

Presented By:

Chantelle Chifamba(301270348)
Khalid Dawd(301144241)
AbdulMujeeb Adesoye(301208797)
Jatinder Dosanjh
Data Mining
• Data mining is the process of automatically discovering useful information in
large data repositories.
• Human analysts may take weeks to discover useful information.
• Much of the data is never analyzed at all.

4,000,000

3,500,000
The Data Gap
3,000,000

2,500,000

2,000,000 Total new disk (TB) since 1995


1,500,000
Number of
1,000,000
analysts
500,000

0
1995 1996 1997 1998 1999
Largest databases in 2007
• Largest database in the world: World Data Centre for Climate (WDCC)
operated by the Max Planck Institute and German Climate Computing
Centre
• 220 terabytes of data on climate research and climatic trends,
• 110 terabytes worth of climate simulation data.
• 6 petabytes worth of additional information stored on tapes.
• AT&T
• 323 terabytes of information
• 1.9 trillion phone call records
• Google
• 91 million searches per day,
• After a year worth of searches, this figure amounts to more than 33 trillion database entries.
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining?

– Look up phone – Certain names are more


number in phone prevalent in certain locations
directory (O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Query a Web
search engine for –Discover groups of similar
information about documents on the Web
“Amazon”
Origins of Data Mining
• Draws ideas from: machine learning/AI, statistics, and database
systems

Statistics Machine Learning

Data Mining

Database
systems
Data Mining Tasks
Data mining tasks are generally divided into two major categories:

• Predictive tasks [Use some attributes to predict unknown or future


values of other attributes.]
• Classification
• Regression
• Deviation Detection

• Descriptive tasks [Find human-interpretable patterns that describe the


data.]
• Association Discovery
• Clustering
Predictive Data Mining or Supervised
learning

Predictive Data Mining is a type of advanced analytics that

uses historical data, statistical modeling, Data Mining

techniques, and Machine Learning to make predictions

about future outcomes. Predictive analytics is used by

businesses to find patterns in data and identify risks and

opportunities
Example problem
(Adapted from Leslie Kaelbling's example in the MIT courseware)

• Imagine that I'm trying predict whether my neighbor is going to drive


into work, so I can ask for a ride.

• Whether she drives into work seems to depend on the following


attributes of the day:
• temperature,
• expected precipitation,
• day of the week,
• what she's wearing.
Memory
• Okay. Let's say we observe our friend on three days:

Temp Precip Day Shop Clothes


25 None Sat No Casual Walk
-5 Snow Mon Yes Casual Drive
15 Snow Mon Yes Casual Walk
Memory

• Now, we find ourselves on a snowy “–5” – degree Monday, when the


neighbor is wearing casual clothes and going shopping.
• Do you think she's going to drive?

Temp Precip Day Clothes


25 None Sat Casual Walk
-5 Snow Mon Casual Drive
15 Snow Mon Casual Walk
-5 Snow Mon Casual
Memory

• The standard answer in this case is "yes".


• This day is just like one of the ones we've seen before, and so it seems like a good
bet to predict "yes."

• This is about the most rudimentary form of learning, which is just to


memorize the things you've seen before.

Temp Precip Day Clothes


25 None Sat Casual Walk
-5 Snow Mon Casual Drive
15 Snow Mon Casual Walk
-5 Snow Mon Casual Drive
Averaging

• One strategy would be to predict the majority outcome.


• The neighbor walked more times than she drove in this situation, so we might
predict "walk".

Temp Precip Day Clothes


25 None Sat Casual Walk
25 None Sat Casual Walk
25 None Sat Casual Drive
25 None Sat Casual Drive
25 None Sat Casual Walk
25 None Sat Casual Walk
25 None Sat Casual Walk
25 None Sat Casual Walk
Generalization • We might plausibly
make any of the
following arguments:
• Dealing with previously unseen cases – She's going to
• Will she walk or drive? walk because it's
raining today and
the only other time
Temp Precip Day Clothes it rained, she
22 None Fri Casual Walk walked.
3 None Sun Casual Walk – She's going to
drive because she
10 Rain Wed Casual Walk has always driven
30 None Mon Casual Drive on Mondays…
20 None Sat Formal Drive
25 None Sat Casual Drive
-5 Snow Mon Casual Drive
27 None Tue Casual Drive
24 Rain Mon Casual ?
Descriptive Data Mining

• Descriptive Data Mining tasks are used to find data

describing patterns and to extract new, significant

information from a data set. A Descriptive Data

Mining task could be defined as a retailer attempting

to identify products that are purchased together.


Clustering
• Given a set of data points, each having a set of attributes, and a
similarity measure among them, find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one another.
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
E.g. Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances


are minimized are maximized
Data Mining Tools
MonkeyLearn | No-code text mining tools

RapidMiner | Drag and drop workflows or data mining in Python

Oracle Data Mining | Predictive data mining models

IBM SPSS Modeler | A predictive analytics platform for data scientists

H2O | Open-source library offering data mining in Python

Orange | Open-source data mining toolbox


MonkeyLearn
MonkeyLearn supports various data mining tasks, from detecting topics,
sentiment, and intent, to extracting keywords and named entities.
MonkeyLearn’s text mining tools are already being used to automate
ticket tagging and routing in customer support, automatically detect
negative feedback in social media, and deliver fine-grained insights
that lead to better decision making.
With MonkeyLearn, you can also connect your analyzed data
to MonkeyLearn Studio, a customizable data visualization dashboard
that makes it even easier to detect trends and patterns in your data.
RapidMiner
RapidMiner is a free open-source data science platform that features

hundreds of algorithms for data preparation, machine learning, deep

learning, text mining, and predictive analytics.

Its drag-and-drop interface and pre-built models allow non-programmers

to intuitively create predictive workflows for specific use cases, like

fraud detection and customer churn. Meanwhile, programmers can

take advantage of RapidMiner’s R and Python extensions to tailor

their data mining.


Oracle Data Mining
Oracle Data Mining is a component of Oracle Advanced Analytics that
enables data analysts to build and implement predictive models. It
contains several data mining algorithms for tasks like classification,
regression, anomaly detection, prediction, and more.
With Oracle Data Mining, you can build models that help you predict
customer behavior, segment customer profiles, detect fraud, and
identify the best prospects to target. Developers can use a Java API to
integrate these models into business intelligence applications to help
them discover new trends and patterns.
IBM SPSS Modeler
IBM SPSS Modeler is a data mining solution, which allows data
scientists to speed up and visualize the data mining process. Even
users with little or no programming experience can use advanced
algorithms to build predictive models in a drag-and-drop interface.
With IBM’s SPSS Modeler, data science teams can import vast amounts
of data from multiple sources and rearrange it to uncover trends and
patterns. The standard version of this tool works with numerical data
from spreadsheets and relational databases. To add text analytics
capabilities, you need to install the premium version.
H20
H2O is an open-source machine learning platform, which aims to make

AI technology accessible to everyone. It supports the most common

ML algorithms and offers Auto ML functions to help users build and

deploy machine learning models in a fast and simple way, even if they

are not experts.

H2O can be integrated through an API, available in all major

programming languages, and uses distributed in-memory computing,

which makes it ideal when analyzing huge datasets.


Orange
Orange is a free, open-source data science toolbox for developing,
testing, and visualizing data mining workflows.
It is a component-based software, with a large collection of pre-built
machine learning algorithms and text mining add-ons. It also has
extended functionalities for bioinformaticians and molecular
biologists.
Orange also allows for interactive data visualization, offering numerous
graphics like silhouette plots and sieve diagrams, and non-
programmers can perform data mining tasks through visual
programming in a drag-and-drop interface. Developers, meanwhile,
can opt to mine data in Python.
Applications Of Data Mining
Fraud Detection
Predict fraudulent cases in credit card
transactions.
• Approach:
• Use credit card transactions and the
information associated with them as
attributes, e.g.
– when does a customer buy,
– what does he buy,
– where does he buy, etc.
• Label some past transactions as fraud or fair
transactions. This forms the class attribute.
• Learn a model for the class of the
transactions.
• Use this model to detect fraud by observing
credit card transactions on an account.
Assessing Credit Risk

• Situation: Person applies for a loan


• Task: Should a bank approve the loan?
• People who have the best credit don’t need the loans
• People with worst credit are not likely to repay.
• Bank’s best customers are in the middle

• Banks develop credit models using a variety of data mining


methods.
• Mortgage and credit card proliferation are the results of being able
to "successfully" predict if a person is likely to default on a loan.
• Widely deployed in many countries.
Benefits Of Data Mining Applications
• It helps companies gather reliable information
• It’s an efficient, cost-effective solution compared to other data
applications
• It helps businesses make profitable production and operational
adjustments
• Data mining uses both new and legacy systems
• It helps businesses make informed decisions
• It helps detect credit risks and fraud
• It helps data scientists easily analyze enormous amounts of data
quickly
• Data scientists can use the information to detect fraud, build risk
models, and improve product safety

You might also like