0% found this document useful (0 votes)

29 views

John - Fields - HW1 Data Mining

This document provides a summary of Chapter 1 of the book "Introduction to Data Mining" by Tan, Steinbach, and Kumar. The chapter introduces data mining as the process of automatically discovering useful information from large data repositories. It discusses how data mining was driven by challenges related to data scalability, dimensionality, heterogeneity, distribution, and non-traditional analysis. The chapter concludes with examples of typical data mining tasks like classification and association analysis.

Uploaded by

Satya Narayan Shukla

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

John - Fields - HW1 Data Mining

Uploaded by

Satya Narayan Shukla

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Fields 1

John Fields

Dr. Ami Gates

IST707 - Assignment #1

7/11/19

Introduction

The field of data mining has grown rapidly with the advancement of technology and the

ability to gather and store large quantities of information and retrieve it quickly from computers.

Utilizing this “big data” to provide insight and make predictions has benefits across a wide

variety of societal and business areas such as medicine, science and engineering. Chapter 1 of

Introduction to Data Mining (Tan et al.) provides an excellent overview of the topic of data

mining that includes information on the benefits, background, definitions and applications of

data mining.

Data mining combines traditional analysis methods from statistics, operations, and other

fields with new algorithms to provide new insights to help people make better decisions. The

authors define data mining as “…the process of automatically discovering useful information in

large data repositories.” The authors also emphasize that not all tasks related to information

discovery are considered to be “data mining” and the exercises at the end of this assignment will

provide examples of data mining compared to other information related activities.

The development of data mining was driven by five specific challenges related to data

and information:

1. Scalability (for massive data sets)

2. High Dimensionality (for data with many different attributes)

3. Heterogeneous and Complex Data (for different types of data)

Fields 2

4. Data Ownership and Distribution (for data in different locations)

5. Non-traditional Analysis (for the automation of hypothesis generation and

evaluation)

To overcome these challenges, data mining has evolved from being a part of Knowledge

Discovery in Databases (see Figure 1 below) to focusing on “…data preprocessing,

mining, and postprocessing” by using techniques from statistics, artificial intelligence,

machine learning, optimization, visualization, and many other areas. The task performed

as part of data mining typically involve Predictive and Descriptive tasks. Predictive tasks

involve the prediction of a target or dependent variable (Y) based on the explanatory or

independent variables (X’s). The first chapter concludes with examples of Classification

(iris flower data set) and Association analysis (market basket data) to show some typical

data mining tasks and a discussion on the scope and organization of the book.

Figure 1. Overview of the steps in the Knowledge Discovery in Databases (KDD)

process

Analysis and Models

Task 1 - 1.7 Exercises - review data mining concepts and task

1. Discuss whether or not each of the following activities is a data mining task.

(a) Dividing the customers of a company according to their gender.

Fields 3

This is not a data mining task. A simple database query is required to

complete this task.

(b) Dividing the customers of a company according to their profitability.

This is not a data mining task. A simple database query would provide the

desired information and then a finance/accounting task is required to

determine the groupings to use to divide the customers.

(c) Computing the total sales of a company.

This is not a data mining task. This is a finance/accounting task to

sum the sales of the company for financial reporting.

(d) Sorting a student database based on student identification numbers.

This is not a data mining task. A database query could be used to collect

this information and sort by student ID numbers.

(e) Predicting the outcomes of tossing a (fair) pair of dice.

This is not a data mining task since you could use statistical

methods (sampling, means and central limit theorem) to show that there is

a 1 out of 6 chance of rolling one of the numbers on the die.

(f) Predicting the future stock price of a company using historical records.

This is a data mining task. A model could be built that includes the

various factors that influence the price of a stock. Time-series analysis

could then be performed to predict the expected future stock price.

(g) Monitoring the heart rate of a patient for abnormalities.

This is a data mining task. Data could be collected on normal and

abnormal heart conditions to determine the threshold where a notification

Fields 4

should be triggered. This could be accomplished with an algorithm for

outlier detection or by classifying the condition as normal or abnormal.

(h) Monitoring seismic waves for earthquake activities.

This is a data mining task. As in (g) above, data could be collected on

normal and abnormal earthquake activity. A classification could then be

performed to label the activity as normal or abnormal.

(i) Extracting the frequencies of a sound wave.

This is not a data mining task. This information could be queried from the

device/system that is collecting the sound wave information.

Task Data Data Mining Technique

Mining?
(a) Divide customers based on gender No
(b) Divide customers by profitability No
(c) Compute total sales to customer No
(d) Sort students by ID number No
(e) Predict outcome of fair dice No
(f) Predict stock price Yes Time-series forecasting
(g) Monitor heart rate Yes Classification and outlier-detection
(h) Monitor seismic waves for earthquakes Yes Classification and outlier-detection
(i) Extract sound wave frequencies No
Table 1. Comparison of data related tasks

2. Suppose that you are employed as a data mining consultant for an Internet search engine

company. Describe how data mining can help the company by giving specific examples of how

techniques, such as clustering, classification, association rule mining, and anomaly detection can

be applied.

Internet search engine companies (e.g. Google) have pioneered the use of data mining to provide

a "free" service in exchange for users supplying behavioral data such as the information that they

type in a search engine. This data can be utilized in a variety of ways to understand the
Fields 5

behaviors of users which can be utilized for revenue generating activities such as advertising, up-

selling and cross-selling. The chart below shows how different data mining techniques can be

used to improve the user experience and provide additional value-added data to an internet

search company.

Data Mining Technique Benefit to Company Benefit to Users

Clustering Group together similar users More relevant search results
to target for advertising
Classification Predict characteristics such as More relevant search results
age and interests (e.g. sports,
news)
Association Rule Mining Information for advertising to
Receive information on
up-sell or cross-sell complementary products and
services
Anomaly Detection Determine if the user is really Protects confidential
a malicious bot or potential information and provides user
threat confidence in the security of
their data
Table 2. Benefits of Data Mining Techniques for Internet Search Engines

3. For each of the following data sets, explain whether or not data privacy is an important

issue.

(a) Census data collected from 1900-1950.

Yes, census information could be a data privacy issue since someone born in 1950

would now be 69 years old. The census information contains information on

race, ethnicity, medical conditions, and financial information which some people

might be uncomfortable sharing via websites such as Ancestry.com. Other data in

the census, such as your mother's maiden name, could be used to circumvent

security questions.

(b) IP addresses and visit times of web users who visit your website.
Fields 6

Yes, this information could present data privacy concerns. There are numerous

recent examples of web sites being hacked and user information being released

that can be utilized by hackers to impersonate others or black mail users.

(c) Images from Earth-orbiting satellites.

No, this does not present a concern today with the current technology. However,

if more detailed data is available in the future this could be a cause for concern.

An example is Google Maps street view which must manually remove or obscure

images which show license plates, faces, etc. If satellites can provide this level of

resolution in the future, then it could cause data privacy concerns.

(d) Names and addresses of people from the telephone book.

Names and addresses in the telephone book don't present data privacy issues.

This information has been available publicly and the downside is the automated

use of robo-calling which has made telemarketing a bigger issue.

(e) Names and email addresses collected from the Web.

Names and email addresses on the web are also not a data privacy issue. Similar

to (d) above, it is more of a nuisance now that this information is more readily

available and can be utilized by unscrupulous marketers, but it doesn't pose an

increased data privacy risk.

Task 2 - Google Flu Trends - practice your critical thinking and writing

In the NY Times article from 2014, Google Flu Trends: The Limits of Big Data, the

authors review the challenges of using internet searches related to tracking potential outbreaks of

the flu. The Centers for Disease Control (CDC) has used manual methods in the past to collect
Fields 7

this information from health care providers which caused several weeks delay in receiving the

results. This article discusses the issue where flu cases were overstated 30-50% by Google Flu

Trends in the period from 2012-2014 using the faster method of Google search data to predict flu

outbreaks.. As the title suggests, there are limits to the use of big data and the NY Times is is

critical of Google for not having been more careful about how they used this new technology.

The second article from the Atlantic, In Defense of Google Flu Trends (also published in

2014), has a more positive opinion of the role of Google Flu Trends. The author provides

additional information on how combining Google Flu Trends with CDC data provides better

predictions than each of these independent sources. This article also quotes Google sources who

refer back to documents that show that Google Flu Trends made warnings earlier about using

their data as a stand-alone source for predictions.

A review of both articles provides interesting insights on the potential “hype” around data

mining and the responsibility of companies and data scientists to be cautious with how these new

predictive capabilities are utilized in a responsible way. Although Google provided a defense

which pointed back to earlier documentation on the use of Google Flu Trends, the company

appeared to be more interested in utilizing the early positive press than warning about the

potential misleading results. Many of the high-tech companies like Google, Facebook, etc. are

learning that they must be very careful about how they deploy new technology to insure it is

fully tested and secure. Companies need to be more open about disclosing the risks and potential

issues of new data mining capabilities such as Google Flu Trends or the recent government and

public outcry could create new laws and regulations if companies are not responsible in

regulating how they operate.

Fields 8

Results

The first topic in the analysis section above reviewed the different types of data mining

activities and a summary of the different techniques shown in Table 1. This information can be

used as a technical reference for determine the types of techniques to apply to different tasks.

Similarly, Table 2 provides a review of how data mining can be used for internet searches and

the benefits for the company and users. In future assignments, additional technical details will

be provided in the Results section of these assignments to provide a comprehensive review of the

techniques used.

The second topic in the analysis section was a review of various articles related to the

capabilities and challenges with Google Flu trends. These articles highlight the promise and

challenges of applying data mining techniques to global health problems such as fighting the

annual battle with flu outbreaks. There are lessons here for tech companies like Google and the

data scientists who develop and deploy these new capabilities. As the great super-hero Spider

Man stated, “With great power comes great responsibility.”

Conclusion

The book, Introduction to Data Mining (Tan et al.), is an excellent introduction to the

topic of data mining and explains the concepts in concise language and easy to understand

examples. Exercise Question 1 in section 1.7, also provided many thought-provoking questions

which helps the reader to understand the concepts behind data mining (what it is and what it is

not). Question 2 helps the reader to see the value of data science in a fictitious internet search

company to explore how a consultant would use different techniques to provide more value to

the company and to users. Finally, Question 3 explores the very relevant topic of data privacy
Fields 9

which continues to be a major issue for data science as information continues to become easier to

collect and more readily available.

The articles on Google Flu Trends review the very relevant topic of the use of big

data/data mining in important areas such as health care. These stories offer different viewpoints

on the value and challenges of utilizing these powerful new techniques and available information

to help us better predict issues such as flu trends. However, as with any new technologies, there

is a learning curve and companies/data scientists need to be very cautious to ensure these

capabilities are deployed ethically and responsibly.

Fields 10

Works Cited

Tan, Pang-Ning, et al. Introduction to Data Mining. Pearson Education, Inc., 2019.

Fayyad, Usama, et al. “The KDD Process for Extracting Useful Knowledge from Volumes of
Data.” Communications of the ACM, vol. 39, no. 11, 1996, pp. 27–34.,
doi:10.1145/240455.240464.

“Google Flu Trends: The Limits of Big Data.” New York Times, 28 Mar. 2014.

“In Defense of Google Flu Trends.” The Atlantic, 27 Mar. 2014.

Module 3 Threats and Attacks On Endpoints
No ratings yet
Module 3 Threats and Attacks On Endpoints
42 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
3 pages
Data Moning Seminar Report
No ratings yet
Data Moning Seminar Report
12 pages
Data Mining First draft
No ratings yet
Data Mining First draft
84 pages
Data Mining
No ratings yet
Data Mining
24 pages
Data Mining
No ratings yet
Data Mining
157 pages
Unit 2
No ratings yet
Unit 2
37 pages
Assignment 1
No ratings yet
Assignment 1
11 pages
Whats App
No ratings yet
Whats App
23 pages
Data Mining Overview
No ratings yet
Data Mining Overview
24 pages
unit2
No ratings yet
unit2
20 pages
Final Document
No ratings yet
Final Document
25 pages
Unit 1 Datamining
No ratings yet
Unit 1 Datamining
16 pages
DMW - Unit 1
No ratings yet
DMW - Unit 1
21 pages
Micro Project
No ratings yet
Micro Project
22 pages
Data Mining Assignment 1 Final To Submit
No ratings yet
Data Mining Assignment 1 Final To Submit
51 pages
Insight Into Theoretical and Applied Informatics I... - (2.2.4 Data Mining)
No ratings yet
Insight Into Theoretical and Applied Informatics I... - (2.2.4 Data Mining)
5 pages
data mining
No ratings yet
data mining
23 pages
Data Mining and Visualization
No ratings yet
Data Mining and Visualization
8 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
Applications & Trends in Data Mining: Gaurav Gupta, Geetika Hans, Tamanna Sehgal
No ratings yet
Applications & Trends in Data Mining: Gaurav Gupta, Geetika Hans, Tamanna Sehgal
3 pages
CS-505 Introduction To Data Mining Exercises: Page 1 of 4
No ratings yet
CS-505 Introduction To Data Mining Exercises: Page 1 of 4
4 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
UNIT-3 DATA MINING - Part1
No ratings yet
UNIT-3 DATA MINING - Part1
111 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Notes for DMDWH -Module1
No ratings yet
Notes for DMDWH -Module1
21 pages
TJ 11 2017 3 128 132
No ratings yet
TJ 11 2017 3 128 132
5 pages
UNIT I Introduction to Data mining-converted
No ratings yet
UNIT I Introduction to Data mining-converted
22 pages
Time Table Scheduling in Data Mining
No ratings yet
Time Table Scheduling in Data Mining
61 pages
Data Mining: Should It Be Included in The 'Statistics' Curriculum?
No ratings yet
Data Mining: Should It Be Included in The 'Statistics' Curriculum?
4 pages
Data Mining
No ratings yet
Data Mining
87 pages
UNIT 1
No ratings yet
UNIT 1
8 pages
Soln 1
100% (1)
Soln 1
6 pages
Assignment 5
No ratings yet
Assignment 5
16 pages
1 Intor To DMW
No ratings yet
1 Intor To DMW
22 pages
Chapter 1 - What is Data Mining
No ratings yet
Chapter 1 - What is Data Mining
8 pages
UNIT 1 - Lecture 1 - Introduction To Data Mining
No ratings yet
UNIT 1 - Lecture 1 - Introduction To Data Mining
62 pages
Activity 1 PDF
No ratings yet
Activity 1 PDF
3 pages
The Survey of Data Mining Applications and Feature Scope
No ratings yet
The Survey of Data Mining Applications and Feature Scope
16 pages
Data Mining: Concepts and Techniques (2nd Edition)
No ratings yet
Data Mining: Concepts and Techniques (2nd Edition)
8 pages
Ramy mahmoud 52117
No ratings yet
Ramy mahmoud 52117
3 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
76 pages
Data Mining
No ratings yet
Data Mining
1 page
Data Mining Prologues: K.Sankar Lecturer / M.E., (P.HD) ., D.V.Rajkumar M.C.A., M.Phil Lecturer
No ratings yet
Data Mining Prologues: K.Sankar Lecturer / M.E., (P.HD) ., D.V.Rajkumar M.C.A., M.Phil Lecturer
4 pages
Unit 4 Introduction To Data Mining
No ratings yet
Unit 4 Introduction To Data Mining
22 pages
Intro To Data Minning
No ratings yet
Intro To Data Minning
24 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
KM Notes Unit-3
No ratings yet
KM Notes Unit-3
20 pages
1 ST Review Document
No ratings yet
1 ST Review Document
37 pages
Data Mining
No ratings yet
Data Mining
10 pages
Data Mining Models and Tasks
No ratings yet
Data Mining Models and Tasks
6 pages
CS-DM MODULE -1
No ratings yet
CS-DM MODULE -1
27 pages
Data Mining Note Sixth Semester ..
No ratings yet
Data Mining Note Sixth Semester ..
79 pages
DATA MINING UNIT-1
No ratings yet
DATA MINING UNIT-1
59 pages
Data Mining Note
No ratings yet
Data Mining Note
79 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
All About Data Science: Learn Data Science from scratch
From Everand
All About Data Science: Learn Data Science from scratch
Devi Prasad
No ratings yet
PDF Makalah Asuhan Keperawatan Pada Agregat Ibu Hamil Dan Menyusui
No ratings yet
PDF Makalah Asuhan Keperawatan Pada Agregat Ibu Hamil Dan Menyusui
35 pages
CCNA Dis4 - Chapter 4 - Identifying Application Impacts On Network Design - PPT (Compatibility Mode)
100% (3)
CCNA Dis4 - Chapter 4 - Identifying Application Impacts On Network Design - PPT (Compatibility Mode)
90 pages
Accenture MDM Foundations
No ratings yet
Accenture MDM Foundations
4 pages
DT Associate
No ratings yet
DT Associate
60 pages
Why Have Sales Genie CRM++ in Your Cloud
No ratings yet
Why Have Sales Genie CRM++ in Your Cloud
24 pages
Presentation On Industrial Training
No ratings yet
Presentation On Industrial Training
13 pages
5.4.3.4 Lab - Remote Access
No ratings yet
5.4.3.4 Lab - Remote Access
3 pages
BAiANATEC_User Manual
No ratings yet
BAiANATEC_User Manual
15 pages
Umesh Arora
No ratings yet
Umesh Arora
1 page
DOL Cyber Security - One Page Tip Sheet
No ratings yet
DOL Cyber Security - One Page Tip Sheet
1 page
Seven DCTMJobs
No ratings yet
Seven DCTMJobs
16 pages
Basic 9 Computing Notes Sample
100% (1)
Basic 9 Computing Notes Sample
54 pages
QA6
No ratings yet
QA6
8 pages
Wddba L2 TRB
No ratings yet
Wddba L2 TRB
8 pages
Pyspark Essentials
No ratings yet
Pyspark Essentials
24 pages
ITESoft FreeMind For Invoices
No ratings yet
ITESoft FreeMind For Invoices
4 pages
DESIGN - AND - IMPLEMENTATION - OF - ONLINE - STUDENTonLIne AdMIssion SYSTEM
No ratings yet
DESIGN - AND - IMPLEMENTATION - OF - ONLINE - STUDENTonLIne AdMIssion SYSTEM
56 pages
455962147_1148247109601582_1673264986279156121_n
No ratings yet
455962147_1148247109601582_1673264986279156121_n
40 pages
W11 Privacy, Security, and Ethics
No ratings yet
W11 Privacy, Security, and Ethics
26 pages
Co2 - SLM (1) Se
No ratings yet
Co2 - SLM (1) Se
13 pages
TC 1
No ratings yet
TC 1
24 pages
Teams Training 1
No ratings yet
Teams Training 1
46 pages
Blockchain-Based Trust Management in
No ratings yet
Blockchain-Based Trust Management in
34 pages
Mannila 1997
No ratings yet
Mannila 1997
15 pages
SE - Module 2
No ratings yet
SE - Module 2
41 pages
Cloud_Computing_Engineering_Azure_AWS_Corrected (1)
No ratings yet
Cloud_Computing_Engineering_Azure_AWS_Corrected (1)
26 pages
Secure File Sharing Using RSA and AES
No ratings yet
Secure File Sharing Using RSA and AES
6 pages
CIS Microsoft 365 Foundations Benchmark v1.4.0
No ratings yet
CIS Microsoft 365 Foundations Benchmark v1.4.0
234 pages
Working With Aws Cloudtrail 187
No ratings yet
Working With Aws Cloudtrail 187
15 pages

John - Fields - HW1 Data Mining

Uploaded by

John - Fields - HW1 Data Mining

Uploaded by

Fields 1

Dr. Ami Gates

provide examples of data mining compared to other information related activities.

1. Scalability (for massive data sets)

2. High Dimensionality (for data with many different attributes)

3. Heterogeneous and Complex Data (for different types of data)

4. Data Ownership and Distribution (for data in different locations)

5. Non-traditional Analysis (for the automation of hypothesis generation and

Discovery in Databases (see Figure 1 below) to focusing on “…data preprocessing,

mining, and postprocessing” by using techniques from statistics, artificial intelligence,

Figure 1. Overview of the steps in the Knowledge Discovery in Databases (KDD)

Analysis and Models

Task 1 - 1.7 Exercises - review data mining concepts and task

(a) Dividing the customers of a company according to their gender.

This is not a data mining task. A simple database query is required to

complete this task.

(b) Dividing the customers of a company according to their profitability.

desired information and then a finance/accounting task is required to

determine the groupings to use to divide the customers.

(c) Computing the total sales of a company.

This is not a data mining task. This is a finance/accounting task to

sum the sales of the company for financial reporting.

(d) Sorting a student database based on student identification numbers.

this information and sort by student ID numbers.

(e) Predicting the outcomes of tossing a (fair) pair of dice.

a 1 out of 6 chance of rolling one of the numbers on the die.

various factors that influence the price of a stock. Time-series analysis

could then be performed to predict the expected future stock price.

(g) Monitoring the heart rate of a patient for abnormalities.

This is a data mining task. Data could be collected on normal and

abnormal heart conditions to determine the threshold where a notification

should be triggered. This could be accomplished with an algorithm for

outlier detection or by classifying the condition as normal or abnormal.

(h) Monitoring seismic waves for earthquake activities.

This is a data mining task. As in (g) above, data could be collected on

normal and abnormal earthquake activity. A classification could then be

performed to label the activity as normal or abnormal.

(i) Extracting the frequencies of a sound wave.

device/system that is collecting the sound wave information.

Task Data Data Mining Technique

Data Mining Technique Benefit to Company Benefit to Users

(a) Census data collected from 1900-1950.

would now be 69 years old. The census information contains information on

might be uncomfortable sharing via websites such as Ancestry.com. Other data in

that can be utilized by hackers to impersonate others or black mail users.

(c) Images from Earth-orbiting satellites.

resolution in the future, then it could cause data privacy concerns.

(d) Names and addresses of people from the telephone book.

use of robo-calling which has made telemarketing a bigger issue.

(e) Names and email addresses collected from the Web.

available and can be utilized by unscrupulous marketers, but it doesn't pose an

increased data privacy risk.

their data as a stand-alone source for predictions.

regulating how they operate.

Man stated, “With great power comes great responsibility.”

collect and more readily available.

capabilities are deployed ethically and responsibly.

“In Defense of Google Flu Trends.” The Atlantic, 27 Mar. 2014.

You might also like