John - Fields - HW1 Data Mining
John - Fields - HW1 Data Mining
John Fields
IST707 - Assignment #1
7/11/19
Introduction
The field of data mining has grown rapidly with the advancement of technology and the
ability to gather and store large quantities of information and retrieve it quickly from computers.
Utilizing this “big data” to provide insight and make predictions has benefits across a wide
variety of societal and business areas such as medicine, science and engineering. Chapter 1 of
Introduction to Data Mining (Tan et al.) provides an excellent overview of the topic of data
mining that includes information on the benefits, background, definitions and applications of
data mining.
Data mining combines traditional analysis methods from statistics, operations, and other
fields with new algorithms to provide new insights to help people make better decisions. The
authors define data mining as “…the process of automatically discovering useful information in
large data repositories.” The authors also emphasize that not all tasks related to information
discovery are considered to be “data mining” and the exercises at the end of this assignment will
The development of data mining was driven by five specific challenges related to data
and information:
evaluation)
To overcome these challenges, data mining has evolved from being a part of Knowledge
machine learning, optimization, visualization, and many other areas. The task performed
as part of data mining typically involve Predictive and Descriptive tasks. Predictive tasks
involve the prediction of a target or dependent variable (Y) based on the explanatory or
independent variables (X’s). The first chapter concludes with examples of Classification
(iris flower data set) and Association analysis (market basket data) to show some typical
data mining tasks and a discussion on the scope and organization of the book.
process
1. Discuss whether or not each of the following activities is a data mining task.
This is not a data mining task. A simple database query would provide the
This is not a data mining task. A database query could be used to collect
This is not a data mining task since you could use statistical
methods (sampling, means and central limit theorem) to show that there is
(f) Predicting the future stock price of a company using historical records.
This is a data mining task. A model could be built that includes the
This is not a data mining task. This information could be queried from the
2. Suppose that you are employed as a data mining consultant for an Internet search engine
company. Describe how data mining can help the company by giving specific examples of how
techniques, such as clustering, classification, association rule mining, and anomaly detection can
be applied.
Internet search engine companies (e.g. Google) have pioneered the use of data mining to provide
a "free" service in exchange for users supplying behavioral data such as the information that they
type in a search engine. This data can be utilized in a variety of ways to understand the
Fields 5
behaviors of users which can be utilized for revenue generating activities such as advertising, up-
selling and cross-selling. The chart below shows how different data mining techniques can be
used to improve the user experience and provide additional value-added data to an internet
search company.
3. For each of the following data sets, explain whether or not data privacy is an important
issue.
Yes, census information could be a data privacy issue since someone born in 1950
race, ethnicity, medical conditions, and financial information which some people
the census, such as your mother's maiden name, could be used to circumvent
security questions.
(b) IP addresses and visit times of web users who visit your website.
Fields 6
Yes, this information could present data privacy concerns. There are numerous
recent examples of web sites being hacked and user information being released
No, this does not present a concern today with the current technology. However,
if more detailed data is available in the future this could be a cause for concern.
An example is Google Maps street view which must manually remove or obscure
images which show license plates, faces, etc. If satellites can provide this level of
Names and addresses in the telephone book don't present data privacy issues.
This information has been available publicly and the downside is the automated
Names and email addresses on the web are also not a data privacy issue. Similar
to (d) above, it is more of a nuisance now that this information is more readily
Task 2 - Google Flu Trends - practice your critical thinking and writing
In the NY Times article from 2014, Google Flu Trends: The Limits of Big Data, the
authors review the challenges of using internet searches related to tracking potential outbreaks of
the flu. The Centers for Disease Control (CDC) has used manual methods in the past to collect
Fields 7
this information from health care providers which caused several weeks delay in receiving the
results. This article discusses the issue where flu cases were overstated 30-50% by Google Flu
Trends in the period from 2012-2014 using the faster method of Google search data to predict flu
outbreaks.. As the title suggests, there are limits to the use of big data and the NY Times is is
critical of Google for not having been more careful about how they used this new technology.
The second article from the Atlantic, In Defense of Google Flu Trends (also published in
2014), has a more positive opinion of the role of Google Flu Trends. The author provides
additional information on how combining Google Flu Trends with CDC data provides better
predictions than each of these independent sources. This article also quotes Google sources who
refer back to documents that show that Google Flu Trends made warnings earlier about using
A review of both articles provides interesting insights on the potential “hype” around data
mining and the responsibility of companies and data scientists to be cautious with how these new
predictive capabilities are utilized in a responsible way. Although Google provided a defense
which pointed back to earlier documentation on the use of Google Flu Trends, the company
appeared to be more interested in utilizing the early positive press than warning about the
potential misleading results. Many of the high-tech companies like Google, Facebook, etc. are
learning that they must be very careful about how they deploy new technology to insure it is
fully tested and secure. Companies need to be more open about disclosing the risks and potential
issues of new data mining capabilities such as Google Flu Trends or the recent government and
public outcry could create new laws and regulations if companies are not responsible in
Results
The first topic in the analysis section above reviewed the different types of data mining
activities and a summary of the different techniques shown in Table 1. This information can be
used as a technical reference for determine the types of techniques to apply to different tasks.
Similarly, Table 2 provides a review of how data mining can be used for internet searches and
the benefits for the company and users. In future assignments, additional technical details will
be provided in the Results section of these assignments to provide a comprehensive review of the
techniques used.
The second topic in the analysis section was a review of various articles related to the
capabilities and challenges with Google Flu trends. These articles highlight the promise and
challenges of applying data mining techniques to global health problems such as fighting the
annual battle with flu outbreaks. There are lessons here for tech companies like Google and the
data scientists who develop and deploy these new capabilities. As the great super-hero Spider
Conclusion
The book, Introduction to Data Mining (Tan et al.), is an excellent introduction to the
topic of data mining and explains the concepts in concise language and easy to understand
examples. Exercise Question 1 in section 1.7, also provided many thought-provoking questions
which helps the reader to understand the concepts behind data mining (what it is and what it is
not). Question 2 helps the reader to see the value of data science in a fictitious internet search
company to explore how a consultant would use different techniques to provide more value to
the company and to users. Finally, Question 3 explores the very relevant topic of data privacy
Fields 9
which continues to be a major issue for data science as information continues to become easier to
The articles on Google Flu Trends review the very relevant topic of the use of big
data/data mining in important areas such as health care. These stories offer different viewpoints
on the value and challenges of utilizing these powerful new techniques and available information
to help us better predict issues such as flu trends. However, as with any new technologies, there
is a learning curve and companies/data scientists need to be very cautious to ensure these
Works Cited
Tan, Pang-Ning, et al. Introduction to Data Mining. Pearson Education, Inc., 2019.
Fayyad, Usama, et al. “The KDD Process for Extracting Useful Knowledge from Volumes of
Data.” Communications of the ACM, vol. 39, no. 11, 1996, pp. 27–34.,
doi:10.1145/240455.240464.
“Google Flu Trends: The Limits of Big Data.” New York Times, 28 Mar. 2014.