Instant Download (Ebook) Statistical Data Mining Using SAS Applications, Second Edition (Chapman & Hall CRC Data Mining and Knowledge Discovery Series) by George Fernandez ISBN 1439810753 PDF All Chapters
Instant Download (Ebook) Statistical Data Mining Using SAS Applications, Second Edition (Chapman & Hall CRC Data Mining and Knowledge Discovery Series) by George Fernandez ISBN 1439810753 PDF All Chapters
(Ebook) Data Mining Using SAS Applications (Chapman & Hall CRC Data
Mining and Knowledge Discovery Series) by George Fernandez ISBN
9781584883456, 1584883456
https://ebooknice.com/product/data-mining-using-sas-applications-
chapman-hall-crc-data-mining-and-knowledge-discovery-series-1770930
ebooknice.com
https://ebooknice.com/product/geographic-data-mining-and-knowledge-
discovery-second-edition-chapman-hall-crc-data-mining-and-knowledge-
discovery-series-2023104
ebooknice.com
(Ebook) Biological Data Mining (Chapman & Hall Crc Data Mining and
Knowledge Discovery Series) by Jake Y. Chen, Stefano Lonardi ISBN
1420086847
https://ebooknice.com/product/biological-data-mining-chapman-hall-crc-
data-mining-and-knowledge-discovery-series-2172726
ebooknice.com
(Ebook) Data Mining with R: Learning with Case Studies (Chapman &
Hall/CRC Data Mining and Knowledge Discovery Series) by Torgo, Luis
ISBN 9781439810187, 1439810184
https://ebooknice.com/product/data-mining-with-r-learning-with-case-
studies-chapman-hall-crc-data-mining-and-knowledge-discovery-
series-55441032
ebooknice.com
(Ebook) Knowledge-Guided Machine Learning: Accelerating Discovery
Using Scientific Knowledge and Data (Chapman & Hall/CRC Data Mining
and Knowledge Discovery Series) by Anuj Karpatne, Ramakrishnan Kannan,
Vipin Kumar ISBN 9780367693411, 0367693410
https://ebooknice.com/product/knowledge-guided-machine-learning-
accelerating-discovery-using-scientific-knowledge-and-data-chapman-
hall-crc-data-mining-and-knowledge-discovery-series-44169202
ebooknice.com
https://ebooknice.com/product/information-discovery-on-electronic-
health-records-chapman-hall-crc-data-mining-and-knowledge-discovery-
series-1930942
ebooknice.com
https://ebooknice.com/product/foundations-of-predictive-analytics-
chapman-hall-crc-data-mining-and-knowledge-discovery-series-5671462
ebooknice.com
https://ebooknice.com/product/statistical-data-mining-knowledge-
discovery-1954732
ebooknice.com
https://ebooknice.com/product/data-mining-and-knowledge-discovery-
technologies-advances-in-data-warehousing-and-mining-1765648
ebooknice.com
Statistical
Data Mining
Using SAS
Applications
Second Edition
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A
PUBLISHED TITLES
UNDERSTANDING COMPLEX DATASETS: GEOGRAPHIC DATA MINING AND
DATA MINING WITH MATRIX DECOMPOSITIONS KNOWLEDGE DISCOVERY, SECOND EDITION
David Skillicorn Harvey J. Miller and Jiawei Han
COMPUTATIONAL METHODS OF FEATURE TEXT MINING: CLASSIFICATION, CLUSTERING,
SELECTION AND APPLICATIONS
Huan Liu and Hiroshi Motoda Ashok N. Srivastava and Mehran Sahami
CONSTRAINED CLUSTERING: ADVANCES IN BIOLOGICAL DATA MINING
ALGORITHMS, THEORY, AND APPLICATIONS Jake Y. Chen and Stefano Lonardi
Sugato Basu, Ian Davidson, and Kiri L. Wagstaff
INFORMATION DISCOVERY ON ELECTRONIC
KNOWLEDGE DISCOVERY FOR HEALTH RECORDS
COUNTERTERRORISM AND LAW ENFORCEMENT Vagelis Hristidis
David Skillicorn
TEMPORAL DATA MINING
MULTIMEDIA DATA MINING: A SYSTEMATIC Theophano Mitsa
INTRODUCTION TO CONCEPTS AND THEORY
Zhongfei Zhang and Ruofei Zhang RELATIONAL DATA CLUSTERING: MODELS,
ALGORITHMS, AND APPLICATIONS
NEXT GENERATION OF DATA MINING Bo Long, Zhongfei Zhang, and Philip S. Yu
Hillol Kargupta, Jiawei Han, Philip S. Yu,
Rajeev Motwani, and Vipin Kumar KNOWLEDGE DISCOVERY FROM DATA STREAMS
João Gama
DATA MINING FOR DESIGN AND MARKETING
Yukio Ohsawa and Katsutoshi Yada STATISTICAL DATA MINING USING SAS
APPLICATIONS, SECOND EDITION
THE TOP TEN ALGORITHMS IN DATA MINING George Fernandez
Xindong Wu and Vipin Kumar
Statistical
Data Mining
Using SAS
Applications
Second Edition
George Fernandez
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
Preface.......................................................................................................... xiii
Acknowledgments.........................................................................................xxi
About the Author....................................................................................... xxiii
1. Data Mining: A Gentle Introduction......................................................1
1.1 Introduction.......................................................................................1
1.2 Data Mining: Why It Is Successful in the IT World...........................2
1.2.1 Availability of Large Databases: Data Warehousing...............2
1.2.2 Price Drop in Data Storage and Efficient Computer
Processing..............................................................................3
1.2.3 New Advancements in Analytical Methodology....................3
1.3 Benefits of Data Mining.....................................................................4
1.4 Data Mining: Users............................................................................4
1.5 Data Mining: Tools............................................................................6
1.6 Data Mining: Steps............................................................................6
1.6.1 Identification of Problem and Defining the Data
Mining Study Goal...............................................................6
1.6.2 Data Processing.....................................................................6
1.6.3 Data Exploration and Descriptive Analysis............................7
1.6.4 Data Mining Solutions: Unsupervised Learning Methods........8
1.6.5 Data Mining Solutions: Supervised Learning Methods.........8
1.6.6 Model Validation...................................................................9
1.6.7 Interpret and Make Decisions..............................................10
1.7 Problems in the Data Mining Process...............................................10
1.8 SAS Software the Leader in Data Mining........................................10
1.8.1 SEMMA: The SAS Data Mining Process............................11
1.8.2 SAS Enterprise Miner for Comprehensive Data Mining
Solution...............................................................................11
1.9 Introduction of User-Friendly SAS Macros for Statistical
Data Mining....................................................................................12
1.9.1 Limitations of These SAS Macros........................................13
1.10 Summary..........................................................................................13
References...................................................................................................13
2. Preparing Data for Data Mining...........................................................15
2.1 Introduction.....................................................................................15
2.2 Data Requirements in Data Mining.................................................15
2.3 Ideal Structures of Data for Data Mining.........................................16
2.4 Understanding the Measurement Scale of Variables.........................16
2.5 Entire Database or Representative Sample........................................17
2.6 Sampling for Data Mining...............................................................17
2.6.1 Sample Size..........................................................................18
2.7 User-Friendly SAS Applications Used in Data Preparation...............18
2.7.1 Preparing PC Data Files before Importing into SAS Data.......18
2.7.2 Converting PC Data Files to SAS Datasets Using the
SAS Import Wizard.............................................................20
2.7.3 EXLSAS2 SAS Macro Application to Convert PC Data
Formats to SAS Datasets.....................................................21
2.7.4 Steps Involved in Running the EXLSAS2 Macro................22
2.7.5 Case Study 1: Importing an Excel File Called “Fraud”
to a Permanent SAS Dataset Called “Fraud”.......................24
2.7.6 SAS Macro Applications—RANSPLIT2: Random
Sampling from the Entire Database.....................................25
2.7.7 Steps Involved in Running the RANSPLIT2 Macro...........26
2.7.8 Case Study 2: Drawing Training (400), Validation
(300), and Test (All Left-Over Observations) Samples
from the SAS Data Called “Fraud”......................................30
2.8 Summary..........................................................................................33
References...................................................................................................33
Objective
The objective of the second edition of this book is to introduce statistical data min-
ing concepts, describe methods in statistical data mining from sampling to decision
trees, demonstrate the features of user-friendly data mining SAS tools and, above
all, allow the book users to download compiled data mining SAS (Version 9.0 and
later) macro files and help them perform complete data mining. The user-friendly
SAS macro approach integrates the statistical and graphical analysis tools available
in SAS systems and provides complete statistical data mining solutions without
writing SAS program codes or using the point-and-click approach. Step-by-step
instructions for using SAS macros and interpreting the results are emphasized in
each chapter. Thus, by following the step-by-step instructions and downloading
the user-friendly SAS macros described in the book, data analysts can perform
complete data mining analysis quickly and effectively.
xiii
Coverage:
The following types of analyses can be performed using the user-friendly SAS macros.
◾◾ Biostatistics
◾◾ Research methods in public health
◾◾ Advanced business statistics
◾◾ Applied statistical methods
◾◾ Research methods
◾◾ Advanced data analysis
Additionally the following new features are included in the SAS-specific macro
application:
I. Chapter 2
a. Converting PC data files to SAS data (EXLSAS2 macro)
−− All numeric (m) and categorical variables (n) in the Excel file are converted to
X1-Xm and C1-Cn, respectively. However, the original column names will be
used as the variable labels in the SAS data. This new feature helps to maximize
the power of the user-friendly SAS macro applications included in the book.
−− Options for renaming any X1-X n or C1-C n variables in a SAS data step are
available in EXLSAS2 macro application.
−− Using SAS ODS graphics features in version 9.2, frequency distribu-
tion display of all categorical variables will be generated when WORD,
HTML, PDF, and TXT format are selected as output file formats.
b. Randomly splitting data (RANSPLIT2)
−− Many different sampling methods such as simple random sampling, stratified
random sampling, systematic random sampling, and unrestricted random
sampling are implemented using the SAS SURVEYSELECT procedure.
II. Chapter 3
a. Frequency analysis (FREQ2)
−− For one-way frequency analysis, the Gini and Entropy indexes are
reported automatically.
−− Confidence interval estimates for percentages in frequency tables are
automatically generated using the SAS SURVEYFREQ procedure. If
survey weights are specified, then these confidence interval estimates are
adjusted for survey sampling and design structures.
b. Univariate analysis (UNIVAR2)
−− If survey weights are specified, then the reported confidence interval
estimates are adjusted for survey sampling and design structures using
SURVEYMEAN procedure.
III. Chapter 4
a. PCA and factor analysis (FACTOR2)
−− PCA and factor analysis can be performed using the covariance matrix.
−− Estimation of Cronbach coefficient alpha and their 95% confidence inter-
vals when performing latent factor analysis.
−− Factor pattern plots (New 9.2: statistical graphics feature) before and
after rotation.
−− Assessing the significance and the nature of factor loadings (New 9.2:
statistical graphics feature).
−− Confidence interval estimates for factor loading when ML factor analysis
is used.
b. Disjoint cluster analysis (DISJCLUS2)
IV. Chapter 5
a. Multiple linear regressions (REGDIAG2)
−− Variable screening step using GLMSELECT and best candidate model
selection using AICC and SBC.
V. Chapter 6
Potential Audience
◾◾ This book is suitable for SAS data analysts, who need to apply data mining
techniques using existing SAS modules for successful data mining, without
investing a lot of time in buying new software products, or spending time on
additional software learning.
◾◾ Graduate students in business, health sciences, biological, engineering, and
social sciences can successfully complete data analysis projects quickly using
these SAS macros.
◾◾ Big business enterprises can use data mining SAS macros in pilot studies
involving the feasibility of conducting a successful data mining endeavor
before investing big bucks on full-scale data mining using SAS EM.
◾◾ Finally, any SAS users who want to impress their boss can do so with quick and
complete data analysis, including fancy reports in PDF, RTF, or HTML format.
Additional Resources
Book’s Web site: A Web site has been setup at http://www.cabnr.unr.edu/gf/dm.
Users can find information in regard to downloading the sample data files used in
the book, and additional reading materials. Users are also encouraged to visit this
page for information on any errors in the book, SAS macro updates, and links for
additional resources.
George Fernandez
University of Nevada-Reno
[email protected]
xxi
* This was originally an acronym for statistical analysis system. Since its founding and adoption
of the term as its trade name, the SAS Institute, headquartered in North Carolina, has consid-
erably broadened its scope.
xxiii
1.1 Introduction
Data mining, or knowledge discovery in databases (KDD), is a powerful infor-
mation technology tool with great potential for extracting previously unknown
and potentially useful information from large databases. Data mining automates
the process of finding relationships and patterns in raw data and delivers results
that can either be utilized in an automated decision support system or assessed by
decision makers. Many successful enterprises practice data mining for intelligent
decision making.1 Data mining allows the extraction of nuggets of knowledge
from business data that can help enhance customer relationship management
(CRM)2 and can help estimate the return on investment (ROI).3 Using power-
ful advanced analytical techniques, data mining enables institutions to turn raw
data into valuable information and thus gain a critical competitive advantage.
With data mining, the possibilities are endless. Although data mining appli-
cations are popular among forward-thinking businesses, other disciplines that
maintain large databases could reap the same benefits from properly carried out
data mining. Some of the potential applications of data mining include charac-
terizations of genes in animal and plant genomics, clustering and segmentations
in remote sensing of satellite image data, and predictive modeling in wildfire inci-
dence databases.
The purpose of this chapter is to introduce data mining concepts, provide some
examples of data mining applications, list the most commonly used data min-
ing techniques, and briefly discuss the data mining applications available in the
SAS software. For a thorough discussion of data mining concept, methods, and
applications, see the following publications.4–6
defined as any centralized data repository that makes it possible to extract archived
operational data and overcome inconsistencies between different data formats.
Thus, data mining and knowledge discovery from large databases become feasible
and productive with the development of cost-effective data warehousing.
A successful data warehousing operation should have the potential to integrate
data wherever it is located and whatever its format. It should provide the busi-
ness analyst with the ability to quickly and effectively extract data tables, resolve
data quality problems, and integrate data from different sources. If the quality of
the data is questionable, then business users and decision makers cannot trust the
results. In order to fully utilize data sources, data warehousing should allow you
to make use of your current hardware investments, as well as provide options for
growth as your storage needs expand. Data warehousing systems should not limit
customer choices, but instead should provide a flexible architecture that accommo-
dates platform-independent storage and distributed processing options.
Data quality is a critical factor for the success of data warehousing projects.
If business data is of an inferior quality, then the business analysts who query the
database and the decision makers who receive the information cannot trust the
results. The quality of individual records is necessary to ensure that the data is
accurate, updated, and consistently represented in the data warehousing.
analysis, provide superior analytical depth. Thus, quality data mining is now fea-
sible with the availability of advanced analytical solutions.
1.6.2 Data Processing
The key to successful data mining is using the right data. Preparing data for mining
is often the most time-consuming aspect of any data mining endeavor. A typical
data structure suitable for data mining should contain observations (e.g., custom-
ers and products) in rows and variables (demographic data and sales history) in
columns. Also, the measurement levels (interval or categorical) of each variable in
the dataset should be clearly defined. The steps involved in preparing the data for
data mining are as follows:
◾◾ Multiple linear regressions (MLRs): In MLR, the association between the two
sets of variables is described by a linear equation that predicts the continuous
response variable from a function of predictor variables.
◾◾ Logistic regressions: It allows a binary or an ordinal variable as the response
variable and allows the construction of more complex models rather than
straight linear models.
◾◾ Neural net (NN) modeling: It can be used for both prediction and classifica-
tion. NN models enable the construction of train and validate multiplayer
feed-forward network models for modeling large data and complex interac-
tions with many predictor variables. NN models usually contain more param-
eters than a typical statistical model, and the results are not easily interpreted
and no explicit rationale is given for the prediction. All variables are treated
as numeric, and all nominal variables are coded as binary. Relatively more
training time is needed to fit the NN models.
◾◾ Classification and regression tree (CART ): These models are useful in
generating binary decision trees by splitting the subsets of the dataset
using all predictor variables to create two child nodes repeatedly, begin-
ning with the entire dataset. The goal is to produce subsets of the data
that are as homogeneous as possible with respect to the target variable.
Continuous, binary, and categorical variables can be used as response
variables in CART.
◾◾ Discriminant function analysis: This is a classification method used to deter-
mine which predictor variables discriminate between two or more natu-
rally occurring groups. Only categorical variables are allowed to be the
response variable, and both continuous and ordinal variables can be used as
predictors.
◾◾ CHAID decision tree (Chi-square Automatic Interaction Detector): This is a
classification method used to study the relationships between a categorical
response measure and a large series of possible predictor variables, which may
interact among one another. For qualitative predictor variables, a series of chi-
square analyses are conducted between the response and predictor variables
to see if splitting the sample based on these predictors leads to a statistically
significant discrimination in the response.
1.6.6 Model Validation
Validating models obtained from training datasets by independent validation data-
sets is an important requirement in data mining to confirm the usability of the
developed model. Model validation assess the quality of the model fit and protect
against overfitted or underfitted models. Thus, it could be considered as the most
important step in the model-building sequence.
Sample your data by extracting a portion of a large dataset big enough to contain
the significant information, and yet small enough to manipulate quickly.
Explore your data by searching for unanticipated trends and anomalies in order
to gain understanding and ideas.
Modify your data by creating, selecting, and transforming the variables to focus
on the model selection process.
Model your data by allowing the software to search automatically for a combina-
tion of data that reliably predicts a desired outcome.
Assess your data by evaluating the usefulness and reliability of the findings from
the data mining process.
By assessing the results gained from each stage of the SEMMA process, you can
determine how to model new questions raised by the previous results, and thus pro-
ceed back to the exploration phase for additional refinement of the data. The SAS
data mining solution integrates everything you need for discovery at each stage of
the SEMMA process: These data mining tools indicate patterns or exceptions and
mimic human abilities for comprehending spatial, geographical, and visual infor-
mation sources. Complex mining techniques are carried out in a totally code-free
environment, allowing you to concentrate on the visualization of the data, discov-
ery of new patterns, and new questions to ask.
not using the SAS Enterprise Miner, but they are licensed to use SAS BASE, STAT,
and GRAPH modules. Thus, these user-friendly SAS macro applications for data
mining are targeted at this group of customers. Also, providing the complete SAS
codes for performing comprehensive data mining solutions is not very effective
because a majority of the business and statistical analysts are not experienced SAS
programmers. Quick results from data mining are not feasible since many hours
of code modification and debugging program errors are required if the analysts are
required to work with SAS program code.
◾◾ Users can perform comprehensive data mining tasks by inputting the macro
parameters in the macro-call window and by running the SAS macro.
◾◾ SAS code required for performing data exploration, model fitting, model
assessment, validation, prediction, and scoring are included in each macro.
Thus, complete results can be obtained quickly by using these macros.
◾◾ Experience in SAS output delivery system (ODS) is not required because
options for producing SAS output and graphics in RTF, WEB, and PDF are
included within the macros.
◾◾ Experience in writing SAS programs code or SAS macros is not required to
use these macros.
◾◾ SAS-enhanced data mining software Enterprise Miner is not required to run
these SAS macros.
◾◾ All SAS macros included in this book use the same simple user-friendly format.
Thus, minimum training time is needed to master the usage of these macros.
◾◾ Regular updates to the SAS macros will be posted in the book Web site. Thus,
readers can always use the updated features in the SAS macros by download-
ing the latest versions.
1.10 Summary
Data mining is a journey—a continuous effort to combine your enterprise knowl-
edge with the information you extracted from the data you have acquired. This
chapter briefly introduces the concept and applications of data mining techniques;
that is, the secret and intelligent weapon that unleashes the power in your data. The
SAS institute, the industry leader in analytical and decision support solutions, pro-
vides the powerful software called Enterprise Miner to perform complete data min-
ing solutions. However, many small business and academic institutions do not have
the license to use the application, but they have the license for SAS BASE, STAT,
and GRAPH. As an alternative to the point-and-click menu interface modules,
user-friendly SAS macro applications for performing several statistical data mining
tasks are included in this book. Instructions are given in the book for downloading
and applying these user-friendly SAS macros for producing quick and complete
data mining solutions.
References
1. SAS Institute Inc., Customer success stories at http://www.sas.com/success/ (last
accessed 10/07/09).
2. SAS Institute Inc., Customer relationship management (CRM) at http://www.sas.
com/solutions/crm/index.html (last accessed 10/07/09).
3. SAS Institute Inc., SAS Enterprise miner product review at http://www.sas.com/
products/miner/miner_review.pdf (last accessed 10/07/09).
4. Two Crows Corporation, Introduction to Data Mining and Knowledge Discovery, 3rd
ed., 1999 at http://www.twocrows.com/intro-dm.pdf.
5. Berry, M. J. A. and Linoff, G. S. Data Mining Techniques: For Marketing, Sales, and
Customer Support, John Wiley & Sons, New York, 1997.
6. Berry, M. J. A. and Linoff, G. S., Mastering Data Mining: The Art and Science of Customer
Relationship Management, Second edition, John Wiley & Sons, New York, 1999.
7. SAS Institute Inc., The Power to Know at http://www.sas.com.
8. SAS Institute Inc., Data Mining Using Enterprise Miner Software: A Case Study Approach,
1st ed., Cary, NC, 2000.
9. SAS Institute Inc., The Enterprise miner, http://www.sas.com/products/miner/index.
html (last accessed 10/07/09).
10. SAS Institute Inc., The Enterprise miner standalone tutorial, http://www.cabnr.unr.
edu/gf/dm/em.pdf (last accessed 10/07/09).
CALÍOPE.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com