100% found this document useful (9 votes)
3K views

Data Science and Predictive Analytics

The document discusses the objectives and outcomes of a course on business analytics for managers, which covers concepts of big data, methodologies for analyzing structured, unstructured, and semi-structured data with an emphasis on applying analytics to business needs; the course syllabus outlines units on descriptive, predictive, and prescriptive analytics using statistical techniques and tools like R; suggested readings and learning objectives are also provided.

Uploaded by

Prateek Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (9 votes)
3K views

Data Science and Predictive Analytics

The document discusses the objectives and outcomes of a course on business analytics for managers, which covers concepts of big data, methodologies for analyzing structured, unstructured, and semi-structured data with an emphasis on applying analytics to business needs; the course syllabus outlines units on descriptive, predictive, and prescriptive analytics using statistical techniques and tools like R; suggested readings and learning objectives are also provided.

Uploaded by

Prateek Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 309

BUSINESS

ANALYTICS FOR
MANAGERS
Session I

Statistical thinking will one day be as necessary for efficient citizenship as the ability to read & write
- H.G.Wells
Course Objective(s) & Outcome(s)
Course Objective(s): This course will cover the basic concepts of big data,
methodologies for analyzing structured, semi-structured and unstructured data with
emphasis laid on the association between the data science and the business needs. The
course is intended for first year management students coming from a background of
engineering, commerce, arts, computer sciences, statistics, mathematics, economy and
management. This course seeks to present the student with a wide range of data
analytic techniques and is structured around the broad contours of the different types of
data analytics namely: descriptive, inferential, predictive, and prescriptive analytics.
Course Outcome(s): By the time student completes the academic requirements he/ she
will be able to:
• Obtain, clean/process and transform data.
• Analyse and interpret data using an ethically responsible approach.
• Use appropriate models of analysis, assess the quality of input, derive insight from
results, and investigate potential issues.
• Apply computing theory, languages and algorithms, as well as mathematical and
statistical models, and the principles of optimization to appropriately formulate and use
data analyses.
• Formulate and use appropriate models of data analysis to answer business-related
questions.
• Interpret data findings effectively to any audience, orally, visually and in written
Syllabus
Unit I : Introduction to Business Analytics and Data
Types of Digital Data: Structured Data, Unstructured Data, and Semi-Structured Data;
Introduction to Big Data; Overview of Business Analytics; Skills Required for a Business
Analyst; Functional Applications of Business Analytics in Management.
Introduction to R Programming; Data Manipulation in R: Vectors, Basic Math, and Matrix
Operations; Summarizing Data: Numerical and Graphical Summaries; Data Visualization in
R; Data Transformation; Data Import Techniques in R; Time Series and Spatial Graphs;
Graphs for Categorical Responses and Panel Data.
Unit II: Descriptive and Prescriptive Analytics
Basic Data Summaries: Measures of Central Tendency, Measures of Dispersion, and
Measures of Skewness and Kurtosis; Slicing and Filtering of data; Subsets of Data; Overview
of Exploratory and Confirmatory Factor Analysis; Unsupervised Learning: Clustering and
Segmentation - K-means Clustering and Association Rule Mining – Market Basket Analysis.
Discussion using one caselet for each concept using R (wherever applicable).
Unit III: Predictive and Diagnostic Analytics
Machine Learning: Building Regression Models – Simple Linear and Multiple Linear
Regression Analysis using Ordinary Least Squares Method; Supervised Learning –
Regression and Classification Techniques: Logistic Regression Analysis; Linear Discriminant
Analysis; Decision Trees; Unstructured Data Analytics: Overview of Text Mining and Web
Mining.
Discussion using one caselet for each concept using R (wherever applicable).
Suggested Readings
• A Ohri (2012), “R for Business Analytics”, ISBN 978-1-4614-4342-1(eBook), DOI 10.1007/978-1-4614-4343-
8, Springer New York-Heidelberg Dordrecht London, Springer Science, New York.
• Arnab K.Laha (2015), “How to Make The Right Decision”, Random House Publishers India Pvt. Ltd., Gurgaon,
Haryana, India.
• Joseph F. Hair, William C. Black, Barry J. Babin and Rolph E. Anderson (2015), “Multivariate Data
Analysis”, Pearson Education, New Delhi, India.
• Jared P. Lander (2013), “R for Everyone: Advanced Analytics and Graphics”, Pearson Education Inc., New
Jersey, USA.
• Johannes Ledolter (2013), “Data Mining and Business Analytics with R”, John Wiley & Sons, Inc., New
Jersey, USA.
• Prasad R N and Acharya Seema (2013), “Fundamentals of Business Analytics”, Wiley India Pvt. Ltd., New
Delhi, India.
• Glyn Davis and Branko Pecar (2013), “Business Statistics using Excel”, Oxford University Press, New Delhi.
• Halady Rao Purba (2013), “Business Analytics an Application Focus”, PHI Learning Private Limited, New
Delhi.
• Jank Wolfgang (2011), “Business Analytics for Managers”, SpringerScience + Business Media, ISBN 978-1-
4614-0405-7.
• Subhashini Sharma Tripathi, “Learn Business Analytics in Six Steps Using SAS and R”, ISBN-13 (pbk):
978-1-4842-1002-4 ISBN-13 (electronic): 978-1-4842-1001-7, Bangalore, Karnataka, India.
• Dr. Umesh R. Hodeghatta and Umesha Nayak, “Business Analytics Using R - A Practical Approach”, ISBN-
13 (pbk): 978-1-4842-2513-4 ISBN-13 (electronic): 978-1-4842-2514-1, DOI 10.1007/978-1-4842-2514-1,
Bangalore, Karnataka, India.
• Thomas A. Runkler, “Data Analytics Models and Algorithms for Intelligent Data Analysis”, Springer, ISBN
978-3-8348-2588-9 ISBN 978-3-8348-2589-6 (eBook) DOI 10.1007/978-3-8348-2589-6.
• Bhasker Gupta, “Interview Questions in Business Analytics”, Apress, ISBN-13 (pbk): 978-1-4842-0600-3
ISBN-13 (electronic): 978-1-4842-0599-0, DOI 10.1007/978-1-4842-0599-0.
List of Journals
• Journal of Retailing - Elsevier
• Journal of Business Research - Elsevier
• Industrial Management & Data Systems- Emerald
Learning objectives
1. Digital Data Formats
2. Genesis
3. Data Science/ Analytics (Business Solutions)
4. Overview of Analytics
5. Types of Business Analytics
6. Foundations of R (A Free Analytic Tool) – Managing Qualitative
Data
7. Text Mining using R Software for Windows
1. Analyzing Online Government Reports
2. Analyzing Online Research Articles
3. Analyze Interviews
Prasanta Chandra Mahalanobis
India's first Big Data Man
• Public use of statistics started in India with Prasanta Chandra Mahalanobis (PCS)
• PCM pursued his education at Brahmo Boys school in Calcutta; later joined
Presidency College in the same city. Following this, he left to University of London
for higher studies
• In 1915, when Mahalanobis's ship from England to India was delayed, he spent
time in library of King’s College, Cambridge, where he found Biometrika, a leading
book on theoretical statistics of the time. A physics student's sudden interest led
to India's rise in the field of statistics
• In 1931, he set up the Indian Statistical Institute (ISI) as a registered society
• He introduced the concept of pilot surveys & advocated the usefulness of
sampling methods
• Early surveys began between 1937-1944 which included consumer expenditure,
tea-drinking habits, public opinion, crop acreage and plant disease
• He served as the Chairman of the UN Sub-Commission on Sampling from 1947-
1951 & was appointed the honorary statistical adviser to the GoI in 1949
• For his pioneering work, he was awarded the Padma Vibhushan in 1968
• The eminent scientist breathed his last on June 28, 1972
1. Digital Data Formats
• Data has seen exponential 10%
10% Unstructured
growth since the advent of the data
computer & internet Semi-structured
80% data
• Digital data can be classified into
Structured data
3 forms
• Unstructured
• Semi-structured
• Structured
• According to Merrill Lynch, 80-
90% of the business data is
either unstructured/ semi-
structured
• Gartner also estimates that
unstructured data constitutes
80% of the whole enterprise data
Structured Data Semi-structured Data Unstructured Data
• Data conforms to some O Data is collected in an ad- • Does not conform to a
specification hoc manner before it is data model
known how it will be stored
• Well-defined schema • It is not in a form which
and managed
enables efficient data can be used easily by a
processing, improved O Does not conform to a data computer program
storage and navigation of model but has some
• Advantage - no
content structure
additional effort on its
• Defines the type & structure O Not all the information classification is
of data, & their relations collected will have identical necessary
structure
• Limitation - Difficult to • About 80-90% data of an
subsequently extend a O Advantage - ability to organization is in this
previously defined accommodate variations in format
database schema that structure
already contains content
“GoodLife HealthCare Group, one of India’s leading healthcare groups, began
its operations in the year 2000 in a small town off the south-east coast of India,
with just one tiny hospital building with 25 beds. Today, group owns 20 multi-
specialty healthcare centers across all the major cities of India. The group has
witnessed some major successes & attributes it to its focus on assembly line
operations & standardizations. GoodLife HealthCare offers the following
facilities: Emergency care 24 x 7 Support groups, support & help through call
centers. The group believes in making a “Dent in Global Healthcare”. A few of
its major milestones are listed below in chronological order:
• Year 2000 – the birth of the GoodLife HealthCare Group. Functioning initially from a tiny hospital building with
25 beds
• Year 2002 – built a low cost hospital with 200 beds in India
• Year 2004 – gained foothold in other cities of India
• Year 2005 – the total number of healthcare centers owned by the group touched the 20 mark
• The next 5 years saw the groups dominance in the form of it setting up a GoodLife HealthCare Research
Institute to conduct research in molecular biology and genetic disorders
• Year 2010 witnessed the group award for the “Best HealthCare Organization of the Decade”

Case Research Questions


Study: • What data is present in the system?

GoodLife • How is it stored?


• How important is the information?
HealthCare • How can this information enhance
Group healthcare services?
Structured Data Semi-structured Data Unstructured Data
• For every patient who visits • Dr.Vishnu, Neurologist at • Dr.Sami, Dr.Raj & Dr.Rahul
the hospital, nurse makes GoodLife HealthCare usually work at the medical facility of
electronic records & stores gets a blood test done for GoodLife. Over the past few
in a relational database migraine patients as he believes days, Dr.Sami & Dr.Raj have
• Nurse Mr.Nandu records that patients with migraine have been exchanging long emails
body temperature & BP of high platelet count. He makes a about a particular case of
Mr.Prem (patient) & stores note of the diagnosis in gastro-intenstinal problem.
the details in the hospital conclusion section of the report. • Dr.Raj upon a particular
database • Dr.Mamatha, searches the combination of drugs has
database for one of her patient successfully cured the
• Dr.Dev, who is treating
with similar health condition, but disorders in his patients. He
Mr.Prem searches the
with no luck! has written an email about this
database & is able to locate
combination of drugs to
the desired information • Mr.Prem’s blood test reports are
Dr.Sami & Dr.Rahul.
easily because the hospital not successfully updated in the
data is structured & is medical system database as • Mr.Prem visits Dr.Rahul with
stored in a relational they were in the semi-structured similar case of gastro-
database format intestinal disorder. He quickly
searches the organizations
GoodLife Healthcare Patient Index Card GoodLife HealthCare - Blood Test Report database for process, but with
Patient ID <> Date <> Doctor <> Patient Age <> no luck as the email
Nurse Name <> Patient Name <> WBC Count <> conversation has not been
Patient <> Patient <>
Hemoglobin <> RBC Count <> successfully updated into the
Content
Name Age medical system database as it
Platelet Count <> fell in the unstructured format
Body <> Blood <>
Temperature Pressure Conclusion <notes>
Exponential Growth of Data
• Until 2005 humans had created 130 Exabytes of data
2010 – 1,200 Exabytes
2015 – 7,900 Exabytes
2020 - 40,900 Exabytes
• It is estimated that the digital bits captured each year in India is expected to
grow from 127 exa bytes to 2.9 zetta bytes between 2012 - 2020
Note: -
• 1 Byte (smallest unit of storage)
• 1 Kilo Byte = 1000 Bytes
• 1 Mega Byte = 1000 Kilo Bytes
• 1 Giga Byte = 1000 Mega Bytes
• 1 Tera Byte = 1000 Giga Bytes
• 1 Peta Bytes = 1000 Tera Bytes
• 1 Exa Bytes = 1000 Peta Bytes
2. Genesis
• Business users are more interested in action & outcomes
• Big data - Collection of large/ vast data sets
• India has
• Over 200,000 - Factories
• 29 million - Estimated public & private sector employment

• Globally, India stands in Top 5 for mobile consumers & social media usage:
• Over 1.2 billion - Population
• Over 890 million - Mobile subscribers
• 213 million - Internet subscribers
• 115 million - Facebook users
• 24 million - LinkedIn users
• Quantifying & analyzing data is virtually impossible using conventional
databases & computing technology
• Levels of Support - Big data initiative requires 3 levels of support
1. Infrastructure – Designing the architecture, providing the enterprise hardware & cloud solutions,
assisting the management for big data enterprise etc.
2. Software development – Big data software platforms such as R, SAS, Hadoop, NoSQL, Reduce-
Map, Pig, Python and related big data software tools are essential
3. Analytics – Once the software delivers the results there has to be insights derived from the
numbers
• Big data through a single analytical model analyzes data, bringing together
both structured data (sales & transactional records) & unstructured data
(social media comments, audio, & video)
• Organizations started turning to self-service data discovery & data
visualization products such as Qlik (founded in 1993), Spotfre (founded in
1996) & Tableau Software (founded in 2003) to analyze data
• Big-data Challenges
1. Data discovery & data visualization tools available, though mature, aren’t always
suitable for non-technical business users
2. Capture, data curation, search, sharing, storage, transfer, analysis, visualization,
querying & information privacy
3. Predictive insights are typically available only in high-value pockets of businesses
as data science talent remains scarce
• Big Data Security - Big data is not inherently secure. It provides
consolidated view of enterprise. Analytics identifies threats in real time
1. Security management – Real-time security data can be combined with big data analytics
2. Identify & access management – Allows enterprise to adapt identity controls for secure
access on demand
3. Fraud detection & prevention – Analyzes massive amounts of behavioral data to instantly
differentiate between legitimate and fraudulent activity
4. Governance, risk management, & compliance – Unifies & enables access to real-time
business data, promoting smarter decision-making that mitigates risk
• Bain & Company study revealed that early adopters of big data analytics
have a significant lead over their competitors. After surveying more than 400
companies, Bain determined that those companies with the best analytics
capabilities come out the clear winners in their market segments:
• Twice as likely to be in the top 25 per cent in financial performance
• 5 times more likely to make strategic business decisions faster than their competitors
• 3 times more likely to execute strategic decisions as planned
• Twice as likely to use data more frequently, when making decisions
• McKinsey Global Institute Report (May 2017) - McKinsey surveyed more
than 500 executives, across the spectrum of industries, regions, and sizes
• More than 85% acknowledged they were somewhat effective at their data analytics initiatives
• Digitization is uneven among companies, sectors & economies. Leaders are reaping benefits
• Innovations in digitization, analytics, artificial intelligence, & automation are creating
performance and productivity opportunities for business & economy
• By 2018, USA alone could face a shortage of 1,40,000-1,90,000 people with deep analytical
skills as well as 1.5 million managers/ analysts with the know-how to use
• NASSCOM (National Association of Software & Services Companies)
Report
• Big data analytics sector in India is expected to witness eight-fold growth to reach $16 billion
by 2025 from the current level of $2 billion
• India will have 32 per cent share in the global market
• Identified 'marketing analytics' as a core growth area for the industry
• Industry is expected to grow at an average of 25% every year till 2020 to reach $1.2 billion. It
is at present estimated at $200 million (Rs. 1000 crore)
• Data analytics segment would require huge manpower over next five years
Industries Using analytics for business
Facts about big data and Analytics

Data & analytics underpin six disruptive models, and certain characteristics make
individual domains susceptible
Most Popular Analytic Tools
(in the Business World)

1. MS Excel: Excellent reporting & dash boarding tool. Latest versions of Excel
can handle tables with up to 1 million rows making it a powerful yet versatile
tool
2. SAS: 5000 pound gorilla of the analytics world. Most commonly used
software in the Indian analytics market despite its monopolistic pricing
3. SPSS Modeler (Clementine): A data mining software tool by SPSS Inc., an
IBM company. This tool has an intuitive GUI & its point-and-click modelling
capabilities are very comprehensive
4. Statistica: Provides data analysis, data management, data mining, & data
visualization procedures. The GUI is not the most user-friendly, takes a little
more time to learn but is a competitively priced product
5. R: It is an open source programming language & software environment for
statistical computing and graphics.
5. Salford systems: Provides a host of predictive analytics & data mining
tools for businesses. The company specializes in classification &
regression tree algorithms. Software is easy to use & learn
6. KXEN: Drives automated analytics. Their products, largely based on
algorithms developed by the Russian mathematician Vladimir Vapnik, are
easy to use, fast & can work with large amounts of data
7. Angoss: Like Salford systems, Angoss has developed its products around
classification & regression decision tree algorithms. The tools are easy to
learn & use, besides the results being easy to understand & explain. GUI is
user friendly besides many new features been added
8. MATLAB: Allows matrix manipulations, plotting of functions & data,
implementation of algorithms & creation of user interfaces. There are many
add-on toolboxes that extend MATLAB. Matlab is not a free software.
However, there are clones like Octave and Scilab which are free and have
similar functionality
9. Weka: Weka (Waikato Environment for Knowledge Analysis) is a popular
suite of machine learning software, developed at the University of Waikato,
New Zealand. Weka, along with R, is amongst the most popular open
source software
3. Data Science/ Analytics
(Business Solutions)
Data Science
1. Practice of various scientific fields, their algorithms, approaches & processes
2. Using programming languages & software frameworks
3. Aiming to extract knowledge, insights & recommendations from data &
4. Deliver them to business users & consumers in consumable applications

Relevant Scientific Fields


1. Statistics
2. Computer Science
3. Operations Research
4. Engineering
5. Applied Mathematics
6. Domain Specific Knowledge
*** Not Exhaustive

Note: - In 2018 and beyond,


we'll see a growing list of "smart"
capabilities powered by machine
learning & artificial intelligence
Relevant Scientific Fields
Predictive / Statistical / Machine Computer Science Operations Research / Applied
Learning Mathematics
Linear & Non-linear
Clustering Search
Optimization
*
Classification Sorting
Engineering
Neural Networks Merging
Signal Processing
*
Regression Encryption

* * Not exhaustive

Relevant Languages & Software Frameworks


Open Source Propriatery
IBM Tools
Programming Languages
SPSS DSX
Software Frameworks
R Machine
CPLEX
Hadoop Learning
Python
Spark Non-IBM Tools
Scala SAS

Matlab
* *

* Not exhaustive
Hyper - Radical Personalization
• Determine what is of interest to a specific user in a
specific context and present it to them when needed
• Retail, Finance, Telecom: Recommend other products
and services
• Media: Recommend relevant or similar content

Resource Allocation
• Determine how to use resources while respecting
operational constraints
• Finance: Portfolio Investment
• CPG, Retail: Supply Chain Management, Inventory
Allocation
• Industrial: Manufacturing scheduling
• Travel and Transportation: Scheduling of trucks,
plains, ships, airplanes

Predictions and Classifications


• Predict and classify events and determine their
impact/importance
• Retail and CPG: Predict and weather impact on sales
• Healthcare: Diagnose and recommend therapy and
drugs
• Travel and Transportation: Predict and classify
weather impact on movement speed
Patterns, Anomalies, Trends

• Determine when a pattern appears or is


broken
• Finance, Government: Fraud detection and
fraud scoring
• Healthcare: Fraud detection and fraud scoring
• Media: Anticipate user requests

Price and Product Optimization

• Determine what is the right price and product


for each customer given the context
• Travel and Transportation: Update prices
based on availability, weather, competition,
trends
• CPG and Retail: Stock and present the right
product mix by store, warehouse, or on-line

Unstructured Data and Natural


Language
• Consume natural language and determine
intensions, context etc.
Applications of Machine
Learning
1. Facial Recognition – Facebook
2. Virtual Reality Headsets
3. Voice Recognition – iphone
4. Reinforcement Learning - Robot Dogs
5. Amazon, Netflix, Audible
6. Medicine
7. Space – through maps
8. Explore new Territories
4. OVERVIEW OF ANALYTICS
Descriptive, Diagnostic, Predictive
and Prescriptive Analytics
Supervised Learning - You have a target, a value or a class to predict. Then
your model will be trained on historical data and use them to forecast future
revenues. Hence the model is supervised, it knows what to learn
• Example: - We wish to predict the revenue of a store from different inputs (day of the week,
advertising, promotion)
• Supervised learning regroups different techniques which all share the same principles
• The training dataset contains inputs data (your predictors) and the value you want to predict
• The model will use the training data to learn a link between the input and the outputs. Underlying
idea is that the training data can be generalized and that the model can be used on new data
with some accuracy
• Some supervised learning algorithms include: Linear & logistic regression, Support vector
machine, Naive Bayes, Neural network, Gradient boosting, Classification trees & random forest
• Supervised learning is often used for expert systems in image recognition, speech recognition,
forecasting, and in some specific business domain
Un-supervised Learning - You have un-labelled data and looks for patterns,
groups in these data. Instead of doing it manually, unsupervised machine
learning will automatically discriminate different clients. Unsupervised
algorithms can be split into different categories:
• For example, you want to cluster to clients according to the type of products they order, how
often they purchase your product, their last visit etc.
• Clustering algorithm, such as K-means, hierarchical clustering or mixture models. These
algorithms try to discriminate and separate the observations in different groups
• Dimensionality reduction algorithms (which are mostly unsupervised) such as PCA, ICA or
autoencoder. These algorithms find the best representation of the data with fewer dimensions
• Anomaly detections to find outliers in the data, i.e. observations which do not follow the data set
patterns
• Most of the time unsupervised learning algorithms are used to pre-process the data, during the
exploratory analysis or to pre-train supervised learning algorithms
2. Reinforcement Learning - You want to attain an objective. For example, you
want to find the best strategy to win a game with specified rules. Once
these rules are specified, reinforcement learning techniques will play this
game many times to find the best strategy
• Reinforcement learning algorithms follow the different circular steps:
• Given its and the environment’s states, the agent will choose the action which will
maximize its reward or will explore a new possibility. These actions will change
the environment’s and the agent states. They will also be interpreted to give a
reward to the agent. By performing this loop many times, the agents will improve
its behavior
• Reinforcement learning already performs wells on ‘small’ dynamic system and is
definitely to follow for the years to come
Descriptive Analytics: What is Prescriptive Analytics: What do I
happening? need to do?
Uses data aggregation & mining techniques  Analyzes data - what has happened,
to provide insight into the past why it has happened & what might
 Standard reporting and dashboards: happen?
What happened? How does it compare to  Helps user determine the best course of
our plan? What is happening now? actions & strategies
 Ad-hoc reporting: How many? How O Applies advanced analytical techniques
often? Where? (optimization & simulation algorithms) to
 Analysis/query/drill-down: What exactly is advice on possible outcomes & make
the problem? Why is it happening? specific recommendations

Diagnostic Analytics: Why it is Predictive Analytics: What is likely


happening? to happen in future based on
 Discovers root-cause of the problem previous trends & patterns?
 “Why did it happen?” • Historical patterns are used to predict
specific outcomes
 Is characterized by techniques such as
drill-down, data discovery, data mining • Uses statistical models/ forecasting
and correlations techniques to predict
Simple Mapping for the Data Science Portfolio

PRIVATE CLOUD PUBLIC CLOUD


Predictive Predictive
• SPSS Modeler • Data Science Experience (RStudio,
DEVELOPMENT

• SPSS Statistics Jupyter Notebooks, Watson Machine


Learning **)
• DSX Local*
Prescriptive
Prescriptive
• Data Science Experience (Rstudio,
• CPLEX Optimization Studio (COS) Jupyter Notebooks)
• Decision Optimization Center (DOC)
• DSX Local*
DEPLOYMENT

Predictive Predictive
• SPSS C&DS • Watson Machine Learning (on
Prescriptive Bluemix)

• CPLEX Optimization Studio (COS) Prescriptive

• Decision Optimization Center (DOC) • Decision Optimization on Cloud


(DOcplexcloud)
Case Study Discussion
Mr.Jones works at Center for Disease Control (CDC) and his job is to
analyze the data gathered from around the country to improve their
response time during flu season. CDC wants to examine the data of
past events (geographical spread of flu last winter – 2012) gathered to
better prepare the state for next winter (2013).
December 2000 happened to be a bad year for the flu epidemic. A new strain of the virus
has wreaked havoc. A drug company produced a vaccine that was effective in
combating the virus. But, the problem was that the company could not produce them fast
enough to meet the demand.
Government had to prioritize its shipments. Government had to wait a considerable
amount of time to gather the data from around the country, analyze it, and take
action. The process was slow and inefficient. The contributing factors included, not
having fast enough computer systems capable of gathering and storing the data
(velocity), not having computer systems that can accommodate the volume of the data
pouring in from all of the medical centers in the country (volume), and not having
computer systems that can process images, i.e, x-rays (variety).
Because of the havoc created by the flu epidemic in 2000 and the governments inability
to braze to the occasion there was a huge loss in lives.
US Government did not want the scenario in 2000 to repeat. So they decided to
adopt Big Data Technology in handling this flu epidemic.
Report generated by BI tool (Descriptive Analytics) presents State of New
York to have most outbreaks. Interactive visualization tool presented a map
depicting the concentration of flu & vaccine distribution in different states of
United States last winter. Visually a direct correlation is detected between the
intensity of flu outbreak with the late shipment of vaccines. It is noticed that the
shipments of vaccine for the state of New York were delayed last year. This
gives a clue to further investigate the case to determine if the correlation is
causal using Diagnostic Analytics (discovery). Ms.Linda, a data scientist
applies Predictive Analytics to create a model and apply it to data (demand,
vaccine production rate, quantity etc.) in order to identify causal relationships,
correlations, weigh the effectiveness of decisions taken so far and to prepare in
tackling potential problems foreseen in the coming months. Prescriptive
Analytics integrates our tried-and-true predictive models into our repeatable
processes to yield desired outcomes.
Big Data technology solved the velocity-volume-variety problem. The Center
for Disease Control may receive the data from hospitals & doctors in real-time
and Data Analytics Software that sits on the top of Big Data computer system
could generate actionable items that can give the Government the agility it
needs in times of crises.
Software Requirements

• R (3.5.1) - https://cran.r-
project.org/bin/windows/base/

• R Studio Desktop (1.1.453) -


https://www.rstudio.com/products/rstudio/do
wnload/

• Datasets -
https://www.superdatascience.com/machine-
learning/
FOUNDATIONS
WITH R
Managing Data
With R
Introduction to R
• 3 core skills of data science
• data manipulation
• data visualization and
• machine learning
• R statistical programming language, a free open source platform, is
developed by R. Gentleman and R. Ihaka at Bell Labs, during 1990’s
• Mastering core skills of data science will be easier in R, hence R is becoming
the lingua franca for data science. R is
• An open source programming language
• an interpreter
• high-level data analytics and statistical functions are available
• An effective data visualization tool
• Google & Facebook considered as two best companies to work for in our modern
economy, have data scientists using R
• R is also the tool of choice for data scientists at Microsoft, who apply machine learning
to data from Bing, Azure, Office, and the Sales, Marketing and Finance departments
• Beyond tech giants like Google, Facebook, and Microsoft, R is widely in use at a wide
range of companies, including Bank of America, Ford, TechCrunch, Uber, and Trulia
• R is a case sensitive language
• Commands can be entered one at a time at the command prompt (>) or can
run a set of commands from a source file
• Comments - start with a #
• Terminate the current R session - use q() or press Escape key
• Inbuilt help in R - ?command or >help(command)
• View(dataset) - Invoke a spreadsheet-style data viewer on a matrix-
like R object
• Commands are separated either by a semi-colon (;) or by a newline
• Elementary commands can be grouped into one compound expression by
braces ({ })
• If a command is not complete, R will prompt + on subsequent lines
• Results of calculations can be stored in objects using the assignment
operators: <- or =
• Vertical arrow keys on the keyboard can be used to scroll forward & backward
through a command history
• Basic functions are available by default. Other functions are contained in
packages as (built-in & user-created functions) that can be attached as
needed. They are kept in memory during an interactive session
• Rstudio - Created by a team led by JJ Allaire whose previous products
include ColdFusion and Windows Live Writer
• The usual R Studio screen has four (4) windows
1. Console
2. Environment and history
3. Files, plots, packages and help
4. The R script(s) and data view
• Console – where you can type commands & see output
• Environment/ Workspace – tab shows all the active objects
• History – tab shows a list of commands used so far
• Files – tab shows all the files and folders in your default workspace
• Plots - will show all your graphs
• Packages - tab will list a series of packages or add-ons needed to run certain
processes
• Help – tab to see additional information
• R script - keeps record of your work
• To send one line place the cursor at the desired line & press ctrl+Enter
• To run a file of code, press ctrl+shift+s or run tab
• To terminate the execution of command press ctrl+c
• ctrl+1 moves the cursor to the text editor area; ctrl+2 moves it to the console
• Primary feature of R Studio is projects. A project is a collection of files
Contd...
• An operator performs Order of the operations
O specific mathematical/ 1. Exponentiation
logical manipulations 2. Multiplication & Division in the order in which
p the operators are presented
• Rich in built-in operators
e 3. Addition & Subtraction in the order in which the
• R provides following types operators are presented
r of operators 4. Mod operator (%%) & integer division operator
(% / %) have same priority as the normal
a • Arithmetic Operators operator (/) in calculations
t • Relational Operators 5. Basic order of operations in R: Parenthesis (),
• Logical Operators Exponents (^), Multiplication, Division, Addition
o & Subtraction (PEMDAS)
• Assignment Operators
r • Miscellaneous Operators
6. Operations put between parenthesis is carried
out first
s
Operator Description Example
x+y y added to x 2+3=5
i x–y y subtracted from x 8- 2=6
x*y x multiplied by y 2*3=6
n
x/y x multiplied by y 10 / 5 = 2
x^y x raised to the power y 2 ^3 = 8

R Arithmetic x %% y Remainder of x divided by y (x mod y) 7 %% 3 = 1


Operators x%/%y x divided by y but rounded down (integer divide) 7 %/% 3 = 2
If x<-c(1.5,2.5,3.5,4.5,5.5) & y<-c(1,2,3,4,5)
Relational Operator Description Output
Operators FALSE FALSE FALSE FALSE
x == y Returns TRUE if x exactly equals y
FALSE

x>y Returns TRUE if x is larger than y TRUE TRUE TRUE TRUE TRUE

FALSE FALSE FALSE FALSE


x<y Returns TRUE if x is smaller than y
FALSE

x >= y Returns TRUE if x is > or exactly equal to y TRUE TRUE TRUE TRUE TRUE

FALSE FALSE FALSE FALSE


x <= y Returns TRUE if x is < or exactly equal to y
FALSE

x != y Returns TRUE if x differs from y TRUE TRUE TRUE TRUE TRUE

Logical Operator Description Output


Operators x&y Returns the result of x and y TRUE TRUE TRUE TRUE TRUE
x|y Returns the result of x or y TRUE TRUE TRUE TRUE TRUE
FALSE FALSE FALSE FALSE
!x Returns not of x
FALSE
FALSE FALSE FALSE FALSE
x OR(x, y) Returns the result of x or y
FALSE
Takes first element of both the vectors and
x && y TRUE
gives the TRUE only if both are TRUE
Takes first element of both the vectors and
x || y TRUE
gives the TRUE if one of them is TRUE
Assignment Operators
Operator Description Example
x=y x=c(1,2,3,4,5) x 12345
x <- y Left Alignment x<-c(1,2,3,4,5) x 12345
x < <-y x<<-c(1,2,3,4,5) x 12345
x ->y c(6,7,8,9,10)->>p p 6 7 8 9 10
Right Alignment
x ->> y c(6,7,8,9,10)->p p 6 7 8 9 10

Miscellaneous Operators
Operator Description Example
Creates the series of numbers in
: A <- 1:5 print(A*A) 1 4 9 16 25
sequence for a vector
a<-1:5 b<-5:10 t<-c(5,10,15,20,25)
Used to identify if an element
%in% print(a%in%t) FALSE FALSE FALSE FALSE TRUE
belongs to a vector
print(b%in%t) TRUE FALSE FALSE FALSE FALSE TRUE
M = matrix( c(1,2,3,4), nrow = 2,ncol = 2,byrow = TRUE)
Used to multiply a matrix with its
%*% x=t(M)
transpose
NewMatrix=M %*% t(M)
• In any programming language, we use variables to store information
D • Based on the data type of a variable, the operating system allocates
a memory & decides what can be stored in the reserved memory
t • The variables are assigned with R-Objects & the data type of the R-
a object becomes the data type of the variable
• Frequently used R-objects
T 1. Vectors
2. Data Frames
y
3. Lists
p 4. Matrices
e 5. Arrays
6. Factors
s
• Simplest 6 data types of atomic vectors
1. Numeric
i 2. Integer
n 3. Complex
4. Character (String)
5. Logical (True/ False)
R 6. Raw
• Other R-Objects are built upon atomic vectors
Dealing with missing values
• One of the most important problems in statistics is incomplete data sets
• To deal with missing values, R uses the reserved keyword NA, which stands
for Not Available
• We can use NA as a valid value, so we can assign it as a value as well
• is.na() - Tests whether a value is NA. Returns TRUE if the value is NA
x<-NA is.na(x) TRUE
• is.nan() - Tests whether a value is not an NA. Returns TRUE if the value is
not an NA
x<-NA is.nan(x) FALSE

>is.na(mtcars$mpg)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE [17] FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> is.nan(mtcars$cyl)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE [17] FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1. Vectors – c()
• One-dimensional arrays that can hold numeric/ character/ logical data
• Constants / one-element vectors – Also called as Scalars hold constants
• Example: - x <- 2.414, h <- TRUE, cn <- 2 + 5i, name = ‘IPE’, v <- charToRaw(‘IPE’)

• Types of Vectors
1. Numeric Vectors – contains all kinds of numbers
2. Integer Vectors – contains integer values only
3. Logical Vectors – contains logical values (TRUE and/ or FALSE)
4. Character Vectors – contains text

Atomic vectors Description Example


Logical It produces the TRUE, FALSE result class(h) "logical"
Handles positive & negative decimals Marks=c(23.2,32.4,12.6,04.2,49.1)
Numeric
including 0 Marks [1] 23.2 32.4 12.6 4.2 49.1
Handles positive & negative integers Marks=c(23L,32L,12L,04L,49L)
Integer
including 0 Marks [1] 23 32 12 4 49
Complex Handles complex data items class(cn) "complex"
Character Handles string/ character data items name=“IPE” class(name) "character"
v = charToRaw(‘IPE’)
Raw Returns a raw vector of bytes
V 49 50 45; class(v) "raw"
Function Description Example
marks=c(10,20,30,40,50)
c() Creates a vector
marks 10 20 30 40 50
smarks=c(56,89,48,12,38)
amarks=c(78L,48L,59L,15L,26L)
c() Combines vectors
final = c(smarks,amarks)
final [1] 56 89 48 12 38 78 48 59 15 26
class(x) Gives the data type of the R object x class(final) [1] "numeric"
nchar(x) Finds the length of a character/ numeric nchar(name) [1] 3
length(x) Returns the length of the vector length(final) [1] 10
Used for create a simple sequence of seq(from=1, to=20, by = 2)
seq(x)
integers [1] 1 3 5 7 9 11 13 15 17 19
Replicates the vector said number of rep(1:4, each = 2, len = 10)
rep(x)
times [1] 1 1 2 2 3 3 4 4 1 1
is.numeric(x) Tests whether x is an numeric vector is.numeric(final) [1] TRUE
is.integer(x) Tests whether x is an integer vector is.integer(final) [1] FALSE
is.character(x) Tests whether x is a character vector is.character(name) [1] TRUE

o We can easily edit the vector by using indices in single/ multiple locations

Example:- smarks=c(56,89,48,12,38); smarks; [1] 56 89 48 12 38; smarks[2]=100; smarks;


[1] 56 100 48 12 38; smarks[c(2,4,5)]=100 ; smarks; [1] 56 100 48 100 100
o Vector operations are complicated when operating on 2 vectors of unequal length. Shorter vector
elements are repeated, in order, until they have been matched up with every element of the longer
vector. If longer vector is not a multiple of shorter one, a warning is given

Example:- x<-1:8 y<-10:13 x+y [1] 11 13 15 17 15 17 19 21


Function Description Example
Tests whether all the resulting elements are all(smarks>amarks)
all()
TRUE [1] FALSE
any(smarks>amarks)
any() Tests whether any element is TRUE
[1] TRUE
Gives difference between that value & the next diff(final)
diff(x)
value in the vector [1] 33 -41 -36 26 40 -30 11 -44 11
sum(final)
sum(x) Calculates the sum of all values in vector x
[1] 469
prod(final)
prod(x) Calculates the product of all values in vector x
[1] 9.398024e+15
min(final)
min(x) Gives the minimum of all values in x
[1] 12
max(final)
max(x) Gives the maximum of all values in x
[1] 89
cumsum(final)
cumsum(x) Gives the cumulative sum of all values in x
[1] 56 145 193 205 243 321 369 428 443 469
cumprod(final)
[1] 5.600000e+01 4.984000e+03 2.392320e+05
cumprod(x) Gives the cumulative product of all values in x 2.870784e+06 1.090898e+08 8.509004e+09 [7]
4.084322e+11 2.409750e+13 3.614625e+14
9.398024e+15

Gives the minimum for all values in x from the cummin(final)


cummin(x) [1] 56 56 48 12 12 12 12 12 12 12
start of the vector until the position of that value

Gives the maximum for all values in x from the cummax(final)


cummax(x) [1] 56 89 89 89 89 89 89 89 89 89
start of the vector until the position of that value
vector=c(10,20,30,40,50,60,50,30,20,50) vector1=vector
vector[vector==50]<-NA
vector 10 20 30 40 NA 60 NA 30 20 NA
sum(is.na(vector)) 3 # Checks for NA’s
Command Description Example
max(vector,na.rm=FALSE) NA
max(x,na.rm=FALSE) Shows the maximum value
max(vector,na.rm=TRUE) 70
Shows the minimum value of min(vector,na.rm=FALSE) NA
min(x,na.rm=FALSE)
the vector min(vector,na.rm=TRUE) 10
Gives the length of the vector
including NA values. na.rm
length(x) length(vector) 10
instruction doesn’t work with
length(na.omit(x)) length(na.omit(vector)) 7
length(). na.omit() strips out NA
items
mean(vector,na.rm=FALSE) NA
mean(x,na.rm=FALSE) Shows the arithmetic mean
mean(vector,na.rm=TRUE) 40
median(vector,na.rm=FALSE) NA
median(x,na.rm=FALSE) Shows the median
median(vector,na.rm=TRUE) 40
sd(vector,na.rm=FALSE) NA
sd(x,na.rm=FALSE) Shows the standard deviation
sd(vector,na.rm=TRUE) 21.60247
var(vector,na.rm=FALSE) NA
var(x,na.rm=FALSE) Shows the variance
var(vector,na.rm=TRUE) 466.6667
Shows the median absolute mad(vector,na.rm=FALSE) NA
mad(x,na.rm=FALSE)
deviation mad(vector,na.rm=TRUE) 29.652
Summary Commands with Multiple Results: - Produce several values
vector=c(10,20,30,40,50,60,70,80,90,100)
• log(vector)
2.302585 2.995732 3.401197 3.688879 3.912023 4.094345 4.248495 4.382027
4.499810 4.605170
• summary(vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.0 32.5 55.0 55.0 77.5 100.0
• quantile(vector)
0% 25% 50% 75% 100%
10.0 32.5 55.0 77.5 100.0
• fivenum(vector)
10 30 55 80 100
2. Data Frames – data.frame()
• Data frame is a list of vectors of equal length, where each column can contain
different modes of data
• Data frames are tabular data objects created using the data.frame() function
x=1:5
jdate=c("30-01-2017","28-01-2017","16-01-2017","02-02-2017","05-02-2017")
age=c(25,34,28,52,74)
diabetes=c('Type 1','Type 2','Type 2','Type 1','Type 2')
status=c('Poor','Improved','Excellent','Poor','Improved')
Diabetologist = data.frame(x,jdate,age,diabetes,status)
diabetesstatus=data.frame(PID=x,Join_Date=jdate,Age=age,Type=diabetes,Status=status)
View(diabetesstatus) #Views the contents of the dataset created
names(patientroster) #Views the names assigned to the elements of the list
Function Description Example
nrow(x) Displays no. of rows in the data frame nrow(diabetesstatus) 5
Displays no. of columns in the data
ncol(x) ncol(diabetesstatus) 5
frame
Displays no. of rows & columns of the
dim(x) dim(diabetesstatus) 5 5
data frame
rownames(diabetesstatus) =
rownames(x) Assign row names
c('One','Two','Three','Four','Five')
rownames(x) <- NULL Set back to the generic index rownames(diabetesstatus) <- NULL
Prints only first 6 rows of the data
head(x) head(diabetesstatus)
frame
tail(x) Prints last 6 rows of the data frame tail(diabetesstatus)
To access single variable in a data
dataframe$colname diabetesstatus$Status
frame $ argument is used
Summarizing a more complicated object
statsvector=c(10,20,30,40,50,60,70,80,90,100)
accountsvector=c(50,60,70,80,90,10,20,30,40,50)
frenchvector=c(30,50,100,98,70,20,40,50,70,10)
studentroaster=data.frame(statsvector,accountsvector,frenchvector)
Command Description Example
The largest value of the entire data frame is
max(frame) max(studentroaster) 100
returned
The smallest value of the entire data frame is
min(frame) min(studentroaster) 10
returned
sum(frame) The sum of the entire data frame sum(studentroaster) 1588
The Tukey summary values for the entire data fivenum(studentroaster$statsvector)
fivenum(frame)
frame is returned [1] 10 30 55 80 100
The number of columns in the data frame is
length(frame) length(studentroaster) 3
returned
summary(frame) Gives summary for each column summary(studentroaster)
rowMeans(frame) Returns the mean of each row rowMeans(studentroaster)
rowSums(frame) Returns the sum of each row rowSums(studentroaster)
colMeans(frame) Returns the mean of each column colMeans(studentroaster)
colSums(frame) Returns the mean of each column colSums(studentroaster)
apply(studentroaster,1,mean,na.rm=TRUE)
Enables to apply function to rows/ columns of
apply(x,MARGIN, apply(studentroaster,2,mean,na.rm=TRUE)
matrix /data frame. Margin is either 1or 2
FUN) apply(studentroaster,1,median,na.rm=TRUE)
where 1 is for rows and 2 is for columns
apply(studentroaster,2,median,na.rm=TRUE)
sapply(x, FUN, sapply(studentroaster, mean, na.rm=TRUE)
na.rm=TRUE) sapply(studentroaster, sd, na.rm=TRUE)
3. Lists – list()
• List is the most complex of the R data types which can contain many different
types of elements inside it like vectors, functions and even another list inside it
x=1:5
jdate=c("30-01-2017","28-01-2017","16-01-2017","02-02-2017","05-02-2017")
age=c(25,34,28,52,74)
diabetes=c('Type 1','Type 2','Type 2','Type 1','Type 2')
status=c('Poor','Improved','Excellent','Poor','Improved')
patientroster=list(x,jdate,age,diabetes,status) # Creates a list
patientroster=list(PatientID=x,JoiningDate=jdate,AgeofthePatient=age,Disease=diabetes,C
urrentStatus=status) # Name the objects in the list
names(patientroster) #Views the names assigned to the elements of the list
Command Description Example
The largest value of the entire data frame is
mean(patientroster$AgeofthePatient) 42.6
returned
max(patientroster$AgeofthePatient) The smallest value of the entire data frame is 74
min(patientroster$AgeofthePatient) returned 25
summary(patientroster) Returns the summary of the list
The number of columns in the data frame is
length(patientroster) 5
returned
lapply(patientroster,mean,na.rm=TRUE) List apply specifically works on list objects
sapply(patientroster,mean,na.rm=TRUE) Resulting output is infact a matrix
4. Matrices – matrix()
• Two-dimensional rectangular data set - created using a vector input to the matrix function -
indices for dimensions are separated by a comma – specifies index for row before comma & the
index for column after comma
• Every element, must be of same type, most commonly all numeric's
• Similar to vectors - element-by-element addition, multiplication, subtraction, division & equality
• Syntax: matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE)
where data: the data vector nrow: desired number of rows
ncol: desired number of columns byrow: logical. If `FALSE' matrix is filled by columns
Function Description
C <- matrix(1:30,ncol=5,byrow=TRUE) Creating Matrix - Arranging elements row-wise
C <- matrix(1:30,ncol=5,byrow=FALSE) Creating Matrix - Arranging elements column-wise
C[1:2,2:3] Extract subset of the matrix - values of rows (1 & 2) & columns (2 & 3)
C[1:2,] Extract complete rows (1 & 2)
C[,3:5] Extract complete columns (3, 4 & 5)
C[,-3:-5] Drops values in a vector by using a negative value for the index
C[-c(1,3),] Drops the first & third rows of the matrix
C[3, ,drop=FALSE] To get the deleted third row returned as a matrix
C[3,2]<-4 Replacing a value in the matrix
C[1:2,4:5]=c(50,100,50,100) Replace a subset of values within the matrix by another matrix
Change fourth row values with the specified values by not specifying
C[4,]<-c(1,2,3,4,5)
the other dimension
rowSums(C)
Returns with the sums of each row and column
colSums(C)
Data of a single vector happens to be split into rows & columns
cosmeticsexpenditure=matrix(c(10,20,30,40,50,60,70,80,90,100,110,120),ncol=4)
colnames(cosmeticsexpenditure)=c("YMen","YWomen","SCM","SCW")

Command Description Example


Returns mean of the second mean(cosmeticsexpenditure[,2])
mean(x[,2])
column 50
Returns mean of the second mean(cosmeticsexpenditure[2,])
mean(x[2,])
row 65
rowMeans(cosmeticsexpenditure)
rowMeans(matrix_name) Returns the mean of each row
55 65 75
rowSums(cosmeticsexpenditure)
rowSums(matrix_name) Returns the sum of each row
220 260 300
Returns the mean of each colMeans(cosmeticsexpenditure)
colMeans(matrix_name)
column YMen YWomen SCM SCW 20 50 80 110
Returns the mean of each colSums(cosmeticsexpenditure)
colSums(matrix_name)
column YMen YWomen SCM SCW 60 150 240 330

apply(cosmeticsexpenditure,2,mean)
YMen YWomen SCM SCW 20 50 80 110
Works equally well for a matrix
apply(cosmeticsexpenditure,1,mean)
apply(x,MARGIN,FUN) as it does for a data frame
55 65 75
object
apply(cosmeticsexpenditure,1,mean)
55 75 NA
5. Arrays – array()
• While matrices are confined to two dimensions, arrays can be of any number
of dimensions
• Arrays have 2 very important features

• They contain only a single type of value

• They have dimensions

• Dimensions of an array determine the type of the array. An array with 2


dimensions is a matrix. An array with more than 2 dimensions is an array
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
6. Factors – factor()
• Factors store the vector along with the distinct values of the elements in the
vector as labels
• The labels are always character irrespective of whether it is numeric or
character or Boolean etc. in the input vector
• Factors are created using the factor() function

• The nlevels functions gives the count of levels


• Example:-
iphone_colors <- c('green','green','yellow','red','red','red','green')
# Create a factor object
factor_iphone <- factor(iphone_colors)
# Print the factor
print(factor_iphone) green green yellow red red red green Levels: green red yellow
print(nlevels(factor_iphone)) 3
AT & T – NETWORK
MANAGEMENT SYSTEM
Using R Software Tool
The AT&T's System T is a network- Failure Inter Failure Cumulative
management system developed by AT&T Index Time Failure Time
1 5.5 5.5
that receives data from telemetry events, 2 1.83 7.33
such as alarms, facility performance 3 2.75 10.08
information, and diagnostic messages, 4 70.89 80.97
and forwards them to operators for 5 3.94 84.91
further action. The system has been 6 14.98 99.89
7 3.47 103.36
tested and failure data has been
8 9.96 113.32
collected, Ehrlich et al. (1993). The 9 11.39 124.71
failures data comes from one of three 10 19.88 144.59
releases of a large medical record 11 7.81 152.4
system, consisting of 188 software 12 14.6 166.99
components Stringfellow & Andrews 13 11.41 178.41
14 18.94 197.35
(2002). This study uses failure data from
15 65.3 262.69
the last drop of system test and after 16 0.04 262.69
release. Table below shows the failures 17 125.67 388.36
and the inter-failure as well as cumulative 18 82.69 471.05
number of failures by week in the last 19 0.46 471.5
20 31.61 503.11
drop of system test for first release (in
21 129.31 632.42
CPU units). 22 47.6 680.02
Failure Inter Failure Cumulative Failure
Table here shows cumulative number Index Time Time
of failures by week in the last drop of 1 90 90
2 17 107
system test for second release 3 19 126
4 19 145
5 26 171
6 17 188
7 1 189
8 1 190
9 0 190
10 0 190
11 2 192
12 0 192
13 0 192
14 0 192
Failure Inter Failure Cumulative Failure 15 11 203
Index Time Time 16 0 203
1 9 9 17 1 204
2 5 14
3 8 21
4 7 28
5 25 53
6 3 56
7 2 58
8 5 63
9 7 70
10 5 75 Table shows cumulative number of
11 1 76 failures by week in the last drop of
12 0 76
13 1 77 system test for third release
Descriptive Analytics
• Several user-contributed packages offer functions for descriptive statistics,
including Hmisc, pastecs, & psych, doBy
• These packages aren’t included in the base distribution, hence they need to
be installed
1. Hmisc: - Contains functions useful for data analysis, high level graphics,
utility operations, functions for computing sample size & importing datasets
etc. describe(mtcars)
• Prints a concise statistical summary (no. of variables & observations, the no. of
missing & unique values, the mean, quantiles, & 5 highest & lowest values)
• install.packages("Hmisc") – Installs the package Hmisc
• library("Hmisc") – Loads, attaches & lists packages

2. doBy: - Provides functions for descriptive statistics by group. It contains


function called summaryBy()- summaryBy(mpg+hp+wt~am,data=mtcars,FUN=mean)
• Categorizes observations on am data item & calculates mean of mpg, hp & wt data
items
• Variables on the left side of the ~ are the numeric variables to be analysed, &
variables on the right side are categorical/ grouping variables
3. Summarize: – One more version of summarize for producing stratified
summary statistics - summarize(X = mtcars$mpg, by = mtcars$cyl, FUN = mean)
• Summarizes the mean of the variable mpg for the unique cyl (4,6, & 8) values
• summarize(X = mtcars$mpg, by = llist(mtcars$cyl,mtcars$gear), FUN = mean) - Summarizes
the mean of the variable mpg for the unique cyl (4,6, & 8) & gear (3,4, & 5) values
• summarize(X=mtcars$mpg,by=llist(mtcars$cyl,mtcars$gear),summary) - Summarizes
st rd
(minimum, maximum, mean, median, 1 quantile, 3 quantile) the variable mpg for the unique
cyl (4,6, & 8) & gear (3,4, & 5) values
• summaryBy(cyl+gear~mpg,data=mtcars,FUN=summary)

4. Psych: - Also offers functions for descriptive statistics. It contains function


called describe() that provides the number of non-missing observations,
mean, standard deviation, median, trimmed mean, median absolute
deviation, minimum, maximum, range, skew, kurtosis, & standard error of
the mean - describe(mtcars[mycars])
• myvars <- c(“mpg”,”hp”,”wt”)
• Displays number of non-missing observations, mean, standard deviation, median, trimmed
mean, median absolute deviation, minimum, maximum, range, skew, kurtosis, & standard
error of the mean )
• describeBy(mtcars[myvars],list(am=mtcars$am)) – Similar to describe() but stratified by one or
more grouping variables
• describeBy(iris,list(iris$Species))
Note: - Packages Hmisc and psych both provide describe() function. How does R know which one to use?
Simply, the package last loaded masks the function of the older one. If you want the Hmisc version, type
Hmisc :: describe(mtcars)
5. aggregate() - Descriptive Statistics by Groups – Focus is usually on
descriptive statistics of each group, rather than the total sample. Obtains
descriptive statistics by group. Will not work if the dataset has NOT been loaded
using attach()
• aggregate(mtcars$mpg,by=list(Cylinder=mtcars$cyl),mean)
• myvars = c(“mpg”,”hp”,”wt”)
• aggregate(mtcars[myvars],by=list(Cylinder=mtcars$cyl),mean)
• aggregate(mtcars,by=list(Cylinder=mtcars$cyl),mean)
• aggregate(mtcars$mpg,by=list(AM=mtcars$am,Gear=mtcars$gear,Cylinder=mtcar
s$cyl),mean)
• aggregate(mtcars[myvars],by=list(AM=mtcars$am,Gear=mtcars$gear,Cylinder=mt
cars$cyl),mean)
• aggregate(mtcars,by=list(AM=mtcars$am,Gear=mtcars$gear,Cylinder=mtcars$cyl)
,mean)
• aggregate(formula=mpg~cyl, data = mtcars, FUN=mean) - Shows the cyl(cylinder-
wise) aggregated mean of mpg
• aggregate(formula=mpg~cyl, data=mtcars, FUN=sd) - Shows the aggregated
standard deviation value for the wt & gear attributes
• aggregate(wt~gear,data= mtcars, FUN = sd) – Shows the Gear-wise standard
deviation of mpg
6. Pastecs (Package for Analysis of Space-Time Ecological Series): - Includes
stat.desc() that provides wide range of descriptive statistics
stat.desc(mtcars$mpg,basic=TRUE,desc=TRUE,norm=TRUE,p=0.95)
basic=TRUE (default) – displays no. of values, no. of null values, no. of
missing values, minimum, maximum, range (max-min) & sum of non-missing
values
desc=TRUE (default) – displays median, mean, standard error of the mean,
95% confidence interval for the mean, variance, standard deviation, &
coefficient of variation (standard deviation divided by the mean)
norm=TRUE (not the default) – normal distribution statistics are displayed,
including skewness & kurtosis. Displays the skewness coefficient g1
(skewness), its significant criterium skew.2SE (i.e., g1/2.SEg1; if skew.2SE
> 1, then skewness is significantly different than zero), kurtosis coefficient g2
(kurtosis), its significant criterium kurt.2SE (same remark as for skew.2SE),
the statistic of a Shapiro-Wilk test of normality (normtest.W) and its
associated probability (normtest.p)
p-value option is used to calculate the confidence interval for the mean
• A positive skewness indicates that the size of
the right-handed tail is larger than the left-
handed tail
• A negative skewness indicates that the left-
hand tail will typically be longer than the right-
hand tail
• The rule of thumb for Skewness
• If the skewness is between -0.5 and 0.5, the data are
fairly symmetrical
• If the skewness is between -1 and – 0.5 or between
0.5 and 1, the data are moderately skewed
• If the skewness is less than -1 or greater than 1, the
data are highly skewed
• Kurtosis decreases as the tails become lighter
& It increases as the tails become heavier
• If the kurtosis is close to 0, then a normal distribution is often
assumed. These are called mesokurtic distributions
• If the kurtosis is less than zero, then the distribution is light
tails and is called a platykurtic distribution
• If the kurtosis is greater than zero, then the distribution has
heavier tails and is called a leptokurtic distribution
Caselet 1 - mtcars (Motor Trend Car Road Tests) Dataset
The data was extracted from the 1974 Motor Trend US magazine,
and comprises fuel consumption and 10 aspects of automobile
design and performance for 32 automobiles (1973–74 models).
The description of the 11 numeric variables with the 32
observations in the data frame are as follows:
1. [,1] mpg – Miles/ (US) gallon
2. [,2] cyl – Number of Cylinders
3. [,3] disp – Displacement (cu.in.)
4. [,4] hp – Gross Horsepower
5. [,5] drat – Rear Axle Ratio
6. [,6] wt – Weight (1000 lbs)
7. [,7] qsec – ¼ Mile Time
8. [,8] vs – V/S Engine Shape
9. [,9] am –Transmission (0=automatic, 1=manual)
10.[,10] gear – Number of Forward Gears
11.[,11] carb – Number of Carburetors
Prepare a managerial report.
1. stat.desc(mtcars$mpg,basic=TRUE,desc=TRUE,norm=TRUE,p=0.95)
2. mt <- mtcars[c("mpg", "hp", "wt", "am")]
summary(mt)
stat.desc(mt,basic = TRUE,desc = TRUE,norm = TRUE,p=0.95)
3. vars <- c("mpg", "hp", "wt")
head(mtcars[vars])
summary(mtcars[vars]) #Provides minimum, maximum, quartiles, mean &
median for numerical variables
stat.desc(mtcars[vars])
stat.desc(mtcars[vars],norm=TRUE) #Surprisingly, the base installation
doesn’t provide functions for skew and kurtosis, but you can add your own

Interpretation
For cars in this sample, the mean mpg is 20.1, with a standard deviation of 6.0.
The distribution is skewed to the right (+0.61) and somewhat flatter than a
normal distribution (–0.37). Majority of the automobiles considered in the Motor
Trend Car Road Tests (mtcars) dataset have less mileage than mean while very
Caselet 2 – Cars93 Dataset (In-built)
The Cars93 data frame has data of 93 Cars on Sale in the USA in 1993 arranged in 93
rows and 27 columns. The description of the variables are in the data set are as follows:
1. Manufacturer - Manufacturer
2. Model - Model
3. Type – A factor with levels “Small", "Sporty", "Compact", "Midsize", "Large“ & "Van"
4. Min.Price – Minimum Price ($1000): Price for a basic version
5. Price – Mid-range Price ($1000): Average of Min.Price & Max.Price
6. Max.Price – Maximum Price ($1000): Price for a premium version
7. MPG.city – City MPG (miles per US gallon by EPA rating)
8. MPG.highway – Highway MPG
9. AirBags – Air Bags standard. Factor: none, driver only, or driver & passenger
10. DriveTrain – Drive train type: rear wheel, front wheel or 4WD; (factor)
11. Cylinders – No. of cylinders (missing for Mazda RX-7, which has a rotary engine)
12. EngineSize - Engine size (litres)
13. Horsepower - Horsepower (maximum)
14. RPM - RPM (revs per minute at maximum horsepower)
15. Rev.per.mile - Engine revolutions per mile (in highest gear)
16. Man.trans.avail - Is a manual transmission version available? (yes or no, Factor)
17. Fuel.tank.capacity - Fuel tank capacity (US gallons)
18. Passengers - Passenger capacity (persons)
19. Length - Length (inches)
20. Wheelbase - Wheelbase (inches)
21. Width - Width (inches)
22. Turn.circle - U-turn space (feet)
23. Rear.seat.room - Rear seat room (inches) (missing for 2-seater vehicles)
24. Luggage.room - Luggage capacity (cubic feet) (missing for vans)
25. Weight - Weight (pounds)
26. Origin - Of non-USA or USA company origins? (factor)
27. Make - Combination of Manufacturer and Model (character)
Assignment
1. Load the data set Cars93 with data(Cars93,package=“MASS”) and
set randomly any 5 observations in the variables Horsepower and
Weight to NA (missing values)
2. Calculate the arithmetic mean & the median of the variables
Horsepower and Weight
3. Calculate the standard deviation and the interquartile range of the
variable Price

1. Load data and setting missing values NA


data(Cars93,package="MASS")
Cars93[sample(1:nrow(Cars93),5), c("Horsepower","Weight")] <- NA
2. Calculate mean & median
c(mean(Cars93$Horsepower, na.rm=TRUE), median(Cars93$Horsepower,
na.rm=TRUE))
c(mean(Cars93$Weight, na.rm=TRUE), median(Cars93$Weight, na.rm=TR
UE))
3. Calculate standard deviation & inter-quartile range
c(sd(Cars93$Price),IQR(Cars93$Price)) # no missing values
Classification of Dependence Methods
Appropriate when one or more of the variables
Multivariate Methods can be identified as dependent variables and the
remaining as independent variables
Multivariate
methods

• Category of multivariate statistical techniques


Are some of the • Explain or predict dependent variable(s) on
variables dependent the basis of 2/ more independent variables
on others?
• Tools: Multiple regression analysis,
Yes No Discriminant analysis, ANOVA & MANOVA

Dependence Interdependence Interdependence Methods


methods methods
Set of interdependent relationships are examined
– analysis involves either the independent or
Dependence Methods - Common judgement dependent variables separately
“Is a person a good or a poor credit risk based on
age, income and marital status?”
Interdependence Methods
• Category of multivariate statistical techniques
Researcher might use interdependence techniques
to identify & classify similar cities on the basis of
• Give meaning to a set of variables or seek to
group things together
population size, income distribution, race & ethnic
distribution, & consumption of a manufacturers • Tools: Cluster analysis, Factor analysis, and
product so as to select comparable test markets. Multidimensional scaling
Dependence
Methods

How many variables are


dependent

One dependent Several dependent Multiple independent


variable variables & dependent variables

Dependence Dependence
Methods Methods

How many How many


variables are variables are
dependent dependent

Several
1 dependent
dependent
variable
variables
Metric Nonmetric Metric Nonmetric

Multiple Multiple Multivariate


Conjoint
regression discriminant analysis of
analysis
analysis analysis variance
Interdependence methods

Are inputs metric?

Metric Non-metric

Interdependence methods Interdependence methods

Are inputs metric? Are inputs metric?

Metric
Non-metric

Metric
Factor Cluster Non-metric Multidimensional
multidimensional
analysis analysis scaling
scaling
Diagnostic Analytics
• R can perform correlation using 2 functions: cor() & cor.test()
• cor() function - cor(x, use = , method =)
• x – Matrix or data frame
• use – Specifies the handling of missing data. Options are all.obs (assumes no missing data –
missing data will produce an error), everything (any correlation involving a case with missing
values will be set to missing), complete.obs (list-wise deletion), pairwise.complete.obs
(pairwise deletion)
• method – Specify the type of correlation. Options are Pearson, Kendal & Spearman Rank
correlations abbreviated as "pearson" (default), "kendall", or "spearman“
• Pearson product-moment correlation assesses the degree of linear relationship between 2
quantitative variables. Spearman’s rank-order correlation coefficient assesses the degree of
relationship between 2 rank-ordered variables. Kendalls tau is also a non-parametric measure
of rank correlation
• Default options are use=“everything” and method=“pearson”
• For executing cor() function one must either give a matrix/ data frame for x & y
• The inputs must be numeric
• Correlations are provided by the cor() function, & scatter plots are generated by
scatterplotMatrix() function or corrgram() function
• Interpretation of corrgram() – A blue color & hashing that goes from lower left to upper right
represent a positive correlation between the 2 variables that meet at that cell. Conversely, a
red color & hashing that goes from the upper left to lower right represent a negative
correlation. The darker the more saturated the color, the greater the magnitude of correlation.
Weak correlations, near zero, appear washed out
Caselet 3 - Prestige Dataset (In-built)
Prestige.txt consists of 102 observations with 6 variables. The description of the
variables are in the data set are as follows:
1. education: The average number of years of education for occupational incumbents
2. income: The average income of occupational incumbents, in dollars
3. women: The percentage of women in the occupation
4. prestige: The average prestige rating for the occupation
5. census: The code of the occupation used in the survey
6. type: Professional and managerial(prof), white collar(wc), blue collar(bc), or
missing(NA)
Is there any association between the variables Education & prestige. If there is
an association what is the strength of the association? Prepare a managerial
report.
install.packages(“car”)
library(car)
# Correlation Matrix of Multivariate sample
test = cor(Prestige[1:4])
#Graphically plot the association
boxplot(Prestige$prestige~Prestige$type)
scatterplotMatrix(Prestige[1:4], spread=FALSE, smoother.args=list(lty=2), main='Scatter Plot')
install.packages(“corrgram”)
library(corrgram)
corrgram(Prestige[1:4])
Interpretation
Professors tend to have higher prestige
than bc and wc Education & prestige
have high positive correlation. More the
education, more the prestige.
Education & income have positive
correlation. More the income, more the
prestige
Caselet 4 - mtcars (Motor Trend Car Road Tests) Dataset
The data was extracted from the 1974 Motor Trend US magazine,
and comprises fuel consumption and 10 aspects of automobile
design and performance for 32 automobiles (1973–74 models).
The description of the 11 numeric variables with the 32
observations in the data frame are as follows:
1. [,1] mpg – Miles/ (US) gallon
2. [,2] cyl – Number of Cylinders
3. [,3] disp – Displacement (cu.in.)
4. [,4] hp – Gross Horsepower
5. [,5] drat – Rear Axle Ratio
6. [,6] wt – Weight (1000 lbs)
7. [,7] qsec – ¼ Mile Time
8. [,8] vs – V/S Engine Shape
9. [,9] am –Transmission (0=automatic, 1=manual)
10.[,10] gear – Number of Forward Gears
11.[,11] carb – Number of Carburetors

Prepare a managerial report.


Research questions
1. Calculate the correlation coefficient between MPG & remaining
variables.
2. Is a car with automatic or manual transmission better in term of miles
per gallons (mpg)?
3. Quantify the mpg difference between automatic & manual
transmission

Calculate the correlation coefficient between MPG & remaining 10


variables
round(cor(mtcars)[-1, 1], 2)
Install.packages(“corrgram”)
library(corrgram)
corrgram(mtcars[1:6])
Conclusion
1. MPG is mainly related to vehicle weight
and number of cylinders
2. According to the correlation coefficients,
“wt”, “cyl” and “disp” show the strongest
correlations with “mpg
Caselet 5 - women Dataset (In-built)
Data-set women provides the height & weight for a set of 15 women
aged 30-39 on 2 variables Height and Weight.
1. [,1] height – numeric height (in)
2. [,2] weight – numeric weight (lbs)
Suppose we wish to predict weight from height. How can we identify
overweight or underweight individuals?
# Correlation Matrix of Multivariate sample
test = cor(women)

Interpretation
1. From the output our suspicion is confirmed. Height and weight
have a strong positive correlation (0.9954948)
2. Multiple R-squared value (R2=0.991) indicates that the model
accounts for 99.1% of the variance in weights (because of
heights)
# Graphical Correlation Matrix
1. plot(women, xlab = "Height (in)", ylab = "Weight (lb)", main = "women data:
American women aged 30-39")
2. install.packages(“car”)
library(car)
scatterplotMatrix(women[1:2], spread=FALSE, smoother.args=list(lty=2),
main='Scatter Plot')
3. scatterplot(weight ~ height, data=women, legend=list(coords="topleft"))

4. install.packages(“corrgram”)

library(corrgram)
corrgram(women)
Caselet 6 - Longley Dataset (In-built)
This is a macroeconomic data frame which consists of 7 economical
variables, observed yearly from 1947 to 1962 (n=16).
1. GNP.deflator - GNP implicit price deflator (1954=100)
2. GNP - Gross National Product
3. Unemployed - number of unemployed
4. Armed.Forces - number of people in the armed forces
5. Population – non-institutionalized’ population ≥ 14 years of age
6. Year - the year (time)
7. Employed - number of people employed
The data items of the data set are known to be highly collinear. Examine
and prepare a managerial report.
# Correlation Matrix of Multivariate sample
test = cor(longley)
# Graphical Correlation Matrix
symnum(test) # highly correlated
stat.desc(longley$Unemployed,basic=TRUE,desc=TRUE)
pairs(longley, panel = panel.smooth, main = "Longley data", col = 3 +
(longley$Unemployed > 400))
Caselet 7 - swiss Dataset (In-built)
Switzerland, in 1888, was entering a period known as the demographic
transition; i.e., its fertility was beginning to fall from the high level typical of
underdeveloped countries. Standardized fertility measure and socio-economic
indicators for each of 47 French-speaking provinces of Switzerland at about
1888 are arranged as a data frame with 6 variables, each of which is presented
in percent.
1. Fertility – lg, ‘common standardized fertility measure’
2. Agriculture - % of males involved in agriculture as occupation
3. Examination - % draftees receiving highest mark on army examination
4. Education - % education beyond primary school for draftees
5. Catholic – % catholic (as opposed to protestant)
6. Infant.Mortality – live births who live less than 1 year
Examine the data and prepare a managerial report.
# Correlation Matrix of Multivariate sample
test = cor(swiss)
# Graphical Correlation Matrix
symnum(test)
pairs(swiss, panel = panel.smooth, main = "swiss data", col = 3 +
(swiss$Catholic > 50))
Cor.test() function
cor.test(var1, var2, method = c("pearson", "kendall", "spearman"))
• cor.test() tests for association/correlation between paired samples using
Pearson's product moment correlation coefficient r/ Kendall's τ/ Spearman's ρ
• It returns both the correlation coefficient and the significance level(or p-value)
of the correlation
• Example: -
my_data <- mtcars
install.packages(“ggpubr”)
library("ggpubr")
res <- cor.test(my_data$wt, my_data$mpg, method = "pearson")
res
ggscatter(my_data, x = "mpg", y = "wt", add = "reg.line", conf.int = TRUE, cor.coef =
TRUE, cor.method = "pearson", xlab = "Miles/(US) gallon", ylab = "Weight (1000 lbs)")

• t is the t-test statistic value (t = -9.559); df is the degrees of freedom (df= 30),
pvalue is the significance level of the t-test (p-value = 1.29410^{-10}); conf.int is
the confidence interval of the correlation coefficient at 95% (conf.int = [-0.9338, -
0.7441]); sample estimates is the correlation coefficient (Cor.coeff = -0.87).
• Interpretation: - The p-value of the test is 1.29410^{-10}, which is less than the
significance level alpha = 0.05. We can conclude that wt and mpg are significantly
correlated with a correlation coefficient of -0.87 and p-value of 1.29410^{-10} .
Caselet 8 – Cars93 Dataset (In-built)
The Cars93 data frame has data of 93 Cars on Sale in the USA in 1993 arranged in 93
rows and 27 columns. The description of the variables are in the data set are as follows:
1. Manufacturer - Manufacturer
2. Model - Model
3. Type – A factor with levels “Small", "Sporty", "Compact", "Midsize", "Large“ & "Van"
4. Min.Price – Minimum Price ($1000): Price for a basic version
5. Price – Mid-range Price ($1000): Average of Min.Price & Max.Price
6. Max.Price – Maximum Price ($1000): Price for a premium version
7. MPG.city – City MPG (miles per US gallon by EPA rating)
8. MPG.highway – Highway MPG
9. AirBags – Air Bags standard. Factor: none, driver only, or driver & passenger
10. DriveTrain – Drive train type: rear wheel, front wheel or 4WD; (factor)
11. Cylinders – No. of cylinders (missing for Mazda RX-7, which has a rotary engine)
12. EngineSize - Engine size (litres)
13. Horsepower - Horsepower (maximum)
14. RPM - RPM (revs per minute at maximum horsepower)
15. Rev.per.mile - Engine revolutions per mile (in highest gear)
16. Man.trans.avail - Is a manual transmission version available? (yes or no, Factor)
17. Fuel.tank.capacity - Fuel tank capacity (US gallons)
18. Passengers - Passenger capacity (persons)
19. Length - Length (inches)
20. Wheelbase - Wheelbase (inches)
21. Width - Width (inches)
22. Turn.circle - U-turn space (feet)
23. Rear.seat.room - Rear seat room (inches) (missing for 2-seater vehicles)
24. Luggage.room - Luggage capacity (cubic feet) (missing for vans)
25. Weight - Weight (pounds)
26. Origin - Of non-USA or USA company origins? (factor)
27. Make - Combination of Manufacturer and Model (character)
Assignment - Calculate the correlation matrix (Pearson) of the variables
Horsepower, Weight and Price. Use both cor() and cor.test() functions
1. Calculate the correlation matrix using cor() function
cor(Cars93[,c("Horsepower","Weight","Price")], use ="complete.obs")
2. Graphically plot the correlation
boxplot(MPG.highway ~ Origin, col = "red", data = Cars93, + main="Box plot of
MPG by origin")
plot(Cars93$Price ~ Cars93$Horsepower,main="Price given
Horsepower",xlab="Horsepower", ylab="Price")
3. Calculate the correlation matrix using cor.test() function
cor.test(Cars93[,13],Cars93[,5])
mydata=Cars93
res=cor.test(mydata$Horsepower,mydata$Price); res

Interpretation
The p-value of the test is 0.00, which is
less than the significance level (i.e., p ≤
0.05). We can conclude that Horespower
and Price are significantly correlated with a
correlation coefficient of 0.79.
FORECASTING
NUMERIC DATA
Using R Software Tool
Predictive Analytics
• Simplest form of regression is where you have 2 variables – a
response variable & a predictor variable
• It essentially describes a straight line that goes through the data where
a is the y-intercept & b is the slope
y  a  bx   SSE   y  ŷ    y  a  b1X1  b 2 X 2 
2 2

b
 ( x  x)( y  y )
i i

 ( x  x) i
2

a  y b
• In R, the basic function for fitting linear model is lm(). The format is
myfit <- lm(formula,data)
• The formula is typically written as Y ~ x1 + x2 + .... + xk
• ~ separates the response variable on the left from the predictor variables on the
right
• Predictor variables are separated by + sign
• Other symbols can be used to modify the formula in various ways
Each of these functions is applied to the object returned by lm()
Contd...
function in order to generate additional information based on the
fitted model
Function & its Syntax Explanation
The result shows the coefficients for the regression, i.e., the intercept & the slope.
lm(formula,data)
We place the predictor on the left of the ~ and the response on the right
summary(fittness) Displays detailed results for the fitted model
names(fittness) To extract more information of the result object
fittness$coefficients
fitness$coef Lists the model parameters (intercept & slopes) for the fitted model
coef(fittness)
Provides confidence intervals for the model parameters. Default settings produce
confint(fittness)
95% confidence intervals, i,e., at 2.5% & 97.5%
Alter the interval using the level = instruction, specifying the interval as a proportion.
confint(fittness,parm=c('(Int
You can also choose which confidence variables to display by using the parm =
ercept)',level=0.9))
instruction & placing the variables in quotes
fitted(fittness) Lists the predicted values in a fitted model i.e., y values for each x value is predicted
residuals(fittness) Lists the residual values i.e., average error in predicting weight from height using
resid(fittness) this model
formula(fittness) Can access the formula used in the linear model
fittness$call Complete call to lm() command
plot() Generates diagnostic plots for evaluating the fit of a model
predict() Uses a fitted model to predict response values for a new dataset
Caselet 9 - women Dataset (In-built)
Data-set women provides the height & weight for a set of 15
women aged 30-39 on 2 variables Height and Weight. Suppose
we wish to predict weight from height. How can we identify
overweight or underweight individuals?
1. [,1] height – numeric height (in)
2. [,2] weight – numeric weight (lbs)

fitness=lm(weight~height, data=women)
summary(fitness)
fitted(fitness)
residuals(fitness)
Output - weight = -87.52 + 3.45 x Height
plot(women$height, women$weight, xlab="Height(in inches)", ylab="Weight (in
pounds)",pch=19)
abline(fitness)
a <- data.frame(height = 170)
result <- predict(fitness, a, level=0.95, interval="confidence")
print(result)
Interpretation
• From the output our suspicion is confirmed. Height and weight have a strong
positive correlation (0.9954948)
• The regression coefficient 3.45 indicates that there is an expected increase
of 3.45 pounds of weight for every 1 inch increase in height
• Multiple R-squared value (R2=0.991) indicates that the model accounts for
99.1% of the variance in weights
• The residual standard error (1.53 pounds) is the average error in predicting
weight from height using this model
• The F-statistic tests whether the predictor variables, taken together, predict
the response variable above chance levels. Here, F-test is equivalent to the
t-test for the regression coefficient for height
• aov() command is a special case of linear modelling, with the command
being a “wrapper” for the lm() command. summary() command gets the
result in a sensible layout
• fittness.lm=lm(weight~height,data=women)
• fittness.aov=aov(weight~height,data=women)
• summary(fittness.aov) #generates a classic ANOVA table
• summary(fittness.lm)
area=c(3L,4L,6L,4L,2L,5L) sales=c(6,8,9,5,4.5,9.5)
realestate=data.frame(area, sales)
Caselet 10 – Real Estate
realestatelm=lm(sales~area, data=realestate)
summary(realestatelm)
Output - sales = 2 + 1.25 x area + 1.311
fitted(realestatelm)
residuals(realestatelm)
plot(area, sales, xlab='Local Area of the Plot(in 1000 sq.
ft.)', ylab='Sales(in $100,000)', pch=19)
abline(realestatelm)
a = data.frame(area = 20)
predictsales = predict(realestatelm, a, level=0.95,
interval="confidence")
print(predictsales)

Interpretation
– Regression coefficient 1.25 indicates that there is an expected increase of 1.25 sales for every 1
inch increase in area
– Multiple R-squared value (R2=0.6944) indicates that the model accounts for 69.4% of the
variance in sales
– The residual standard error (1.311 pounds) is the average error in predicting sales from area of
the plot using this model
– When area is 20K sq ft, then the plot will be sold for $27L
– At 95% confidence level the cost of the plot for 20Ksq ft, will lie between $8.5L and $45.4L
Caselet 11 - mtcars (Motor Trend Car Road Tests) Dataset
The data was extracted from the 1974 Motor Trend US magazine,
and comprises fuel consumption and 10 aspects of automobile
design and performance for 32 automobiles (1973–74 models).
The description of the 11 numeric variables with the 32
observations in the data frame are as follows:
1. [,1] mpg – Miles/ (US) gallon
2. [,2] cyl – Number of Cylinders
3. [,3] disp – Displacement (cu.in.)
4. [,4] hp – Gross Horsepower
5. [,5] drat – Rear Axle Ratio
6. [,6] wt – Weight (1000 lbs)
7. [,7] qsec – ¼ Mile Time
8. [,8] vs – V/S Engine Shape
9. [,9] am –Transmission (0=automatic, 1=manual)
10.[,10] gear – Number of Forward Gears
11.[,11] carb – Number of Carburetors
Prepare a managerial report.
Research questions
1. Is a car with automatic or manual transmission better in term of miles
per gallons (mpg)?
2. Quantify the mpg difference between automatic & manual
transmission

Calculate the correlation coefficient between MPG &


remaining 10 variables
round(cor(mtcars)[-1, 1], 2)
Interpretation: - According to the correlation coefficients, “wt”, “cyl”
and “disp” show the strongest correlations with “mpg”
Is a car with automatic or manual transmission better in terms of mpg?
## Model 1: MPG ~ am (automatic - am = 0 or manual - am = 1)
fit0 <- lm(mpg ~ am, data = mtcars); summary(fit0)

Interpretation
1. With manual transmission MPG increases by 7.245 miles/Gallon
2. Here p-value ≤ 0.05, indicating that the difference for manual transmission is
significant
3. Adjusted R2 is 0.3385. Clearly, the coefficients obtained here is biased without
considering other variables
## Model 2: MPG ~ weight + number of cylinders + displacement
fit1 <- lm(mpg ~ wt + cyl + disp, data = mtcars); summary(fit1)

Interpretation
1. We see that the adjusted R2 is 0.8147
2. P-values show that both weight and number of cylinders have significant linear
relationships with MPG (since p-value ≤ 0.05), but displacement does not
## Model 3: MPG ~ weight + horesepower + number of cylinders + displacement +
transmission fit2 <- lm(mpg ~ wt + hp + cyl + disp + am, data = mtcars);
summary(fit2)
Interpretation
1. We see that the adjusted R2 is 0.827
2. P-values show that both weight &
horsepower have significant linear
relationships with MPG (since p-
value ≤ 0.05), but number of
cylinders, displacement &
transmission do not
## Model 4: MPG ~ all variables
fit3 <- lm(mpg ~ ., data = mtcars); summary(fit3)
Interpretation
1. We see that the adjusted R2 is 0.807
2. None of the variables considered
have significant linear relationship
with MPG since p-value ≥ 0.05
## Model 5: Multiple linear regression with interactions - impact of automobile
weight & horsepower on mileage
mreg_int = lm(mpg~hp + wt + hp:wt, data=mtcars)
summary(mreg_int)

Interpretation
• The interaction between horsepower & car weight is significant (since p ≤ 0.05) i.e., the
relationship between miles per gallon & horsepower varies by car weight
• The model for predicting mpg is given by mpg = 49.81 – 0.12 x hp – 8.22 x wt + 0.03 x hp x wt

## Model 6: Predict – Simple & Multiple Linear Regression


• a=data.frame(wt=10) #Predict Simple Linear Regression
• result = predict(slin_reg,a, level=0.95, interval="confidence")
• print(result)
• a=data.frame(wt=10,cyl=10,disp=200) #Predict Multiple Linear Regression
• result = predict(mlin_reg1,a,level=0.95, interval="confidence")
• print(result)

Interpretation
• When the wt of the automobile is 10 the mpg of the vehicle reduces further to -16.15 with its
confidence limits as -8.33 and -23.9
• When the wt is 10, no. of cylinders is 10, and displacement is 200 then the mpg reduces further
to -11.6
## Model 7: Graphical Representation
install.packages(“car”)
library(car)
scatterplotMatrix(mtcars, spread=FALSE, smoother.args=list(lty=2), main='Scatter
Plot Matrix - mtcars')

Conclusion
1. MPG is mainly related to vehicle weight
2. Manual transmission is better than auto transmission on MPG
3. With the given data set, we are unable to correctly quantify the difference between the two types of
transmissions on MPG
Polynomial Regression – Special case of multiple linear regression
• Linear relationship between two variables x and y is one of the
most common, effective and easy assumptions to make when
trying to figure out their relationship
• Sometimes however, the true underlying relationship is more
complex than that, and this is when polynomial regression
comes in to help
Caselet 12 – Position_Salaries (User-defined Dataset)
Data-set provides the Position, Level and Salary for a set of 10 employees. Suppose
Mr.Raghu currently working with the XYZ company as Region Manager since past 2
years wishes to shift to a new organization & applied for a vacant position for Partner
which he is due in next 2 years with a demand for a salary of more than his current pay
of Rs.160000/- Is Mr.Raghu reasonable?
1. [,1] Position – Designation
2. [,2] Level – coding for designation
3. [,3] Salary – Salary of the employee
View(Position_Salaries)
dataset = Position_Salaries[2:3]
dataset
#Polynomial Regression Model
poly_reg=lm(formula=Salary~Level + I(Level^2) + I(Level^3) + I(Level^4), data=dataset)
summary(poly_reg)
#Visualizing Polynomial Results (2 Methods – Plot & ggplot)
plot(dataset$Level,dataset$Salary, xlab=“Level”, ylab=“Salary”)
lines(dataset$Level,fitted(poly_reg))
ggplot() + #install ggplot2 package & corresponding library
geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)), colour = 'blue') +
ggtitle('Truth or Bluff (Polynomial Regression)') + xlab('Level') + ylab('Salary')
#Predicting Result with Polynomial Regression Model
y_pred=predict(poly_reg, data.frame(Level=6.5, Level=6.5^2, Level=6.5^3, Level=6.5^4))
Y_pred
Interpretation
Mr.Raghu seems reasonable as after 2 years i.e., at level 6.5 he is already earning Rs.158862.5/-
which is almost Rs.160000/-. Hence his pay could be fixed with nothing less than Rs.160000/-.
Caselet 13 - women Dataset (In-built)
Data-set women provides the height & weight for a set of 15 women aged
30-39 on 2 variables Height and Weight. Suppose we wish to predict weight
from height. How can we identify overweight or underweight individuals?
1. [,1] height – numeric height (in)
2. [,2] weight – numeric weight (lbs)

Simple Linear Regression


fit1<- lm(weight ~ height, data=women)
summary(fit1)
plot(women$height,women$weight, + xlab="Height
(in inches)", + ylab="Weight (in pounds)")
abline(fit1)
Weight = −87.52 + 3.45 × Height

• From the output our suspicion is confirmed. Height and weight have a strong positive
correlation (0.9954948)
• The regression coefficient 3.45 indicates that there is an expected increase of 3.45
pounds of weight for every 1 inch increase in height
• Multiple R-squared value (R2=0.991) indicates that the model accounts for 99.1% of
the variance in weights
• The residual standard error (1.53 pounds) is the average error in predicting weight
from height using this model
Simple Linear Regression

Polynomial Regression
Polynomial Regression
• In the Women dataset we can improve the prediction using a regression with a quadratic term
(that is, X2 )
• We can fit a quadratic equation using the statement
fit2 <- lm(weight ~ height + I(height^2), data=women)
summary(fit2)
height^2 adds a height-squared term to the prediction equation. The I function treats the contents
within the parentheses as an R regular expression
plot(women$height,women$weight, xlab="Height (in inches)", ylab="Weight (in lbs)")
lines(women$height,fitted(fit2))
• From this new analysis, the prediction equation is
Weight = 261.88 - 7.348 × Height + 0.083 xHeight2

Interpretation
• Both regression coefficients are significant
at the p < 0.0001 level
• Amount of variance accounted for has
increased to 99.9 percent
• The significance of the squared term (t =
13.89, p < .001) suggests that inclusion of
the quadratic term improves the model fit
• Close look at the plot of fit2 shows a curve
that indeed provides a better fit
• We can fit a quadratic equation using the statement
fit3 <- lm(weight ~ height + I(height^2) + + I(height^3), data=women)
summary(fit3)
ggplot() +
geom_point(aes(x = women$height, y = women$weight), colour = 'red') +
geom_line(aes(x = women$height, y = predict(women_polyreg, newdata = women)),
colour = 'blue') +
ggtitle('Polynomial Regression for Women dataset') + xlab('Height') + ylab('Weight')
• From this new analysis, the prediction equation is
Weight = -896.74 + 46.41 × Height - 0.74 xHeight2 + 0.004 × Height3
• To predict the weight when height is 100 using Polynomial Regression Model we use
predict() function
y_pred=predict(women_polyreg, data.frame(height=100, height=100^2, height=100^3))
y_pred
Interpretation
• Regression coefficients are significant at the p < 0.0001 level
• Amount of variance accounted for has increased to 1 percent
• t = 3.94, p < 0.01 suggests that inclusion of the quadratic term improves the model fit
• When height=100 the predicted weight on applying Polynomial regression model is 535
Overview of Non-Linear Models - Decision Tree Regression
• One of the most intuitive ways to create a predictive model – using the concept of a
tree. Tree-based models often also known as decision tree models successfully
handle both regression & classification type problems
• Linear regression models and logistic regression fail in situations where the
relationship between features and outcome is non-linear or where the features are
interacting with each other
• Non-linear models include: nonlinear least squares, splines, decision trees, random
forests and generalized additive models (GAMs)
• Relatively modern technique for fitting nonlinear models is the decision tree. Tree
based learning algorithms are considered to be one of the best & mostly use
supervised learning methods
• Decision tree is a model with a straightforward structure that allows to predict output
variable, based on a series of rules arranged in a tree-like structure
• Output variable that we can model/ predict can be categorical or numerical
• Decision Trees are non-parametric supervised learning method that are often used for
classification (dependent variable – categorical) and regression (dependent variable –
continuous)
• For a regression tree, the predicted response for an observation is given by the mean/
average response of the training observations that belong to the same terminal node,
while for classification tree predicted response for an observation is given by the
mode/ class response of the training observations that belong to same terminal node

Decision trees can produce nonlinear decision surfaces


Source: https://www.youtube.com/watch?v=DCZ3tsQIoGU

• Root Node represents the entire population or sample. It further gets divided into two or
more homogeneous sets.
• Splitting is a process of dividing a node into two or more sub-nodes.
• When a sub-node splits into further sub-nodes, it is called a Decision Node.
• Nodes that do not split is called a Terminal Node or a Leaf.
• When you remove sub-nodes of a decision node, this process is called Pruning. The
opposite of pruning is Splitting.
• A sub-section of an entire tree is called Branch.
• A node, which is divided into sub-nodes is called a parent node of the sub-nodes; whereas
the sub-nodes are called the child of the parent node.
• Decision trees are nonlinear predictors. The extent of nonlinearities depends on a
number of splits in the tree
• To build a regression tree, we use recursive binary splitting ( a greedy & top-down
algorithm) to grow a large tree on the training data, stopping only when each terminal
node has fewer than some minimum number of observations, which minimizes
the Residual Sum of Squares (RSS)
• Beginning at the top, split the tree into 2 branches, creating a partition of 2 spaces.
You then carry out this particular split of the tree multiple times & choose the split that
features minimizing the (current) RSS
• Next, we can apply cost complexity pruning to the large tree in order to obtain a
sequence of best subtrees, as a function of α. We can use K-fold cross-validation to
choose α. Divide the training observations into K folds to estimate the test error rate of
the subtrees. Our goal is to select the one that leads to the lowest error rate
• The anova method leads to regression trees; it is the default method if y a simple
numeric vector, i.e., not a factor, matrix, or survival object
• To decide which attribute should be tested first, simply find the one with the highest
information gain. Then recurse…
• Limitations
• Decision trees generally do not have the same level of predictive accuracy as other
approaches, since they aren't quite robust. A small change in the data can cause a large
change in the final estimated tree
• Over fitting & unstable at times. Loses a lot of information while trying to categorize continuous
variable. Not sensitive to skewed distribution
• Advantages
• Simple to understand & interpret. Can be displayed graphically
• Requires little data preparation. Useful in data exploration
• Able to handle both numerical and categorical data
• Non-parametric method, thus it does not need any assumptions on the sample space
• Closely mirror human decision-making compared to other regression and classification
approaches
• Tree based methods empower predictive models with high accuracy, stability and ease
of interpretation

• Applications of Decision Trees


• Direct Marketing – While the marketing of products and services, business should track
products and services offered by the competitors as it identifies the best combination of
products and marketing channels that target specific sets of consumers
• Customer Retention – Decision trees helps organizations keep their valuable
customers and get new ones by providing good quality products, discounts, and gift
vouchers. These can also analyze buying behaviors of the customers and know their
satisfaction levels
• Fraud Detection – Fraud is a major problem for many industries. Using classification
tree, a business can detect frauds beforehand and can drop fraudulent customers
• Diagnosis of Medical Problems – Classification trees identifies patients who are at risk
of suffering from serious diseases such as cancer and diabetes
1. Grow the Tree - There are multiple packages available to
implement decision tree such as ctree, rpart, tree etc.
rpart(formula, data=, method=,control=)
formula outcome ~ predictor1+predictor2+predictor3+etc.
data=specifies the data frame
method="class" for a classification tree and “anova" for a regression tree
control=optional parameters for controlling tree growth

2. Examine the Results


printcp(fit) - display cp table
plotcp(fit) - plot cross-validation results
rsq.rpart(fit) - plot approximate R-squared and relative error for different splits (2
plots). labels are only appropriate for the "anova" method
print(fit) - print results
summary(fit) - detailed results including surrogate splits
plot(fit) - plot decision tree
rpart.plot(fit) – another form of plotting decision tree
text(fit) - label the decision tree plot
post(fit, file=) - create postscript plot of decision tree

3. Prune Tree - Prune tree to avoid over-fitting the data


prune(fit, cp= ) - prune the tree to the desired size
Growing a Tree Common Decision Tree Algorithms
• Features to choose • Giri Index
• Conditions for splitting • Chi-square
• Knowing when to stop • Information Gain
• Pruning • Reduction in Variance
• Regression & Classification trees can be generated through the rpart
package
rpart package uses recursive partitioning algorithm. It helps the user identify the
structure of data; & develop decision rules for decision trees
Syntax: - rpart(formula, data, method,…)
where formula is a formula describing the predictor and response variables with no
interaction terms
data is the name of the data set used
method If y is a survival object, thenmethod = "exp" is assumed, if y has 2
columns then method = "poisson" is assumed, if y is a factor
then method = "class" is assumed, otherwise method = "anova" is
assumed
Example: - fit = rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)

• The R package party is also used to create decision trees through its
function ctree() which is used to create & analyze decision tree
Syntax: - ctree(formula, data)
where formula is a formula describing the predictor and response variables
data is the name of the data set used
Example: - fit = ctree(Kyphosis ~ Age + Number + Start, data=kyphosis)
plot(fit, main="Conditional Inference Tree for Kyphosis")
Plotting Options
• The plot function for rpart uses the general plot function. By default, this
leaves space for axes, legends or titles on the bottom, left, and top
• Simplest labeled plot is called by using plot & text without changing any
defaults
par(mar = rep(0.2, 4))
plot(fit, uniform = TRUE)
text(fit, use.n = TRUE, all = TRUE)
• uniform = TRUE - plot has uniform stem lengths
• use.n = TRUE - specifying number of subjects at each node
• all = TRUE - Labels on all the nodes, not just the terminal nodes
• Fancier plots can be created by modifying the branch option, which controls
the shape of branches that connect a node to its children.
par(mar = rep(0.2, 4))
plot(fit, uniform = TRUE, branch = 0.2, compress = TRUE, margin = 0.1)
text(fit, all = TRUE, use.n = TRUE, fancy = TRUE, cex= 0.9)
• compress - attempts to improve overlapping of some nodes
• fancy - creates the ellipses and rectangles, and moves the splitting rule to the midpoints of the
branches
• Margin - shrinks the plotting region slightly so that the text boxes don’t run over the edge of
the plot
Caselet 14 – Position_Salaries (User-defined Dataset)
Data-set provides the Position, Level and Salary for a set of 10 employees. Suppose
Mr.Raghu currently working with the XYZ company as Region Manager since past 2
years wishes to shift to a new organization & applied for a vacant position for Partner
which he is due in next 2 years with a demand for a salary of more than his current pay
of Rs.160000/- Is Mr.Raghu reasonable?
1. [,1] Position – Designation
2. [,2] Level – coding for designation
3. [,3] Salary – Salary of the employee

regressor=rpart(formula=Salary~.,data=dataset) regressor=rpart(formula=Salary~.,data=dataset,
control=rpart.control(minsplit = 1))

ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +


geom_line(aes(x = dataset$Level, y = predict(regressor, newdata = dataset)), colour = 'blue') +
ggtitle('Truth or Bluff (Decision Tree Regression)') + xlab('Level') + ylab('Salary')
dataset = read.csv('Position_Salaries.csv') # Importing the dataset
dataset = dataset[2:3]
View(dataset)
install.packages("rpart") #Creating Decision Tree
library(rpart)
install.packages('ggplot2') #Plotting the model
library(ggplot2)
# Fitting the Regression Model to the dataset
regressor=rpart(formula=Salary~., data=dataset, control=rpart.control(minsplit = 1))
# Visualising the Decision Tree Regression Model results
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(regressor, newdata = dataset)), colour =
'blue') + ggtitle('Truth or Bluff (Decision Tree Regression)') + xlab('Level') +
ylab('Salary')
# Visualizing the Regression Model results (for higher resolution and smoother curve)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level =
x_grid))), colour = 'blue') + ggtitle('Truth or Bluff (Regression Model)') + xlab('Level') +
ylab('Salary')
y_pred = predict(regressor, data.frame(Level = 6.5)) # Predicting a new result
Y_pred
Interpretation
The average of salaries in the interval 6.5 and 8.5 is 250000
dollars. For the employee of interest for us, in the previous
company his salary is 250000 whose level was 6.5
Caselet 15 – Prostate Cancer (in-built Dataset)
Prostate cancer dataset is a data frame with 97 observations on 10 variables. A
short description of each of the variables is listed below
lcavol - log cancer volume
lweight- log prostate weight
age - in years
lbph - log of the amount of benign prostatic hyperplasia
svi - seminal vesicle invasion
lcp - log of capsular penetration
gleason - a numeric vector
pgg45 - percent of Gleason score 4 or 5
lpsa - response
train - a logical vector
The last column indicates which 67 observations were used as the "training set"
and which 30 as the test set. Given the data, examine the correlation between
the level of prostate-specific antigen & a number of clinical measures in men
who were about to receive a radical prostatectomy.
1. Ensure that the packages & libraries are installed
install.packages("rpart") #Classification & Regression Trees
library(rpart)
install.packages("partykit") #Tree Plots
library(partykit)
install.packages(“MASS") #Breast & pima Indian data
library(MASS)
install.packages(“ElemStatLearn") #Prostate data
library(ElemStatLearn)
2. We will first apply regression on prostate data. It involves calling the
dataset, coding the gleason score as an indicator variable using the
ifelse() function & creating the test & train sets. Train set will be pros.train
& the test set will be pros.test
data(prostate)
prostate$gleason = ifelse(prostate$gleason == 6,0,1)
pros.train = subset(prostate, gleason==TRUE)[,1:9]
pros.test = subset(prostate, gleason==FALSE)[,1:9]
3. To build a regression tree on the train data, lets use rpart() function
tree.pros = rpart(lpsa~., data=pros.train)
4. Call this object using print() function & cptable & then examine the error
per split in order to determine the optimal number of splits in the tree
print(tree.pros$cptable)
5. Examine the splits graphically
dev.off() #To capture the plot completely as a screenshot
plotcp(tree.pros)

Interpretation
• First column labelled CP is the cost complexity parameter
• Second column, nsplit, is the number of splits in the tree
• rel error stands for relative error & is the RSS for the number of splits divided by the RSS for no
splits
• Both xerror & xstd are based on ten-fold cross-validation. xerror being the average error & xstd
the standard deviation of the cross-validation proceess
• We can see that while 5 splits in xerror produced the lowest error on the full dataset, 4 splits
produced a slightly less error using corss-validation
Interpretation
• Plot shows us the relative error by the tree size with the corresponding error
bars
• Horizontal line on the plot is the upper limit of the lowest standard error
• Selecting a tree size of 5 which is 4 splits, we can build a new tree object
where xerror is minimized
6. We can build a new tree object where xerror is minimized by pruning our
tree. First create an object for cp associated with the pruned tree from the
table. Then the prune() function handles the rest.
cparam = min(tree.pros$cptable[5,])
prune.tree.pros = prune(tree.pros, cp=cparam)
7. We can plot and compare the full & pruned trees. Tree plots produced by
partykit package are much better than those produced by the party
package. Simply use the as.party() function as a wrapper in plot()
plot(as.party(tree.pros))
Now use as.party() function for the pruned tree
plot(as.party(prune.tree.pros))
Interpretation: - In the next slide we note that splits are exactly the same in 2 trees with the
exception of the last split, which includes the variable age for the full tree. Interestingly, both
the first & second splits in the tree are related to the log of cancer volume (lcavol). Plots are
quite informative as they show the splits, nodes, observations per node & boxplots of the
outcome that we are trying to predict
8. Lets examine how well the pruned tree performs on the test data. Lets
create an object of the predicted values using the predict() function &
incorporate the test data. Then, calculate the errors & finally mean of the
squared errors
party.pros.test = predict(prune.tree.pros, newdata=pros.test)
rpart.resid = party.pros.test – pros.test$lpsa #Calculate the residuals
mean(rpart.resid^2) #Calculate MSE
Caselet 16 – iris Dataset (In-built)
Anderson collected & measured hundreds of irises in an effort to study variation
between & among the different species. There are 260 species of iris. This data
set focuses of three of them (Iris setosa, Iris virginica and Iris versicolor). Four
features were measured on 50 samples for each species: sepal width, sepal
length, petal width, and petal length. Anderson published it in "The irises of the
Gaspe Peninsula", which originally inspired Fisher to develop LDA
• iris is a data frame with 150 cases (rows) and 5 variables (columns) named
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
• iris3 gives the same data arranged as a 3-dimensional array of size 50 by 4
by 3, as represented by S-PLUS. The first dimension gives the case number
within the species subsample, the second the measurements with
names Sepal L., Sepal W., Petal L., and Petal W., and the third the species.
In this example we are going to try to predict the Sepal.Length
1. In order to build our decision tree, first we need to install the correct package
head(iris)
install.packages("rpart")
library(rpart)
2. Next we are going to create our tree. Since we want to predict Sepal.Length – that
will be the first element in our fit equation
fit <- rpart(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width+ Species,
method="anova", data=iris )
3. Note the method in this model is anova. This means we are going to try to predict a
number value. If we were doing a classifier model, the method would be class
4. Now let’s plot out our model
plot(fit, uniform=TRUE, main="Regression Tree for Sepal Length")
text(fit, use.n=TRUE, cex = .6)
outtree = ctree(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width+ Species, data
= iris) #Alternate Method
plot(outtree)
5. Note the splits are marked – like the top split is Petal.Length < 4.25. Also, at the
terminating point of each branch, you see an n which specified the number of
elements from the data file that fit at the end of that branch
6. Finally now that we know the model is good, let’s make a prediction
testData <-data.frame (Species = 'setosa', Sepal.Width = 4, Petal.Length =1.2,
Petal.Width=0.3)
predict(fit, testData, method = "anova")
Interpretation: - The model predicts our Sepal.Length will be approx 5.17
Case Study
Predicting Complex Skill Learning

• SkillCraft1 Master Table Dataset

• https://archive.ics.uci.edu/ml/datasets/SkillCraft1+Master+
Table+Dataset
• Mastering Predictive Analytics with R – James D.Miller,
Rui Miguel Forte – Second Edition – Packt Publishing –
August 2017 - Page 226
Overview of Non-Linear Models – Random Forest Regression
• To greatly improve models predictive ability, we can produce numerous trees
& combine the results
• Random forest technique does this by applying two different tricks in model
development
1. Use of bootstrap aggregation or bagging – An individual tree is built on a
random sample of the dataset, roughly two-thirds of the total observations
(remaining one-third are referred to as out-of-bag (oob)). This is repeated dozens/
hundreds of times & the results are averaged. Each of these trees is grown & not
pruned based on any error measure, & this means that the variance of each of
these individual trees is high
2. Concurrently with the random sample of the data, i.e., bagging, It also takes a
random sample of the input features at each split.
• In the randomForest pakage, we will use the default random no. of the
predictors that are sampled, which, for classification problems, is the square
root of the total predictors & for regression, it is the total no. of predictors
divided by three
• Number of predictors the algorithm randomly chooses at each split can be
changed via. the model tuning process
• We will use the randomForest package. General syntax to create a random
forest is to use the randomForest() function & specify the formula & dataset
as the 2 primary arguments
• Recall that, for regression, default variable sample per tree iteration is p/3,
where p is equal to the number of predictor variables in the data frame
Caselet 17 – Prostate Cancer (in-built Dataset)
Prostate cancer dataset is a data frame with 97 observations on 10 variables. A
short description of each of the variables is listed below
lcavol - log cancer volume
lweight- log prostate weight
age - in years
lbph - log of the amount of benign prostatic hyperplasia
svi - seminal vesicle invasion
lcp - log of capsular penetration
gleason - a numeric vector
pgg45 - percent of Gleason score 4 or 5
lpsa - response
train - a logical vector
The last column indicates which 67 observations were used as the "training set"
and which 30 as the test set. Given the data, examine the correlation between
the level of prostate-specific antigen & a number of clinical measures in men
who were about to receive a radical prostatectomy.
1. Ensure that the packages & libraries are installed
install.packages(“randomForest") # Random Forest Regression
library(randomForest)
install.packages("partykit") # Tree Plots
library(partykit)
install.packages(“MASS") # Breast & pima Indian data
library(MASS)
install.packages(“ElemStatLearn") # Prostate data
library(ElemStatLearn)
2. We will first apply regression on prostate data. It involves calling the
dataset, coding the gleason score as an indicator variable using the
ifelse() function & creating the test & train sets. Train set will be pros.train
& the test set will be pros.test
data(prostate)
prostate$gleason = ifelse(prostate$gleason == 6,0,1)
pros.train = subset(prostate, gleason==TRUE)[,1:9]
pros.test = subset(prostate, gleason==FALSE)[,1:9]
3. To build a random forest regression on the train data, lets use
randomForest() function
rf.pros = randomForest(lpsa~., data=pros.train)
4. Call this object using print() function
print(rf.pros)
5. Examine the splits graphically
dev.off() #To capture the plot completely as a screenshot
plot(rf.pros)

Interpretation
• Call of the rf.pros object shows us that the random forest generated 500
different trees & sampled 2 variables at each split
• Result of MSE of 0.68 & nearly 53 per cent of the variance explained
• Lets see if we can improve on the default number of trees. Too many trees
may lead to over fitting
• How much is too many depends on the data. 2 things that can help out are:
1. Plot rf.pros
2. Ask for the minimum MSE
Interpretation
• Plot shows the MSE by the number of trees in the model
• We can see that as the trees are added, significant improvement in MSE
occurs early on & then result in flat lines just before 100 trees are built in the
forest
6. We can identify the specific & optimal
tree with the which.min() function
which.min(rf.pros$mse)

7. We can try 193 trees in the random forest by just specifying ntree=193 in
the model
set.seed(123)
rf.pros.2=randomForest(lpsa~., data=pros.train, ntree=193)
print(rf.pros.2)
Interpretation
• We can see that MSE & Variance explained have both improved slightly
• Lets see one more plot before testing the model. If we are combining the results
of 193 trees that are built using bootstrapped samples & only 2 random
predictors, we will need a way to determine the drivers of the outcome
8. Only one tree alone cannot be used to paint this picture but we can produce
a variable importance plot & corresponding list. Y-axis is the list of variables
in descending order of importance & x-axis is the % of improvement in
MSE
varImpPlot(rf.pros.2,main=“Variable Importance Plot – PSA Score”)
Interpretation
Consistent with the single tree, lcavol is the most important variable & lweight is the second-
most important variable
9. To examine the raw numbers, use the importance() function
importance(rf.pros.2)
10. Now, lets examine how it does on the test dataset
rf.pros.test = predict(rf.pros.2, newdata=pros.test)
rf.pros.test
rf.resid = rf.pros.test – pros.test$lpsa #Calculate the residuals
mean(rf.resid^2) #Calculate MSE
CLASSIFICATION
Using R Software Tool
Classification Trees
• Classification trees operate under the same principal as regression trees
except the splits are not determined by the RSS but an error rate
• Building classification trees using the CART methodology continues the
notation of recursively splitting up groups of data points in-order to minimize
some error function
• What we like to use is a measure for node purity that would score nodes
based on whether they contain data points primarily belonging to one of the
output classes
1. One possible measure of node purity, is the Gini index. To calculate the Gini index at a
particular node in a tree, we compute ratio of the number of data points labeled as class k
over the total number of data points as an estimate for the probability of a data point
belonging to class k at the node in question. To Gini index for a completely pure node (a node
with only one class) is 0. For a binary output with equal proportions of the 2 classes, the Gini
index is 0.5
• Another commonly used criterion is deviance. All nodes that have same proportion of data
points across different classes will have the same value of the Gini index, but if they have
different number of observations, they will have different values of deviance
• Besides these, CART methodology to build a classification tree is exactly parallel to that of
building a regression tree. The tree is post-pruned using the same cost-complexity approach
outlined for regression trees, but after replacing the SSE as the error function with either the
Gini idex or deviance
CASE STUDY
Biopsy Data on Breast Cancer Patients
Dr. William H. Wolberg from the University of Wisconsin commissioned the
Wisconsin Breast Cancer Data in 1990. His goal of collecting the data was to
identify whether a tumor biopsy was malignant or benign. His team collected the
samples using Fine Needle Aspiration (FNA). If a physician identifies the tumor
through examination or imaging an area of abnormal tissue, then the next step is
to collect a biopsy. FNA is a relatively sage method of collecting the tissue &
complications are rare. Pathologists examine the biopsy & attempt to determine
the diagnosis (malignant of benign). As you can imagine, this not a trivial
conclusion. Benign breast tumors are not dangerous as there is no risk of the
abnormal growth spreading to the other body parts. If a benign tumor is large
enough, surgery might be needed to remove it. On the other hand, a malignant
rumor requires medical intervention. The level of treatments depends on a
number of factors but most likely will require surgery, which can be followed by
radiation and/ or chemotherapy. Therefore, the implications of a misdiagnosis can
be extensive. A false positive for malignancy can lead to costly & unnecessary
treatment, subjecting the patient to a tremendous emotional & physical burden.
On the other hand, a false negative can deny a patient the treatment that they
need, causing the cancer to spread & leading to premature death. Early treatment
intervention in breast cancer patients can greatly improve their survival.
Our task is to develop the best possible diagnostic machine learning algorithm
inorder to assist the patients medical team in determining whether the tumor is
malignant or not.
Dr. William H. Wolberg obtained breast cancer database from the University of
Wisconsin Hospitals, Madison & the biopsies of breast tumours for 699 patients
up to 15 July 1992 were assessed; with each of nine attributes scored on a
scale of 1 to 10, and the outcome also known. There are 699 rows and 11
columns. This data frame contains following columns
ID - sample code number (not unique)
V1 - clump thickness
V2 - uniformity of cell size
V3 - uniformity of cell shape
V4 - marginal adhesion
V5 - single epithelial cell size
V6 - bare nuclei (16 values are missing)
V7 - bland chromatin
V8 - normal nucleoli
V9 - mitoses
class - "benign" or "malignant"
1. Ensure that the packages & libraries are installed. The data
frame is available in the R MASS package under the biopsy
name
install.packages(“MASS") #Breast & pima Indian data
library(MASS)
install.packages(“reshape2") #Plotting the data
library(reshape2)
install.packages(“ggplot2") #Plotting the data
library(ggplot2)
install.packages(“corrplot") #Graphically represent correlation
library(corrplot)
install.packages("partykit") # Tree Plots
library(partykit)
classify_biopsy_1 = biopsy
str(classify_biopsy_1) #Examine the underlying structure of the data
Interpretation
• An examination of the data structure shows that our features are integers & the
outcome is a factor. No transformation of the data to a different structure is needed
• Depending on the package that we are using to analyse data, the outcome needs to
be numeric, which is 0 or 1

2. We now get rid of the ID column


classify_biopsy_1$ID=NULL #Delete column ID
View(classify_biopsy_1)
Rename the variables & confirm that the code has worked as intended
names(classify_biopsy_1)=c('Thickeness','u.size','u.shape','adhesion','s
.size','nuclie','chromatin','normal_nucleoli','mitoses','class')
names(classify_biopsy_1)
3. Delete the missing observations. As there are only 16 observations
with missing data, it is safe to get rid of them as they account only 2
per cent of the total observations. In deleting these observations, a
new working data frame is created with the na.omit() function
classify_biopsy_2 = na.omit(classify_biopsy_1)
4. There are number of ways through which we can understand the
data visually in a classification problem
• Boxplots are a simple way to understand the distribution of data at
a glance
• ggplot2 and lattice packages are effective ways to present data
visually
• Use ggplot2 with the additional package reshape2. After loading
the packages, create a data frame using the melt() function which
allows creation of a matrix of boxplots, allowing us to easily
conduct the visual inspection
biop.m=melt(classify_biopsy_2, id.var=‘class’)
ggplot(data=biop.m, aes(x=class, y=value)) + geom_boxplot() +
facet_wrap(~variable,ncol=3)
Interpretation
• Thick white boxes constitute the upper & lower quartiles of the data
• The dark line cutting across the box is the median value
• Lines extending from the boxes are also quartiles, terminating at the maximum & minimum
values, outliers not withstanding. The black dots constitute the outliers
• By inspecting the plot & applying some judgement, it is safe to assume that the nuclei feature
(v6) will be important given the separation of the median values & corresponding distributions.
Conversely, there appears to be little separation of the mitosis (v9) feature by class & it will likely
be an irrelevant feature
5. With all our features quantitative, we can also do a correlation
analysis using corrplot package
bc = cor(classify_biopsy_2[,1:9])
corrplot.mixed(bc)

Interpretation
• Correlation coefficients are indicating collinearity (Collinearity also referred to as
Multi-collinearity, generally occurs when there are high correlations between 2 or
more predictor variables. In other words, one predictor variable can be used to predict
the other), in particular, the features of uniform shape & uniform size that are present
6. The final task in data preparation will be the creation of our train &
test datasets
• In machine learning we should not be concerned with how well we can
predict the current observations, but more focussed on how well we can
predict the observations that were not used in order to create the
algorithm
• So, we can create & select the best algorithm using the training data that
maximizes our predictions on test set
• There are number of ways to proportionally split our data into train & test
sets: 50/50, 60/40, 70/30, 80/20 & so forth. Data split you select should
be based on your experience & judgement
• In this case, lets use a 70/30 split
set.seed(123) #Random number generator
index_sample = sample(2, nrow(classify_biopsy_2), replace=TRUE,
prob=c(0.7,0.3))
biop.train = classify_biopsy_2[index_sample==1,] #Training data set
biop.test = classify_biopsy_2[index_sample==2,] #Test data set
str(biop.test[,10]) #Confirm the structure of Test data set
7. Create the tree & examine the table for the optimal number of splits
set.seed(123) #Random number generator
tree.biop = rpart(class~.,data=biop.train)
print(tree.biop$cptable)
cp=min(tree.biop$cptable[3,]) #xerror is minimum in row 3 i.e., cptable[3,]
8. Prune the tree, plot the full & prune trees
prune.tree.biop=prune(tree.biop,cp=cp)
plot(as.party(tree.biop))
plot(as.party(prune.tree.biop))

Interpretation
• The cross-validation error is at a minimum with only 2 splits (row 3)
• Examination of the tree plots shows that the uniformity of the cell size is the
first split, then nuclei
• The full tree had an additional split at the cell thickness
9. Check how it performs on test dataset
Predict the test observations using type=“class” in the predict() function
rparty.test=predict(prune.tree.biop, newdata=biop.test, type=‘class’)
dim(biop.test)
table(rparty.test, biop.test$class)
10. Calcualte the accuracy of prediction
(136+64)/209
[1] 0.9569378

Interpretation
• The basic tree with just 2 splits get us almost 96 per cent accuracy
• This still falls short of 97.6 per cent with logistic regression
• It encourages us to believe that we can improve on this with the upcoming
methods, starting with random forests
Random Forest Classification
• A random forest is an ensemble learning approach to supervised learning
• Multiple predictive models are developed, & the results are aggregated to
improve classification rates
• The algorithm for a random forest involves sampling cases & variables to
create a large number of decision trees. Each case is classified by each
decision tree. The most common classification for that case is then used as
the outcome
• Assume that N is the number of cases in the training sample &
M is the number of variables. Algorithm is as follows:
1. Grow a large number of decision trees by sampling N cases with replacement from
the training set
2. Sample m < M at each node. These variables are considered candidates for
splitting in that node. The value m is the same for each node
3. Grow each tree fully without pruning
4. Terminal nodes are assigned to a class based on the mode of cases in that node
5. Classify new cases by sending them down all the trees & taking a vote – majority
rules
• An out-of-bag (oob) error estimate is obtained by classifying the
cases that are not selected when building a tree
• Random forests also provide a measure of variable importance
• Random forests are grown using the randomForest() function in
the randomForest package
• The default number of trees is 500, the default number of
variables sampled at each node is sqrt(M), & the minimum node
size is 1
1. Ensure that the packages & libraries are installed. The data
frame is available in the R MASS package under the biopsy
name
install.packages(“MASS") #Breast & pima Indian data
library(MASS)
install.packages(“randomForest") #Create random forest
library(randomForest)
classify_biopsy_rf1 = biopsy
str(classify_biopsy_rf1) #Examine the underlying structure of the data
2. We now get rid of the ID column
classify_biopsy_rf1$ID=NULL #Delete column ID
View(classify_biopsy_rf1)
3. Delete the missing observations. As there are only 16 observations
with missing data, it is safe to get rid of them as they account only 2
per cent of the total observations. In deleting these observations, a
new working data frame is created with the na.omit() function
classify_biopsy_rf2 = na.omit(classify_biopsy_rf1)
4. The final task in data preparation will be the creation of our train &
test datasets
• In machine learning we should not be concerned with how well we can predict the
current observations, but more focussed on how well we can predict the
observations that were not used in order to create the algorithm
• So, we can create & select the best algorithm using the training data that
maximizes our predictions on test set
• There are number of ways to proportionally split our data into train & test sets:
50/50, 60/40, 70/30, 80/20 & so forth. Data split you select should be based on
your experience & judgement
• In this case, lets use a 70/30 split
set.seed(123) #Random number generator
index_sample = sample(2, nrow(classify_biopsy_rf2), replace=TRUE,
prob=c(0.7,0.3))
rf.biop.train = classify_biopsy_rf2[index_sample==1,] #Training data set
rf.biop.test = classify_biopsy_rf2[index_sample==2,] #Test data set
str(rf.biop.test[,10]) #Confirm the structure of Test data set
5. Create the random forest
set.seed(123) #Random number generator
rf.biop = randomForest(class~., data=rf.biop.train)
print(rf.biop)

Interpretation
• The OOB error rate is 3.38%
• This is considering all the 500 trees factored into the analysis with 3
variables at each split
6. Plot the Error by trees
plot(rf.biop)

Interpretation
• Plot shows that the minimum error & standard error is the lowest with quite
few trees
7. Let’s now pull the exact number of trees with minimum error &
standard error using which.min() function
which.min(rf.biop$err.rate[,1]) #OOB is the first variable of the object err.rate

8. Only 125 trees are needed to optimize the model accuracy


set.seed(123)
rf.biop.2 = randomForest(class~., data=rf.biop.train, ntree=125)
print(rf.biop.2)
9. Check how it performs on test dataset
We can predict the test observations using type=“response” in the predict()
function
pred_rf.biop.test=predict(rf.biop.2,newdata=rf.biop.test,type="respons
e")
table(pred_rf.biop.test, rf.biop.test$class)
10. Calculate the accuracy of prediction
(138+67)/209

Interpretation
• Train set error is below 3 per cent
• Model even performs better on the test set where we have only 4
observations misclassified out of 209 & none were false positives
11. Lets have a look at the variable importance plot
varImpPlot(rf.biop.2)

Interpretation
• The importance of this plot is each variables contribution to the mean decrease in the Gini index
• This is rather different from the splits of the single tree
• Building random forests is potentially a powerful technique that not only has predictive ability but
also in feature selection
Classification – Logistic Regression
• Logistic regression belongs to a class of models known as Generalized
Linear Models (GLMs)
• Binary Logistic Regression is useful when we are predicting a binary
outcome from a set of predictors x. The predictors can be continuous,
categorical or a mix of both
• Logistic regression is a method for fitting a regression curve, y = f(x), when
y consists of proportions or probabilities, or binary coded (0,1--
failure, success) data
• The categorical variable y, in general, can assume different values
• The logistic curve is an S-shaped or sigmoid
curve
• Logistic regression fits b0 and b1, the regression
coefficients (which are 0 and 1, respectively, for
the graph depicted)
• This curve is not linear
• Logit transform makes it linear logit(y) = b0 + b1x
• Hence, logistic regression is linear regression on
the logit transform of y, where y is the proportion
(or probability) of success at each value of x
• Avoid to do a traditional least-squares
regression, as neither the normality nor the
homoscedasticity assumptions are satisfied
• Basic syntax for glm function is The logistic function is
glm(formula,data,family) given by
where formula is the symbol presenting the relationship
( b0  b1 x )
between the variables e
data is the data set giving the values of these variables y ( b0  b1 x )
family is R object to specify the details of the model. It's
value is binomial for logistic regression
1 e
CASE STUDY
Biopsy Data on Breast Cancer Patients
Dr. William H. Wolberg from the University of Wisconsin commissioned the
Wisconsin Breast Cancer Data in 1990. His goal of collecting the data was to
identify whether a tumor biopsy was malignant or benign. His team collected the
samples using Fine Needle Aspiration (FNA). If a physician identifies the tumor
through examination or imaging an area of abnormal tissue, then the next step is
to collect a biopsy. FNA is a relatively sage method of collecting the tissue &
complications are rare. Pathologists examine the biopsy & attempt to determine
the diagnosis (malignant of benign). As you can imagine, this not a trivial
conclusion. Benign breast tumors are not dangerous as there is no risk of the
abnormal growth spreading to the other body parts. If a benign tumor is large
enough, surgery might be needed to remove it. On the other hand, a malignant
rumor requires medical intervention. The level of treatments depends on a
number of factors but most likely will require surgery, which can be followed by
radiation and/ or chemotherapy. Therefore, the implications of a misdiagnosis can
be extensive. A false positive for malignancy can lead to costly & unnecessary
treatment, subjecting the patient to a tremendous emotional & physical burden.
On the other hand, a false negative can deny a patient the treatment that they
need, causing the cancer to spread & leading to premature death. Early treatment
intervention in breast cancer patients can greatly improve their survival.
Our task is to develop the best possible diagnostic machine learning algorithm
inorder to assist the patients medical team in determining whether the tumor is
malignant or not.
Dr. William H. Wolberg obtained breast cancer database from the University of
Wisconsin Hospitals, Madison & the biopsies of breast tumours for 699 patients
up to 15 July 1992 were assessed; with each of nine attributes scored on a
scale of 1 to 10, and the outcome also known. There are 699 rows and 11
columns. This data frame contains following columns
ID - sample code number (not unique)
V1 - clump thickness
V2 - uniformity of cell size
V3 - uniformity of cell shape
V4 - marginal adhesion
V5 - single epithelial cell size
V6 - bare nuclei (16 values are missing)
V7 - bland chromatin
V8 - normal nucleoli
V9 - mitoses
class - "benign" or "malignant"
Business Understanding
The goal of collecting the data is to identify whether a tumor
biopsy was malignant or benign. Implications of a misdiagnosis
can be extensive. A false positive for malignancy can lead to
costly and unnecessary treatment, subjecting the patient to a
tremendous emotional and physical burden. On the other hand, a
false negative can deny a patient the treatment that they need,
causing the cancer to spread and leading to premature death.
Early treatment intervention in breast cancer patients can greatly
improve their survival.
Our task here is to develop the best possible diagnostic machine
learning algorithm in order to assist the patient’s medical team in
determining whether the tumor is malignant or not
1. Ensure that the packages & libraries are installed. The data frame is
available in the R MASS package under the biopsy name
install.packages(“MASS") #Breast & pima Indian data
library(MASS)
install.packages(“car") #VIF Statistics
library(car)
2. Make a copy of the original dataset. Get rid of ID column
lr_biopsy=biopsy
lr_biopsy$ID = NULL #Delete column ID
3. Delete the missing observations. As there are only 16 observations with
missing data, it is safe to get rid of them as they account only 2 per cent of
the total observations. In deleting these observations, a new working data
frame is created with the na.omit() function
logreg_biopsy = na.omit(lr_biopsy)
names(logreg_biopsy)=c('Thickeness','u.size','u.shape','adhesion','s.size','nuc
lie','chromatin','normal_nucleoli','mitoses','class')
names(logreg_biopsy)
4. The final task in data preparation will be the creation of our train & test
datasets
index_lr = sample(2, nrow(logreg_biopsy), replace=TRUE, prob=c(0.7,0.3))
train = logreg_biopsy[index_lr==1,] #Training data set
test = logreg_biopsy[index_lr==2,] #Test data set
5. For implementing Logistic Regression, R installation comes with the glm()
function that fits the generalized linear models, with family=binomial
set.seed(123) #Random number generator
lr.fit = glm(class~., family=binomial, data=train)
6. The summary() function allows us to inspect the coefficients & their p-values
summary(lr.fit)
Interpretation
• The higher the absolute value of z-statistic, the more likely it is that this particular
feature is significantly related to our output variable
• From the model summary, we see that Thickness (3.280,0.001039) & nuclei
(3.344,0.000826) are the strongest predictors for breast cancer. Both thickness &
nuclei are statistically significant
• P-values next to the z-statistic express this as probability
• Number of input features have relatively high p-values indicating that they are
probably not good indicators of breast cancer in the presence of other features
• The logistic regression coefficients give the change in the log odds of the outcome for
a one unit increase in the predictor variable
• For every one unit change in Thickness, the log odds of class (benign vs malignant)
increases by 0.5252
• For a one unit increase in nuclei, the log odds of benign breast cancer increases by
0.4057

7. confint() function obtains


confidence intervals for the
coefficient estimates at 95 per
cent confidence interval
confint(lr.fit)
8. We cannot translate the coefficients Interpretation
of logistic regression as the change • If the value is greater than 1, it indicates
that as the feature increases, the odds of
is Y is based on a one-unit change the outcome increase. Conversely, a value
in X. We can exponentiate the less than 1 would mean that as the feature
coefficients and interpret them as increases, the odds of the outcome
decrease
odds-ratios
• Here, all the features except u.size will
exp(coef(lr.fit)) increase the log odds

9. To put it all in one table, we use


cbind to bind the coefficients and
confidence intervals column-wise
options(scipen=999) #Remove scientific notation
exp(cbind(OR=coef(lr.fit),confint(lr.fit))

Interpretation
For a one unit increase in Thickness, the odds of
suffering from benign or malignant breast cancer
increases by a factor of 1.69 & for a unit increase in
nuclei, the odds of suffering increases by a factor of
1.50
10. Multicollinearity, potential NOTE
• Collinearity, or excessive correlation among
issue of data exploration explanatory variables, can complicate or prevent the
is taken care by vif() identification of an optimal set of explanatory
function - vif(lr.fit) variables for a statistical model
• A simple approach to identify collinearity among
Interpretation: - None of the values explanatory variables is the use of variance inflation
are > VIF rule of thumb statistic of factors (VIF)
5, so collinearity does not seem to • A VIF is calculated for each explanatory variable and
be a problem those with high values are removed. The definition of
‘high’ is somewhat arbitrary but values in the range
of 5-10 are commonly used

11. We can also use predicted probabilities to understand the model.


Predicted probabilities can be computed for both categorical and
continuous predictor variables. train$prob tells R that we want to
create a new variable in the dataset
#Create a vector of predicted probabilities
train$prob = predict(lr.fit, type=“response”)
print(train$prob) #Inspect the predicted probabilities
12. The contrasts() function allows us to confirm the coding assigned to
the variable class (benign as 0 & malignant as 1)
contrasts(train$class)

13. Create a meaningful table of the fit model as a confusion matrix


• We will need to produce a vector that codes the predicted probabilities as either
benign or malignant
• Using the rep() function, a vector is created with all the values called benign & a
total of 474 observations, which match the number in the training set
• Then, we will code all the values as malignant where the predicted probability
was greater than 50 per cent
dim(train)
train$predict = rep(“benign”,474)
print(train$predict)
train$predict[train$probs>0.5]=“malignant”
print(train$predict)
14. The table() produces our confusion matrix
table(train$predict, train$class)
Rows signify the predictions & columns signify the actual values. Diagonal
elements are the correct classifications. Top right value 7, is the no. of false
negatives & the bottom left value, 8, is the number of false positives

15. The mean() function shows us


what percentage of the
observations were predicted
correctly
mean(train.predict==train$class)
Interpretation - It seems that we have
done a fairly good job with almost 97
per cent prediction rate on the training
set
16. Confusion matrix for test set is similar to the training set
test$probs = predict(lr.fit, newdata=test, type="response")
dim(test)
test$predict = rep('benign',209)
print(test$predict)
test$predict[test$probs>0.5]='malignant'
print(test$predict)
table(test$predict,test$class) #Confusion Matrix
mean(test$predict==test$class)

Interpretation
• It appears that we have done
pretty well in creating a model
with all the features
• Roughly 98 per cent
prediction rate is quite
impressive
Estimate the probability of a patient suffering from benign breast
cancer with a score for Thickness (V1) as 12 and nuclei (V6) as 12
• Apply the function glm() that describes the Thickness (V1) and nuclei (V6)
lr = glm(class~V1+V6, data=lr_biopsy, family=binomial)
• We then wrap the test parameters inside a data frame
predictdata=data.frame(V1=12,V6=12) #Thickness code = 12 and nuclie code
= 12
• Now we apply the function predict() to the generalized linear model lr along with
predictdata. We will have to select response prediction type in order to obtain the
predicted probability.
predict(lr,predictdata,type="response")

Interpretation
• For a patient with V1 score as 12 and V6
score as 12, the probability of the patient
suffering from benign breast cancer is
about 99.9 per cent
• For a patient with V1 score as 5 and V6
score as 7, the probability of the patient
suffering from benign breast cancer is
about 91 per cent
mtcars (Motor Trend Car Road Tests) Dataset
The data was extracted from the 1974 Motor Trend US magazine,
and comprises fuel consumption and 10 aspects of automobile
design and performance for 32 automobiles (1973–74 models).
The description of the 11 numeric variables with the 32
observations in the data frame are as follows:
1. [,1] mpg – Miles/ (US) gallon
2. [,2] cyl – Number of Cylinders
3. [,3] disp – Displacement (cu.in.)
4. [,4] hp – Gross Horsepower
5. [,5] drat – Rear Axle Ratio
6. [,6] wt – Weight (1000 lbs)
7. [,7] qsec – ¼ Mile Time
8. [,8] vs – V/S Engine Shape
9. [,9] am –Transmission (0=automatic, 1=manual)
10.[,10] gear – Number of Forward Gears
11.[,11] carb – Number of Carburetors
Prepare a managerial report.
In mtcars dataset, the transmission mode (automatic or manual) is
described by the column am which is a binary value (0 or 1). Let’s
create a logistic regression model between the columns vs and 2 other
columns - mpg, am (Multiple predictors)
#Select some columns from mtcars
sample=subset(mtcars, select=c(mpg, vs, am, cyl, wt, hp, disp))
print(head(sample))
vsdata = glm(formula= vs ~ wt + disp, data=sample, family=binomial)
binomial(link="logit"))
print(summary(vsdata))


Interpretation
• weight influences vs positively, while displacement has a slightly negative effect
• We also see that the coefficient of weight is non-significant (p > 0.05), while the coefficient
of displacement is significant
• The estimates (coefficients of the predictors weight and displacement) are now in units
called logits
• The logistic regression coefficients give the change in the log odds of the outcome for a
one unit increase in the predictor variable
• For every one unit change in wt, the log odds of vs increases by 1.62635 and for every one
unit change in disp, the log odds of vs decrease by 0.03443

• exp(coef(vsdata))
Now we can say that for a one unit increase in wt, the odds of vs increase by a factor
of 5.085 and a one unit increase in disp, the odds of vs increase by a factor of 0.966
• The data and logistic regression model can be plotted with ggplot2 or base graphics:
install.packages(“ggplot2”)
library(ggplot2)
ggplot(vsdata, aes(x=wt, y=vs)) + geom_point() + stat_smooth(method="glm",
method.args=list(family="binomial"), se=FALSE)
par(mar = c(4, 4, 1, 1)) # Reduce some of the margins so that the plot fits better
• Estimate the probability of a vehicle being fitted with a manual
transmission if it has a 120hp engine and weighs 2800 lbs
• Apply the function glm() that describes the transmission type (am) by the horsepower (hp)
and weight (wt)
amglm=glm(formula=am~hp+wt, data=sampleset, family=binomial)
• We then wrap the test parameters inside a data frame
predictdata = data.frame(hp=120, wt=2.8)
• Now we apply the function predict() to the generalized linear model amglm along with
predictdata. We will have to select response prediction type in order to obtain the predicted
probability.
predict(amglm, predictdata, type="response")
• Interpretation - For an automobile with 120hp engine and 2800 lbs weight, the probability
of it being fitted with a manual transmission is about 64%.
• We want to create a model that helps us to predict the probability of a
vehicle having a V engine or a straight engine given a weight of 2100
lbs and engine displacement of 180 cubic inches
• model <- glm(formula= vs ~ wt + disp, data=mtcars, family=binomial)
• summary(model)
• Interpretation - We see from the estimates of the coefficients that weight influences vs
positively, while displacement has a slightly negative effect
• newdata = data.frame(wt = 2.1, disp = 180)
• predict(model, newdata, type="response")
• Interpretation - The predicted probability is 0.24
Classification/ Dimension Reduction – Linear Discriminant
Analysis (LDA)
• Discriminant analysis, is also known as Fisher Discriminant Analysis (FDA)
• It can be effective alternative to logistic regression when the classes are well-
separated
• Linear Discriminant analysis (LDA) is a multivariate classification (and dimension
reduction) technique that separates objects into two or more mutually exclusive
groups based on measurable features of those objects. The measurable features are
sometimes called predictors or independent variables, while the classification group is
the response or what is being predicted
• It is an appropriate technique when the dependent variable is categorical (nominal or
non-metric) and the independent variables are metric. The single dependent variable
can have two, three or more categories
• DA uses data where the classes are known beforehand to create a model
that may be used to predict future observations
• Types of Discriminant Analysis
1. Linear Discriminant Analysis (LDA) - assumes that the covariance of the
independent variables is equal across all classes
2. Quadratic Discriminant Analysis (QDA) - does not assume equal covariance
across the classes
• Both LDA and QDA require
• Number of independent variables should be less than the sample size
• Assumes multivariate normality among the independent variables, i.e.,
independent variables come from a normal (or Gaussian) distribution
• DA uses Baye’s theorem in order to determine the probability of each class
membership for each observation
• Fit a linear discriminant analysis with the function lda()
• Advantage: - Elegantly simple
• Limitation: - Assumption that the observations of each class are said to have
a multivariate normal distribution & there is a common covariance across the
classes
• The process of attaining the posterior probabilities goes through
following steps:
• Collect data with a known class membership
• Calculate the prior probabilities – this represents the proportion of the sample that
belongs to each class
• Calculate the mean for each feature by their class
• Calculate the variance-covariance matrix for each feature. If it is an Linear
Discriminant Analysis (LDA) this would be a pooled matrix of all the classes giving
us a linear classifier; if Quadratic Discriminant Analysis (QDA) then variance-
covariance matrix is created for each class
• Estimate the normal distribution for each class
• Compute the discriminant function that is the rule for the classification of a new
object
• Assign an observation to a class based on the discriminant function
CASE STUDY
Biopsy Data on Breast Cancer Patients
Dr. William H. Wolberg from the University of Wisconsin commissioned the
Wisconsin Breast Cancer Data in 1990. His goal of collecting the data was to
identify whether a tumor biopsy was malignant or benign. His team collected the
samples using Fine Needle Aspiration (FNA). If a physician identifies the tumor
through examination or imaging an area of abnormal tissue, then the next step is
to collect a biopsy. FNA is a relatively sage method of collecting the tissue &
complications are rare. Pathologists examine the biopsy & attempt to determine
the diagnosis (malignant of benign). As you can imagine, this not a trivial
conclusion. Benign breast tumors are not dangerous as there is no risk of the
abnormal growth spreading to the other body parts. If a benign tumor is large
enough, surgery might be needed to remove it. On the other hand, a malignant
rumor requires medical intervention. The level of treatments depends on a
number of factors but most likely will require surgery, which can be followed by
radiation and/ or chemotherapy. Therefore, the implications of a misdiagnosis can
be extensive. A false positive for malignancy can lead to costly & unnecessary
treatment, subjecting the patient to a tremendous emotional & physical burden.
On the other hand, a false negative can deny a patient the treatment that they
need, causing the cancer to spread & leading to premature death. Early treatment
intervention in breast cancer patients can greatly improve their survival.
Our task is to develop the best possible diagnostic machine learning algorithm
inorder to assist the patients medical team in determining whether the tumor is
malignant or not.
Dr. William H. Wolberg obtained breast cancer database from the University of
Wisconsin Hospitals, Madison & the biopsies of breast tumours for 699 patients
up to 15 July 1992 were assessed; with each of nine attributes scored on a
scale of 1 to 10, and the outcome also known. There are 699 rows and 11
columns. This data frame contains following columns
ID - sample code number (not unique)
V1 - clump thickness
V2 - uniformity of cell size
V3 - uniformity of cell shape
V4 - marginal adhesion
V5 - single epithelial cell size
V6 - bare nuclei (16 values are missing)
V7 - bland chromatin
V8 - normal nucleoli
V9 - mitoses
class - "benign" or "malignant"
1. Ensure that the packages & libraries are installed. The data frame is
available in the R MASS package under the biopsy name
install.packages(“MASS") #Breast & pima Indian data
library(MASS)
install.packages(“car") #VIF Statistics
library(car)
da_biopsy=biopsy
str(da_biopsy) #Examine the underlying structure of the data
2. We now get rid of the ID column
da_biopsy$ID = NULL #Delete column ID
View(da_biopsy)
3. Rename the variables & confirm that the code has worked as intended
names(da_biopsy)=c('Thickeness','u.size','u.shape','adhesion','s.size','nuclie','
chromatin','normal_nucleoli','mitoses','class')
names(da_biopsy)
4. Delete the missing observations. As there are only 16 observations with
missing data, it is safe to get rid of them as they account only 2 per cent of
the total observations. In deleting these observations, a new working data
frame is created with the na.omit() function
lda_biopsy = na.omit(da_biopsy) #Delete observations with missing data
5. The final task in data preparation will be the creation of our train & test
datasets
• In machine learning we should not be concerned with how well we can predict the
current observations, but more focussed on how well we can predict the
observations that were not used in order to create the algorithm
• So, we can create & select the best algorithm using the training data that
maximizes our predictions on test set
• There are number of ways to proportionally split our data into train & test sets:
50/50, 60/40, 70/30, 80/20 & so forth. Data split you select should be based on
your experience & judgement
• In this case, lets use a 70/30 split
set.seed(123) #Random number generator
index_lda = sample(2, nrow(lda_biopsy), replace=TRUE, prob=c(0.7,0.3))
lda.train = lda_biopsy[index_lda==1,] #Training data set
lda.test = lda_biopsy[index_lda==2,] #Test data set
dim(lda.train)
dim(lda.test)
6. R installation comes with the lda() function that performs Linear
Discriminant Analysis
lda.fit = lda(class~., data=lda.train)
lda.fit
Interpretation
• Prior probabilities of groups are approximately 64 per cent for benign & 36 per cent for
malignancy
• Group means is the average of each feature by their class
• Coefficients of linear discriminants are the standardized linear combination of the features that
are used to determine an observations discriminant score
• The higher the score, the more likely the classification is malignant
7. The plot() function will provide us with a histogram and/ or the densities of
the discriminant scores
plot(lda.fit, type=“both”)
Interpretation: - We see that there is some overlap in the groups, indicating that
there will be some incorrectly classified observations
8. The predict() function available with LDA provides a list of 3 elements
(class, posterior, & x).
• The class elements is the prediction of benign or malignant
• Posterior is the probability score of x being in each class
• X is the linear discriminant score
lda.predict1= predict(lda.fit) #Training dataset prediction
lda.train$lda = lda.predict1$class
table(lda.train$lda, lda.train$class)
mean(lda.train$lda==lda.train$class)
Interpretation: - We see that LDA models has performed much worse than the logistic
regression models on the training dataset
lda.predict2= predict(lda.fit, newdata=lda.test) #Test dataset prediction
lda.test$lda = lda.predict2$class
table(lda.test$lda, lda.test$class)
mean(lda.test$lda==lda.test$class)
Interpretation: - Better performance than on the training dataset. Still did not perform as well
as the logistic regression (96% against 98% with logistic regression)
iris Dataset (In-built)
Anderson collected & measured hundreds of irises in an effort to study variation
between & among the different species. There are 260 species of iris. This data
set focuses of three of them (Iris setosa, Iris virginica and Iris versicolor). Four
features were measured on 50 samples for each species: sepal width, sepal
length, petal width, and petal length. Anderson published it in "The irises of the
Gaspe Peninsula", which originally inspired Fisher to develop LDA
• iris is a data frame with 150 cases (rows) and 5 variables (columns) named
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
• iris3 gives the same data arranged as a 3-dimensional array of size 50 by 4
by 3, as represented by S-PLUS. The first dimension gives the case number
within the species subsample, the second the measurements with
names Sepal L., Sepal W., Petal L., and Petal W., and the third the species.
1. Ensure that the packages & libraries are installed. The data frame is
available in the R MASS package under the biopsy name
install.packages(“MASS") #lda function
library(MASS)
install.packages(“klaR") #Classification & Visualization functions
library(klaR)
da_iris=iris
View(da_iris)
2. Data preparation will be the creation of our train & test datasets
• In machine learning we should not be concerned with how well we can predict the
current observations, but more focussed on how well we can predict the
observations that were not used in order to create the algorithm
• So, we can create & select the best algorithm using the training data that
maximizes our predictions on test set
• There are number of ways to proportionally split our data into train & test sets:
50/50, 60/40, 70/30, 80/20 & so forth. Data split you select should be based on
your experience & judgement
• In this case, lets use a 70/30 split
set.seed(123) #Random number generator
ind = sample(2, nrow(da_iris), replace=TRUE, prob=c(0.6,0.4))
train = da_iris[ind==1,] #Training data set
test = da_iris[ind==2,] #Test data set
3. R installation comes with
the lda() function that
performs Linear
Discriminant Analysis
lda_fit = lda(Species~.,
data=train)
lda_fit

Interpretation
• Prior probabilities of groups show πi, the probability of randomly selecting an observation from
class I from the total training set
• Because there are 50 observations of each species in the original data set (150 observations
total) we know that the prior probabilities should be close to 33.3% for each class
• Group means μi shows the mean value for each of the independent variables for each class i
• The Coefficients of linear discriminants are the coefficients for each discriminant
• Linear discriminant (LD1) is the linear combination:
(0.36∗Sepal.Length)+(2.22∗Sepal.Width)+(−1.78∗Petal.Length)+(−3.97∗Petal.Width)
• The Proportions of trace describes the proportion of between-class variance that is explained by
successive discriminant functions. As you can see LD1 explains 99% of the variance
4. Plot the data using the basic plot function plot()
plot(lda_fit, col = as.integer(train$Species))
You can see that there are three distinct groups with some overlap between
virginica and versicolor
Plot the observations illustrating the separation between groups as well as
overlapping areas that are potential for mix-ups when predicting classes
plot(lda_fit, dimen = 1, type = "b")
• Using the partimat function from the klaR package provides an alternate way to plot
the linear discriminant functions
partimat(Species ~ ., data=train, method="lda")
• Partimat() outputs an array of plots for every combination of two variables. Think of
each plot as a different view of the same data. Colored regions delineate each
classification area. Any observation that falls within a region is predicted to be from a
specific class. Each plot also includes the apparent error rate for that view of the data
5. Next let’s evaluate the prediction accuracy of our model. First we’ll run the
model against the training set to verify the model fit by using the command
predict. The table output is a confusion matrix with the actual species as the
row labels and the predicted species at the column labels.
lda.pred1= predict(lda_fit) #Training dataset prediction
train$lda = lda.pred1$class
table(train$lda, train$Species)
mean(train$lda==train$class)
Interpretation: - The total number of correctly predicted observations is the sum of
the diagonal. So this model fit the training data correctly for almost every
observation. Verifying the training set doesn’t prove accuracy, but a poor fit to the
training data could be a sign that the model isn’t a good one. Now let’s run our test
set against this model to determine its accuracy.
lda.pred2= predict(lda_fit, newdata = test) #Test dataset prediction
test$lda = lda.pred2$class
table(test$lda, test$Species)
mean(test$lda==test$Species)
When Sepal.Length=5.1,Sepal.Width=3.5,Petal.Length=1.4, Petal.Width=0.2,
what will be the class of this flower?
testdata=data.frame(Sepal.Length=5.1,Sepal.Width=3.5,Petal.Length=1.4,
Petal.Width=0.2)
predict(lda_fit, testdata)$class
Interpretation
• Based on the earlier plots, it makes sense that a few iris versicolor and iris virginica
observations may be mis-categorized
• Overall the model performs very well with the testing set with an accuracy of 96.7%
• For Sepal.Length=5.1,Sepal.Width=3.5,Petal.Length=1.4, Petal.Width=0.2, Setosa is
the class of the new flower
Dr.Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: [email protected] Mobile: + (91)98666 66620
• Segmentation is a fundamental
requirement when dealing with
customer background/ sales
transactions, & similar data
• Appropriate segmentation of customer
data can provide insights to enhance a
company's performance by identifying
the most valuable customers or the
customers that are likely to leave the
company

• Cluster: - Set of points in a


dataset that are similar to each
other but are dissimilar to other
points in the dataset
Objectives of this section
2 Clusters
1. What is clustering?
2. How does Kmeans & Hierarchical
Clustering work? 3 Clusters

3. Write code on real data.


• Unsupervised Machine Learning Methods: - Focus on revealing the
hidden structure of unlabelled data. Unsupervised learning focuses on 2
main areas: Clustering and Dimension Reduction
• Clustering: - Technique used to group similar objects together in the same group
(cluster). Clustering analysis does not use any label information, but simply uses
the similarity between data features to group them into clusters
• Dimension Reduction: - Technique that focuses on removing irrelevant &
redundant data to reduce the computational cost & avoid over-fitting. Dimension
reduction can be divided into 2 parts – feature extraction & feature selection
• Most popular & powerful supervised/ un-supervised clustering techniques
include: Hierarchical, k-Means, k-Nearest Neighbor (kNN), Support Vector
Machine (SVM), and multi-layer feed-forward Neural Networks (NNs)
• Supervised clustering techniques - kNN, SVM, and NN
• Unsupervised clustering techniques – Hierarchical and k-Means

• 2 most popular approaches are: k-means clustering & Hierarchical clustering


To perform a cluster analysis in R, generally, the data should be
prepared as follows:
1. Rows are observations (individuals) and columns are variables
2. Any missing value in the data must be removed or estimated using
na.omit() function
3. The data must be standardized (i.e., scaled) using scale() function to
make variables comparable. Recall that, standardization consists of
transforming the variables such that they have mean zero and standard
deviation as one

Steps for Implementing Cluster Analysis


1. Choose appropriate attributes – Select variables important for
identifying & understanding differences among groups of observations
within the data
Example:- In a study of depression, we may want to assess one or more of
the following: psychological & physical symptoms, age, duration, timings of
episodes, no. of hospitalizations, social & work history, gender, socio-
economic status, family medical history, & response to previous treatments
2. Scale the data – Scale the data, as the variables with largest range
will have greatest impact on the results which is undesirable
• Most popular approach is to standardize each variable to a mean 0 & standard
deviation of 1
• Other approaches include dividing each variable by its maximum value
• Subtracting the variables mean & dividing by variables median absolute
deviation
df1=apply(mydata, 2, function(x) {x-mean(x))/sd(x)})
df2=apply(mydata, 2, function(x) {x/max(x)})
df3=apply(mydata, 2, function(x) {x-mean(x))/mad(x)})
Note: - We will use the scale() function to standardize the variables to a mean 0 &
standard deviation 1, equivalent of snippet df1
3. Screen for outliers – Many clustering techniques are sensitive to outliers.
• Screen & remove uni-variate outliers using functions from the outliers package
• mvoutlier package contains functions that are used to identify multivariate outliers

4. Calculate distances – Most popular measure of the distance


between 2 observations is the Euclidean distance (Manhattan,
Canberra, asymmetric binary, maximum & Minkowski measures are
also available)
5. Select a clustering algorithm – First select hierarchical/ partitioning
approach & then select specific clustering algorithm
• Hierarchical Clustering is useful for smaller problems (150 observations or less) &
where nested hierarchy of groupings is desired
• Partitioning Method can handle much larger problems but requires no. of clusters
to be specified in advance
6. Obtain one or more cluster solutions – Uses the method (s) selected in
step 5
7. Determine the number of clusters present – Decide how many clusters
are present in the data. Many approaches have been proposed which
involve extracting clusters. NbClust() function in the NBClust package
provides 30 different indices to help you make this decision
8. Obtain a final clustering algorithm – Once no. of clusters has been
determined, a final clustering is performed to extract that no. of subgroups
9. Visualize the results – Results of hierarchical clustering are usually
represented as a dendrogram. Partitioning results are typically visualized
using a bi-variate cluster plot
10. Interpret the clusters – Once a cluster solution has been obtained,
interpret the clusters. It is accomplished by obtaining summary statistics for
each variable by cluster
Hierarchical Clustering (HC) - Hierarchical clustering can be divided
into two main types: agglomerative & divisive
• Agglomerative clustering: Also known as AGNES (Agglomerative Nesting).
Works in a bottom-up manner i.e., each object is initially considered as a
single-element cluster (leaf). At each step of the algorithm, the two clusters
that are the most similar are combined into a new bigger cluster (nodes). This
procedure is iterated until all points are member of just one single big cluster.
The result is a tree which can be plotted as a dendrogram.
• Divisive hierarchical clustering: Also known as DIANA (Divisive Analysis).
Works in a top-down manner. The algorithm is an inverse order of AGNES. It
begins with the root, in which all objects are included in a single cluster. At
each step of iteration, the most heterogeneous cluster is divided into two. The
process is iterated until all objects are in their own cluster
• Note: - Agglomerative clustering is good at identifying small clusters. Divisive
hierarchical clustering is good at identifying large clusters
• we measure the (dis)similarity of observations using distance measures (i.e.
Euclidean distance, Manhattan distance, etc.) In R, the Euclidean distance is
used by default to measure the dissimilarity between each pair of
observations. Distance matrix decides which clusters to merge/split
• Clustering of the data objects is obtained by cutting the dendrogram at the
desired level, then each connected component forms a cluster
a Agglomerative Approach
ab
b
Initialization: Each object is a
abcde cluster
c
cde Iteration: Merge two clusters which
d are most similar to each other; Until
de
e all objects are merged into a single
cluster

Step Step Step Step Step Bottom-up


0 1 2 3 4

a
ab
b
abcde
c
cde Divisive Approach
d
de Initialization: All objects stay in one
e cluster
Iteration: Select a cluster and split
it into two sub clusters; Until each
Step Step Step Step Step Top-down leaf cluster contains only one
4 3 2 1 0 object
• In addition to the distance measure, we need to specify the linkage between
the groups of observations
• How do we measure the dissimilarity between two clusters of observations? A
number of different cluster agglomeration methods (i.e, linkage methods)
have been developed to answer to this question. The most common types
methods are:
1. Maximum or complete linkage clustering: It computes all pairwise dissimilarities
between the elements in cluster 1 and the elements in cluster 2, and considers the
largest value (i.e., maximum value) of these dissimilarities as the distance between
the two clusters. It tends to produce more compact clusters
2. Minimum or single linkage clustering: It computes all pairwise dissimilarities
between the elements in cluster 1 and the elements in cluster 2, and considers the
smallest of these dissimilarities as a linkage criterion. It tends to produce long,
“loose” clusters
3. Mean or average linkage clustering: It computes all pairwise dissimilarities between
the elements in cluster 1 and the elements in cluster 2, and considers the average
of these dissimilarities as the distance between the two clusters
4. Centroid linkage clustering: It computes the dissimilarity between the centroid for
cluster 1 (a mean vector of length p variables) and the centroid for cluster 2
5. Ward’s minimum variance method: It minimizes the total within-cluster variance. At
each step the pair of clusters with minimum between-cluster distance are merged
Linkage Methods of Clustering

Single Linkage
Minimum
Distance Ward’s
Cluster 1 Cluster 2 Procedure
Complete Linkage
Maximum
Distance

Cluster 1 Cluster 2 Centroid


Average Linkage Method

Average
Cluster 1 Distance Cluster 2
Single-link Complete-link
5
1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 4
4

Average-link Centroid distance


5
1 5 4 1
2
5 2
2 5
2
3 6 3
3 6
4 1 1
4 4
3
Single-link Complete-link

1 2 5 3 6 4 1 2 5 3 6 4

Average-link Centroid distance

2 5 3 6 4 1
1 2 5 3 6 4
• There are different functions available in R for computing hierarchical
clustering. The commonly used functions are:
• hclust [in stats package] and agnes [in cluster package] for agglomerative
hierarchical clustering (HC)
• diana [in cluster package] for divisive HC

• The height of the fusion, provided on the vertical axis, indicates the
(dis)similarity between two observations. The higher the height of the fusion,
the less similar the observations are
• The height of the cut to the dendrogram controls the number of clusters
obtained. In order to identify sub-groups (i.e. clusters), we can cut the
dendrogram with cutree() function

• Advantages/ Limitations
• Hierarchical structures are informative but are not suitable for large
datasets
• Algorithm imposes hierarchical structure on data, even when it is not
appropriate
• Crucial question - How many clusters are to be considered?
• The complexity of hierarchical clustering is higher than k-means
Agglomerative Hierarchical Clustering (AHC)
1. We can perform AHC with hclust() function in R

2. First we compute the dissimilarity values with dist() function &


then feed these values into hclust() and specify the
agglomeration method to be used (i.e. “complete”, “average”,
“single”, “ward.D”)
3. We can then plot the dendrogram
USArrests Dataset (In-built)
USArrests dataset in R contains statistics, in arrests per 100,000
residents for assault, murder, and rape in each of the 50 US states in
1973. It includes also the percent of the population living in urban areas.
It’s a data frame which contains 50 observations on 4 variables:
[,1] Murder - numeric - Murder arrests (per 100,000)
[,2] Assault - numeric - Assault arrests (per 100,000)
[,3] UrbanPop – numeric - Percent urban population
[,4] Rape - numeric - Rape arrests (per 100,000)
1. Ensure that the packages & libraries are installed
install.packages("cluster") #agnes() function
library(cluster)
install.packages("tidyverse") #data manipulation
library(tidyverse)
install.packages("factoextra") #clustering visualization
library(factoextra)
install.packages("dendextend") #for comparing two dendrograms
library(dendextend)
install.packages("purrr") #for map_dbl() function
library(purrr)
install.packages("dplyr") #for mutate() function
library(dplyr)
install.packages(“ape") #for mutate() function
library(ape)
2. Examine the underlying structure of the data and display descriptive
statistics of dataset
US_hc=USArrests #Make a copy of the original dataset
summary(US_hc) #Descriptive Statistics
3. Remove any missing value using na.omit() that might be present in
the data. Standardize/ scale the data using the R function scale()
US_hc=na.omit(US_hc) #Omit Missing Values
US_hc=scale(US_hc) #Normalize independent variables
4. Compute the dissimilarity values with dist() function. Feed these
values into hclust() as input & specify the agglomeration method to
be used (i.e. “complete”, “average”, “single”, “ward.D”)
d = dist(US_hc, method = "euclidean") # Dissimilarity matrix
hc1 = hclust(d, method="complete") #Complete Linkage Method for HC
plot(hc1, cex = 0.6, hang = -1) #Plot the obtained dendrogram
plot(as.phylo(hc1), type = "fan", cex = 1)
5. Alternatively, we can use agnes() function. This function behaves
similarly; however, we can also get the agglomerative coefficient,
which measures the amount of clustering structure found (values
closer to 1 suggest strong clustering structure)
hc2=agnes(US_hc, method = "complete") #Compute agglomerative clustering
hc2$ac #Agglomerative coefficient
6. To find which hierarchical linkage method can identify stronger
clustering structures
m = c( "average", "single", "complete", "ward")
names(m) = c( "average", "single", "complete", "ward")
# Function to compute coefficient
ac = function(x) { agnes(US_hc, method = x)$ac }
map_dbl(m, ac)
Interpretation: - Here we see that Ward’s method identifies the strongest clustering
structure of the four methods assessed
7. Visualize the dendrogram with Ward’s method
hc3 = agnes(US_hc,method=“ward") #Complete Linkage Method for HC
plot(hc3, cex = 0.6, hang = -1, main = “Dendrogram of Agnes”) #Plot
8. Visualize the dendrogram with Divisive Clustering method
hc4 = diana(df) #Compute divisive HC using diana() - cluster package
hc4$dc # Divisive coefficient; amount of clustering structure found
pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana") #Plot
plot(as.phylo(hc4), type = "fan", cex = 1)
9. Identify the sub-groups by cutting the dendrogram with cutree()
function
hc5 = hclust(d, method = "ward.D2" ) # Ward's method
Interpretation: - Here we see that Ward’s method identifies the strongest clustering
structure of the four methods assessed
sub_grp = cutree(hc5, k = 4) # Cut tree into 4 groups
table(sub_grp) #Number of members in each cluster
rownames(US_hc)[sub_grp == 1] #Get the names for the members of cluster 1
rownames(US_hc)[sub_grp == 2] #Get the names for the members of cluster 2
rownames(US_hc)[sub_grp == 3] #Get the names for the members of cluster 3
rownames(US_hc)[sub_grp == 4] #Get the names for the members of cluster 4
10. We can add the cutree output to our original data
USArrests %>% mutate(cluster = sub_grp) %>% head
11. It’s also possible to draw the dendrogram with a border around the 4
clusters. The parameter border is used to specify the border colors
for the rectangles
plot(hc5, cex = 0.6)
rect.hclust(hc5, k = 4, border = 2:5)
K-Means Clustering
It classifies the input data points into k
number of clusters based on their inherent
distance from each other. The principle is to
minimize the sum of squares of the
distances between data & the
corresponding cluster centroids
K-Means Clustering Algorithm
Input: Set of N items & number k of centroids
Output: k clusters
1. Place k points into the space represented by the
items that are being clustered. These points
represent initial cluster centroids
2. Assign each object to the cluster that has the
closest centroid
3. When all objects have been assigned, recalculate
the positions of the k centroids
4. Repeat Steps 2 and 3 until the centroids no longer
change
Note: - Different initial positions of centroids yield
different final clusters
• k-means cluster analysis, is available in R through the kmeans()
function
• How many Clusters (k)?
One should choose a number of clusters so that adding another cluster
doesn’t give much better modeling of the data. More precisely, if one plots
the percentage of variance explained by the clusters against the number of
clusters, the first clusters will add much information (explain a lot of
variance), but at some point the marginal gain will drop, giving an angle in
the graph. The number of clusters is chosen at this point, hence the “elbow
criterion”. The number of clusters for our case is clearly 3
• Additionally, the kmeans() function returns some ratios that let us know how
compact is a cluster and how different are several clusters among
themselves.
1. betweenss. The between-cluster sum of squares. In an optimal segmentation,
one expects this ratio to be as higher as possible, since we would like to have
heterogeneous clusters
2. withinss. Vector of within-cluster sum of squares, one component per cluster. In
an optimal segmentation, one expects this ratio to be as lower as possible for
each cluster, since we would like to have homogeneity within the clusters
3. tot.withinss. Total within-cluster sum of squares
4. totss. The total sum of squares
Advantages
 k-means technique is fast

 Doesn't require calculating all of the distances between each


observation and every other observation
 Efficiently deals with very large data sets

Disadvantages
 If the data is rearranged a different solution is generated every time

 This procedure fails if you don't know exactly how many clusters you
should have in the first place
wine Dataset (In-built)
The dataset contains 13 chemical measurements (Alcohol, Malic acid, Ash,
Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoids
phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines
and Proline) and one variable Class, the label, for the cultivar or plant variety,
on 178 Italian wine samples. This data is the result of chemical analysis of
wines grown in the same region in Italy but derived from three different
cultivars.
Using Wine dataset cluster different types of wines.
1. Ensure that the packages & libraries are installed. The
library rattle is loaded in order to use the dataset wine
install.packages(“rattle") #For wine dataset
library(rattle)
install.packages(“cluster") #For plotting data
library(cluster)
data(wine, package='rattle')
head(wine)
str(wine) #Examine the underlying structure of the data
wine_clust=wine #Make a copy of the dataset wine

2. Remove any missing value using na.omit() that might be present in


the data. Standardize/ scale the data using the R function scale()
wine_scale=na.omit(wine_clust) #Omit Missing Values
wine_scale = scale(wine_clust[,-1]) #Normalize independent variables
3. Inorder to determine no. of clusters, plot the percentage of variance
explained by the clusters against the number of clusters. Initially clusters
will add information but after some point a marginal drop is observed, giving
an angle in the graph (elbow criterion)
wssplot = function(data, nc=15, seed=1234)
{
wss = (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc)
{
set.seed(seed)
wss[i] = sum(kmeans(data, centers=i)$withinss)
}
plot(1:nc, wss, type="b", xlab="Number of Clusters",ylab ="Within groups
sum of squares")
}
wssplot(wine_scale, nc=6)

4. Apply kmeans() function for performing kmeans clustering


kmeans_fit = kmeans(wine_scale, 3) # No. of clusters - k = 3
5. Plot the cluster solution as a 2 dimensional graph
clusplot(wine_scale, kmeans_fit$cluster, main='2D representation of the
Cluster solution', color=TRUE, shade=TRUE,labels=2, lines=0)
6. To evaluate the clustering performance we build a confusion matrix
table(wine[,1],kmeans_fit$cluster)
install.packages(“factoextra”) How to choose the right number of
library(factoextra) expected clusters (k)?
install.packages(“"cluster") The wss (within sum of square) is drawn
according to the number of clusters. The
library(cluster) location of a bend (knee) in the plot is
install.packages("purrr") generally considered as an indicator of the
library(purrr) appropriate number of clusters.
data(USArrests)
df = na.omit(USArrests) #Delete Missing Values
df <- scale(df) #Standardize the variables
#Function fviz_nbclust() is in factoextra package
fviz_nbclust(df, kmeans, method="wss") + geom_vline(xintercept=4, linetype=2)
set.seed(123)
km.res = kmeans(df, 4)
print(km.res)
#Compute the mean of each of the variables in the clusters
aggregate(USArrests, by=list(cluster=km.res$cluster), mean)
fviz_cluster(km.res, data = df)
#Cluster Analysis - wine Dataset How to choose the right number of
#K-means Clustering Method expected clusters (k)?
The wss (within sum of square) is drawn
according to the number of clusters. The
install.packages("factoextra") //fviz_nbclust()
location of function
a bend (knee) in the plot is
library(factoextra) generally considered as an indicator of the
kwineclustering=wine[,-1] appropriate number of clusters.
kwineclustering=na.omit(kwineclustering)
kwineclustering=scale(kwineclustering)
fviz_nbclust(kwineclustering, kmeans, method="wss")
fviz_nbclust(kwineclustering, kmeans, method="wss") +
geom_vline(xintercept=5, linetype=2)
set.seed(123)
kmeansres=kmeans(kwineclustering,4)
print(kmeansres)
#Compute the mean of each of the variables in the clusters
aggregate(kwineclustering, by=list(cluster=kmeansres$cluster), mean)
fviz_cluster(kmeansres, data = kwineclustering)
Principal Component Analysis (PCA)
• Cluster analysis provides us with the groupings of similar observations. PCA
used for dimension reduction improves the understanding of our data by
grouping the correlated variables
• Many datasets have variables that are highly correlated with each other
• PCA is a data-reduction technique that transforms larger number of
correlated variables into a much smaller set of uncorrelated variables called
principal components
• PCA is particularly powerful in dealing with multi-collinearity and variables
that outnumber the sample size (n)
• PCA is an unsupervised learning technique (means that it is performed
on a set of variables X1, X2, …, Xp with no associated response variable
Y) used to reduce dimensions of data with minimum loss of information
• PCA is a method of extracting important variables (in form of
components) from a large set of variables available in a dataset, thereby
providing better visualization
• PCA is the process of finding the principal components
• A component is a normalized linear combination of the features
• First principal component in a dataset is the linear combination that
captures the maximum variance in the data
• Second component is created by selecting another linear combination
that maximizes the variance with the constraints that its direction is
perpendicular to the first component
• Subsequent components (equal to the no. of variables) would follow the
same rule
• Applications: - Face recognition, image compression etc.
• In the base installation package of R, function for PCA is princomp()
provided in the psych package
Explain how dimensions/ principal components are found?
• First principal component of the data set X1,X2,...,Xp is the linear combination
of the features Z1=ϕ11X1+ϕ21X2+...+ϕp1Xp that has the largest variance and
where ϕ1 is the first principal component loading vector, with
elements ϕ12,ϕ22,…,ϕp2. The ϕ are normalized
• After the first principal components Z1 features are determined, we can find
the second principal component Z2
• Second principal component is the linear combination of X1,X2,...,Xp that has
maximal variance of all linear combinations that are uncorrelated with Z1.
The second principal component scores z12,z22,…,zn2 take the form
Z2=ϕ12X1+ϕ22X2+...+ϕp2Xp
• This proceeds until all principal components are computed
• To calculate loadings, we must find the ϕ vector that maximizes the variance.
From linear algebra, eigenvector corresponding to the largest eigenvalue of
the covariance matrix is the set of loadings that explains the greatest
proportion of the variability
• To calculate principal components
• cov() - Calculates the covariance matrix
• eigen() - Calculate eigenvalues of the matrix, that contains both ordered
eigenvalues ($values) and the corresponding eigenvector matrix ($vectors)
The goal of PCA is to explain most of the variability in the data with a
smaller number of variables than the original data set
1. Prepare the data – Standardize each variable
2. Select PCA model if it is a better fit for your research goals
3. Decide how many components
4. Extract the components
5. Rotate the components
6. Interpret the results
7. Compute component

Selecting the number of components to extract - Several criteria are


available for deciding how many components to retain in a PCA. They
include:
1. Basing the number of components on prior experience and theory
2. Selecting the number of components needed to account for some threshold
cumulative amount of variance in the variables (for example, 80 percent)
3. Selecting the number of components to retain by examining eigenvalues of
the k x k correlation matrix among the variables
• Most common approach is based on the eigenvalues. Each component is
associated with an eigenvalue. The first PC is associated with largest
eigenvalue, second PC with second-largest eigenvalue, and so on
• Kaiser–Harris criterion suggests retaining components with eigenvalues > 1.
Components with eigenvalues < 1 explain less variance
• Cattell Scree test plots eigenvalues against their component numbers. Plots
typically demonstrate a bend/ elbow, & the components above this sharp break
are retained
• Finally, you can run simulations, extracting eigenvalues from random data
matrices of the same size as the original matrix. If an eigenvalue based on real
data is larger than the average corresponding eigenvalues from a set of random
data matrices, that component is retained
• You can assess all three eigenvalue criteria at the same time via the
fa.parallel() function
Extracting Principal Components - principal() function will perform PCA
starting with either raw data matrix/ a correlation matrix
principal(r, nfactors=, rotate=, scores=)
where r is a correlation matrix or a raw data matrix
nfactors specifies no. of principal components to extract (1 by default)
rotate indicates rotation to be applied (varimax by default)
scores specifies whether or not to calculate principal component scores (false
by default)
Rotating Principal Components - Whenever two or more components
have been extracted, you can rotate the solution to make it more
interpretable
• Rotations are a set of mathematical techniques for transforming the
component loading matrix into one that’s more interpretable
• Rotation methods differ with regard to whether the resulting components
remain uncorrelated (orthogonal rotation) or are allowed to correlate (oblique
rotation)
• The most popular orthogonal rotation is the varimax rotation, which attempts
to purify the columns of the loading matrix, so that each component is
defined by a limited set of variables (that is, each column has a few large
loadings and many very small loadings)

Obtaining Principal Components Scores


• The principal component scores are saved in the s cores element of the
object returned by the principal() function when the option s cores=TRUE
Harman23.cor Dataset (In-built)
The dataset Harman23.cor contains data on 8 body
measurements for 305 girls aged between 7 and 17. In this case,
the dataset consists of the correlations among the variables rather
than the original data
1. Ensure that the packages & libraries are installed
install.packages(“psych") #fa.parallel() function
library(psych)
2. Examine the underlying structure of the data and display descriptive
statistics of dataset
Harman_pca=Harman23.cor #Make a copy of the original dataset
3. You need to identify the correlation matrix (the cov component of the
Harman23.cor object) and specify the sample size (n.obs)
fa.parallel(Harman_pca$cov, n.obs=302, fa="pc", n.iter=100,
show.legend = FALSE, main="Scree plot with parallel analysis")
Interpretation: - You can see from the plot presented in the next slide that
a two-component (2) solution is suggested
4. Extract the 2 principal components
pc = principal(Harman_pca$cov, nfactors=2,rotate=“none”)
pc
Interpretation: - On examining PC1 & PC2 columns, we observe that the first
component accounts for 58 % variance in the physical measurements, while the
second component accounts for 22 %. Together, the two components account for
81 percent of the variance
5. Rotate principal components using popular orthogonal rotation i.e., varimax
rotation method
rc = principal(Harman23.cor$cov, nfactors=2, rotate="varimax")
rc
Interpretation: - The column names change from PC to RC to denote
rotated components. Looking at column RC1, we see that the first
component is primarily defined by the first four variables (length variables).
The loadings in the column RC2 indicate that the second component is
primarily defined by variables 5 through 8 (volume variables). Note that the
two components are still uncorrelated and that together, they still explain
the variables equally well.
6. Obtain principal component scores. The principal() function makes
it easy to obtain scores for each participant on this derived variable
rc = principal(Harman23.cor$cov, nfactors=2, rotate="varimax")
round(unclass(rc$weights), 2)
The component scores are obtained using the formulas
PC1 = 0.28*height + 0.30*arm.span + 0.30*forearm + 0.29*lower.leg -
0.06*weight - 0.08*bitro.diameter - 0.10*chest.girth - 0.04*chest.width
and
PC2 = -0.05*height - 0.08*arm.span - 0.09*forearm - 0.06*lower.leg
+0.33*weight + 0.32*bitro.diameter + 0.34*chest.girth + 0.27*chest.width
Interpretation
These equations assume that the physical measurements have been
standardized (mean=0, sd=1). Note that the weights for PC1 tend to be
around 0.3 or 0. The same is true for PC2. As a practical matter, you
could simplify your approach further by taking the first composite
variable as the mean of the standardized scores for the first four
variables. Similarly, you could define the second composite variable as
the mean of the standardized scores for the second four variables
USJudgeRatings Dataset
(In-built)
USJudgeRatings contains
lawyers’ ratings of state
judges in the US Superior
Court. The data frame
contains 43 observations on
12 numeric variables. The
variables are listed in
adjacent table
1. Can you summarize the 11
evaluative ratings (INTG to
RTEN) with a smaller
number of composite
variables?
2. If so, how many will you
need and how will they be
defined?
1. The data are in raw score format and there are no missing values.
Therefore, your next decision is deciding how many principal
components
2. Ensure that the packages & libraries are installed
install.packages(“psych") #fa.parallel() function
library(psych)
3. Examine the underlying structure of the data and display descriptive
statistics of dataset
data(USJudgeRatings)
USJR_pca=USJudgeRatings[,-1] #Make a copy of the original dataset
summary(USJR_pca) #Descriptive Statistics
4. Assess three eigenvalue criteria at the same time via the
fa.parallel() function. For the 11 ratings (dropping the CONT
variable), the necessary code is as follows:
fa.parallel(USJR_pca, fa="PC", n.iter=100, show.legend=FALSE, main="Scree
plot with parallel analysis")
Interpretation: - The plot in the next slide displays the scree test based on the
observed eigenvalues. A single component is appropriate for summarizing this
dataset as there is only one eigenvalue > 1 (i.e., horizontal line at y=1)
5. Extract the first principal component
pc = principal(USJR_pca, nfactors=1)
pc
Interpretation: - Here we input raw data without the CONT variable. Column PC1
contains the component loadings, which are the correlations of the observed
variables with the principal component(s). We can see that each variable correlates
highly with the first component (PC1). It therefore appears to be a general
evaluative dimension. Column h2 contains the component communalities—the
amount of variance in each variable explained by the components. u2 column
contains the component uniquenesses, the amount of variance not accounted for by
the components (or 1–h2). Row SS loadings containing eigenvalues associated
with the components are the standardized variance associated with a particular
component (in this case, the value for the first component is 10.13). Row
Proportion Var represents the amount of variance accounted for by each
component. Here you see that the first principal component accounts for 92 percent
of the variance in the 11 variables
6. Rotate principal components using popular orthogonal rotation i.e., varimax
rotation method
USJR.cov = cov(USJR_pca) #Covariance matrix
rc = principal(USJR.cov, nfactors=1, rotate="varimax")
rc
plot(rc)
6. Obtain principal component scores. The principal() function makes
it easy to obtain scores for each participant on this derived variable
pc = principal(USJudgeRatings, nfactors=1, score=TRUE)
head(pc$scores)
Interpretation: - The principal component scores are saved in the scores
element of the object returned by the principal() function when the option
scores=TRUE.
5. Extract the first principal component
pc = principal(USJR_pca[,-1], nfactors=1)
pc
Interpretation: - Here we input raw data without the CONT variable.
Column PC1 contains the component loadings, which are the correlations
of the observed variables with the principal component(s). We can see that
each variable correlates highly with the first component (PC1). It therefore
appears to be a general evaluative dimension. Column h2 contains the
component communalities—the amount of variance in each variable
explained by the components. u2 column contains the component
uniquenesses, the amount of variance not accounted for by the
components (or 1–h2)
USArrests Dataset (In-built)
USArrests dataset in R contains statistics, in arrests per 100,000
residents for assault, murder, and rape in each of the 50 US states in
1973. It includes also the percent of the population living in urban areas.
It’s a data frame which contains 50 observations on 4 variables:
[,1] Murder - numeric - Murder arrests (per 100,000)
[,2] Assault - numeric - Assault arrests (per 100,000)
[,3] UrbanPop – numeric - Percent urban population
[,4] Rape - numeric - Rape arrests (per 100,000)
1. Ensure that the packages & libraries are installed
install.packages("tidyverse") #data manipulation & visualization
library(tidyverse)
install.packages(“gridExtra") #Plot arrangement
library(gridExtra)
install.packages("dendextend") #for comparing two dendrograms
library(dendextend)
install.packages("purrr") #for map_dbl() function
library(purrr)
install.packages("dplyr") #for mutate() function
library(dplyr)
install.packages(“ape") #for mutate() function
library(ape)
2. Examine the underlying structure of the data and display descriptive
statistics of dataset
data(USArrests)
US_pca=USArrests #Make a copy of the original dataset
apply(US_pca,2,var)
summary(US_pca) #Descriptive Statistics
3. It is usually beneficial for each variable to be centered at zero for
PCA, as it makes comparing each principal component to the mean
straightforward. Standardizing each variable will fix this issue
apply(USArrests, 2, var) #Compute variance for each variable (column)
US_pca= apply(USArrests, 2, scale) #Normalize data
head(US_pca)
summary(US_pca) #Descriptive Statistics
4. Calculate principal components. Start with the cov() function to
calculate the covariance matrix, followed by the eigen() function to
calculate the eigenvalues of the matrix
US.cov = cov(US_pca) #Covariance matrix
US.eigen = eigen(US.cov) #Eigen values of the matrix
str(US.eigen) #Ordered eigenvalues & eigenvector matrix
5. Alternatively, we can use agnes() function. This function behaves
similarly; however, we can also get the agglomerative coefficient,
which measures the amount of clustering structure found (values
closer to 1 suggest strong clustering structure)
hc2=agnes(US_hc, method = "complete") #Compute agglomerative clustering
hc2$ac #Agglomerative coefficient
Text Mining
• According to Merrill Lynch, 80% - 90% of
business data is unstructured/ semi- 10% 10%
Unstructured
structured data
Semi-structured
• Gartner estimates that unstructured data 80% data
constitutes 80% of the whole enterprise Structured data
data
• Anyone seeking to find insights in the
data must develop the capability to
process & analyze text
• Using R, we can extract powerful
information in the textual data
• There are many different methods to use
in text mining
• Assuming data to be available from
Twitter, a customer call centre, scrapped
off the web, or whatever & is contained
in some sort of text file or files
Analyzing Online Research Articles
##https://www.youtube.com/watch?v=BPjgwdqHM8g
install.packages("tm")
library(tm)
options(header=FALSE, stringAsFactors=FALSE,FileEncoding="latin1")
#Set Directory
setwd("C:/Users/HAPPY/Desktop/University of Kolkatta - 4-5 February 2019")
#Read Text File & build corpus - collection of documents
text = readLines("Machine_Learning_Techniques.txt")
corpus = Corpus(VectorSource(text))
text
inspect(corpus[8])
#Clean the Data
corpus = tm_map(corpus,tolower)
corpus = tm_map(corpus,removePunctuation)
corpus = tm_map(corpus,removeNumbers)
stopwords("english")
cleanset = tm_map(corpus, removeWords,stopwords("english"))
inspect(corpus[8])
inspect(cleanset[8])
#Editing - Post plotting the graph
cleanset = tm_map(cleanset, removeWords, c("fig","per"))
cleanset = tm_map(cleanset, gsub, pattern = "claims", replacement = "claim")
#Whereever words are removed whitespaces are added - remove them
cleanset = tm_map(cleanset, stripWhitespace)
#Needed for newtm 0.6.0
cleanset = tm_map(cleanset, PlainTextDocument)
#Build document term matrix
dtm = TermDocumentMatrix(cleanset,control = list(minWordLength=c(1,Inf)))
#Inspect frequently repeated words more than 2 times
findFreqTerms(dtm,lowfreq = 2)
#Create bar plot
termFrequency = rowSums(as.matrix(dtm))
termFrequency = subset(termFrequency, termFrequency>=15)
install.packages("ggplot2")
library(ggplot2)
#Building Word Cloud
install.packages("wordcloud")
library(wordcloud)
#Calculate the frequency of words & sort it on frequency in a descending order
wordFreq = sort(rowSums(m), decreasing = TRUE)
#To make the graphs reproducible
set.seed(375)
grayLevels = gray(wordFreq + 10)/max(wordFreq)+10))
#With gray levels
wordcloud(words = names(wordFreq), freq=wordFreq,max.words = 100,min.freq =
5,random.order=F,colors = grayLevels)
#With colors
wordcloud(words = names(wordFreq), freq=wordFreq,max.words = 100,min.freq =
10,random.order=F,colors = rainbow(20))
wordcloud(words = names(wordFreq), freq=wordFreq,max.words = 100,min.freq =
10,random.order=F,colors = brewer.pal(6,"Dark2"))
#Scale function specified the maximum and minimum size of the words in the
word cloud
wordcloud(words = names(wordFreq), scale=c(6,.3),freq=wordFreq,max.words =
100,min.freq = 10,random.order=F,colors = rainbow(20))
#20% of the words will be rotated using rot.per funtion
wordcloud(words = names(wordFreq), rot.per=0.2,
scale=c(6,.3),freq=wordFreq,max.words = 100,min.freq = 10,random.order=F,colors =
rainbow(20))
To explain this concept we will consider President Obama’s State
of the Union Speeches. We are just curious as to what can be
uncovered in particular & if & how his message has changed over
time (2010 to 2015).
Perhaps this will serve as a blueprint to analyze any politicians
speech in order to prepare an opposing candidate in a debate/
speech of their own.
2 main analytical goals are to build topic models on the 6 State of
the Union speeches & then compare the first speech in 2010 with
the most recent speech in 2015, for sentence based textual
measures, such as sentiment & dispersion
Sno State of the Union Address Date URL
https://www.cbsnews.com/news/ob
January 27,
1 First State of the Union Address amas-state-of-the-union-key-
2010
quotes/
https://www.cbsnews.com/news/sta
Second State of the Union January 25,
2 te-of-the-union-2011-jobs-and-
Address 2011
more-jobs/
https://www.theguardian.com/world/
January 24,
3 Third State of the Union Address 2012/jan/25/state-of-the-union-
2012
address-full-text
https://www.theatlantic.com/politics/
Fourth State of the Union February 12, archive/2013/02/obamas-2013-
4
Address 2013 state-of-the-union-speech-full-
text/273089/
https://www.washingtonpost.com/p
olitics/full-text-of-obamas-2014-
state-of-the-union-
January 28,
5 Fifth State of the Union Address address/2014/01/28/e0c93358-
2014
887f-11e3-a5bd-
844629433ba3_story.html?noredire
ct=on&utm_term=.a9b189cc43e9
January 20, http://time.com/3675705/full-text-
5 Sixth State of the Union Address
2015 state-union-2015/
1. Ensure that the necessary packages & respective libraries are
installed
• The primary package that we use is tm – text mining package

• We will also need SnowballC for stemming of the words, RcolorBrewer for
color palettes in wordclouds, & the wordcloud package
install.packages(“tm”)
library(tm)
install.packages(“SnowballC”)
library(SnowballC)
install.packages(“wordcloud”)
library(wordcloud)
install.packages(“RColorBrewer”)
library(RColorBrewer)

Note: - The data files are available for download


in https://github.com/datameister66/data. Please ensure you put the text files
into a separate directory because it will all go into our corpus for analysis.
2. We should put the 6 files in a folder (Text_Mining) to hold the
documents that will form the corpus by setting the working directory
using setwd().
3. Begin to create the corpus by first creating an object (path) to the speeches &
checking the files for their number & names:
name=file.path("C:/Users/HAPPY/Desktop/Text_Mining") # Create the corpus
length(dir(name))
dir(name)
4. We will create the corpus docs and it is created with the Corpus() function,
wrapped around the DirSource() function, which is also part of the tm package:
docs = VCorpus(DirSource(name))
docs
5. We will now begin the text transformations (Change capital letters to lowercase,
Remove numbers, Remove punctuation, Remove stop words – and, is, not etc.,
Remove excess whitespace, Word stemming – family & families - famili, Word
replacement – management & leadership)using the tm_map() function from the
tm package:
docs=tm_map(docs,tolower)
docs=tm_map(docs,removeNumbers)
docs=tm_map(docs,removePunctuation)
docs=tm_map(docs,removeWords, stopwords("english"))
docs=tm_map(docs,stripWhitespace)
docs=tm_map(docs,stemDocument)
6. After completing the transformations and removal of other words, make
sure documents are plain text, put it in a document-term matrix, & check the
dimensions
docs = tm_map(docs, PlainTextDocument)
dtm = DocumentTermMatrix(docs)
dim(dtm)
7. Remove the sparse terms with the removeSparseTerms() function. Specify
a number between 0 & 1 where the higher the number, the higher the
percentage of sparsity in the matrix. So, with 6 documents, by specifying
0.51 as the sparsity number, the resulting matrix would have words that
occurred in at least three documents, as follows:
dtm=removeSparseTerms(dtm,0.51)
dim(dtm)
8. Name the rows of the matrix so that we know which document belongs to
which year. Using the inspect() function, you can examine the matrix (6
rows & 5 columns).
rownames(dtm) = c("2010","2011","2012","2013","2014","2015")
inspect(dtm[1:6, 1:5])
9. Move on to exploring word frequencies by creating an object with the column
sums, sorted in descending order. The default order is ascending, so putting -
in front of freq will change it to descending. Using findFreqTerms(), we can see
what words occurred at least 100 times.
freq=colSums(as.matrix(dtm))
ord=order(-freq) #Descending order - Default ascending order
freq[tail(ord)] #Head of the object
freq[head(ord)] #Tail of the object
findFreqTerms(dtm,100) #Words occured atleast 100 times
10. We can find associations within words (correlation) with the findAssocs()
function using 0.9 as the correlation cutoff as follows:
findAssocs(dtm, "job", corlimit=0.9) #Find associations with word job by
correlation with 0.9 as correlation cutoff
11. For visual portrayal, we can produce wordclouds and a bar chart. We will do
two wordclouds to show the different ways to produce them: one with a
minimum frequency & the other by specifying the max. no. of words to include.
The first one with minimum frequency also includes code to specify the color,
scale syntax determines min & max word size by frequency. In this case, the
minimum frequency is 50
wordcloud(names(freq), freq, min.freq=50, scale=c(3, .5),colors=brewer.pal(6,
"Dark2")) #visual portrayal can be produced - wordclouds & bar chart
wordcloud(names(freq), freq, max.words=30) #Capturing 30 most frequent words
12. Code below will show you how to produce a bar chart for the 10 most frequent
words in base R:
freq = sort(colSums(as.matrix(dtm)), decreasing=TRUE) #Bar Chart
wf = data.frame(word=names(freq), freq=freq)
wf = wf[1:10,]
barplot(wf$freq, names=wf$word, main="Word Frequency",
xlab="Words",ylab="Counts", ylim=c(0,250))
13. Code We will now move on to the building of topic models using the
topicmodels package, which offers the LDA() function. The question now is how
many topics to create. It seems logical to solve for three or four, so we will try
both. Using the terms() function produces a list of an ordered word frequency
for each topic
set.seed(123)
lda3 = LDA(dtm, k=3, method="Gibbs") #topicmodels package
topics(lda3) #How many topics to create? Seems logical to solve
3/4
lda4 = LDA(dtm, k=4, method="Gibbs") #topicmodels package
topics(lda4)
terms(lda3,10)
terms(lda4,10)
###TEXT MINING

#Install the packages


install.packages("tm")
library(tm)
install.packages("SnowballC")
library(SnowballC)
install.packages("wordcloud")
library(wordcloud)
install.packages("RColorBrewer")
library(RColorBrewer)
install.packages("topicmodels")
library(topicmodels)
name=file.path("C:/Users/HAPPY/Desktop/Text_Mining") # Create the corpus
length(dir(name))
dir(name)
docs=VCorpus(DirSource(name)) #tm package
docs
docs=tm_map(docs,tolower)
docs=tm_map(docs,removeNumbers)
docs=tm_map(docs,removePunctuation)
docs=tm_map(docs,removeWords, stopwords("english"))
docs=tm_map(docs,stripWhitespace)
docs=tm_map(docs,stemDocument)
docs = tm_map(docs, PlainTextDocument)
dtm = DocumentTermMatrix(docs)
dim(dtm)
dtm=removeSparseTerms(dtm,0.51)
dim(dtm)
rownames(dtm)=c("2010","2011","2012","2013","2014","2015","2016")
inspect(dtm[1:6,1:5])
#Word Frequency & topic models
freq=colSums(as.matrix(dtm))
ord=order(-freq) #Descending order - Default ascending order
freq[tail(ord)] #Head of the object
freq[head(ord)] #Tail of the object
findFreqTerms(dtm,100) #Words occured atleast 100 times
findAssocs(dtm, "job", corlimit=0.9) #Find associations with word job by
correlation with 0.9 as correlation cutoff
wordcloud(names(freq), freq, min.freq=50, scale=c(3, .5),colors=brewer.pal(6,
"Dark2")) #visual portrayal can be produced - wordclouds & bar chart
wordcloud(names(freq), freq, max.words=30) #Capturing 30 most frequent words
freq = sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wf = data.frame(word=names(freq), freq=freq) #Bar Chart
wf = wf[1:10,]
barplot(wf$freq, names=wf$word, main="Word Frequency",
xlab="Words",ylab="Counts", ylim=c(0,250))

set.seed(123)
lda3 = LDA(dtm, k=3, method="Gibbs") #topicmodels package
topics(lda3) #How many topics to create? Seems logical to solve
3/4
lda4 = LDA(dtm, k=4, method="Gibbs") #topicmodels package
topics(lda4)
terms(lda3,10)
terms(lda4,10)
Dr.Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: [email protected] Mobile: + (91)98666 66620
Unit I - Motivation; Foundations with R; Managing Data with R; Data
Visualization; Linear Algebra and Matrix Computing; Dimensionality
Reduction; Lazy Learning: Classification using Nearest Neighbors;
Probabilistic Learning: Classification using Naïve Bayes; Divide and
Conquer – Classification Using Decision Trees.
Unit II - Forecasting Numeric Data: Regression Models: Simple Linear
Regression, Multiple Linear Regression, Polynomial Regression, Support
Vector Regression (SVR), Decision Tree Regression, and Random Forest
Regression; Classification: Logistic Regression, K-Nearest Neighbours
(K-NN), Support Vector Machine (SVM), Kernel SVM, Naïve Bayes,
Decision Tree Classification, and Random Forest Classification;
Association Rule Learning: Apriori Algorithm; Clustering: K-means and
Hierarchical Clustering.
Unit III - Dimensionality Reduction: Principal Component Analysis, Linear
Discriminant Analysis, Kernel PCS; Why Machine Learning?;
Applications of Machine Learning; Specialized Machine Learning: Data
Formats and Optimization of Computation; Variable/ Feature Selection;
Regularized Linear Modeling and Controlled Variable Selection; Big
Longitudinal Data Analysis; Reinforcement Learning: Thompson
Sampling; Deep Learning; Text Mining and Natural Language
Processing.
Support Vector Machines (SVM)
• Logistic Regression & Discriminant Analysis (Classification
techniques), determined the probability of a predicted
observation (categorical response)
• Now, lets delve into two nonlinear techniques: K-Nearest
Neighbors (KNN) and Support Vector Machines (SVM)
• These methods can be used for continuous outcomes in
addition to classification problems though this section is only
limited to the latter
K-Nearest Neighbours (KNN)
• In previous efforts, models estimated coefficients/ parameters
for each included feature
• KNN has no parameters. So, it is called instance-based
learning. The labeled examples (inputs and corresponding
output labels) are stored & no action is taken until a new input
pattern demands an output value
• This method is commonly called lazy learning as no specific
model parameters are produced
• In previous efforts, models estimated coefficients/ parameters
for each included feature
• KNN has no parameters. So, it is called instance-based
learning. The labeled examples (inputs and corresponding
output labels) are stored & no action is taken until a new input
pattern demands an output value
• This method is commonly called lazy learning as no specific
model parameters are produced
Pima.tr & Pima.te
Datasets

The data originally collected by the National Institute of


Diabetes and Digestive and Kidney Diseases (NIDDK), is in 2
different data frames consisting of 532 observations & 8 input
features along with a binary outcome (type - Yes/No).
The patients in this study were of Pima Indian descent from
South Central Arizona. The NIDDK data shows that since past 30
years, research has helped scientists to prove that obesity is a
major risk factor in the development of diabetes. The Pima Indians
were selected for the study as one-half of the adult Pima Indians
have diabetes and 95 percent of those with diabetes are
overweight. The analysis will focus on adult women only. Diabetes
was diagnosed according to the WHO criteria and was of the type
of diabetes that is known as type 2.
Our task is to examine and predict those individuals that have
diabetes or the risk factors that could lead to diabetes in this
population. Data frame contains the following columns:
1. np - number of pregnancies
2. glu - glucose concentration in an oral glucose tolerance test
3. bp - blood pressure (mm Hg)
4. skin – triceps skin fold thickness (mm)
5. bmi – body mass index (weight in kg/(height in m)\^2)
6. ped – diabetes pedigree function
7. age - age in years
8. type - Yes or No, for diabetic according to WHO criteria

• Datasets are contained in the R package, MASS.


• Lets combine the datasets Pima.tr & Pima.te into one data
frame, instead of using these as separate train and test sets.
1. Ensure that necessary packages & respective libraries are
loaded
install.packages(“class”) #k-nearest neighbors
library(class)
install.packages(“kknn”) #weighted k-nearest neighbors
library(kknn)
install.packages(“e1071”) #SVM
library(e1071)
install.packages(“caret”) #Select tuning parameters
library(caret)
install.packages(“MASS”) #Contains the data
library(MASS)
install.packages(“reshape2”) #Assist in creating boxplots
library(reshape2)
install.packages(“ggplot2”) #Create boxplots
library(ggplot2)
install.packages(“kernlab”) #Assist with SVM feature selection
library(kernlab)
install.packages(“pROC”)
library(pROC)
2. Load the datasets and check their structure, ensuring that they are the
same
data(Pima.tr) #200 observations for 8 variables

str(Pima.tr)

data(Pima.te) #332 observations for 8 variables

str(Pima.te)

3. Combine the datasets (Pima.tr & Pima.te) into single dataframe using
rbind() function, as both have similar data structures
pima = rbind(Pima.tr, Pima.te) #Combine (Pima.tr & Pima.te)

str(pima)

4. Boxplot layout using ggplot() function of ggplot2 package is quite effective


way of graphical representation. Specify the data to use (response variable
as x & its value as y), type of plot, plot the series in 2 columns
pima.melt = melt(pima, id.var="type")
ggplot(data=pima.melt, aes(x=type, y=value)) + geom_boxplot() +

facet_wrap(~variable, ncol=2) #ggplot2() package

Interpretation: - As expected, the fasting glucose appears to be significantly


higher in the patients currently diagnosed with diabetes
5. Standardize the values using scale() function leaving the response
variable (type) & re-plot. Let’s call the new data frame pima.scale.
Now, we will need to include the response variable (type) in the data
frame, as follows:
pima.scale = as.data.frame(scale(pima[,-8]))
str(pima.scale)
pima.scale$type = pima$type
pima.scale.melt = melt(pima.scale, id.var="type") #Repeat the boxplot
ggplot(data=pima.scale.melt, aes(x=type, y=value)) + geom_boxplot() +
facet_wrap(~variable, ncol=2)
Interpretation: - In addition to glucose, it appears that the other features
may differ by type, in particular, age
Note: - On scaling a data frame, it automatically becomes a matrix. Using
the as.data.frame() function, convert it back to a data frame
6. Check the Correlation between variables except for response
variable (type)
cor(pima.scale[-8])
Interpretation: - There are a couple of correlations above 0.5 to point out:
npreg/age & skin/bmi
7. Check the ratio of Yes and No for response variable (type)
table(pima.scale$type)
set.seed(502)
ind = sample(2, nrow(pima.scale), replace=TRUE, prob=c(0.7,0.3))
train = pima.scale[ind==1,]; test = pima.scale[ind==2,]
str(train); str(test)
Note: - It is important to ensure a balanced split in the data. A good rule of thumb is
at least a 2:1 ratio in the possible outcomes (He and Wa, 2013). The ratio is 2:1 so
we can create the train and test sets with our usual syntax using a 70/30 split

Before Scaling After Scaling


8. Using caret package identify k. Create a grid of inputs for the
experiment, with k ranging from 2-20 by an increment of 1. This is
easily done with the expand.grid() and seq() functions. Incorporate
cross-validation in the selection of the parameter, creating an object
called control & utilizing the trainControl() function from the caret
package
grid1 = expand.grid(.k=seq(2,20, by=1)) #caret package
control = trainControl(method="cv") #Cross Validation
set.seed(502)
knn.train = train(type~., data=train, method="knn", trControl=control,
tuneGrid=grid1)
knn.train

Interpretation: - Calling the object provides us with the k parameter


that we are seeking, which is k=17. Accuracy tells us the percentage
of observations that the model classified correctly. Kappa statistic is
commonly used to provide a measure of how well can two
evaluators classify an observation correctly
9. The Percent of agreement is the rate that the evaluators agreed on
the class (accuracy) and Percent of chance agreement is the rate
that the evaluators randomly agreed on. The higher the statistic, the
better they performed with the maximum agreement being one. To
do this, we will utilize the knn() function from the class package. With
this function, we will need to specify at least four items. These would
be the train inputs, the test inputs, correct labels from the train set,
and k.
grid1 = expand.grid(.k=seq(2,20, by=1)) #caret package
knn.test = knn(train[,-8], test[,-8], train[,8], k=17) #class package
Let’s examine the confusion matrix & calculate accuracy & kappa.
accuracy is done by simply dividing the correctly classified observations by
the total observations
table(knn.test, test$type)
(77+28)/147 #(77+26+16+28)=147
prob.agree = (77+28)/147 # Calculate Kappa - accuracy
prob.chance = ((77+26)/147) * ((77+16)/147); prob.chance
kappa = (prob.agree - prob.chance) / (1 - prob.chance); kappa
Interpretation: - This is slightly less than our accuracy of 71 percent that
we achieved on the train data alone of almost eight percent
Interpretation: - The kappa statistic is 0.49
for the test data set. Altman(1991) provides a
heuristic to assist us in the interpretation of
the statistic. Our kappa statistics are only
moderate with an accuracy just over 70
percent on the test set

After specifying a random seed, we will create the train set object
with kknn(). This function asks for the maximum number of k
values (kmax), distance (one is equal to Euclidian and two is
equal to absolute), and kernel. For this model, kmax will be set to
25 and distance will be 2
set.seed(123)
kknn.train = train.kknn(type~., data=train, kmax=25, distance=2,
kernel=c("rectangular", "triangular", "epanechnikov"))
A nice feature of the package is the ability to plot and compare the
results, as follows:
plot(kknn.train)
#1. Install the packages
install.packages(“class”) #k-nearest neighbors
library(class)
install.packages(“kknn”) #weighted k-nearest neighbors
library(kknn)
install.packages(“e1071”) #SVM
library(e1071)
install.packages(“caret”) #Select tuning parameters
library(caret)
install.packages(“MASS”) #Contains the data
library(MASS)
install.packages(“ggplot2”) #Create boxplots
library(ggplot2)
#2. Combine the datasets
data(Pima.tr) #200 observations for 8 variables
str(Pima.tr)
data(Pima.te) #332 observations for 8 variables
str(Pima.te)
pima = rbind(Pima.tr, Pima.te) #Combine (Pima.tr & Pima.te)
str(pima)
#3. Plot the graph - Scale the data
pima.melt = melt(pima, id.var="type")
ggplot(data=pima.melt, aes(x=type, y=value)) + geom_boxplot() +
facet_wrap(~variable, ncol=2) #ggplot2 package
pima.scale = as.data.frame(scale(pima[,-8]))
str(pima.scale)
pima.scale$type = pima$type
pima.scale.melt = melt(pima.scale, id.var="type") #Repeat the boxplot
ggplot(data=pima.scale.melt, aes(x=type, y=value)) + geom_boxplot() +
facet_wrap(~variable, ncol=2)

#4. Check for Correlation between variables except response variable (type)
cor(pima.scale[-8])

#5. check the ratio of Yes and No in our response & create training & test
datasets
table(pima.scale$type)
set.seed(502)
ind = sample(2, nrow(pima.scale), replace=TRUE, prob=c(0.7,0.3))
train = pima.scale[ind==1,]
test = pima.scale[ind==2,]
str(train)
str(test)
#6. knn Modelling
grid1 = expand.grid(.k=seq(2,20, by=1))
control = trainControl(method="cv") #caret package
set.seed(502)
knn.train = train(type~., data=train, method="knn",
trControl=control,tuneGrid=grid1)
knn.train
knn.test = knn(train[,-8], test[,-8], train[,8], k=17)
table(knn.test, test$type)
(77+28)/147
#calculate Kappa
prob.agree = (77+28)/147 #accuracy
prob.chance = ((77+26)/147) * ((77+16)/147)
prob.chance
kappa = (prob.agree - prob.chance) / (1 - prob.chance)
kappa

set.seed(123)
#install package kknn
kknn.train = train.kknn(type~., data=train, kmax=25, distance=2,
kernel=c("rectangular", "triangular", "epanechnikov"))
plot(kknn.train)
knn.train
Dr.Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: [email protected] Mobile: + (91)98666 66620
Support Vector Machines (SVM)
• Logistic Regression & Discriminant Analysis (Classification
techniques), determined the probability of a predicted
observation (categorical response)
• Now, lets delve into two nonlinear techniques: K-Nearest
Neighbors (KNN) and Support Vector Machines (SVM)
• These methods can be used for continuous outcomes in
addition to classification problems though this section is only
limited to the latter
More Classification Techniques – K-Nearest Neighbors &
Support Vector Machines
• K-Nearest Neighbours (KNN) & Support Vector Machines (SVM) techniques
are most sophisticated
Overview of Association Rule Mining &
Sentiment Analysis
• Text is a vast source of data for business

• Examples: - Earnings announcements, Communications to


shareholders, Press releases, Restaurant reviews, Supreme Court
opinions, and so on
• Text data is extremely high dimensional

• Analysis of phrase counts from text documents is the current state of


the art
• Considerable pre-processing of text is needed before one can obtain
frequency information on words and before one can start the statistical
analysis
• Information retrieval and the appropriate “tokenization” of the
information are very important
• Step 1: When faced with a raw text document, stem the words
• This means that one cuts words to their root: for example, “tax” from
taxing, taxes, and taxation
• Porter stemming algorithm is used for removing common
morphological and inflexional endings from English words
• Step 2: Search the text documents for a list of stop words
containing irrelevant words marked for removal
• Examples: If, and, but, who, what, the, they, their, a, or, and so on

• Step 3: Remove words that are extremely rare


• A reasonable rule removes words with relative frequencies below
0.5%
support(A  B)  P(A  B)
Overview of Association Rules confidence (A  B) 
P(A  B)
P(A)
P(A  B)
lift(A  B) 
Association Rule Mining P(A)P(B)
 In data mining, association rule mining is a popular and well
researched method for discovering interesting relations between
variables in large databases
– The rule found in the sales data of a supermarket would indicate that if a
customer buys onions & potatoes together, he/ she is likely to buy chicken
– Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also
purchasing one of the 3 types of candy bars
– Customers who purchase maintenance agreements are likely to purchase large
appliances
• Association rules are rules presenting association or correlation
between item sets
• The three most widely used measures for selecting interesting rules
are support, confidence and lift
• Support - Percentage of cases in the data that contains both A and B
• Confidence - Percentage of cases containing A that also contain B
• Here we have 5 customers. Each customer is given a trolly &
their purchases are as follows:
Customer Items Purchased
1 Orange Juice, Soda
2 Milk, Orange Juice, Window
Cleaner
3 Orange Juice, Detergent
4 Orange Juice, Detergent, Soda

• Now
5 lets for a Cleaner
Window matrix to analyze the above data & conclude
, Soda

inferences
• Simple Patterns derived from the observation
Orange Window Milk Sod Detergen
Juice Cleaner a t
Orange Juice 4 1 1 2 2
Window 1 2 1 1 0
Cleaner
Milk 1 1 1 0 0
Soda Juice & soda
• Orange 2 1
are more likely 0
purchased together3than any
1 other 2 items
• Detergent
Detergent is never
2 purchased with
0 milk or window0cleaner
1 2
• Milk is never purchased with soda or detergent
How good is Association Rule?
• The following 3 terms are important constraints on which
the association rules are made

• arules - provides the infrastructure for representing,


manipulating, and analyzing transaction data and patterns
• arulesViz – extends package arules. Various visualization
techniques for association rules & itemsets are included
• apriori() – Present in the arules package. It employs level-
wise search for frequent item-sets
Groceries Dataset – Contains 1 month (30 days) of
real-world point-of-sale transaction data from a typical
grocery outlet. The data set contains 9835 transactions
& the items are aggregated to 169 categories
install.packages("arules")
library(arules)
install.packages("arulesViz")
library(arulesViz)
data(Groceries)
head(Groceries)
#Display transactions as item Matrix in sparse format with 9835 rows and
169 columns & a density of 0.02609146
summary(Groceries)
#Employs level-wise search for frequent item-sets
rules=apriori(Groceries, parameter=list(support=0.002,confidence=0.5))
#Displays set of 1098 rules
rules
inspect(head(sort(rules,by=‘lift’),2))
#Plot the data
plot(rules)
#Display Support, Confidence, and the Lift values after applying AR
head(quality(rules))
plot(rules, method=“grouped”)
Interpretation
1. From the scatter plot, it can be seen that rules with high lift
have relatively low support. Most interesting rules reside
on support-confidence border
2. From the plot, it is seen that most interesting rules
according to ‘lift’ can be seen at the top0-center. There are
3 rules containing “hard cheese” and “butter”, in
consequence to “whipped/ sour cream”

You might also like