Data Science and Predictive Analytics
Data Science and Predictive Analytics
ANALYTICS FOR
MANAGERS
Session I
Statistical thinking will one day be as necessary for efficient citizenship as the ability to read & write
- H.G.Wells
Course Objective(s) & Outcome(s)
Course Objective(s): This course will cover the basic concepts of big data,
methodologies for analyzing structured, semi-structured and unstructured data with
emphasis laid on the association between the data science and the business needs. The
course is intended for first year management students coming from a background of
engineering, commerce, arts, computer sciences, statistics, mathematics, economy and
management. This course seeks to present the student with a wide range of data
analytic techniques and is structured around the broad contours of the different types of
data analytics namely: descriptive, inferential, predictive, and prescriptive analytics.
Course Outcome(s): By the time student completes the academic requirements he/ she
will be able to:
• Obtain, clean/process and transform data.
• Analyse and interpret data using an ethically responsible approach.
• Use appropriate models of analysis, assess the quality of input, derive insight from
results, and investigate potential issues.
• Apply computing theory, languages and algorithms, as well as mathematical and
statistical models, and the principles of optimization to appropriately formulate and use
data analyses.
• Formulate and use appropriate models of data analysis to answer business-related
questions.
• Interpret data findings effectively to any audience, orally, visually and in written
Syllabus
Unit I : Introduction to Business Analytics and Data
Types of Digital Data: Structured Data, Unstructured Data, and Semi-Structured Data;
Introduction to Big Data; Overview of Business Analytics; Skills Required for a Business
Analyst; Functional Applications of Business Analytics in Management.
Introduction to R Programming; Data Manipulation in R: Vectors, Basic Math, and Matrix
Operations; Summarizing Data: Numerical and Graphical Summaries; Data Visualization in
R; Data Transformation; Data Import Techniques in R; Time Series and Spatial Graphs;
Graphs for Categorical Responses and Panel Data.
Unit II: Descriptive and Prescriptive Analytics
Basic Data Summaries: Measures of Central Tendency, Measures of Dispersion, and
Measures of Skewness and Kurtosis; Slicing and Filtering of data; Subsets of Data; Overview
of Exploratory and Confirmatory Factor Analysis; Unsupervised Learning: Clustering and
Segmentation - K-means Clustering and Association Rule Mining – Market Basket Analysis.
Discussion using one caselet for each concept using R (wherever applicable).
Unit III: Predictive and Diagnostic Analytics
Machine Learning: Building Regression Models – Simple Linear and Multiple Linear
Regression Analysis using Ordinary Least Squares Method; Supervised Learning –
Regression and Classification Techniques: Logistic Regression Analysis; Linear Discriminant
Analysis; Decision Trees; Unstructured Data Analytics: Overview of Text Mining and Web
Mining.
Discussion using one caselet for each concept using R (wherever applicable).
Suggested Readings
• A Ohri (2012), “R for Business Analytics”, ISBN 978-1-4614-4342-1(eBook), DOI 10.1007/978-1-4614-4343-
8, Springer New York-Heidelberg Dordrecht London, Springer Science, New York.
• Arnab K.Laha (2015), “How to Make The Right Decision”, Random House Publishers India Pvt. Ltd., Gurgaon,
Haryana, India.
• Joseph F. Hair, William C. Black, Barry J. Babin and Rolph E. Anderson (2015), “Multivariate Data
Analysis”, Pearson Education, New Delhi, India.
• Jared P. Lander (2013), “R for Everyone: Advanced Analytics and Graphics”, Pearson Education Inc., New
Jersey, USA.
• Johannes Ledolter (2013), “Data Mining and Business Analytics with R”, John Wiley & Sons, Inc., New
Jersey, USA.
• Prasad R N and Acharya Seema (2013), “Fundamentals of Business Analytics”, Wiley India Pvt. Ltd., New
Delhi, India.
• Glyn Davis and Branko Pecar (2013), “Business Statistics using Excel”, Oxford University Press, New Delhi.
• Halady Rao Purba (2013), “Business Analytics an Application Focus”, PHI Learning Private Limited, New
Delhi.
• Jank Wolfgang (2011), “Business Analytics for Managers”, SpringerScience + Business Media, ISBN 978-1-
4614-0405-7.
• Subhashini Sharma Tripathi, “Learn Business Analytics in Six Steps Using SAS and R”, ISBN-13 (pbk):
978-1-4842-1002-4 ISBN-13 (electronic): 978-1-4842-1001-7, Bangalore, Karnataka, India.
• Dr. Umesh R. Hodeghatta and Umesha Nayak, “Business Analytics Using R - A Practical Approach”, ISBN-
13 (pbk): 978-1-4842-2513-4 ISBN-13 (electronic): 978-1-4842-2514-1, DOI 10.1007/978-1-4842-2514-1,
Bangalore, Karnataka, India.
• Thomas A. Runkler, “Data Analytics Models and Algorithms for Intelligent Data Analysis”, Springer, ISBN
978-3-8348-2588-9 ISBN 978-3-8348-2589-6 (eBook) DOI 10.1007/978-3-8348-2589-6.
• Bhasker Gupta, “Interview Questions in Business Analytics”, Apress, ISBN-13 (pbk): 978-1-4842-0600-3
ISBN-13 (electronic): 978-1-4842-0599-0, DOI 10.1007/978-1-4842-0599-0.
List of Journals
• Journal of Retailing - Elsevier
• Journal of Business Research - Elsevier
• Industrial Management & Data Systems- Emerald
Learning objectives
1. Digital Data Formats
2. Genesis
3. Data Science/ Analytics (Business Solutions)
4. Overview of Analytics
5. Types of Business Analytics
6. Foundations of R (A Free Analytic Tool) – Managing Qualitative
Data
7. Text Mining using R Software for Windows
1. Analyzing Online Government Reports
2. Analyzing Online Research Articles
3. Analyze Interviews
Prasanta Chandra Mahalanobis
India's first Big Data Man
• Public use of statistics started in India with Prasanta Chandra Mahalanobis (PCS)
• PCM pursued his education at Brahmo Boys school in Calcutta; later joined
Presidency College in the same city. Following this, he left to University of London
for higher studies
• In 1915, when Mahalanobis's ship from England to India was delayed, he spent
time in library of King’s College, Cambridge, where he found Biometrika, a leading
book on theoretical statistics of the time. A physics student's sudden interest led
to India's rise in the field of statistics
• In 1931, he set up the Indian Statistical Institute (ISI) as a registered society
• He introduced the concept of pilot surveys & advocated the usefulness of
sampling methods
• Early surveys began between 1937-1944 which included consumer expenditure,
tea-drinking habits, public opinion, crop acreage and plant disease
• He served as the Chairman of the UN Sub-Commission on Sampling from 1947-
1951 & was appointed the honorary statistical adviser to the GoI in 1949
• For his pioneering work, he was awarded the Padma Vibhushan in 1968
• The eminent scientist breathed his last on June 28, 1972
1. Digital Data Formats
• Data has seen exponential 10%
10% Unstructured
growth since the advent of the data
computer & internet Semi-structured
80% data
• Digital data can be classified into
Structured data
3 forms
• Unstructured
• Semi-structured
• Structured
• According to Merrill Lynch, 80-
90% of the business data is
either unstructured/ semi-
structured
• Gartner also estimates that
unstructured data constitutes
80% of the whole enterprise data
Structured Data Semi-structured Data Unstructured Data
• Data conforms to some O Data is collected in an ad- • Does not conform to a
specification hoc manner before it is data model
known how it will be stored
• Well-defined schema • It is not in a form which
and managed
enables efficient data can be used easily by a
processing, improved O Does not conform to a data computer program
storage and navigation of model but has some
• Advantage - no
content structure
additional effort on its
• Defines the type & structure O Not all the information classification is
of data, & their relations collected will have identical necessary
structure
• Limitation - Difficult to • About 80-90% data of an
subsequently extend a O Advantage - ability to organization is in this
previously defined accommodate variations in format
database schema that structure
already contains content
“GoodLife HealthCare Group, one of India’s leading healthcare groups, began
its operations in the year 2000 in a small town off the south-east coast of India,
with just one tiny hospital building with 25 beds. Today, group owns 20 multi-
specialty healthcare centers across all the major cities of India. The group has
witnessed some major successes & attributes it to its focus on assembly line
operations & standardizations. GoodLife HealthCare offers the following
facilities: Emergency care 24 x 7 Support groups, support & help through call
centers. The group believes in making a “Dent in Global Healthcare”. A few of
its major milestones are listed below in chronological order:
• Year 2000 – the birth of the GoodLife HealthCare Group. Functioning initially from a tiny hospital building with
25 beds
• Year 2002 – built a low cost hospital with 200 beds in India
• Year 2004 – gained foothold in other cities of India
• Year 2005 – the total number of healthcare centers owned by the group touched the 20 mark
• The next 5 years saw the groups dominance in the form of it setting up a GoodLife HealthCare Research
Institute to conduct research in molecular biology and genetic disorders
• Year 2010 witnessed the group award for the “Best HealthCare Organization of the Decade”
• Globally, India stands in Top 5 for mobile consumers & social media usage:
• Over 1.2 billion - Population
• Over 890 million - Mobile subscribers
• 213 million - Internet subscribers
• 115 million - Facebook users
• 24 million - LinkedIn users
• Quantifying & analyzing data is virtually impossible using conventional
databases & computing technology
• Levels of Support - Big data initiative requires 3 levels of support
1. Infrastructure – Designing the architecture, providing the enterprise hardware & cloud solutions,
assisting the management for big data enterprise etc.
2. Software development – Big data software platforms such as R, SAS, Hadoop, NoSQL, Reduce-
Map, Pig, Python and related big data software tools are essential
3. Analytics – Once the software delivers the results there has to be insights derived from the
numbers
• Big data through a single analytical model analyzes data, bringing together
both structured data (sales & transactional records) & unstructured data
(social media comments, audio, & video)
• Organizations started turning to self-service data discovery & data
visualization products such as Qlik (founded in 1993), Spotfre (founded in
1996) & Tableau Software (founded in 2003) to analyze data
• Big-data Challenges
1. Data discovery & data visualization tools available, though mature, aren’t always
suitable for non-technical business users
2. Capture, data curation, search, sharing, storage, transfer, analysis, visualization,
querying & information privacy
3. Predictive insights are typically available only in high-value pockets of businesses
as data science talent remains scarce
• Big Data Security - Big data is not inherently secure. It provides
consolidated view of enterprise. Analytics identifies threats in real time
1. Security management – Real-time security data can be combined with big data analytics
2. Identify & access management – Allows enterprise to adapt identity controls for secure
access on demand
3. Fraud detection & prevention – Analyzes massive amounts of behavioral data to instantly
differentiate between legitimate and fraudulent activity
4. Governance, risk management, & compliance – Unifies & enables access to real-time
business data, promoting smarter decision-making that mitigates risk
• Bain & Company study revealed that early adopters of big data analytics
have a significant lead over their competitors. After surveying more than 400
companies, Bain determined that those companies with the best analytics
capabilities come out the clear winners in their market segments:
• Twice as likely to be in the top 25 per cent in financial performance
• 5 times more likely to make strategic business decisions faster than their competitors
• 3 times more likely to execute strategic decisions as planned
• Twice as likely to use data more frequently, when making decisions
• McKinsey Global Institute Report (May 2017) - McKinsey surveyed more
than 500 executives, across the spectrum of industries, regions, and sizes
• More than 85% acknowledged they were somewhat effective at their data analytics initiatives
• Digitization is uneven among companies, sectors & economies. Leaders are reaping benefits
• Innovations in digitization, analytics, artificial intelligence, & automation are creating
performance and productivity opportunities for business & economy
• By 2018, USA alone could face a shortage of 1,40,000-1,90,000 people with deep analytical
skills as well as 1.5 million managers/ analysts with the know-how to use
• NASSCOM (National Association of Software & Services Companies)
Report
• Big data analytics sector in India is expected to witness eight-fold growth to reach $16 billion
by 2025 from the current level of $2 billion
• India will have 32 per cent share in the global market
• Identified 'marketing analytics' as a core growth area for the industry
• Industry is expected to grow at an average of 25% every year till 2020 to reach $1.2 billion. It
is at present estimated at $200 million (Rs. 1000 crore)
• Data analytics segment would require huge manpower over next five years
Industries Using analytics for business
Facts about big data and Analytics
Data & analytics underpin six disruptive models, and certain characteristics make
individual domains susceptible
Most Popular Analytic Tools
(in the Business World)
1. MS Excel: Excellent reporting & dash boarding tool. Latest versions of Excel
can handle tables with up to 1 million rows making it a powerful yet versatile
tool
2. SAS: 5000 pound gorilla of the analytics world. Most commonly used
software in the Indian analytics market despite its monopolistic pricing
3. SPSS Modeler (Clementine): A data mining software tool by SPSS Inc., an
IBM company. This tool has an intuitive GUI & its point-and-click modelling
capabilities are very comprehensive
4. Statistica: Provides data analysis, data management, data mining, & data
visualization procedures. The GUI is not the most user-friendly, takes a little
more time to learn but is a competitively priced product
5. R: It is an open source programming language & software environment for
statistical computing and graphics.
5. Salford systems: Provides a host of predictive analytics & data mining
tools for businesses. The company specializes in classification &
regression tree algorithms. Software is easy to use & learn
6. KXEN: Drives automated analytics. Their products, largely based on
algorithms developed by the Russian mathematician Vladimir Vapnik, are
easy to use, fast & can work with large amounts of data
7. Angoss: Like Salford systems, Angoss has developed its products around
classification & regression decision tree algorithms. The tools are easy to
learn & use, besides the results being easy to understand & explain. GUI is
user friendly besides many new features been added
8. MATLAB: Allows matrix manipulations, plotting of functions & data,
implementation of algorithms & creation of user interfaces. There are many
add-on toolboxes that extend MATLAB. Matlab is not a free software.
However, there are clones like Octave and Scilab which are free and have
similar functionality
9. Weka: Weka (Waikato Environment for Knowledge Analysis) is a popular
suite of machine learning software, developed at the University of Waikato,
New Zealand. Weka, along with R, is amongst the most popular open
source software
3. Data Science/ Analytics
(Business Solutions)
Data Science
1. Practice of various scientific fields, their algorithms, approaches & processes
2. Using programming languages & software frameworks
3. Aiming to extract knowledge, insights & recommendations from data &
4. Deliver them to business users & consumers in consumable applications
* * Not exhaustive
Matlab
* *
* Not exhaustive
Hyper - Radical Personalization
• Determine what is of interest to a specific user in a
specific context and present it to them when needed
• Retail, Finance, Telecom: Recommend other products
and services
• Media: Recommend relevant or similar content
Resource Allocation
• Determine how to use resources while respecting
operational constraints
• Finance: Portfolio Investment
• CPG, Retail: Supply Chain Management, Inventory
Allocation
• Industrial: Manufacturing scheduling
• Travel and Transportation: Scheduling of trucks,
plains, ships, airplanes
Predictive Predictive
• SPSS C&DS • Watson Machine Learning (on
Prescriptive Bluemix)
• R (3.5.1) - https://cran.r-
project.org/bin/windows/base/
• Datasets -
https://www.superdatascience.com/machine-
learning/
FOUNDATIONS
WITH R
Managing Data
With R
Introduction to R
• 3 core skills of data science
• data manipulation
• data visualization and
• machine learning
• R statistical programming language, a free open source platform, is
developed by R. Gentleman and R. Ihaka at Bell Labs, during 1990’s
• Mastering core skills of data science will be easier in R, hence R is becoming
the lingua franca for data science. R is
• An open source programming language
• an interpreter
• high-level data analytics and statistical functions are available
• An effective data visualization tool
• Google & Facebook considered as two best companies to work for in our modern
economy, have data scientists using R
• R is also the tool of choice for data scientists at Microsoft, who apply machine learning
to data from Bing, Azure, Office, and the Sales, Marketing and Finance departments
• Beyond tech giants like Google, Facebook, and Microsoft, R is widely in use at a wide
range of companies, including Bank of America, Ford, TechCrunch, Uber, and Trulia
• R is a case sensitive language
• Commands can be entered one at a time at the command prompt (>) or can
run a set of commands from a source file
• Comments - start with a #
• Terminate the current R session - use q() or press Escape key
• Inbuilt help in R - ?command or >help(command)
• View(dataset) - Invoke a spreadsheet-style data viewer on a matrix-
like R object
• Commands are separated either by a semi-colon (;) or by a newline
• Elementary commands can be grouped into one compound expression by
braces ({ })
• If a command is not complete, R will prompt + on subsequent lines
• Results of calculations can be stored in objects using the assignment
operators: <- or =
• Vertical arrow keys on the keyboard can be used to scroll forward & backward
through a command history
• Basic functions are available by default. Other functions are contained in
packages as (built-in & user-created functions) that can be attached as
needed. They are kept in memory during an interactive session
• Rstudio - Created by a team led by JJ Allaire whose previous products
include ColdFusion and Windows Live Writer
• The usual R Studio screen has four (4) windows
1. Console
2. Environment and history
3. Files, plots, packages and help
4. The R script(s) and data view
• Console – where you can type commands & see output
• Environment/ Workspace – tab shows all the active objects
• History – tab shows a list of commands used so far
• Files – tab shows all the files and folders in your default workspace
• Plots - will show all your graphs
• Packages - tab will list a series of packages or add-ons needed to run certain
processes
• Help – tab to see additional information
• R script - keeps record of your work
• To send one line place the cursor at the desired line & press ctrl+Enter
• To run a file of code, press ctrl+shift+s or run tab
• To terminate the execution of command press ctrl+c
• ctrl+1 moves the cursor to the text editor area; ctrl+2 moves it to the console
• Primary feature of R Studio is projects. A project is a collection of files
Contd...
• An operator performs Order of the operations
O specific mathematical/ 1. Exponentiation
logical manipulations 2. Multiplication & Division in the order in which
p the operators are presented
• Rich in built-in operators
e 3. Addition & Subtraction in the order in which the
• R provides following types operators are presented
r of operators 4. Mod operator (%%) & integer division operator
(% / %) have same priority as the normal
a • Arithmetic Operators operator (/) in calculations
t • Relational Operators 5. Basic order of operations in R: Parenthesis (),
• Logical Operators Exponents (^), Multiplication, Division, Addition
o & Subtraction (PEMDAS)
• Assignment Operators
r • Miscellaneous Operators
6. Operations put between parenthesis is carried
out first
s
Operator Description Example
x+y y added to x 2+3=5
i x–y y subtracted from x 8- 2=6
x*y x multiplied by y 2*3=6
n
x/y x multiplied by y 10 / 5 = 2
x^y x raised to the power y 2 ^3 = 8
x>y Returns TRUE if x is larger than y TRUE TRUE TRUE TRUE TRUE
x >= y Returns TRUE if x is > or exactly equal to y TRUE TRUE TRUE TRUE TRUE
Miscellaneous Operators
Operator Description Example
Creates the series of numbers in
: A <- 1:5 print(A*A) 1 4 9 16 25
sequence for a vector
a<-1:5 b<-5:10 t<-c(5,10,15,20,25)
Used to identify if an element
%in% print(a%in%t) FALSE FALSE FALSE FALSE TRUE
belongs to a vector
print(b%in%t) TRUE FALSE FALSE FALSE FALSE TRUE
M = matrix( c(1,2,3,4), nrow = 2,ncol = 2,byrow = TRUE)
Used to multiply a matrix with its
%*% x=t(M)
transpose
NewMatrix=M %*% t(M)
• In any programming language, we use variables to store information
D • Based on the data type of a variable, the operating system allocates
a memory & decides what can be stored in the reserved memory
t • The variables are assigned with R-Objects & the data type of the R-
a object becomes the data type of the variable
• Frequently used R-objects
T 1. Vectors
2. Data Frames
y
3. Lists
p 4. Matrices
e 5. Arrays
6. Factors
s
• Simplest 6 data types of atomic vectors
1. Numeric
i 2. Integer
n 3. Complex
4. Character (String)
5. Logical (True/ False)
R 6. Raw
• Other R-Objects are built upon atomic vectors
Dealing with missing values
• One of the most important problems in statistics is incomplete data sets
• To deal with missing values, R uses the reserved keyword NA, which stands
for Not Available
• We can use NA as a valid value, so we can assign it as a value as well
• is.na() - Tests whether a value is NA. Returns TRUE if the value is NA
x<-NA is.na(x) TRUE
• is.nan() - Tests whether a value is not an NA. Returns TRUE if the value is
not an NA
x<-NA is.nan(x) FALSE
>is.na(mtcars$mpg)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE [17] FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> is.nan(mtcars$cyl)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE [17] FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1. Vectors – c()
• One-dimensional arrays that can hold numeric/ character/ logical data
• Constants / one-element vectors – Also called as Scalars hold constants
• Example: - x <- 2.414, h <- TRUE, cn <- 2 + 5i, name = ‘IPE’, v <- charToRaw(‘IPE’)
• Types of Vectors
1. Numeric Vectors – contains all kinds of numbers
2. Integer Vectors – contains integer values only
3. Logical Vectors – contains logical values (TRUE and/ or FALSE)
4. Character Vectors – contains text
o We can easily edit the vector by using indices in single/ multiple locations
apply(cosmeticsexpenditure,2,mean)
YMen YWomen SCM SCW 20 50 80 110
Works equally well for a matrix
apply(cosmeticsexpenditure,1,mean)
apply(x,MARGIN,FUN) as it does for a data frame
55 65 75
object
apply(cosmeticsexpenditure,1,mean)
55 75 NA
5. Arrays – array()
• While matrices are confined to two dimensions, arrays can be of any number
of dimensions
• Arrays have 2 very important features
Interpretation
For cars in this sample, the mean mpg is 20.1, with a standard deviation of 6.0.
The distribution is skewed to the right (+0.61) and somewhat flatter than a
normal distribution (–0.37). Majority of the automobiles considered in the Motor
Trend Car Road Tests (mtcars) dataset have less mileage than mean while very
Caselet 2 – Cars93 Dataset (In-built)
The Cars93 data frame has data of 93 Cars on Sale in the USA in 1993 arranged in 93
rows and 27 columns. The description of the variables are in the data set are as follows:
1. Manufacturer - Manufacturer
2. Model - Model
3. Type – A factor with levels “Small", "Sporty", "Compact", "Midsize", "Large“ & "Van"
4. Min.Price – Minimum Price ($1000): Price for a basic version
5. Price – Mid-range Price ($1000): Average of Min.Price & Max.Price
6. Max.Price – Maximum Price ($1000): Price for a premium version
7. MPG.city – City MPG (miles per US gallon by EPA rating)
8. MPG.highway – Highway MPG
9. AirBags – Air Bags standard. Factor: none, driver only, or driver & passenger
10. DriveTrain – Drive train type: rear wheel, front wheel or 4WD; (factor)
11. Cylinders – No. of cylinders (missing for Mazda RX-7, which has a rotary engine)
12. EngineSize - Engine size (litres)
13. Horsepower - Horsepower (maximum)
14. RPM - RPM (revs per minute at maximum horsepower)
15. Rev.per.mile - Engine revolutions per mile (in highest gear)
16. Man.trans.avail - Is a manual transmission version available? (yes or no, Factor)
17. Fuel.tank.capacity - Fuel tank capacity (US gallons)
18. Passengers - Passenger capacity (persons)
19. Length - Length (inches)
20. Wheelbase - Wheelbase (inches)
21. Width - Width (inches)
22. Turn.circle - U-turn space (feet)
23. Rear.seat.room - Rear seat room (inches) (missing for 2-seater vehicles)
24. Luggage.room - Luggage capacity (cubic feet) (missing for vans)
25. Weight - Weight (pounds)
26. Origin - Of non-USA or USA company origins? (factor)
27. Make - Combination of Manufacturer and Model (character)
Assignment
1. Load the data set Cars93 with data(Cars93,package=“MASS”) and
set randomly any 5 observations in the variables Horsepower and
Weight to NA (missing values)
2. Calculate the arithmetic mean & the median of the variables
Horsepower and Weight
3. Calculate the standard deviation and the interquartile range of the
variable Price
Dependence Dependence
Methods Methods
Several
1 dependent
dependent
variable
variables
Metric Nonmetric Metric Nonmetric
Metric Non-metric
Metric
Non-metric
Metric
Factor Cluster Non-metric Multidimensional
multidimensional
analysis analysis scaling
scaling
Diagnostic Analytics
• R can perform correlation using 2 functions: cor() & cor.test()
• cor() function - cor(x, use = , method =)
• x – Matrix or data frame
• use – Specifies the handling of missing data. Options are all.obs (assumes no missing data –
missing data will produce an error), everything (any correlation involving a case with missing
values will be set to missing), complete.obs (list-wise deletion), pairwise.complete.obs
(pairwise deletion)
• method – Specify the type of correlation. Options are Pearson, Kendal & Spearman Rank
correlations abbreviated as "pearson" (default), "kendall", or "spearman“
• Pearson product-moment correlation assesses the degree of linear relationship between 2
quantitative variables. Spearman’s rank-order correlation coefficient assesses the degree of
relationship between 2 rank-ordered variables. Kendalls tau is also a non-parametric measure
of rank correlation
• Default options are use=“everything” and method=“pearson”
• For executing cor() function one must either give a matrix/ data frame for x & y
• The inputs must be numeric
• Correlations are provided by the cor() function, & scatter plots are generated by
scatterplotMatrix() function or corrgram() function
• Interpretation of corrgram() – A blue color & hashing that goes from lower left to upper right
represent a positive correlation between the 2 variables that meet at that cell. Conversely, a
red color & hashing that goes from the upper left to lower right represent a negative
correlation. The darker the more saturated the color, the greater the magnitude of correlation.
Weak correlations, near zero, appear washed out
Caselet 3 - Prestige Dataset (In-built)
Prestige.txt consists of 102 observations with 6 variables. The description of the
variables are in the data set are as follows:
1. education: The average number of years of education for occupational incumbents
2. income: The average income of occupational incumbents, in dollars
3. women: The percentage of women in the occupation
4. prestige: The average prestige rating for the occupation
5. census: The code of the occupation used in the survey
6. type: Professional and managerial(prof), white collar(wc), blue collar(bc), or
missing(NA)
Is there any association between the variables Education & prestige. If there is
an association what is the strength of the association? Prepare a managerial
report.
install.packages(“car”)
library(car)
# Correlation Matrix of Multivariate sample
test = cor(Prestige[1:4])
#Graphically plot the association
boxplot(Prestige$prestige~Prestige$type)
scatterplotMatrix(Prestige[1:4], spread=FALSE, smoother.args=list(lty=2), main='Scatter Plot')
install.packages(“corrgram”)
library(corrgram)
corrgram(Prestige[1:4])
Interpretation
Professors tend to have higher prestige
than bc and wc Education & prestige
have high positive correlation. More the
education, more the prestige.
Education & income have positive
correlation. More the income, more the
prestige
Caselet 4 - mtcars (Motor Trend Car Road Tests) Dataset
The data was extracted from the 1974 Motor Trend US magazine,
and comprises fuel consumption and 10 aspects of automobile
design and performance for 32 automobiles (1973–74 models).
The description of the 11 numeric variables with the 32
observations in the data frame are as follows:
1. [,1] mpg – Miles/ (US) gallon
2. [,2] cyl – Number of Cylinders
3. [,3] disp – Displacement (cu.in.)
4. [,4] hp – Gross Horsepower
5. [,5] drat – Rear Axle Ratio
6. [,6] wt – Weight (1000 lbs)
7. [,7] qsec – ¼ Mile Time
8. [,8] vs – V/S Engine Shape
9. [,9] am –Transmission (0=automatic, 1=manual)
10.[,10] gear – Number of Forward Gears
11.[,11] carb – Number of Carburetors
Interpretation
1. From the output our suspicion is confirmed. Height and weight
have a strong positive correlation (0.9954948)
2. Multiple R-squared value (R2=0.991) indicates that the model
accounts for 99.1% of the variance in weights (because of
heights)
# Graphical Correlation Matrix
1. plot(women, xlab = "Height (in)", ylab = "Weight (lb)", main = "women data:
American women aged 30-39")
2. install.packages(“car”)
library(car)
scatterplotMatrix(women[1:2], spread=FALSE, smoother.args=list(lty=2),
main='Scatter Plot')
3. scatterplot(weight ~ height, data=women, legend=list(coords="topleft"))
4. install.packages(“corrgram”)
library(corrgram)
corrgram(women)
Caselet 6 - Longley Dataset (In-built)
This is a macroeconomic data frame which consists of 7 economical
variables, observed yearly from 1947 to 1962 (n=16).
1. GNP.deflator - GNP implicit price deflator (1954=100)
2. GNP - Gross National Product
3. Unemployed - number of unemployed
4. Armed.Forces - number of people in the armed forces
5. Population – non-institutionalized’ population ≥ 14 years of age
6. Year - the year (time)
7. Employed - number of people employed
The data items of the data set are known to be highly collinear. Examine
and prepare a managerial report.
# Correlation Matrix of Multivariate sample
test = cor(longley)
# Graphical Correlation Matrix
symnum(test) # highly correlated
stat.desc(longley$Unemployed,basic=TRUE,desc=TRUE)
pairs(longley, panel = panel.smooth, main = "Longley data", col = 3 +
(longley$Unemployed > 400))
Caselet 7 - swiss Dataset (In-built)
Switzerland, in 1888, was entering a period known as the demographic
transition; i.e., its fertility was beginning to fall from the high level typical of
underdeveloped countries. Standardized fertility measure and socio-economic
indicators for each of 47 French-speaking provinces of Switzerland at about
1888 are arranged as a data frame with 6 variables, each of which is presented
in percent.
1. Fertility – lg, ‘common standardized fertility measure’
2. Agriculture - % of males involved in agriculture as occupation
3. Examination - % draftees receiving highest mark on army examination
4. Education - % education beyond primary school for draftees
5. Catholic – % catholic (as opposed to protestant)
6. Infant.Mortality – live births who live less than 1 year
Examine the data and prepare a managerial report.
# Correlation Matrix of Multivariate sample
test = cor(swiss)
# Graphical Correlation Matrix
symnum(test)
pairs(swiss, panel = panel.smooth, main = "swiss data", col = 3 +
(swiss$Catholic > 50))
Cor.test() function
cor.test(var1, var2, method = c("pearson", "kendall", "spearman"))
• cor.test() tests for association/correlation between paired samples using
Pearson's product moment correlation coefficient r/ Kendall's τ/ Spearman's ρ
• It returns both the correlation coefficient and the significance level(or p-value)
of the correlation
• Example: -
my_data <- mtcars
install.packages(“ggpubr”)
library("ggpubr")
res <- cor.test(my_data$wt, my_data$mpg, method = "pearson")
res
ggscatter(my_data, x = "mpg", y = "wt", add = "reg.line", conf.int = TRUE, cor.coef =
TRUE, cor.method = "pearson", xlab = "Miles/(US) gallon", ylab = "Weight (1000 lbs)")
• t is the t-test statistic value (t = -9.559); df is the degrees of freedom (df= 30),
pvalue is the significance level of the t-test (p-value = 1.29410^{-10}); conf.int is
the confidence interval of the correlation coefficient at 95% (conf.int = [-0.9338, -
0.7441]); sample estimates is the correlation coefficient (Cor.coeff = -0.87).
• Interpretation: - The p-value of the test is 1.29410^{-10}, which is less than the
significance level alpha = 0.05. We can conclude that wt and mpg are significantly
correlated with a correlation coefficient of -0.87 and p-value of 1.29410^{-10} .
Caselet 8 – Cars93 Dataset (In-built)
The Cars93 data frame has data of 93 Cars on Sale in the USA in 1993 arranged in 93
rows and 27 columns. The description of the variables are in the data set are as follows:
1. Manufacturer - Manufacturer
2. Model - Model
3. Type – A factor with levels “Small", "Sporty", "Compact", "Midsize", "Large“ & "Van"
4. Min.Price – Minimum Price ($1000): Price for a basic version
5. Price – Mid-range Price ($1000): Average of Min.Price & Max.Price
6. Max.Price – Maximum Price ($1000): Price for a premium version
7. MPG.city – City MPG (miles per US gallon by EPA rating)
8. MPG.highway – Highway MPG
9. AirBags – Air Bags standard. Factor: none, driver only, or driver & passenger
10. DriveTrain – Drive train type: rear wheel, front wheel or 4WD; (factor)
11. Cylinders – No. of cylinders (missing for Mazda RX-7, which has a rotary engine)
12. EngineSize - Engine size (litres)
13. Horsepower - Horsepower (maximum)
14. RPM - RPM (revs per minute at maximum horsepower)
15. Rev.per.mile - Engine revolutions per mile (in highest gear)
16. Man.trans.avail - Is a manual transmission version available? (yes or no, Factor)
17. Fuel.tank.capacity - Fuel tank capacity (US gallons)
18. Passengers - Passenger capacity (persons)
19. Length - Length (inches)
20. Wheelbase - Wheelbase (inches)
21. Width - Width (inches)
22. Turn.circle - U-turn space (feet)
23. Rear.seat.room - Rear seat room (inches) (missing for 2-seater vehicles)
24. Luggage.room - Luggage capacity (cubic feet) (missing for vans)
25. Weight - Weight (pounds)
26. Origin - Of non-USA or USA company origins? (factor)
27. Make - Combination of Manufacturer and Model (character)
Assignment - Calculate the correlation matrix (Pearson) of the variables
Horsepower, Weight and Price. Use both cor() and cor.test() functions
1. Calculate the correlation matrix using cor() function
cor(Cars93[,c("Horsepower","Weight","Price")], use ="complete.obs")
2. Graphically plot the correlation
boxplot(MPG.highway ~ Origin, col = "red", data = Cars93, + main="Box plot of
MPG by origin")
plot(Cars93$Price ~ Cars93$Horsepower,main="Price given
Horsepower",xlab="Horsepower", ylab="Price")
3. Calculate the correlation matrix using cor.test() function
cor.test(Cars93[,13],Cars93[,5])
mydata=Cars93
res=cor.test(mydata$Horsepower,mydata$Price); res
Interpretation
The p-value of the test is 0.00, which is
less than the significance level (i.e., p ≤
0.05). We can conclude that Horespower
and Price are significantly correlated with a
correlation coefficient of 0.79.
FORECASTING
NUMERIC DATA
Using R Software Tool
Predictive Analytics
• Simplest form of regression is where you have 2 variables – a
response variable & a predictor variable
• It essentially describes a straight line that goes through the data where
a is the y-intercept & b is the slope
y a bx SSE y ŷ y a b1X1 b 2 X 2
2 2
b
( x x)( y y )
i i
( x x) i
2
a y b
• In R, the basic function for fitting linear model is lm(). The format is
myfit <- lm(formula,data)
• The formula is typically written as Y ~ x1 + x2 + .... + xk
• ~ separates the response variable on the left from the predictor variables on the
right
• Predictor variables are separated by + sign
• Other symbols can be used to modify the formula in various ways
Each of these functions is applied to the object returned by lm()
Contd...
function in order to generate additional information based on the
fitted model
Function & its Syntax Explanation
The result shows the coefficients for the regression, i.e., the intercept & the slope.
lm(formula,data)
We place the predictor on the left of the ~ and the response on the right
summary(fittness) Displays detailed results for the fitted model
names(fittness) To extract more information of the result object
fittness$coefficients
fitness$coef Lists the model parameters (intercept & slopes) for the fitted model
coef(fittness)
Provides confidence intervals for the model parameters. Default settings produce
confint(fittness)
95% confidence intervals, i,e., at 2.5% & 97.5%
Alter the interval using the level = instruction, specifying the interval as a proportion.
confint(fittness,parm=c('(Int
You can also choose which confidence variables to display by using the parm =
ercept)',level=0.9))
instruction & placing the variables in quotes
fitted(fittness) Lists the predicted values in a fitted model i.e., y values for each x value is predicted
residuals(fittness) Lists the residual values i.e., average error in predicting weight from height using
resid(fittness) this model
formula(fittness) Can access the formula used in the linear model
fittness$call Complete call to lm() command
plot() Generates diagnostic plots for evaluating the fit of a model
predict() Uses a fitted model to predict response values for a new dataset
Caselet 9 - women Dataset (In-built)
Data-set women provides the height & weight for a set of 15
women aged 30-39 on 2 variables Height and Weight. Suppose
we wish to predict weight from height. How can we identify
overweight or underweight individuals?
1. [,1] height – numeric height (in)
2. [,2] weight – numeric weight (lbs)
fitness=lm(weight~height, data=women)
summary(fitness)
fitted(fitness)
residuals(fitness)
Output - weight = -87.52 + 3.45 x Height
plot(women$height, women$weight, xlab="Height(in inches)", ylab="Weight (in
pounds)",pch=19)
abline(fitness)
a <- data.frame(height = 170)
result <- predict(fitness, a, level=0.95, interval="confidence")
print(result)
Interpretation
• From the output our suspicion is confirmed. Height and weight have a strong
positive correlation (0.9954948)
• The regression coefficient 3.45 indicates that there is an expected increase
of 3.45 pounds of weight for every 1 inch increase in height
• Multiple R-squared value (R2=0.991) indicates that the model accounts for
99.1% of the variance in weights
• The residual standard error (1.53 pounds) is the average error in predicting
weight from height using this model
• The F-statistic tests whether the predictor variables, taken together, predict
the response variable above chance levels. Here, F-test is equivalent to the
t-test for the regression coefficient for height
• aov() command is a special case of linear modelling, with the command
being a “wrapper” for the lm() command. summary() command gets the
result in a sensible layout
• fittness.lm=lm(weight~height,data=women)
• fittness.aov=aov(weight~height,data=women)
• summary(fittness.aov) #generates a classic ANOVA table
• summary(fittness.lm)
area=c(3L,4L,6L,4L,2L,5L) sales=c(6,8,9,5,4.5,9.5)
realestate=data.frame(area, sales)
Caselet 10 – Real Estate
realestatelm=lm(sales~area, data=realestate)
summary(realestatelm)
Output - sales = 2 + 1.25 x area + 1.311
fitted(realestatelm)
residuals(realestatelm)
plot(area, sales, xlab='Local Area of the Plot(in 1000 sq.
ft.)', ylab='Sales(in $100,000)', pch=19)
abline(realestatelm)
a = data.frame(area = 20)
predictsales = predict(realestatelm, a, level=0.95,
interval="confidence")
print(predictsales)
Interpretation
– Regression coefficient 1.25 indicates that there is an expected increase of 1.25 sales for every 1
inch increase in area
– Multiple R-squared value (R2=0.6944) indicates that the model accounts for 69.4% of the
variance in sales
– The residual standard error (1.311 pounds) is the average error in predicting sales from area of
the plot using this model
– When area is 20K sq ft, then the plot will be sold for $27L
– At 95% confidence level the cost of the plot for 20Ksq ft, will lie between $8.5L and $45.4L
Caselet 11 - mtcars (Motor Trend Car Road Tests) Dataset
The data was extracted from the 1974 Motor Trend US magazine,
and comprises fuel consumption and 10 aspects of automobile
design and performance for 32 automobiles (1973–74 models).
The description of the 11 numeric variables with the 32
observations in the data frame are as follows:
1. [,1] mpg – Miles/ (US) gallon
2. [,2] cyl – Number of Cylinders
3. [,3] disp – Displacement (cu.in.)
4. [,4] hp – Gross Horsepower
5. [,5] drat – Rear Axle Ratio
6. [,6] wt – Weight (1000 lbs)
7. [,7] qsec – ¼ Mile Time
8. [,8] vs – V/S Engine Shape
9. [,9] am –Transmission (0=automatic, 1=manual)
10.[,10] gear – Number of Forward Gears
11.[,11] carb – Number of Carburetors
Prepare a managerial report.
Research questions
1. Is a car with automatic or manual transmission better in term of miles
per gallons (mpg)?
2. Quantify the mpg difference between automatic & manual
transmission
Interpretation
1. With manual transmission MPG increases by 7.245 miles/Gallon
2. Here p-value ≤ 0.05, indicating that the difference for manual transmission is
significant
3. Adjusted R2 is 0.3385. Clearly, the coefficients obtained here is biased without
considering other variables
## Model 2: MPG ~ weight + number of cylinders + displacement
fit1 <- lm(mpg ~ wt + cyl + disp, data = mtcars); summary(fit1)
Interpretation
1. We see that the adjusted R2 is 0.8147
2. P-values show that both weight and number of cylinders have significant linear
relationships with MPG (since p-value ≤ 0.05), but displacement does not
## Model 3: MPG ~ weight + horesepower + number of cylinders + displacement +
transmission fit2 <- lm(mpg ~ wt + hp + cyl + disp + am, data = mtcars);
summary(fit2)
Interpretation
1. We see that the adjusted R2 is 0.827
2. P-values show that both weight &
horsepower have significant linear
relationships with MPG (since p-
value ≤ 0.05), but number of
cylinders, displacement &
transmission do not
## Model 4: MPG ~ all variables
fit3 <- lm(mpg ~ ., data = mtcars); summary(fit3)
Interpretation
1. We see that the adjusted R2 is 0.807
2. None of the variables considered
have significant linear relationship
with MPG since p-value ≥ 0.05
## Model 5: Multiple linear regression with interactions - impact of automobile
weight & horsepower on mileage
mreg_int = lm(mpg~hp + wt + hp:wt, data=mtcars)
summary(mreg_int)
Interpretation
• The interaction between horsepower & car weight is significant (since p ≤ 0.05) i.e., the
relationship between miles per gallon & horsepower varies by car weight
• The model for predicting mpg is given by mpg = 49.81 – 0.12 x hp – 8.22 x wt + 0.03 x hp x wt
Interpretation
• When the wt of the automobile is 10 the mpg of the vehicle reduces further to -16.15 with its
confidence limits as -8.33 and -23.9
• When the wt is 10, no. of cylinders is 10, and displacement is 200 then the mpg reduces further
to -11.6
## Model 7: Graphical Representation
install.packages(“car”)
library(car)
scatterplotMatrix(mtcars, spread=FALSE, smoother.args=list(lty=2), main='Scatter
Plot Matrix - mtcars')
Conclusion
1. MPG is mainly related to vehicle weight
2. Manual transmission is better than auto transmission on MPG
3. With the given data set, we are unable to correctly quantify the difference between the two types of
transmissions on MPG
Polynomial Regression – Special case of multiple linear regression
• Linear relationship between two variables x and y is one of the
most common, effective and easy assumptions to make when
trying to figure out their relationship
• Sometimes however, the true underlying relationship is more
complex than that, and this is when polynomial regression
comes in to help
Caselet 12 – Position_Salaries (User-defined Dataset)
Data-set provides the Position, Level and Salary for a set of 10 employees. Suppose
Mr.Raghu currently working with the XYZ company as Region Manager since past 2
years wishes to shift to a new organization & applied for a vacant position for Partner
which he is due in next 2 years with a demand for a salary of more than his current pay
of Rs.160000/- Is Mr.Raghu reasonable?
1. [,1] Position – Designation
2. [,2] Level – coding for designation
3. [,3] Salary – Salary of the employee
View(Position_Salaries)
dataset = Position_Salaries[2:3]
dataset
#Polynomial Regression Model
poly_reg=lm(formula=Salary~Level + I(Level^2) + I(Level^3) + I(Level^4), data=dataset)
summary(poly_reg)
#Visualizing Polynomial Results (2 Methods – Plot & ggplot)
plot(dataset$Level,dataset$Salary, xlab=“Level”, ylab=“Salary”)
lines(dataset$Level,fitted(poly_reg))
ggplot() + #install ggplot2 package & corresponding library
geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)), colour = 'blue') +
ggtitle('Truth or Bluff (Polynomial Regression)') + xlab('Level') + ylab('Salary')
#Predicting Result with Polynomial Regression Model
y_pred=predict(poly_reg, data.frame(Level=6.5, Level=6.5^2, Level=6.5^3, Level=6.5^4))
Y_pred
Interpretation
Mr.Raghu seems reasonable as after 2 years i.e., at level 6.5 he is already earning Rs.158862.5/-
which is almost Rs.160000/-. Hence his pay could be fixed with nothing less than Rs.160000/-.
Caselet 13 - women Dataset (In-built)
Data-set women provides the height & weight for a set of 15 women aged
30-39 on 2 variables Height and Weight. Suppose we wish to predict weight
from height. How can we identify overweight or underweight individuals?
1. [,1] height – numeric height (in)
2. [,2] weight – numeric weight (lbs)
• From the output our suspicion is confirmed. Height and weight have a strong positive
correlation (0.9954948)
• The regression coefficient 3.45 indicates that there is an expected increase of 3.45
pounds of weight for every 1 inch increase in height
• Multiple R-squared value (R2=0.991) indicates that the model accounts for 99.1% of
the variance in weights
• The residual standard error (1.53 pounds) is the average error in predicting weight
from height using this model
Simple Linear Regression
Polynomial Regression
Polynomial Regression
• In the Women dataset we can improve the prediction using a regression with a quadratic term
(that is, X2 )
• We can fit a quadratic equation using the statement
fit2 <- lm(weight ~ height + I(height^2), data=women)
summary(fit2)
height^2 adds a height-squared term to the prediction equation. The I function treats the contents
within the parentheses as an R regular expression
plot(women$height,women$weight, xlab="Height (in inches)", ylab="Weight (in lbs)")
lines(women$height,fitted(fit2))
• From this new analysis, the prediction equation is
Weight = 261.88 - 7.348 × Height + 0.083 xHeight2
Interpretation
• Both regression coefficients are significant
at the p < 0.0001 level
• Amount of variance accounted for has
increased to 99.9 percent
• The significance of the squared term (t =
13.89, p < .001) suggests that inclusion of
the quadratic term improves the model fit
• Close look at the plot of fit2 shows a curve
that indeed provides a better fit
• We can fit a quadratic equation using the statement
fit3 <- lm(weight ~ height + I(height^2) + + I(height^3), data=women)
summary(fit3)
ggplot() +
geom_point(aes(x = women$height, y = women$weight), colour = 'red') +
geom_line(aes(x = women$height, y = predict(women_polyreg, newdata = women)),
colour = 'blue') +
ggtitle('Polynomial Regression for Women dataset') + xlab('Height') + ylab('Weight')
• From this new analysis, the prediction equation is
Weight = -896.74 + 46.41 × Height - 0.74 xHeight2 + 0.004 × Height3
• To predict the weight when height is 100 using Polynomial Regression Model we use
predict() function
y_pred=predict(women_polyreg, data.frame(height=100, height=100^2, height=100^3))
y_pred
Interpretation
• Regression coefficients are significant at the p < 0.0001 level
• Amount of variance accounted for has increased to 1 percent
• t = 3.94, p < 0.01 suggests that inclusion of the quadratic term improves the model fit
• When height=100 the predicted weight on applying Polynomial regression model is 535
Overview of Non-Linear Models - Decision Tree Regression
• One of the most intuitive ways to create a predictive model – using the concept of a
tree. Tree-based models often also known as decision tree models successfully
handle both regression & classification type problems
• Linear regression models and logistic regression fail in situations where the
relationship between features and outcome is non-linear or where the features are
interacting with each other
• Non-linear models include: nonlinear least squares, splines, decision trees, random
forests and generalized additive models (GAMs)
• Relatively modern technique for fitting nonlinear models is the decision tree. Tree
based learning algorithms are considered to be one of the best & mostly use
supervised learning methods
• Decision tree is a model with a straightforward structure that allows to predict output
variable, based on a series of rules arranged in a tree-like structure
• Output variable that we can model/ predict can be categorical or numerical
• Decision Trees are non-parametric supervised learning method that are often used for
classification (dependent variable – categorical) and regression (dependent variable –
continuous)
• For a regression tree, the predicted response for an observation is given by the mean/
average response of the training observations that belong to the same terminal node,
while for classification tree predicted response for an observation is given by the
mode/ class response of the training observations that belong to same terminal node
• Root Node represents the entire population or sample. It further gets divided into two or
more homogeneous sets.
• Splitting is a process of dividing a node into two or more sub-nodes.
• When a sub-node splits into further sub-nodes, it is called a Decision Node.
• Nodes that do not split is called a Terminal Node or a Leaf.
• When you remove sub-nodes of a decision node, this process is called Pruning. The
opposite of pruning is Splitting.
• A sub-section of an entire tree is called Branch.
• A node, which is divided into sub-nodes is called a parent node of the sub-nodes; whereas
the sub-nodes are called the child of the parent node.
• Decision trees are nonlinear predictors. The extent of nonlinearities depends on a
number of splits in the tree
• To build a regression tree, we use recursive binary splitting ( a greedy & top-down
algorithm) to grow a large tree on the training data, stopping only when each terminal
node has fewer than some minimum number of observations, which minimizes
the Residual Sum of Squares (RSS)
• Beginning at the top, split the tree into 2 branches, creating a partition of 2 spaces.
You then carry out this particular split of the tree multiple times & choose the split that
features minimizing the (current) RSS
• Next, we can apply cost complexity pruning to the large tree in order to obtain a
sequence of best subtrees, as a function of α. We can use K-fold cross-validation to
choose α. Divide the training observations into K folds to estimate the test error rate of
the subtrees. Our goal is to select the one that leads to the lowest error rate
• The anova method leads to regression trees; it is the default method if y a simple
numeric vector, i.e., not a factor, matrix, or survival object
• To decide which attribute should be tested first, simply find the one with the highest
information gain. Then recurse…
• Limitations
• Decision trees generally do not have the same level of predictive accuracy as other
approaches, since they aren't quite robust. A small change in the data can cause a large
change in the final estimated tree
• Over fitting & unstable at times. Loses a lot of information while trying to categorize continuous
variable. Not sensitive to skewed distribution
• Advantages
• Simple to understand & interpret. Can be displayed graphically
• Requires little data preparation. Useful in data exploration
• Able to handle both numerical and categorical data
• Non-parametric method, thus it does not need any assumptions on the sample space
• Closely mirror human decision-making compared to other regression and classification
approaches
• Tree based methods empower predictive models with high accuracy, stability and ease
of interpretation
• The R package party is also used to create decision trees through its
function ctree() which is used to create & analyze decision tree
Syntax: - ctree(formula, data)
where formula is a formula describing the predictor and response variables
data is the name of the data set used
Example: - fit = ctree(Kyphosis ~ Age + Number + Start, data=kyphosis)
plot(fit, main="Conditional Inference Tree for Kyphosis")
Plotting Options
• The plot function for rpart uses the general plot function. By default, this
leaves space for axes, legends or titles on the bottom, left, and top
• Simplest labeled plot is called by using plot & text without changing any
defaults
par(mar = rep(0.2, 4))
plot(fit, uniform = TRUE)
text(fit, use.n = TRUE, all = TRUE)
• uniform = TRUE - plot has uniform stem lengths
• use.n = TRUE - specifying number of subjects at each node
• all = TRUE - Labels on all the nodes, not just the terminal nodes
• Fancier plots can be created by modifying the branch option, which controls
the shape of branches that connect a node to its children.
par(mar = rep(0.2, 4))
plot(fit, uniform = TRUE, branch = 0.2, compress = TRUE, margin = 0.1)
text(fit, all = TRUE, use.n = TRUE, fancy = TRUE, cex= 0.9)
• compress - attempts to improve overlapping of some nodes
• fancy - creates the ellipses and rectangles, and moves the splitting rule to the midpoints of the
branches
• Margin - shrinks the plotting region slightly so that the text boxes don’t run over the edge of
the plot
Caselet 14 – Position_Salaries (User-defined Dataset)
Data-set provides the Position, Level and Salary for a set of 10 employees. Suppose
Mr.Raghu currently working with the XYZ company as Region Manager since past 2
years wishes to shift to a new organization & applied for a vacant position for Partner
which he is due in next 2 years with a demand for a salary of more than his current pay
of Rs.160000/- Is Mr.Raghu reasonable?
1. [,1] Position – Designation
2. [,2] Level – coding for designation
3. [,3] Salary – Salary of the employee
regressor=rpart(formula=Salary~.,data=dataset) regressor=rpart(formula=Salary~.,data=dataset,
control=rpart.control(minsplit = 1))
Interpretation
• First column labelled CP is the cost complexity parameter
• Second column, nsplit, is the number of splits in the tree
• rel error stands for relative error & is the RSS for the number of splits divided by the RSS for no
splits
• Both xerror & xstd are based on ten-fold cross-validation. xerror being the average error & xstd
the standard deviation of the cross-validation proceess
• We can see that while 5 splits in xerror produced the lowest error on the full dataset, 4 splits
produced a slightly less error using corss-validation
Interpretation
• Plot shows us the relative error by the tree size with the corresponding error
bars
• Horizontal line on the plot is the upper limit of the lowest standard error
• Selecting a tree size of 5 which is 4 splits, we can build a new tree object
where xerror is minimized
6. We can build a new tree object where xerror is minimized by pruning our
tree. First create an object for cp associated with the pruned tree from the
table. Then the prune() function handles the rest.
cparam = min(tree.pros$cptable[5,])
prune.tree.pros = prune(tree.pros, cp=cparam)
7. We can plot and compare the full & pruned trees. Tree plots produced by
partykit package are much better than those produced by the party
package. Simply use the as.party() function as a wrapper in plot()
plot(as.party(tree.pros))
Now use as.party() function for the pruned tree
plot(as.party(prune.tree.pros))
Interpretation: - In the next slide we note that splits are exactly the same in 2 trees with the
exception of the last split, which includes the variable age for the full tree. Interestingly, both
the first & second splits in the tree are related to the log of cancer volume (lcavol). Plots are
quite informative as they show the splits, nodes, observations per node & boxplots of the
outcome that we are trying to predict
8. Lets examine how well the pruned tree performs on the test data. Lets
create an object of the predicted values using the predict() function &
incorporate the test data. Then, calculate the errors & finally mean of the
squared errors
party.pros.test = predict(prune.tree.pros, newdata=pros.test)
rpart.resid = party.pros.test – pros.test$lpsa #Calculate the residuals
mean(rpart.resid^2) #Calculate MSE
Caselet 16 – iris Dataset (In-built)
Anderson collected & measured hundreds of irises in an effort to study variation
between & among the different species. There are 260 species of iris. This data
set focuses of three of them (Iris setosa, Iris virginica and Iris versicolor). Four
features were measured on 50 samples for each species: sepal width, sepal
length, petal width, and petal length. Anderson published it in "The irises of the
Gaspe Peninsula", which originally inspired Fisher to develop LDA
• iris is a data frame with 150 cases (rows) and 5 variables (columns) named
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
• iris3 gives the same data arranged as a 3-dimensional array of size 50 by 4
by 3, as represented by S-PLUS. The first dimension gives the case number
within the species subsample, the second the measurements with
names Sepal L., Sepal W., Petal L., and Petal W., and the third the species.
In this example we are going to try to predict the Sepal.Length
1. In order to build our decision tree, first we need to install the correct package
head(iris)
install.packages("rpart")
library(rpart)
2. Next we are going to create our tree. Since we want to predict Sepal.Length – that
will be the first element in our fit equation
fit <- rpart(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width+ Species,
method="anova", data=iris )
3. Note the method in this model is anova. This means we are going to try to predict a
number value. If we were doing a classifier model, the method would be class
4. Now let’s plot out our model
plot(fit, uniform=TRUE, main="Regression Tree for Sepal Length")
text(fit, use.n=TRUE, cex = .6)
outtree = ctree(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width+ Species, data
= iris) #Alternate Method
plot(outtree)
5. Note the splits are marked – like the top split is Petal.Length < 4.25. Also, at the
terminating point of each branch, you see an n which specified the number of
elements from the data file that fit at the end of that branch
6. Finally now that we know the model is good, let’s make a prediction
testData <-data.frame (Species = 'setosa', Sepal.Width = 4, Petal.Length =1.2,
Petal.Width=0.3)
predict(fit, testData, method = "anova")
Interpretation: - The model predicts our Sepal.Length will be approx 5.17
Case Study
Predicting Complex Skill Learning
• https://archive.ics.uci.edu/ml/datasets/SkillCraft1+Master+
Table+Dataset
• Mastering Predictive Analytics with R – James D.Miller,
Rui Miguel Forte – Second Edition – Packt Publishing –
August 2017 - Page 226
Overview of Non-Linear Models – Random Forest Regression
• To greatly improve models predictive ability, we can produce numerous trees
& combine the results
• Random forest technique does this by applying two different tricks in model
development
1. Use of bootstrap aggregation or bagging – An individual tree is built on a
random sample of the dataset, roughly two-thirds of the total observations
(remaining one-third are referred to as out-of-bag (oob)). This is repeated dozens/
hundreds of times & the results are averaged. Each of these trees is grown & not
pruned based on any error measure, & this means that the variance of each of
these individual trees is high
2. Concurrently with the random sample of the data, i.e., bagging, It also takes a
random sample of the input features at each split.
• In the randomForest pakage, we will use the default random no. of the
predictors that are sampled, which, for classification problems, is the square
root of the total predictors & for regression, it is the total no. of predictors
divided by three
• Number of predictors the algorithm randomly chooses at each split can be
changed via. the model tuning process
• We will use the randomForest package. General syntax to create a random
forest is to use the randomForest() function & specify the formula & dataset
as the 2 primary arguments
• Recall that, for regression, default variable sample per tree iteration is p/3,
where p is equal to the number of predictor variables in the data frame
Caselet 17 – Prostate Cancer (in-built Dataset)
Prostate cancer dataset is a data frame with 97 observations on 10 variables. A
short description of each of the variables is listed below
lcavol - log cancer volume
lweight- log prostate weight
age - in years
lbph - log of the amount of benign prostatic hyperplasia
svi - seminal vesicle invasion
lcp - log of capsular penetration
gleason - a numeric vector
pgg45 - percent of Gleason score 4 or 5
lpsa - response
train - a logical vector
The last column indicates which 67 observations were used as the "training set"
and which 30 as the test set. Given the data, examine the correlation between
the level of prostate-specific antigen & a number of clinical measures in men
who were about to receive a radical prostatectomy.
1. Ensure that the packages & libraries are installed
install.packages(“randomForest") # Random Forest Regression
library(randomForest)
install.packages("partykit") # Tree Plots
library(partykit)
install.packages(“MASS") # Breast & pima Indian data
library(MASS)
install.packages(“ElemStatLearn") # Prostate data
library(ElemStatLearn)
2. We will first apply regression on prostate data. It involves calling the
dataset, coding the gleason score as an indicator variable using the
ifelse() function & creating the test & train sets. Train set will be pros.train
& the test set will be pros.test
data(prostate)
prostate$gleason = ifelse(prostate$gleason == 6,0,1)
pros.train = subset(prostate, gleason==TRUE)[,1:9]
pros.test = subset(prostate, gleason==FALSE)[,1:9]
3. To build a random forest regression on the train data, lets use
randomForest() function
rf.pros = randomForest(lpsa~., data=pros.train)
4. Call this object using print() function
print(rf.pros)
5. Examine the splits graphically
dev.off() #To capture the plot completely as a screenshot
plot(rf.pros)
Interpretation
• Call of the rf.pros object shows us that the random forest generated 500
different trees & sampled 2 variables at each split
• Result of MSE of 0.68 & nearly 53 per cent of the variance explained
• Lets see if we can improve on the default number of trees. Too many trees
may lead to over fitting
• How much is too many depends on the data. 2 things that can help out are:
1. Plot rf.pros
2. Ask for the minimum MSE
Interpretation
• Plot shows the MSE by the number of trees in the model
• We can see that as the trees are added, significant improvement in MSE
occurs early on & then result in flat lines just before 100 trees are built in the
forest
6. We can identify the specific & optimal
tree with the which.min() function
which.min(rf.pros$mse)
7. We can try 193 trees in the random forest by just specifying ntree=193 in
the model
set.seed(123)
rf.pros.2=randomForest(lpsa~., data=pros.train, ntree=193)
print(rf.pros.2)
Interpretation
• We can see that MSE & Variance explained have both improved slightly
• Lets see one more plot before testing the model. If we are combining the results
of 193 trees that are built using bootstrapped samples & only 2 random
predictors, we will need a way to determine the drivers of the outcome
8. Only one tree alone cannot be used to paint this picture but we can produce
a variable importance plot & corresponding list. Y-axis is the list of variables
in descending order of importance & x-axis is the % of improvement in
MSE
varImpPlot(rf.pros.2,main=“Variable Importance Plot – PSA Score”)
Interpretation
Consistent with the single tree, lcavol is the most important variable & lweight is the second-
most important variable
9. To examine the raw numbers, use the importance() function
importance(rf.pros.2)
10. Now, lets examine how it does on the test dataset
rf.pros.test = predict(rf.pros.2, newdata=pros.test)
rf.pros.test
rf.resid = rf.pros.test – pros.test$lpsa #Calculate the residuals
mean(rf.resid^2) #Calculate MSE
CLASSIFICATION
Using R Software Tool
Classification Trees
• Classification trees operate under the same principal as regression trees
except the splits are not determined by the RSS but an error rate
• Building classification trees using the CART methodology continues the
notation of recursively splitting up groups of data points in-order to minimize
some error function
• What we like to use is a measure for node purity that would score nodes
based on whether they contain data points primarily belonging to one of the
output classes
1. One possible measure of node purity, is the Gini index. To calculate the Gini index at a
particular node in a tree, we compute ratio of the number of data points labeled as class k
over the total number of data points as an estimate for the probability of a data point
belonging to class k at the node in question. To Gini index for a completely pure node (a node
with only one class) is 0. For a binary output with equal proportions of the 2 classes, the Gini
index is 0.5
• Another commonly used criterion is deviance. All nodes that have same proportion of data
points across different classes will have the same value of the Gini index, but if they have
different number of observations, they will have different values of deviance
• Besides these, CART methodology to build a classification tree is exactly parallel to that of
building a regression tree. The tree is post-pruned using the same cost-complexity approach
outlined for regression trees, but after replacing the SSE as the error function with either the
Gini idex or deviance
CASE STUDY
Biopsy Data on Breast Cancer Patients
Dr. William H. Wolberg from the University of Wisconsin commissioned the
Wisconsin Breast Cancer Data in 1990. His goal of collecting the data was to
identify whether a tumor biopsy was malignant or benign. His team collected the
samples using Fine Needle Aspiration (FNA). If a physician identifies the tumor
through examination or imaging an area of abnormal tissue, then the next step is
to collect a biopsy. FNA is a relatively sage method of collecting the tissue &
complications are rare. Pathologists examine the biopsy & attempt to determine
the diagnosis (malignant of benign). As you can imagine, this not a trivial
conclusion. Benign breast tumors are not dangerous as there is no risk of the
abnormal growth spreading to the other body parts. If a benign tumor is large
enough, surgery might be needed to remove it. On the other hand, a malignant
rumor requires medical intervention. The level of treatments depends on a
number of factors but most likely will require surgery, which can be followed by
radiation and/ or chemotherapy. Therefore, the implications of a misdiagnosis can
be extensive. A false positive for malignancy can lead to costly & unnecessary
treatment, subjecting the patient to a tremendous emotional & physical burden.
On the other hand, a false negative can deny a patient the treatment that they
need, causing the cancer to spread & leading to premature death. Early treatment
intervention in breast cancer patients can greatly improve their survival.
Our task is to develop the best possible diagnostic machine learning algorithm
inorder to assist the patients medical team in determining whether the tumor is
malignant or not.
Dr. William H. Wolberg obtained breast cancer database from the University of
Wisconsin Hospitals, Madison & the biopsies of breast tumours for 699 patients
up to 15 July 1992 were assessed; with each of nine attributes scored on a
scale of 1 to 10, and the outcome also known. There are 699 rows and 11
columns. This data frame contains following columns
ID - sample code number (not unique)
V1 - clump thickness
V2 - uniformity of cell size
V3 - uniformity of cell shape
V4 - marginal adhesion
V5 - single epithelial cell size
V6 - bare nuclei (16 values are missing)
V7 - bland chromatin
V8 - normal nucleoli
V9 - mitoses
class - "benign" or "malignant"
1. Ensure that the packages & libraries are installed. The data
frame is available in the R MASS package under the biopsy
name
install.packages(“MASS") #Breast & pima Indian data
library(MASS)
install.packages(“reshape2") #Plotting the data
library(reshape2)
install.packages(“ggplot2") #Plotting the data
library(ggplot2)
install.packages(“corrplot") #Graphically represent correlation
library(corrplot)
install.packages("partykit") # Tree Plots
library(partykit)
classify_biopsy_1 = biopsy
str(classify_biopsy_1) #Examine the underlying structure of the data
Interpretation
• An examination of the data structure shows that our features are integers & the
outcome is a factor. No transformation of the data to a different structure is needed
• Depending on the package that we are using to analyse data, the outcome needs to
be numeric, which is 0 or 1
Interpretation
• Correlation coefficients are indicating collinearity (Collinearity also referred to as
Multi-collinearity, generally occurs when there are high correlations between 2 or
more predictor variables. In other words, one predictor variable can be used to predict
the other), in particular, the features of uniform shape & uniform size that are present
6. The final task in data preparation will be the creation of our train &
test datasets
• In machine learning we should not be concerned with how well we can
predict the current observations, but more focussed on how well we can
predict the observations that were not used in order to create the
algorithm
• So, we can create & select the best algorithm using the training data that
maximizes our predictions on test set
• There are number of ways to proportionally split our data into train & test
sets: 50/50, 60/40, 70/30, 80/20 & so forth. Data split you select should
be based on your experience & judgement
• In this case, lets use a 70/30 split
set.seed(123) #Random number generator
index_sample = sample(2, nrow(classify_biopsy_2), replace=TRUE,
prob=c(0.7,0.3))
biop.train = classify_biopsy_2[index_sample==1,] #Training data set
biop.test = classify_biopsy_2[index_sample==2,] #Test data set
str(biop.test[,10]) #Confirm the structure of Test data set
7. Create the tree & examine the table for the optimal number of splits
set.seed(123) #Random number generator
tree.biop = rpart(class~.,data=biop.train)
print(tree.biop$cptable)
cp=min(tree.biop$cptable[3,]) #xerror is minimum in row 3 i.e., cptable[3,]
8. Prune the tree, plot the full & prune trees
prune.tree.biop=prune(tree.biop,cp=cp)
plot(as.party(tree.biop))
plot(as.party(prune.tree.biop))
Interpretation
• The cross-validation error is at a minimum with only 2 splits (row 3)
• Examination of the tree plots shows that the uniformity of the cell size is the
first split, then nuclei
• The full tree had an additional split at the cell thickness
9. Check how it performs on test dataset
Predict the test observations using type=“class” in the predict() function
rparty.test=predict(prune.tree.biop, newdata=biop.test, type=‘class’)
dim(biop.test)
table(rparty.test, biop.test$class)
10. Calcualte the accuracy of prediction
(136+64)/209
[1] 0.9569378
Interpretation
• The basic tree with just 2 splits get us almost 96 per cent accuracy
• This still falls short of 97.6 per cent with logistic regression
• It encourages us to believe that we can improve on this with the upcoming
methods, starting with random forests
Random Forest Classification
• A random forest is an ensemble learning approach to supervised learning
• Multiple predictive models are developed, & the results are aggregated to
improve classification rates
• The algorithm for a random forest involves sampling cases & variables to
create a large number of decision trees. Each case is classified by each
decision tree. The most common classification for that case is then used as
the outcome
• Assume that N is the number of cases in the training sample &
M is the number of variables. Algorithm is as follows:
1. Grow a large number of decision trees by sampling N cases with replacement from
the training set
2. Sample m < M at each node. These variables are considered candidates for
splitting in that node. The value m is the same for each node
3. Grow each tree fully without pruning
4. Terminal nodes are assigned to a class based on the mode of cases in that node
5. Classify new cases by sending them down all the trees & taking a vote – majority
rules
• An out-of-bag (oob) error estimate is obtained by classifying the
cases that are not selected when building a tree
• Random forests also provide a measure of variable importance
• Random forests are grown using the randomForest() function in
the randomForest package
• The default number of trees is 500, the default number of
variables sampled at each node is sqrt(M), & the minimum node
size is 1
1. Ensure that the packages & libraries are installed. The data
frame is available in the R MASS package under the biopsy
name
install.packages(“MASS") #Breast & pima Indian data
library(MASS)
install.packages(“randomForest") #Create random forest
library(randomForest)
classify_biopsy_rf1 = biopsy
str(classify_biopsy_rf1) #Examine the underlying structure of the data
2. We now get rid of the ID column
classify_biopsy_rf1$ID=NULL #Delete column ID
View(classify_biopsy_rf1)
3. Delete the missing observations. As there are only 16 observations
with missing data, it is safe to get rid of them as they account only 2
per cent of the total observations. In deleting these observations, a
new working data frame is created with the na.omit() function
classify_biopsy_rf2 = na.omit(classify_biopsy_rf1)
4. The final task in data preparation will be the creation of our train &
test datasets
• In machine learning we should not be concerned with how well we can predict the
current observations, but more focussed on how well we can predict the
observations that were not used in order to create the algorithm
• So, we can create & select the best algorithm using the training data that
maximizes our predictions on test set
• There are number of ways to proportionally split our data into train & test sets:
50/50, 60/40, 70/30, 80/20 & so forth. Data split you select should be based on
your experience & judgement
• In this case, lets use a 70/30 split
set.seed(123) #Random number generator
index_sample = sample(2, nrow(classify_biopsy_rf2), replace=TRUE,
prob=c(0.7,0.3))
rf.biop.train = classify_biopsy_rf2[index_sample==1,] #Training data set
rf.biop.test = classify_biopsy_rf2[index_sample==2,] #Test data set
str(rf.biop.test[,10]) #Confirm the structure of Test data set
5. Create the random forest
set.seed(123) #Random number generator
rf.biop = randomForest(class~., data=rf.biop.train)
print(rf.biop)
Interpretation
• The OOB error rate is 3.38%
• This is considering all the 500 trees factored into the analysis with 3
variables at each split
6. Plot the Error by trees
plot(rf.biop)
Interpretation
• Plot shows that the minimum error & standard error is the lowest with quite
few trees
7. Let’s now pull the exact number of trees with minimum error &
standard error using which.min() function
which.min(rf.biop$err.rate[,1]) #OOB is the first variable of the object err.rate
Interpretation
• Train set error is below 3 per cent
• Model even performs better on the test set where we have only 4
observations misclassified out of 209 & none were false positives
11. Lets have a look at the variable importance plot
varImpPlot(rf.biop.2)
Interpretation
• The importance of this plot is each variables contribution to the mean decrease in the Gini index
• This is rather different from the splits of the single tree
• Building random forests is potentially a powerful technique that not only has predictive ability but
also in feature selection
Classification – Logistic Regression
• Logistic regression belongs to a class of models known as Generalized
Linear Models (GLMs)
• Binary Logistic Regression is useful when we are predicting a binary
outcome from a set of predictors x. The predictors can be continuous,
categorical or a mix of both
• Logistic regression is a method for fitting a regression curve, y = f(x), when
y consists of proportions or probabilities, or binary coded (0,1--
failure, success) data
• The categorical variable y, in general, can assume different values
• The logistic curve is an S-shaped or sigmoid
curve
• Logistic regression fits b0 and b1, the regression
coefficients (which are 0 and 1, respectively, for
the graph depicted)
• This curve is not linear
• Logit transform makes it linear logit(y) = b0 + b1x
• Hence, logistic regression is linear regression on
the logit transform of y, where y is the proportion
(or probability) of success at each value of x
• Avoid to do a traditional least-squares
regression, as neither the normality nor the
homoscedasticity assumptions are satisfied
• Basic syntax for glm function is The logistic function is
glm(formula,data,family) given by
where formula is the symbol presenting the relationship
( b0 b1 x )
between the variables e
data is the data set giving the values of these variables y ( b0 b1 x )
family is R object to specify the details of the model. It's
value is binomial for logistic regression
1 e
CASE STUDY
Biopsy Data on Breast Cancer Patients
Dr. William H. Wolberg from the University of Wisconsin commissioned the
Wisconsin Breast Cancer Data in 1990. His goal of collecting the data was to
identify whether a tumor biopsy was malignant or benign. His team collected the
samples using Fine Needle Aspiration (FNA). If a physician identifies the tumor
through examination or imaging an area of abnormal tissue, then the next step is
to collect a biopsy. FNA is a relatively sage method of collecting the tissue &
complications are rare. Pathologists examine the biopsy & attempt to determine
the diagnosis (malignant of benign). As you can imagine, this not a trivial
conclusion. Benign breast tumors are not dangerous as there is no risk of the
abnormal growth spreading to the other body parts. If a benign tumor is large
enough, surgery might be needed to remove it. On the other hand, a malignant
rumor requires medical intervention. The level of treatments depends on a
number of factors but most likely will require surgery, which can be followed by
radiation and/ or chemotherapy. Therefore, the implications of a misdiagnosis can
be extensive. A false positive for malignancy can lead to costly & unnecessary
treatment, subjecting the patient to a tremendous emotional & physical burden.
On the other hand, a false negative can deny a patient the treatment that they
need, causing the cancer to spread & leading to premature death. Early treatment
intervention in breast cancer patients can greatly improve their survival.
Our task is to develop the best possible diagnostic machine learning algorithm
inorder to assist the patients medical team in determining whether the tumor is
malignant or not.
Dr. William H. Wolberg obtained breast cancer database from the University of
Wisconsin Hospitals, Madison & the biopsies of breast tumours for 699 patients
up to 15 July 1992 were assessed; with each of nine attributes scored on a
scale of 1 to 10, and the outcome also known. There are 699 rows and 11
columns. This data frame contains following columns
ID - sample code number (not unique)
V1 - clump thickness
V2 - uniformity of cell size
V3 - uniformity of cell shape
V4 - marginal adhesion
V5 - single epithelial cell size
V6 - bare nuclei (16 values are missing)
V7 - bland chromatin
V8 - normal nucleoli
V9 - mitoses
class - "benign" or "malignant"
Business Understanding
The goal of collecting the data is to identify whether a tumor
biopsy was malignant or benign. Implications of a misdiagnosis
can be extensive. A false positive for malignancy can lead to
costly and unnecessary treatment, subjecting the patient to a
tremendous emotional and physical burden. On the other hand, a
false negative can deny a patient the treatment that they need,
causing the cancer to spread and leading to premature death.
Early treatment intervention in breast cancer patients can greatly
improve their survival.
Our task here is to develop the best possible diagnostic machine
learning algorithm in order to assist the patient’s medical team in
determining whether the tumor is malignant or not
1. Ensure that the packages & libraries are installed. The data frame is
available in the R MASS package under the biopsy name
install.packages(“MASS") #Breast & pima Indian data
library(MASS)
install.packages(“car") #VIF Statistics
library(car)
2. Make a copy of the original dataset. Get rid of ID column
lr_biopsy=biopsy
lr_biopsy$ID = NULL #Delete column ID
3. Delete the missing observations. As there are only 16 observations with
missing data, it is safe to get rid of them as they account only 2 per cent of
the total observations. In deleting these observations, a new working data
frame is created with the na.omit() function
logreg_biopsy = na.omit(lr_biopsy)
names(logreg_biopsy)=c('Thickeness','u.size','u.shape','adhesion','s.size','nuc
lie','chromatin','normal_nucleoli','mitoses','class')
names(logreg_biopsy)
4. The final task in data preparation will be the creation of our train & test
datasets
index_lr = sample(2, nrow(logreg_biopsy), replace=TRUE, prob=c(0.7,0.3))
train = logreg_biopsy[index_lr==1,] #Training data set
test = logreg_biopsy[index_lr==2,] #Test data set
5. For implementing Logistic Regression, R installation comes with the glm()
function that fits the generalized linear models, with family=binomial
set.seed(123) #Random number generator
lr.fit = glm(class~., family=binomial, data=train)
6. The summary() function allows us to inspect the coefficients & their p-values
summary(lr.fit)
Interpretation
• The higher the absolute value of z-statistic, the more likely it is that this particular
feature is significantly related to our output variable
• From the model summary, we see that Thickness (3.280,0.001039) & nuclei
(3.344,0.000826) are the strongest predictors for breast cancer. Both thickness &
nuclei are statistically significant
• P-values next to the z-statistic express this as probability
• Number of input features have relatively high p-values indicating that they are
probably not good indicators of breast cancer in the presence of other features
• The logistic regression coefficients give the change in the log odds of the outcome for
a one unit increase in the predictor variable
• For every one unit change in Thickness, the log odds of class (benign vs malignant)
increases by 0.5252
• For a one unit increase in nuclei, the log odds of benign breast cancer increases by
0.4057
Interpretation
For a one unit increase in Thickness, the odds of
suffering from benign or malignant breast cancer
increases by a factor of 1.69 & for a unit increase in
nuclei, the odds of suffering increases by a factor of
1.50
10. Multicollinearity, potential NOTE
• Collinearity, or excessive correlation among
issue of data exploration explanatory variables, can complicate or prevent the
is taken care by vif() identification of an optimal set of explanatory
function - vif(lr.fit) variables for a statistical model
• A simple approach to identify collinearity among
Interpretation: - None of the values explanatory variables is the use of variance inflation
are > VIF rule of thumb statistic of factors (VIF)
5, so collinearity does not seem to • A VIF is calculated for each explanatory variable and
be a problem those with high values are removed. The definition of
‘high’ is somewhat arbitrary but values in the range
of 5-10 are commonly used
Interpretation
• It appears that we have done
pretty well in creating a model
with all the features
• Roughly 98 per cent
prediction rate is quite
impressive
Estimate the probability of a patient suffering from benign breast
cancer with a score for Thickness (V1) as 12 and nuclei (V6) as 12
• Apply the function glm() that describes the Thickness (V1) and nuclei (V6)
lr = glm(class~V1+V6, data=lr_biopsy, family=binomial)
• We then wrap the test parameters inside a data frame
predictdata=data.frame(V1=12,V6=12) #Thickness code = 12 and nuclie code
= 12
• Now we apply the function predict() to the generalized linear model lr along with
predictdata. We will have to select response prediction type in order to obtain the
predicted probability.
predict(lr,predictdata,type="response")
Interpretation
• For a patient with V1 score as 12 and V6
score as 12, the probability of the patient
suffering from benign breast cancer is
about 99.9 per cent
• For a patient with V1 score as 5 and V6
score as 7, the probability of the patient
suffering from benign breast cancer is
about 91 per cent
mtcars (Motor Trend Car Road Tests) Dataset
The data was extracted from the 1974 Motor Trend US magazine,
and comprises fuel consumption and 10 aspects of automobile
design and performance for 32 automobiles (1973–74 models).
The description of the 11 numeric variables with the 32
observations in the data frame are as follows:
1. [,1] mpg – Miles/ (US) gallon
2. [,2] cyl – Number of Cylinders
3. [,3] disp – Displacement (cu.in.)
4. [,4] hp – Gross Horsepower
5. [,5] drat – Rear Axle Ratio
6. [,6] wt – Weight (1000 lbs)
7. [,7] qsec – ¼ Mile Time
8. [,8] vs – V/S Engine Shape
9. [,9] am –Transmission (0=automatic, 1=manual)
10.[,10] gear – Number of Forward Gears
11.[,11] carb – Number of Carburetors
Prepare a managerial report.
In mtcars dataset, the transmission mode (automatic or manual) is
described by the column am which is a binary value (0 or 1). Let’s
create a logistic regression model between the columns vs and 2 other
columns - mpg, am (Multiple predictors)
#Select some columns from mtcars
sample=subset(mtcars, select=c(mpg, vs, am, cyl, wt, hp, disp))
print(head(sample))
vsdata = glm(formula= vs ~ wt + disp, data=sample, family=binomial)
binomial(link="logit"))
print(summary(vsdata))
•
Interpretation
• weight influences vs positively, while displacement has a slightly negative effect
• We also see that the coefficient of weight is non-significant (p > 0.05), while the coefficient
of displacement is significant
• The estimates (coefficients of the predictors weight and displacement) are now in units
called logits
• The logistic regression coefficients give the change in the log odds of the outcome for a
one unit increase in the predictor variable
• For every one unit change in wt, the log odds of vs increases by 1.62635 and for every one
unit change in disp, the log odds of vs decrease by 0.03443
• exp(coef(vsdata))
Now we can say that for a one unit increase in wt, the odds of vs increase by a factor
of 5.085 and a one unit increase in disp, the odds of vs increase by a factor of 0.966
• The data and logistic regression model can be plotted with ggplot2 or base graphics:
install.packages(“ggplot2”)
library(ggplot2)
ggplot(vsdata, aes(x=wt, y=vs)) + geom_point() + stat_smooth(method="glm",
method.args=list(family="binomial"), se=FALSE)
par(mar = c(4, 4, 1, 1)) # Reduce some of the margins so that the plot fits better
• Estimate the probability of a vehicle being fitted with a manual
transmission if it has a 120hp engine and weighs 2800 lbs
• Apply the function glm() that describes the transmission type (am) by the horsepower (hp)
and weight (wt)
amglm=glm(formula=am~hp+wt, data=sampleset, family=binomial)
• We then wrap the test parameters inside a data frame
predictdata = data.frame(hp=120, wt=2.8)
• Now we apply the function predict() to the generalized linear model amglm along with
predictdata. We will have to select response prediction type in order to obtain the predicted
probability.
predict(amglm, predictdata, type="response")
• Interpretation - For an automobile with 120hp engine and 2800 lbs weight, the probability
of it being fitted with a manual transmission is about 64%.
• We want to create a model that helps us to predict the probability of a
vehicle having a V engine or a straight engine given a weight of 2100
lbs and engine displacement of 180 cubic inches
• model <- glm(formula= vs ~ wt + disp, data=mtcars, family=binomial)
• summary(model)
• Interpretation - We see from the estimates of the coefficients that weight influences vs
positively, while displacement has a slightly negative effect
• newdata = data.frame(wt = 2.1, disp = 180)
• predict(model, newdata, type="response")
• Interpretation - The predicted probability is 0.24
Classification/ Dimension Reduction – Linear Discriminant
Analysis (LDA)
• Discriminant analysis, is also known as Fisher Discriminant Analysis (FDA)
• It can be effective alternative to logistic regression when the classes are well-
separated
• Linear Discriminant analysis (LDA) is a multivariate classification (and dimension
reduction) technique that separates objects into two or more mutually exclusive
groups based on measurable features of those objects. The measurable features are
sometimes called predictors or independent variables, while the classification group is
the response or what is being predicted
• It is an appropriate technique when the dependent variable is categorical (nominal or
non-metric) and the independent variables are metric. The single dependent variable
can have two, three or more categories
• DA uses data where the classes are known beforehand to create a model
that may be used to predict future observations
• Types of Discriminant Analysis
1. Linear Discriminant Analysis (LDA) - assumes that the covariance of the
independent variables is equal across all classes
2. Quadratic Discriminant Analysis (QDA) - does not assume equal covariance
across the classes
• Both LDA and QDA require
• Number of independent variables should be less than the sample size
• Assumes multivariate normality among the independent variables, i.e.,
independent variables come from a normal (or Gaussian) distribution
• DA uses Baye’s theorem in order to determine the probability of each class
membership for each observation
• Fit a linear discriminant analysis with the function lda()
• Advantage: - Elegantly simple
• Limitation: - Assumption that the observations of each class are said to have
a multivariate normal distribution & there is a common covariance across the
classes
• The process of attaining the posterior probabilities goes through
following steps:
• Collect data with a known class membership
• Calculate the prior probabilities – this represents the proportion of the sample that
belongs to each class
• Calculate the mean for each feature by their class
• Calculate the variance-covariance matrix for each feature. If it is an Linear
Discriminant Analysis (LDA) this would be a pooled matrix of all the classes giving
us a linear classifier; if Quadratic Discriminant Analysis (QDA) then variance-
covariance matrix is created for each class
• Estimate the normal distribution for each class
• Compute the discriminant function that is the rule for the classification of a new
object
• Assign an observation to a class based on the discriminant function
CASE STUDY
Biopsy Data on Breast Cancer Patients
Dr. William H. Wolberg from the University of Wisconsin commissioned the
Wisconsin Breast Cancer Data in 1990. His goal of collecting the data was to
identify whether a tumor biopsy was malignant or benign. His team collected the
samples using Fine Needle Aspiration (FNA). If a physician identifies the tumor
through examination or imaging an area of abnormal tissue, then the next step is
to collect a biopsy. FNA is a relatively sage method of collecting the tissue &
complications are rare. Pathologists examine the biopsy & attempt to determine
the diagnosis (malignant of benign). As you can imagine, this not a trivial
conclusion. Benign breast tumors are not dangerous as there is no risk of the
abnormal growth spreading to the other body parts. If a benign tumor is large
enough, surgery might be needed to remove it. On the other hand, a malignant
rumor requires medical intervention. The level of treatments depends on a
number of factors but most likely will require surgery, which can be followed by
radiation and/ or chemotherapy. Therefore, the implications of a misdiagnosis can
be extensive. A false positive for malignancy can lead to costly & unnecessary
treatment, subjecting the patient to a tremendous emotional & physical burden.
On the other hand, a false negative can deny a patient the treatment that they
need, causing the cancer to spread & leading to premature death. Early treatment
intervention in breast cancer patients can greatly improve their survival.
Our task is to develop the best possible diagnostic machine learning algorithm
inorder to assist the patients medical team in determining whether the tumor is
malignant or not.
Dr. William H. Wolberg obtained breast cancer database from the University of
Wisconsin Hospitals, Madison & the biopsies of breast tumours for 699 patients
up to 15 July 1992 were assessed; with each of nine attributes scored on a
scale of 1 to 10, and the outcome also known. There are 699 rows and 11
columns. This data frame contains following columns
ID - sample code number (not unique)
V1 - clump thickness
V2 - uniformity of cell size
V3 - uniformity of cell shape
V4 - marginal adhesion
V5 - single epithelial cell size
V6 - bare nuclei (16 values are missing)
V7 - bland chromatin
V8 - normal nucleoli
V9 - mitoses
class - "benign" or "malignant"
1. Ensure that the packages & libraries are installed. The data frame is
available in the R MASS package under the biopsy name
install.packages(“MASS") #Breast & pima Indian data
library(MASS)
install.packages(“car") #VIF Statistics
library(car)
da_biopsy=biopsy
str(da_biopsy) #Examine the underlying structure of the data
2. We now get rid of the ID column
da_biopsy$ID = NULL #Delete column ID
View(da_biopsy)
3. Rename the variables & confirm that the code has worked as intended
names(da_biopsy)=c('Thickeness','u.size','u.shape','adhesion','s.size','nuclie','
chromatin','normal_nucleoli','mitoses','class')
names(da_biopsy)
4. Delete the missing observations. As there are only 16 observations with
missing data, it is safe to get rid of them as they account only 2 per cent of
the total observations. In deleting these observations, a new working data
frame is created with the na.omit() function
lda_biopsy = na.omit(da_biopsy) #Delete observations with missing data
5. The final task in data preparation will be the creation of our train & test
datasets
• In machine learning we should not be concerned with how well we can predict the
current observations, but more focussed on how well we can predict the
observations that were not used in order to create the algorithm
• So, we can create & select the best algorithm using the training data that
maximizes our predictions on test set
• There are number of ways to proportionally split our data into train & test sets:
50/50, 60/40, 70/30, 80/20 & so forth. Data split you select should be based on
your experience & judgement
• In this case, lets use a 70/30 split
set.seed(123) #Random number generator
index_lda = sample(2, nrow(lda_biopsy), replace=TRUE, prob=c(0.7,0.3))
lda.train = lda_biopsy[index_lda==1,] #Training data set
lda.test = lda_biopsy[index_lda==2,] #Test data set
dim(lda.train)
dim(lda.test)
6. R installation comes with the lda() function that performs Linear
Discriminant Analysis
lda.fit = lda(class~., data=lda.train)
lda.fit
Interpretation
• Prior probabilities of groups are approximately 64 per cent for benign & 36 per cent for
malignancy
• Group means is the average of each feature by their class
• Coefficients of linear discriminants are the standardized linear combination of the features that
are used to determine an observations discriminant score
• The higher the score, the more likely the classification is malignant
7. The plot() function will provide us with a histogram and/ or the densities of
the discriminant scores
plot(lda.fit, type=“both”)
Interpretation: - We see that there is some overlap in the groups, indicating that
there will be some incorrectly classified observations
8. The predict() function available with LDA provides a list of 3 elements
(class, posterior, & x).
• The class elements is the prediction of benign or malignant
• Posterior is the probability score of x being in each class
• X is the linear discriminant score
lda.predict1= predict(lda.fit) #Training dataset prediction
lda.train$lda = lda.predict1$class
table(lda.train$lda, lda.train$class)
mean(lda.train$lda==lda.train$class)
Interpretation: - We see that LDA models has performed much worse than the logistic
regression models on the training dataset
lda.predict2= predict(lda.fit, newdata=lda.test) #Test dataset prediction
lda.test$lda = lda.predict2$class
table(lda.test$lda, lda.test$class)
mean(lda.test$lda==lda.test$class)
Interpretation: - Better performance than on the training dataset. Still did not perform as well
as the logistic regression (96% against 98% with logistic regression)
iris Dataset (In-built)
Anderson collected & measured hundreds of irises in an effort to study variation
between & among the different species. There are 260 species of iris. This data
set focuses of three of them (Iris setosa, Iris virginica and Iris versicolor). Four
features were measured on 50 samples for each species: sepal width, sepal
length, petal width, and petal length. Anderson published it in "The irises of the
Gaspe Peninsula", which originally inspired Fisher to develop LDA
• iris is a data frame with 150 cases (rows) and 5 variables (columns) named
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
• iris3 gives the same data arranged as a 3-dimensional array of size 50 by 4
by 3, as represented by S-PLUS. The first dimension gives the case number
within the species subsample, the second the measurements with
names Sepal L., Sepal W., Petal L., and Petal W., and the third the species.
1. Ensure that the packages & libraries are installed. The data frame is
available in the R MASS package under the biopsy name
install.packages(“MASS") #lda function
library(MASS)
install.packages(“klaR") #Classification & Visualization functions
library(klaR)
da_iris=iris
View(da_iris)
2. Data preparation will be the creation of our train & test datasets
• In machine learning we should not be concerned with how well we can predict the
current observations, but more focussed on how well we can predict the
observations that were not used in order to create the algorithm
• So, we can create & select the best algorithm using the training data that
maximizes our predictions on test set
• There are number of ways to proportionally split our data into train & test sets:
50/50, 60/40, 70/30, 80/20 & so forth. Data split you select should be based on
your experience & judgement
• In this case, lets use a 70/30 split
set.seed(123) #Random number generator
ind = sample(2, nrow(da_iris), replace=TRUE, prob=c(0.6,0.4))
train = da_iris[ind==1,] #Training data set
test = da_iris[ind==2,] #Test data set
3. R installation comes with
the lda() function that
performs Linear
Discriminant Analysis
lda_fit = lda(Species~.,
data=train)
lda_fit
Interpretation
• Prior probabilities of groups show πi, the probability of randomly selecting an observation from
class I from the total training set
• Because there are 50 observations of each species in the original data set (150 observations
total) we know that the prior probabilities should be close to 33.3% for each class
• Group means μi shows the mean value for each of the independent variables for each class i
• The Coefficients of linear discriminants are the coefficients for each discriminant
• Linear discriminant (LD1) is the linear combination:
(0.36∗Sepal.Length)+(2.22∗Sepal.Width)+(−1.78∗Petal.Length)+(−3.97∗Petal.Width)
• The Proportions of trace describes the proportion of between-class variance that is explained by
successive discriminant functions. As you can see LD1 explains 99% of the variance
4. Plot the data using the basic plot function plot()
plot(lda_fit, col = as.integer(train$Species))
You can see that there are three distinct groups with some overlap between
virginica and versicolor
Plot the observations illustrating the separation between groups as well as
overlapping areas that are potential for mix-ups when predicting classes
plot(lda_fit, dimen = 1, type = "b")
• Using the partimat function from the klaR package provides an alternate way to plot
the linear discriminant functions
partimat(Species ~ ., data=train, method="lda")
• Partimat() outputs an array of plots for every combination of two variables. Think of
each plot as a different view of the same data. Colored regions delineate each
classification area. Any observation that falls within a region is predicted to be from a
specific class. Each plot also includes the apparent error rate for that view of the data
5. Next let’s evaluate the prediction accuracy of our model. First we’ll run the
model against the training set to verify the model fit by using the command
predict. The table output is a confusion matrix with the actual species as the
row labels and the predicted species at the column labels.
lda.pred1= predict(lda_fit) #Training dataset prediction
train$lda = lda.pred1$class
table(train$lda, train$Species)
mean(train$lda==train$class)
Interpretation: - The total number of correctly predicted observations is the sum of
the diagonal. So this model fit the training data correctly for almost every
observation. Verifying the training set doesn’t prove accuracy, but a poor fit to the
training data could be a sign that the model isn’t a good one. Now let’s run our test
set against this model to determine its accuracy.
lda.pred2= predict(lda_fit, newdata = test) #Test dataset prediction
test$lda = lda.pred2$class
table(test$lda, test$Species)
mean(test$lda==test$Species)
When Sepal.Length=5.1,Sepal.Width=3.5,Petal.Length=1.4, Petal.Width=0.2,
what will be the class of this flower?
testdata=data.frame(Sepal.Length=5.1,Sepal.Width=3.5,Petal.Length=1.4,
Petal.Width=0.2)
predict(lda_fit, testdata)$class
Interpretation
• Based on the earlier plots, it makes sense that a few iris versicolor and iris virginica
observations may be mis-categorized
• Overall the model performs very well with the testing set with an accuracy of 96.7%
• For Sepal.Length=5.1,Sepal.Width=3.5,Petal.Length=1.4, Petal.Width=0.2, Setosa is
the class of the new flower
Dr.Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: [email protected] Mobile: + (91)98666 66620
• Segmentation is a fundamental
requirement when dealing with
customer background/ sales
transactions, & similar data
• Appropriate segmentation of customer
data can provide insights to enhance a
company's performance by identifying
the most valuable customers or the
customers that are likely to leave the
company
a
ab
b
abcde
c
cde Divisive Approach
d
de Initialization: All objects stay in one
e cluster
Iteration: Select a cluster and split
it into two sub clusters; Until each
Step Step Step Step Step Top-down leaf cluster contains only one
4 3 2 1 0 object
• In addition to the distance measure, we need to specify the linkage between
the groups of observations
• How do we measure the dissimilarity between two clusters of observations? A
number of different cluster agglomeration methods (i.e, linkage methods)
have been developed to answer to this question. The most common types
methods are:
1. Maximum or complete linkage clustering: It computes all pairwise dissimilarities
between the elements in cluster 1 and the elements in cluster 2, and considers the
largest value (i.e., maximum value) of these dissimilarities as the distance between
the two clusters. It tends to produce more compact clusters
2. Minimum or single linkage clustering: It computes all pairwise dissimilarities
between the elements in cluster 1 and the elements in cluster 2, and considers the
smallest of these dissimilarities as a linkage criterion. It tends to produce long,
“loose” clusters
3. Mean or average linkage clustering: It computes all pairwise dissimilarities between
the elements in cluster 1 and the elements in cluster 2, and considers the average
of these dissimilarities as the distance between the two clusters
4. Centroid linkage clustering: It computes the dissimilarity between the centroid for
cluster 1 (a mean vector of length p variables) and the centroid for cluster 2
5. Ward’s minimum variance method: It minimizes the total within-cluster variance. At
each step the pair of clusters with minimum between-cluster distance are merged
Linkage Methods of Clustering
Single Linkage
Minimum
Distance Ward’s
Cluster 1 Cluster 2 Procedure
Complete Linkage
Maximum
Distance
Average
Cluster 1 Distance Cluster 2
Single-link Complete-link
5
1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 4
4
1 2 5 3 6 4 1 2 5 3 6 4
2 5 3 6 4 1
1 2 5 3 6 4
• There are different functions available in R for computing hierarchical
clustering. The commonly used functions are:
• hclust [in stats package] and agnes [in cluster package] for agglomerative
hierarchical clustering (HC)
• diana [in cluster package] for divisive HC
• The height of the fusion, provided on the vertical axis, indicates the
(dis)similarity between two observations. The higher the height of the fusion,
the less similar the observations are
• The height of the cut to the dendrogram controls the number of clusters
obtained. In order to identify sub-groups (i.e. clusters), we can cut the
dendrogram with cutree() function
• Advantages/ Limitations
• Hierarchical structures are informative but are not suitable for large
datasets
• Algorithm imposes hierarchical structure on data, even when it is not
appropriate
• Crucial question - How many clusters are to be considered?
• The complexity of hierarchical clustering is higher than k-means
Agglomerative Hierarchical Clustering (AHC)
1. We can perform AHC with hclust() function in R
Disadvantages
If the data is rearranged a different solution is generated every time
This procedure fails if you don't know exactly how many clusters you
should have in the first place
wine Dataset (In-built)
The dataset contains 13 chemical measurements (Alcohol, Malic acid, Ash,
Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoids
phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines
and Proline) and one variable Class, the label, for the cultivar or plant variety,
on 178 Italian wine samples. This data is the result of chemical analysis of
wines grown in the same region in Italy but derived from three different
cultivars.
Using Wine dataset cluster different types of wines.
1. Ensure that the packages & libraries are installed. The
library rattle is loaded in order to use the dataset wine
install.packages(“rattle") #For wine dataset
library(rattle)
install.packages(“cluster") #For plotting data
library(cluster)
data(wine, package='rattle')
head(wine)
str(wine) #Examine the underlying structure of the data
wine_clust=wine #Make a copy of the dataset wine
• We will also need SnowballC for stemming of the words, RcolorBrewer for
color palettes in wordclouds, & the wordcloud package
install.packages(“tm”)
library(tm)
install.packages(“SnowballC”)
library(SnowballC)
install.packages(“wordcloud”)
library(wordcloud)
install.packages(“RColorBrewer”)
library(RColorBrewer)
set.seed(123)
lda3 = LDA(dtm, k=3, method="Gibbs") #topicmodels package
topics(lda3) #How many topics to create? Seems logical to solve
3/4
lda4 = LDA(dtm, k=4, method="Gibbs") #topicmodels package
topics(lda4)
terms(lda3,10)
terms(lda4,10)
Dr.Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: [email protected] Mobile: + (91)98666 66620
Unit I - Motivation; Foundations with R; Managing Data with R; Data
Visualization; Linear Algebra and Matrix Computing; Dimensionality
Reduction; Lazy Learning: Classification using Nearest Neighbors;
Probabilistic Learning: Classification using Naïve Bayes; Divide and
Conquer – Classification Using Decision Trees.
Unit II - Forecasting Numeric Data: Regression Models: Simple Linear
Regression, Multiple Linear Regression, Polynomial Regression, Support
Vector Regression (SVR), Decision Tree Regression, and Random Forest
Regression; Classification: Logistic Regression, K-Nearest Neighbours
(K-NN), Support Vector Machine (SVM), Kernel SVM, Naïve Bayes,
Decision Tree Classification, and Random Forest Classification;
Association Rule Learning: Apriori Algorithm; Clustering: K-means and
Hierarchical Clustering.
Unit III - Dimensionality Reduction: Principal Component Analysis, Linear
Discriminant Analysis, Kernel PCS; Why Machine Learning?;
Applications of Machine Learning; Specialized Machine Learning: Data
Formats and Optimization of Computation; Variable/ Feature Selection;
Regularized Linear Modeling and Controlled Variable Selection; Big
Longitudinal Data Analysis; Reinforcement Learning: Thompson
Sampling; Deep Learning; Text Mining and Natural Language
Processing.
Support Vector Machines (SVM)
• Logistic Regression & Discriminant Analysis (Classification
techniques), determined the probability of a predicted
observation (categorical response)
• Now, lets delve into two nonlinear techniques: K-Nearest
Neighbors (KNN) and Support Vector Machines (SVM)
• These methods can be used for continuous outcomes in
addition to classification problems though this section is only
limited to the latter
K-Nearest Neighbours (KNN)
• In previous efforts, models estimated coefficients/ parameters
for each included feature
• KNN has no parameters. So, it is called instance-based
learning. The labeled examples (inputs and corresponding
output labels) are stored & no action is taken until a new input
pattern demands an output value
• This method is commonly called lazy learning as no specific
model parameters are produced
• In previous efforts, models estimated coefficients/ parameters
for each included feature
• KNN has no parameters. So, it is called instance-based
learning. The labeled examples (inputs and corresponding
output labels) are stored & no action is taken until a new input
pattern demands an output value
• This method is commonly called lazy learning as no specific
model parameters are produced
Pima.tr & Pima.te
Datasets
str(Pima.tr)
str(Pima.te)
3. Combine the datasets (Pima.tr & Pima.te) into single dataframe using
rbind() function, as both have similar data structures
pima = rbind(Pima.tr, Pima.te) #Combine (Pima.tr & Pima.te)
str(pima)
After specifying a random seed, we will create the train set object
with kknn(). This function asks for the maximum number of k
values (kmax), distance (one is equal to Euclidian and two is
equal to absolute), and kernel. For this model, kmax will be set to
25 and distance will be 2
set.seed(123)
kknn.train = train.kknn(type~., data=train, kmax=25, distance=2,
kernel=c("rectangular", "triangular", "epanechnikov"))
A nice feature of the package is the ability to plot and compare the
results, as follows:
plot(kknn.train)
#1. Install the packages
install.packages(“class”) #k-nearest neighbors
library(class)
install.packages(“kknn”) #weighted k-nearest neighbors
library(kknn)
install.packages(“e1071”) #SVM
library(e1071)
install.packages(“caret”) #Select tuning parameters
library(caret)
install.packages(“MASS”) #Contains the data
library(MASS)
install.packages(“ggplot2”) #Create boxplots
library(ggplot2)
#2. Combine the datasets
data(Pima.tr) #200 observations for 8 variables
str(Pima.tr)
data(Pima.te) #332 observations for 8 variables
str(Pima.te)
pima = rbind(Pima.tr, Pima.te) #Combine (Pima.tr & Pima.te)
str(pima)
#3. Plot the graph - Scale the data
pima.melt = melt(pima, id.var="type")
ggplot(data=pima.melt, aes(x=type, y=value)) + geom_boxplot() +
facet_wrap(~variable, ncol=2) #ggplot2 package
pima.scale = as.data.frame(scale(pima[,-8]))
str(pima.scale)
pima.scale$type = pima$type
pima.scale.melt = melt(pima.scale, id.var="type") #Repeat the boxplot
ggplot(data=pima.scale.melt, aes(x=type, y=value)) + geom_boxplot() +
facet_wrap(~variable, ncol=2)
#4. Check for Correlation between variables except response variable (type)
cor(pima.scale[-8])
#5. check the ratio of Yes and No in our response & create training & test
datasets
table(pima.scale$type)
set.seed(502)
ind = sample(2, nrow(pima.scale), replace=TRUE, prob=c(0.7,0.3))
train = pima.scale[ind==1,]
test = pima.scale[ind==2,]
str(train)
str(test)
#6. knn Modelling
grid1 = expand.grid(.k=seq(2,20, by=1))
control = trainControl(method="cv") #caret package
set.seed(502)
knn.train = train(type~., data=train, method="knn",
trControl=control,tuneGrid=grid1)
knn.train
knn.test = knn(train[,-8], test[,-8], train[,8], k=17)
table(knn.test, test$type)
(77+28)/147
#calculate Kappa
prob.agree = (77+28)/147 #accuracy
prob.chance = ((77+26)/147) * ((77+16)/147)
prob.chance
kappa = (prob.agree - prob.chance) / (1 - prob.chance)
kappa
set.seed(123)
#install package kknn
kknn.train = train.kknn(type~., data=train, kmax=25, distance=2,
kernel=c("rectangular", "triangular", "epanechnikov"))
plot(kknn.train)
knn.train
Dr.Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: [email protected] Mobile: + (91)98666 66620
Support Vector Machines (SVM)
• Logistic Regression & Discriminant Analysis (Classification
techniques), determined the probability of a predicted
observation (categorical response)
• Now, lets delve into two nonlinear techniques: K-Nearest
Neighbors (KNN) and Support Vector Machines (SVM)
• These methods can be used for continuous outcomes in
addition to classification problems though this section is only
limited to the latter
More Classification Techniques – K-Nearest Neighbors &
Support Vector Machines
• K-Nearest Neighbours (KNN) & Support Vector Machines (SVM) techniques
are most sophisticated
Overview of Association Rule Mining &
Sentiment Analysis
• Text is a vast source of data for business
• Now
5 lets for a Cleaner
Window matrix to analyze the above data & conclude
, Soda
inferences
• Simple Patterns derived from the observation
Orange Window Milk Sod Detergen
Juice Cleaner a t
Orange Juice 4 1 1 2 2
Window 1 2 1 1 0
Cleaner
Milk 1 1 1 0 0
Soda Juice & soda
• Orange 2 1
are more likely 0
purchased together3than any
1 other 2 items
• Detergent
Detergent is never
2 purchased with
0 milk or window0cleaner
1 2
• Milk is never purchased with soda or detergent
How good is Association Rule?
• The following 3 terms are important constraints on which
the association rules are made