Logistic regression is a machine learning classification algorithm that predicts the probability of a categorical dependent variable. It models the probability of the dependent variable being in one of two possible categories, as a function of the independent variables. The model transforms the linear combination of the independent variables using the logistic sigmoid function to output a probability between 0 and 1. Logistic regression is optimized using maximum likelihood estimation to find the coefficients that maximize the probability of the observed outcomes in the training data. Like linear regression, it makes assumptions about the data being binary classified with no noise or highly correlated independent variables.
The goal of this workshop is to introduce fundamental capabilities of R as a tool for performing data analysis. Here, we learn about the most comprehensive statistical analysis language R, to get a basic idea how to analyze real-word data, extract patterns from data and find causality.
Linear regression is a supervised machine learning technique used to model the relationship between a continuous dependent variable and one or more independent variables. It is commonly used for prediction and forecasting. The regression line represents the best fit line for the data using the least squares method to minimize the distance between the observed data points and the regression line. R-squared measures how well the regression line represents the data, on a scale of 0-100%. Linear regression performs well when data is linearly separable but has limitations such as assuming linear relationships and being sensitive to outliers and multicollinearity.
Data preprocessing involves transforming raw data into an understandable and consistent format. It includes data cleaning, integration, transformation, and reduction. Data cleaning aims to fill missing values, smooth noise, and resolve inconsistencies. Data integration combines data from multiple sources. Data transformation handles tasks like normalization and aggregation to prepare the data for mining. Data reduction techniques obtain a reduced representation of data that maintains analytical results but reduces volume, such as through aggregation, dimensionality reduction, discretization, and sampling.
This document provides an overview of the statistical programming language R. It discusses key R concepts like data types, vectors, matrices, data frames, lists, and functions. It also covers important R tools for data analysis like statistical functions, linear regression, multiple regression, and file input/output. The goal of R is to provide a large integrated collection of tools for data analysis and statistical computing.
This presentation discusses the following topics:
Basic features of R
Exploring R GUI
Data Frames & Lists
Handling Data in R Workspace
Reading Data Sets & Exporting Data from R
Manipulating & Processing Data in R
PCA transforms correlated variables into uncorrelated variables called principal components. It finds the directions of maximum variance in high-dimensional data by computing the eigenvectors of the covariance matrix. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Dimensionality reduction is achieved by ignoring components with small eigenvalues, retaining only the most significant components.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
The document provides an overview of statistics for data science. It introduces key concepts including descriptive versus inferential statistics, different types of variables and data, probability distributions, and statistical analysis methods. Descriptive statistics are used to describe data through measures of central tendency, variability, and visualization techniques. Inferential statistics enable drawing conclusions about populations from samples using hypothesis testing, confidence intervals, and regression analysis.
Data preprocessing is the process of preparing raw data for analysis by cleaning it, transforming it, and reducing it. The key steps in data preprocessing include data cleaning to handle missing values, outliers, and noise; data transformation techniques like normalization, discretization, and feature extraction; and data reduction methods like dimensionality reduction and sampling. Preprocessing ensures the data is consistent, accurate and suitable for building machine learning models.
Logistic regression is a statistical model used to predict binary outcomes like disease presence/absence from several explanatory variables. It is similar to linear regression but for binary rather than continuous outcomes. The document provides an example analysis using logistic regression to predict risk of HHV8 infection from sexual behaviors and infections like HIV. The analysis found HIV and HSV2 history were associated with higher odds of HHV8 after adjusting for other variables, while gonorrhea history was not a significant independent predictor.
Classification techniques in data miningKamal Acharya
The document discusses classification algorithms in machine learning. It provides an overview of various classification algorithms including decision tree classifiers, rule-based classifiers, nearest neighbor classifiers, Bayesian classifiers, and artificial neural network classifiers. It then describes the supervised learning process for classification, which involves using a training set to construct a classification model and then applying the model to a test set to classify new data. Finally, it provides a detailed example of how a decision tree classifier is constructed from a training dataset and how it can be used to classify data in the test set.
This document compares Python and R for use in data science. Both languages are popular among data scientists, though Python has broader usage among professional developers overall. Python is a general purpose language while R is specialized for statistical computing. Both have extensive libraries for data manipulation, analysis, and visualization. The best choice depends on factors like familiarity, project requirements, and team preferences as both are capable of most data science tasks.
This document provides an overview and agenda for a hands-on introduction to data science. It includes the following sections: Data Science Overview and Intro to R (90 minutes), Exploratory Data Analysis (60 minutes), and Logistic Regression Model (30 minutes). The document then covers key concepts in data science including collecting and analyzing data to find insights to help decision making, using analytics to improve operations and innovations, and predicting problems before they occur. Machine learning and statistical techniques are also introduced such as supervised and unsupervised learning, parameters versus statistics, and calculating variance and standard deviation.
Machine learning models involve a bias-variance tradeoff, where increased model complexity can lead to overfitting training data (high variance) or underfitting (high bias). Bias measures how far model predictions are from the correct values on average, while variance captures differences between predictions on different training data. The ideal model has low bias and low variance, accurately fitting training data while generalizing to new examples.
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Edureka!
Here are the steps to calculate the standard deviation of the numbers:
1) Find the mean (average) of the numbers: (9 + 2 + 5 + ... + 10 + 9 + 6 + 9 + 4) / 20 = 7
2) For each number, subtract the mean and square the result:
(9 - 7)2 = 4
(2 - 7)2 = 49
...
(4 - 7)2 = 9
3) Sum all the squared differences: 4 + 49 + ... + 9 = S
4) Divide the sum by the number of values minus 1: S / (20 - 1)
5) Take the square root. This is the standard
Definition of classification
Basic principles of classification
Typical
How Does Classification Works?
Difference between Classification & Prediction.
Machine learning techniques
Decision Trees
k-Nearest Neighbors
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Abstract: This PDSG workshop introduces basic concepts of simple linear regression in machine learning. Concepts covered are Slope of a Line, Loss Function, and Solving Simple Linear Regression Equation, with examples.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what Machine Learning is, what Machine Learning is, what Decision Tree is, the advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with resolved examples, and at the end of the decision Tree use case/demo in Python for loan payment. For both beginners and experts who want to learn Machine Learning Algorithms, this Decision Tree tutorial is perfect.
This document discusses missing data imputation. It notes that missing data appears as "NA", blank spaces, or other placeholders in a dataset. There are two main approaches to handling missing data - deleting rows with missing values or using imputation methods to replace missing values. Common imputation methods for continuous data include replacing with the mean, median, or predicted value from regression. For categorical data, the most frequent category or predicted value from a classifier may be used. The MICE and missForest packages are popular tools for imputing missing values in R.
The document discusses the objectives and units of the CS8091 / Big Data Analytics course, which include understanding fundamental concepts of big data, HDFS, MapReduce, clustering, classification, association analysis, and recommendation systems. It also covers sources of big data, data structures, current analytical architectures, drivers of big data, and the emerging big data ecosystem approach to analytics using data devices, collectors, aggregators, and users.
The document discusses data preprocessing techniques. It explains that data preprocessing is important because real-world data is often noisy, incomplete, and inconsistent. The key techniques covered are data cleaning, integration, reduction, and transformation. Data cleaning handles missing values, noise, and outliers. Data integration merges data from multiple sources. Data reduction reduces data size through techniques like dimensionality reduction. Data transformation normalizes and aggregates data to make it suitable for mining.
This document summarizes a presentation given by Thomas Hütter on using R for data analysis and visualization. The presentation provided an overview of R's history and ecosystem, introduced basic data types and functions, and demonstrated connecting to a SQL Server database to extract and analyze sales data from a Dynamics Nav system. It showed visualizing the results with ggplot2 and creating interactive apps with the Shiny framework. The presentation emphasized that proper data understanding is important for reliable analysis and highlighted resources for learning more about R.
This document provides an agenda for an R programming presentation. It includes an introduction to R, commonly used packages and datasets in R, basics of R like data structures and manipulation, looping concepts, data analysis techniques using dplyr and other packages, data visualization using ggplot2, and machine learning algorithms in R. Shortcuts for the R console and IDE are also listed.
PCA transforms correlated variables into uncorrelated variables called principal components. It finds the directions of maximum variance in high-dimensional data by computing the eigenvectors of the covariance matrix. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Dimensionality reduction is achieved by ignoring components with small eigenvalues, retaining only the most significant components.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
The document provides an overview of statistics for data science. It introduces key concepts including descriptive versus inferential statistics, different types of variables and data, probability distributions, and statistical analysis methods. Descriptive statistics are used to describe data through measures of central tendency, variability, and visualization techniques. Inferential statistics enable drawing conclusions about populations from samples using hypothesis testing, confidence intervals, and regression analysis.
Data preprocessing is the process of preparing raw data for analysis by cleaning it, transforming it, and reducing it. The key steps in data preprocessing include data cleaning to handle missing values, outliers, and noise; data transformation techniques like normalization, discretization, and feature extraction; and data reduction methods like dimensionality reduction and sampling. Preprocessing ensures the data is consistent, accurate and suitable for building machine learning models.
Logistic regression is a statistical model used to predict binary outcomes like disease presence/absence from several explanatory variables. It is similar to linear regression but for binary rather than continuous outcomes. The document provides an example analysis using logistic regression to predict risk of HHV8 infection from sexual behaviors and infections like HIV. The analysis found HIV and HSV2 history were associated with higher odds of HHV8 after adjusting for other variables, while gonorrhea history was not a significant independent predictor.
Classification techniques in data miningKamal Acharya
The document discusses classification algorithms in machine learning. It provides an overview of various classification algorithms including decision tree classifiers, rule-based classifiers, nearest neighbor classifiers, Bayesian classifiers, and artificial neural network classifiers. It then describes the supervised learning process for classification, which involves using a training set to construct a classification model and then applying the model to a test set to classify new data. Finally, it provides a detailed example of how a decision tree classifier is constructed from a training dataset and how it can be used to classify data in the test set.
This document compares Python and R for use in data science. Both languages are popular among data scientists, though Python has broader usage among professional developers overall. Python is a general purpose language while R is specialized for statistical computing. Both have extensive libraries for data manipulation, analysis, and visualization. The best choice depends on factors like familiarity, project requirements, and team preferences as both are capable of most data science tasks.
This document provides an overview and agenda for a hands-on introduction to data science. It includes the following sections: Data Science Overview and Intro to R (90 minutes), Exploratory Data Analysis (60 minutes), and Logistic Regression Model (30 minutes). The document then covers key concepts in data science including collecting and analyzing data to find insights to help decision making, using analytics to improve operations and innovations, and predicting problems before they occur. Machine learning and statistical techniques are also introduced such as supervised and unsupervised learning, parameters versus statistics, and calculating variance and standard deviation.
Machine learning models involve a bias-variance tradeoff, where increased model complexity can lead to overfitting training data (high variance) or underfitting (high bias). Bias measures how far model predictions are from the correct values on average, while variance captures differences between predictions on different training data. The ideal model has low bias and low variance, accurately fitting training data while generalizing to new examples.
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Edureka!
Here are the steps to calculate the standard deviation of the numbers:
1) Find the mean (average) of the numbers: (9 + 2 + 5 + ... + 10 + 9 + 6 + 9 + 4) / 20 = 7
2) For each number, subtract the mean and square the result:
(9 - 7)2 = 4
(2 - 7)2 = 49
...
(4 - 7)2 = 9
3) Sum all the squared differences: 4 + 49 + ... + 9 = S
4) Divide the sum by the number of values minus 1: S / (20 - 1)
5) Take the square root. This is the standard
Definition of classification
Basic principles of classification
Typical
How Does Classification Works?
Difference between Classification & Prediction.
Machine learning techniques
Decision Trees
k-Nearest Neighbors
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Abstract: This PDSG workshop introduces basic concepts of simple linear regression in machine learning. Concepts covered are Slope of a Line, Loss Function, and Solving Simple Linear Regression Equation, with examples.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what Machine Learning is, what Machine Learning is, what Decision Tree is, the advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with resolved examples, and at the end of the decision Tree use case/demo in Python for loan payment. For both beginners and experts who want to learn Machine Learning Algorithms, this Decision Tree tutorial is perfect.
This document discusses missing data imputation. It notes that missing data appears as "NA", blank spaces, or other placeholders in a dataset. There are two main approaches to handling missing data - deleting rows with missing values or using imputation methods to replace missing values. Common imputation methods for continuous data include replacing with the mean, median, or predicted value from regression. For categorical data, the most frequent category or predicted value from a classifier may be used. The MICE and missForest packages are popular tools for imputing missing values in R.
The document discusses the objectives and units of the CS8091 / Big Data Analytics course, which include understanding fundamental concepts of big data, HDFS, MapReduce, clustering, classification, association analysis, and recommendation systems. It also covers sources of big data, data structures, current analytical architectures, drivers of big data, and the emerging big data ecosystem approach to analytics using data devices, collectors, aggregators, and users.
The document discusses data preprocessing techniques. It explains that data preprocessing is important because real-world data is often noisy, incomplete, and inconsistent. The key techniques covered are data cleaning, integration, reduction, and transformation. Data cleaning handles missing values, noise, and outliers. Data integration merges data from multiple sources. Data reduction reduces data size through techniques like dimensionality reduction. Data transformation normalizes and aggregates data to make it suitable for mining.
This document summarizes a presentation given by Thomas Hütter on using R for data analysis and visualization. The presentation provided an overview of R's history and ecosystem, introduced basic data types and functions, and demonstrated connecting to a SQL Server database to extract and analyze sales data from a Dynamics Nav system. It showed visualizing the results with ggplot2 and creating interactive apps with the Shiny framework. The presentation emphasized that proper data understanding is important for reliable analysis and highlighted resources for learning more about R.
This document provides an agenda for an R programming presentation. It includes an introduction to R, commonly used packages and datasets in R, basics of R like data structures and manipulation, looping concepts, data analysis techniques using dplyr and other packages, data visualization using ggplot2, and machine learning algorithms in R. Shortcuts for the R console and IDE are also listed.
CuRious about R in Power BI? End to end R in Power BI for beginners Jen Stirrup
R is a widely used open-source statistical software environment used by over 2 million data scientists and analysts. It is based on the S programming language and is developed by the R Foundation. R provides a flexible and powerful environment for statistical analysis, modeling, and data visualization. Some key advantages include being free, having an extensive community for support, and allowing for automated replication through scripting. However, it also has some drawbacks like having a steep learning curve and scripts sometimes being difficult to understand.
This document provides an introduction and overview of Neo4j, a graph database. It discusses trends in big data, NoSQL databases, and different types of NoSQL databases like key-value stores, column family databases, and document databases. It then defines what a graph and graph database are, and introduces Neo4j as a native graph database that uses a property graph model. It outlines some of Neo4j's features and provides examples of how it can be used to represent social network, spatial, and interconnected data.
The presentation was given by Mr. Bas Kempen, ISRIC, during the GSOC Mapping Global Training hosted by ISRIC - World Soil Information, 6 - 23 June 2017, Wageningen (The Netherlands).
Rattle is Free (as in Libre) Open Source Software and the source code is available from the Bitbucket repository. We give you the freedom to review the code, use it for whatever purpose you like, and to extend it however you like, without restriction, except that if you then distribute your changes you also need to distribute your source code too.
Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features (according to me) is that all of your interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface.
Rattle clocks between 10,000 and 20,000 installations per month from the RStudio CRAN node (one of over 100 nodes). Rattle has been downloaded several million times overall.
Data Science, Statistical Analysis and R... Learn what those mean, how they can help you find answers to your questions and complement the existing toolsets and processes you are currently using to make sense of data. We will explore R and the RStudio development environment, installing and using R packages, basic and essential data structures and data types, plotting graphics, manipulating data frames and how to connect R and SQL Server.
Week-3 – System RSupplemental material1Recap •.docxhelzerpatrina
Week-3 – System R
Supplemental material
1
Recap
• R - workhorse data structures
• Data frame
• List
• Matrix / Array
• Vector
• System-R – Input and output
• read() function
• read.table and read.csv
• scan() function
• typeof() function
• Setwd() function
• print()
• Factor variables
• Used in category analysis and statistical modelling
• Contains predefined set value called levels
• Descriptive statistics
• ls() – list of named objects
• str() – structure of the data and not the data itself
• summary() – provides a summary of data
• Plot() – Simple plot
2
Descriptive statistics - continued
• Summary of commands with single-value result. These commands will work on variables
containing numeric value.
• max() ---- It shows the maximum value in the vector
• min() ----- It shows the minimum value in the vector
• sum() ----- It shows the sum of all the vector elements.
• mean() ---- It shows the arithmetic mean for the entire vector
• median() – It shows the median value of the vector
• sd() – It shows the standard deviation
• var() – It show the variance
3
Descriptive statistics - single value results -
example
temp is the name of the vector
containing all numeric values
4
• log(dataset) – Shows log value for each
element.
• summary(dataset) –shows the summary
of values
• quantile() - Shows the quantiles by
default—the 0%, 25%, 50%, 75%, and
100% quantiles. It is possible to select
other quantiles also.
Descriptive statistics - multiple value results -
example
5
Descriptive Statistics in R for Data Frames
• Max(frame) – Returns the largest value in the entire data frame.
• Min(frame) – Returns the smallest value in the entire data frame.
• Sum(frame) – Returns the sum of the entire data frame.
• Fivenum(frame) – Returns the Tukey summary values for the entire
data frame.
• Length(frame)- Returns the number of columns in the data frame.
• Summary(frame) – Returns the summary for each column.
6
Descriptive Statistics in R for Data Frames -
Example
7
Descriptive Statistics in R for Data Frames –
RowMeans example
8
Descriptive Statistics in R for Data Frames –
ColMeans example
9
Graphical analysis - simple linear regression model
in R
• Logistic regression is implemented to understand if the dependent
variable is a linear function of the independent variable.
• Logistic regression is used for fitting the regression curve.
• Pre-requisite for implementing linear regression:
• Dependent variable should conform to normal distribution
• Cars dataset that is part of the R-Studio will be used as an example to
explain linear regression model.
10
Creating a simple linear model
• cars is a dataset preloaded into
System-R studio.
• head() function prints the first
few rows of the list/df
• cars dataset contains two major
columns
• X = speed (cars$speed)
• Y = dist (cars$dist)
• data() function is used to list all
the active datasets in the
environment.
• ...
This document discusses machine learning algorithms in R. It provides an overview of machine learning, data science, and the 5 V's of big data. It then discusses two main machine learning algorithms - clustering and classification. For clustering, it covers k-means clustering, providing examples of how to implement k-means clustering in R. For classification, it discusses decision trees, K-nearest neighbors (KNN), and provides an example of KNN classification in R. It also provides a brief overview of regression analysis, including examples of simple and multiple linear regression in R.
The Tidyverse and the Future of the Monitoring ToolchainJohn Rauser
As delivered at Monitorama PDX 2017.
Thesis: The ideas in the tidyverse, not the tools themselves but the fundamental ideas they are built upon, those ideas are going to have profound impact on everything having to do with data manipulation and visualization or data analysis.
R is a language and environment for statistical computing and graphics. It includes facilities for data manipulation, calculation, graphical display, and programming. Some key features of R include effective data handling, a suite of operators for calculations on arrays and matrices, graphical facilities, and a programming language with conditionals, loops, and functions. Common data structures in R include vectors, matrices, factors, lists, and data frames. Basic operations include arithmetic, logical operations, indexing, subsetting, applying functions, binding, and coercing between different structures.
R-Programming.ppt it is based on R programming languageZoha681526
It's based on R programming language and this ppt contains very easy learning of R programming language anyone can easily learn this R programming language through this ppt as it has very easy explanation
Exploratory Analysis Part1 Coursera DataScience SpecialisationWesley Goi
The document discusses exploratory data analysis techniques in R, including various plotting systems and graph types. It provides code examples for creating boxplots, histograms, bar plots, and scatter plots in Base, Lattice, and ggplot2. It also covers downloading data, transforming data, adding scales and themes, and creating faceted plots. The final challenge involves creating a boxplot with rectangles to represent regions and jittered points to show trends over years.
Introduction to Data Science, Prerequisites (tidyverse), Import Data (readr), Data Tyding (tidyr),
pivot_longer(), pivot_wider(), separate(), unite(), Data Transformation (dplyr - Grammar of Manipulation): arrange(), filter(),
select(), mutate(), summarise()m
Data Visualization (ggplot - Grammar of Graphics): Column Chart, Stacked Column Graph, Bar Graph, Line Graph, Dual Axis Chart, Area Chart, Pie Chart, Heat Map, Scatter Chart, Bubble Chart
This document provides an overview of the R programming language. It describes that R can handle numeric and textual data, perform matrix algebra and statistical functions. While R is not a database, it can connect to external databases. It also summarizes that R has no graphical user interface but can connect to other languages for visualization, and its interpreter can be slow but users can call optimized C/C++ code. The document also contrasts the differences between using R and commercial packages.
Thingyan is now a global treasure! See how people around the world are search...Pixellion
We explored how the world searches for 'Thingyan' and 'သင်္ကြန်' and this year, it’s extra special. Thingyan is now officially recognized as a World Intangible Cultural Heritage by UNESCO! Dive into the trends and celebrate with us!
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
2. R for Data Science | Long Nguyen | Sep 20172
Why R?
• R is the most preferred programming tool for statisticians, data scientists, data analysts and data architects
• Easy to develop your own model.
• R is freely available under GNU General Public License
• R has over 10,000 packages (a lot of available algorithms) from multiple repositories.
http://www.burtchworks.com/2017/06/19/2017-sas-r-python-flash-survey-results/
3. R for Data Science | Long Nguyen | Sep 20173
R & Rstudio IDE
Go to:
• https://www.r-project.org/
• https://www.rstudio.com/products/rstudio/download/
4. R for Data Science | Long Nguyen | Sep 20174
Essentials of R Programming
• Basic computations
• Five basic classes of objects
– Character
– Numeric (Real Numbers)
– Integer (Whole Numbers)
– Complex
– Logical (True / False)
• Data types in R
– Vector: a vector contains object of same class
– List: a special type of vector which contain
elements of different data types
– Matrix: A matrix is represented by set of rows
and columns.
– Data frame: Every column of a data frame acts
like a list
2+3
sqrt(121)
myvector<- c("Time", 24, "October", TRUE, 3.33)
my_list <- list(22, "ab", TRUE, 1 + 2i)
my_list[[1]]
my_matrix <- matrix(1:6, nrow=3, ncol=2)
df <- data.frame(name = c("ash","jane","paul","mark"), score =
c(67,56,87,91))
df
name score
1 ash NA
2 jane NA
3 paul 87
4 mark 91
5. R for Data Science | Long Nguyen | Sep 20175
Essentials of R Programming
• Control structures
– If (condition){
Do something
}else{
Do something else
}
• Loop
– For loop
– While loop
• Function
– function.name <- function(arguments) {
computations on the arguments some
other code }
x <- runif(1, 0, 10)
if(x > 3) {
y <- 10
} else {
y <- 0
}
for(i in 1:10) {
print(i)
}
mySquaredFunc<-function(n){
# Compute the square of integer `n`
n*n
}
mySquaredVal(5)
6. R for Data Science | Long Nguyen | Sep 20176
Useful R Packages
• Install packages: install.packages('readr‘, ‘ggplot2’, ‘dplyr’, ‘caret’)
• Load packages: library(package_name)
Importing Data
•readr
•data.table
•Sqldf
Data Manipulation
•dplyr
•tidyr
•lubridate
•stringr
Data Visualization
•ggplot2
•plotly
Modeling
•caret
•lm,
•randomForest, rpart
•gbm, xgb
Reporting
•RMarkdown
•Shiny
7. R for Data Science | Long Nguyen | Sep 20177
Importing Data
• CSV file
mydata <- read.csv("mydata.csv") # read csv file
library(readr)
mydata <- read_csv("mydata.csv") # 10x faster
• Tab-delimited text file
mydata <- read.table("mydata.txt") # read text file
mydata <- read_table("mydata.txt")
• Excel file:
library(XLConnect)
wk <- loadWorkbook("mydata.xls")
df <- readWorksheet(wk, sheet="Sheet1")
• SAS file
library(sas7bdat)
mySASData <- read.sas7bdat("example.sas7bdat")
• Other files:
– Minitab, SPSS(foreign),
– MySQL (RMySQL)
Col1,Col2,Col3
100,a1,b1
200,a2,b2
300,a3,b3
100 a1 b1
200 a2 b2
300 a3 b3
400 a4 b4
8. R for Data Science | Long Nguyen | Sep 20178
Data Manipulation with ‘dplyr’
• Some of the key “verbs”:
– select: return a subset of the columns of
a data frame, using a flexible notation
– filter: extract a subset of rows from a
data frame based on logical conditions
– arrange: reorder rows of a data frame
– rename: rename variables in a data
frame
library(nycflights13)
flights
select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
jan1 <- filter(flights, month == 1, day == 1)
nov_dec <- filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
arrange(flights, year, month, day)
arrange(flights, desc(arr_delay))
rename(flights, tail_num = tailnum)
9. R for Data Science | Long Nguyen | Sep 20179
Data Manipulation with ‘dplyr’
• Some of the key “verbs”:
– mutate: add new variables/columns or
transform existing variables
– summarize: generate summary statistics
of different variables in the data frame
– %>%: the “pipe” operator is used to
connect multiple verb actions together
into a pipeline
flights_sml <- select(flights, year:day, ends_with("delay"),
distance, air_time )
mutate(flights_sml, gain = arr_delay - dep_delay,
speed = distance / air_time * 60 )
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delays, count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
10. R for Data Science | Long Nguyen | Sep 201710
library(tidyr)
tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
spread(table2, key = type, value = count)
separate(table3, year, into = c("century", "year"), sep = 2)
separate(table3, rate, into = c("cases", "population"))
unite(table5, "new", century, year, sep = "")
Data Manipulation with ‘tidyr’
• Some of the key “verbs”:
– gather: takes multiple columns, and gathers
them into key-value pairs
– spread: takes two columns (key & value) and
spreads in to multiple columns
– separate: splits a single column into multiple
columns
– unite: combines multiple columns into a
single column
11. R for Data Science | Long Nguyen | Sep 201711
Data Visualization with ‘ggplot2’
• Scatter plot library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() +
geom_smooth(method="lm")
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) +
xlim(c(0, 0.1)) +
ylim(c(0, 500000)) +
labs(subtitle="Area Vs Population",
y="Population",
x="Area",
title="Scatterplot",
caption = "Source: midwest")
12. R for Data Science | Long Nguyen | Sep 201712
Data Visualization with ‘ggplot2’
• Correlogram library(ggplot2)
library(ggcorrplot)
# Correlation matrix
data(mtcars)
corr <- round(cor(mtcars), 1)
# Plot
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of mtcars",
ggtheme=theme_bw)
13. R for Data Science | Long Nguyen | Sep 201713
Data Visualization with ‘ggplot2’
• Histogram on Categorical Variables
ggplot(mpg, aes(manufacturer)) +
geom_bar(aes(fill=class), width = 0.5) +
theme(axis.text.x = element_text(angle=65,
vjust=0.6)) +
labs(title="Histogram on Categorical Variable",
subtitle= "Manufacturer across Vehicle
Classes")
14. R for Data Science | Long Nguyen | Sep 201714
Data Visualization with ‘ggplot2’
• Density plot ggplot(mpg, aes(cty)) +
geom_density(aes(fill=factor(cyl)), alpha=0.8) +
labs(title="Density plot", subtitle="City Mileage
Grouped by Number of cylinders",
caption="Source: mpg", x="City Mileage",
fill="# Cylinders")
Other plots:
• Box plot
• Pie chart
• Time-series plot
15. R for Data Science | Long Nguyen | Sep 201715
Interactive Visualization with ‘plotly’
library(plotly)
d <- diamonds[sample(nrow(diamonds), 1000), ]
plot_ly(d, x = ~carat, y = ~price, color = ~carat,
size = ~carat, text = ~paste("Clarity: ", clarity))
Plotly library makes interactive, publication-quality graphs online. It supports line plots, scatter plots, area
charts, bar charts, error bars, box plots, histograms, heat maps, subplots, multiple-axes, and 3D charts.
16. R for Data Science | Long Nguyen | Sep 201716
Data Modeling - Linear Regression
data(mtcars)
mtcars$am = as.factor(mtcars$am)
mtcars$cyl = as.factor(mtcars$cyl)
mtcars$vs = as.factor(mtcars$vs)
mtcars$gear = as.factor(mtcars$gear)
#Dropping dependent variable
mtcars_a = subset(mtcars, select = -c(mpg))
#Identifying numeric variables
numericData <- mtcars_a[sapply(mtcars_a, is.numeric)]
#Calculating Correlation
descrCor <- cor(numericData)
# Checking Variables that are highly correlated
highlyCorrelated = findCorrelation(descrCor, cutoff=0.7)
highlyCorCol = colnames(numericData)[highlyCorrelated]
#Remove highly correlated variables and create a new dataset
dat3 = mtcars[, -which(colnames(mtcars) %in% highlyCorCol)]
#Build Linear Regression Model
fit = lm(mpg ~ ., data=dat3)
#Extracting R-squared value
summary(fit)$r.squared
library(MASS) #Stepwise Selection based on AIC
step <- stepAIC(fit, direction="both")
summary(step)
17. R for Data Science | Long Nguyen | Sep 201717
Data modeling with ‘caret’
• Loan prediction problem
• Data standardization and imputing missing values using kNN
preProcValues <- preProcess(train, method = c("knnImpute","center","scale"))
library('RANN')
train_processed <- predict(preProcValues, train)
• One-hot encoding for categorical variables
dmy <- dummyVars(" ~ .", data = train_processed,fullRank = T)
train_transformed <- data.frame(predict(dmy, newdata = train_processed))
• Prepare training and testing set
index <- createDataPartition(train_transformed$Loan_Status, p=0.75, list=FALSE)
trainSet <- train_transformed[ index,]
testSet <- train_transformed[-index,]
• Feature selection using rfe
predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
Loan_Pred_Profile <- rfe(trainSet[,predictors], trainSet[,outcomeName], rfeControl = control)
18. R for Data Science | Long Nguyen | Sep 201718
• Take top 5 variables
predictors<-c("Credit_History", "LoanAmount", "Loan_Amount_Term", "ApplicantIncome", "CoapplicantIncome")
• Train different models
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm')
model_rf<-train(trainSet[,predictors],trainSet[,outcomeName],method='rf')
model_nnet<-train(trainSet[,predictors],trainSet[,outcomeName],method='nnet')
model_glm<-train(trainSet[,predictors],trainSet[,outcomeName],method='glm')
• Variable important
plot(varImp(object=model_gbm),main="GBM - Variable Importance")
plot(varImp(object=model_rf),main="RF - Variable Importance")
plot(varImp(object=model_nnet),main="NNET - Variable Importance")
plot(varImp(object=model_glm),main="GLM - Variable Importance")
• Prediction
predictions<-predict.train(object=model_gbm,testSet[,predictors],type="raw")
confusionMatrix(predictions,testSet[,outcomeName])
#Confusion Matrix and Statistics
#Prediction 0 1
# 0 25 3
# 1 23 102
#Accuracy : 0.8301
Data modeling with ‘caret’
19. R for Data Science | Long Nguyen | Sep 201719
Reporting
R Markdown files are designed: (i) for communicating to decision
makers, (ii) collaborating with other data scientists, and (iii) an
environment in which to do data science, where you can capture
what you were thinking.
Text formatting
*italic* or _italic_
**bold** __bold__
`code`
superscript^2^ and subscript~2~
Headings
# 1st Level Header
## 2nd Level Header
### 3rd Level Header
Lists
* Bulleted list item 1
* Item 2
* Item 2a
* Item 2b
1. Numbered list item 1
2. Item 2. The numbers are incremented automatically in the output.
Links and images
<http://example.com>[linked phrase](http://example.com)
20. R for Data Science | Long Nguyen | Sep 201720
Thank you!