DA_Lab_Week-1
DA_Lab_Week-1
(AUTONOMOUS)
SREE SAINATH NAGAR, A. RANGAMPET –517102.
Week-1
Aim: Introduction to R Studio, Basic operations and import and
export of data using R Tool.
Agenda:
1. About Data Mining
2. About R and RStudio
3. Basic Operations
a.
4. Datasets
5. Data Import and Export
a. Save and Load R Data
b. Import from and Export to .CSV Files
In real world applications, a data mining process can be broken into six major
phases:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation and
6. Deployment
1
as defined by the CRISP-DM (Cross Industry Standard Process for Data Mining).
About R:
R is a free software environment for statistical computing and graphics.
It provides a wide variety of statistical and graphical techniques (http://www.r-
project.org/).
R can be easily extended with 7324 packages available on CRAN (Comprehensive
R Archive Network) (http://cran.r-project.org/)
To help users to find out which R packages to use, the CRAN Task Views are a
good guidance (http://cran.r-project.org/web/views/). They provide collections of
packages for different tasks. Some Task Views related to data mining are:
o Machine Learning & Statistical Learning
o Cluster Analysis & Finite Mixture Models
o Time Series Analysis
o Natural Language Processing
o Multivariate Statistics and
o Analysis of Spatial Data.
RStudio
RStudio 10 is an integrated development environment (IDE) for R and can run on
various operating systems like Windows, Mac OS X and Linux. It is a very useful
and powerful tool for R programming.
2
When RStudio is launched for the first time, you can see a window similar to
below Figure. There are four panels:
1. Source panel (top left), which shows your R source code. If you cannot see the
source panel, you can find it by clicking menu \File", \New File" and then \R
Script". You can run a line or a selection of R code by clicking the \Run" bottom on
top of source panel, or pressing \Ctrl + Enter".
2. Console panel (bottom left), which shows outputs and system messages
displayed in a normal R console;
3. Environment/History/Presentation panel (top right), whose three tabs show
respectively all objects and function loaded in R, a history of submitted R code,
and Presentations generated with R;
4. Files/Plots/Packages/Help/Viewer panel (bottom right), whose tabs show
respectively a list of _les, plots, R packages installed, help documentation and
local web content.
In addition to above three folders which are useful to most projects, depending on your
project and preference, you may create additional folders below:
1. rawdata, where to put all raw data,
2. models, where to put all produced analytics models, and
3. reports, where to put your analysis reports.
3
Datasets
> str(iris)
'data.frame': 150 observations (records, or rows) of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
4
> data("bodyfat", package = "TH.data")
> str(bodyfat)
data.frame: 71 obs. of 10 variables:
$ age : num 57 65 59 58 60 61 56 60 58 62 ...
$ DEXfat : num 41.7 43.3 35.4 22.8 36.4 ...
$ waistcirc : num 100 99.5 96 72 89.5 83.5 81 89 80 79 ...
$ hipcirc : num 112 116.5 108.5 96.5 100.5 ...
$ elbowbreadth: num 7.1 6.5 6.2 6.1 7.1 6.5 6.9 6.2 6.4 7 ...
$ kneebreadth : num 9.4 8.9 8.9 9.2 10 8.8 8.9 8.5 8.8 8.8 ...
$ anthro3a : num 4.42 4.63 4.12 4.03 4.24 3.55 4.14 4.04 3.91 3.66 ...
$ anthro3b : num 4.95 5.01 4.74 4.48 4.68 4.06 4.52 4.7 4.32 4.21 ...
$ anthro3c : num 4.5 4.48 4.6 3.91 4.15 3.64 4.31 4.47 3.47 3.6 ...
$ anthro4 : num 6.13 6.37 5.82 5.66 5.91 5.14 5.69 5.7 5.49 5.25 ...
An alternative way to save and load R data objects is using functions saveRDS()
and readRDS(). They work in a similar way as save() and load().
5
The differences are:
a. multiple R objects can be saved into one single _le with save(), but only
one object can be saved in a file with saveRDS(); and
b. readRDS() enables us to restore the data under a different object name,
while load() restores the data under the same object name as when it was
saved.
> a <- 1:10
> saveRDS(a, file="mydatafile2.rds")
> a2 <- readRDS("mydatafile2.rds")
> print(a2)
[1] 1 2 3 4 5 6 7 8 9 10
Example:
> var1 <- sample(5)
> var2 <- var1 / 10
> var3 <- c("R", "and", "Data Mining", "Examples", "Case Studies")
> df1 <- data.frame(var1, var2, var3)
6
> names(df1) <- c("Var.Int", "Var.Num", "Var.Char")
> write.csv(df1, "mydatafile3.csv", row.names = FALSE)
> df2 <- read.csv("mydatafile3.csv")
> print(df2)