0% found this document useful (0 votes)
2 views

DA_Lab_Week-1

The document provides an introduction to R Studio, covering basic operations and data import/export techniques. It discusses data mining, its applications, and the phases of the data mining process, along with an overview of R and RStudio as tools for statistical computing. Additionally, it includes examples of datasets and demonstrates how to save and load data in R, particularly using .Rdata and .CSV files.

Uploaded by

upesh
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DA_Lab_Week-1

The document provides an introduction to R Studio, covering basic operations and data import/export techniques. It discusses data mining, its applications, and the phases of the data mining process, along with an overview of R and RStudio as tools for statistical computing. Additionally, it includes examples of datasets and demonstrates how to save and load data in R, particularly using .Rdata and .CSV files.

Uploaded by

upesh
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 7

SREE VIDYANIKETHAN ENGINEERING COLLEGE

(AUTONOMOUS)
SREE SAINATH NAGAR, A. RANGAMPET –517102.

Week-1
Aim: Introduction to R Studio, Basic operations and import and
export of data using R Tool.
Agenda:
1. About Data Mining
2. About R and RStudio
3. Basic Operations
a.
4. Datasets
5. Data Import and Export
a. Save and Load R Data
b. Import from and Export to .CSV Files

 Data mining is the process to discover interesting knowledge from large


amounts of data [Han and Kamber, 2000].
 It is an interdisciplinary field with contributions from many areas, such as:
o Statistics, machine learning, information retrieval, pattern recognition and
bioinformatics.
 Data mining is widely used in many domains, such as:
o Retail, Finance, telecommunication and social media.

 The main techniques for data mining include:


o Classification and prediction, clustering, outlier detection, association
rules, sequence analysis, time series analysis and text mining, and also
some new techniques such as social network analysis and sentiment
analysis.

 In real world applications, a data mining process can be broken into six major
phases:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation and
6. Deployment

1
as defined by the CRISP-DM (Cross Industry Standard Process for Data Mining).
About R:
 R is a free software environment for statistical computing and graphics.
 It provides a wide variety of statistical and graphical techniques (http://www.r-
project.org/).
 R can be easily extended with 7324 packages available on CRAN (Comprehensive
R Archive Network) (http://cran.r-project.org/)
 To help users to find out which R packages to use, the CRAN Task Views are a
good guidance (http://cran.r-project.org/web/views/). They provide collections of
packages for different tasks. Some Task Views related to data mining are:
o Machine Learning & Statistical Learning
o Cluster Analysis & Finite Mixture Models
o Time Series Analysis
o Natural Language Processing
o Multivariate Statistics and
o Analysis of Spatial Data.

RStudio
 RStudio 10 is an integrated development environment (IDE) for R and can run on
various operating systems like Windows, Mac OS X and Linux. It is a very useful
and powerful tool for R programming.

2
 When RStudio is launched for the first time, you can see a window similar to
below Figure. There are four panels:
1. Source panel (top left), which shows your R source code. If you cannot see the
source panel, you can find it by clicking menu \File", \New File" and then \R
Script". You can run a line or a selection of R code by clicking the \Run" bottom on
top of source panel, or pressing \Ctrl + Enter".
2. Console panel (bottom left), which shows outputs and system messages
displayed in a normal R console;
3. Environment/History/Presentation panel (top right), whose three tabs show
respectively all objects and function loaded in R, a history of submitted R code,
and Presentations generated with R;
4. Files/Plots/Packages/Help/Viewer panel (bottom right), whose tabs show
respectively a list of _les, plots, R packages installed, help documentation and
local web content.

It is always a good practice to begin R programming with an RStudio project, which is


a folder where to put your R code, data files and figures.
 To create a new project, click the “Project" button at the top-right corner and
then choose “New Project".
 After that, select “create project from new directory" and then “Empty Project".
After typing a directory name, which will also be your project name, click “Create
Project" to create your project folder and files.

After that, create three folders as below:


1. code, where to put your R souce code;
2. data, where to put your datasets; and
3. figures, where to put produced diagrams.

In addition to above three folders which are useful to most projects, depending on your
project and preference, you may create additional folders below:
1. rawdata, where to put all raw data,
2. models, where to put all produced analytics models, and
3. reports, where to put your analysis reports.

3
Datasets

1. The Iris Dataset


2. The Bodyfat Dataset

The iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris) has been used for


classification in many research publications. It consists of 50 samples from each of three
classes of iris owners [Frank and Asuncion, 2010]. One class is linearly separable from
the other two, while the latter are not linearly separable from each other. There are five
attributes in the dataset:
1. sepal length in cm,
2. sepal width in cm,
3. petal length in cm,
4. petal width in cm, and
5. class: Iris Setosa, Iris Versicolour, and Iris Virginica.

> str(iris)
'data.frame': 150 observations (records, or rows) of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

2. The Bodyfat Dataset


Bodyfat is a dataset available in package TH.data [Hothorn, 2015]. It has 71 rows, and
each row contains information of one person. It contains the following 10 numeric
columns.
_ age: age in years.
_ DEXfat: body fat measured by DXA, response variable.
_ waistcirc: waist circumference.
_ hipcirc: hip circumference.
_ elbowbreadth: breadth of the elbow.
_ kneebreadth: breadth of the knee.
_ anthro3a: sum of logarithm of three anthropometric measurements.
_ anthro3b: sum of logarithm of three anthropometric measurements.
_ anthro3c: sum of logarithm of three anthropometric measurements.
_ anthro4: sum of logarithm of three anthropometric measurements.

The value of DEXfat is to be predicted by the other variables.

4
> data("bodyfat", package = "TH.data")
> str(bodyfat)
data.frame: 71 obs. of 10 variables:
$ age : num 57 65 59 58 60 61 56 60 58 62 ...
$ DEXfat : num 41.7 43.3 35.4 22.8 36.4 ...
$ waistcirc : num 100 99.5 96 72 89.5 83.5 81 89 80 79 ...
$ hipcirc : num 112 116.5 108.5 96.5 100.5 ...
$ elbowbreadth: num 7.1 6.5 6.2 6.1 7.1 6.5 6.9 6.2 6.4 7 ...
$ kneebreadth : num 9.4 8.9 8.9 9.2 10 8.8 8.9 8.5 8.8 8.8 ...
$ anthro3a : num 4.42 4.63 4.12 4.03 4.24 3.55 4.14 4.04 3.91 3.66 ...
$ anthro3b : num 4.95 5.01 4.74 4.48 4.68 4.06 4.52 4.7 4.32 4.21 ...
$ anthro3c : num 4.5 4.48 4.6 3.91 4.15 3.64 4.31 4.47 3.47 3.6 ...
$ anthro4 : num 6.13 6.37 5.82 5.66 5.91 5.14 5.69 5.7 5.49 5.25 ...

Data Import and Export

Save and Load R Data


 Data in R can be saved as .Rdata files with function save() and .Rdata files can be
reloaded into R with load().
 With the code below, we first create a new object a as a numeric sequence (1,
2, ..., 10) and a second new object b as a vector of characters (`a', `b', `c', `d',
`e').
 Object letters is a built-in vector in R of 26 English letters, and letters[1:5]
returns the first five letters. We then save them to a file and remove them from R
with function rm(). After that, we reload both a and b from the file and print their
values.
> a <- 1:10
> b <- letters[1:5]
>getwd() # to know the current directory and setwd() to set
> save(a, b, file="mydatafile.Rdata")
> rm(a, b)
> load("mydatafile.Rdata")
> print(a)
[1] 1 2 3 4 5 6 7 8 9 10
> print(b)
[1] "a" "b" "c" "d" "e"

 An alternative way to save and load R data objects is using functions saveRDS()
and readRDS(). They work in a similar way as save() and load().

5
 The differences are:
a. multiple R objects can be saved into one single _le with save(), but only
one object can be saved in a file with saveRDS(); and
b. readRDS() enables us to restore the data under a different object name,
while load() restores the data under the same object name as when it was
saved.
> a <- 1:10
> saveRDS(a, file="mydatafile2.rds")
> a2 <- readRDS("mydatafile2.rds")
> print(a2)
[1] 1 2 3 4 5 6 7 8 9 10

R also provides function save.image() to save everything in current workspace into a


single file, which is very convenient to save your current work and resume it later, if the
data loaded into R are not very big.

Import from and Export to .CSV Files


 Data frame is a data format that we mostly deal with in R. A data frame is similar
to a table in databases, with each row being an observation (or record) and each
column beding a variable (or feature).
 The example below demonstrates saving a dataframe into file and then reloaded
it into R. At first, we create three vectors, an integer vector, a numeric (real)
vector and a character vector, use function data.frame() to build them into
dataframe df1 and save it into a .CSV file with write.csv(). Function sample(5)
produces a random sample of five numbers out of 1 to 5.
 Column names in the data frame are then set with function names(). After that,
we reload the data frame from the file to a new data frame df2 with read.csv().
Note that the very first column printed below is the row names, created
automatically by R.

Example:
> var1 <- sample(5)
> var2 <- var1 / 10
> var3 <- c("R", "and", "Data Mining", "Examples", "Case Studies")
> df1 <- data.frame(var1, var2, var3)

6
> names(df1) <- c("Var.Int", "Var.Num", "Var.Char")
> write.csv(df1, "mydatafile3.csv", row.names = FALSE)
> df2 <- read.csv("mydatafile3.csv")
> print(df2)

Var.Int Var.Num Var.Char


1 3 0.3 R
2 4 0.4 and
3 1 0.1 Data Mining
4 2 0.2 Examples
5 5 0.5 Case Studies

You might also like