0% found this document useful (0 votes)
7 views2 pages

R Master Sheet - All codes, inbuilt functions and packages needed for the course

The document provides a comprehensive guide on regular expressions, text cleaning, string manipulation, and data manipulation techniques in R. It includes various functions and their usage for tasks such as removing stopwords, reading different file formats, and manipulating data frames. Additionally, it covers the application of functions like 'apply', 'mutate', and 'filter' for data analysis and transformation.

Uploaded by

sushantgoyal3525
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views2 pages

R Master Sheet - All codes, inbuilt functions and packages needed for the course

The document provides a comprehensive guide on regular expressions, text cleaning, string manipulation, and data manipulation techniques in R. It includes various functions and their usage for tasks such as removing stopwords, reading different file formats, and manipulating data frames. Additionally, it covers the application of functions like 'apply', 'mutate', and 'filter' for data analysis and transformation.

Uploaded by

sushantgoyal3525
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

lOMoARcPSD|53435725

REGULAR EXPRESSIONS: GENERAL:


^ means start of string trimws (x) to trim leading & trailing whitespaces
$ means end of string order (x, decreasing = T) or can just put – x to define decreasing
[0-9] means any digits Sort()
[^0-9] means anything that is NOT digit Row/Colnames(x) means to convert row names of a dataframe from names specified in a vector/list
\\d means digits Which.max/min() means to return the row number of the max value in a vector
\\s*\\([^\\)]+\\) means all text within parentheses List.files means to list all files in the path directory as lists
"\\[[^][]*]" means all text within square brackets Nrow() mens to return how many rows in a matrix/df/array
gsub("(.*),(.*)", "\\2_\\1", names) means swap position of comma-separated N() means to return how many elements/observations
strings T() means to transpose a matrix; %*% means matrix multiplication
[] means anything within the brackets (or) Is.na() {returns T/F}
+ means 1 or more Factor ( data, levels = c(….), labels = levels ) can define a levels vector and set labels = levels
* means 0 or more Sample ( X = a vector/list, SIZE = how many elements randomly returned, REPLACE = T/F , PROB =
\\ means an escape for metacharacters c(a,b,c))
e.g sample(1:3, 20, replace = T, prob = c(0.2,0.3,0.5)) means from {1,2,3}, select 20 randomly but
Text-Cleaning: use replacement = T so that function can repeat elements, prob (1) = 0.2, prob (2) = 0.3, prob (3) =
Gsub(“@\\w+”, “”) to remove mentions (@...) 0.5
Gsub(“https?://.+”, “”) to remove urls rowMeans() / colMeans () / rowSums() / colSums ()
Gsub(“\\d+\\w*\\d*”, “”) to remove numbers nrow () / ncol () / nchar () returns number of rows/cols/characters in a given input
Gsub(“#\\w+”, “”) to remove hashtags seq (FROM = start, TO = end, BY = steps)
Gsub(“^[\x01-\x7F]”,””) ifelse ( if _____, then ______, else this ______,)
Gsub(“[[:punct:]]”, “”) to remove punctuations e.g ifelse (salary < 1000, “low”, “high”)
Gsub(“\n”,””) to remove line breaks ceiling (x) rounds all numbers up, floor (x) rounds all numbers down
Gsub(“^\\s+”,””) to remove whitespaces round (x, digits = n) round up to a specific decimal point
Gsub(“[ |\t]”,””) to remove tabs diag ( x = value in the diagonal indexes, nrow, ncol )
e.g diag(5,5,5,) will return a matrix with all 0s, but the diagonal is 5
# REMOVING STOPWORDS: stopwords_regex = NOTE: if you put only diag (n) where n = e.g 5, itll return an identity matrix of 5
paste(stopwords('en'), collapse = '\\b|\\b') Scales:: Percent()
stopwords_regex = paste0('\\b', stopwords_regex, '\\b') Casewhen
documents = stringr::str_replace_all(documents, Formattable:: Currency( x, digits = no. of digits after decimal ) changes a numeric vector into
stopwords_regex, '') currency format
# STEMMING TEXT: stemDocument(x, language = "en")
Options(scipen = 100) to remove scientific notations (i.e exponential 3e+05 to numeric 300000)
Rename(“oldcol” = “newcol”) to change column names
STRINGS/CHARACTERS: APPLY FAMILY:
paste (“string1”, “string2”, … sep = “___”, collapse)
e.g paste(“hi”, “hi, sep = “,”) == “hi, hi”
M=
paste0 (“string1”, “string2”… ) got no separator; directly pastes all strings together apply ( X = array/df/matrix, MARGIN = 1 or 2 (row or col), FUNC = function, …))
(if want separate words by regex, must paste it) NOTE: if the function input can perform > 2 operations (e.g based on T/F parameters), specify the
e.g paste0(“I”, “ “, “am”, “ “, “here”) T/F in the “…” argument in apply () as well, because inputting just the function alone R cannot tell
e.g paste(“I”, “am”, “here”, sep = “ “) which operation you want it to perform.
strsplit ( string, split = “____” ) where split is a character vector containing regex e.g countOE <- function(m, odd) {ifelse(odd = T, sum(m%%2), length(m) - sum(m%%2))}
full ioin(active, podiums, by = "driver"):
A full join returns all rows from both the active and
rapply : stands for recursive apply
podiums data frames, merging them bused on the
"driver" column. If there are no matches for a Downloaded by sushant goyal ([email protected]) x <- list (A=2, B= (1,-3)
particular "driver" in either data frame, missing
values (NA) will be filled in for the columns of the
lOMoARcPSD|53435725

rapply(x,function(x) {x^2})
data frame without a match
merge(companies, countries, by= "Countor", all=
A,B1,B2
True)
4,1,9
substr ( string, start, stop ) based on index apply (m, 2, countOE, … = FALSE) means take blue calculation & perform on each Column of M,
gsub ( old string, new string, string , ignore.case = T/F) returning a vector {1, 2, 1}, this is because m%%2 == 1 if odd, and == 0 if even, hence e.g 1 st column
NOTE: use “[…]” to mean replace anything within these square brackets (it’s like will return 3 (length) – 1 – 0 – 1 = 1.
indexing out specific characters) lapply ( X = vector/list, FUNC = function) returns only a list e.g lapply (x, mean)
toupper () / tolower () transform strings into upper/lower case sapply () same as lapply () but returns a vector instead
chartr (OLD, NEW, X = string) NOTE: applying the same countOE function on lapply & sapply will return a vector/list of:0, 1, 0, 1, 0,
e.g charter(“wa”, “WA”, x = “walao eh”) means change all “w” into “W”, and “a” 1, 0, 1…. This is because there is NO MARGIN specification, hence function will take each element as
into “A”, output = “WALAo eh” ‘m’, and hence:
grepl ( pattern, vector, ignore.case = T/F, value = T ) matches & returns the 1 (length) – 1 = 0
matching element in a vector where value = T argument decides whether to return 1 (length) – 0 = 1
the full string, or if F, then return the index 1 (length) – 1 = 0
format(x, digits to be displayed, nsmall, width, justify) …
Stringr:: Sapply/Lapply results
Str_split_fixed ( dataframe, col1, col2 …, sep = “____” ) means split strings in a
column into specified no. of columns mapply (FUNC = function, VECTOR1, VECTOR2)
Apply results
Str_extract(string, pattern) extracts strings that match the specified pattern e.g mapply(rep, 1:5, c(4,4,4,4,4)) means to perform rep function on: rep (1,4), rep (2,4), rep (3,4) …
rep (5,4), hence returning a matrix.
tapply ( X = vector, LIST = vector/list, FUNC = function)
READING FORMATS: e.g tapply (mtcars$mpg,INDEX = list(mtcars$am, mtcars$gear), mean) means group mpg by am &
read.csv( “path” , colClasses = c(“chr”, “str”, “factor”…)) gear, and summarize by mean.
TXT XLXS NOTE: must convert am & gear into factors first
Read.delim(“path”, sep = Install.packages(“readxl”, repos = http://cran.us.r-
“,”) project.org) DATA MANIPULATION I :
Read.table(“path”, header Library(readxl)
= T, sep = “,”) Read_excel(“path”, sheet = 2) also can define sheet
= {name of sheet}

TSV WEB
read.table(“.”, header = T, Library(curl)
sep = "\t") Library(XML)
url <- Curl(url) NOTE: if variable names are diff, just use arguments by.x = , by.y = to specify the corresponding
urldata <- readLines(url) variables to merge by
readHTMLTable(urldata, stringsasFactors = FALSE) Table (x = vector) summarises frequencies of elements in a vector
Subset ( dataframe, select = c(“col1”, “col2”,…)) to retain specific columns or (…., select = -c(1,2,3,…
JSON XML n)) to drop columns
Install.packages(“jsonlite”, Library(XML) Unique (dataframe) returns unique values
repos = http://cran.us.r- Data <- xmlParse( “path” ) Duplicated (dataframe) returns a T/F vector of whether row is duplicated
project.org) Dataroot <- xmlRoot(Data) returns the top-most Na.omit ( col_name ) removes rows w NA
Library(jsonlite) node/parent of the xml file Complete.cases (x) checks if there are any missing values
Data <- fromJSON(url) Datachild <- xmlChildren(Dataroot) returns a list of FormatC ( x, digits = n, format = “f” , big.mark = “,” ) big.mark = , helps to change numeric value e.g
Datadf <- children nodes from the parent node (dataroot) “10000” to “10,000”, format = “f” means for digits
as.data.frame(data$...$...$) NOTE: index datachild via datachild[[i]] Format ()

Downloaded by sushant goyal ([email protected])


lOMoARcPSD|53435725

means to search through Data_nodes <- getNodeSet(Data, e.g format(as.Date(2022-12-12), “%d-%m%-%Y) == {12-12-2022}
the parent-child nodes to “/parent/child/child/child….”) means to reach SprintF ( fmt = “%f/d/…”, x ) where fmt is the format, x are the values (chr/int/num/log) to be passed
obtain the desired data, desired node level into the fmt E.g: (%02d: %02d, 9, 10) == (09:10)
then convert that to xmlToDataFrame(Data_Nodes) convert all nodes in NOTE: %02d means format integer 2 digits, left padding it with zeros
dataframe. desired node level to data frame NOTE: %.f returns default 6 digits aft decimal, add a number e.g %.8f to return 8 digits
Cut ( x = numeric vector, breaks = c(100, 200, 300, 400), labels = c(“a”,”b
NOTE: if want to cut the vector such that (100 - 200) = “a”, (201 – 300) = “b”, (301 – 400) = “c”, then
Breaks = c(99, 200, 300, 400)
DATA MANIPULATION II: Dplyr::
Tidyr:: Filter(dataframe, conditions… ) if no pipeline/df specified, then specify df in the first argument to be
Gather(table4, X = “year”, Y = “cases”, 2:3) means similar to Python key:value pair, filtered, then followed by conditions
to convert from Wide to Long format Mutate( dataframe, newcol = oldcol *%^$- oldcol ) means to create a newcol based on an operator
or formula
Mutate_at (dataframe, vars = cols to mutate, funs = function to perform on vars cols) mutate new
columns for specific columns in a df
Mutate_all (dataframe, funs) mutate new columns for ALL columns in a df
Arrange() means to sort in an order; put “-“ in front of the vector to be sorted to arrange in
descending order
Spread(table2, X = key, Y = value) Select(x, y…) means to subset only x & y columns from the dataframe to show
Slice (dataframe, n = ()) allows you to slice out a specific row/rows
e.g if n = 10, then return the 10th row of the dataframe (inc. all columns)
Slice_sample (dataframe) returns a random row of data in the df
Slice_head(data, n) where n = how many first rows to extract; this function slices ALL cols and only
returns n rows, instead of manually slicing each col
***comes with others like slice_tail, slice_sample etc
Slice_max(col to slice, n) take the top n objects
Recode ( x = chr vector, factor1 = “name1”, factor2 = “name2… ) recode factor/chr values in a vector

Separate ( dataframe, col = col to split, into = c(“newcol1”, “newcol2”), sep =


“____”, fill , extra, remove = T/F)
Fill: too little pieces, can fill from left/right
Extra: too many pieces for the cols, can merge/drop
Remove: if true, remove the original col from the df after splitting
Replace_na ( dataframe, value to replace NA with ) Group_by(x, y, …) & Summarise()
Separate_rows ( dataframe, cols which require splitting, sep = “___” )

Companies %>% group_by(Continent, Country) %>% summarise(num_companies = n(), avg assets =


mean(assets…billion) %>% arrange(-avg_assets)
Bind_rows/cols (dataframe, dataframe2) bind them in terms of rows/cols (stacked or side-by-side,
then fill NA for any missing values)
NOTE: if you assign x <- data.frame(c(1,2,3,4,…)) you are creating a 4 x 1, but if x <- data.frame(a = 1,

Downloaded by sushant goyal ([email protected])


lOMoARcPSD|53435725

Unite ( dataframe, newcolname, cols to unite, sep = “___” ) this merges 2 columns b = 2, c = 3, d = 4…) you are creating 1 x 4
together *** This is impt in binding rows/columns
Fill ( dataframe, cols, direction = up/down ) where up means fill from last bottom Na_if ( vector, value to replace with NA or the condition ) means replace all values in a vector w NA if
value up, and down is top down value. given condition}
Setdiff ( x, y = vectors/dfs ) means to find all differing elements between sets x & y DF %>% Distinct ( col_name, keep_all = T) removes duplicate rows in a df
Union ( x, y ) means to find the union between sets x & y (ALL elements) Relocate (df, col to move, .before/after) move around order of columns & specify before or after a
Intersect ( x, y ) means to find the intersection between sets x & y specific col
Setequal ( x, y ) means to check if 2 sets contain identical elements (returns T/F) CHARACTER text/ strings | echo: (TRUE; logical or numeric) Whether to
%in% same operator as python, checks matching values (returns T/F)
Which () returns the index of the condition value(s): GGPLOT:
as.character()
nchar(x): checks string length
display theNote:
numeric vector,
source
Note:
code in the
geom_line
geom_bar
outputgroup
requires
eg: echo = requires
2:3,echo stat
document.
= 1 w/
= -4(exclude)
=
ggplot() + geom_bar/point/line
(including space) message: (TRUE; logical) Whether to preserve
“identity”
Positionsep
paste(“x”,”y”, = position_dodge()
=”%”): concatenate makes sidemessages
by side by message () include: (TRUE; logical)
Position
str tgt, eg: = position_stack()
address, def sep = spacemakes it stacked Whether to include the chunk output in the output
Position = position_fill()
strsplit (exam.result, “/”):splits makes
string it percentage
doc. Ifstacked
FALSE, nothing will be written in output,
Position_jitter(width,
(reverse of paste()) height) to add random jitter points
code’s toevaluated
still avoid overplotting
and plot {usually
files are use
generated if
Match ( v1, v2, nomatch = what to replace non-matched values with ) returns index geom_point(position
substr(“ I am me”,8,18) = position_jitter(w,
or substr(a,8): h))} there are any, so you can manually insert figures
of matched values: Geom_density ( x ) shows
extract out certain char inthefield
distribution of thelater.
variable
warning: (TRUE; logical) Whether to
Geom_smooth
gsub(“behavior”, ( method = “lm” ) adds a best fitpreserve
“behaviour”,a): line warnings (produced by warning()) in the
geom_text(aes(label
replace words (every = round(Avg.Spending,
occurrence) | 1)),output.
positionIf=FALSE,
position_dodge(0.9), color="white",vjust
warnings suppressed. Can take
=sub()
0.5,hjust = 0.5) to add annotation
regexpr(“el”,”stella”): find ato bar chartsnumeric values as indices to select a subset of
Provides
piece of density tilings:
string *-1 https://ggplot2.tidyverse.org/reference/geom_density_2d.html
is returned when no warnings to include in the output. these values
Stat_density2d()
match gregexpr(“el”,”elegant stella”): reference the indices of the warnings themselves
Geom_density2d()
find every single string (e.g., 3 means “the third warning from chunk”) not
Theme()
grep(“la”,to edit segments of a plot (e.g
c(“stella”,”elle”,”val”)) findaxis, titles,
thelegend,
indicesticks,
of thebackground,
expressionspanel)that emit warnings.
SIMULATION: regular expressions in string (returns
Theme(axis.text.x = element_text())
vectors element index)
fig.cap= “...” adds a caption to graphical results
Results: (‘markup’:character) controls how to
Rnorm ( n = sample, mean, SD ) generates a vector of elements that are NORMALLY Assign formatting instructions on the right side to the left side
substr(“Stella, 2, 4): extract part of display text results. This option only applies to
distributed
string w/ keys *result incld
Element_line/rect/text are theboth stop n functions
overarching normal to text output (not
manipulate a plotwarnings,
segmentmessages or
Rlnorm ( n = sample, logmean, logSD ) generates LOG NORMAL distribution
startsecondary
Add toupper(“hi”)
axis: | tolower(“hi”) | errors).
Sample ( x = vector of elements to choose from, size = sample from vector,
abbreviate(“France”)
Ggplot() + geom_line1 +| geom_line2
trimws(“ Hi+“) * avail options: ‘Markup’- markup w/ aprop
replacement = T/F, prob = c(prob(x1), prob(x2)…)
scale_y_continuous ( … sec.axis = sec_axis ( formula environment
to define depending
breaks for 2on
nd
youtput format,
axis, name, ‘Asis’
labels ) -
Simulation Framework:
NOTE: e.g axis 1 in thousands, axis 2 in millions, no markup,
then ‘Hold’- =flush
axis 2 formula texts to chunk end,
~./1000
iteration1 <- function(iter, ...)
NOTE: can use scale(x|y, labels = …) where labels ‘Hide’ (or FALSE)
will specify - hides
how the text output
text/ticks are displayed
{
Scale_x|y_log10(), Scale_x|y_sqrt(), Scale_x|y_reverse() Collapse:(FALSE;logical) whether to (if possible)
logic
Scale_color_manual ( values = c(…) ) to manuallycollapse all the source
differentiate/facet byand output
variable blocks
type from one
via color
}
Scale_fill_gradient(low, high, guide=FALSE) code chunk into a single block (by def written into
n <- number of iterations separate blocks). This option applies only to
Scale_alpha(range, guide = FALSE)
results <- data.frame(iter = 1:n, sapply) markdown documents
Rainbow(n) where n = number of colors in the paletteScale_x|y_continuous ( … sec.axis? )
DATES/POSIXCT:
apti join (active, podiums, hy = "driver*):
an anti-join between the active and podiums data frames based on the "driver" column. An anti-
join returns all rows from the first data frame (active) thut do not have matchin values in the
Facet_wrap(~variable, nrow, ncol) its like a dataframe
Theme_() to set the theme for non-data elements in plot
Error in
(from
:(TRUE;
terms oflogical)
stop()). By
displaywhether to prteserve errors
def, code eval will still run
Ymd_hms ( “chr
second data frame vectorbased
(podiums) in year, month,
on the specified keydate,
columnhour, min, sec” ) converts
(driver"). character to although error. To stop on errors, set opt to FALSE
DOES NKT ADD COLUMN OF SECOND GGanimate:
POXISct format (defa
P <- ggplot() + geom_ + …
Ymd(“20221212”) == {2022-12-12} in date format

Downloaded by sushant goyal ([email protected])

You might also like