R Master Sheet - All codes, inbuilt functions and packages needed for the course
R Master Sheet - All codes, inbuilt functions and packages needed for the course
rapply(x,function(x) {x^2})
data frame without a match
merge(companies, countries, by= "Countor", all=
A,B1,B2
True)
4,1,9
substr ( string, start, stop ) based on index apply (m, 2, countOE, … = FALSE) means take blue calculation & perform on each Column of M,
gsub ( old string, new string, string , ignore.case = T/F) returning a vector {1, 2, 1}, this is because m%%2 == 1 if odd, and == 0 if even, hence e.g 1 st column
NOTE: use “[…]” to mean replace anything within these square brackets (it’s like will return 3 (length) – 1 – 0 – 1 = 1.
indexing out specific characters) lapply ( X = vector/list, FUNC = function) returns only a list e.g lapply (x, mean)
toupper () / tolower () transform strings into upper/lower case sapply () same as lapply () but returns a vector instead
chartr (OLD, NEW, X = string) NOTE: applying the same countOE function on lapply & sapply will return a vector/list of:0, 1, 0, 1, 0,
e.g charter(“wa”, “WA”, x = “walao eh”) means change all “w” into “W”, and “a” 1, 0, 1…. This is because there is NO MARGIN specification, hence function will take each element as
into “A”, output = “WALAo eh” ‘m’, and hence:
grepl ( pattern, vector, ignore.case = T/F, value = T ) matches & returns the 1 (length) – 1 = 0
matching element in a vector where value = T argument decides whether to return 1 (length) – 0 = 1
the full string, or if F, then return the index 1 (length) – 1 = 0
format(x, digits to be displayed, nsmall, width, justify) …
Stringr:: Sapply/Lapply results
Str_split_fixed ( dataframe, col1, col2 …, sep = “____” ) means split strings in a
column into specified no. of columns mapply (FUNC = function, VECTOR1, VECTOR2)
Apply results
Str_extract(string, pattern) extracts strings that match the specified pattern e.g mapply(rep, 1:5, c(4,4,4,4,4)) means to perform rep function on: rep (1,4), rep (2,4), rep (3,4) …
rep (5,4), hence returning a matrix.
tapply ( X = vector, LIST = vector/list, FUNC = function)
READING FORMATS: e.g tapply (mtcars$mpg,INDEX = list(mtcars$am, mtcars$gear), mean) means group mpg by am &
read.csv( “path” , colClasses = c(“chr”, “str”, “factor”…)) gear, and summarize by mean.
TXT XLXS NOTE: must convert am & gear into factors first
Read.delim(“path”, sep = Install.packages(“readxl”, repos = http://cran.us.r-
“,”) project.org) DATA MANIPULATION I :
Read.table(“path”, header Library(readxl)
= T, sep = “,”) Read_excel(“path”, sheet = 2) also can define sheet
= {name of sheet}
TSV WEB
read.table(“.”, header = T, Library(curl)
sep = "\t") Library(XML)
url <- Curl(url) NOTE: if variable names are diff, just use arguments by.x = , by.y = to specify the corresponding
urldata <- readLines(url) variables to merge by
readHTMLTable(urldata, stringsasFactors = FALSE) Table (x = vector) summarises frequencies of elements in a vector
Subset ( dataframe, select = c(“col1”, “col2”,…)) to retain specific columns or (…., select = -c(1,2,3,…
JSON XML n)) to drop columns
Install.packages(“jsonlite”, Library(XML) Unique (dataframe) returns unique values
repos = http://cran.us.r- Data <- xmlParse( “path” ) Duplicated (dataframe) returns a T/F vector of whether row is duplicated
project.org) Dataroot <- xmlRoot(Data) returns the top-most Na.omit ( col_name ) removes rows w NA
Library(jsonlite) node/parent of the xml file Complete.cases (x) checks if there are any missing values
Data <- fromJSON(url) Datachild <- xmlChildren(Dataroot) returns a list of FormatC ( x, digits = n, format = “f” , big.mark = “,” ) big.mark = , helps to change numeric value e.g
Datadf <- children nodes from the parent node (dataroot) “10000” to “10,000”, format = “f” means for digits
as.data.frame(data$...$...$) NOTE: index datachild via datachild[[i]] Format ()
means to search through Data_nodes <- getNodeSet(Data, e.g format(as.Date(2022-12-12), “%d-%m%-%Y) == {12-12-2022}
the parent-child nodes to “/parent/child/child/child….”) means to reach SprintF ( fmt = “%f/d/…”, x ) where fmt is the format, x are the values (chr/int/num/log) to be passed
obtain the desired data, desired node level into the fmt E.g: (%02d: %02d, 9, 10) == (09:10)
then convert that to xmlToDataFrame(Data_Nodes) convert all nodes in NOTE: %02d means format integer 2 digits, left padding it with zeros
dataframe. desired node level to data frame NOTE: %.f returns default 6 digits aft decimal, add a number e.g %.8f to return 8 digits
Cut ( x = numeric vector, breaks = c(100, 200, 300, 400), labels = c(“a”,”b
NOTE: if want to cut the vector such that (100 - 200) = “a”, (201 – 300) = “b”, (301 – 400) = “c”, then
Breaks = c(99, 200, 300, 400)
DATA MANIPULATION II: Dplyr::
Tidyr:: Filter(dataframe, conditions… ) if no pipeline/df specified, then specify df in the first argument to be
Gather(table4, X = “year”, Y = “cases”, 2:3) means similar to Python key:value pair, filtered, then followed by conditions
to convert from Wide to Long format Mutate( dataframe, newcol = oldcol *%^$- oldcol ) means to create a newcol based on an operator
or formula
Mutate_at (dataframe, vars = cols to mutate, funs = function to perform on vars cols) mutate new
columns for specific columns in a df
Mutate_all (dataframe, funs) mutate new columns for ALL columns in a df
Arrange() means to sort in an order; put “-“ in front of the vector to be sorted to arrange in
descending order
Spread(table2, X = key, Y = value) Select(x, y…) means to subset only x & y columns from the dataframe to show
Slice (dataframe, n = ()) allows you to slice out a specific row/rows
e.g if n = 10, then return the 10th row of the dataframe (inc. all columns)
Slice_sample (dataframe) returns a random row of data in the df
Slice_head(data, n) where n = how many first rows to extract; this function slices ALL cols and only
returns n rows, instead of manually slicing each col
***comes with others like slice_tail, slice_sample etc
Slice_max(col to slice, n) take the top n objects
Recode ( x = chr vector, factor1 = “name1”, factor2 = “name2… ) recode factor/chr values in a vector
Unite ( dataframe, newcolname, cols to unite, sep = “___” ) this merges 2 columns b = 2, c = 3, d = 4…) you are creating 1 x 4
together *** This is impt in binding rows/columns
Fill ( dataframe, cols, direction = up/down ) where up means fill from last bottom Na_if ( vector, value to replace with NA or the condition ) means replace all values in a vector w NA if
value up, and down is top down value. given condition}
Setdiff ( x, y = vectors/dfs ) means to find all differing elements between sets x & y DF %>% Distinct ( col_name, keep_all = T) removes duplicate rows in a df
Union ( x, y ) means to find the union between sets x & y (ALL elements) Relocate (df, col to move, .before/after) move around order of columns & specify before or after a
Intersect ( x, y ) means to find the intersection between sets x & y specific col
Setequal ( x, y ) means to check if 2 sets contain identical elements (returns T/F) CHARACTER text/ strings | echo: (TRUE; logical or numeric) Whether to
%in% same operator as python, checks matching values (returns T/F)
Which () returns the index of the condition value(s): GGPLOT:
as.character()
nchar(x): checks string length
display theNote:
numeric vector,
source
Note:
code in the
geom_line
geom_bar
outputgroup
requires
eg: echo = requires
2:3,echo stat
document.
= 1 w/
= -4(exclude)
=
ggplot() + geom_bar/point/line
(including space) message: (TRUE; logical) Whether to preserve
“identity”
Positionsep
paste(“x”,”y”, = position_dodge()
=”%”): concatenate makes sidemessages
by side by message () include: (TRUE; logical)
Position
str tgt, eg: = position_stack()
address, def sep = spacemakes it stacked Whether to include the chunk output in the output
Position = position_fill()
strsplit (exam.result, “/”):splits makes
string it percentage
doc. Ifstacked
FALSE, nothing will be written in output,
Position_jitter(width,
(reverse of paste()) height) to add random jitter points
code’s toevaluated
still avoid overplotting
and plot {usually
files are use
generated if
Match ( v1, v2, nomatch = what to replace non-matched values with ) returns index geom_point(position
substr(“ I am me”,8,18) = position_jitter(w,
or substr(a,8): h))} there are any, so you can manually insert figures
of matched values: Geom_density ( x ) shows
extract out certain char inthefield
distribution of thelater.
variable
warning: (TRUE; logical) Whether to
Geom_smooth
gsub(“behavior”, ( method = “lm” ) adds a best fitpreserve
“behaviour”,a): line warnings (produced by warning()) in the
geom_text(aes(label
replace words (every = round(Avg.Spending,
occurrence) | 1)),output.
positionIf=FALSE,
position_dodge(0.9), color="white",vjust
warnings suppressed. Can take
=sub()
0.5,hjust = 0.5) to add annotation
regexpr(“el”,”stella”): find ato bar chartsnumeric values as indices to select a subset of
Provides
piece of density tilings:
string *-1 https://ggplot2.tidyverse.org/reference/geom_density_2d.html
is returned when no warnings to include in the output. these values
Stat_density2d()
match gregexpr(“el”,”elegant stella”): reference the indices of the warnings themselves
Geom_density2d()
find every single string (e.g., 3 means “the third warning from chunk”) not
Theme()
grep(“la”,to edit segments of a plot (e.g
c(“stella”,”elle”,”val”)) findaxis, titles,
thelegend,
indicesticks,
of thebackground,
expressionspanel)that emit warnings.
SIMULATION: regular expressions in string (returns
Theme(axis.text.x = element_text())
vectors element index)
fig.cap= “...” adds a caption to graphical results
Results: (‘markup’:character) controls how to
Rnorm ( n = sample, mean, SD ) generates a vector of elements that are NORMALLY Assign formatting instructions on the right side to the left side
substr(“Stella, 2, 4): extract part of display text results. This option only applies to
distributed
string w/ keys *result incld
Element_line/rect/text are theboth stop n functions
overarching normal to text output (not
manipulate a plotwarnings,
segmentmessages or
Rlnorm ( n = sample, logmean, logSD ) generates LOG NORMAL distribution
startsecondary
Add toupper(“hi”)
axis: | tolower(“hi”) | errors).
Sample ( x = vector of elements to choose from, size = sample from vector,
abbreviate(“France”)
Ggplot() + geom_line1 +| geom_line2
trimws(“ Hi+“) * avail options: ‘Markup’- markup w/ aprop
replacement = T/F, prob = c(prob(x1), prob(x2)…)
scale_y_continuous ( … sec.axis = sec_axis ( formula environment
to define depending
breaks for 2on
nd
youtput format,
axis, name, ‘Asis’
labels ) -
Simulation Framework:
NOTE: e.g axis 1 in thousands, axis 2 in millions, no markup,
then ‘Hold’- =flush
axis 2 formula texts to chunk end,
~./1000
iteration1 <- function(iter, ...)
NOTE: can use scale(x|y, labels = …) where labels ‘Hide’ (or FALSE)
will specify - hides
how the text output
text/ticks are displayed
{
Scale_x|y_log10(), Scale_x|y_sqrt(), Scale_x|y_reverse() Collapse:(FALSE;logical) whether to (if possible)
logic
Scale_color_manual ( values = c(…) ) to manuallycollapse all the source
differentiate/facet byand output
variable blocks
type from one
via color
}
Scale_fill_gradient(low, high, guide=FALSE) code chunk into a single block (by def written into
n <- number of iterations separate blocks). This option applies only to
Scale_alpha(range, guide = FALSE)
results <- data.frame(iter = 1:n, sapply) markdown documents
Rainbow(n) where n = number of colors in the paletteScale_x|y_continuous ( … sec.axis? )
DATES/POSIXCT:
apti join (active, podiums, hy = "driver*):
an anti-join between the active and podiums data frames based on the "driver" column. An anti-
join returns all rows from the first data frame (active) thut do not have matchin values in the
Facet_wrap(~variable, nrow, ncol) its like a dataframe
Theme_() to set the theme for non-data elements in plot
Error in
(from
:(TRUE;
terms oflogical)
stop()). By
displaywhether to prteserve errors
def, code eval will still run
Ymd_hms ( “chr
second data frame vectorbased
(podiums) in year, month,
on the specified keydate,
columnhour, min, sec” ) converts
(driver"). character to although error. To stop on errors, set opt to FALSE
DOES NKT ADD COLUMN OF SECOND GGanimate:
POXISct format (defa
P <- ggplot() + geom_ + …
Ymd(“20221212”) == {2022-12-12} in date format