0% found this document useful (0 votes)
36 views

BIO259 Note

The document provides an introduction to biological data analysis using R. It discusses data types and structures like dataframes, vectors and matrices. It covers topics like data import, indexing, sorting and basic operations. Functions like read.table, select, filter and arrange from dplyr package are also introduced.

Uploaded by

Chilli Lee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

BIO259 Note

The document provides an introduction to biological data analysis using R. It discusses data types and structures like dataframes, vectors and matrices. It covers topics like data import, indexing, sorting and basic operations. Functions like read.table, select, filter and arrange from dplyr package are also introduced.

Uploaded by

Chilli Lee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Week 1: Lecture 1

Introduction to Biological Data


- The advantage of the R language
1. Analysis of large amounts of data without using traditional method
2. Not limited to a million rows as if we analyze DNA nucleotide there are way
more than 1 million rows for one person(sample).
3. Can handle a very huge text file
4. Use to develop a a biological discovery from the big data set ad through a bunch
of analysis tha what we can do to generate a graphic plot
- Biological data
1. Dataframe
Column(look from top to bottom, collection of single element that have the same
type) -> vector

header(tell u what
is present in this
column)

rows(look from One cell


left to right)

2. Types of data
a. String: letters only e.g. Canada, case sensitive so really matter if
something is capitalized or not
b. Integers: no decimal place in the number cell that only contain number
that you can do maths on them
c. Floats: number with decimal value (sometime called double for large
value)
d. Missing Data: blank information. either the cell is completely blank or the
character showing NaN NA <NA>
- The whole column have the same type of data
3. Beyond table
a. variable : one single thing store in one single cell
b. vector : a collection of same type of value value without dimension
E.g. a vector of strings c(‘Canada’, ‘United States’, ‘Mexico’)
E.g. a vector of integers

e.g. code

a vector of strings c(‘Canada’, ‘United States’, ‘Mexico’)

a vector of integers c(1, 2, 4, 6)

a vector of floats c(2.234, 5.534,7.346)

c. Dataframe: consist of multiple vector that can be different type of data


d. Matrix: a vector of vector, a collection of same data type value organized
into two dimensions. Asking for the whole data frame’s data type
e.
f. Matrix vs data frame

matrix Data frame

Data type Can only be one data Different column


type contain of different
type fo data

header never sometimes

g. Array: a collection of like values organized into many dimensions. (no


array in this course)
h. List (row): row consist of different types
- Coding in R
1. Variable assignment
Variable <- “assigned_to _this_value”
Numerical value (integer, floats) thing you can do math, no ‘quote’
Add ‘quote’ will become strings
Week 1: Lecture 2
Introduction to Biological Data
- Variable type
Logical (T/F)< Integer < Numeric (or float) < Character < List
1. Lower type can convert into higher type
2. Higher type can cometime convert into lower type but information will lost
3. All numbers are floats by default unless specifies
E.g. c(TRUE, FALSE)

c(TRUE, FALSE) TRUE . FLASE

as.integer(c(TRUE, 1.0
FALSE))

as.numeric(c(TRUE, 1.0
FALSE))

as.character(c(TRUE, ‘TRUE’ . ’FALSE’


FALSE))

4. Between logical and number ( FALSE = 0 and the rest number is TRUE)
E.g. c(0, 0.5 , 1.5, 2.5 , 3.5)

c(0, 0.5 , 1.5, 2.5 , 3.5) 0 . 0.5 . 1.5 . 2.5 . 3.5

as.integer(c(0, 0.5 , 0 . 1 . 2 . 3 . 4 (information because from floats(higher) to


1.5, 2.5 , 3.5)) integer(lower))

as.logical(c(0, 0.5 , FLASE . TRUE . TRUE . TRUE . TRUE


1.5, 2.5 , 3.5))

as.character(c(0, 0.5 , ‘0’ . ‘0.5’ . ‘1.5’ . ‘2.5’ . ‘3.5’ (information because from floats(higher)
1.5, 2.5 , 3.5)) to integer(lower))
5. Convert from string will only get direct representation
Like ‘TRUE’ will get TRUE if convert into logical and the other type will turn
into NA

- Data files
1. Text Files: convert the data that you will supply
a. TXT: text file
b. CSV: data separated by commma
If the comma is inside “vdf,vdf” that comma should not consider as a
speration it will come out as vdf,vdf in stead of vdf . vdf
c. TAB
d. Fields
1. Nominal values (labels with no order)
2. Ordinal values (labels with some order)
3. Numeric values (numbers)

2. Binary Files: already convert data for program to read it way faster but not human
readable
3. Compressed text file for binary file that is human nonreadable
Open compressed file through read.table(“txt”, seq = “,”, header = TRUE, quote =
“”) doesn't have any specific character for quotes.
Week 2: Lecture 3
Data Organization and Manipulation

- Data Frame Variants


a. Any column containing NA will become character vector
b. If the program can determine the na is missing data instad of random
character it will conclude as double.
c. Common delimiters include spaces sep = “ ”, tab keys sep = “ \t”, comma sep
= “ , ”, and other special characters.
d. Need to use the correct separation for different type of text file, it will give you
error if u use the wrong one

- Other File Format Considerations Read directly into R


1. Read excel file in R, convert excel file into csv
2. How to read directly, use the library(readxl) function
- Indexing biological data: allow extract data in the data
1. Also call slicing
2. vector[c(1, 8:10)] at row 1 from column 8 to 10
3. Replace value vector[c(1, 8:10)] <- c(1,2,3)
4. Remove index: negative index
5. For data frame header dones not count as a row
6. Use dplyr package: use select() function to select specific columns based on
the header value
7. Use dplyr package: use filter() function to select specific row based on the
value
8. Use arrange() to sort(form small number to large number)your data by one or
multiple column

- Rational databases, similar rows


merge(df1, df2, by=”column name”

- filesystem
1. How to navigate the file path
a. Absolute path point to the file regardless of current location, it is
usually the ‘full’ path
b. Relative path start from the currwtn working directory
Lecture 4
Data organization and manipulation
-
- Indexing vector
a. First_vale <- vect[1] assuming is [1,1]
b. Middle_value <- vect[3:6] is row 3,4,5,6 and column 3,4,5,6
c. First_row <- [1,] if nothing in the column section assume all column
- Indexing dataframe
Col <- df[“colname”]
a. Asking for even number a%%2==0 “a” is the vector, the number divided by 2
with the remainder of 0, which is even number
b. If assign one column and R will automatically change it into a vector instead
of the dataframe by drop = FALSE mean retain the dataframe format
c. If only df[1] > 10 it will only give u logical value which is TRUE or FALSE and
the if u type df[df[1] > 10] it will return dataframe
d. OR is | , AND is & , NOT is !

- Relational data
a. primary key: the record
b. Foreign key: table point to other table
c. Composite multiple columns: joint or merge -> in order to get a more
compact information
d. The types of joint specify the returned rowa with missing foreign key and
primary key pairs. Orange part indicate the missing key should be return.

e. inner joint: ensure the key and foreign key are in both table. Remove rows
that have no relationship aka no missing data in the inner joint
f. Full joint: keep every rows rather or not there is relationship. Missing data
remain blank rows
g. Left joint: display every thing on the table on the left, if there are matching
from the right table, pulling any relevant data from the right table.
h. Right joint: display every thing on the table on the right, if there are matching
from the lrft table, pulling any relevant data from the left table.
i. When matching ID make sure there is the same name, because R is case
sensity different name may refer tothe same thing but R wont know, so make
sure to match the ID that is nonchangable
- Type of relational data:
a. Phylogenetic tree
b. Family tree

- sorting(when number is store as character)


1. If sort(c(“1”,”2”,”10”)), it will give out “1”, “10”, “2” due to alphabetical order
2. If sort (as.integer(c(“1”,”2”,”10”))) it will give out 1 ,2 10
3. Need to specific sort by column 1 and then column 2, or else it will give
random sorting fo rcolumn 2 or the rest
- Coding in R
arrange() will give out sorted data
group_by() %>% slice() will give out the first data
group_by() %>% summarise() wil give out the mean of each data
Week 3: Lecture 5
Basic Operations and Data Summaries
- Basic operation

< Less than x <- c(10,5,0); y <- c(5,5,0);


x<y
[1] FALSE FALSE FALSE

> More than x <- c(10,5,0); y <- c(5,5,0);


x>y
[1] TRUE FALSE FALSE
<= Less or equal to x <- c(10,5,0); y <- c(5,5,0);
x<=y
[1] FALSE TRUE TRUE

>= More or equal to x <- c(10,5,0); y <- c(5,5,0);


x>=y
[1] TRUE TRUE TRUE

== Equal to x <- c(10,5,0); y <- c(5,5,0);


x==y
[1] FALSE TRUE TRUE

!= Not equal to x <- c(10,5,0); y <- c(5,5,0);


x!=y
[1] TRUE FALSE FALSE

! Not x <- c(FALSE,TRUE,TRUE)


!x
[1] TRUE FALSE FALSE

& And x <- c(T,T,F); y <- c(T,F,F)


x&y
[1] TRUE FALSE FALSE

&& And (only one x <- c(T,T,F); y <- c(T,F,F)


comparison) x && y
[1] TRUE

| OR (follow the first x <- c(T,T,F); y <- c(T,F,F)


one) x|y
[1] TRUE TRUE FALSE

|| OR(only one x <- c(F,T,F); y <- c(F,F,F)


comparison) x || y
[1] FALSE

- ! before & before |


- TRUE & FALSE is FLASE
- NA in R indicate as actual missing value that wont affect the type of the vector but
na, NaN, <NA> wll indicate as character which affect teh type of vector
- Managing different types of variable
1. Continuos variable
Infinite number like floats(decimal number) that have no minimum distance
between number
2. Discrete variable
Positive minimum distance between numbers like integer 1,2,3 all difference
with 1
3. Catagorical variable
Finite number of items that have no logical order
4. Nominal variable(variable with name but no order)
5. Ordinal variable (think with order)
6. Qualitative Variables; cannot be count, more like discribtion(like a comment
for wine)
7. Quantitative Variables: can be count(like a rating for wine)
8. Independent variable(not impact by the experiment)
9. Dependent variable(outcome of the experiment)

- four primary data measurement scales


1. Nominal scale: involve categorical values without any quantitative value
2. Ordinal scale: involve categorical values where order is significant, but the
differences are ambiguous
3. Interval scale: involve numerical values where both order and exact
differences are known
4. Ratio scale: are numerical values with known order,exact differences, and an
absolute zero

- Statistical Definitions
1. range : different highest to lowest
2. mode : the most count of data
3. Median teh middle number of ordered value
4. Mean: average
5. Standard deviation: teh uare root of variance
6. Standard error

𝑠𝑟𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 = 𝑆𝑡𝑎𝑛𝑑𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛/ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒

- Regular expression
1. Pattern
2. mutate() add new column in the dataframe
3. pivoting : group_by() %>% summarise()
Week 3: Lecture 6
Regular Expressions
- No need /A/ fro regular expression in R language
- ?: use for logic
- [ ] mean or like /[AB]/ mean A or B
- [A-Z] all capitalized letter [a-z] all no cap letter [A-Za-z] all letter
- Repairing the [ ] by adding + like /[A-Za-z]+/
- Repeat at least one by add . infont of + like /[A-Za-z].+/
- Repeat at least 0 or more by add * infont of + like /[A-Za-z]*+/
- Match 0 to one letter add ? like /[A-Za-z]?/
- ^ must match
- $ end
- \ match special character
- \w: [A-Za-z0-9_]
- \d: [0-9]
- \s: space, tab, return carriage
- Replace: /match/replace/
- {repart how many times} for if we are matching YYYY-MM-DD
e.g.Specific for september or august “[0-9]{4}-[0][89]-[0-9]+”
Week 4: Lecture 7
Graphical Data Summaries

- Dot plot
a. Quantities of categorical data
b. X is independent y is dependent
c. Alternative (replace) bar chart when u wanna displace all value in the dataset
d. 95% confident interval

- Error bar : uncertainty


- Line graph: through time mosty or interval scale
a. Time variable in x axis y is dependent variable
b. Better with straight line betweeen dot and dot

- Bar chart: bar with lengths that is proportional to the values when comparing
categorical data
a. X axis category y is the dependent variable
b. The bar with mean and 95% CI of the mean the intersection is the mean teh
the vertical line is the 95% confident interval
c. Stacked bar charts sum different common categories across labels. Very
useful to allow the comparison of few categories. Especially useful when the
total (the sum) for each labels is different
- Pie chart: display percentage or ratio
a. Use for categorical data with small number of categories

- Box and whisker lot


a. display categorical data on the x-axis with quantitative dependent variables
on the y-axis.
b. Include information like variance and outlier
c. The box is the interquartile range and the whisker is the complete range

- Scatter plots
a. both the x and the y axis represent a quantitative measurement of the same
piece of data.
b.

- Histogram
a. Frequency distributions of data from one or more variables using adjacent
vertical bars.
b. X-axis categorize the data, while count data for each interval is displayed on
the y-axis.

- Heat maps
a. two-dimensional matrix( same data type)
b. X and y are categories and teh colour is quantative
Term test2 tWeek 5-8

Week 5: Lecture 9
Simple Summary Statistics
- Summary statistics
1. mean : a measure of central location or tendency 傾向
2. Standard deviation: spread around the center
3. Skewness: measure of the shape to indicate the asymmetry
4. Correlation: association between pairs of data

- Central location or tendency


1. mean : average of the data(sensitive by outlier)
2. Median: 50th percentile of the sort data(not affect by outlier)
- Spread around the center
1. Standard deviation:
2. Variance: how far the samples (aka observations) are from their average
value, the unit of variance is the square of unit (cannot plot variance in the
same graph as mean because of different unit)

3. use standard dedication to show the spread of the same graph


a. (same unit as the measure)
b. smaller SD mean smaller the spread the data is more close to each
other
c. measurement of uncertainty like measurement error
- Summing means and variances
1. Summing mean:
Mean(total)=Mean(X)+Mean(XY)
Mean of X is 10
Mean of Y is 10
Mean of X+Y is 20 = mean of X + mean of Y
2. Var(response3) = var(response1 ) + var(response2) + 2Cov(r1,r2)
a. Uncorrelated: Var(x) + Var(Y) = Var(X+Y)
b. Correlated(Cov(X,Y) is positive): Var(x) + Var(Y) + 2Cov(X,Y) = Var(X+Y) if x
and y direction is similar
c. Negatively Correlated(Cov(X,Y) is negative ): Var(x) + Var(Y) + 2Cov(X,Y) =
Var(X+Y) if x and y direction is opposite
d. perfectly correlated: X and Y is the exact same direction,
- Shape to indicate asymmetry around mean
1. Skewness = 0: symmetric about the mean
a. Mean is the same
2. Skewness is negative: left tail(mean is closer to the right) = left skewed
a. Mean is smaller than the median
3. Skewness is positive: right tail(mean is closer to the left) = right skewed
a. Mean is larger than the meidan
4. Mean is closer to the tail than the median
Week 6: Lecture 10
Probability Distributions
- Terminology
1. A random trial is a process or experiment that has two or more possible
outcomes
2. An outcome is the result of the processor experiment
3. An event is any subset of possible outcomes

- Probability without replacement


1. Draw Probability trees without specific order
Need to add up all probability
Specific order only one trail is fine
Sum of all probability is 1

- Probability with replacement


a. uniform distribution: all have same probability
Number of draw*Probability^pick
Like total 10 white ball and 10 black ball
4 white ball and 1 black ball
(10/20)^4 x (10/20)^1
Week 6: Lecture 11
Probability Distributions contd
- The Gaussian distribution
1. The peak is the mean
2. Symmetric distribution(skewness = 0 )
3. +- 1 SD is 68.2 % +- 2SD is 95%
4. Resampling statistics are normally distributed
- binomial distribution (recall the Urn model)
1. is the sum of independent sampling with the same probability (with
replacement)– recall, it is the repeated sampling with replacement and
summing up of black/white balls
- The standard deviation of the estimated mean is the standard error

The smaller the sample size , the more spread


Error bar use the 95% CI instead of the actual standard deviation

Week 8: Lecture 15
Linear correlation and regression
- Linear regression
y=mx+c —> c is the y-int, m is the slope(Δy/Δx) unit of y/ unit of x
Positive slope is positive correlation but not how steep the slope is instead is how fit
the data fit the slope
Residual : Vertical error bar —> calculated the coefficient of determination
Assumption for the linear fit
- The vertical error was are approximately the same along the line
- Have a gaussian distribution (normal distribution)
Scale changing: Multiplying the x and y will affect the slope and intercept (affect mean and
standard deviaiton)
Shifting : Addition or elimination will NOT affect the slope but only the intercept(affect mean)

- standard error and 95% confidence interval for slope/intercept


a. Resample data with replacement and find the new slope and new intercept
b. Repeat many times
c. The distribution look normal distribusion
d. Find Standard error from sample distribution. Find the 95% CI from the standard
error
e. True value is the mean u +- standard error*1.96

-Permutations交換: change the order that make the x and y have no correlation that make a
normal distribution
Final recap
The honey bee actually do this in practice so indicating we we keep patients and some of
these patients we can take blood samples everything anything we can think about in terms
of trying to understand what this is and if you can find a lot of people saying to me right but
we can do is we can sequence only the people that we know have so let's think of
something let's think of a disease so skin cancer so lots of people have skin cancer of years
ago anomas but we can do is we can sequence all of the actual Chino of these people and
the idea here is to ask her so are there genetic this so who can hear things that skin cancer
has no genetic basis to it okay so what does the genetic base of 10 can anyone explain to
me why skin cancer would be genetic nobody cares since the skin cancer specific over
there in the background noise

Skin cancer
100 sequency individual all have history skin cancer
100000 people form UK with no history of skin cancer

gene(character) Mutation (integras)

A 1(with mutation)

B 0(without mutation)

C 1

D 0
Lecture 16:
What we are wishing for: The distribution of p value from the population must be uniformly
distributed
threshold of p-value = 5%, exactly 5% of the time
Larger p value = not significant
Equal p value = borderline of being significant
Less than p value = significant to reject the null hypothesis
True value distribution and sample value distribution will have the same width but different
center
Tail of distribution: looking for the abnormal sum “how rare is our sample value”
Rare == significant
- Difference is larger and small width based on SE (1.96 *SE)
- Error bar must be 95% CI
- The sample is far from the true value
- No overlap
The p value of sample value is less than 0.05 “the sample is so rare” doesnt mean it is false
it is just rare
Equal mean = the different is 0

2 sample test
See if they are from the same population
Null hypothesis: different =0

T test formula: difference in mean / SE >1.96 is significant (no overlap)


Smaller the sample size larger the t value

Unpaired t test: 2 different group


Paried t test: same group twice
T test is finding the null hypothesis is true or not and the p value is about the probability of
the difference due to random chance

Parametric test
- Assume the data come from the normal distribution
- There is a null hypothesis
- Test the different from the null hypothesis
-
Non parametric test
Fewer assumtion

Chi-square test
- Ho is the expected
- Ha is the observed
- Skewed distribution

- Prevalence: P / (N+P)
- Precision: TP / (TP + FP)
- True-positive rate: TP / P `
- False-positive rate: FP / N
- False-discovery rate: FP / (TP + FP)

You might also like