BIO259 Note
BIO259 Note
header(tell u what
is present in this
column)
2. Types of data
a. String: letters only e.g. Canada, case sensitive so really matter if
something is capitalized or not
b. Integers: no decimal place in the number cell that only contain number
that you can do maths on them
c. Floats: number with decimal value (sometime called double for large
value)
d. Missing Data: blank information. either the cell is completely blank or the
character showing NaN NA <NA>
- The whole column have the same type of data
3. Beyond table
a. variable : one single thing store in one single cell
b. vector : a collection of same type of value value without dimension
E.g. a vector of strings c(‘Canada’, ‘United States’, ‘Mexico’)
E.g. a vector of integers
e.g. code
as.integer(c(TRUE, 1.0
FALSE))
as.numeric(c(TRUE, 1.0
FALSE))
4. Between logical and number ( FALSE = 0 and the rest number is TRUE)
E.g. c(0, 0.5 , 1.5, 2.5 , 3.5)
as.character(c(0, 0.5 , ‘0’ . ‘0.5’ . ‘1.5’ . ‘2.5’ . ‘3.5’ (information because from floats(higher)
1.5, 2.5 , 3.5)) to integer(lower))
5. Convert from string will only get direct representation
Like ‘TRUE’ will get TRUE if convert into logical and the other type will turn
into NA
- Data files
1. Text Files: convert the data that you will supply
a. TXT: text file
b. CSV: data separated by commma
If the comma is inside “vdf,vdf” that comma should not consider as a
speration it will come out as vdf,vdf in stead of vdf . vdf
c. TAB
d. Fields
1. Nominal values (labels with no order)
2. Ordinal values (labels with some order)
3. Numeric values (numbers)
2. Binary Files: already convert data for program to read it way faster but not human
readable
3. Compressed text file for binary file that is human nonreadable
Open compressed file through read.table(“txt”, seq = “,”, header = TRUE, quote =
“”) doesn't have any specific character for quotes.
Week 2: Lecture 3
Data Organization and Manipulation
- filesystem
1. How to navigate the file path
a. Absolute path point to the file regardless of current location, it is
usually the ‘full’ path
b. Relative path start from the currwtn working directory
Lecture 4
Data organization and manipulation
-
- Indexing vector
a. First_vale <- vect[1] assuming is [1,1]
b. Middle_value <- vect[3:6] is row 3,4,5,6 and column 3,4,5,6
c. First_row <- [1,] if nothing in the column section assume all column
- Indexing dataframe
Col <- df[“colname”]
a. Asking for even number a%%2==0 “a” is the vector, the number divided by 2
with the remainder of 0, which is even number
b. If assign one column and R will automatically change it into a vector instead
of the dataframe by drop = FALSE mean retain the dataframe format
c. If only df[1] > 10 it will only give u logical value which is TRUE or FALSE and
the if u type df[df[1] > 10] it will return dataframe
d. OR is | , AND is & , NOT is !
- Relational data
a. primary key: the record
b. Foreign key: table point to other table
c. Composite multiple columns: joint or merge -> in order to get a more
compact information
d. The types of joint specify the returned rowa with missing foreign key and
primary key pairs. Orange part indicate the missing key should be return.
e. inner joint: ensure the key and foreign key are in both table. Remove rows
that have no relationship aka no missing data in the inner joint
f. Full joint: keep every rows rather or not there is relationship. Missing data
remain blank rows
g. Left joint: display every thing on the table on the left, if there are matching
from the right table, pulling any relevant data from the right table.
h. Right joint: display every thing on the table on the right, if there are matching
from the lrft table, pulling any relevant data from the left table.
i. When matching ID make sure there is the same name, because R is case
sensity different name may refer tothe same thing but R wont know, so make
sure to match the ID that is nonchangable
- Type of relational data:
a. Phylogenetic tree
b. Family tree
- Statistical Definitions
1. range : different highest to lowest
2. mode : the most count of data
3. Median teh middle number of ordered value
4. Mean: average
5. Standard deviation: teh uare root of variance
6. Standard error
- Regular expression
1. Pattern
2. mutate() add new column in the dataframe
3. pivoting : group_by() %>% summarise()
Week 3: Lecture 6
Regular Expressions
- No need /A/ fro regular expression in R language
- ?: use for logic
- [ ] mean or like /[AB]/ mean A or B
- [A-Z] all capitalized letter [a-z] all no cap letter [A-Za-z] all letter
- Repairing the [ ] by adding + like /[A-Za-z]+/
- Repeat at least one by add . infont of + like /[A-Za-z].+/
- Repeat at least 0 or more by add * infont of + like /[A-Za-z]*+/
- Match 0 to one letter add ? like /[A-Za-z]?/
- ^ must match
- $ end
- \ match special character
- \w: [A-Za-z0-9_]
- \d: [0-9]
- \s: space, tab, return carriage
- Replace: /match/replace/
- {repart how many times} for if we are matching YYYY-MM-DD
e.g.Specific for september or august “[0-9]{4}-[0][89]-[0-9]+”
Week 4: Lecture 7
Graphical Data Summaries
- Dot plot
a. Quantities of categorical data
b. X is independent y is dependent
c. Alternative (replace) bar chart when u wanna displace all value in the dataset
d. 95% confident interval
- Bar chart: bar with lengths that is proportional to the values when comparing
categorical data
a. X axis category y is the dependent variable
b. The bar with mean and 95% CI of the mean the intersection is the mean teh
the vertical line is the 95% confident interval
c. Stacked bar charts sum different common categories across labels. Very
useful to allow the comparison of few categories. Especially useful when the
total (the sum) for each labels is different
- Pie chart: display percentage or ratio
a. Use for categorical data with small number of categories
- Scatter plots
a. both the x and the y axis represent a quantitative measurement of the same
piece of data.
b.
- Histogram
a. Frequency distributions of data from one or more variables using adjacent
vertical bars.
b. X-axis categorize the data, while count data for each interval is displayed on
the y-axis.
- Heat maps
a. two-dimensional matrix( same data type)
b. X and y are categories and teh colour is quantative
Term test2 tWeek 5-8
Week 5: Lecture 9
Simple Summary Statistics
- Summary statistics
1. mean : a measure of central location or tendency 傾向
2. Standard deviation: spread around the center
3. Skewness: measure of the shape to indicate the asymmetry
4. Correlation: association between pairs of data
Week 8: Lecture 15
Linear correlation and regression
- Linear regression
y=mx+c —> c is the y-int, m is the slope(Δy/Δx) unit of y/ unit of x
Positive slope is positive correlation but not how steep the slope is instead is how fit
the data fit the slope
Residual : Vertical error bar —> calculated the coefficient of determination
Assumption for the linear fit
- The vertical error was are approximately the same along the line
- Have a gaussian distribution (normal distribution)
Scale changing: Multiplying the x and y will affect the slope and intercept (affect mean and
standard deviaiton)
Shifting : Addition or elimination will NOT affect the slope but only the intercept(affect mean)
-Permutations交換: change the order that make the x and y have no correlation that make a
normal distribution
Final recap
The honey bee actually do this in practice so indicating we we keep patients and some of
these patients we can take blood samples everything anything we can think about in terms
of trying to understand what this is and if you can find a lot of people saying to me right but
we can do is we can sequence only the people that we know have so let's think of
something let's think of a disease so skin cancer so lots of people have skin cancer of years
ago anomas but we can do is we can sequence all of the actual Chino of these people and
the idea here is to ask her so are there genetic this so who can hear things that skin cancer
has no genetic basis to it okay so what does the genetic base of 10 can anyone explain to
me why skin cancer would be genetic nobody cares since the skin cancer specific over
there in the background noise
Skin cancer
100 sequency individual all have history skin cancer
100000 people form UK with no history of skin cancer
A 1(with mutation)
B 0(without mutation)
C 1
D 0
Lecture 16:
What we are wishing for: The distribution of p value from the population must be uniformly
distributed
threshold of p-value = 5%, exactly 5% of the time
Larger p value = not significant
Equal p value = borderline of being significant
Less than p value = significant to reject the null hypothesis
True value distribution and sample value distribution will have the same width but different
center
Tail of distribution: looking for the abnormal sum “how rare is our sample value”
Rare == significant
- Difference is larger and small width based on SE (1.96 *SE)
- Error bar must be 95% CI
- The sample is far from the true value
- No overlap
The p value of sample value is less than 0.05 “the sample is so rare” doesnt mean it is false
it is just rare
Equal mean = the different is 0
2 sample test
See if they are from the same population
Null hypothesis: different =0
Parametric test
- Assume the data come from the normal distribution
- There is a null hypothesis
- Test the different from the null hypothesis
-
Non parametric test
Fewer assumtion
Chi-square test
- Ho is the expected
- Ha is the observed
- Skewed distribution
- Prevalence: P / (N+P)
- Precision: TP / (TP + FP)
- True-positive rate: TP / P `
- False-positive rate: FP / N
- False-discovery rate: FP / (TP + FP)