0% found this document useful (0 votes)
206 views

LOKT.08.005 Chemometrics: Geven Piir

Chemometrics is a discipline that applies multivariate statistical methods to chemistry-related data to extract information. It involves measuring and collecting data, extracting relevant information from that data, and using the extracted knowledge to understand chemical systems and make decisions. Common applications of chemometrics include determining concentrations in mixtures, classifying sample origins, predicting properties, and recognizing molecular substructures using analytical techniques and spectroscopic or chemical structure data.

Uploaded by

Rodney Salazar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views

LOKT.08.005 Chemometrics: Geven Piir

Chemometrics is a discipline that applies multivariate statistical methods to chemistry-related data to extract information. It involves measuring and collecting data, extracting relevant information from that data, and using the extracted knowledge to understand chemical systems and make decisions. Common applications of chemometrics include determining concentrations in mixtures, classifying sample origins, predicting properties, and recognizing molecular substructures using analytical techniques and spectroscopic or chemical structure data.

Uploaded by

Rodney Salazar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

LOKT.08.

005 Chemometrics
Introduction

Geven Piir
Chemometrics
• Chemometrics is a discipline where
(multivariate) statistical methods are applied
on chemistry related data to extract
information
– Multivariate statistics
– Mathematical modelling
– Computer science
– Analytical chemistry

LOKT.08.005 Chemometrics 2
The chemometrics process
Measurement • Measure and collect data

Data • Extract relevant information

Information • Extract knowledge about system

Knowledge • Helps to understand the system

Understanding • Make a decision

LOKT.08.005 Chemometrics 3
LOKT.08.005 Chemometrics 4
ANALYTICAL NOISE
TECHNIQUES RAW DATA +
(HPLC, MS, IR) INFORMATION

NOISE
CHEMOMETRIC TOOLS
INFORMATION

LOKT.08.005 Chemometrics 5
Collecting data

LOKT.08.005 Chemometrics 6
Fifty-one spectra recorded on samples
containing either ibuprofen (blue) or
ketoprofen (red) are recorded in the
region 680–2,000 cm−1

Ketoprofene – higher absorbance


Ibuprofene – higher absorbance

LOKT.08.005 Chemometrics 7
Where chemometrics can be used?
• Determination of the concentration of a compound
in a complex mixture
– from infrared data
• Classification of the origins of samples
– from chemical analytical or spectroscopic data
• Prediction of a property or activity of a chemical
compound
– from chemical structure data

LOKT.08.005 Chemometrics 8
Where chemometrics can be used?
• Recognition of presence/absence of substructures in
the chemical structure of an unknown organic
compound
– from spectroscopic data
• Evaluation of the process status in chemical
technology
– from spectroscopic and chemical analytical data

LOKT.08.005 Chemometrics 9
Chemometrics is …
• Asking an interesting question
• Obtaining the data

• Exploring the data


• Modelling the data
• Communicating and visualizing the results

LOKT.08.005 Chemometrics 10
COURSE OVERVIEW

LOKT.08.005 Chemometrics 11
Arrangements
• Lecture
– Wednesdays 10.15
– VIDEO (https://button.ut.ee/b/gev-i9u-ylz-jeh)
• “Seminar”
– Individual work at home
– You can send questions to my e-mail
• Fridays 14.15-15.45 I try to answer ASAP
• All other times it may take some time
• Exam
– Requirements to be met for exam
• All seminar quizzes must be passed

LOKT.08.005 Chemometrics 12
“Seminar”
• All the seminar exercises with solutions will be
uploaded to the MOODLE
• After each “seminar” there is a quiz

• We have to make a decision here:


– All quizzes must be passed
• You have unlimited tries
– Quizzes give up to 40% of final grade
• You have 3 tries. Each try gives some % less points
LOKT.08.005 Chemometrics 13
Exam structure

• 10 multiple choice questions


• 4 answers
• You should choose only one
• 10 exercises

• Examples are available in MOODLE

LOKT.08.005 Chemometrics 14
Retaking the exam
• All final grades can be improved by retaking
the exam, but …
– Final score will be calculated as:

𝑆𝑐𝑜𝑟𝑒 𝑓𝑟𝑜𝑚 𝐸𝑋𝐴𝑀 + 𝑆𝑐𝑜𝑟𝑒 𝑓𝑟𝑜𝑚 𝑅𝐸𝑇𝐴𝐾𝐼𝑁𝐺


2
– It is possible to lower your grade
– You can retake the exam only once

LOKT.08.005 Chemometrics 15
Lecture topics

• Data and statistics


• Exploratory analysis
–Unsupervised learning
• Modelling
–Supervised learning
LOKT.08.005 Chemometrics 16
Seminar exercises - R

R Project for Statistical Computing


www.r-project.org

RStudio
https://www.rstudio.com

LOKT.08.005 Chemometrics 17
Exercises

• How to use R and RStudio


• Data input / output
• Apply statistical methods
• Interpretation of results

LOKT.08.005 Chemometrics 18
This course is based on:
• Introduction to Multivariate Statistical Analysis in
Chemometrics, Kurt Varmuza, Peter Filzmoser, 2009
• Practical guide to chemometrics, Paul Gemperline,
2006
• Statistics and chemometrics for analytical chemistry,
James N. Miller, Jane Charlotte Miller, 2005
• Chemometrics with R: Multivariate Data Analysis in
the Natural Sciences and Life Sciences, Ron Wehrens,
2011

LOKT.08.005 Chemometrics 19
Contacts
• Geven Piir, PhD
– room 4072
– phone: 737 5278
[email protected]

LOKT.08.005 Chemometrics 20
SCHEDULE

LOKT.08.005 Chemometrics 21
DATA AND STATISTICS

LOKT.08.005 Chemometrics 22
Schedule
• Introduction
• Data

• Data distributions and descriptive statistics

• Descriptive statistics
• Data distributions
• Hypothesis testing, statistical tests
• Erroneous and missing data

LOKT.08.005 Chemometrics 23
Schedule
• Multivariate data

• Univariate vs. multivariate


• ANOVA, covariance and correlation
• Data transformation
• Data preprocessing (centering, scaling, normalization)

LOKT.08.005 Chemometrics 24
Schedule
• Multivariate data

• Distances, similarities
• Multivariate outlier detection
• Linear latent variables

LOKT.08.005 Chemometrics 25
EXPLORATORY ANALYSIS

LOKT.08.005 Chemometrics 26
Schedule
• Principal Component Analysis

• Eigenvectors, Eigenvalues
• Number of PCA components
• Data preprocessing (centering and scaling)

LOKT.08.005 Chemometrics 27
Schedule
• Principal Component Analysis

• Statistical interpretation
• Rotation of planes
• Outlier analysis
• Other methods

LOKT.08.005 Chemometrics 28
Schedule
• Cluster analysis

• Clustering
• Cluster validity

LOKT.08.005 Chemometrics 29
MODELLING

LOKT.08.005 Chemometrics 30
Schedule
• Calibration and regression analysis

• Concepts
• Multiple linear regression

LOKT.08.005 Chemometrics 31
Schedule
• Calibration and regression analysis

• Variable selection
• Regression (model) diagnostics
• Validation

LOKT.08.005 Chemometrics 32
Schedule
• Regression analysis

• Regression model interpretation


• Regression with indicator variables
• Other methods

LOKT.08.005 Chemometrics 33
Schedule
• Classification

• Background
• Statistical measures
• Methods

LOKT.08.005 Chemometrics 34
Exam and consultation
• Consultation
– A week before first exam
• Exam
1. Before Christmas break?

LOKT.08.005 Chemometrics 35
STRUCTURE OF DATA

LOKT.08.005 Chemometrics 36
Data - definition
• Information, especially facts or numbers,
collected to be examined and considered and
used to help decision-making, or information
in an electronic form that can be stored and
used by a computer

LOKT.08.005 Chemometrics 37
Types of data
• Structured vs unstructured data
– Organized vs unorganized
• Dependent vs independent data
• Quantitative vs qualitative data
– Numerical vs categorical

LOKT.08.005 Chemometrics 38
Structured vs unstructured data
• Structured (organized) data
– This is data that can be thought of as observations
and characteristics. It is usually organized using a
table method (rows and columns)
– Structured data is generally thought of as being
much easier to work with and analyse
– The natural row and column structure is easy to
digest for human and machine eyes

LOKT.08.005 Chemometrics 39
Structured vs unstructured data
• Unstructured (unorganized) data
– This data exists as a free entity and does not
follow any standard organization hierarchy
– Most statistical and machine learning models
were built with structured data in mind and
cannot work on the loose interpretation of
unstructured data

LOKT.08.005 Chemometrics 40
Unstructured data

LOKT.08.005 Chemometrics 41
LOKT.08.005 Chemometrics 42
We need to transform at least part of the unstructured data to structured

LOKT.08.005 Chemometrics 43
LOKT.08.005 Chemometrics 44
LOKT.08.005 Chemometrics 45
Structured data
Name CAS K(O3) -logK(O3) E(HOMO) MW
Octafluoro-2-butene 360-89-4 1.58E+20 20.2 -11.566 200.028
Formaldehyde hydrazone 6629-91-0 3.98E+16 16.6 -9.584 44.0566
trans-1,2-Difluoroethene 1630-77-9 4.79E+17 17.68 -10.011 64.0338
2-Methyl-1,4-pentadiene 763-30-4 7.76E+16 16.89 -9.745 82.145
1-Methyl-1-cyclohexene 591-49-1 6.03E+15 15.78 -9.214 96.1718
2,3-Dimethyl-2-butene 563-79-1 6.61E+14 14.82 -8.951 84.1608
2-methyl-1-butene 563-46-2 6.31E+16 16.8 -9.702 70.134
1,1,1-Trifluoroethane 420-46-2 1.86E+25 25.27 -13.116 84.0397
alpha-Terpinene 99-86-5 1.15E+13 13.06 -8.438 136.236
alpha-Phellandrene 99-83-2 8.32E+13 13.92 -8.655 136.236
alpha-Pinene 80-56-8 3.02E+15 15.48 -9.117 136.2364
1,1-Difluoroethane 75-37-6 1.66E+24 24.22 -11.926 66.0496
Acetaldehyde 75-07-0 2.95E+19 19.47 -10.72 44.0526
Vinyl fluoride 75-02-5 1.45E+18 18.16 -10.238 46.0437
Ethane 74-84-0 8.32E+22 22.92 -11.766 30.0694

LOKT.08.005 Chemometrics 46
Rows
• Case, object, sample, observation, compound

LOKT.08.005 Chemometrics 47
Columns
• Variable, feature, measurement, descriptor,
parameter
– Not all columns are variables

LOKT.08.005 Chemometrics 48
Data in matrix form

LOKT.08.005 Chemometrics 49
Example - drugbank
identifiers measurement

drugbank ID drugs.name melting point solubility (mg/ml) pKa isoelectric point Hydrophobicity (logP)
DB00880 Chlorothiazide 350 0.266 6.85 -0.5
DB00999 Hydrochlorothiazide 274 0.7 7.9 -0.5
DB00501 Cimetidine 142 5 6.8 1
DB00458 Imipramine 174.5 0.0182 9.4 3.9
DB00788 Naproxen 153 0.0159 4.15 2.8
DB00184 Nicotine -79 1000 3.1 1.1
DB00328 Indomethacin 158 0.000937 4.5 3.4
compound

DB00554 Piroxicam 199 0.023 6.3 3


DB00201 Caffeine 238 22 10.4 -0.5
DB01050 Ibuprofen 75 0.049 4.91 3.6
DB00281 Lidocaine 68.5 4.1 8.01 2.1
DB00140 Riboflavin 280 0.0847 10.2 -1.9
DB00186 Lorazepam 167 0.08 13 3.5
DB00250 Dapsone 175.5 0.38 2.41 0.4
DB00259 Sulfanilamide 165.5 7.5 10.6 -0.8
DB00295 Morphine 255 0.149 8.21 0.8
DB00312 Pentobarbital 129.5 0.679 8.11 2.1
DB00313 Valproic Acid 120 1.3 4.8 2.7

LOKT.08.005 Chemometrics 50
Variables
• A dependent variable is the variable being
tested and measured in a scientific
experiment.
• An independent variable is the variable that is
changed or controlled in a scientific
experiment to test the effects on the
dependent variable

LOKT.08.005 Chemometrics 51
Variables
• The independent and dependent variables are the
two key variables in a science experiment.
• The independent variable is the one the
experimenter controls. The dependent variable is the
variable that changes in response to the independent
variable.
• The two variables may be related by cause and
effect. If the independent variable changes, then the
dependent variable is affected.

LOKT.08.005 Chemometrics 52
Data matrix
Y – dependent variable
X – independent variable
n – number of objects
m – number of variables
Variable, feature, measurement, descriptor, parameter

Y X1 Xj Xm

Case,
y1 x11 ... x1j ... x1m
object, ... ... ...
sample, yi xi1 ... xij ... xim
... ... ...
observation, yn xn1 ... xnj ... xnm
compound

LOKT.08.005 Chemometrics 53
Quantitative vs qualitative data
• Quantitative data
– This data can be described using numbers, and
basic mathematical procedures, including
addition, are possible on the set.
• Qualitative data
– This data cannot be described using numbers and
basic mathematics. This data is generally thought
of as being described using "natural" categories
and language.

LOKT.08.005 Chemometrics 54
Example
• Data: Coffee Shop
– Name of coffee shop
– Revenue (in thousands of dollars)
– Zip code
– Average monthly customers
– Country of coffee origin

LOKT.08.005 Chemometrics 55
Which is which?

• Can you describe it using numbers?


• No? It is qualitative.
• Yes? Move on to next question.

• Does it still make sense after you add them


together?
• No? They are qualitative.
• Yes? You probably have quantitative data.

LOKT.08.005 Chemometrics 56
LOKT.08.005 Chemometrics 57
Variable types
• Categorical (Qualitative)

– Nominal (alcohols, esters, acids, ...)


• functional group, colour, apparatus
– Ordinal, including binary (A, B, C, ... – grades)
• They have meaning, one is better than another

LOKT.08.005 Chemometrics 58
Example - Categorical
name MW aliph-arom HC-OH aliph-arom_HC-OH logL(n-hexane) logL(water)
methane 16 1 3 13 -0.03 -1.43
ethene 28 1 3 13 0.59 -0.97
methanol 32 1 4 14 1.09 3.73
propane 44 1 3 13 1.37 -1.46
ethane 30 1 3 13 1.48 -1.31
trans-stilbene 180 2 3 23 7.45 2.78
anthracene 178 2 3 23 7.45 3.03
fluoranthene 202 2 3 23 8.42 3.44
methyl 4-hydroxybenzoate 152 2 4 24 8.42 6.84

LOKT.08.005 Chemometrics 59
Example - Categorical
name MW aliph-arom HC-OH aliph-arom_HC-OH logL(n-hexane) logL(water)
methane 16 1 3 13 -0.03 -1.43
ethene 28 1 3 13 0.59 -0.97
methanol 32 1 4 14 1.09 3.73
propane 44 1 3 13 1.37 -1.46
ethane 30 1 3 13 1.48 -1.31
trans-stilbene 180 2 3 23 7.45 2.78
anthracene 178 2 3 23 7.45 3.03
fluoranthene 202 2 3 23 8.42 3.44
methyl 4-hydroxybenzoate 152 2 4 24 8.42 6.84
name MW aliph-arom HC-OH aliph-arom_HC-OH
methane 16 aliphatic hydrocarbon aliphatic_hydrocarbon
ethene 28 aliphatic hydrocarbon aliphatic_hydrocarbon
methanol 32 aliphatic alcohol aliphatic_alcohol
propane 44 aliphatic hydrocarbon aliphatic_hydrocarbon
ethane 30 aliphatic hydrocarbon aliphatic_hydrocarbon
trans-stilbene 180 aromatic hydrocarbon aromatic_hydrocarbon
anthracene 178 aromatic hydrocarbon aromatic_hydrocarbon
fluoranthene 202 aromatic hydrocarbon aromatic_hydrocarbon
methyl 4-hydroxybenzoate 152 aromatic alcohol aromatic_alcohol
LOKT.08.005 Chemometrics 60
Variable types
• Numerical (Quantitative)

– Discrete (1, 2, 3, ...)


• Obtained in counting
• It can only take on certain values
– Continuous (1.2, 2.68, 0.15, ...)
• Obtained from measurements
• It exists on an infinite range of values

LOKT.08.005 Chemometrics 61
Example - Continuous

LOKT.08.005 Chemometrics 62
Which is which?
COL 1 COL 2 COL 3 COL 4 COL 5
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5

If it is represented like this and no additional information is given


it is quite useless data table.

It might make sense to original creator. Well, at least for couple of


days. After a year nobody knows what is what.

All discrete numerical variables?

LOKT.08.005 Chemometrics 63
Which is which?
ID nH P (cm) Class Toxicity
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5

LOKT.08.005 Chemometrics 64
Which is which?
ID nH P (cm) Class Toxicity
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
ID – case identifier
nH – number of hydrogen atoms
P – perimeter in centimetres
Class – chemical class:
1 = alcohol
2 = aldehyde
3 = hydrocarbon
4 = carboxylic acid
5 = amine
Toxicity – levels of toxicity, non-toxic = 1
and extremely toxic = 5
LOKT.08.005 Chemometrics 65
Which is which?
ID nH P (cm) Class Toxicity
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
ID – case identifier
nH – number of hydrogen atoms
P – perimeter in centimetres
Class – chemical class: ID – not variable at all
1 = alcohol nH – discrete
2 = aldehyde P – continuous
3 = hydrocarbon Class – nominal
4 = carboxylic acid Grade – ordinal
5 = amine
Toxicity – levels of toxicity, non-toxic = 1
and extremely toxic = 5
LOKT.08.005 Chemometrics 66
Example – intercalibration

case,
laboratory

Measurement (continuous) Feature (nominal)


LOKT.08.005 Chemometrics 67
What kind of data?
1999.96 116
2000.28 123
2000.61 123
2000.93 117
2001.25 104
2001.58 83
2001.9 55
2002.22 24
File name: SpeciesA_Isolate1-6_110131.txt
2002.55 -1
Number of rows: 28133
… … Number of columns: 2
First row is a header: NO
21467.02 98
21468.08 96
21469.15 92
21470.21 86
21471.27 83
21472.34 88
21473.4 108
21474.46 147
21475.52 202
21476.59 274
LOKT.08.005 Chemometrics 68
Some kind of spectral data?
70000

60000

50000

40000

30000

20000

10000

0
0 5000 10000 15000 20000 25000

-10000

LOKT.08.005 Chemometrics 69
Some kind of spectral data?
70000

60000

50000

40000
Intensity

30000

20000

10000

0
0 5000 10000 15000 20000 25000

-10000
m/z

LOKT.08.005 Chemometrics 70

You might also like