Data Science by CFA
Data Science by CFA
Investment Professionals
COMING SOON!
1
Data Science for Investment Professionals Glossary
Discrete Data
Coefficient of Variation
Discrete data assume values on a scale that is finite,
The coefficient of variation for a dataset is the ratio of
or countably infinite. Integer scales are a common
the standard deviation to the mean.
example, but the data need not be in integer form.
2
Data Science for Investment Professionals Glossary
3
Data Science for Investment Professionals Glossary
4
Data Science for Investment Professionals Glossary
5
Data Science for Investment Professionals Glossary
Time-Series Data
Time-series data are measurements of one or more
variables over time, with the measurements equally
spaced in time.
6
Data Science for Investment Professionals Glossary
Coefficient of Determination
P(A|B) = P(B|A) * P(A) /P(B).
The coefficient of determination, or R2, is a measure of
the proportion of variation in data that is explained by
For example, if you know the prevalence of a disease,
a linear model.
and the probability of a positive diagnostic test if you
have the disease, and the general rate of positive
Collectively Exhaustive Events
tests, you would use Bayes formula to calculate
In probability theory, events are collectively
P(disease|positive test).
exhaustive if at least one of them must occur. In ice
hockey, for example, a team’s outcome must be one of
Bernoulli Distribution
three events: win, loss, or tie.
The Bernoulli distribution is a special case of the
binomial in which there is just one trial. For each
Conditional Probability
event, there is only two possible outcomes.
Conditional probability is the probability that some
event occurs given that some other event or condition
Bias–Variance Trade-off
has occurred. For example, the probability that a web
Statistical and machine learning prediction algorithms
visitor purchases a product during a session, given
can underperform in two ways: (1) stable predictions
that they have purchased a product in the past.
that fall short in accuracy (bias); or (2) accurate
predictions that are unstable, depending on the
data sample (high variance). Typically, there is
a trade-off: improve accuracy (minimize bias) and
decrease stability (increase variance).
7
Data Science for Investment Professionals Glossary
8
Data Science for Investment Professionals Glossary
Features Labels
Features are measurements of characteristics in In machine learning, labels are another term for
a dataset that vary from record to record and are “outcome variable,” particularly in cases in which
useful in prediction. Synonyms for “features” are the value must be assigned by human review
as follows: predictor variables (from statistics), (e.g., pictures of cats or dogs in training data).
fields (from database management), columns, and
attributes. Some examples of numerical features are Linear Regression
as follows: sales, number of customers, and assets. Linear regression is a statistical procedure that
Some examples of categorical features (which are estimates a linear relationship between predictor
often binary) are as follows: purchase/don’t purchase, variables (also called features or independent variables)
survive/die, level of education, and low/mid/large cap. and a numerical outcome. It typically does so by
minimizing the squared errors or residuals (i.e., the
Gradient Descent differences between predicted and actual values).
Machine learning algorithms typically seek to
minimize predictive error or a related loss function, Logistic Regression
and neural networks do so through some form of The goal of a logistic regression is to predict the
iterative trial and error. Gradient descent is one probability of being a 1 in binary data. A linear model
popular form of this trial-and-error process. It can be is used, but the output must be scaled so that it
likened to hiking over hills and valleys in search of the lies between 0 and 1. This is done by using the logs
lowest point, following a rule that says to proceed of all terms and fitting a standard linear model.
in whatever direction takes you downhill from your After the model is fit, the coefficients can then
current position. This does not guarantee a global be exponentiated so that relationships between
minimum, but the algorithm can be adapted so that it predictors and the output can be interpreted. The
is not necessarily stopped by a local minimum. output is now in terms of odds, which can be
converted to probability.
Heteroskedasticity
Heteroskedasticity is when the variance of a variable Lognormal Distribution
changes over different ranges of the data. For The lognormal distribution is a distribution of data
example, wages may vary only a little for people with that is long-tailed to the right. When the logs of
a high school education but may vary considerably the data in a lognormal distribution are taken, the
more so for people with a graduate education. resulting transformed distribution becomes normally
distributed.
9
Data Science for Investment Professionals Glossary
10
Data Science for Investment Professionals Glossary
11
Data Science for Investment Professionals Glossary
Underfitting
Underfitting in a machine learning model is when
the model underperforms because there are some
patterns or relationships that it is failing to capture
effectively.
Unexplained Variation
Unexplained variation in a target or outcome variable
refers to that variation that is not explained by a
model. For example, in a revenue prediction model, the
differences between predicted revenue and actual
revenue constitute unexplained variation.
Validation Data
In a machine learning process, a model typically
learns from training data, and is then applied to
validation or holdout data, to see how well the model
predicts.
12
Data Science for Investment Professionals Glossary
Covariance Matrix
Bias Error
For multivariate data, a covariance matrix lists all the
Bias error is consistent predictive inaccuracy in one
variables both as row and column headers. Each cell
direction or the other and is not the result of random
is the covariance of the row and column variables.
variation.
Cross-Validation
Binary Classifier
Cross-validation is the repeated splitting of data into
A binary classifier is a machine learning algorithm that
model-fitting and model-assessment subsets. For
predicts the class of a record, where the record can
example, a model might be fit to 80% of the data, and
belong to one of two classes (e.g., a web visitor could
then assessed on the remaining 20%. This process
click or not, an account could be current or not).
is iterated, with a different 20% being held out for
assessment each time (or “fold”). The 80/20 split
Bootstrapping
would constitute five-fold cross-validation.
Bootstrapping in statistics is the process of repeatedly
taking samples with replacement from a dataset and
Cutoff Value
making an estimate from, or fitting a model to, each
A classification model generates an estimated
bootstrap sample. Bootstrapping is used to estimate
probability (propensity) that a record belongs to a
variability, and it also facilitate methods to avoid
particular class. The model algorithm, or the analyst,
overfitting in machine learning models.
can set a cutoff value (threshold value) for that
probability to distinguish records that are classified
13
Data Science for Investment Professionals Glossary
14
Data Science for Investment Professionals Glossary
cluster. The process stops when the only available output that is interpreted as predictions. Initially,
reassignments increase within-cluster dispersion. the predictions are essentially random, but they are
compared to actual values, and the weights that
Labeled Dataset govern the mathematical operations are modified in
A labeled dataset is one that has records whose additional passes through the data to improve the
outcomes are known, usually as a result of a human predictions.
review that assigns a label (e.g., an insurance claim
could be fraudulent or normal). Node
In a neural network, a node is a location where
LASSO weights are applied to inputs to generate an output:
LASSO (least absolute shrinkage and selection either an intermediate output that becomes an input
operator) is a regression method that penalizes to another node, or a final output that becomes a
regression coefficients in a way that causes prediction, which, in turn, is either final, or provisional
noninfluential predictors to drop from the model. This (if the latter, it is compared to the actual known value).
decreases variance in the coefficient estimates and
yields a model that is more parsimonious. Noise
Noise in data and models is variation that is not
Market Impact Analysis explainable by a statistical or machine learning model.
In finance, a market impact analysis is the
assessment of how much a purchase or sale of Nonparametric
securities affects the security price. Nonparametric statistical methods or estimates do
not incorporate assumptions about a normal (or other)
Mean Squared Error data distribution. Such assumptions reduce input
In machine learning, mean squared error (MSE) is the data to parameters (e.g., mean, variance) that are
average of the sum of squared errors (where errors often required for it to be used in many mathematical
are the differences between actual and predicted formulas.
values).
Out-of-Sample Error
Multilayer Convolutional Neural Network In implementing machine learning models, a subset
A convolutional neural network (CNN) is a deep neural of the data is typically held out (holdout or validation
net that adds convolutions to its learning process. data) to allow the model to be applied to data that
A convolution is an operation applied to multiple were not used to train the model. The model’s error on
observations at the same time, and it helps uncover these out-of-sample data is a more unbiased estimate
higher level features that are not discoverable at the of how the model will perform than the error on the
granular level. In processing images, for example, training data.
operations can be applied to small matrices of
pixels, moving sequentially across the image. These Output Layer
operations help reveal basic higher level features In a neural network, the output layer contains a node
(e.g., edges or lines) that are not discoverable at the whose output is the prediction (either final or, if there
individual pixel level but appear when you look at are to be further passes through the data, provisional).
multiple pixels.
Parsimonious Models
Natural Language Processing Parsimonious machine learning models are those that
Natural language processing (NLP) is the application use only essential and useful predictors and records.
of machine learning methods to natural languages, Including variables and records that do not provide
like English. The goal may be classification, extraction information that is useful for making predictions adds
of meaning, or generative (producing content). noise and degrades performance.
The older term “text mining” is sometimes used
synonymously. Partition
In machine learning, data are typically split into two or
Neural Network three subsets. The “training” partition is used to train
A neural network is a machine learning algorithm that or fit the model. A second partition is used to assess
performs numerous mathematical operations (both the model and tune (adjust) its parameters. A third
parallel and sequential) on input data, generating partition may be used at the end to estimate how well
15
Data Science for Investment Professionals Glossary
the model will do in the real world (no further model Semi supervised Technique
tweaking is allowed after this). The second partition In some machine learning situations, labeled data are
is sometimes called the validation partition and may scarce or expensive (e.g., insurance claims, which
also be called a “holdout” partition. The third partition require a human review to determine validity), but there
is also sometimes called the “holdout” partition and is is a lot of unlabeled data. In semi supervised learning,
also called the “test” partition. an initial model is trained on a limited set of trained
data, and then is applied to the unlabeled data to
Penalized Regression generate predictions. The predictions that have the most
Standard linear regression may include predictors that confidence (i.e., have the highest probability attached to
introduce instability, reflected in high variance for the their classification by the model) are then added to the
coefficients. This can result from multicollinearity, original labeled data as “pseudo-labeled data,” and a new
from the presence of noise, or when there are simply model is trained. This process is then repeated.
too many variables. Regression models can be
improved by penalizing coefficients if their variables Sentiment Analysis
do not contribute stable and useful information. Sentiment analysis is the analysis of human-generated
text to classify it to a sentiment (e.g., positive/negative,
Principal Component Analysis enthusiastic/indifferent).
Principal component analysis (PCA) transforms
multivariate data into a limited set of variables, each Supervised Learning
of which is a linear combination of others. The first Many machine learning models have as their goal
such variable or component seeks to capture as much the prediction of something, such as whether a
of the variability in the data as possible. This makes it web visitor will click on a link, how long a patient will
possible to create a parsimonious machine learning need to be hospitalized, or whether a firm presents a
model that uses just a handful of these newly created financial risk to stakeholders. Models are “trained” on
variables. data for which the outcome is known, and then are
applied to data for which the outcome is not known.
Pruning This process is termed “supervision.”
If allowed to grow indefinitely, decision trees overfit
the training data and do not perform well with new Tensor
data. An optimal tree is one that is not fully grown. A tensor is a multidimensional matrix map. A scalar
One way to achieve smaller trees is to allow them to is a tensor of rank 0, a vector is a tensor of rank 1,
grow fully, and then to prune them back to a point at a two-dimensional matrix is a tensor of rank 2, and
which they have minimal error on validation data. so on. Tensors are central to the ability of deep neural
nets to cope with complex data and features that are
Random Forest Classifier at a higher level than the input values (e.g., shapes
A form of bagging, a random forest is a collection of instead of pixel values); a popular deep learning
decision trees that are grown using a set of bootstrap environment is termed “TensorFlow.”
samples. The bootstrapping, or bagging, is done by
taking random samples, with replacement from the Terminal Node
records; random selection (without replacement) In a decision tree, a terminal node is a final split in a
is also applied to the predictors at each stage. The branch. The records resulting from the terminal node
results are then averaged, or a majority vote is taken split are assigned classes or values, and they are not
(for categorical outcomes), yielding predictions that split further.
are more stable than those of individual models.
Threshold Value
Regression Tree See Cutoff Value.
A regression tree is a decision tree where the
outcome is numerical. See CART. Unsupervised Learning
Unsupervised learning methods are machine learning
Reinforcement Learning methods that do not involve the prediction of an
Reinforcement learning is a machine learning method outcome variable. Clustering and recommendation
in which the algorithm interacts with people (e.g., engines are examples of unsupervised learning.
shoppers at a website) and, from their reactions,
learns the optimal assignment of options.
16
Data Science for Investment Professionals Glossary
Course 4 – Natural Language Processing metrics can be calculated from the confusion matrix,
including accuracy rate, error rate, sensitivity (recall),
Alternative Hypothesis
specificity, false discovery rate, and false omission
In a statistical hypothesis test, the alternative
rate.
hypothesis is the proposition you hope to
demonstrate, for example, that a treatment (e.g.,
Constraint
a new drug, or a different web page) had an effect
In optimization, constraints (e.g., limited supplies of
beyond what chance might produce.
labor and capital) are almost always present and are
an integral part of the optimization algorithm.
Area Under the Curve (AUC)
The area under the curve (AUC) is a metric that
Cross-Sectional Data
measures how effectively a predictive algorithm
Cross-sectional data are measurements of many
separates two classes. The curve in question is
different entities (e.g., people in a medical study, web
the ROC curve, and an area of 1 indicates perfect
visitors, emails) at the same time, or in contexts in
separation, while an area of 0.5 indicates that the
which time is not a factor.
model does no better than chance. See Receiver
Operator Characteristics Curve.
Curating Text Data
Text data encompasses a huge variety of topics and
Bag of Words
types, such as scholarly articles, tweets, news items,
In natural language processing, the bag of words
maintenance reports, medical notes, and dictionaries.
technique treats a document as simply a collection of
A curated corpus (body) of text is text from one
terms, without regard to order.
topic or type that has been selected, reviewed,
and processed to a sufficient degree that it can be
Binary Classification
subjected to analysis.
Binary classification algorithms predict classifications
for records that have two (binary) outcomes. Typically,
Data Frame
the class of interest is in the minority (e.g., a sale at a
A data frame is a 2 x 2 tabular set of data, like a
website, the presence of a disease, a fraudulent tax
spreadsheet range, with records as rows and columns
return), and is labeled 1, while the other records are
as variables.
labeled 0.
Data Mining
Chi-Square Test for Independence
Data mining is the science of gaining insight from data
The chi-square test of independence tests whether
and finding patterns. It encompasses techniques like
two or more samples differ in the proportion of 1’s
predictive modeling, clustering, principal component
to a greater extent than chance would allow, if the
analysis, and more. The term has largely been
samples were drawn from the same population with
superseded by the term “machine learning.”
a common proportion of 1’s. The test derives from a
chance model in which all samples are pooled, are
Data Wrangling
dealt out again in random order, and the 1’s in each
Data wrangling is the process of obtaining raw data
sample are counted.
and preparing it for analysis and modeling. Obtaining
the data, dealing with missing values, understanding
Complexity in a Model
variables, reconciling different sources, and achieving
In machine learning, a complex model is contrasted
consistent formatting are all part of data wrangling.
with a parsimonious one. A complex model will have
many variables and hyperparameters. It may fit the
Document Frequency
data better, but it is vulnerable to overfitting and
In natural language processing (NLP), document
misinterpreting noise as signal.
frequency is the number of documents in which a
term appears.
Confusion Matrix
A confusion matrix for a classification model is a
Document Term Matrix
2 x 2 table of predicted versus actual outcomes.
In natural language processing (NLP), the document
The columns typically represent predictions (1 or 0),
term matrix is a matrix in which rows are documents,
and the rows actual classes (1 or 0). The upper left
columns are terms, and the entries are the counts of a
cell, for example, represents the number or records
given term in the document for that row.
predicted as 1’s that are, in fact, 1’s. Numerous
17
Data Science for Investment Professionals Glossary
18
Data Science for Investment Professionals Glossary
19
Data Science for Investment Professionals Glossary
Type II Error
In statistical hypothesis testing, a type II error is
mistakenly accepting the null hypothesis that an
effect or phenomenon (e.g., a new therapy improves
health, or a stock price is not a random walk) is not
real and is just the product of chance. A type II error
(which can occur only when the effect is real) usually
results from an inadequate sample size.
Unstructured Data
Unstructured data are data that do not occur naturally
in the form of a table with rows and columns. The text
in documents, images, and “messy” data (e.g., medical
notes that have a mixture of numeric and text data)
are all unstructured data.
20
Data Science for Investment Professionals Glossary
Cohen-D
Bias
A statistic that is used to measure the effect of size in
A measurable, systematic difference between actual
power analysis.
results and the correct results in statistics, or a belief
or assumption that is implicit or explicit.
Coefficient of Determination
The percentage of the variation of the dependent
Bias Error
variable in a regression explained by the independent
The degree to which the model fits the training data,
or explanatory variables. Also known as R2.
where the lower the bias, the better the model’s
performance in the training dataset.
Complexity Bias
Refers to a behavioral bias that the complex can be
Behavioral Bias
more appealing than the simple. In reality, a simple
Typically, cognitive, or emotional bias in decision
model that is not underfitted is usually better than a
making that leads to errors in judgment. Personality
complex model.
difference can exert a relatively huge influence on its
overall impact.
21
Data Science for Investment Professionals Glossary
22
Data Science for Investment Professionals Glossary
23
Data Science for Investment Professionals Glossary
24
Data Science for Investment Professionals Glossary
25
Data Science for Investment Professionals Glossary
26
Data Science for Investment Professionals Glossary
27