0% found this document useful (0 votes)
2 views

Session 1 Canvas

The document outlines the structure and requirements for a course on quantitative methods, emphasizing the use of Stata for data analysis. It covers topics such as data collection, sampling techniques, biases, and the importance of clear coding and documentation. Additionally, it provides logistical details about workshops and resources for students to succeed in the course.

Uploaded by

b00819452
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Session 1 Canvas

The document outlines the structure and requirements for a course on quantitative methods, emphasizing the use of Stata for data analysis. It covers topics such as data collection, sampling techniques, biases, and the importance of clear coding and documentation. Additionally, it provides logistical details about workshops and resources for students to succeed in the course.

Uploaded by

b00819452
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Quantitative Methods:

Session 1 – Working with Quantitative Data


Richard Haans ([email protected])

RSM - a force for positive change


Agenda
• Logistics
• About me
• What are data?
• Sampling and biases
• Working with data
• Principles of good data management

2
Logistics

3
How to pass this course
There are three workshops where you will complete an
individual assignment. All three need to be passed to pass
the course. Bring your laptop (with Stata installed)!

Workshop 2: 15:00 – 16:45 on Thursday the 25th of Jan.


Workshop 3: 15:00 – 16:45 on Tuesday the 30th of Jan.
Workshop 4: 15:00 – 16:45 on Thursday the 1st of Feb.

Workshop 1 (Thursday the 18th, 15:00 – 16:45) is not


compulsory, but will be a practice session.

4
How to pass this course
Resits are due 16th of February at 23:59 PM and are for
any missed / failed assignment. Will be posted the day after
the respective workshop.

5
Using Stata
We will use Stata in this course, but it’s not a course on Stata.
Stata strikes a nice balance between user-friendliness and
statistical options / flexibility.
Whenever you work with Stata, work from a .do file. This is a
file containing your code, and you can run this code from
there.
While you can do many things using the interface (like in
SPSS), using commands is much faster and more precise.

6
Video lectures
Video lectures on Stata contain all the required code for the
assignments (and likely: your thesis!). See the Discussion
board for links to the videos.

Finding out what commands to use and how to use them is an


important learning goal of the course. I show the most
important pieces, but often you’ll need to fill in the blanks.

This is an important skill for your thesis (and life)!

7
Useful links
https://stats.idre.ucla.edu/stata/modules/
(Basics of Stata)

https://stats.idre.ucla.edu/other/annotatedoutput/
(Annotated analysis examples)

https://stats.idre.ucla.edu/other/dae/
(More annotated data analysis examples)

https://stats.idre.ucla.edu/stata/faq/
(Various frequently asked questions)
e.g. “Can I quickly see how many missing values a variable has?”;
“How can I quickly convert many string variables to numeric variables?”
8
Questions?
Really stuck?
Post your question on the Discussion Board on Canvas; this
way others who face the same problems may also get
something out of it (compared to direct e-mails).
I appreciate other students also helping out and giving their
insights!

9
About me

10
CV
Dutch; born and raised in Tilburg.
PhD from Tilburg U in 2017; joined RSM that year.
Teaching: This course and Strategic Mgmt (BSc); quant methods
(part-time PhD); Organization Theory (PhD). Prior teaching: Research
Clinic, thesis coordinator, New Business Development, Strategic
Entrepreneurship.
Won inaugural award for best thesis coach at RSM, best young
researcher at ERIM, most creative research at SMS.
Editorial review board member of SMJ.
Starting March: Director of Doctoral Education at ERIM.
11
Why me?
Most importantly: I just really like understanding and teaching
the methods that I use.
- Inverted U-shapes in Strategic Management Journal: 15th most-
cited article in Business and Management since 2016 (out of 316K+
articles).
- Topic modeling (text analysis) in Academy of Management Annals;
see also https://github.com/RFJHaans/topicmodeling/
- Data Analytics Chair of the Academy of Management’s
OMT Division (around 3800 members).
- Advisory Board member of the Erasmus Data Service Centre.

12
Teaching philosophy
1) I often show mathematics behind important statistics and tests, since it
can help get at the intuition (and show it’s not a black box). In the end, I
always offer the key take-aways.

2) I don’t expect students to know all the mathematics, but I do expect that
you understand anything you report.

3) The videos contain Stata commands that will do the calculations for you.
Try to replicate what I do with your own data. Change things and see
what happens.

“I hear and I forget. I see and I


remember. I do and I understand.” 13
What are data?

14
15
What are data?
Data are a set of observations of the world around us …
... collected intentionally and systematically
(unsystematic observations lead to bias) …
… coded clearly, to facilitate understanding and comparison …
… analyzed deductively or inductively to draw conclusions.

16
A set of observations...
Empirical Context:
The setting in which our study occurs
- Relevant contexts informed by RQ & theory
- Choice puts boundaries on generalizability of findings

Unit of Analysis:
The object our data describe. What the data are “about”.
*Individual (Entrepreneur, CEO, student)
*Organization (Firm, Board)
*Time (quarter, year)
Others (Country, Industry, Network)

17
How to get quantitative data?
Quantitative studies use many kinds of data.
Most commonly:
• Archival data (e.g., government and industry databases) often
contain financial and demographic statistics for a variety of empirical
contexts and units of analysis.
• Available, but often with poor documentation. Also: note that they must
have been collected in some way! That is, they are rarely unbiased.

• Surveys: commonly the source of archival data, and


the source for many primary data collection initiatives.
• More control but takes a lot of time + highly uncertain.
• Requires careful design, documentation, reporting.

18
Your thesis data?

19
How to decide?
What data might be useful?
- Check the methodologies of the papers you cite and see what data
they use.
- Some questions require primary data; others are also readily
addressed with archival data.
- Your RQ should lead, but be wary of the risks of hinging your entire
thesis on a survey.
Be creative: Often you can collect really good data without requiring a
questionnaire—e.g., web scraping, manual coding of online sources
such as annual reports. Combine sources.
Search widely: There are databases available
via the university, but many more exist! 20
Where to look?
https://libguides.eur.nl/az.php?s=124888  Databases that EUR has
access to.

https://www.strategicmanagement.net/ig-competitive-strategy/research-
resources
 An overview of databases for strategic management and
entrepreneurship.

https://www.kauffman.org/entrepreneurship/research/data-resources/ 
Entrepreneurship data.

https://datasetsearch.research.google.com/ 
Google Dataset search.
21
Sampling and biases

22
What are data?
Data are a set of observations of the world around us …
... collected intentionally and systematically
(unsystematic observations lead to bias) …
… coded clearly, to facilitate understanding and comparison …
… analyzed deductively or inductively to draw conclusions.

23
... collected systematically ...
Population: The entire set of existing individuals:
…for our unit of analysis, and
…within our empirical context
Ex: Census of People, Registry of Firms
Hard to know the entire population!

If you have the population, you effectively do not need inferential


statistics!

Sample: A selection from the population


- Makes data collection feasible
- Informs statistics interpretation
- “Good” samples represent the population (no bias)
24
... collected systematically ...
Random Samples: every individual in population has equal chance of
selection (limits potential bias)
Statistical ideal.
Ex: Lottery (simple random sample)
Limitations: Requires sufficient sample size to work.
Bias still possible (by random chance)
Systematic Sample: Organize population by group, randomly select
within groups
Strata: groups have categories or ordered rank
Cluster: groups are geographically proximate
Limitations: Requires good population data
Benefits: Sampling bias less likely
25
... collected systematically ...
Convenience Sampling: collecting data through your friends, peers,
personal network, or based on availability
Limitations: the observed individuals are unlikely to be representative
of all individuals (biased). You have no way of comparing to the
population.

Benefits: Data are more easily accessed.

Worst-case scenario: If you do not know or cannot explain


where the data come from, no one can say
whether your findings are useful.

26
Biased samples
How to determine whether you have a biased sample?
a) Rhetorically defend why it is unlikely your sample is biased—e.g.
because you sampled randomly from the population. If you are
using archival data, the documentation should help (else the people
collecting the data did not do their job).
b) Compare population and sample on specific characteristics. E.g.
firm size, founding date, and legal form available in the population.

27
Biased responses
Response Bias: A survey respondent’s answer is not aligned with reality.
Ex: Teenagers under-report drug use

Non-Response Bias: Those who do not respond are systematically


different than those who do respond. Ex: Are the CEOs of the most
successful companies too busy to answer your survey? Also sometimes
referred to as sample selection bias.

Item Non-Response Bias: For a specific question / variable, specific


respondents may be systematically less likely to provide any answer—e.g.,
observations may be missing for sensitive questions.

Again: Show that these issues are not present / have limited influence.
If you do find biases based on tests; don’t hide it but treat it as a limitation.
28
Intermezzo: t-tests / prtests
You often see people use t-tests and tests of proportions to see
whether those in the sample are different from those in the population.

T-tests are tests comparing average values for a continuous variable


across two groups. Tests of probabilities compare proportions for
binary variables across two groups. These offer only very descriptive
information (see next session for more sophisticated approaches).

In Stata: ttest and prtest, respectively.

29
An example
The rows are firms; we have a population list
containing info on firm size and new entrant
status.
We know whether or not we sampled each of
them.
Within the sampled firms, we know whether
or not they responded; the rest are missing
(indicated by “.” in Stata).
Within the firms that responded we obtained
a response asking for their satisfaction.
30
An example

Step 1: There do not appear to be differences in firm size or new entrant


status for those that were sampled and those that were not sampled. (the
p-values are greater than 0.05—more next session!)

31
An example

Step 2: It seems that larger firms were more likely to participate in our
questionnaire; this implies that our conclusions may not generalize as
well to smaller firms. There are no differences in new entrant status. (the
p-value for size is smaller than 0.05—more next session!)

32
An example
You can continue by testing whether or not new entrants who responded scored
higher on satisfaction, for example. However, this may simply be what we are
interested in (i.e. doesn’t necessarily indicate any sort of bias).

Likewise, you can also check whether new entrants were more likely to not have
answered the question about satisfaction (i.e., within the responding sample,
check whether or not there are more missing observations for new entrants).

The key: Throughout your empirical design, think about every step of the way and
whether or not specific biases may creep into the data. Document everything,
report everything transparently and honestly.

33
Limiting bias
Limiting Non-Response Bias
Effective questionnaire (instrument) design:
- Communicate professionally (notify in advance, follow-up after)
- Personalize to respondent (make this obvious by signing in blue ink)
- Incentivize (thank in advance, highlight salience, pay)
- Minimize response burden (fewer questions, faster questions)

What is a good response rate?

Depends on context!
Survey of homes in 1950s: >80% response rate.
Response rates have fallen due to “survey fatigue”.
My own published work: 3.9% (out of 66,089 sampled)
MSc Thesis: 0.1%-5%

Key: Assessing whether any biases crept in!


34
Limiting bias
Limiting Response Bias
Effective question design increases probability that respondents answer truthfully &
accurately.

Easy to answer when questions are:


- Clearly & simply worded
- Are not ambiguous
- Do not require calculations, speculation, hypotheticals
- Address topics understood by the respondent
- Do not pertain to socially normative behaviors
(social confirmation / positive response bias)

Tips: Pre-test the survey questions, simplify as much as possible, verify the question by
asking in more than one way (within reason).

A survey is a poor choice of instrument for questions that cannot fit this list of
characteristics! For archival data; don’t just look at the answers in the data, but35also check
Break

36
Working with data

37
What are data?
Data are a set of observations of the world around us …
... collected intentionally and systematically
(unsystematic observations lead to bias) …
… coded clearly, to facilitate understanding and comparison …
… analyzed deductively or inductively to draw conclusions.

38
... coded clearly ...
Observations: a set of people & firms with unique characteristics.
Dataset: a table with rows for individuals, and columns defining their
qualities.
Variables: explained / explanatory items that go into your models.
Codes: the meaning behind the values taken for each variable in the
dataset.
Codebook: your documentation about the data, including each
variable and its coding.

39
Epistemic correlations
Epistemic Correlation:
The strength of the logical connection between the observed data and
the concept it represents.

Examples:
Concept Observation
Innovation # Patents, New Products
Diversity % Women / Minority
Performance % Return on Investment

Quantitative data come from observations that are examples of


broader concepts.
40
Observations
Data with naturally-numerical values are quantitative.
- Number of patents (#)
- IPO valuation ($)

But many qualitative characteristics can be useful for quantitative


studies:
- Sex – binary / dummy variable (0,1)
- Education level – ordinal variable (0, 1, 2, …)
- Industry – categorical variable (no logical order)
Qualitative observations require codification as number values for use
in quantitative analysis.

41
Observations to values
Continuous C.A.R., financial performance
Count # Acquisitions, # Startups
Proportions %Family Ownership, %Female
Binary / Dummy (0,1) Yes/No; Present/Absent
Categorical (0,1,2,3,…) Race, Industry, Country
Ordinal (0,1,2,3,…) Education level, S/M/L

Note: You cannot enter categorical variables as a ‘regular’ variable.


You must always convert these to dummy variables—one for each
category. See Stata Video Tutorial 3 for how to do this.

42
Observations to values
Coding your own data (from web documents, etc.)

Simplify: make it so that the values seem intuitive.


Variable = Female? Men take value 0, not 1.

Document: others must be able to interpret meaning.


Create a codebook and maintain a log file.

Preserve Variance: don’t erase the nuance.


Do not group firms into size classes from continuous vars.

Align codes with insight you eventually need.


Base this on your hypotheses!
43
Your variables of interest?

44
Describing quantitative data
Describing Variables
Univariate Statistics describe observations along a single variable
Mean: Average
Mode: Most Frequent
Median: Middle Value of Ordered List (50th percentile)
Range: Min, Max value
Standard Deviation: Spread around the average

Histograms describe the distribution of a variable.

45
Histograms
Histograms quickly reveal many univariate statistics

Normal Distribution
• Mean = median
• Std Dev (s.d.) = σ

46
Skew
Skewed Distribution
• Left (Negative) Skew: long “tail” on left
• Mean < Median
• Right (Positive) Skew: long “tail” on right
• Mean > Median

47
Cleaning datasets
Inspecting datasets is important, since data are often not clean.
Common cleaning steps:
1) Fix impossible values—Some datasets indicate missing values with
numbers like “999999”. Check documentation (or use common sense) to see
whether these need to be recoded to missing (in Stata, “.”).
Other times data entry error etc. lead to impossible values that should be
changed to missing (e.g. -1 when a scale can go from 1 to 5).
2) Check outliers—There may be extremely large values that need to be
corrected. E.g. firm profits exceeding global GDP. Typically changed to
missing or winsorized (i.e. changed to 99th percentile).
3) Transform variables—Sometimes distributions are problematic
(e.g. extreme right skews). It may be needed to correct these without
removing observations (outliers can still be valid).
48
Correcting for skew
It’s common to use log-transformations to correct for right-skewness
of variables and to reduce the influence of outliers.
We often need to add 1 to X, because ln(0) is mathematically
undefined (yet zeroes commonly occur). You also cannot log-
transform negative values.
What if your data run from e.g. -2 to 2? Unfortunately comes down to
arbitrary choice (e.g., add 3 or 2.1 or 2.01 before transforming...) that
does have serious implications for results.

49
Correcting for skew
An alternative is to use cube-root transformation (X^(1/3)) which is
intuitively doing the same but works for negative and zero values. See
e.g. Cox (2011) and the video on working with data for the right
commands.

I recommend using this, rather than log-transformations, but it’s not


yet very common.

50
Combining datasets
You often work with multiple datasets that need to be combined into a
single dataset.
Two main approaches:
1) Append: This literally pastes one dataset under another.

51
Appending

52
Combining datasets
You often work with multiple datasets that need to be combined into a
single dataset.
Two main approaches:
1) Append: This literally pastes one dataset under another.
2) Merging:
a) 1:1 – A unique identifier in each dataset. There cannot be duplicate values in either
dataset for the identifiers. Each match needs to be unique: one-to-one. E.g. firm to firm or
firm / year to firm / year.
b) 1:m – In the first dataset, one value for each identifier. In the second dataset, many values
for each identifier. E.g. one dataset with one piece of information per industry, another with
many observations for each industry.
c) m:1 – The opposite of 1:m.
d) m:m – Many to many. Not advised to use this; only the first match will be used, but other
variables may be different for other matches.

53
1:1

Here, there would be a 1:1 match based on firm and year to yield
the dataset on the right. See Video 2 for code and example.

54
1:m / m:1

Here, there would be a m:1 match based on firm to yield the dataset
on the right.

If we start from the middle dataset and match to the left dataset, it’d
be a 1:m merge on firm.

55
m:m

Here, there would be a m:m match based on firm to yield the


dataset on the right. This is like VLOOKUP in Excel and loses valid
observations purely because of the way the data were sorted. If we
sort the data differently, another match would occur.

Likely, some information in the middle dataset is missing (e.g. year


information).

56
Principles of good
data management

57
Guiding thoughts
1) Start with a plan
• Know your goal (Thesis / Hypothesis testing).
• Build backwards: to ID measures & data needed.
• Start early.
2) Document everything
• Maintain a .do and log file in Stata (like a lab notebook).
• Keep your codebook updated.
• Even idiots should be able to replicate your work from these 
use comments.
3) Preserve raw data
• Save data versions with unique names.
58

• Having to re-collect the same data twice wastes time.


Guiding thoughts
4) Take responsibility for quality
• Your findings are only as good as your data.
• Vet reliability.
• Consult with others.
5) Stay secure
• Understand basic terms of access.
• Know who can access the data (or decide this for yourself!).
• Plan what you will do after you are done using the data.
6) Protect Privacy
• Confidentiality: you tell no one who the individuals are (common).
• Anonymity: individuals identities are impossible to determine (rare).
• GDRP: Law regulating data about individual people in EU.
59

https://www.i-scoop.eu/gdpr/gdpr-personal-data-identifiers-pseudonymous-information/
What are data?
Data are a set of observations of the world around us …
... collected intentionally and systematically
(unsystematic observations lead to bias) …
… coded clearly, to facilitate understanding and comparison …
… analyzed deductively or inductively to draw conclusions.

60
Next session
Regression essentials
Thursday 18th 11:00 – 12:45 in Sanders 0-02 – Lecture 2.
Thursday 18th 15:00 – 16:45 – Workshop 1 (see Excel file on
Canvas for your group and room).

In the meantime:
• Check out videos 1 through 3.
• Check out practice assignments.
• Install and try out Stata (follow along with videos).

61
See you next time!

62

You might also like