Session 1 Canvas
Session 1 Canvas
2
Logistics
3
How to pass this course
There are three workshops where you will complete an
individual assignment. All three need to be passed to pass
the course. Bring your laptop (with Stata installed)!
4
How to pass this course
Resits are due 16th of February at 23:59 PM and are for
any missed / failed assignment. Will be posted the day after
the respective workshop.
5
Using Stata
We will use Stata in this course, but it’s not a course on Stata.
Stata strikes a nice balance between user-friendliness and
statistical options / flexibility.
Whenever you work with Stata, work from a .do file. This is a
file containing your code, and you can run this code from
there.
While you can do many things using the interface (like in
SPSS), using commands is much faster and more precise.
6
Video lectures
Video lectures on Stata contain all the required code for the
assignments (and likely: your thesis!). See the Discussion
board for links to the videos.
7
Useful links
https://stats.idre.ucla.edu/stata/modules/
(Basics of Stata)
https://stats.idre.ucla.edu/other/annotatedoutput/
(Annotated analysis examples)
https://stats.idre.ucla.edu/other/dae/
(More annotated data analysis examples)
https://stats.idre.ucla.edu/stata/faq/
(Various frequently asked questions)
e.g. “Can I quickly see how many missing values a variable has?”;
“How can I quickly convert many string variables to numeric variables?”
8
Questions?
Really stuck?
Post your question on the Discussion Board on Canvas; this
way others who face the same problems may also get
something out of it (compared to direct e-mails).
I appreciate other students also helping out and giving their
insights!
9
About me
10
CV
Dutch; born and raised in Tilburg.
PhD from Tilburg U in 2017; joined RSM that year.
Teaching: This course and Strategic Mgmt (BSc); quant methods
(part-time PhD); Organization Theory (PhD). Prior teaching: Research
Clinic, thesis coordinator, New Business Development, Strategic
Entrepreneurship.
Won inaugural award for best thesis coach at RSM, best young
researcher at ERIM, most creative research at SMS.
Editorial review board member of SMJ.
Starting March: Director of Doctoral Education at ERIM.
11
Why me?
Most importantly: I just really like understanding and teaching
the methods that I use.
- Inverted U-shapes in Strategic Management Journal: 15th most-
cited article in Business and Management since 2016 (out of 316K+
articles).
- Topic modeling (text analysis) in Academy of Management Annals;
see also https://github.com/RFJHaans/topicmodeling/
- Data Analytics Chair of the Academy of Management’s
OMT Division (around 3800 members).
- Advisory Board member of the Erasmus Data Service Centre.
12
Teaching philosophy
1) I often show mathematics behind important statistics and tests, since it
can help get at the intuition (and show it’s not a black box). In the end, I
always offer the key take-aways.
2) I don’t expect students to know all the mathematics, but I do expect that
you understand anything you report.
3) The videos contain Stata commands that will do the calculations for you.
Try to replicate what I do with your own data. Change things and see
what happens.
14
15
What are data?
Data are a set of observations of the world around us …
... collected intentionally and systematically
(unsystematic observations lead to bias) …
… coded clearly, to facilitate understanding and comparison …
… analyzed deductively or inductively to draw conclusions.
16
A set of observations...
Empirical Context:
The setting in which our study occurs
- Relevant contexts informed by RQ & theory
- Choice puts boundaries on generalizability of findings
Unit of Analysis:
The object our data describe. What the data are “about”.
*Individual (Entrepreneur, CEO, student)
*Organization (Firm, Board)
*Time (quarter, year)
Others (Country, Industry, Network)
17
How to get quantitative data?
Quantitative studies use many kinds of data.
Most commonly:
• Archival data (e.g., government and industry databases) often
contain financial and demographic statistics for a variety of empirical
contexts and units of analysis.
• Available, but often with poor documentation. Also: note that they must
have been collected in some way! That is, they are rarely unbiased.
18
Your thesis data?
19
How to decide?
What data might be useful?
- Check the methodologies of the papers you cite and see what data
they use.
- Some questions require primary data; others are also readily
addressed with archival data.
- Your RQ should lead, but be wary of the risks of hinging your entire
thesis on a survey.
Be creative: Often you can collect really good data without requiring a
questionnaire—e.g., web scraping, manual coding of online sources
such as annual reports. Combine sources.
Search widely: There are databases available
via the university, but many more exist! 20
Where to look?
https://libguides.eur.nl/az.php?s=124888 Databases that EUR has
access to.
https://www.strategicmanagement.net/ig-competitive-strategy/research-
resources
An overview of databases for strategic management and
entrepreneurship.
https://www.kauffman.org/entrepreneurship/research/data-resources/
Entrepreneurship data.
https://datasetsearch.research.google.com/
Google Dataset search.
21
Sampling and biases
22
What are data?
Data are a set of observations of the world around us …
... collected intentionally and systematically
(unsystematic observations lead to bias) …
… coded clearly, to facilitate understanding and comparison …
… analyzed deductively or inductively to draw conclusions.
23
... collected systematically ...
Population: The entire set of existing individuals:
…for our unit of analysis, and
…within our empirical context
Ex: Census of People, Registry of Firms
Hard to know the entire population!
26
Biased samples
How to determine whether you have a biased sample?
a) Rhetorically defend why it is unlikely your sample is biased—e.g.
because you sampled randomly from the population. If you are
using archival data, the documentation should help (else the people
collecting the data did not do their job).
b) Compare population and sample on specific characteristics. E.g.
firm size, founding date, and legal form available in the population.
27
Biased responses
Response Bias: A survey respondent’s answer is not aligned with reality.
Ex: Teenagers under-report drug use
Again: Show that these issues are not present / have limited influence.
If you do find biases based on tests; don’t hide it but treat it as a limitation.
28
Intermezzo: t-tests / prtests
You often see people use t-tests and tests of proportions to see
whether those in the sample are different from those in the population.
29
An example
The rows are firms; we have a population list
containing info on firm size and new entrant
status.
We know whether or not we sampled each of
them.
Within the sampled firms, we know whether
or not they responded; the rest are missing
(indicated by “.” in Stata).
Within the firms that responded we obtained
a response asking for their satisfaction.
30
An example
31
An example
Step 2: It seems that larger firms were more likely to participate in our
questionnaire; this implies that our conclusions may not generalize as
well to smaller firms. There are no differences in new entrant status. (the
p-value for size is smaller than 0.05—more next session!)
32
An example
You can continue by testing whether or not new entrants who responded scored
higher on satisfaction, for example. However, this may simply be what we are
interested in (i.e. doesn’t necessarily indicate any sort of bias).
Likewise, you can also check whether new entrants were more likely to not have
answered the question about satisfaction (i.e., within the responding sample,
check whether or not there are more missing observations for new entrants).
The key: Throughout your empirical design, think about every step of the way and
whether or not specific biases may creep into the data. Document everything,
report everything transparently and honestly.
33
Limiting bias
Limiting Non-Response Bias
Effective questionnaire (instrument) design:
- Communicate professionally (notify in advance, follow-up after)
- Personalize to respondent (make this obvious by signing in blue ink)
- Incentivize (thank in advance, highlight salience, pay)
- Minimize response burden (fewer questions, faster questions)
Depends on context!
Survey of homes in 1950s: >80% response rate.
Response rates have fallen due to “survey fatigue”.
My own published work: 3.9% (out of 66,089 sampled)
MSc Thesis: 0.1%-5%
Tips: Pre-test the survey questions, simplify as much as possible, verify the question by
asking in more than one way (within reason).
A survey is a poor choice of instrument for questions that cannot fit this list of
characteristics! For archival data; don’t just look at the answers in the data, but35also check
Break
36
Working with data
37
What are data?
Data are a set of observations of the world around us …
... collected intentionally and systematically
(unsystematic observations lead to bias) …
… coded clearly, to facilitate understanding and comparison …
… analyzed deductively or inductively to draw conclusions.
38
... coded clearly ...
Observations: a set of people & firms with unique characteristics.
Dataset: a table with rows for individuals, and columns defining their
qualities.
Variables: explained / explanatory items that go into your models.
Codes: the meaning behind the values taken for each variable in the
dataset.
Codebook: your documentation about the data, including each
variable and its coding.
39
Epistemic correlations
Epistemic Correlation:
The strength of the logical connection between the observed data and
the concept it represents.
Examples:
Concept Observation
Innovation # Patents, New Products
Diversity % Women / Minority
Performance % Return on Investment
41
Observations to values
Continuous C.A.R., financial performance
Count # Acquisitions, # Startups
Proportions %Family Ownership, %Female
Binary / Dummy (0,1) Yes/No; Present/Absent
Categorical (0,1,2,3,…) Race, Industry, Country
Ordinal (0,1,2,3,…) Education level, S/M/L
42
Observations to values
Coding your own data (from web documents, etc.)
44
Describing quantitative data
Describing Variables
Univariate Statistics describe observations along a single variable
Mean: Average
Mode: Most Frequent
Median: Middle Value of Ordered List (50th percentile)
Range: Min, Max value
Standard Deviation: Spread around the average
45
Histograms
Histograms quickly reveal many univariate statistics
Normal Distribution
• Mean = median
• Std Dev (s.d.) = σ
46
Skew
Skewed Distribution
• Left (Negative) Skew: long “tail” on left
• Mean < Median
• Right (Positive) Skew: long “tail” on right
• Mean > Median
47
Cleaning datasets
Inspecting datasets is important, since data are often not clean.
Common cleaning steps:
1) Fix impossible values—Some datasets indicate missing values with
numbers like “999999”. Check documentation (or use common sense) to see
whether these need to be recoded to missing (in Stata, “.”).
Other times data entry error etc. lead to impossible values that should be
changed to missing (e.g. -1 when a scale can go from 1 to 5).
2) Check outliers—There may be extremely large values that need to be
corrected. E.g. firm profits exceeding global GDP. Typically changed to
missing or winsorized (i.e. changed to 99th percentile).
3) Transform variables—Sometimes distributions are problematic
(e.g. extreme right skews). It may be needed to correct these without
removing observations (outliers can still be valid).
48
Correcting for skew
It’s common to use log-transformations to correct for right-skewness
of variables and to reduce the influence of outliers.
We often need to add 1 to X, because ln(0) is mathematically
undefined (yet zeroes commonly occur). You also cannot log-
transform negative values.
What if your data run from e.g. -2 to 2? Unfortunately comes down to
arbitrary choice (e.g., add 3 or 2.1 or 2.01 before transforming...) that
does have serious implications for results.
49
Correcting for skew
An alternative is to use cube-root transformation (X^(1/3)) which is
intuitively doing the same but works for negative and zero values. See
e.g. Cox (2011) and the video on working with data for the right
commands.
50
Combining datasets
You often work with multiple datasets that need to be combined into a
single dataset.
Two main approaches:
1) Append: This literally pastes one dataset under another.
51
Appending
52
Combining datasets
You often work with multiple datasets that need to be combined into a
single dataset.
Two main approaches:
1) Append: This literally pastes one dataset under another.
2) Merging:
a) 1:1 – A unique identifier in each dataset. There cannot be duplicate values in either
dataset for the identifiers. Each match needs to be unique: one-to-one. E.g. firm to firm or
firm / year to firm / year.
b) 1:m – In the first dataset, one value for each identifier. In the second dataset, many values
for each identifier. E.g. one dataset with one piece of information per industry, another with
many observations for each industry.
c) m:1 – The opposite of 1:m.
d) m:m – Many to many. Not advised to use this; only the first match will be used, but other
variables may be different for other matches.
53
1:1
Here, there would be a 1:1 match based on firm and year to yield
the dataset on the right. See Video 2 for code and example.
54
1:m / m:1
Here, there would be a m:1 match based on firm to yield the dataset
on the right.
If we start from the middle dataset and match to the left dataset, it’d
be a 1:m merge on firm.
55
m:m
56
Principles of good
data management
57
Guiding thoughts
1) Start with a plan
• Know your goal (Thesis / Hypothesis testing).
• Build backwards: to ID measures & data needed.
• Start early.
2) Document everything
• Maintain a .do and log file in Stata (like a lab notebook).
• Keep your codebook updated.
• Even idiots should be able to replicate your work from these
use comments.
3) Preserve raw data
• Save data versions with unique names.
58
https://www.i-scoop.eu/gdpr/gdpr-personal-data-identifiers-pseudonymous-information/
What are data?
Data are a set of observations of the world around us …
... collected intentionally and systematically
(unsystematic observations lead to bias) …
… coded clearly, to facilitate understanding and comparison …
… analyzed deductively or inductively to draw conclusions.
60
Next session
Regression essentials
Thursday 18th 11:00 – 12:45 in Sanders 0-02 – Lecture 2.
Thursday 18th 15:00 – 16:45 – Workshop 1 (see Excel file on
Canvas for your group and room).
In the meantime:
• Check out videos 1 through 3.
• Check out practice assignments.
• Install and try out Stata (follow along with videos).
61
See you next time!
62