0% found this document useful (0 votes)
2 views

Gg Stats Plt

{ggstatsplot} is a software tool designed to create information-rich statistical visualizations with minimal effort, addressing common reporting and interpretation errors in data analysis. It provides ready-made plots that incorporate raw data, descriptive and inferential statistics, and various statistical approaches, enhancing the quality and transparency of statistical reporting. Since its launch in 2017, it has gained significant adoption across multiple fields, with over 500K downloads and more than 1000 citations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Gg Stats Plt

{ggstatsplot} is a software tool designed to create information-rich statistical visualizations with minimal effort, addressing common reporting and interpretation errors in data analysis. It provides ready-made plots that incorporate raw data, descriptive and inferential statistics, and various statistical approaches, enhancing the quality and transparency of statistical reporting. Since its launch in 2017, it has gained significant adoption across multiple fields, with over 500K downloads and more than 1000 citations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Statistical Visualizations with

{ggstatsplot}: A Biography
Indrajeet Patil

Or How I Learned to Stop Worrying about Data Visualization and Statistical Reporting
1
Genesis
Why a new software?

2
Life in the trenches (c. 2017, Harvard)
External Stimulus Internal Response

Reporting errors:

“half of all published psychology papers contained


at least one p-value that was inconsistent”1

Interpretation errors:

“in 72% of cases, nonsignificant results were


misinterpreted [to mean] that effect was absent”2 How to:

avoid reporting errors?


Replication crisis:
improve quality of statistical reporting?
emphasize the importance of the effect?
“39% of effects were subjectively rated to have interpret null results?
replicated the original result”3 easily assess validity of model assumptions?
increase replicability?
and more…
3
Proposal
Information-rich, ready-made statistical visualizations
(minimal effort and maximum transparency)

4
A visualization with statistical summary

💡 Visualizations reveal problems not discernible from model summaries!


(Matejka & Fitzmaurice, Autodesk Research,2017) 5
Ready-made plots with one-line syntax

The grammar of graphics


framework can prepare any
visualization! But building
plots from scratch can be
time-consuming.

💡 Using ready-made plots lowers the effort


needed for visualizing data!

6
Action Plan
{ggstatsplot} was born!
(open-sourced on GitHub in 2017; still actively developed)

7
Example function
E.g., for hypothesis about differences between groups
1 ggbetweenstats(iris, Species, Sepal.Length)
Important

Information-rich defaults

raw data + distributions


descriptive statistics
inferential statistics
effect size + uncertainty
pairwise comparisons
Bayesian hypothesis-testing
Bayesian estimation

Statistical approaches available

parametric
non-parametric
robust
Bayesian

8
And there is more!

Appendix provides more details. 9


Promised Land
Does it deliver?

10
Show, don’t tell
Without {ggstatsplot} With {ggstatsplot}
Pearson’s correlation test revealed that, across 142
participants, variable x was negatively correlated
with variable y: t(140) = −0.76, p = 0.446. The
effect size (r = −0.06, 95%CI [−0.23, 0.10])
was small, as per Cohen’s (1988) conventions. The
Bayes Factor for the same analysis revealed that
the data were 5.81 times more probable under the
null hypothesis as compared to the alternative
hypothesis. This can be considered moderate
evidence (Jeffreys, 1961) in favor of the null
hypothesis (absence of any correlation between x
and y).

✅ No need to worry about reporting or interpretation errors!


11
Thoughtful Defaults
Data Visualization Statistical Reporting

(Doorn et al., 2020; APA Manual)

✅ Follows best practices in data visualization and statistical reporting!


12
Impact
I can haz users?!

13
User Love
Total downloads > 500K (97 percentile) Total citations > 1000

From publications across a wide range of fields:


biology, medicine, psychology, economics, etc.

Second most starred {ggplot2}-extension!

Improving Psychological Science Award (2020)

14
Pleasant Side Effects
Maybe the real treasure was the skills we acquired along the
way!

15
Software Architecture
Breaking down the monolith: 20K(2017) → 1K(2024) lines of code

easyst

effectsize

insight

backend engine
statsExpressions parameters

ggstatsplot performance
Other dependencies

bayestestR

16
Collaborative Solutions
While re-architecting {ggstatsplot}, I started contributing upstream.

As part of {easystats} core team

leadership skills to steer the project


long-term vision for the project
API design
CI infrastructure
code review
documentation
scouting for new talent
developer advocacy
community engagement

Making it a habit

co-maintainer of {ggsignif} co-author of {lintr} (linter for R)


contributor to {WRS2}, {ggcorrplot} co-author of {styler} (code formatter)
17
Quality Assurance
“The only way to go fast, is to go well.”
- Robert C. Martin

CI Checks (GitHub Actions) Healthy and active code base

Unit tests (random-order)


Code coverage (100%)
Linting (0 lints)
Formatting (0 issues)
Documentation (website, no link rot, plenty examples)
Pre-commit hooks (0 issues)
Zero user-facing warnings
Portability (Linux, macOS, Windows)
Robustness (dependencies, language versions)
CRAN checks (0 notes, 0 warnings, 0 errors)

18
Communication
Training material on best practices in software/package development to support
community contributions keeping in mind the diverse backgrounds of contributors.

19
Biography (2017-)
(Or how developing {ggstatsplot} continues to help me grow as a software developer)

Code Quality

Technical Skills Architecture Design

Technical Debt

ggstatsplot

Collaboration

Soft Skills Leadership

Communication

20
Conclusion
{ggstatsplot} offers an intuitive interface for creating
detailed statistical visualizations, enabling users to adopt
rigorous, reliable, and robust workflows for data exploration
and reporting across various academic and industrial
disciplines. It is a well-maintained tool with high-quality
infrastructure and widespread adoption.

21
Thank You 😊
Source code for these slides can be found on GitHub.

22
For more
If you are interested in good programming and software
development practices, check out my other slide decks.

23
Find me at…
 Twitter
 LikedIn
 GitHub
 Website
 E-mail

24
Session information
1 sessioninfo::session_info(include_base = TRUE)
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.2 (2024-10-31)
os Ubuntu 22.04.5 LTS
system x86_64, linux-gnu
hostname fv-az564-242
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz UTC
date 2024-12-08
pandoc 3.5 @ /opt/hostedtoolcache/pandoc/3.5/x64/ (via rmarkdown)
quarto 1.7.2 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
base * 4.4.2 2024-10-31 [3] local
BayesFactor 0.9.12-4.7 2024-01-24 [1] RSPM
bayestestR 0.15.0 2024-10-17 [1] RSPM
bi 1 0 9 2024 10 03 [1]

25
Appendix

26
Examples of other
functions

27
ggwithinstats()
Hypothesis about group differences: repeated measures design
1 ggwithinstats(
2 data = WRS2::WineTasting, Important
3 x = Wine,
4
5 )
y = Taste
✏️ Defaults
raw data + distributions
descriptive statistics
inferential statistics
effect size + uncertainty
pairwise comparisons
Bayesian hypothesis-testing
Bayesian estimation

Statistical approaches available

parametric
parametric
robust
Bayesian
28
gghistostats()
Distribution of a numeric variable
1 gghistostats(
2 data = movies_long, Important
3 x = budget,
4
5 )
test.value = 30
✏️ Defaults
counts + proportion for bins
descriptive statistics
inferential statistics
effect size + uncertainty
pairwise comparisons
Bayesian hypothesis-testing
Bayesian estimation

Statistical approaches available

parametric
parametric
robust
Bayesian
29
ggdotplotstats()
Labeled numeric variable
1 ggdotplotstats(
2 data = movies_long, Important
3 x = budget,
4
5
y = genre,
test.value = 30 ✏️ Defaults
6 )
descriptive statistics
inferential statistics
effect size + uncertainty
pairwise comparisons
Bayesian hypothesis-testing
Bayesian estimation

Statistical approaches available

parametric
parametric
robust
Bayesian

30
ggscatterstats()
Hypothesis about correlation: Two numeric variables
1 ggscatterstats(
2 data = movies_long, Important
3 x = budget,
4
5 )
y = rating
✏️ Defaults
joint distribution
marginal distribution
effect size + uncertainty
pairwise comparisons
Bayesian hypothesis-testing
Bayesian estimation

Statistical approaches available

parametric
parametric
robust
Bayesian

31
ggcorrmat()
Hypothesis about correlation: Multiple numeric variables
1 ggcorrmat(dplyr::starwars)
Important

✏️ Defaults
inferential statistics
effect size + uncertainty
careful handling of NAs
partial correlations

Statistical approaches available

parametric
parametric
robust
Bayesian

32
ggpiestats()
Hypothesis about composition of categorical variables
1 ggpiestats(
2 data = mtcars, Important
3 x = am,
4
5 )
y = cyl
✏️ Defaults
descriptive statistics
inferential statistics
effect size + uncertainty
goodness-of-fit tests
Bayesian hypothesis-testing
Bayesian estimation

33
ggbarstats()
Hypothesis about composition of categorical variables
1 ggbarstats(
2 data = mtcars, Important
3 x = am,
4
5 )
y = cyl
✏️ Defaults
descriptive statistics
inferential statistics
effect size + uncertainty
goodness-of-fit tests
Bayesian hypothesis-testing
Bayesian estimation

34
ggcoefstats()
Hypothesis about regression coefficients
1 mod <- lm(
2 formula = rating ~ mpaa, Important
3 data = movies_long
4 )
5 ✏️ Defaults
6 ggcoefstats(mod)
estimate + uncertainty
inferential statistics (t, z, F , χ2 )
model fit indices (AIC + BIC)

Supports all regression models


supported in {easystats} ecosystem.

Meta-analysis is also supported!

35
grouped_ variants
Iterating over a grouping variable

36
grouped_ functions
1 grouped_ggpiestats(
2 data = mtcars, Available grouped_ variants:
3 x = cyl,
4 grouping.var = am
5 )
grouped_ggbetweenstats()
grouped_ggwithinstats()
grouped_gghistostats()
grouped_ggdotplotstats()
grouped_ggscatterstats()
grouped_ggcorrmat()
grouped_ggpiestats()
grouped_ggbarstats()

37
Customizability
“What if I don’t like the default plots?” 🤔

38
Modify the look
By changing theme and palette
🎨
1 ggbetweenstats(
2 data = movies_long,
3 x = mpaa,
4 y = rating,
5 ggtheme = ggthemes::theme_economist(),
6 palette = "Darjeeling2",
7 package = "wesanderson"
8 )

By using {ggplot2} functions


1 ggbetweenstats(
2 data = mtcars,
3 x = am,
4 y = wt,
5 type = "bayes"
6 ) +
7 scale_y_continuous(sec.axis = dup_axis())

39
Too much information
Get only plots:
🙈
1 ggbetweenstats(
2 data = iris,
3 x = Species,
4 y = Sepal.Length,
5 # turn off statistical analysis
6 centrality.plotting = FALSE,
7 results.subtitle = FALSE,
8 bf.message = FALSE,
9 # turn off pairwise comparisons
10 pairwise.display = "none"
11 )

Get only expressions:


1 stats_expr <- ggpiestats(
2 Titanic_full, Survived, Sex,
3 ) %>%
4 extract_subtitle()
5
6 ggiraphExtra::ggSpine(
7 data = Titanic_full,
8 aes(x = Sex, fill = Survived)
9 ) +
10 labs(subtitle = stats_expr)

40
{ggstatsplot}: Details
about statistical reporting

41
Supports different statistical approaches
Note
Functions Description Parametric Non- Robust Bayesian
parametric
ggbetweenstats() Between group comparisons ✅ ✅ ✅ ✅
ggwithinstats() Within group comparisons ✅ ✅ ✅ ✅
gghistostats(), Distribution of a numeric variable ✅ ✅ ✅ ✅
ggdotplotstats()
ggcorrmat() Correlation matrix ✅ ✅ ✅ ✅
ggscatterstats() Correlation between two variables ✅ ✅ ✅ ✅
ggpiestats(), Association between categorical ✅ NA NA ✅
ggbarstats() variables
ggpiestats(), Equal proportions for categorical ✅ NA NA ✅
ggbarstats() variable levels
ggcoefstats() Regression modeling ✅ ✅ ✅ ✅
ggcoefstats() Random-effects meta-analysis ✅ NA ✅ ✅

42
Toggling statistical approaches
Parametric Non-parametric
🔀
1 # anova 1 # anova
2 ggbetweenstats( 2 ggbetweenstats(
3 data = mtcars, 3 data = mtcars,
4 x = cyl, 4 x = cyl,
5 y = wt, 5 y = wt,
6 type = "p" 6 type = "np"
7 ) 7 )
8 8
9 # correlation analysis 9 # correlation analysis
10 ggscatterstats( 10 ggscatterstats(
11 data = mtcars, 11 data = mtcars,
12 x = wt, 12 x = wt,
13 y = mpg, 13 y = mpg,
14 type = "p" 14 type = "np"
15 ) 15 )
16 16
17 # t-test 17 # t-test
18 gghistostats( 18 gghistostats(
19 data = mtcars, 19 data = mtcars,
20 x = wt, 20 x = wt,
21 test.value = 2, 21 test.value = 2,
22 type = "p" 22 type = "np"
23 ) 23 )

43
Alternative: Pure Pain
Hunting for packages Inconsistent APIs

📦 for inferential statistics ({stats}) 🤔 accepts data frame, vector, matrix?


📦 computing effect size + CIs ({effectsize}) 🤔 long/wide format data?
📦 for descriptive statistics ({skimr}) 🤔 works with NAs?
📦 pairwise comparisons ({multcomp}) 🤔 returns data frame, vector, matrix?
📦 Bayesian hypothesis testing ({BayesFactor}) 🤔 works with tibbles?
📦 Bayesian estimation ({bayestestR}) 🤔 has all necessary details?
📦… 🤔…

44
Benefits in details
{ggstatsplot} combines data visualization and statistical analysis in a single step.
It…

provides ready-made plots with information-rich defaults

minimizes the chances of making errors in statistical reporting

follows best practices in data visualization and statistical reporting

helps evaluate statistical analysis in the context of the underlying data

highlights the importance of the effect by providing effect size measures

provides an easy way to evaluate absence of an effect using Bayesian framework

extremely beginner-friendly

45
Simplified data analysis workflow

✅ Quick insight into data by combining visualization and modeling!

(Grolemund & Wickham, R for Data Science, 2017) 46


Community Involvement
11 contributors
3 reverse dependencies
Widely covered in YouTube videos and social media posts
Almost 100% resolution rate on StackOverflow (> 150 questions)
Over 100 daily visitors on GitHub repo
Usage in a wide range of fields: psychology, biology, medicine, economics, etc.
Usage in data science training programs

47
A grain of salt
The “Golem of Prague” problem

❌ Promotes mindless
application of statistical tests.

No stable release yet.

48
Footnotes
1. (Nuijten et al., Behavior Research Methods, 2016)

2. (Aczel et al., AMPPS, 2018)

3. Open Science Collaboration, Science, 2015

49

You might also like