Estimating regression fits — seaborn 0.13.2 documentation
Estimating regression fits — seaborn 0.13.2 documentation
2 documentation
In the spirit of Tukey, the regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during
exploratory data analyses. That is to say that seaborn is not itself a package for statistical analysis. To obtain quantitative measures related to the fit of
regression models, you should use statsmodels. The goal of seaborn, however, is to make exploring a dataset through visualization quick and easy, as
doing so is just as (if not more) important than exploring a dataset through tables of statistics.
In the simplest invocation, both functions draw a scatterplot of two variables, x and y , and then fit the regression model y ~ x and plot the resulting
regression line and a 95% confidence interval for that regression:
tips = sns.load_dataset("tips")
sns.regplot(x="total_bill", y="tip", data=tips);
https://seaborn.pydata.org/tutorial/regression.html 1/11
19/04/2024 Estimating regression fits — seaborn 0.13.2 documentation
These functions draw similar plots, but regplot() is an axes-level function, and lmplot() is a figure-level function. Additionally, regplot() accepts
the x and y variables in a variety of formats including simple numpy arrays, pandas.Series objects, or as references to variables in a
pandas.DataFrame object passed to data . In contrast, lmplot() has data as a required parameter and the x and y variables must be specified as
strings. Finally, only lmplot() has hue as a parameter.
The core functionality is otherwise similar, though, so this tutorial will focus on lmplot() :.
It’s possible to fit a linear regression when one of the variables takes discrete values, however, the simple scatterplot produced by this kind of dataset is
often not optimal:
One option is to add some random noise (“jitter”) to the discrete values to make the distribution of those values more clear. Note that jitter is applied
only to the scatterplot data and does not influence the regression line fit itself:
A second option is to collapse over the observations in each discrete bin to plot an estimate of central tendency along with a confidence interval:
https://seaborn.pydata.org/tutorial/regression.html 2/11
19/04/2024 Estimating regression fits — seaborn 0.13.2 documentation
anscombe = sns.load_dataset("anscombe")
The linear relationship in the second dataset is the same, but the plot clearly shows that this is not a good model:
https://seaborn.pydata.org/tutorial/regression.html 3/11
19/04/2024 Estimating regression fits — seaborn 0.13.2 documentation
In the presence of these kind of higher-order relationships, lmplot() and regplot() can fit a polynomial regression model to explore simple kinds of
nonlinear trends in the dataset:
A different problem is posed by “outlier” observations that deviate for some reason other than the main relationship under study:
https://seaborn.pydata.org/tutorial/regression.html 4/11
19/04/2024 Estimating regression fits — seaborn 0.13.2 documentation
In the presence of outliers, it can be useful to fit a robust regression, which uses a different loss function to downweight relatively large residuals:
When the y variable is binary, simple linear regression also “works” but provides implausible predictions:
https://seaborn.pydata.org/tutorial/regression.html 5/11
19/04/2024 Estimating regression fits — seaborn 0.13.2 documentation
The solution in this case is to fit a logistic regression, such that the regression line shows the estimated probability of y = 1 for a given value of x :
Note that the logistic regression estimate is considerably more computationally intensive (this is true of robust regression as well). As the confidence
interval around the regression line is computed using a bootstrap procedure, you may wish to turn this off for faster iteration (using ci=None ).
An altogether different approach is to fit a nonparametric regression using a lowess smoother. This approach has the fewest assumptions, although it is
computationally intensive and so currently confidence intervals are not computed at all:
https://seaborn.pydata.org/tutorial/regression.html 6/11
19/04/2024 Estimating regression fits — seaborn 0.13.2 documentation
The residplot() function can be a useful tool for checking whether the simple regression model is appropriate for a dataset. It fits and removes a
simple linear regression and then plots the residual values for each observation. Ideally, these values should be randomly scattered around y = 0 :
If there is structure in the residuals, it suggests that simple linear regression is not appropriate:
https://seaborn.pydata.org/tutorial/regression.html 7/11
19/04/2024 Estimating regression fits — seaborn 0.13.2 documentation
The best way to separate out a relationship is to plot both levels on the same axes and to use color to distinguish them:
Unlike relplot() , it’s not possible to map a distinct variable to the style properties of the scatter plot, but you can redundantly code the hue variable
with marker shape:
To add another variable, you can draw multiple “facets” with each level of the variable appearing in the rows or columns of the grid:
https://seaborn.pydata.org/tutorial/regression.html 8/11
19/04/2024 Estimating regression fits — seaborn 0.13.2 documentation
https://seaborn.pydata.org/tutorial/regression.html 9/11
19/04/2024 Estimating regression fits — seaborn 0.13.2 documentation
Using the pairplot() function with kind="reg" combines regplot() and PairGrid to show the linear relationship between variables in a dataset.
Take care to note how this is different from lmplot() . In the figure below, the two axes don’t show the same relationship conditioned on two levels of a
third variable; rather, PairGrid() is used to show multiple relationships between different pairings of the variables in a dataset:
Conditioning on an additional categorical variable is built into both of these functions using the hue parameter:
https://seaborn.pydata.org/tutorial/regression.html 10/11
19/04/2024 Estimating regression fits — seaborn 0.13.2 documentation
https://seaborn.pydata.org/tutorial/regression.html 11/11