Skip to content

BUG: Result of rolling mean depends on more observations than are in window #34390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cpaulik opened this issue May 26, 2020 · 5 comments
Closed
Labels
Needs Info Clarification about behavior needed to assess issue

Comments

@cpaulik
Copy link

cpaulik commented May 26, 2020

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
import datetime as dt

data = np.array([0.1552, 0.1746, 0.1932, 0.234 , np.nan, 0.2423, 0.1648,
                    np.nan, 0.2148, 0.2081, 0.2313, 0.2011, np.nan, 0.2076,
                    0.2096, np.nan, 0.1801, 0.1872, 0.1878, 0.1949, np.nan,
                    0.1608, np.nan, np.nan, 0.1793, np.nan, 0.1689, 0.1631,
                    np.nan, 0.1586, 0.1531, np.nan, 0.149 , 0.1434, np.nan,
                    0.1526, np.nan, 0.1293, 0.1268, np.nan])
dates_pandas = pd.date_range(start=dt.date(2020, 3, 24), periods=data.shape[0])
ser = pd.Series(data, index=dates_pandas)
full_mean = ser.rolling(window="20D",
                        min_periods=2,
                        center=False,
                        closed='both').mean()
shorter_mean = ser['2020-03-25':].rolling(window="20D",
                                            min_periods=2,
                                            center=False,
                                            closed='both').mean()
print(full_mean['2020-05-02'], full_mean['2020-05-02'].round(4))
print(shorter_mean['2020-05-02'], shorter_mean['2020-05-02'].round(4))
Output
0.15664999999999998 0.1566
0.15665000000000004 0.1567

Problem description

The rolling mean in the example should only take the last 20 values into account. The output for the last day 2020-05-02 does however depend on the inclusion of a value on 2020-03-24.

This is especially visible if we round the output to the 4 digit precision that the input data has.

Expected Output

Both running means should be the same value.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.6.12-arch1-1
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191101
Cython : 0.29.13
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.0
pyxlsb : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.17
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
None

@cpaulik cpaulik added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2020
@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 26, 2020

Thanks for the report @cpaulik. Can you debug a little further by investigating the windows generated by the Rolling object? You might look at _get_window_indexer, though I'm not too familiar with this code. I'd like to verify that this is actually a bug first.

@TomAugspurger TomAugspurger added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels May 26, 2020
@cpaulik
Copy link
Author

cpaulik commented May 27, 2020

I looked at the following part of pandas now https://github.com/pandas-dev/pandas/blob/master/pandas/core/window/rolling.py#L571-L577

                    start, end = window_indexer.get_window_bounds(
                        num_values=len(x),
                        min_periods=self.min_periods,
                        center=self.center,
                        closed=self.closed,
                    )
                    return func(x, start, end, min_periods)

full_mean

(Pdb) print(start)
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  2  3
  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
(Pdb) print(end)
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40]
(Pdb) print(x)
[0.1552 0.1746 0.1932 0.234     nan 0.2423 0.1648    nan 0.2148 0.2081
 0.2313 0.2011    nan 0.2076 0.2096    nan 0.1801 0.1872 0.1878 0.1949
    nan 0.1608    nan    nan 0.1793    nan 0.1689 0.1631    nan 0.1586
 0.1531    nan 0.149  0.1434    nan 0.1526    nan 0.1293 0.1268    nan]
(Pdb) print(x[start[-1]:end[-1]])
[0.1949    nan 0.1608    nan    nan 0.1793    nan 0.1689 0.1631    nan
 0.1586 0.1531    nan 0.149  0.1434    nan 0.1526    nan 0.1293 0.1268
    nan]
(Pdb) func(x, start, end, min_periods)[-1]
0.15664999999999998

shorter mean

(Pdb) print(start)
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  2  3
  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
(Pdb) print(end)
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]
(Pdb) print(x)
[0.1746 0.1932 0.234     nan 0.2423 0.1648    nan 0.2148 0.2081 0.2313
 0.2011    nan 0.2076 0.2096    nan 0.1801 0.1872 0.1878 0.1949    nan
 0.1608    nan    nan 0.1793    nan 0.1689 0.1631    nan 0.1586 0.1531
    nan 0.149  0.1434    nan 0.1526    nan 0.1293 0.1268    nan]
(Pdb) print(x[start[-1]:end[-1]])
[0.1949    nan 0.1608    nan    nan 0.1793    nan 0.1689 0.1631    nan
 0.1586 0.1531    nan 0.149  0.1434    nan 0.1526    nan 0.1293 0.1268
    nan]
(Pdb) print(func(x, start, end, min_periods)[-1])
0.15665000000000004

func is <built-in function roll_mean_variable>

def roll_mean_variable(ndarray[float64_t] values, ndarray[int64_t] start,

Looking at that it seems that it is a numerical issue since the current sum is reused instead of being computed cleanly for each window.

@mroeschke
Copy link
Member

If this is a numerical precision issue, this may be related to #13254

@cpaulik
Copy link
Author

cpaulik commented May 28, 2020

Yes the improved algorithm in #13254 might solve this problem.

@mroeschke
Copy link
Member

Thanks for the report. We can continue the discussion at #13254, and I'll close this issue out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

3 participants