-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Result of rolling mean depends on more observations than are in window #34390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report @cpaulik. Can you debug a little further by investigating the windows generated by the |
I looked at the following part of pandas now https://github.com/pandas-dev/pandas/blob/master/pandas/core/window/rolling.py#L571-L577 start, end = window_indexer.get_window_bounds(
num_values=len(x),
min_periods=self.min_periods,
center=self.center,
closed=self.closed,
)
return func(x, start, end, min_periods) full_mean(Pdb) print(start)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
(Pdb) print(end)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40]
(Pdb) print(x)
[0.1552 0.1746 0.1932 0.234 nan 0.2423 0.1648 nan 0.2148 0.2081
0.2313 0.2011 nan 0.2076 0.2096 nan 0.1801 0.1872 0.1878 0.1949
nan 0.1608 nan nan 0.1793 nan 0.1689 0.1631 nan 0.1586
0.1531 nan 0.149 0.1434 nan 0.1526 nan 0.1293 0.1268 nan]
(Pdb) print(x[start[-1]:end[-1]])
[0.1949 nan 0.1608 nan nan 0.1793 nan 0.1689 0.1631 nan
0.1586 0.1531 nan 0.149 0.1434 nan 0.1526 nan 0.1293 0.1268
nan]
(Pdb) func(x, start, end, min_periods)[-1]
0.15664999999999998 shorter mean(Pdb) print(start)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18]
(Pdb) print(end)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]
(Pdb) print(x)
[0.1746 0.1932 0.234 nan 0.2423 0.1648 nan 0.2148 0.2081 0.2313
0.2011 nan 0.2076 0.2096 nan 0.1801 0.1872 0.1878 0.1949 nan
0.1608 nan nan 0.1793 nan 0.1689 0.1631 nan 0.1586 0.1531
nan 0.149 0.1434 nan 0.1526 nan 0.1293 0.1268 nan]
(Pdb) print(x[start[-1]:end[-1]])
[0.1949 nan 0.1608 nan nan 0.1793 nan 0.1689 0.1631 nan
0.1586 0.1531 nan 0.149 0.1434 nan 0.1526 nan 0.1293 0.1268
nan]
(Pdb) print(func(x, start, end, min_periods)[-1])
0.15665000000000004
pandas/pandas/_libs/window/aggregations.pyx Line 327 in df5eee6
Looking at that it seems that it is a numerical issue since the current |
If this is a numerical precision issue, this may be related to #13254 |
Yes the improved algorithm in #13254 might solve this problem. |
Thanks for the report. We can continue the discussion at #13254, and I'll close this issue out. |
Code Sample, a copy-pastable example
Output
Problem description
The rolling mean in the example should only take the last 20 values into account. The output for the last day
2020-05-02
does however depend on the inclusion of a value on2020-03-24
.This is especially visible if we round the output to the 4 digit precision that the input data has.
Expected Output
Both running means should be the same value.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.6.12-arch1-1
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191101
Cython : 0.29.13
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.0
pyxlsb : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.17
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
None
The text was updated successfully, but these errors were encountered: