-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
MAINT, ENH: Refactor percentile and quantile methods #19857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just as an initial comment: this PR contains a lot of stylistic changes to the existing code base that are unrelated to the |
Nice! Sounds like there are a couple of nuts to crack still. Some notes:
Some API things we probably need to hash out, but I think these can be discussed later:
EDIT: As Bas noted, it would be nice if you can avoid running black on the code (at least for unmodified lines). |
Yes you are right, I'm used to run a auto linter all the time but I understand that it make the review difficult. About |
I did quite a big refactoring, the implementation is now much closer to what the former _quantile was and is hopefully much easier to understand. |
@seberg What is the expected behavior of The unit test was previously handled in a strange manner using So if I understand what is the expected behavior for booleans I could try to fix the unit test and #19154 in one commit. |
@bzah Seems that
|
Yes, in the former implementation of _quantile it only works with booleans if the indices are integers. |
There was a change, I think it used to work, but probably we can get away with doing what we think is right. My feeling is that the correct thing is to return the input dtype unmodified for any of the non-interpolating methods. For the interpolating methods, you will get whatever |
Well I don't get it.
If arr is cast into integers, this would give a result close to 0.5. I guess if someone has a real world use case for percentile on booleans it would help me understand it. |
No, to be honest, it doesn't. But NumPy (similar to Python) usually pretends that booleans are like integers and thus "numerical". Since it is already an error right now, I don't mind if you keep it there until a clear decision is made though. So the argument is that |
The work on this branch has removed a handling of an unnecessary specific case (when indices are integers), which was covered by this test case. However, percentile/quantiles have been broken for a while when the input array is made of booleans, but it's not clear what should be done to fix this. The unit test case now behaves like any other boolean array and raise a TypeError. See - numpy#19857 (comment) - numpy#19154
I'm removing the Moreover, by "handling" I mean it will force the result to be the left bound instead of interpolating to something very close to the left bound. This also means there should not be any other issue with the license as far as I understand it. |
On the CI, I get a cygwin error:
Is this usual ? Can I do something to fix it ? |
You can ignore the Cygwin error, it is on the CI side. |
Could you add a release note? It would go in |
- Added the missing linear interpolation methods. - Updated the existing unit tests. - Added pytest.mark.xfail for boolean arrays See - numpy#19857 (comment) - numpy#19154
Hopefully fix the docstrings of percentile, nanpercentile, quantile, and nanquantile so that CircleCI passes.
Also removed unused imports
Also added unit test for it.
On some platforms float128 and complex256 do not exist. Using (c)longdouble aliases should work on all platforms.
cdc2088
to
f7911c6
Compare
Thanks @bzah. Lets put this in. If there are any remaining concerns, please open an issue or mention them here so we can followup before the next release hopefully. I am planning to follow up with a rename of If anyone wants to take a closer look at the documentation to see if it can be improved to provide more guidelines that would be great. (I think of example the default method could be considered a sample quantile/percentile, while most others are various population estimates, if that is correct it might be helpful to explain earlier/clearer?) |
@seberg are the quantile changes something we need to adapt to on our (pandas) end or to address here? |
@jbrockmendel I am honestly not quite sure yet. The formula used for calculating seems "index" not ideal (potentially a few ULPs less imprecise than it could be for the default method). I am not sure if we can reorder the formula in general, or whether that would lead to worse characteristics for some methods. @bzah suggested using a different (simplified/old one) formula for the current default, I suppose we could go through all methods to see if we can clearly simplify a few more. If the changes in pandas seem too strange to be true, I am happy to take that as enough reason that we should specialize at least the default. One thing I am curious about: If this is the reason for the change, I wonder if the code that tests this will misbehave for all methods except the ones NumPy previously implemented! For all other methods, the repeating samples would change the result (my working hypothesis is that this happens, but I am not familiar with |
The few test failures I've looked at have been small floating point errors https://github.com/pandas-dev/pandas/runs/4131499766?check_suite_focus=true#step:7:104 e.g.
|
@bzah I just realized that we already need the Is Do you want to make a quick PR? |
Yup, I'm on it. |
The test was fixed by numpygh-19857
The test was fixed by numpygh-19857
This is quite a big pull request especially for a new comer like me.
I hope it respect the standards of numpy.
This PR should solve #10736.
It is a rewrite of the work done on Ouranosinc/xclim#802, with some other cases handled.
API changes
The interpolation option for quantile/percentile has been expanded.
Interpolation can be passed as a string or a value of QuantileInterpolation Enum.
This gives more flexibility to add new interpolation methods and is clearer than using magic strings.
Note: In the issue #10736, it was asked to remove some existing methods but I don't think it should be part of the same PR.
Algorithm changes
indices computation
The logic has changed a bit for to compute
indices
(a.k.avirtual_indices
).Instead of doing
q * (Nx - 1)
, it is now(q * Nx) - 1
which seems more reasonable, see below example.However, to not break the existing behavior of nearest, midpoint, lower and higher, I kept it as is, even if it is probably giving wrong results.
Example
Let's say you want the 50th percentile of a 100 size ordered list.
If the array index start at zero,
doing
0.5 * ( 100 -1 )
would give49.5
which is wrong because, the 50th percentile is exactly on the 49th index.
But doing
0.5 * 100 - 1
gives the expected49
This has no consequence on the result of the default interpolation method (method 7). This is because it is taken into account with the formula using alpha == beta == 1
Rewrite of existing interpolation methods
The existing nearest, midpoint, lower and higher methods have been transformed to fit the QuantileInterpolation model.
Performance improvements
A performance improvement could be done for nanpercentile but it's no longer part of this PR.
The performance of nanpercentile/nanquantile have been significantly improved, especially when the wantedquantiles
is a scalar or a short list of quantiles. In most case it should perform around 50 times faster than before.I will add here a link to the performance report once I re-run the perf tests, in the meantime you can find the report of the Xclim version here: https://gist.github.com/bzah/2a84d050b8a1aed1b40a2ed1526e1f12. It might not be accurate thought.