-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Why is dropna default value is True in value_counts() methods/functions ? #21890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In general, the use case for including
I'm not sure I follow you here. What do you mean by an "implicit way" ? cc @jreback |
@gfyoung thank you for your answer. By implicit I mean:
What's your use case of value_counts() ? Doesn't the benefit of having information about NaN values which is very informative about the structure of your data and leads to potential error findings in a data munging use case outweighs the cost of having an additional NaN line in the output ? |
The default behavior, with
As stated by @jreback in a comment in the extended discussion around #9422:
I think that is a good principle, and pandas does not follow it in this case, i.e. I would prefer to see the default be I suspect that a practical reason for the current behavior is that pandas does not handle null values in indexes very well, in my experience. Finally, what should the following return? (This question seems equivalent to the sum-of-all-NaN-series debate in #9422.)
The result in my environment (pandas 0.19.1, bottleneck 1.2.0) is:
|
@nmusolino , I'm a bit troubled, what do you really mean ? You advocate for dropping NaNs as default behavior on one hand, but your other comments seems to lean on the I think that I'm not sure to fully get the arguments that are in favour of dropping NaNs... |
👍 to this (ancient) question, and one vote from me for changing the default to |
Sorry, I had a typo in my old comment, which I've just corrected. I agree with the original issue description: the default should be to include null values ( |
Problem description
For beginner in pandas, it can be puzzling and misleading to do not see NaNs values when trying to understand a DataFrame, a Series, etc. Especially if
value_counts
is used to check that previous operations were made in a correct way (e.g. join / merge-type operations).As I can understand that it may seems natural to drop
NaNs
for various operations in pandas (means, etc), and as a consequence of that, the general default value fordropna
arguments isTrue
(is it really the real reason?).I feel uncomfortable with the value_counts default behavior and it has (and still) caused me some troubles.
The zen of python second aphorism state:
I do feel that dropping NaN value is done in a implicit way, that this implicit way is harmful.
If find no drawbacks to have
False
as default value, to the exception of having a Na inSeries
index
.The question:
So why is
dropna
arguments default value ofvalue_counts()
areTrue
?Ps : I've looked into issues with filters :
is:issue value_counts dropna
to the exception of #5569 I didn't find a lot of informations.The text was updated successfully, but these errors were encountered: