Skip to content

Why is dropna default value is True in value_counts() methods/functions ? #21890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adrienpacifico opened this issue Jul 13, 2018 · 6 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API - Consistency Internal Consistency of API/Behavior Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action

Comments

@adrienpacifico
Copy link
Contributor

adrienpacifico commented Jul 13, 2018

Problem description

>>> s = pd.Series([1,2,3, np.nan, 5])
>>> s.value_counts()
5    1
3    1
2    1
1    1

>>> s.value_counts(dropna = False)

 5     1
 3     1
 2     1
 1     1
NaN    1

For beginner in pandas, it can be puzzling and misleading to do not see NaNs values when trying to understand a DataFrame, a Series, etc. Especially if value_counts is used to check that previous operations were made in a correct way (e.g. join / merge-type operations).

As I can understand that it may seems natural to drop NaNs for various operations in pandas (means, etc), and as a consequence of that, the general default value for dropna arguments is True (is it really the real reason?).

I feel uncomfortable with the value_counts default behavior and it has (and still) caused me some troubles.

The zen of python second aphorism state:

Explicit is better than implicit

I do feel that dropping NaN value is done in a implicit way, that this implicit way is harmful.
If find no drawbacks to have False as default value, to the exception of having a Na in Series index.

The question:

So why is dropna arguments default value of value_counts() are True ?

Ps : I've looked into issues with filters : is:issue value_counts dropna to the exception of #5569 I didn't find a lot of informations.

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question labels Jul 13, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 13, 2018

In general, the use case for including NaN in an aggregation like this isn't particularly substantial IMO. That's why I find dropna=True to be helpful. Others may disagree: this is just my opinion.

I do feel that dropping NaN value is done in a implicit way, that this implicit way is harmful.

I'm not sure I follow you here. What do you mean by an "implicit way" ?

cc @jreback

@adrienpacifico
Copy link
Contributor Author

@gfyoung thank you for your answer.

By implicit I mean:

  • I consider that a NaN value in an array is a value.
  • That value_counts() is asking for counting the number occurrences of each value.
  • Thus since NaNs are value they should be included in the value_counts() output. The function is implicitly (without clearly stating it) dropping NaNs.
  • Moreover it is opposed to my expectation in terms of behaviour).

What's your use case of value_counts() ? Doesn't the benefit of having information about NaN values which is very informative about the structure of your data and leads to potential error findings in a data munging use case outweighs the cost of having an additional NaN line in the output ?

@nmusolino
Copy link
Contributor

nmusolino commented Jul 13, 2018

The default behavior, with dropna=True, loses information.

In [10]: pandas.Series([1., 2.]).value_counts()
Out[10]:
2.0    1
1.0    1
dtype: int64

In [11]: pandas.Series([1, 2] + [numpy.nan] * 98).value_counts()
Out[11]:
2.0    1
1.0    1
dtype: int64

As stated by @jreback in a comment in the extended discussion around #9422:

But in pandas [ignoring NaN values] is completely misleading and lossy, because nans by definition propogate (unless you specifically dont' propogate them).

I think that is a good principle, and pandas does not follow it in this case, i.e. value_counts(). In fact, I think it doesn't follow it when it comes to aggregation functions (which typically skip/drop null values) in general.

I would prefer to see the default be dropna=False. Certainly when working interactively, or using the value_counts() result directly for reporting (e.g. printing in a report), it's very informative to have a null count.

I suspect that a practical reason for the current behavior is that pandas does not handle null values in indexes very well, in my experience.

Finally, what should the following return? (This question seems equivalent to the sum-of-all-NaN-series debate in #9422.)

In [5]: pandas.Series([numpy.nan] * 100).value_counts()

The result in my environment (pandas 0.19.1, bottleneck 1.2.0) is:

Out[5]: Series([], dtype: int64)

@adrienpacifico
Copy link
Contributor Author

@nmusolino , I'm a bit troubled, what do you really mean ?

You advocate for dropping NaNs as default behavior on one hand, but your other comments seems to lean on the dropna = False default side.

I think that pandas.Series([numpy.nan] * 100).value_counts() should definitely provide the information about NaNs.

I'm not sure to fully get the arguments that are in favour of dropping NaNs...

@jbrockmendel jbrockmendel added the API - Consistency Internal Consistency of API/Behavior label Sep 21, 2020
@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Usage Question labels Jun 20, 2021
@mqk
Copy link

mqk commented Jul 6, 2021

👍 to this (ancient) question, and one vote from me for changing the default to dropna=False. I add that keyword pretty much every single time I call value_counts...

@nmusolino
Copy link
Contributor

@nmusolino , I'm a bit troubled, what do you really mean ?

[...]

I'm not sure to fully get the arguments that are in favour of dropping NaNs...

Sorry, I had a typo in my old comment, which I've just corrected. I agree with the original issue description: the default should be to include null values (dropna=False).

@simonjayhawkins simonjayhawkins added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API - Consistency Internal Consistency of API/Behavior Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

7 participants