Why is dropna default value is True in value_counts() methods/functions ? #21890

adrienpacifico · 2018-07-13T07:24:27Z

Problem description

>>> s = pd.Series([1,2,3, np.nan, 5])
>>> s.value_counts()
5    1
3    1
2    1
1    1

>>> s.value_counts(dropna = False)

 5     1
 3     1
 2     1
 1     1
NaN    1

For beginner in pandas, it can be puzzling and misleading to do not see NaNs values when trying to understand a DataFrame, a Series, etc. Especially if value_counts is used to check that previous operations were made in a correct way (e.g. join / merge-type operations).

As I can understand that it may seems natural to drop NaNs for various operations in pandas (means, etc), and as a consequence of that, the general default value for dropna arguments is True (is it really the real reason?).

I feel uncomfortable with the value_counts default behavior and it has (and still) caused me some troubles.

The zen of python second aphorism state:

Explicit is better than implicit

I do feel that dropping NaN value is done in a implicit way, that this implicit way is harmful.
If find no drawbacks to have False as default value, to the exception of having a Na in Series index.

The question:

So why is dropna arguments default value of value_counts() are True ?

Ps : I've looked into issues with filters : is:issue value_counts dropna to the exception of #5569 I didn't find a lot of informations.

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-07-13T09:57:26Z

In general, the use case for including NaN in an aggregation like this isn't particularly substantial IMO. That's why I find dropna=True to be helpful. Others may disagree: this is just my opinion.

I do feel that dropping NaN value is done in a implicit way, that this implicit way is harmful.

I'm not sure I follow you here. What do you mean by an "implicit way" ?

cc @jreback

adrienpacifico · 2018-07-13T10:16:26Z

@gfyoung thank you for your answer.

By implicit I mean:

I consider that a NaN value in an array is a value.
That value_counts() is asking for counting the number occurrences of each value.
Thus since NaNs are value they should be included in the value_counts() output. The function is implicitly (without clearly stating it) dropping NaNs.
Moreover it is opposed to my expectation in terms of behaviour).

What's your use case of value_counts() ? Doesn't the benefit of having information about NaN values which is very informative about the structure of your data and leads to potential error findings in a data munging use case outweighs the cost of having an additional NaN line in the output ?

nmusolino · 2018-07-13T15:15:43Z

The default behavior, with dropna=True, loses information.

In [10]: pandas.Series([1., 2.]).value_counts()
Out[10]:
2.0    1
1.0    1
dtype: int64

In [11]: pandas.Series([1, 2] + [numpy.nan] * 98).value_counts()
Out[11]:
2.0    1
1.0    1
dtype: int64

As stated by @jreback in a comment in the extended discussion around #9422:

But in pandas [ignoring NaN values] is completely misleading and lossy, because nans by definition propogate (unless you specifically dont' propogate them).

I think that is a good principle, and pandas does not follow it in this case, i.e. value_counts(). In fact, I think it doesn't follow it when it comes to aggregation functions (which typically skip/drop null values) in general.

I would prefer to see the default be dropna=False. Certainly when working interactively, or using the value_counts() result directly for reporting (e.g. printing in a report), it's very informative to have a null count.

I suspect that a practical reason for the current behavior is that pandas does not handle null values in indexes very well, in my experience.

Finally, what should the following return? (This question seems equivalent to the sum-of-all-NaN-series debate in #9422.)

In [5]: pandas.Series([numpy.nan] * 100).value_counts()

The result in my environment (pandas 0.19.1, bottleneck 1.2.0) is:

Out[5]: Series([], dtype: int64)

adrienpacifico · 2018-07-15T22:42:32Z

@nmusolino , I'm a bit troubled, what do you really mean ?

You advocate for dropping NaNs as default behavior on one hand, but your other comments seems to lean on the dropna = False default side.

I think that pandas.Series([numpy.nan] * 100).value_counts() should definitely provide the information about NaNs.

I'm not sure to fully get the arguments that are in favour of dropping NaNs...

mqk · 2021-07-06T16:47:31Z

👍 to this (ancient) question, and one vote from me for changing the default to dropna=False. I add that keyword pretty much every single time I call value_counts...

nmusolino · 2021-07-06T21:59:53Z

@nmusolino , I'm a bit troubled, what do you really mean ?

[...]

I'm not sure to fully get the arguments that are in favour of dropping NaNs...

Sorry, I had a typo in my old comment, which I've just corrected. I agree with the original issue description: the default should be to include null values (dropna=False).

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question labels Jul 13, 2018

jbrockmendel added the API - Consistency Internal Consistency of API/Behavior label Sep 21, 2020

mroeschke added Needs Discussion Requires discussion from core team before further action and removed Usage Question labels Jun 20, 2021

simonjayhawkins added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 11, 2022

rhshadrach mentioned this issue May 6, 2023

PDEP-11: Change default of dropna to False #53094

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is dropna default value is True in value_counts() methods/functions ? #21890

Why is dropna default value is True in value_counts() methods/functions ? #21890

adrienpacifico commented Jul 13, 2018 •

edited

Loading

gfyoung commented Jul 13, 2018

adrienpacifico commented Jul 13, 2018

nmusolino commented Jul 13, 2018 •

edited

Loading

adrienpacifico commented Jul 15, 2018

mqk commented Jul 6, 2021

nmusolino commented Jul 6, 2021

Why is dropna default value is True in value_counts() methods/functions ? #21890

Why is dropna default value is True in value_counts() methods/functions ? #21890

Comments

adrienpacifico commented Jul 13, 2018 • edited Loading

Problem description

The question:

gfyoung commented Jul 13, 2018

adrienpacifico commented Jul 13, 2018

nmusolino commented Jul 13, 2018 • edited Loading

adrienpacifico commented Jul 15, 2018

mqk commented Jul 6, 2021

nmusolino commented Jul 6, 2021

adrienpacifico commented Jul 13, 2018 •

edited

Loading

nmusolino commented Jul 13, 2018 •

edited

Loading