Skip to content

BUG GH11600 - MultiIndex column level names lost when to_sparse() called #11606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

Ezekiel-Kruglick
Copy link

closes #11600

Fixed problem with multi-index column level names not propagating into sparse frames or back out to dense on a round trip through sparse. Includes 4 new tests to cover some relevant scenarios. Problem fixed for the conventions to_sparse path, I'm not sure about other paths where something else is passed to SparseDataSeries directly, those would be outside scope of bug.


def test_to_sparse_preserve_multiindex_names_on_columns(self):
sparse_multiindex_frame = self.dense_multiindex_frame.to_sparse()
self.assertTrue(self.dense_multiindex_frame.columns.equals(sparse_multiindex_frame.columns))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should all be 2 test and just use
assert_sp_frame_equal or dense version to compare versus and expected frame

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I can change that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging into assert_sp_frame_equal it appears that if I feed it a dense reference frame and a sparse frame it converts the sparse frame to dense and compares them in the dense state. This would make an assert_sp_frame of the sparse and dense matrices the same as the round trip test (i.e. it only tests the sparse frame after passing it through to_dense).

I am trying to actually test the column names while still sparse against the reference column names to make sure we don't wind up just preserving the column names during conversion and putting them back on when going back to_dense. So I actually want a test of the sparse column names without making it dense again. I agree the round_trip_test goes much better with assert_frame_equal, but it looks like testing the column names are actually valid while sparse can't use assert_sp_frame_equal, do I have that right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more concise, I think two tests can go to a single assert_sp_frame, but I think the other two (which could be made one test with two asserts) still needs to directly test the column names, perhaps using assert_index_equal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you prob want 2 tests here. one of which is a full round-trip (dense->sparse->dense) to check for preservation, then another which takes a constructed sparse and check for correct attribute preservation on construction (with sparse again)

@jreback jreback added Bug Sparse Sparse Data Type MultiIndex labels Nov 15, 2015
@Ezekiel-Kruglick
Copy link
Author

Hi Jeff- Hopefully I understood what you were looking for in the tests, it now has one roundtrip and one check after sparse construction, both using the pandas-specific asserts. If this was not what you meant just let me know how it could be better. Cheers, Zeke

self.dense_multiindex_frame = dense_multiindex_frame.fillna(value=3.14)

def test_to_sparse_preserve_multiindex_names_columns(self):
sparse_multiindex_frame = self.dense_multiindex_frame.to_sparse().copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to copy

@jreback
Copy link
Contributor

jreback commented Nov 16, 2015

couple of points; pls add a whatsnew note (bug fixes) for 0.17.1.

ping when green.

@jreback
Copy link
Contributor

jreback commented Nov 16, 2015

pls squash as well

@Ezekiel-Kruglick
Copy link
Author

Code all fixed up but squashing still left me two commits for some reason. I'll look into the Git stuff but it might take a couple days as I have some other stuff to do.

jreback and others added 2 commits November 18, 2015 16:36
BUG GH11600  - MultiIndex column level names were getting lost in sparse conversion

Updated testing and whatsnew to follow project preferences
@Ezekiel-Kruglick
Copy link
Author

Squashed all my stuff down to one commit and updated pull request. Looks like you have a commit that's mixed in there to adjust a warning but github says its okay. Let me know if anything can be better, I think I hit everything you asked for.

@jreback jreback added this to the 0.17.1 milestone Nov 19, 2015
@jreback
Copy link
Contributor

jreback commented Nov 19, 2015

merged via 207e0ce

thanks!

@Ezekiel-Kruglick
Copy link
Author

Great! I updated on stackoverflow with a link to the issue here just in case somebody (possibly who hasn't updated) should ever run into it. Seems perhaps unlikely for such an edge case but better that than somebody finding the SO question and not knowing its fixed if they update!

@jreback jreback closed this Nov 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG - sparse dataframes lose multi-index column names
2 participants