Skip to content

Speed up max_len_string_array #10024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 30, 2015
Merged

Speed up max_len_string_array #10024

merged 8 commits into from
Apr 30, 2015

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Apr 29, 2015

Before:

In [5]: x = np.array(['abcd', 'abcde', 'abcdef', 'abcdefg'] * int(1e7), object)

In [2]: f = pd.lib.max_len_string_array

In [6]: %timeit f(x)
1 loops, best of 3: 501 ms per loop

After:

In [1]: x = np.array(['abcd', 'abcde', 'abcdef', 'abcdefg'] * int(1e7), object)

In [2]: f = pd.lib.max_len_string_array

In [3]: %timeit f(x)
10 loops, best of 3: 68.4 ms per loop

@cpcloud cpcloud self-assigned this Apr 29, 2015
@cpcloud cpcloud added this to the 0.16.1 milestone Apr 29, 2015
@jreback jreback added the Performance Memory or execution speed performance label Apr 29, 2015
@jreback
Copy link
Contributor

jreback commented Apr 29, 2015

looks good. release note in perf section & squash. merge when ready.

@@ -896,23 +904,32 @@ def clean_index_list(list obj):

return maybe_convert_objects(converted), 0


ctypedef fused pandas_string:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to mention you could use a better name here :).

@jreback
Copy link
Contributor

jreback commented Apr 30, 2015

my naive quick look does show some improvements to to_hdf when you have object dtypes, order is 10-15% impv. (only used in to_stata/to_hdf).

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2015

@shoyer I didn't have a particular pandas use case for this. I'm going to start using it in some often called paths in odo and I wanted to see if I could squeeze out some more perf.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2015

I think @jreback had some ideas about using cython memoryviews in some of re csv code similar to how I use them here. IIRC he said there are quite a few places where we don't take full advantage of what cython has to offer. For example if you type a variable as just ndarray you incur the overhead of the fully general get item c API whereas if you type it as ndarray[type] the getitem syntax goes directly to the underlying raw pointer array.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2015

ok squashed. merging on pass

cpcloud added a commit that referenced this pull request Apr 30, 2015
@cpcloud cpcloud merged commit d96ccd2 into pandas-dev:master Apr 30, 2015
@cpcloud cpcloud deleted the fixlen-string-faster branch April 30, 2015 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants