Speed up max_len_string_array #10024

cpcloud · 2015-04-29T20:41:36Z

Before:

In [5]: x = np.array(['abcd', 'abcde', 'abcdef', 'abcdefg'] * int(1e7), object)

In [2]: f = pd.lib.max_len_string_array

In [6]: %timeit f(x)
1 loops, best of 3: 501 ms per loop

After:

In [1]: x = np.array(['abcd', 'abcde', 'abcdef', 'abcdefg'] * int(1e7), object)

In [2]: f = pd.lib.max_len_string_array

In [3]: %timeit f(x)
10 loops, best of 3: 68.4 ms per loop

jreback · 2015-04-29T22:14:31Z

looks good. release note in perf section & squash. merge when ready.

shoyer · 2015-04-30T06:46:22Z

pandas/lib.pyx

@@ -896,23 +904,32 @@ def clean_index_list(list obj):

    return maybe_convert_objects(converted), 0

+
+ctypedef fused pandas_string:


I was going to mention you could use a better name here :).

jreback · 2015-04-30T10:05:57Z

my naive quick look does show some improvements to to_hdf when you have object dtypes, order is 10-15% impv. (only used in to_stata/to_hdf).

cpcloud · 2015-04-30T11:57:52Z

@shoyer I didn't have a particular pandas use case for this. I'm going to start using it in some often called paths in odo and I wanted to see if I could squeeze out some more perf.

cpcloud · 2015-04-30T12:03:36Z

I think @jreback had some ideas about using cython memoryviews in some of re csv code similar to how I use them here. IIRC he said there are quite a few places where we don't take full advantage of what cython has to offer. For example if you type a variable as just ndarray you incur the overhead of the fully general get item c API whereas if you type it as ndarray[type] the getitem syntax goes directly to the underlying raw pointer array.

cpcloud · 2015-04-30T13:50:48Z

ok squashed. merging on pass

Speed up max_len_string_array

cpcloud self-assigned this Apr 29, 2015

cpcloud added this to the 0.16.1 milestone Apr 29, 2015

cpcloud added the Enhancement label Apr 29, 2015

jreback added the Performance Memory or execution speed performance label Apr 29, 2015

shoyer reviewed Apr 30, 2015
View reviewed changes

cpcloud added 8 commits April 30, 2015 09:49

Network-ize a test

077d353

Use explicit len dispatch to avoid overhead

b698772

Improve perf

e6a831f

Use a fused type

c1caf7f

Ensure object on stata

def1479

Test that we do not accept unicode

13b7474

Use proper types so that we work with python3

b88139d

Better name for fused type

ee2626e

cpcloud mentioned this pull request Apr 30, 2015

Discover string columns more precisely blaze/odo#186

Closed

1 task

cpcloud added a commit that referenced this pull request Apr 30, 2015

Merge pull request #10024 from cpcloud/fixlen-string-faster

d96ccd2

Speed up max_len_string_array

cpcloud merged commit d96ccd2 into pandas-dev:master Apr 30, 2015

cpcloud deleted the fixlen-string-faster branch April 30, 2015 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up max_len_string_array #10024

Speed up max_len_string_array #10024

cpcloud commented Apr 29, 2015

jreback commented Apr 29, 2015

shoyer Apr 30, 2015

jreback commented Apr 30, 2015

cpcloud commented Apr 30, 2015

cpcloud commented Apr 30, 2015

cpcloud commented Apr 30, 2015

		@@ -896,23 +904,32 @@ def clean_index_list(list obj):

		return maybe_convert_objects(converted), 0


		ctypedef fused pandas_string:

Speed up max_len_string_array #10024

Speed up max_len_string_array #10024

Conversation

cpcloud commented Apr 29, 2015

jreback commented Apr 29, 2015

shoyer Apr 30, 2015

Choose a reason for hiding this comment

jreback commented Apr 30, 2015

cpcloud commented Apr 30, 2015

cpcloud commented Apr 30, 2015

cpcloud commented Apr 30, 2015