Avoid unnecessary string allocations in IdnMapping #17399

stephentoub · 2018-04-03T23:21:12Z

If the output matches the input string, we can just use the input string as the result.

This eliminates the allocation in common cases. It has a measurable positive impact (~10% on Windows for the cases I tried) where no conversion is needed while also removing the allocation, and for cases where conversion is needed, it appears to result in no measurable degradation of throughput, as those cases are already significantly more expensive.

cc: @tarekgh, @krwq, @danmosemsft, @geoffkizer, @ahsonkhan

If the output matches the input string, we can just use the input string as the result.

tarekgh · 2018-04-03T23:35:51Z

In the mainstream cases, the string will change. so, if you believe the overhead in such cases is neglectable then I am ok with this change and LGTM.

ahsonkhan · 2018-04-03T23:37:50Z

it appears to result in no measurable degradation of throughput, as those cases are already significantly more expensive.

Does passing around an extra parameter (string) have negligible cost here?

stephentoub · 2018-04-03T23:42:26Z

In the mainstream cases, the string will change

Really? We're using it on host names, which if they're the common case and are ASCII for the most part don't seem to change. Am I misunderstanding?

stephentoub · 2018-04-03T23:42:47Z

Does passing around an extra parameter (string) have negligible cost here?

Yes

krwq · 2018-04-03T23:47:00Z

src/mscorlib/shared/System/Globalization/IdnMapping.cs

@@ -156,6 +157,14 @@ public override int GetHashCode()
            return (_allowUnassigned ? 100 : 200) + (_useStd3AsciiRules ? 1000 : 2000);
        }

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static unsafe string GetStringForOutput(string originalString, char* input, int inputLength, char* output, int outputLength)


optional: might make sense to pass ReadOnlySpan directly so that we have less unsafe methods

Thanks. I think we should look at changing most of the pointers to be spans in the implementation. For now, though, since we already have the pointers, I'm going to leave this as is.

tarekgh · 2018-04-04T00:49:27Z

Really? We're using it on hostnames, which if they're the common case and are ASCII, for the most part, don't seem to change. Am I misunderstanding?

The IDN internationalization is catching up and people started using names in non-ASCII (e.g. Greeks, German...etc.). Using such names is increasing every day especially IDN standard started to adopt more scenarios (e.g. BiDi support). I don't think using ASCII names would be the common cases forever.

stephentoub · 2018-04-04T01:29:23Z

src/mscorlib/shared/System/Globalization/IdnMapping.cs

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static unsafe string GetStringForOutput(string originalString, char* input, int inputLength, char* output, int outputLength)
+        {
+            return originalString.Length == inputLength && new ReadOnlySpan<char>(input, inputLength).SequenceEqual(new ReadOnlySpan<char>(output, outputLength)) ?


@tarekgh, is there any situation where the conversion would yield a different result but where the lengths would still match? I'm wondering if we could get rid of the SequenceEqual and just make it:

return originalString.Length == inputLength && inputLength == outputLength ? originalString : new string(output, 0, outputLength);

?

Thanks.

I expect the length will be different in most of the cases (if not all) so I think your check of the length would be good enough here.

in most of the cases (if not all)

Thanks, though if it won't definitely be all, we'll want to keep the SequenceEqual. Can we prove one way or another whether it would be all cases?

Usually, when you have non-ASCII character, it will be converted to Punycode which is usually more than one character. so the string length would be different. I don't know a case that can cause producing the same string length.

CC @ShawnSteele if he can advise more about that.

@stephentoub don't block on Shawn's reply. I believe your changes are good to go.

I believe you can have strings with the same result. Typically non-ASCII will cause xn-- to be prepended, making it longer, plus some other stuff, however... the normalization step can also throw out characters, which could make the input shorter. Or combining sequences could normalize to single codepoints, enough of them might make up for the xn-- space. Certainly 99% of the time theyʻd be different lengths, but...

@ShawnSteele any specific examples? Might be worth to add it as a testcase

stephentoub · 2018-04-04T01:30:44Z

Using such names is increasing every day especially IDN standard started to adopt more scenarios (e.g. BiDi support). I don't think using ASCII names would be the common cases forever.

Sure, that makes sense. But I would expect ASCII names would still be super common even in that future.

tarekgh · 2018-04-04T01:36:49Z

Sure, that makes sense. But I would expect ASCII names would still be super common even in that future.

I hope I can find any source have some data telling how many domains today is using at least one non-Ascii character. I don't agree though in the future ASCII will still be super common because I expect many domains will have mostly Ascii but can include one or two non-ASCII characters. this is my speculation and not based on any data though.

krwq · 2018-04-04T03:27:29Z

I'd expect ascii to dominate in countries which use superset (or almost superset) of latin alphabet and non-ascii everywhere else - either way I think it is worth to remove allocations if possible

) If the output matches the input string, we can just use the input string as the result. Signed-off-by: dotnet-bot <[email protected]>

) If the output matches the input string, we can just use the input string as the result. Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>

Avoid unnecessary string allocations in IdnMapping

0e4a52b

If the output matches the input string, we can just use the input string as the result.

stephentoub mentioned this pull request Apr 3, 2018

Add a few IdnMapping tests dotnet/corefx#28797

Merged

ahsonkhan approved these changes Apr 3, 2018

View reviewed changes

krwq approved these changes Apr 3, 2018

View reviewed changes

krwq reviewed Apr 3, 2018

View reviewed changes

stephentoub commented Apr 4, 2018

View reviewed changes

stephentoub merged commit fff9f71 into dotnet:master Apr 4, 2018

stephentoub deleted the idnmappingalloc branch April 4, 2018 10:53

Ruikuan mentioned this pull request Nov 28, 2018

.Net Core 2.1 性能改进 Ruikuan/blog#26

Closed

lewurm mentioned this pull request Feb 1, 2019

[2018-08] Bump corert mono/mono#12721

Merged

Avoid unnecessary string allocations in IdnMapping #17399

Avoid unnecessary string allocations in IdnMapping #17399

Uh oh!

Conversation

stephentoub commented Apr 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarekgh commented Apr 3, 2018

Uh oh!

ahsonkhan commented Apr 3, 2018

Uh oh!

stephentoub commented Apr 3, 2018

Uh oh!

stephentoub commented Apr 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarekgh commented Apr 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Apr 4, 2018

Uh oh!

tarekgh commented Apr 4, 2018

Uh oh!

krwq commented Apr 4, 2018

Uh oh!

Uh oh!

stephentoub commented Apr 3, 2018 •

edited

Loading