Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Avoid unnecessary string allocations in IdnMapping #17399

Merged
merged 1 commit into from
Apr 4, 2018

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Apr 3, 2018

If the output matches the input string, we can just use the input string as the result.

This eliminates the allocation in common cases. It has a measurable positive impact (~10% on Windows for the cases I tried) where no conversion is needed while also removing the allocation, and for cases where conversion is needed, it appears to result in no measurable degradation of throughput, as those cases are already significantly more expensive.

cc: @tarekgh, @krwq, @danmosemsft, @geoffkizer, @ahsonkhan

If the output matches the input string, we can just use the input string as the result.
@tarekgh
Copy link
Member

tarekgh commented Apr 3, 2018

In the mainstream cases, the string will change. so, if you believe the overhead in such cases is neglectable then I am ok with this change and LGTM.

@ahsonkhan
Copy link

it appears to result in no measurable degradation of throughput, as those cases are already significantly more expensive.

Does passing around an extra parameter (string) have negligible cost here?

@stephentoub
Copy link
Member Author

In the mainstream cases, the string will change

Really? We're using it on host names, which if they're the common case and are ASCII for the most part don't seem to change. Am I misunderstanding?

@stephentoub
Copy link
Member Author

Does passing around an extra parameter (string) have negligible cost here?

Yes

@@ -156,6 +157,14 @@ public override int GetHashCode()
return (_allowUnassigned ? 100 : 200) + (_useStd3AsciiRules ? 1000 : 2000);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe string GetStringForOutput(string originalString, char* input, int inputLength, char* output, int outputLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: might make sense to pass ReadOnlySpan directly so that we have less unsafe methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think we should look at changing most of the pointers to be spans in the implementation. For now, though, since we already have the pointers, I'm going to leave this as is.

@tarekgh
Copy link
Member

tarekgh commented Apr 4, 2018

Really? We're using it on hostnames, which if they're the common case and are ASCII, for the most part, don't seem to change. Am I misunderstanding?

The IDN internationalization is catching up and people started using names in non-ASCII (e.g. Greeks, German...etc.). Using such names is increasing every day especially IDN standard started to adopt more scenarios (e.g. BiDi support). I don't think using ASCII names would be the common cases forever.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe string GetStringForOutput(string originalString, char* input, int inputLength, char* output, int outputLength)
{
return originalString.Length == inputLength && new ReadOnlySpan<char>(input, inputLength).SequenceEqual(new ReadOnlySpan<char>(output, outputLength)) ?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarekgh, is there any situation where the conversion would yield a different result but where the lengths would still match? I'm wondering if we could get rid of the SequenceEqual and just make it:

return originalString.Length == inputLength && inputLength == outputLength ?
    originalString :
    new string(output, 0, outputLength);

?

Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect the length will be different in most of the cases (if not all) so I think your check of the length would be good enough here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in most of the cases (if not all)

Thanks, though if it won't definitely be all, we'll want to keep the SequenceEqual. Can we prove one way or another whether it would be all cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, when you have non-ASCII character, it will be converted to Punycode which is usually more than one character. so the string length would be different. I don't know a case that can cause producing the same string length.

CC @ShawnSteele if he can advise more about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub don't block on Shawn's reply. I believe your changes are good to go.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can have strings with the same result. Typically non-ASCII will cause xn-- to be prepended, making it longer, plus some other stuff, however... the normalization step can also throw out characters, which could make the input shorter. Or combining sequences could normalize to single codepoints, enough of them might make up for the xn-- space. Certainly 99% of the time theyʻd be different lengths, but...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShawnSteele any specific examples? Might be worth to add it as a testcase

@stephentoub
Copy link
Member Author

Using such names is increasing every day especially IDN standard started to adopt more scenarios (e.g. BiDi support). I don't think using ASCII names would be the common cases forever.

Sure, that makes sense. But I would expect ASCII names would still be super common even in that future.

@tarekgh
Copy link
Member

tarekgh commented Apr 4, 2018

Sure, that makes sense. But I would expect ASCII names would still be super common even in that future.

I hope I can find any source have some data telling how many domains today is using at least one non-Ascii character. I don't agree though in the future ASCII will still be super common because I expect many domains will have mostly Ascii but can include one or two non-ASCII characters. this is my speculation and not based on any data though.

@krwq
Copy link
Member

krwq commented Apr 4, 2018

I'd expect ascii to dominate in countries which use superset (or almost superset) of latin alphabet and non-ascii everywhere else - either way I think it is worth to remove allocations if possible

@stephentoub stephentoub merged commit fff9f71 into dotnet:master Apr 4, 2018
@stephentoub stephentoub deleted the idnmappingalloc branch April 4, 2018 10:53
dotnet-bot pushed a commit to dotnet/corert that referenced this pull request Apr 4, 2018
)

If the output matches the input string, we can just use the input string as the result.

Signed-off-by: dotnet-bot <[email protected]>
jkotas pushed a commit to dotnet/corert that referenced this pull request Apr 5, 2018
)

If the output matches the input string, we can just use the input string as the result.

Signed-off-by: dotnet-bot <[email protected]>
dotnet-bot pushed a commit to dotnet/corefx that referenced this pull request Apr 9, 2018
)

If the output matches the input string, we can just use the input string as the result.

Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>
Anipik pushed a commit to dotnet/corefx that referenced this pull request Apr 9, 2018
)

If the output matches the input string, we can just use the input string as the result.

Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants