-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Avoid unnecessary string allocations in IdnMapping #17399
Conversation
If the output matches the input string, we can just use the input string as the result.
In the mainstream cases, the string will change. so, if you believe the overhead in such cases is neglectable then I am ok with this change and LGTM. |
Does passing around an extra parameter (string) have negligible cost here? |
Really? We're using it on host names, which if they're the common case and are ASCII for the most part don't seem to change. Am I misunderstanding? |
Yes |
@@ -156,6 +157,14 @@ public override int GetHashCode() | |||
return (_allowUnassigned ? 100 : 200) + (_useStd3AsciiRules ? 1000 : 2000); | |||
} | |||
|
|||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
private static unsafe string GetStringForOutput(string originalString, char* input, int inputLength, char* output, int outputLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: might make sense to pass ReadOnlySpan directly so that we have less unsafe methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think we should look at changing most of the pointers to be spans in the implementation. For now, though, since we already have the pointers, I'm going to leave this as is.
The IDN internationalization is catching up and people started using names in non-ASCII (e.g. Greeks, German...etc.). Using such names is increasing every day especially IDN standard started to adopt more scenarios (e.g. BiDi support). I don't think using ASCII names would be the common cases forever. |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static unsafe string GetStringForOutput(string originalString, char* input, int inputLength, char* output, int outputLength) | ||
{ | ||
return originalString.Length == inputLength && new ReadOnlySpan<char>(input, inputLength).SequenceEqual(new ReadOnlySpan<char>(output, outputLength)) ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarekgh, is there any situation where the conversion would yield a different result but where the lengths would still match? I'm wondering if we could get rid of the SequenceEqual and just make it:
return originalString.Length == inputLength && inputLength == outputLength ?
originalString :
new string(output, 0, outputLength);
?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect the length will be different in most of the cases (if not all) so I think your check of the length would be good enough here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in most of the cases (if not all)
Thanks, though if it won't definitely be all, we'll want to keep the SequenceEqual. Can we prove one way or another whether it would be all cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, when you have non-ASCII character, it will be converted to Punycode which is usually more than one character. so the string length would be different. I don't know a case that can cause producing the same string length.
CC @ShawnSteele if he can advise more about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub don't block on Shawn's reply. I believe your changes are good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you can have strings with the same result. Typically non-ASCII will cause xn-- to be prepended, making it longer, plus some other stuff, however... the normalization step can also throw out characters, which could make the input shorter. Or combining sequences could normalize to single codepoints, enough of them might make up for the xn-- space. Certainly 99% of the time theyʻd be different lengths, but...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShawnSteele any specific examples? Might be worth to add it as a testcase
Sure, that makes sense. But I would expect ASCII names would still be super common even in that future. |
I hope I can find any source have some data telling how many domains today is using at least one non-Ascii character. I don't agree though in the future ASCII will still be super common because I expect many domains will have mostly Ascii but can include one or two non-ASCII characters. this is my speculation and not based on any data though. |
I'd expect ascii to dominate in countries which use superset (or almost superset) of latin alphabet and non-ascii everywhere else - either way I think it is worth to remove allocations if possible |
) If the output matches the input string, we can just use the input string as the result. Signed-off-by: dotnet-bot <[email protected]>
) If the output matches the input string, we can just use the input string as the result. Signed-off-by: dotnet-bot <[email protected]>
) If the output matches the input string, we can just use the input string as the result. Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>
) If the output matches the input string, we can just use the input string as the result. Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>
If the output matches the input string, we can just use the input string as the result.
This eliminates the allocation in common cases. It has a measurable positive impact (~10% on Windows for the cases I tried) where no conversion is needed while also removing the allocation, and for cases where conversion is needed, it appears to result in no measurable degradation of throughput, as those cases are already significantly more expensive.
cc: @tarekgh, @krwq, @danmosemsft, @geoffkizer, @ahsonkhan