|
|
Created:
3 years, 8 months ago by Alexander Yashkin Modified:
3 years, 8 months ago CC:
chromium-reviews Target Ref:
refs/heads/master Project:
chromium Visibility:
Public. |
DescriptionChanged GenerateKeyword to always return keyword in lowercase
TemplateURL::GenerateKeyword returns keyword for search engine URL using
GURL::host() method. TemplateURLService component that stores
TemplateURLs assumes that keywords are always converted to lowercase.
GURL::host() can return string with uppercase characters for some exotic
URLs. For example for "http://embedded.<html>web" it will return
"embedded.%3Ehtml%3Eweb".
This could lead to problems when TemplateURLService tries to resolve
conflicts between autogenerated keywords.
BUG=709761
[email protected], [email protected]
Review-Url: https://codereview.chromium.org/2806593006
Cr-Commit-Position: refs/heads/master@{#465946}
Committed: https://chromium.googlesource.com/chromium/src/+/81695d0da81e63990e210da79b7a65c7dd386140
Patch Set 1 #
Total comments: 8
Patch Set 2 : Fixed after review, round 1 #
Messages
Total messages: 25 (8 generated)
The CQ bit was checked by [email protected] to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by [email protected]
Dry run: No L-G-T-M from a valid reviewer yet. CQ run can only be started once the patch has received an L-G-T-M from a full committer. Even if an L-G-T-M may have been provided, it was from a non-committer,_not_ a full super star committer. Committers are members of the group "project-chromium-committers". Note that this has nothing to do with OWNERS files.
How can this actually happen in the wild? Hostnames should never contain escaped characters.
On 2017/04/09 at 04:55:36, pkasting wrote: > How can this actually happen in the wild? Hostnames should never contain escaped characters. This actually happened in the wild. I have captured such URL with ScopedCrashKey in TemplateURLService: https://bugs.chromium.org/p/chromium/issues/detail?id=697745#c35 GURL thinks its a valid URL: DLOG(ERROR) << GURL("http://embeddedhtml.<head>/").is_valid() << GURL("http://embeddedhtml.<head>/").host(); Output: 1 embeddedhtml.%3Chead%3E I think such URLs can transfer to browser from OSDD description, maybe its a result of some proxy software filtering data.
[email protected] changed reviewers: + [email protected]
Looping in brettw (GURL owner) regarding whether it's correct for GURL to consider an HTTP URL valid when the hostname contains characters that must be escaped. Such URLs violate various RFCs, but maybe there's a reason we accept them anyway?
On 2017/04/09 at 08:12:45, pkasting wrote: > Looping in brettw (GURL owner) regarding whether it's correct for GURL to consider an HTTP URL valid when the hostname contains characters that must be escaped. Such URLs violate various RFCs, but maybe there's a reason we accept them anyway? Ping brettw
This is theoretically possible but I'd like to remove support for percent-escaped hostname chars in https://bugs.chromium.org/p/chromium/issues/detail?id=652808 Unfortunately doing this change properly requires a major change to Blink (basically separating a DOM "url object" from our internal notion of a canonical URL). But given the reasoning in that bug for removing that support, basically no real site uses these characters to my knowledge. If you have examples of this happening in real-life, I would want to know about that to update the design doc for removing the escaped characters. The bug does not describe a real-life example of this problem. Given this, I don't think we should be adding complexity to the template URL resolver to handle the case where the user has a search engine with such characters in the host name.
On 2017/04/17 at 18:02:14, brettw wrote: > This is theoretically possible but I'd like to remove support for percent-escaped hostname chars in > https://bugs.chromium.org/p/chromium/issues/detail?id=652808 > Unfortunately doing this change properly requires a major change to Blink (basically separating a DOM "url object" from our internal notion of a canonical URL). > > But given the reasoning in that bug for removing that support, basically no real site uses these characters to my knowledge. If you have examples of this happening in real-life, I would want to know about that to update the design doc for removing the escaped characters. > > The bug does not describe a real-life example of this problem. Given this, I don't think we should be adding complexity to the template URL resolver to handle the case where the user has a search engine with such characters in the host name. Actually I have captured such URLs in TemplateURLService by CHECKs and ScopedCrashKey in https://bugs.chromium.org/p/chromium/issues/detail?id=697745#c35 Captured URLs: http://isearch.<HTML><HEAD><TITLE>Christie Offices - HotSpot - Login</TITLE>web/?type=dspp&q={searchTerms} http://isearch.<html>web/?type=dspp&q={searchTerms} http://www.<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" IMHO, this URLs could be added by parsing OSDD description and maybe their content is a result of traffic filtering by some buggy proxy software, yet this only is assumption. So this URLs are from real-life and they are not correctly handled in current TemplateURLService implementation. This CL tries to handle them gracefully. Sorry, I should have described problem in description in more details.
I agree with the assessment of buggy proxies or something. It sounds to me like we want these to fail to canonicalize, but it may be hard to land that change soon. What's the best way forward, Brett? Should we land this change as a bandaid?
Is it possible to check for % and exit-out of template URL matching?
On 2017/04/17 19:25:43, brettw (plz ping after 24h) wrote: > Is it possible to check for % and exit-out of template URL matching? You mean in GenerateKeyword()? Yeah, that seems plausible.
On 2017/04/17 at 19:38:43, pkasting wrote: > On 2017/04/17 19:25:43, brettw (plz ping after 24h) wrote: > > Is it possible to check for % and exit-out of template URL matching? > > You mean in GenerateKeyword()? Yeah, that seems plausible. What should return GenerateKeyword() for such strange URLs? I see two ways of dealing with this problem: 1. Allow search URLs with '%' in host() in TemplateURLService. In this case, my proposed solution is good enough, IMHO. 2. Do not allow URLs with '%' in TemplateURLService. In this case we should filter out such search engines while parsing OSDD description. Also we should check search URLs from webdata DB, extensions manifests and other inputs. I like the route number 1, its simplier, although we store some strange search URLs and autogenerate strange keywords.
On 2017/04/18 07:14:28, Alexander Yashkin wrote: > On 2017/04/17 at 19:38:43, pkasting wrote: > > On 2017/04/17 19:25:43, brettw (plz ping after 24h) wrote: > > > Is it possible to check for % and exit-out of template URL matching? > > > > You mean in GenerateKeyword()? Yeah, that seems plausible. > > What should return GenerateKeyword() for such strange URLs? > > I see two ways of dealing with this problem: > 1. Allow search URLs with '%' in host() in TemplateURLService. In this case, my > proposed solution is good enough, IMHO. > 2. Do not allow URLs with '%' in TemplateURLService. In this case we should > filter out such search engines while parsing OSDD description. > Also we should check search URLs from webdata DB, extensions manifests and other > inputs. > > I like the route number 1, its simplier, although we store some strange search > URLs and autogenerate strange keywords. Blargh, I didn't realize this function can't really fail and callers aren't prepared for it to. You're right that disallowing these sorts of engines is going to require adding filtering/checks in multiple places, which I'd rather avoid when this is basically a bandaid. Let's go ahead and land something like this, with a TODO pointing at Brett's bug and saying this is only needed for URLs that have escaped characters in the hostname, and we can remove it when those are gone. Brett: Should GURL at least use lowercase characters when escaping in the hostname for now, rather than uppercase? https://codereview.chromium.org/2806593006/diff/1/chrome/browser/search_engin... File chrome/browser/search_engines/template_url_service_unittest.cc (right): https://codereview.chromium.org/2806593006/diff/1/chrome/browser/search_engin... chrome/browser/search_engines/template_url_service_unittest.cc:1799: // generated keyword for such URL contained upper case characters. Nit: Avoid describing past problems (because it's not clear when this refers to), and if you need to motivate something, just describe potential problems (e.g. "URLs with embedded HTML canonicalize to contain uppercase characters in the hostname. Ensure these URLs are still handled correctly for conflict resolution."). https://codereview.chromium.org/2806593006/diff/1/chrome/browser/search_engin... chrome/browser/search_engines/template_url_service_unittest.cc:1817: ASCIIToUTF16("embedded.%3chtml%3eweb_"))); Would this test have failed without this patch? Wouldn't both keywords be uppercase, and thus still match each other, so the test would still pass? https://codereview.chromium.org/2806593006/diff/1/components/search_engines/t... File components/search_engines/template_url_unittest.cc (right): https://codereview.chromium.org/2806593006/diff/1/components/search_engines/t... components/search_engines/template_url_unittest.cc:27: bool IsLowerCase(base::string16 str) { Nit: const & https://codereview.chromium.org/2806593006/diff/1/components/search_engines/t... components/search_engines/template_url_unittest.cc:1749: ASSERT_TRUE(IsLowerCase( Nit: I think every ASSERT in this test could be EXPECT.
Changing canonicalization is scary and I'd rather not mess with it. Peter's suggestion SGTM
https://codereview.chromium.org/2806593006/diff/1/chrome/browser/search_engin... File chrome/browser/search_engines/template_url_service_unittest.cc (right): https://codereview.chromium.org/2806593006/diff/1/chrome/browser/search_engin... chrome/browser/search_engines/template_url_service_unittest.cc:1799: // generated keyword for such URL contained upper case characters. On 2017/04/18 at 18:46:19, Peter Kasting wrote: > Nit: Avoid describing past problems (because it's not clear when this refers to), and if you need to motivate something, just describe potential problems (e.g. "URLs with embedded HTML canonicalize to contain uppercase characters in the hostname. Ensure these URLs are still handled correctly for conflict resolution."). Thanks, done. https://codereview.chromium.org/2806593006/diff/1/chrome/browser/search_engin... chrome/browser/search_engines/template_url_service_unittest.cc:1817: ASCIIToUTF16("embedded.%3chtml%3eweb_"))); On 2017/04/18 at 18:46:20, Peter Kasting wrote: > Would this test have failed without this patch? Wouldn't both keywords be uppercase, and thus still match each other, so the test would still pass? This tests fails without my patch - first at different DHCECKs, and if disabling them, it fails in checking expected keywords: Note: Google Test filter = TemplateURLServiceTest.CheckNonreplaceableEnginesKeywordsConflicts [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from TemplateURLServiceTest [ RUN ] TemplateURLServiceTest.CheckNonreplaceableEnginesKeywordsConflicts ../../chrome/browser/search_engines/template_url_service_unittest.cc:1814: Failure Value of: model()->GetTemplateURLForKeyword( ASCIIToUTF16("embedded.%3chtml%3eweb")) Actual: 0x7f81cd02dc00 Expected: user6 Which is: 0x7f81cd02e600 ../../chrome/browser/search_engines/template_url_service_unittest.cc:1816: Failure Value of: user5->keyword() Actual: embedded.%3chtml%3eweb Expected: ASCIIToUTF16("embedded.%3chtml%3eweb_") Which is: embedded.%3chtml%3eweb_ ../../chrome/browser/search_engines/template_url_service_unittest.cc:1818: Failure Value of: model()->GetTemplateURLForKeyword( ASCIIToUTF16("embedded.%3chtml%3eweb_")) Actual: NULL Expected: user5 Which is: 0x7f81cd02dc00 What happens without my patch: 1. Engine "nonreplaceable5" is added with keyword "embedded.%3chtml%3eweb" 2. Attempt to add engine (using AddNoNotify) "nonreplaceable6" with same keyword "embedded.%3chtml%3eweb" leads to call to UniquifyKeyword for existing engine "nonreplaceable5". UniquifyKeyword returns keyword autogenerated from engine URL (using GetKeyword()) - "embedded.%3Chtml%3Eweb"(contains uppercased characters). While generating keywords UniquifyKeyword checks that TemplateURLService does not contain engine with keyword "embedded.%3Chtml%3Eweb" and it succeeds. All keywords inside TemplateURLService are lowercased. 3. After receiving new keyword (with uppercase chars) from UniquifyKeyword AddNoNotify function tries to update engine "nonreplaceable5" with new keyword calling ResetTemplateURLNoNotify. 4. ResetTemplateURLNoNotify creates TemplateURLData with new values, and inside TemplateURLData::SetKeyword() keyword "embedded.%3Chtml%3Eweb" is converted to lower case. After that ResetTemplateURLNoNotify calls UpdateNoNotify for "nonreplaceable5" with lowercased keyword "embedded.%3ehtml%3eweb". 5. So after call to AddNoNotify "nonreplaceable5" engine have same keyword as before and new engine "nonreplaceable6" is also added to TemplateURLService with same keyword "embedded.%3ehtml%3eweb". So we have two normal search engines inside TemplateURLService with same keyword after AddNoNotify. Which is not intended result, IMHO. https://codereview.chromium.org/2806593006/diff/1/components/search_engines/t... File components/search_engines/template_url_unittest.cc (right): https://codereview.chromium.org/2806593006/diff/1/components/search_engines/t... components/search_engines/template_url_unittest.cc:27: bool IsLowerCase(base::string16 str) { On 2017/04/18 at 18:46:20, Peter Kasting wrote: > Nit: const & Done. https://codereview.chromium.org/2806593006/diff/1/components/search_engines/t... components/search_engines/template_url_unittest.cc:1749: ASSERT_TRUE(IsLowerCase( On 2017/04/18 at 18:46:20, Peter Kasting wrote: > Nit: I think every ASSERT in this test could be EXPECT. Done
LGTM
The CQ bit was checked by [email protected]
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 20001, "attempt_start_ts": 1492673665660750, "parent_rev": "938f4e0fe5e252248920e446aab82b784adae07d", "commit_rev": "81695d0da81e63990e210da79b7a65c7dd386140"}
Message was sent while issue was closed.
Description was changed from ========== Changed GenerateKeyword to always return keyword in lowercase TemplateURL::GenerateKeyword returns keyword for search engine URL using GURL::host() method. TemplateURLService component that stores TemplateURLs assumes that keywords are always converted to lowercase. GURL::host() can return string with uppercase characters for some exotic URLs. For example for "http://embedded.<html>web" it will return "embedded.%3Ehtml%3Eweb". This could lead to problems when TemplateURLService tries to resolve conflicts between autogenerated keywords. BUG=709761 [email protected], [email protected] ========== to ========== Changed GenerateKeyword to always return keyword in lowercase TemplateURL::GenerateKeyword returns keyword for search engine URL using GURL::host() method. TemplateURLService component that stores TemplateURLs assumes that keywords are always converted to lowercase. GURL::host() can return string with uppercase characters for some exotic URLs. For example for "http://embedded.<html>web" it will return "embedded.%3Ehtml%3Eweb". This could lead to problems when TemplateURLService tries to resolve conflicts between autogenerated keywords. BUG=709761 [email protected], [email protected] Review-Url: https://codereview.chromium.org/2806593006 Cr-Commit-Position: refs/heads/master@{#465946} Committed: https://chromium.googlesource.com/chromium/src/+/81695d0da81e63990e210da79b7a... ==========
Message was sent while issue was closed.
Committed patchset #2 (id:20001) as https://chromium.googlesource.com/chromium/src/+/81695d0da81e63990e210da79b7a... |