Mustafa Emre Acer | bd51bef | 2020-03-18 19:36:29 | [diff] [blame] | 1 | # Internationalized Domain Names (IDN) in Google Chrome |
| 2 | |
| 3 | ## Background |
| 4 | |
| 5 | Many years ago, domains could only consist of the Latin letters A to Z, digits, |
| 6 | and a few other characters. [Internationalized Domain Names |
| 7 | (IDNs)](https://en.wikipedia.org/wiki/Internationalized_domain_name) were |
| 8 | created to better support non-Latin alphabets for web users around the globe. |
| 9 | |
| 10 | Different characters from different (or even the same!) languages can look very |
| 11 | similar. We’ve seen |
| 12 | [reports](https://bugs.chromium.org/p/chromium/issues/detail?id=683314) of |
| 13 | proof-of-concept attacks. These are called [homograph |
| 14 | attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack). For example, the |
| 15 | Latin "a" looks a lot like the Cyrillic "а", so someone could register |
| 16 | `http://ebаy.com` (using Cyrillic "`а`"), which could be confused for |
| 17 | `http://ebay.com`. This is a limitation of how URLs are displayed in browsers in |
| 18 | general, not a specific bug in Chrome. |
| 19 | |
| 20 | In a perfect world, domain registrars would not allow these confusable domain |
| 21 | names to be registered. Some domain registrars do exactly that, mostly by |
| 22 | restricting the characters allowed, but many do not. To better protect against |
| 23 | these attacks, browsers display some domains in |
| 24 | [punycode](https://en.wikipedia.org/wiki/Punycode) (looks like `xn--...`) |
| 25 | instead of the original IDN, according to their own IDN policies. |
| 26 | |
| 27 | This is a challenging problem space. Chrome has a global user base of billions |
| 28 | of people around the world, many of whom are not viewing URLs with Latin |
| 29 | letters. We want to prevent confusion, while ensuring that users across |
| 30 | languages have a great experience in Chrome. Displaying either punycode or a |
| 31 | visible security warning on too wide of a set of URLs would hurt web usability |
| 32 | for people around the world. |
| 33 | |
| 34 | Chrome and other browsers try to balance these needs by implementing IDN |
| 35 | policies in a way that allows IDN to be shown for valid domains, but protects |
| 36 | against confusable homograph attacks. |
| 37 | |
| 38 | Chrome's IDN policy is one of several tools that aim to protect users. |
| 39 | [Google Safe Browsing](https://safebrowsing.google.com/) continues to help |
| 40 | protect over two billion devices every day by showing warnings to users when |
| 41 | they attempt to navigate to dangerous or deceptive sites or download dangerous |
| 42 | files. Password managers continue to remember which domain password logins are |
| 43 | for, and won’t automatically fill a password into a domain that is not the |
| 44 | exactly correct one. |
| 45 | |
| 46 | ## How IDN works |
| 47 | |
| 48 | IDNs were devised to support arbitrary Unicode characters in hostnames in a |
| 49 | backward-compatible way. This works by having user agents transform hostnames |
| 50 | containing non-ASCII Unicode characters into an ASCII-only hostname, which can |
| 51 | then be sent on to DNS servers. This is done by encoding each domain label into |
| 52 | its punycode representation. This representation includes a four-character |
| 53 | prefix (`xn--`) and then the unicode translated to ASCII Compatible Encoding |
| 54 | (ACE). For example, `http://öbb.at` is transformed to `http://xn--bb-eka.at`. |
| 55 | |
| 56 | ## Google Chrome's IDN policy |
| 57 | |
| 58 | Since Chrome 51, Chrome uses an IDN display policy that does not take into |
| 59 | account the language settings (the Accept-Language list) of the browser. A |
| 60 | [similar strategy](https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm) is |
| 61 | used by Firefox. |
| 62 | |
| 63 | Google Chrome decides if it should show Unicode or punycode for each domain |
| 64 | label (component) of a hostname separately. To decide if a component should be |
| 65 | shown in Unicode, Google Chrome uses the following algorithm: |
| 66 | 1. Convert each component stored in the ACE to Unicode per [UTS 46 transitional |
| 67 | processing](http://unicode.org/reports/tr46/#Processing) (`ToUnicode`). |
| 68 | |
| 69 | 2. If there is an error in `ToUnicode` conversion (e.g. contains [disallowed |
| 70 | characters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Auts46%3Ddisallowed%3A%5D&abb=on&g=&i=), |
| 71 | [starts with a combining |
| 72 | mark](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da390a6b3d9844a1dcc1f99fb1ae478ecf), |
| 73 | or [violates BiDi |
| 74 | rules](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da8a9311811fb0f3db1644ac1a88056370)), |
| 75 | show punycode. |
| 76 | |
| 77 | 3. If there is a character in a label not belonging to [Characters allowed in |
| 78 | identifiers](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AIdentifierStatus%3DAllowed%3A&abb=on&g=&i=) |
| 79 | per [Unicode Technical Standard 39 (UTS |
| 80 | 39)](http://www.unicode.org/reports/tr39/#Identifier_Status_and_Type), show |
| 81 | punycode. |
| 82 | |
| 83 | 4. If any character in a label belongs to [the disallowed |
| 84 | list](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5Cu01CD-%5Cu01DC%5D+%5B%5Cu1c80-%5Cu1c8f%5D++%5B%5Cu1e90-%5Cu1e9b%5D++%5B%5Cu1f00-%5Cu1fff%5D++%5B%5Cua640-%5Cua69f%5D-%5B%5Cua720-%5Cua72f%5D+%5B%5Cu0338+%5Cu058a+%5Cu2010+%5Cu2019+%5Cu2027+%5Cu30a0+%5Cu02bb+%5Cu02bc+%5D&abb=on&g=&i=), |
| 85 | show punycode. |
| 86 | |
| 87 | 5. If the component uses characters drawn from multiple scripts, it is subject |
| 88 | to a script mixing check based on ["Highly Restrictive" profile of UTS |
| 89 | 39](http://www.unicode.org/reports/tr39/#Restriction_Level_Detection) with an |
| 90 | additional restriction on Latin. If the component fails the check, show the |
| 91 | component in punycode. |
| 92 | - Latin, Cyrillic or Greek characters cannot be mixed with each other |
| 93 | - Latin characters in the ASCII range can be mixed ONLY with Chinese (Han, |
Mustafa Emre Acer | 0f00662e | 2020-03-20 20:39:03 | [diff] [blame] | 94 | Bopomofo), Japanese (Kanji, Katakana, Hiragana), or Korean (Hangul, Hanja) |
Mustafa Emre Acer | bd51bef | 2020-03-18 19:36:29 | [diff] [blame] | 95 | - Han (CJK Ideographs) can be mixed with Bopomofo |
| 96 | - Han can be mixed with Hiragana and Katakana |
| 97 | - Han can be mixed with Korean Hangul |
| 98 | |
| 99 | 6. If two or more numbering systems (e.g. European digits + Bengali digits) are |
| 100 | mixed, show punycode. |
| 101 | |
| 102 | 7. If there are any invisible characters (e.g. a sequence of the same combining |
| 103 | mark or a sequence of Kana combining marks), show punycode. |
| 104 | |
Mustafa Emre Acer | 0f00662e | 2020-03-20 20:39:03 | [diff] [blame] | 105 | 8. If there are any characters used in an unusual way, show punycode. E.g. |
| 106 | [`LATIN MIDDLE DOT (·)`](https://unicode.org/cldr/utility/character.jsp?a=00B7) |
| 107 | used outside [ela geminada](https://en.wiktionary.org/wiki/ela_geminada). |
| 108 | |
| 109 | 9. Test the label for [mixed script confusable per UTS |
Mustafa Emre Acer | bd51bef | 2020-03-18 19:36:29 | [diff] [blame] | 110 | 39](http://unicode.org/reports/tr39/#Mixed_Script_Confusables). If mixed script |
| 111 | confusable is detected, show punycode. |
| 112 | |
Mustafa Emre Acer | 0f00662e | 2020-03-20 20:39:03 | [diff] [blame] | 113 | 10. Test the label for [whole script |
| 114 | confusables](http://unicode.org/reports/tr39/#Whole_Script_Confusables): If all |
| 115 | the letters in a given label belong to a set of whole-script-confusable letters |
| 116 | in one of the [whole-script-confusable |
| 117 | scripts](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=kWholeScriptConfusables&sq=package:chromium) |
| 118 | and if the hostname doesn't have a corresponding |
| 119 | [allowed top-level-domain](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.h?type=cs&q=allowed_tlds) |
| 120 | for that script, show punycode. |
| 121 | **Example for Cyrillic:** |
| 122 | The first label in hostname `аррӏе.com` (`xn--80ak6aa92e.com`) is all [Cyrillic |
| 123 | letters that look like Latin letters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%D0%B0%D1%81%D4%81%D0%B5%D2%BB%D1%96%D1%98%D3%8F%D0%BE%D1%80%D4%9B%D1%95%D4%9D%D1%85%D1%83%D1%8A%D0%AC%D2%BD%D0%BF%D0%B3%D1%B5%D1%A1%5D&g=gc&i=) |
| 124 | **AND** the TLD (`com`) is not Cyrillic **AND** the TLD is not one of the TLDs |
| 125 | known to host a large number of Cyrillic domains (e.g. `ru`, `su`, `pyc`, `ua`). |
| 126 | Show it in punycode. |
Mustafa Emre Acer | bd51bef | 2020-03-18 19:36:29 | [diff] [blame] | 127 | |
Mustafa Emre Acer | 0f00662e | 2020-03-20 20:39:03 | [diff] [blame] | 128 | 11. If the label contains only [digits and digit |
| 129 | spoofs](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=IsDigitLookalike), |
| 130 | show punycode. |
| 131 | |
| 132 | 12. If the label matches a [dangerous |
Mustafa Emre Acer | bd51bef | 2020-03-18 19:36:29 | [diff] [blame] | 133 | pattern](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&g=0&l=422), |
| 134 | show punycode. |
| 135 | |
Mustafa Emre Acer | 0f00662e | 2020-03-20 20:39:03 | [diff] [blame] | 136 | 13. If the [skeleton](http://unicode.org/reports/tr39/#def-skeleton) of the |
Mustafa Emre Acer | bd51bef | 2020-03-18 19:36:29 | [diff] [blame] | 137 | registrable part of a hostname is identical to one of the top domains after |
| 138 | removing diacritic marks and mapping each character to its spoofing skeleton |
| 139 | (e.g. `www.googlé.com` with `é` in place of `e`), show punycode. |
| 140 | |
Mustafa Emre Acer | 0f00662e | 2020-03-20 20:39:03 | [diff] [blame] | 141 | Otherwise, show Unicode. |
Mustafa Emre Acer | bd51bef | 2020-03-18 19:36:29 | [diff] [blame] | 142 | |
| 143 | This is implemented by `IDNToUnicodeOneComponent()` and `IsIDNComponentSafe()` |
| 144 | in |
| 145 | [`components/url_formatter/url_formatter.cc`](https://cs.chromium.org/search/?q=components/url_formatter/url_formatter.cc) |
| 146 | and `IDNSpoofChecker` class in |
| 147 | [`components/url_formatter/spoof_checks/idn_spoof_checker.cc`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc). |
| 148 | |
Mustafa Emre Acer | ad2cdf5 | 2020-03-31 23:19:22 | [diff] [blame] | 149 | ## Additional Protections |
| 150 | |
| 151 | In addition to the spoof checks above, Chrome also implements a full page |
| 152 | security warning to protect against lookalike URLs. You can find an example of |
| 153 | this warning at `chrome://interstitials/lookalike`. This warning blocks main |
| 154 | frame navigations that involve lookalike URLs, either as a direct navigation or |
| 155 | as part of a redirect. |
| 156 | |
| 157 | The algorithm to show this warning is as follows: |
| 158 | |
| 159 | 1. If the scheme of the navigation is not `http` or `https`, allow |
| 160 | the navigation. |
| 161 | |
| 162 | 2. If the navigation is a redirect, check the redirect chain. If the redirect |
| 163 | chain is safe, allow the navigation. (See Defensive Registrations section for |
| 164 | details). |
| 165 | |
| 166 | 3. If the hostname of the navigation has at least a medium site engagement |
| 167 | score, allow the navigation. Site engagement score is assigned to sites by the |
| 168 | [Site Engagement |
| 169 | Service](https://www.chromium.org/developers/design-documents/site-engagement). |
| 170 | |
| 171 | 4. If the hostname of the navigation is in |
| 172 | [`domains.list`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/top_domains/domains.list), |
| 173 | allow the navigation. |
| 174 | |
| 175 | 5. If the user previously allowed the hostname of the navigation by clicking |
| 176 | "Ignore" in the warning, allow the navigation. Currently, user decisions are |
| 177 | stored per tab, so navigating to the same site in a new tab may show the |
| 178 | warning. |
| 179 | |
| 180 | 6. If the hostname has the same skeleton as a recently engaged site or a top 500 |
| 181 | domain, block the navigation and show the warning. |
| 182 | |
| 183 | All of these checks are done locally on the client side. |
| 184 | |
| 185 | ### Defensive Registrations |
| 186 | |
| 187 | Domain owners can sometimes register multiple versions of their domains, such |
| 188 | as the ASCII and IDN versions, to improve user experience and prevent potential |
| 189 | spoofs. We call these supplementary domains defensive registrations. |
| 190 | |
| 191 | In some cases, Chrome's lookalike warning may flag and block navigations to |
| 192 | these domains: |
| 193 | - If one of the sites is in `domains.list` but the other isn't, the latter will |
| 194 | be blocked. |
| 195 | - If the user engaged with one of the sites but not the other, the latter will |
| 196 | be blocked. |
| 197 | |
| 198 | ### Avoiding a lookalike warning on your site |
| 199 | |
| 200 | **Domain owners can avoid the "Did you mean" warning by redirecting their |
| 201 | defensive registrations to their canonical domain.** |
| 202 | |
| 203 | **Example**: If you own both `example.com` and `éxample.com` and the majority of |
| 204 | your traffic is to `example.com`, you can fix the warning by redirecting |
| 205 | `éxample.com` to `example.com`. The lookalike warning logic considers this a |
| 206 | safe redirect and allows the navigation. If you must also redirect `http` |
| 207 | navigations to `https`, do this in a single redirect such as |
| 208 | `http://éxample.com -> https://example.com`. Use HTTP 301 or HTTP 302 |
| 209 | redirects, the lookalike warning ignores meta redirects. |
Mustafa Emre Acer | bd51bef | 2020-03-18 19:36:29 | [diff] [blame] | 210 | |
| 211 | ## Reporting Security Bugs |
| 212 | |
| 213 | We reward certain cases of IDN spoofs according to [Chrome's Vulnerability |
| 214 | Reward Program](https://www.google.com/about/appsecurity/chrome-rewards/index.html) |
| 215 | policies. Please see [this |
| 216 | document]( https://docs.google.com/document/d/1_xJz3J9kkAPwk3pma6K3X12SyPTyyaJDSCxTfF8Y5sU/edit?usp=sharing) |
| 217 | before reporting a security bug. |