blob: 245c68a7b1fca42ce64cd40c7563c625b5d636ac [file] [log] [blame] [view]
Mustafa Emre Acerbd51bef2020-03-18 19:36:291# Internationalized Domain Names (IDN) in Google Chrome
2
3## Background
4
5Many years ago, domains could only consist of the Latin letters A to Z, digits,
6and a few other characters. [Internationalized Domain Names
7(IDNs)](https://en.wikipedia.org/wiki/Internationalized_domain_name) were
8created to better support non-Latin alphabets for web users around the globe.
9
10Different characters from different (or even the same!) languages can look very
11similar. We’ve seen
12[reports](https://bugs.chromium.org/p/chromium/issues/detail?id=683314) of
13proof-of-concept attacks. These are called [homograph
14attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack). For example, the
15Latin "a" looks a lot like the Cyrillic "а", so someone could register
16`http://ebаy.com` (using Cyrillic "`а`"), which could be confused for
17`http://ebay.com`. This is a limitation of how URLs are displayed in browsers in
18general, not a specific bug in Chrome.
19
20In a perfect world, domain registrars would not allow these confusable domain
21names to be registered. Some domain registrars do exactly that, mostly by
22restricting the characters allowed, but many do not. To better protect against
23these attacks, browsers display some domains in
24[punycode](https://en.wikipedia.org/wiki/Punycode) (looks like `xn--...`)
25instead of the original IDN, according to their own IDN policies.
26
27This is a challenging problem space. Chrome has a global user base of billions
28of people around the world, many of whom are not viewing URLs with Latin
29letters. We want to prevent confusion, while ensuring that users across
30languages have a great experience in Chrome. Displaying either punycode or a
31visible security warning on too wide of a set of URLs would hurt web usability
32for people around the world.
33
34Chrome and other browsers try to balance these needs by implementing IDN
35policies in a way that allows IDN to be shown for valid domains, but protects
36against confusable homograph attacks.
37
38Chrome's IDN policy is one of several tools that aim to protect users.
39[Google Safe Browsing](https://safebrowsing.google.com/) continues to help
40protect over two billion devices every day by showing warnings to users when
41they attempt to navigate to dangerous or deceptive sites or download dangerous
42files. Password managers continue to remember which domain password logins are
43for, and won’t automatically fill a password into a domain that is not the
44exactly correct one.
45
46## How IDN works
47
48IDNs were devised to support arbitrary Unicode characters in hostnames in a
49backward-compatible way. This works by having user agents transform hostnames
50containing non-ASCII Unicode characters into an ASCII-only hostname, which can
51then be sent on to DNS servers. This is done by encoding each domain label into
52its punycode representation. This representation includes a four-character
53prefix (`xn--`) and then the unicode translated to ASCII Compatible Encoding
54(ACE). For example, `http://öbb.at` is transformed to `http://xn--bb-eka.at`.
55
56## Google Chrome's IDN policy
57
58Since Chrome 51, Chrome uses an IDN display policy that does not take into
59account the language settings (the Accept-Language list) of the browser. A
60[similar strategy](https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm) is
61used by Firefox.
62
63Google Chrome decides if it should show Unicode or punycode for each domain
64label (component) of a hostname separately. To decide if a component should be
65shown in Unicode, Google Chrome uses the following algorithm:
661. Convert each component stored in the ACE to Unicode per [UTS 46 transitional
67 processing](http://unicode.org/reports/tr46/#Processing) (`ToUnicode`).
68
692. If there is an error in `ToUnicode` conversion (e.g. contains [disallowed
70 characters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Auts46%3Ddisallowed%3A%5D&abb=on&g=&i=),
71 [starts with a combining
72 mark](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da390a6b3d9844a1dcc1f99fb1ae478ecf),
73 or [violates BiDi
74 rules](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da8a9311811fb0f3db1644ac1a88056370)),
75 show punycode.
76
773. If there is a character in a label not belonging to [Characters allowed in
78 identifiers](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AIdentifierStatus%3DAllowed%3A&abb=on&g=&i=)
79 per [Unicode Technical Standard 39 (UTS
80 39)](http://www.unicode.org/reports/tr39/#Identifier_Status_and_Type), show
81 punycode.
82
834. If any character in a label belongs to [the disallowed
84 list](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5Cu01CD-%5Cu01DC%5D+%5B%5Cu1c80-%5Cu1c8f%5D++%5B%5Cu1e90-%5Cu1e9b%5D++%5B%5Cu1f00-%5Cu1fff%5D++%5B%5Cua640-%5Cua69f%5D-%5B%5Cua720-%5Cua72f%5D+%5B%5Cu0338+%5Cu058a+%5Cu2010+%5Cu2019+%5Cu2027+%5Cu30a0+%5Cu02bb+%5Cu02bc+%5D&abb=on&g=&i=),
85 show punycode.
86
875. If the component uses characters drawn from multiple scripts, it is subject
88to a script mixing check based on ["Highly Restrictive" profile of UTS
8939](http://www.unicode.org/reports/tr39/#Restriction_Level_Detection) with an
90additional restriction on Latin. If the component fails the check, show the
91component in punycode.
92 - Latin, Cyrillic or Greek characters cannot be mixed with each other
93 - Latin characters in the ASCII range can be mixed ONLY with Chinese (Han,
Mustafa Emre Acer0f00662e2020-03-20 20:39:0394 Bopomofo), Japanese (Kanji, Katakana, Hiragana), or Korean (Hangul, Hanja)
Mustafa Emre Acerbd51bef2020-03-18 19:36:2995 - Han (CJK Ideographs) can be mixed with Bopomofo
96 - Han can be mixed with Hiragana and Katakana
97 - Han can be mixed with Korean Hangul
98
996. If two or more numbering systems (e.g. European digits + Bengali digits) are
100mixed, show punycode.
101
1027. If there are any invisible characters (e.g. a sequence of the same combining
103mark or a sequence of Kana combining marks), show punycode.
104
Mustafa Emre Acer0f00662e2020-03-20 20:39:031058. If there are any characters used in an unusual way, show punycode. E.g.
106[`LATIN MIDDLE DOT (·)`](https://unicode.org/cldr/utility/character.jsp?a=00B7)
107used outside [ela geminada](https://en.wiktionary.org/wiki/ela_geminada).
108
1099. Test the label for [mixed script confusable per UTS
Mustafa Emre Acerbd51bef2020-03-18 19:36:2911039](http://unicode.org/reports/tr39/#Mixed_Script_Confusables). If mixed script
111confusable is detected, show punycode.
112
Mustafa Emre Acer0f00662e2020-03-20 20:39:0311310. Test the label for [whole script
114confusables](http://unicode.org/reports/tr39/#Whole_Script_Confusables): If all
115the letters in a given label belong to a set of whole-script-confusable letters
116in one of the [whole-script-confusable
117scripts](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=kWholeScriptConfusables&sq=package:chromium)
118and if the hostname doesn't have a corresponding
119[allowed top-level-domain](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.h?type=cs&q=allowed_tlds)
120for that script, show punycode.
121**Example for Cyrillic:**
122The first label in hostname `аррӏе.com` (`xn--80ak6aa92e.com`) is all [Cyrillic
123letters that look like Latin letters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%D0%B0%D1%81%D4%81%D0%B5%D2%BB%D1%96%D1%98%D3%8F%D0%BE%D1%80%D4%9B%D1%95%D4%9D%D1%85%D1%83%D1%8A%D0%AC%D2%BD%D0%BF%D0%B3%D1%B5%D1%A1%5D&g=gc&i=)
124**AND** the TLD (`com`) is not Cyrillic **AND** the TLD is not one of the TLDs
125known to host a large number of Cyrillic domains (e.g. `ru`, `su`, `pyc`, `ua`).
126Show it in punycode.
Mustafa Emre Acerbd51bef2020-03-18 19:36:29127
Mustafa Emre Acer0f00662e2020-03-20 20:39:0312811. If the label contains only [digits and digit
129spoofs](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=IsDigitLookalike),
130show punycode.
131
13212. If the label matches a [dangerous
Mustafa Emre Acerbd51bef2020-03-18 19:36:29133pattern](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&g=0&l=422),
134show punycode.
135
Mustafa Emre Acer0f00662e2020-03-20 20:39:0313613. If the [skeleton](http://unicode.org/reports/tr39/#def-skeleton) of the
Mustafa Emre Acerbd51bef2020-03-18 19:36:29137registrable part of a hostname is identical to one of the top domains after
138removing diacritic marks and mapping each character to its spoofing skeleton
139(e.g. `www.googlé.com` with `é` in place of `e`), show punycode.
140
Mustafa Emre Acer0f00662e2020-03-20 20:39:03141Otherwise, show Unicode.
Mustafa Emre Acerbd51bef2020-03-18 19:36:29142
143This is implemented by `IDNToUnicodeOneComponent()` and `IsIDNComponentSafe()`
144in
145[`components/url_formatter/url_formatter.cc`](https://cs.chromium.org/search/?q=components/url_formatter/url_formatter.cc)
146and `IDNSpoofChecker` class in
147[`components/url_formatter/spoof_checks/idn_spoof_checker.cc`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc).
148
Mustafa Emre Acerad2cdf52020-03-31 23:19:22149## Additional Protections
150
151In addition to the spoof checks above, Chrome also implements a full page
152security warning to protect against lookalike URLs. You can find an example of
153this warning at `chrome://interstitials/lookalike`. This warning blocks main
154frame navigations that involve lookalike URLs, either as a direct navigation or
155as part of a redirect.
156
157The algorithm to show this warning is as follows:
158
1591. If the scheme of the navigation is not `http` or `https`, allow
160the navigation.
161
1622. If the navigation is a redirect, check the redirect chain. If the redirect
163chain is safe, allow the navigation. (See Defensive Registrations section for
164details).
165
1663. If the hostname of the navigation has at least a medium site engagement
167score, allow the navigation. Site engagement score is assigned to sites by the
168[Site Engagement
169Service](https://www.chromium.org/developers/design-documents/site-engagement).
170
1714. If the hostname of the navigation is in
172[`domains.list`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/top_domains/domains.list),
173allow the navigation.
174
1755. If the user previously allowed the hostname of the navigation by clicking
176"Ignore" in the warning, allow the navigation. Currently, user decisions are
177stored per tab, so navigating to the same site in a new tab may show the
178warning.
179
1806. If the hostname has the same skeleton as a recently engaged site or a top 500
181domain, block the navigation and show the warning.
182
183All of these checks are done locally on the client side.
184
185### Defensive Registrations
186
187Domain owners can sometimes register multiple versions of their domains, such
188as the ASCII and IDN versions, to improve user experience and prevent potential
189spoofs. We call these supplementary domains defensive registrations.
190
191In some cases, Chrome's lookalike warning may flag and block navigations to
192these domains:
193 - If one of the sites is in `domains.list` but the other isn't, the latter will
194be blocked.
195 - If the user engaged with one of the sites but not the other, the latter will
196be blocked.
197
198### Avoiding a lookalike warning on your site
199
200**Domain owners can avoid the "Did you mean" warning by redirecting their
201defensive registrations to their canonical domain.**
202
203**Example**: If you own both `example.com` and `éxample.com` and the majority of
204your traffic is to `example.com`, you can fix the warning by redirecting
205`éxample.com` to `example.com`. The lookalike warning logic considers this a
206safe redirect and allows the navigation. If you must also redirect `http`
207navigations to `https`, do this in a single redirect such as
208`http://éxample.com -> https://example.com`. Use HTTP 301 or HTTP 302
209redirects, the lookalike warning ignores meta redirects.
Mustafa Emre Acerbd51bef2020-03-18 19:36:29210
211## Reporting Security Bugs
212
213We reward certain cases of IDN spoofs according to [Chrome's Vulnerability
214Reward Program](https://www.google.com/about/appsecurity/chrome-rewards/index.html)
215policies. Please see [this
216document]( https://docs.google.com/document/d/1_xJz3J9kkAPwk3pma6K3X12SyPTyyaJDSCxTfF8Y5sU/edit?usp=sharing)
217before reporting a security bug.