RESOLVED FIXED 30437
REGRESSION: Japanese text search ignores small vs. large and voicing mark differences
https://bugs.webkit.org/show_bug.cgi?id=30437
Summary REGRESSION: Japanese text search ignores small vs. large and voicing mark dif...
Satoshi Nakagawa
Reported 2009-10-16 04:36:41 PDT
== Summary == In Japanese, 'ぁ' and 'あ' are treated as different characters in anytime. 'か' and 'が' are as well. But in Safari and Chrome, they are treated as the same characters in its search. == Description == As you know in English, abc and ABC are treated as the same in a case insensitive context like application searches. But in Japanese, for example, "あった" and "あつた" are always different words in any contexts. Because in Japanese semantics, 'っ' is NOT considered as a small form of 'つ'. These characters are never treated as the same characters. In the current Unicode Collation Algorithm, っ and つ are in the same order in the primary collation strength. WebKit uses the primary collation strength in ICU for its search. I reported this problem in the Unicode ML. (http://unicode.org/mail-arch/unicode-ml/y2009-m10/0019.html) Mark Davis replied to my report. (http://unicode.org/mail-arch/unicode-ml/y2009-m10/0022.html) > UTS#10 does not necessarily match the sorting of any particular language. It means we cannot use ICU's search function directly for application searches. It needs some tailoring in the collation table for some languages. I wrote a patch for WebKit to add the following tailoring rules for Japanese text search. This patch doesn't have any regression in the other languages.
Attachments
A patch to fix this problem (4.61 KB, patch)
2009-10-16 04:39 PDT, Satoshi Nakagawa
no flags
Revised patch (11.67 KB, patch)
2009-10-18 22:02 PDT, Satoshi Nakagawa
eric: commit-queue-
Revised patch 2 (11.88 KB, patch)
2009-10-19 23:57 PDT, Satoshi Nakagawa
no flags
Revised patch 3 (11.96 KB, patch)
2009-10-20 19:08 PDT, Satoshi Nakagawa
no flags
work in progress (8.33 KB, patch)
2010-01-07 17:56 PST, Darin Adler
no flags
patch (38.43 KB, patch)
2010-01-10 17:39 PST, Darin Adler
ap: review+
Satoshi Nakagawa
Comment 1 2009-10-16 04:39:05 PDT
Created attachment 41282 [details] A patch to fix this problem
Satoshi Nakagawa
Comment 2 2009-10-16 04:44:12 PDT
The tailoring rules are like this: &ぁ=ァ=ァ<あ=ア=ア<ぃ=ィ=ィ<い=イ=イ<ぅ=ゥ=ゥ<う=ウ=ウ<ゔ=ヴ<ぇ=ェ=ェ<え=エ=エ<ぉ=ォ=ォ<お=オ=オ <ゕ=ヵ<か=カ=カ<が=ガ<き=キ=キ<ぎ=ギ<く=ク=ク<ぐ=グ<ゖ=ヶ<け=ケ=ケ<げ=ゲ<こ=コ=コ<ご=ゴ <さ=サ=サ<ざ=ザ<し=シ=シ<じ=ジ<す=ス=ス<ず=ズ<せ=セ=セ<ぜ=ゼ<そ=ソ=ソ<ぞ=ゾ <た=タ=タ<だ=ダ<ち=チ=チ<ぢ=ヂ<っ=ッ=ッ<つ=ツ=ツ<づ=ヅ<て=テ=テ<で=デ<と=ト=ト<ど=ド <な=ナ=ナ<に=ニ=ニ<ぬ=ヌ=ヌ<ね=ネ=ネ<の=ノ=ノ <は=ハ=ハ<ば=バ<ぱ=パ<ひ=ヒ=ヒ<び=ビ<ぴ=ピ<ふ=フ=フ<ぶ=ブ<ぷ=プ<へ=ヘ=ヘ<べ=ベ<ぺ=ペ<ほ=ホ=ホ<ぼ=ボ<ぽ=ポ <ま=マ=マ<み=ミ=ミ<む=ム=ム<め=メ=メ<も=モ=モ <ゃ=ャ=ャ<や=ヤ=ヤ<ゅ=ュ=ュ<ゆ=ユ=ユ<ょ=ョ=ョ<よ=ヨ=ヨ <ら=ラ=ラ<り=リ=リ<る=ル=ル<れ=レ=レ<ろ=ロ=ロ <ゎ=ヮ<わ=ワ=ワ<ヷ<ゐ=ヰ<ヸ<ゑ=ヱ<を=ヲ=ヲ<ん=ン=ン For example, "ぁ=ァ=ァ<あ=ア=ア" means: - 'ぁ', 'ァ' and 'ァ' are in the same order. - 'ぁ', 'ァ' and 'ァ' are smaller than 'あ', 'ア' and 'ア'. You can see some screenshots that describe this problem and the effect of this patch. http://limechat.net/report/webkit-search-problem.html
Alexey Proskuryakov
Comment 3 2009-10-16 15:54:23 PDT
A fix like this definitely needs include a regression test. See <http://trac.webkit.org/browser/trunk/LayoutTests/fast/text/find-case-folding.html> for an example of how it could be done. Some more information about contributing code to WebKit is available at <http://webkit.org/coding/contributing.html>. This may seem like a lot to ask for, but it's vital for maintainability of the project. See also: bug 27587. Once you have a patch ready for review, please mark it as such by setting review flag to "?".
Darin Adler
Comment 4 2009-10-16 17:58:11 PDT
Comment on attachment 41282 [details] A patch to fix this problem This code should go inside createSearcher(), not inside searcher().
Satoshi Nakagawa
Comment 5 2009-10-18 22:02:21 PDT
Created attachment 41396 [details] Revised patch
Satoshi Nakagawa
Comment 6 2009-10-18 22:07:35 PDT
Thanks for your comments. I've uploaded a revised patch. About bug 27587, IMO it's natural to treat "シ" and "㋛" as different charactes. "アパート" and "㌀" as well.
Darin Adler
Comment 7 2009-10-19 15:26:19 PDT
Comment on attachment 41396 [details] Revised patch > + * editing/TextIterator.cpp: > + (WebCore::): > + (WebCore::createSearcher): The line saying just "WebCore::" should be deleted. > +// Tailored collation rules for Japanese text search. > +// The default Unicode Collation Algorithm is unnatural in Japanese. > +// These rules intend to treat the following characters as different characters. > +// > +// - Small kana letters and normal kana letters > +// - Voiceless letters, voiced letters and semi-voiced letters > +// This comment should document where this array came from. Is this original work or did you copy this here from some other project? > +static const UChar JAPANESE_KANA_COLLATION_RULES[] = { This array should not have a name that's in all capitals. Those names are reserved for macros. > UErrorCode status = U_ZERO_ERROR; > UStringSearch* searcher = usearch_open(&newlineCharacter, 1, &newlineCharacter, 1, currentSearchLocaleID(), 0, &status); > ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > + > + static UCollator* collator = 0; > + if (!collator) { > + // Set tailored collation rules to fix Japanese text search. > + // See the comments before JAPANESE_KANA_COLLATION_RULES for details. > + status = U_ZERO_ERROR; > + collator = ucol_openRules(JAPANESE_KANA_COLLATION_RULES, -1, UCOL_DEFAULT, > + UCOL_DEFAULT_STRENGTH, 0, &status); > + ASSERT(status == U_ZERO_ERROR); > + status = U_ZERO_ERROR; > + usearch_setCollator(searcher, collator, &status); > + ASSERT(status == U_ZERO_ERROR); > + usearch_reset(searcher); > + } This is OK, but not quite right. The usearch_setCollator and usearch_reset calls should be outside the if statement, since they are part of the creation of searcher, not the creation of the collator. However, since the createSearcher function is only called once, there's no problem in practice. The entire function should ideally be refactored to match the way the searcher() and createSearcher() function work. There would be a collator() and createCollator() function, and createSearcher() would call collator(). Otherwise, the patch looks good. I'm going to say r=me because I think it's OK to land this patch as is, but I think it would be even better if you took my suggestions above.
Eric Seidel (no email)
Comment 8 2009-10-19 16:54:46 PDT
Comment on attachment 41396 [details] Revised patch Nakagawa-san is not a committer, so I would mark this cq+, except Darin has asked for modifications. So best if Nakagawa-san could post a new patch with r=? and cq=?.
Satoshi Nakagawa
Comment 9 2009-10-19 23:57:39 PDT
Created attachment 41486 [details] Revised patch 2
Satoshi Nakagawa
Comment 10 2009-10-20 00:02:52 PDT
Thanks for reviewing. I've updated the patch per your comments.
WebKit Commit Bot
Comment 11 2009-10-20 09:43:13 PDT
Comment on attachment 41486 [details] Revised patch 2 Rejecting patch 41486 from commit-queue. Failed to run "['WebKitTools/Scripts/run-webkit-tests', '--no-launch-safari', '--quiet', '--exit-after-n-failures=1']" exit_code: 60 Last 500 characters of output: ages while\n\trunning as root. There are known race conditions that\n\twill allow any local user to read any file on the system.\n\tIf you still desire to serve pages as root then\n\tadd -DBIG_SECURITY_HOLE to the CFLAGS env variable\n\tand then rebuild the server.\n\tIt is strongly suggested that you instead modify the User\n\tdirective in your httpd.conf file to list a non-root\n\tuser.\n Timed out waiting for httpd to start at WebKitTools/Scripts/run-webkit-tests line 1359, <IN> line 30187.
Eric Seidel (no email)
Comment 12 2009-10-20 11:03:35 PDT
My apologies. Something is causing run-webkit-tests to fail on the commit bot. I'm not sure what yet.
Eric Seidel (no email)
Comment 13 2009-10-20 11:14:44 PDT
Comment on attachment 41486 [details] Revised patch 2 I've fixed this commit-queue. Again my apologies.
WebKit Commit Bot
Comment 14 2009-10-20 12:00:49 PDT
Comment on attachment 41486 [details] Revised patch 2 Clearing flags on attachment: 41486 Committed r49876: <http://trac.webkit.org/changeset/49876>
WebKit Commit Bot
Comment 15 2009-10-20 12:00:55 PDT
All reviewed patches have been landed. Closing bug.
Satoshi Nakagawa
Comment 16 2009-10-20 14:11:07 PDT
Thanks!
Mark Rowe (bdash)
Comment 17 2009-10-20 17:20:38 PDT
After this change the SnowLeopard debug bot is hitting numerous assertion failures with stack traces like so: Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x00000000bbadbeef Crashed Thread: 0 Dispatch queue: com.apple.main-thread Thread 0 Crashed: Dispatch queue: com.apple.main-thread 0 com.apple.WebCore 0x000000010163333c WebCore::createSearcher() + 216 (TextIterator.cpp:1534) 1 com.apple.WebCore 0x000000010163336f WebCore::searcher() + 23 (TextIterator.cpp:1541) 2 com.apple.WebCore 0x000000010163a27f WebCore::SearchBuffer::SearchBuffer(WebCore::String const&, bool) + 247 (TextIterator.cpp:1581) 3 com.apple.WebCore 0x0000000101636736 WebCore::findPlainText(WebCore::CharacterIterator&, WebCore::String const&, bool, bool, unsigned long&) + 81 (TextIterator.cpp:1983) 4 com.apple.WebCore 0x0000000101637bff WebCore::findPlainText(WebCore::Range const*, WebCore::String const&, bool, bool) + 116 (TextIterator.cpp:2015) I’m going to verify that I can reproduce this locally and may end up rolling this change out.
Mark Rowe (bdash)
Comment 18 2009-10-20 17:34:33 PDT
When doing ‘run-webkit-tests editing’ I hit this assertion failure on seven tests. Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x00000000bbadbeef 0x00000001016360ac in WebCore::createSearcher () at WebCore/editing/TextIterator.cpp:1534 1534 ASSERT(status == U_ZERO_ERROR); (gdb) print status $1 = U_USING_DEFAULT_WARNING Current language: auto; currently c++ (gdb) I’m going to roll this out for now since we can’t leave debug builds in a broken state.
Mark Rowe (bdash)
Comment 19 2009-10-20 17:43:20 PDT
Rolled out in r49876.
Mark Rowe (bdash)
Comment 20 2009-10-20 17:53:42 PDT
Maybe the solution is to modify the assertion to be the same as the one a few lines up rather than requiring U_ZERO_ERROR.
Satoshi Nakagawa
Comment 21 2009-10-20 19:08:45 PDT
Created attachment 41543 [details] Revised patch 3 I'm sorry it caused an assertion failure. Here is a revised patch.
Darin Adler
Comment 22 2009-10-21 00:03:02 PDT
Comment on attachment 41543 [details] Revised patch 3 > + UCollator* collator = ucol_openRules(japaneseKanaCollationRules, -1, UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH, 0, &status); > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); Are all three of these expected here for some reason, or are you just ignoring the same set I ignored for usearch_open? > + usearch_setCollator(searcher, collator(), &status); > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); Same question. r=me, but I would like to hear the answer to that question at some point.
WebKit Commit Bot
Comment 23 2009-10-21 00:11:16 PDT
Comment on attachment 41543 [details] Revised patch 3 Clearing flags on attachment: 41543 Committed r49899: <http://trac.webkit.org/changeset/49899>
WebKit Commit Bot
Comment 24 2009-10-21 00:11:22 PDT
All reviewed patches have been landed. Closing bug.
Satoshi Nakagawa
Comment 25 2009-10-21 00:49:03 PDT
(In reply to comment #22) > (From update of attachment 41543 [details]) > > + UCollator* collator = ucol_openRules(japaneseKanaCollationRules, -1, UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH, 0, &status); > > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > > Are all three of these expected here for some reason, or are you just ignoring > the same set I ignored for usearch_open? It's just for ignoring the same set as for usearch_open. I read ICU documents, and I thought it's better to ignore them. http://icu-project.org/apiref/icu4c/ucol_8h.html#a128ea0ed3869415c1c96a9a2c997c2d http://icu-project.org/apiref/icu4c/utypes_8h.html#3343c1c8a8377277046774691c98d78c Probably, using U_SUCCESS() would be better for assertion. > > + usearch_setCollator(searcher, collator(), &status); > > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > > Same question. > > r=me, but I would like to hear the answer to that question at some point. The same here.
Darin Adler
Comment 26 2009-10-21 08:21:09 PDT
(In reply to comment #25) > Probably, using U_SUCCESS() would be better for assertion. I prefer having the more specific assertion, because we want to know if we're getting new, unexpected error codes there. Sorry it caused a problem in this case. I think this is good now.
Jungshik Shin
Comment 27 2009-10-21 15:22:02 PDT
Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). Before the change, the search collator was locale-dependent. That is, if you use Japanese Safari, you'd get the collator tailored for Japanese (which I think does the right thing for Japanese). The same is true of other languages. The intent of the patch is to make Japanese search work *regardless of the current locale (UI language)*, which I support. However, it should be done without breaking search in Swedish, German, Finnish or any other language. The patch here breaks that because it throws away the locale-dependent collator right after creating one and replace it with the UCA + Japanese tailoring. The way to do that is to get the collation rule string for the current locale and combine it with the Japanese tailoring (added in the patch) and build a new collator from the combined rule strings. (I just went over the patch with one of ICU's leading contributors). I'll make a patch to do the above in a couple of days.
Darin Adler
Comment 28 2009-10-21 15:30:59 PDT
(In reply to comment #27) > Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). I can't believe I missed that. We may need to roll the patch out until we have a method that doesn't break all the other languages.
Darin Adler
Comment 29 2009-10-21 15:34:09 PDT
(In reply to comment #27) > The way to do that is to get the collation rule string for the current locale > and combine it with the Japanese tailoring (added in the patch) and build a new > collator from the combined rule strings. (I just went over the patch with one > of ICU's leading contributors). For what it's worth, I tried to do exactly that when I added the code to fold quote marks. But as I mention in a comment, I was unable to do this, even with the help of some ICU experts here at Apple. Here's hoping you are able to do what I was not! See <http://trac.webkit.org/changeset/45858> for the comment and code. If you are able to add custom tailoring, then perhaps we can do the quote marks that way.
Jungshik Shin
Comment 30 2009-10-21 15:40:11 PDT
(In reply to comment #28) > (In reply to comment #27) > > Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). > > I can't believe I missed that. We may need to roll the patch out until we have > a method that doesn't break all the other languages. It'd be great if you can roll it out. Then, I don't have to rush to make a new ICU data file for Chrome with invuca table (which I excluded to save some space) :-). Chrome is crashing due to assertions because it doesn't have invuca table. Without noticing you adding the last two comments, I filed a new bug 30646 to fix the regression. If you think it's better to roll out the patch, perhaps it's better to deal with a better solution for the original issue here instead of bug 30646. Either way is fine with me.
Darin Adler
Comment 31 2009-10-21 15:42:48 PDT
What is the invuca table, and why does Chrome need it, but not other WebKit-based browsers?
Jungshik Shin
Comment 32 2009-10-21 15:56:06 PDT
(In reply to comment #31) > What is the invuca table, and why does Chrome need it, but not other > WebKit-based browsers? I was not clear. invuca is necessary by all other ICU-dependent webkit port IF collator is built at run-time (as is done with the patch for this bug). Chrome excluded it from its ICU data file because neither Webkit nor Chrome's other components build a collator with custom rules at run-time. With the patch here (or a new patch addressing the regression), that's not the case any more. Among other ICU-dependent Webkit ports, Safari on OS X is not affected because it gets the *full* ICU data from the OS. Safari on Win could have cut down the download size a little bit by excluding invuca, but apparently it didn't. So, it's not affected, either.
Darin Adler
Comment 33 2009-10-21 16:11:04 PDT
Rolled the change out in r49926. Reopening the bug.
Eric Seidel (no email)
Comment 34 2009-10-26 12:06:32 PDT
Comment on attachment 41396 [details] Revised patch Clearing darin's r+ on this obsolete patch.
mitz
Comment 35 2009-10-30 16:43:20 PDT
Darin Adler
Comment 36 2009-10-30 16:46:36 PDT
If we can't fix this with tailoring, maybe there's some other simple way to fix it. I’d really like to resolve this somehow!
Darin Adler
Comment 37 2010-01-06 16:52:34 PST
Does anyone have any ideas about how to solve this?
Darin Adler
Comment 38 2010-01-06 16:59:19 PST
I am going to tackle this.
Darin Adler
Comment 39 2010-01-07 17:56:09 PST
Created attachment 46103 [details] work in progress
Darin Adler
Comment 40 2010-01-10 17:39:32 PST
Alexey Proskuryakov
Comment 41 2010-01-10 23:02:25 PST
Comment on attachment 46249 [details] patch > +static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer) > +{ > + ASSERT(length); > + > + UErrorCode status = U_ZERO_ERROR; > + size_t bufferSize = unorm_normalize(characters, length, UNORM_NFC, 0, 0, 0, &status); > + ASSERT(status == U_BUFFER_OVERFLOW_ERROR); > + ASSERT(bufferSize); Would it make sense to try with an output buffer of size length,, to avoid having two passes in most cases? > Index: LayoutTests/fast/text/find-kana.html I need to add a similar test for Russian! r=me
Darin Adler
Comment 42 2010-01-11 07:54:56 PST
(In reply to comment #41) > Would it make sense to try with an output buffer of size length, to avoid > having two passes in most cases? Yes. > I need to add a similar test for Russian! Please do.
Darin Adler
Comment 44 2010-01-11 10:41:43 PST
*** Bug 27587 has been marked as a duplicate of this bug. ***
Eric Seidel (no email)
Comment 45 2010-01-11 12:07:44 PST
Darin Adler
Comment 47 2010-01-11 13:02:47 PST
Tiger and Leopard are failing because they are using an old ICU. Qt is failing because it's using the non-ICU search code path.
Darin Adler
Comment 48 2010-01-11 13:03:13 PST
In both cases the best solution is checking in expected failure results.
Note You need to log in before you can comment on or make changes to this bug.