WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
30437
REGRESSION: Japanese text search ignores small vs. large and voicing mark differences
https://bugs.webkit.org/show_bug.cgi?id=30437
Summary
REGRESSION: Japanese text search ignores small vs. large and voicing mark dif...
Satoshi Nakagawa
Reported
2009-10-16 04:36:41 PDT
== Summary == In Japanese, 'ぁ' and 'あ' are treated as different characters in anytime. 'か' and 'が' are as well. But in Safari and Chrome, they are treated as the same characters in its search. == Description == As you know in English, abc and ABC are treated as the same in a case insensitive context like application searches. But in Japanese, for example, "あった" and "あつた" are always different words in any contexts. Because in Japanese semantics, 'っ' is NOT considered as a small form of 'つ'. These characters are never treated as the same characters. In the current Unicode Collation Algorithm, っ and つ are in the same order in the primary collation strength. WebKit uses the primary collation strength in ICU for its search. I reported this problem in the Unicode ML. (
http://unicode.org/mail-arch/unicode-ml/y2009-m10/0019.html
) Mark Davis replied to my report. (
http://unicode.org/mail-arch/unicode-ml/y2009-m10/0022.html
)
> UTS#10 does not necessarily match the sorting of any particular language.
It means we cannot use ICU's search function directly for application searches. It needs some tailoring in the collation table for some languages. I wrote a patch for WebKit to add the following tailoring rules for Japanese text search. This patch doesn't have any regression in the other languages.
Attachments
A patch to fix this problem
(4.61 KB, patch)
2009-10-16 04:39 PDT
,
Satoshi Nakagawa
no flags
Details
Formatted Diff
Diff
Revised patch
(11.67 KB, patch)
2009-10-18 22:02 PDT
,
Satoshi Nakagawa
eric
: commit-queue-
Details
Formatted Diff
Diff
Revised patch 2
(11.88 KB, patch)
2009-10-19 23:57 PDT
,
Satoshi Nakagawa
no flags
Details
Formatted Diff
Diff
Revised patch 3
(11.96 KB, patch)
2009-10-20 19:08 PDT
,
Satoshi Nakagawa
no flags
Details
Formatted Diff
Diff
work in progress
(8.33 KB, patch)
2010-01-07 17:56 PST
,
Darin Adler
no flags
Details
Formatted Diff
Diff
patch
(38.43 KB, patch)
2010-01-10 17:39 PST
,
Darin Adler
ap
: review+
Details
Formatted Diff
Diff
Show Obsolete
(5)
View All
Add attachment
proposed patch, testcase, etc.
Satoshi Nakagawa
Comment 1
2009-10-16 04:39:05 PDT
Created
attachment 41282
[details]
A patch to fix this problem
Satoshi Nakagawa
Comment 2
2009-10-16 04:44:12 PDT
The tailoring rules are like this: &ぁ=ァ=ァ<あ=ア=ア<ぃ=ィ=ィ<い=イ=イ<ぅ=ゥ=ゥ<う=ウ=ウ<ゔ=ヴ<ぇ=ェ=ェ<え=エ=エ<ぉ=ォ=ォ<お=オ=オ <ゕ=ヵ<か=カ=カ<が=ガ<き=キ=キ<ぎ=ギ<く=ク=ク<ぐ=グ<ゖ=ヶ<け=ケ=ケ<げ=ゲ<こ=コ=コ<ご=ゴ <さ=サ=サ<ざ=ザ<し=シ=シ<じ=ジ<す=ス=ス<ず=ズ<せ=セ=セ<ぜ=ゼ<そ=ソ=ソ<ぞ=ゾ <た=タ=タ<だ=ダ<ち=チ=チ<ぢ=ヂ<っ=ッ=ッ<つ=ツ=ツ<づ=ヅ<て=テ=テ<で=デ<と=ト=ト<ど=ド <な=ナ=ナ<に=ニ=ニ<ぬ=ヌ=ヌ<ね=ネ=ネ<の=ノ=ノ <は=ハ=ハ<ば=バ<ぱ=パ<ひ=ヒ=ヒ<び=ビ<ぴ=ピ<ふ=フ=フ<ぶ=ブ<ぷ=プ<へ=ヘ=ヘ<べ=ベ<ぺ=ペ<ほ=ホ=ホ<ぼ=ボ<ぽ=ポ <ま=マ=マ<み=ミ=ミ<む=ム=ム<め=メ=メ<も=モ=モ <ゃ=ャ=ャ<や=ヤ=ヤ<ゅ=ュ=ュ<ゆ=ユ=ユ<ょ=ョ=ョ<よ=ヨ=ヨ <ら=ラ=ラ<り=リ=リ<る=ル=ル<れ=レ=レ<ろ=ロ=ロ <ゎ=ヮ<わ=ワ=ワ<ヷ<ゐ=ヰ<ヸ<ゑ=ヱ<を=ヲ=ヲ<ん=ン=ン For example, "ぁ=ァ=ァ<あ=ア=ア" means: - 'ぁ', 'ァ' and 'ァ' are in the same order. - 'ぁ', 'ァ' and 'ァ' are smaller than 'あ', 'ア' and 'ア'. You can see some screenshots that describe this problem and the effect of this patch.
http://limechat.net/report/webkit-search-problem.html
Alexey Proskuryakov
Comment 3
2009-10-16 15:54:23 PDT
A fix like this definitely needs include a regression test. See <
http://trac.webkit.org/browser/trunk/LayoutTests/fast/text/find-case-folding.html
> for an example of how it could be done. Some more information about contributing code to WebKit is available at <
http://webkit.org/coding/contributing.html
>. This may seem like a lot to ask for, but it's vital for maintainability of the project. See also:
bug 27587
. Once you have a patch ready for review, please mark it as such by setting review flag to "?".
Darin Adler
Comment 4
2009-10-16 17:58:11 PDT
Comment on
attachment 41282
[details]
A patch to fix this problem This code should go inside createSearcher(), not inside searcher().
Satoshi Nakagawa
Comment 5
2009-10-18 22:02:21 PDT
Created
attachment 41396
[details]
Revised patch
Satoshi Nakagawa
Comment 6
2009-10-18 22:07:35 PDT
Thanks for your comments. I've uploaded a revised patch. About
bug 27587
, IMO it's natural to treat "シ" and "㋛" as different charactes. "アパート" and "㌀" as well.
Darin Adler
Comment 7
2009-10-19 15:26:19 PDT
Comment on
attachment 41396
[details]
Revised patch
> + * editing/TextIterator.cpp: > + (WebCore::): > + (WebCore::createSearcher):
The line saying just "WebCore::" should be deleted.
> +// Tailored collation rules for Japanese text search. > +// The default Unicode Collation Algorithm is unnatural in Japanese. > +// These rules intend to treat the following characters as different characters. > +// > +// - Small kana letters and normal kana letters > +// - Voiceless letters, voiced letters and semi-voiced letters > +//
This comment should document where this array came from. Is this original work or did you copy this here from some other project?
> +static const UChar JAPANESE_KANA_COLLATION_RULES[] = {
This array should not have a name that's in all capitals. Those names are reserved for macros.
> UErrorCode status = U_ZERO_ERROR; > UStringSearch* searcher = usearch_open(&newlineCharacter, 1, &newlineCharacter, 1, currentSearchLocaleID(), 0, &status); > ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > + > + static UCollator* collator = 0; > + if (!collator) { > + // Set tailored collation rules to fix Japanese text search. > + // See the comments before JAPANESE_KANA_COLLATION_RULES for details. > + status = U_ZERO_ERROR; > + collator = ucol_openRules(JAPANESE_KANA_COLLATION_RULES, -1, UCOL_DEFAULT, > + UCOL_DEFAULT_STRENGTH, 0, &status); > + ASSERT(status == U_ZERO_ERROR); > + status = U_ZERO_ERROR; > + usearch_setCollator(searcher, collator, &status); > + ASSERT(status == U_ZERO_ERROR); > + usearch_reset(searcher); > + }
This is OK, but not quite right. The usearch_setCollator and usearch_reset calls should be outside the if statement, since they are part of the creation of searcher, not the creation of the collator. However, since the createSearcher function is only called once, there's no problem in practice. The entire function should ideally be refactored to match the way the searcher() and createSearcher() function work. There would be a collator() and createCollator() function, and createSearcher() would call collator(). Otherwise, the patch looks good. I'm going to say r=me because I think it's OK to land this patch as is, but I think it would be even better if you took my suggestions above.
Eric Seidel (no email)
Comment 8
2009-10-19 16:54:46 PDT
Comment on
attachment 41396
[details]
Revised patch Nakagawa-san is not a committer, so I would mark this cq+, except Darin has asked for modifications. So best if Nakagawa-san could post a new patch with r=? and cq=?.
Satoshi Nakagawa
Comment 9
2009-10-19 23:57:39 PDT
Created
attachment 41486
[details]
Revised patch 2
Satoshi Nakagawa
Comment 10
2009-10-20 00:02:52 PDT
Thanks for reviewing. I've updated the patch per your comments.
WebKit Commit Bot
Comment 11
2009-10-20 09:43:13 PDT
Comment on
attachment 41486
[details]
Revised patch 2 Rejecting patch 41486 from commit-queue. Failed to run "['WebKitTools/Scripts/run-webkit-tests', '--no-launch-safari', '--quiet', '--exit-after-n-failures=1']" exit_code: 60 Last 500 characters of output: ages while\n\trunning as root. There are known race conditions that\n\twill allow any local user to read any file on the system.\n\tIf you still desire to serve pages as root then\n\tadd -DBIG_SECURITY_HOLE to the CFLAGS env variable\n\tand then rebuild the server.\n\tIt is strongly suggested that you instead modify the User\n\tdirective in your httpd.conf file to list a non-root\n\tuser.\n Timed out waiting for httpd to start at WebKitTools/Scripts/run-webkit-tests line 1359, <IN> line 30187.
Eric Seidel (no email)
Comment 12
2009-10-20 11:03:35 PDT
My apologies. Something is causing run-webkit-tests to fail on the commit bot. I'm not sure what yet.
Eric Seidel (no email)
Comment 13
2009-10-20 11:14:44 PDT
Comment on
attachment 41486
[details]
Revised patch 2 I've fixed this commit-queue. Again my apologies.
WebKit Commit Bot
Comment 14
2009-10-20 12:00:49 PDT
Comment on
attachment 41486
[details]
Revised patch 2 Clearing flags on attachment: 41486 Committed
r49876
: <
http://trac.webkit.org/changeset/49876
>
WebKit Commit Bot
Comment 15
2009-10-20 12:00:55 PDT
All reviewed patches have been landed. Closing bug.
Satoshi Nakagawa
Comment 16
2009-10-20 14:11:07 PDT
Thanks!
Mark Rowe (bdash)
Comment 17
2009-10-20 17:20:38 PDT
After this change the SnowLeopard debug bot is hitting numerous assertion failures with stack traces like so: Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x00000000bbadbeef Crashed Thread: 0 Dispatch queue: com.apple.main-thread Thread 0 Crashed: Dispatch queue: com.apple.main-thread 0 com.apple.WebCore 0x000000010163333c WebCore::createSearcher() + 216 (TextIterator.cpp:1534) 1 com.apple.WebCore 0x000000010163336f WebCore::searcher() + 23 (TextIterator.cpp:1541) 2 com.apple.WebCore 0x000000010163a27f WebCore::SearchBuffer::SearchBuffer(WebCore::String const&, bool) + 247 (TextIterator.cpp:1581) 3 com.apple.WebCore 0x0000000101636736 WebCore::findPlainText(WebCore::CharacterIterator&, WebCore::String const&, bool, bool, unsigned long&) + 81 (TextIterator.cpp:1983) 4 com.apple.WebCore 0x0000000101637bff WebCore::findPlainText(WebCore::Range const*, WebCore::String const&, bool, bool) + 116 (TextIterator.cpp:2015) I’m going to verify that I can reproduce this locally and may end up rolling this change out.
Mark Rowe (bdash)
Comment 18
2009-10-20 17:34:33 PDT
When doing ‘run-webkit-tests editing’ I hit this assertion failure on seven tests. Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x00000000bbadbeef 0x00000001016360ac in WebCore::createSearcher () at WebCore/editing/TextIterator.cpp:1534 1534 ASSERT(status == U_ZERO_ERROR); (gdb) print status $1 = U_USING_DEFAULT_WARNING Current language: auto; currently c++ (gdb) I’m going to roll this out for now since we can’t leave debug builds in a broken state.
Mark Rowe (bdash)
Comment 19
2009-10-20 17:43:20 PDT
Rolled out in
r49876
.
Mark Rowe (bdash)
Comment 20
2009-10-20 17:53:42 PDT
Maybe the solution is to modify the assertion to be the same as the one a few lines up rather than requiring U_ZERO_ERROR.
Satoshi Nakagawa
Comment 21
2009-10-20 19:08:45 PDT
Created
attachment 41543
[details]
Revised patch 3 I'm sorry it caused an assertion failure. Here is a revised patch.
Darin Adler
Comment 22
2009-10-21 00:03:02 PDT
Comment on
attachment 41543
[details]
Revised patch 3
> + UCollator* collator = ucol_openRules(japaneseKanaCollationRules, -1, UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH, 0, &status); > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING);
Are all three of these expected here for some reason, or are you just ignoring the same set I ignored for usearch_open?
> + usearch_setCollator(searcher, collator(), &status); > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING);
Same question. r=me, but I would like to hear the answer to that question at some point.
WebKit Commit Bot
Comment 23
2009-10-21 00:11:16 PDT
Comment on
attachment 41543
[details]
Revised patch 3 Clearing flags on attachment: 41543 Committed
r49899
: <
http://trac.webkit.org/changeset/49899
>
WebKit Commit Bot
Comment 24
2009-10-21 00:11:22 PDT
All reviewed patches have been landed. Closing bug.
Satoshi Nakagawa
Comment 25
2009-10-21 00:49:03 PDT
(In reply to
comment #22
)
> (From update of
attachment 41543
[details]
) > > + UCollator* collator = ucol_openRules(japaneseKanaCollationRules, -1, UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH, 0, &status); > > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > > Are all three of these expected here for some reason, or are you just ignoring > the same set I ignored for usearch_open?
It's just for ignoring the same set as for usearch_open. I read ICU documents, and I thought it's better to ignore them.
http://icu-project.org/apiref/icu4c/ucol_8h.html#a128ea0ed3869415c1c96a9a2c997c2d
http://icu-project.org/apiref/icu4c/utypes_8h.html#3343c1c8a8377277046774691c98d78c
Probably, using U_SUCCESS() would be better for assertion.
> > + usearch_setCollator(searcher, collator(), &status); > > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > > Same question. > > r=me, but I would like to hear the answer to that question at some point.
The same here.
Darin Adler
Comment 26
2009-10-21 08:21:09 PDT
(In reply to
comment #25
)
> Probably, using U_SUCCESS() would be better for assertion.
I prefer having the more specific assertion, because we want to know if we're getting new, unexpected error codes there. Sorry it caused a problem in this case. I think this is good now.
Jungshik Shin
Comment 27
2009-10-21 15:22:02 PDT
Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). Before the change, the search collator was locale-dependent. That is, if you use Japanese Safari, you'd get the collator tailored for Japanese (which I think does the right thing for Japanese). The same is true of other languages. The intent of the patch is to make Japanese search work *regardless of the current locale (UI language)*, which I support. However, it should be done without breaking search in Swedish, German, Finnish or any other language. The patch here breaks that because it throws away the locale-dependent collator right after creating one and replace it with the UCA + Japanese tailoring. The way to do that is to get the collation rule string for the current locale and combine it with the Japanese tailoring (added in the patch) and build a new collator from the combined rule strings. (I just went over the patch with one of ICU's leading contributors). I'll make a patch to do the above in a couple of days.
Darin Adler
Comment 28
2009-10-21 15:30:59 PDT
(In reply to
comment #27
)
> Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish).
I can't believe I missed that. We may need to roll the patch out until we have a method that doesn't break all the other languages.
Darin Adler
Comment 29
2009-10-21 15:34:09 PDT
(In reply to
comment #27
)
> The way to do that is to get the collation rule string for the current locale > and combine it with the Japanese tailoring (added in the patch) and build a new > collator from the combined rule strings. (I just went over the patch with one > of ICU's leading contributors).
For what it's worth, I tried to do exactly that when I added the code to fold quote marks. But as I mention in a comment, I was unable to do this, even with the help of some ICU experts here at Apple. Here's hoping you are able to do what I was not! See <
http://trac.webkit.org/changeset/45858
> for the comment and code. If you are able to add custom tailoring, then perhaps we can do the quote marks that way.
Jungshik Shin
Comment 30
2009-10-21 15:40:11 PDT
(In reply to
comment #28
)
> (In reply to
comment #27
) > > Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). > > I can't believe I missed that. We may need to roll the patch out until we have > a method that doesn't break all the other languages.
It'd be great if you can roll it out. Then, I don't have to rush to make a new ICU data file for Chrome with invuca table (which I excluded to save some space) :-). Chrome is crashing due to assertions because it doesn't have invuca table. Without noticing you adding the last two comments, I filed a new
bug 30646
to fix the regression. If you think it's better to roll out the patch, perhaps it's better to deal with a better solution for the original issue here instead of
bug 30646
. Either way is fine with me.
Darin Adler
Comment 31
2009-10-21 15:42:48 PDT
What is the invuca table, and why does Chrome need it, but not other WebKit-based browsers?
Jungshik Shin
Comment 32
2009-10-21 15:56:06 PDT
(In reply to
comment #31
)
> What is the invuca table, and why does Chrome need it, but not other > WebKit-based browsers?
I was not clear. invuca is necessary by all other ICU-dependent webkit port IF collator is built at run-time (as is done with the patch for this bug). Chrome excluded it from its ICU data file because neither Webkit nor Chrome's other components build a collator with custom rules at run-time. With the patch here (or a new patch addressing the regression), that's not the case any more. Among other ICU-dependent Webkit ports, Safari on OS X is not affected because it gets the *full* ICU data from the OS. Safari on Win could have cut down the download size a little bit by excluding invuca, but apparently it didn't. So, it's not affected, either.
Darin Adler
Comment 33
2009-10-21 16:11:04 PDT
Rolled the change out in
r49926
. Reopening the bug.
Eric Seidel (no email)
Comment 34
2009-10-26 12:06:32 PDT
Comment on
attachment 41396
[details]
Revised patch Clearing darin's r+ on this obsolete patch.
mitz
Comment 35
2009-10-30 16:43:20 PDT
<
rdar://problem/7214058
>
Darin Adler
Comment 36
2009-10-30 16:46:36 PDT
If we can't fix this with tailoring, maybe there's some other simple way to fix it. I’d really like to resolve this somehow!
Darin Adler
Comment 37
2010-01-06 16:52:34 PST
Does anyone have any ideas about how to solve this?
Darin Adler
Comment 38
2010-01-06 16:59:19 PST
I am going to tackle this.
Darin Adler
Comment 39
2010-01-07 17:56:09 PST
Created
attachment 46103
[details]
work in progress
Darin Adler
Comment 40
2010-01-10 17:39:32 PST
Created
attachment 46249
[details]
patch
Alexey Proskuryakov
Comment 41
2010-01-10 23:02:25 PST
Comment on
attachment 46249
[details]
patch
> +static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer) > +{ > + ASSERT(length); > + > + UErrorCode status = U_ZERO_ERROR; > + size_t bufferSize = unorm_normalize(characters, length, UNORM_NFC, 0, 0, 0, &status); > + ASSERT(status == U_BUFFER_OVERFLOW_ERROR); > + ASSERT(bufferSize);
Would it make sense to try with an output buffer of size length,, to avoid having two passes in most cases?
> Index: LayoutTests/fast/text/find-kana.html
I need to add a similar test for Russian! r=me
Darin Adler
Comment 42
2010-01-11 07:54:56 PST
(In reply to
comment #41
)
> Would it make sense to try with an output buffer of size length, to avoid > having two passes in most cases?
Yes.
> I need to add a similar test for Russian!
Please do.
Darin Adler
Comment 43
2010-01-11 08:31:35 PST
http://trac.webkit.org/changeset/53078
Darin Adler
Comment 44
2010-01-11 10:41:43 PST
***
Bug 27587
has been marked as a duplicate of this bug. ***
Eric Seidel (no email)
Comment 45
2010-01-11 12:07:44 PST
Looks like this test breaks on Leopard and Tiger:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r53078%20(9114)/fast/text/find-kana-diffs.txt
Eric Seidel (no email)
Comment 46
2010-01-11 12:23:24 PST
And Qt:
http://build.webkit.org/results/Qt%20Linux%20Release/r53092%20(5839)/fast/text/find-kana-diffs.txt
Darin Adler
Comment 47
2010-01-11 13:02:47 PST
Tiger and Leopard are failing because they are using an old ICU. Qt is failing because it's using the non-ICU search code path.
Darin Adler
Comment 48
2010-01-11 13:03:13 PST
In both cases the best solution is checking in expected failure results.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug