如何从字符串的中间执行区分区域性的“ start-with”操作?

我有一个需求,这是相对模糊的,但它感觉像 应该是可能的使用 BCL。

对于上下文,我正在解析 野田时光中的日期/时间字符串。我在输入字符串中为我的位置维护一个逻辑光标。因此,尽管完整的字符串可能是“2013年1月3日”,但逻辑光标可能在“ J”处。

现在,我需要解析月份名称,将其与文化中所有已知的月份名称进行比较:

  • 文化敏感
  • 不注意大小写
  • 仅仅从光标的角度(不是稍后; 我想看看光标是否“查看”候选人的月份名称)
  • 快点
  • 然后我要知道用了多少个字符

执行此操作的 当前代码通常使用 CompareInfo.Compare。实际上就像这样(只针对匹配部分——实际中有更多的代码,但与匹配无关) :

internal bool MatchCaseInsensitive(string candidate, CompareInfo compareInfo)
{
return compareInfo.Compare(text, position, candidate.Length,
candidate, 0, candidate.Length,
CompareOptions.IgnoreCase) == 0;
}

然而,这依赖于候选者和我们比较的区域是相同的长度。大多数情况下是正常的,但在某些特殊情况下 没有是正常的。假设我们有这样的东西:

// U+00E9 is a single code point for e-acute
var text = "x b\u00e9d y";
int position = 2;
// e followed by U+0301 still means e-acute, but from two code points
var candidate = "be\u0301d";

现在我的比较将会失败。我可以使用 IsPrefix:

if (compareInfo.IsPrefix(text.Substring(position), candidate,
CompareOptions.IgnoreCase))

但是:

  • 这需要我创建一个子字符串,这是我最好避免的。(我将 Noda Time 视为一个有效的系统库; 解析性能对某些客户机可能非常重要。)
  • 它并没有告诉我之后光标要前进多远

事实上,我强烈怀疑这不会经常出现... ... 但我真的 喜欢在这里做正确的事情。我也真的希望能够在不成为 Unicode 专家或自己实现它的情况下完成它:)

(如果有人想要得出任何最终结论的话,在《野田时代》(Noda Time)中被称为 窃听器210。)

我喜欢正常化的想法。我需要详细检查一下,看看是否正确,还是性能良好。假设我的 可以能正确地工作,我仍然不确定它是否值得改变——这种事情可能会在现实生活中出现,但可能会损害我所有用户的性能:

我还检查了 BCL-它似乎也没有正确地处理这个问题。示例代码:

using System;
using System.Globalization;


class Test
{
static void Main()
{
var culture = (CultureInfo) CultureInfo.InvariantCulture.Clone();
var months = culture.DateTimeFormat.AbbreviatedMonthNames;
months[10] = "be\u0301d";
culture.DateTimeFormat.AbbreviatedMonthNames = months;


var text = "25 b\u00e9d 2013";
var pattern = "dd MMM yyyy";
DateTime result;
if (DateTime.TryParseExact(text, pattern, culture,
DateTimeStyles.None, out result))
{
Console.WriteLine("Parsed! Result={0}", result);
}
else
{
Console.WriteLine("Didn't parse");
}
}
}

将自定义月份名称更改为文本值为“ bEd”的“ bEd”可以进行很好的解析。

好了,还有几个数据点:

  • 使用 SubstringIsPrefix的成本很高,但并不可怕。在我的开发笔记本电脑上的“ Friday April 12201320:28:42”示例中,它将我在一秒钟内可以执行的解析操作的数量从大约460K 改为大约400K。如果可能的话,我宁愿避免这种减速,但它不是 也是坏。

  • 规范化比我想象的更不可行——因为它在便携式类库中不可用。我可以潜在地将它 只是用于非 PCL 构建,允许 PCL 构建稍微不那么正确。正常化测试(string.IsNormalized)的性能损失使性能降低到每秒约445K 次调用,这是我可以接受的。我仍然不确定它是否做了我需要它做的所有事情——例如,在许多文化中,包含“ ß”的月份名称应该与“ ss”相匹配,我相信... 而正常化并不能做到这一点。

6432 次浏览

I'll consider the problem of many<->one/many casemappings first and separately from handling different Normalization forms.

For example:

x heiße y
^--- cursor

Matches heisse but then moves cursor 1 too much. And:

x heisse y
^--- cursor

Matches heiße but then moves cursor 1 too less.

This will apply to any character that doesn't have a simple one-to-one mapping.

You would need to know the length of the substring that was actually matched. But Compare, IndexOf ..etc throw that information away. It could be possible with regular expressions but the implementation doesn't do full case folding and so doesn't match ß to ss/SS in case-insensitive mode even though .Compare and .IndexOf do. And it would probably be costly to create new regexes for every candidate anyway.

The simplest solution to this is to just internally store strings in case folded form and do binary comparisons with case folded candidates. Then you can move the cursor correctly with just .Length since the cursor is for internal representation. You also get most of the lost performance back from not having to use CompareOptions.IgnoreCase.

Unfortunately there is no case fold function built-in and the poor man's case folding doesn't work either because there is no full case mapping - the ToUpper method doesn't turn ß into SS.

For example this works in Java (and even in Javascript), given string that is in Normal Form C:

//Poor man's case folding.
//There are some edge cases where this doesn't work
public static String toCaseFold( String input, Locale cultureInfo ) {
return input.toUpperCase(cultureInfo).toLowerCase(cultureInfo);
}

Fun to note that Java's ignore case comparison doesn't do full case folding like C#'s CompareOptions.IgnoreCase. So they are opposite in this regard: Java does full casemapping, but simple case folding - C# does simple casemapping, but full case folding.

So it's likely that you need a 3rd party library to case fold your strings before using them.


Before doing anything you have to be sure that your strings are in normal form C. You can use this preliminary quick check optimized for Latin script:

public static bool MaybeRequiresNormalizationToFormC(string input)
{
if( input == null ) throw new ArgumentNullException("input");


int len = input.Length;
for (int i = 0; i < len; ++i)
{
if (input[i] > 0x2FF)
{
return true;
}
}


return false;
}

This gives false positives but not false negatives, I don't expect it to slow down 460k parses/s at all when using Latin script characters even though it needs to be performed on every string. With a false positive you would use IsNormalized to get a true negative/positive and only after that normalize if necessary.


So in conclusion, the processing is to ensure normal form C first, then case fold. Do binary comparisons with the processed strings and move cursor as you are moving it currently.

See if this meets the requirement .. :

public static partial class GlobalizationExtensions {
public static int IsPrefix(
this CompareInfo compareInfo,
String source, String prefix, int startIndex, CompareOptions options
) {
if(compareInfo.IndexOf(source, prefix, startIndex, options)!=startIndex)
return ~0;
else
// source is started with prefix
// therefore the loop must exit
for(int length2=0, length1=prefix.Length; ; )
if(0==compareInfo.Compare(
prefix, 0, length1,
source, startIndex, ++length2, options))
return length2;
}
}

compareInfo.Compare only performs once source started with prefix; if it didn't, then IsPrefix returns -1; otherwise, the length of characters used in source.

However, I have no idea except increment length2 by 1 with the following case:

var candidate="ßssß\u00E9\u0302";
var text="abcd ssßss\u0065\u0301\u0302sss";


var count=
culture.CompareInfo.IsPrefix(text, candidate, 5, CompareOptions.IgnoreCase);

update:

I tried to improve a little bit of perf., but it isn't proved whether there's bug in the following code:

public static partial class GlobalizationExtensions {
public static int Compare(
this CompareInfo compareInfo,
String source, String prefix, int startIndex, ref int length2,
CompareOptions options) {
int length1=prefix.Length, v2, v1;


if(0==(v1=compareInfo.Compare(
prefix, 0, length1, source, startIndex, length2, options))
) {
return 0;
}
else {
if(0==(v2=compareInfo.Compare(
prefix, 0, length1, source, startIndex, 1+length2, options))
) {
++length2;
return 0;
}
else {
if(v1<0||v2<0) {
length2-=2;
return -1;
}
else {
length2+=2;
return 1;
}
}
}
}


public static int IsPrefix(
this CompareInfo compareInfo,
String source, String prefix, int startIndex, CompareOptions options
) {
if(compareInfo.IndexOf(source, prefix, startIndex, options)
!=startIndex)
return ~0;
else
for(int length2=
Math.Min(prefix.Length, source.Length-(1+startIndex)); ; )
if(0==compareInfo.Compare(
source, prefix, startIndex, ref length2, options))
return length2;
}
}

I tested with the particular case, and the comparision down to about 3.

This is actually possible without normalization and without using IsPrefix.

We need to compare the same number of text elements as opposed to the same number of characters, but still return the number of matching characters.

I've created a copy of the MatchCaseInsensitive method from ValueCursor.cs in Noda Time and modified it slightly so that it can be used in a static context:

// Noda time code from MatchCaseInsensitive in ValueCursor.cs
static int IsMatch_Original(string source, int index, string match, CompareInfo compareInfo)
{
unchecked
{
if (match.Length > source.Length - index)
{
return 0;
}


// TODO(V1.2): This will fail if the length in the input string is different to the length in the
// match string for culture-specific reasons. It's not clear how to handle that...
if (compareInfo.Compare(source, index, match.Length, match, 0, match.Length, CompareOptions.IgnoreCase) == 0)
{
return match.Length;
}


return 0;
}
}

(Just included for reference, it is the code that won't compare properly as you know)

The following variant of that method uses StringInfo.GetNextTextElement which is provided by the framework. The idea is to compare text element by text element to find a match and if found return the actual number of matching characters in the source string:

// Using StringInfo.GetNextTextElement to match by text elements instead of characters
static int IsMatch_New(string source, int index, string match, CompareInfo compareInfo)
{
int sourceIndex = index;
int matchIndex = 0;


// Loop until we reach the end of source or match
while (sourceIndex < source.Length && matchIndex < match.Length)
{
// Get text elements at the current positions of source and match
// Normally that will be just one character but may be more in case of Unicode combining characters
string sourceElem = StringInfo.GetNextTextElement(source, sourceIndex);
string matchElem = StringInfo.GetNextTextElement(match, matchIndex);


// Compare the current elements.
if (compareInfo.Compare(sourceElem, matchElem, CompareOptions.IgnoreCase) != 0)
{
return 0; // No match
}


// Advance in source and match (by number of characters)
sourceIndex += sourceElem.Length;
matchIndex += matchElem.Length;
}


// Check if we reached end of source and not end of match
if (matchIndex != match.Length)
{
return 0; // No match
}


// Found match. Return number of matching characters from source.
return sourceIndex - index;
}

That method works just fine at least according to my test cases (which basically just test a couple of variants of the strings you've provided: "b\u00e9d" and "be\u0301d").

However, the GetNextTextElement method creates a substring for each text element so this implementation requires alot of substring comparisons - which will have an impact on performance.

So, I created another variant that does not use GetNextTextElement but instead skips over Unicode combining characters to find the actual match length in characters:

// This should be faster
static int IsMatch_Faster(string source, int index, string match, CompareInfo compareInfo)
{
int sourceLength = source.Length;
int matchLength = match.Length;
int sourceIndex = index;
int matchIndex = 0;


// Loop until we reach the end of source or match
while (sourceIndex < sourceLength && matchIndex < matchLength)
{
sourceIndex += GetTextElemLen(source, sourceIndex, sourceLength);
matchIndex += GetTextElemLen(match, matchIndex, matchLength);
}


// Check if we reached end of source and not end of match
if (matchIndex != matchLength)
{
return 0; // No match
}


// Check if we've found a match
if (compareInfo.Compare(source, index, sourceIndex - index, match, 0, matchIndex, CompareOptions.IgnoreCase) != 0)
{
return 0; // No match
}


// Found match. Return number of matching characters from source.
return sourceIndex - index;
}

That method uses the following two helpers:

static int GetTextElemLen(string str, int index, int strLen)
{
bool stop = false;
int elemLen;


for (elemLen = 0; index < strLen && !stop; ++elemLen, ++index)
{
stop = !IsCombiningCharacter(str, index);
}


return elemLen;
}


static bool IsCombiningCharacter(string str, int index)
{
switch (CharUnicodeInfo.GetUnicodeCategory(str, index))
{
case UnicodeCategory.NonSpacingMark:
case UnicodeCategory.SpacingCombiningMark:
case UnicodeCategory.EnclosingMark:
return true;


default:
return false;
}
}

I haven't done any bench marking, so I don't really know whether the faster method is actually faster. Nor have I done any extended testing.

But this should answer your question on how to perform cultural sensitive substring matching for strings that may include Unicode combining characters.

These are the test cases I've used:

static Tuple<string, int, string, int>[] tests = new []
{
Tuple.Create("x b\u00e9d y", 2, "be\u0301d", 3),
Tuple.Create("x be\u0301d y", 2, "b\u00e9d", 4),


Tuple.Create("x b\u00e9d", 2, "be\u0301d", 3),
Tuple.Create("x be\u0301d", 2, "b\u00e9d", 4),


Tuple.Create("b\u00e9d y", 0, "be\u0301d", 3),
Tuple.Create("be\u0301d y", 0, "b\u00e9d", 4),


Tuple.Create("b\u00e9d", 0, "be\u0301d", 3),
Tuple.Create("be\u0301d", 0, "b\u00e9d", 4),


Tuple.Create("b\u00e9", 0, "be\u0301d", 0),
Tuple.Create("be\u0301", 0, "b\u00e9d", 0),
};

The tuple values are:

  1. The source string (haystack)
  2. The starting position in source.
  3. The match string (needle).
  4. The expected match length.

Running those tests on the three methods yields this result:

Test #0: Orignal=BAD; New=OK; Faster=OK
Test #1: Orignal=BAD; New=OK; Faster=OK
Test #2: Orignal=BAD; New=OK; Faster=OK
Test #3: Orignal=BAD; New=OK; Faster=OK
Test #4: Orignal=BAD; New=OK; Faster=OK
Test #5: Orignal=BAD; New=OK; Faster=OK
Test #6: Orignal=BAD; New=OK; Faster=OK
Test #7: Orignal=BAD; New=OK; Faster=OK
Test #8: Orignal=OK; New=OK; Faster=OK
Test #9: Orignal=OK; New=OK; Faster=OK

The last two tests are testing the case when the source string is shorter than the match string. In this case the original (Noda time) method will succeed as well.