大写与小写

在进行不区分大小写的比较时,将字符串转换为大写或小写是否更有效?这重要吗?

有人建议 在这个职位使用 ToUpper 时 C # 效率更高,因为“ Microsoft 就是这样优化它的”但是我也读过 这个论点,转换 ToLower 和 ToUpper 取决于你的字符串包含更多的字符,通常字符串包含更多的小写字符,这使得 ToLower 更有效率。

特别是,我想知道:

  • 有没有一种方法可以优化 ToUpper 或 ToLower,使其中一个比另一个更快?
  • 在大小写字符串之间进行不区分大小写的比较是否更快? 为什么?
  • 是否有任何编程环境(例如。C,C # ,Python,随便什么) ,其中一种情况明显优于另一种情况,为什么呢?
46037 次浏览

Based on strings tending to have more lowercase entries, ToLower should theoretically be faster (lots of compares, but few assignments).

In C, or when using individually-accessible elements of each string (such as C strings or the STL's string type in C++), it's actually a byte comparison - so comparing UPPER is no different from lower.

If you were sneaky and loaded your strings into long arrays instead, you'd get a very fast comparison on the whole string because it could compare 4 bytes at a time. However, the load time might make it not worthwhile.

Why do you need to know which is faster? Unless you're doing a metric buttload of comparisons, one running a couple cycles faster is irrelevant to the speed of overall execution, and sounds like premature optimization :)

It really shouldn't ever matter. With ASCII characters, it definitely doesn't matter - it's just a few comparisons and a bit flip for either direction. Unicode might be a little more complicated, since there are some characters that change case in weird ways, but there really shouldn't be any difference unless your text is full of those special characters.

According to MSDN it is more efficient to pass in the strings and tell the comparison to ignore case:

String.Compare(strA, strB, StringComparison.OrdinalIgnoreCase) is equivalent to (but faster than) calling

String.Compare(ToUpperInvariant(strA), ToUpperInvariant(strB), StringComparison.Ordinal).

These comparisons are still very fast.

Of course, if you are comparing one string over and over again then this may not hold.

If you are doing string comparison in C# it is significantly faster to use .Equals() instead of converting both strings to upper or lower case. Another big plus for using .Equals() is that more memory isn't allocated for the 2 new upper/lower case strings.

Microsoft has optimized ToUpperInvariant(), not ToUpper(). The difference is that invariant is more culture friendly. If you need to do case-insensitive comparisons on strings that may vary in culture, use Invariant, otherwise the performance of invariant conversion shouldn't matter.

I can't say whether ToUpper() or ToLower() is faster though. I've never tried it since I've never had a situation where performance mattered that much.

Converting to either upper case or lower case in order to do case-insensitive comparisons is incorrect due to "interesting" features of some cultures, particularly Turkey. Instead, use a StringComparer with the appropriate options.

MSDN has some great guidelines on string handling. You might also want to check that your code passes the Turkey test.

EDIT: Note Neil's comment around ordinal case-insensitive comparisons. This whole realm is pretty murky :(

Doing it right, there should be a small, insignificant speed advantage if you convert to lower case, but this is, as many has hinted, culture dependent and is not inherit in the function but in the strings you convert (lots of lower case letters means few assignments to memory) -- converting to upper case is faster if you have a string with lots of upper case letters.

From Microsoft on MSDN:

Best Practices for Using Strings in the .NET Framework

Recommendations for String Usage

Why? From Microsoft:

Normalize strings to uppercase

There is a small group of characters that when converted to lowercase cannot make a round trip.

What is example of such a character that cannot make a round trip?

  • Start: Greek Rho Symbol (U+03f1) ϱ
  • Uppercase: Capital Greek Rho (U+03a1) Ρ
  • Lowercase: Small Greek Rho (U+03c1) ρ

ϱ , Ρ , ρ

.NET Fiddle

Original: ϱ
ToUpper: Ρ
ToLower: ρ

That is why, if your want to do case insensitive comparisons you convert the strings to uppercase, and not lowercase.

So if you have to choose one, choose Uppercase.

I wanted some actual data on this, so I pulled the full list of two byte characters that fail with ToLower or ToUpper. I then ran this test below:

using System;


class Program {
static void Main() {
char[][] pairs = {
new[]{'\u00E5','\u212B'},new[]{'\u00C5','\u212B'},new[]{'\u0399','\u1FBE'},
new[]{'\u03B9','\u1FBE'},new[]{'\u03B2','\u03D0'},new[]{'\u03B5','\u03F5'},
new[]{'\u03B8','\u03D1'},new[]{'\u03B8','\u03F4'},new[]{'\u03D1','\u03F4'},
new[]{'\u03B9','\u1FBE'},new[]{'\u0345','\u03B9'},new[]{'\u0345','\u1FBE'},
new[]{'\u03BA','\u03F0'},new[]{'\u00B5','\u03BC'},new[]{'\u03C0','\u03D6'},
new[]{'\u03C1','\u03F1'},new[]{'\u03C2','\u03C3'},new[]{'\u03C6','\u03D5'},
new[]{'\u03C9','\u2126'},new[]{'\u0392','\u03D0'},new[]{'\u0395','\u03F5'},
new[]{'\u03D1','\u03F4'},new[]{'\u0398','\u03D1'},new[]{'\u0398','\u03F4'},
new[]{'\u0345','\u1FBE'},new[]{'\u0345','\u0399'},new[]{'\u0399','\u1FBE'},
new[]{'\u039A','\u03F0'},new[]{'\u00B5','\u039C'},new[]{'\u03A0','\u03D6'},
new[]{'\u03A1','\u03F1'},new[]{'\u03A3','\u03C2'},new[]{'\u03A6','\u03D5'},
new[]{'\u03A9','\u2126'},new[]{'\u0398','\u03F4'},new[]{'\u03B8','\u03F4'},
new[]{'\u03B8','\u03D1'},new[]{'\u0398','\u03D1'},new[]{'\u0432','\u1C80'},
new[]{'\u0434','\u1C81'},new[]{'\u043E','\u1C82'},new[]{'\u0441','\u1C83'},
new[]{'\u0442','\u1C84'},new[]{'\u0442','\u1C85'},new[]{'\u1C84','\u1C85'},
new[]{'\u044A','\u1C86'},new[]{'\u0412','\u1C80'},new[]{'\u0414','\u1C81'},
new[]{'\u041E','\u1C82'},new[]{'\u0421','\u1C83'},new[]{'\u1C84','\u1C85'},
new[]{'\u0422','\u1C84'},new[]{'\u0422','\u1C85'},new[]{'\u042A','\u1C86'},
new[]{'\u0463','\u1C87'},new[]{'\u0462','\u1C87'}
};
int upper = 0, lower = 0;
foreach (char[] pair in pairs) {
Console.Write(
"U+{0:X4} U+{1:X4} pass: ",
Convert.ToInt32(pair[0]),
Convert.ToInt32(pair[1])
);
if (Char.ToUpper(pair[0]) == Char.ToUpper(pair[1])) {
Console.Write("ToUpper ");
upper++;
} else {
Console.Write("        ");
}
if (Char.ToLower(pair[0]) == Char.ToLower(pair[1])) {
Console.Write("ToLower");
lower++;
}
Console.WriteLine();
}
Console.WriteLine("upper pass: {0}, lower pass: {1}", upper, lower);
}
}

Result below. Note I also tested with the Invariant versions, and result was exact same. Interestingly, one of the pairs fails with both. But based on this ToUpper is the best option.

U+00E5 U+212B pass:         ToLower
U+00C5 U+212B pass:         ToLower
U+0399 U+1FBE pass: ToUpper
U+03B9 U+1FBE pass: ToUpper
U+03B2 U+03D0 pass: ToUpper
U+03B5 U+03F5 pass: ToUpper
U+03B8 U+03D1 pass: ToUpper
U+03B8 U+03F4 pass:         ToLower
U+03D1 U+03F4 pass:
U+03B9 U+1FBE pass: ToUpper
U+0345 U+03B9 pass: ToUpper
U+0345 U+1FBE pass: ToUpper
U+03BA U+03F0 pass: ToUpper
U+00B5 U+03BC pass: ToUpper
U+03C0 U+03D6 pass: ToUpper
U+03C1 U+03F1 pass: ToUpper
U+03C2 U+03C3 pass: ToUpper
U+03C6 U+03D5 pass: ToUpper
U+03C9 U+2126 pass:         ToLower
U+0392 U+03D0 pass: ToUpper
U+0395 U+03F5 pass: ToUpper
U+03D1 U+03F4 pass:
U+0398 U+03D1 pass: ToUpper
U+0398 U+03F4 pass:         ToLower
U+0345 U+1FBE pass: ToUpper
U+0345 U+0399 pass: ToUpper
U+0399 U+1FBE pass: ToUpper
U+039A U+03F0 pass: ToUpper
U+00B5 U+039C pass: ToUpper
U+03A0 U+03D6 pass: ToUpper
U+03A1 U+03F1 pass: ToUpper
U+03A3 U+03C2 pass: ToUpper
U+03A6 U+03D5 pass: ToUpper
U+03A9 U+2126 pass:         ToLower
U+0398 U+03F4 pass:         ToLower
U+03B8 U+03F4 pass:         ToLower
U+03B8 U+03D1 pass: ToUpper
U+0398 U+03D1 pass: ToUpper
U+0432 U+1C80 pass: ToUpper
U+0434 U+1C81 pass: ToUpper
U+043E U+1C82 pass: ToUpper
U+0441 U+1C83 pass: ToUpper
U+0442 U+1C84 pass: ToUpper
U+0442 U+1C85 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+044A U+1C86 pass: ToUpper
U+0412 U+1C80 pass: ToUpper
U+0414 U+1C81 pass: ToUpper
U+041E U+1C82 pass: ToUpper
U+0421 U+1C83 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+0422 U+1C84 pass: ToUpper
U+0422 U+1C85 pass: ToUpper
U+042A U+1C86 pass: ToUpper
U+0463 U+1C87 pass: ToUpper
U+0462 U+1C87 pass: ToUpper
upper pass: 46, lower pass: 8