最佳答案
我昨天对有人在正则表达式中使用[0123456789]
而不是[0-9]
或\d
的答案发表了评论。我说使用范围或数字说明符可能比字符集更有效。
我决定今天测试一下,并惊讶地发现(至少在c#regex引擎中)\d
似乎不如其他两个效率,它们似乎没有太大差异。这是我的测试输出超过10000个随机字符串,包含1000个随机字符,5077个实际包含数字:
Regex \d took 00:00:00.2141226 result: 5077/10000Regex [0-9] took 00:00:00.1357972 result: 5077/10000 63.42 % of firstRegex [0123456789] took 00:00:00.1388997 result: 5077/10000 64.87 % of first
这对我来说是一个惊喜,有两个原因,如果有人能提供一些线索,我会感兴趣:
\d
比[0-9]
更糟糕。\d
不仅仅是[0-9]
的简写吗?以下是测试代码:
using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Diagnostics;using System.Text.RegularExpressions;
namespace SO_RegexPerformance{class Program{static void Main(string[] args){var rand = new Random(1234);var strings = new List<string>();//10K random stringsfor (var i = 0; i < 10000; i++){//generate random stringvar sb = new StringBuilder();for (var c = 0; c < 1000; c++){//add a-z randomlysb.Append((char)('a' + rand.Next(26)));}//in roughly 50% of them, put a digitif (rand.Next(2) == 0){//replace 1 char with a digit 0-9sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));}strings.Add(sb.ToString());}
var baseTime = testPerfomance(strings, @"\d");Console.WriteLine();var testTime = testPerfomance(strings, "[0-9]");Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);testTime = testPerfomance(strings, "[0123456789]");Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);}
private static TimeSpan testPerfomance(List<string> strings, string regex){var sw = new Stopwatch();
int successes = 0;
var rex = new Regex(regex);
sw.Start();foreach (var str in strings){if (rex.Match(str).Success){successes++;}}sw.Stop();
Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);
return sw.Elapsed;}}}