字节[]数组模式搜索

小开

You can put the byte array into String and run match by IndexOf. Or you can at least reuse existing algorithms on string matching.

    [STAThread]
static void Main(string[] args)
{
byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};
byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125};
string needle, haystack;


unsafe
{
fixed(byte * p = pattern) {
needle = new string((SByte *) p, 0, pattern.Length);
} // fixed


fixed (byte * p2 = toBeSearched)
{
haystack = new string((SByte *) p2, 0, toBeSearched.Length);
} // fixed


int i = haystack.IndexOf(needle, 0);
System.Console.Out.WriteLine(i);
}
}

小开

toBeSearched.Except(pattern) will return you differences toBeSearched.Intersect(pattern) will produce set of intersections Generally, you should look into extended methods within Linq extensions

小开

My solution:

class Program
{
public static void Main()
{
byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};


byte[] toBeSearched = new byte[] { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125};


List<int> positions = SearchBytePattern(pattern, toBeSearched);


foreach (var item in positions)
{
Console.WriteLine("Pattern matched at pos {0}", item);
}


}


static public List<int> SearchBytePattern(byte[] pattern, byte[] bytes)
{
List<int> positions = new List<int>();
int patternLength = pattern.Length;
int totalLength = bytes.Length;
byte firstMatchByte = pattern[0];
for (int i = 0; i < totalLength; i++)
{
if (firstMatchByte == bytes[i] && totalLength - i >= patternLength)
{
byte[] match = new byte[patternLength];
Array.Copy(bytes, i, match, 0, patternLength);
if (match.SequenceEqual<byte>(pattern))
{
positions.Add(i);
i += patternLength - 1;
}
}
}
return positions;
}
}

小开

May I suggest something that doesn't involve creating strings, copying arrays or unsafe code:

using System;
using System.Collections.Generic;


static class ByteArrayRocks
{
static readonly int[] Empty = new int[0];


public static int[] Locate (this byte[] self, byte[] candidate)
{
if (IsEmptyLocate(self, candidate))
return Empty;


var list = new List<int>();


for (int i = 0; i < self.Length; i++)
{
if (!IsMatch(self, i, candidate))
continue;


list.Add(i);
}


return list.Count == 0 ? Empty : list.ToArray();
}


static bool IsMatch (byte[] array, int position, byte[] candidate)
{
if (candidate.Length > (array.Length - position))
return false;


for (int i = 0; i < candidate.Length; i++)
if (array[position + i] != candidate[i])
return false;


return true;
}


static bool IsEmptyLocate (byte[] array, byte[] candidate)
{
return array == null
|| candidate == null
|| array.Length == 0
|| candidate.Length == 0
|| candidate.Length > array.Length;
}


static void Main()
{
var data = new byte[] { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125 };
var pattern = new byte[] { 12, 3, 5, 76, 8, 0, 6, 125 };


foreach (var position in data.Locate(pattern))
Console.WriteLine(position);
}
}

Edit (by IAbstract) - moving contents of post here since it is not an answer

Out of curiosity, I've created a small benchmark with different answers.

Here are the results for a million iterations:

solution [Locate]:            00:00:00.7714027
solution [FindAll]:           00:00:03.5404399
solution [SearchBytePattern]: 00:00:01.1105190
solution [MatchBytePattern]:  00:00:03.0658212

小开

Here's my (not the most performant) solution. It relies on the fact that bytes/latin-1 conversion is lossless, which is not true for bytes/ASCII or bytes/UTF8 conversions.

It's advantages are that It Works (tm) for any byte values (some other solutions work incorrectly with bytes 0x80-0xff) and can be extended to perform more advanced regex matching.

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;


class C {


public static void Main() {
byte[] data = {0, 100, 0, 255, 100, 0, 100, 0, 255};
byte[] pattern = {0, 255};
foreach (int i in FindAll(data, pattern)) {
Console.WriteLine(i);
}
}


public static IEnumerable<int> FindAll(
byte[] haystack,
byte[] needle
) {
// bytes <-> latin-1 conversion is lossless
Encoding latin1 = Encoding.GetEncoding("iso-8859-1");
string sHaystack = latin1.GetString(haystack);
string sNeedle = latin1.GetString(needle);
for (Match m = Regex.Match(sHaystack, Regex.Escape(sNeedle));
m.Success; m = m.NextMatch()) {
yield return m.Index;
}
}
}

小开

I would use a solution which does matching by converting to a string...

You should write a simple function implementing the Knuth-Morris-Pratt searching algorithm. This will be the fastest simple algorithm you can use to find the correct indexes.(You could use Boyer-Moore but it will require more setup.

After you have optimized the algorithm, you could try to look for other kind of optimizations. But you should start with the basics.

For example, the current "fastest" is the Locate solution by Jb Evian.

if you look at the core

    for (int i = 0; i < self.Length; i++) {
if (!IsMatch (self, i, candidate))
continue;


list.Add (i);
}

After a match of the sub algorithm, it will start to find a match at i + 1, but you already know that the first possible match would be i + candidate.Length. So if you add,

i += candidate.Length -2; //  -2 instead of -1 because the i++ will add the last index

it will be a lot faster when you expect a lot of occurrences of the subset in the superset. (Bruno Conde already does this in his solution)

But this is just a half of the KNP algorithm, you should also add an extra parameter to the IsMatch method called numberOfValidMatches which would be an out parameter.

this would resolve to the following:

int validMatches = 0;
if (!IsMatch (self, i, candidate, out validMatches))
{
i += validMatches - 1; // -1 because the i++ will do the last one
continue;
}

and

static bool IsMatch (byte [] array, int position, byte [] candidate, out int numberOfValidMatches)
{
numberOfValidMatches = 0;
if (candidate.Length > (array.Length - position))
return false;


for (i = 0; i < candidate.Length; i++)
{
if (array [position + i] != candidate [i])
return false;
numberOfValidMatches++;
}


return true;
}

A little bit of refactoring and you could use the numberOfValidMatches as the loop variable, and rewrite the Locate loop using a while to avoid the -2 and -1. But I just wanted to make clear how you could add the KMP algorithm.

小开

Use the efficient Boyer-Moore algorithm.

It's designed to find strings withing strings but you need little imagination to project this to byte arrays.

In general the best answer is: use any string searching algorithm that you like :).

小开

Jb Evain's answer has:

 for (int i = 0; i < self.Length; i++) {
if (!IsMatch (self, i, candidate))
continue;
list.Add (i);
}

and then the IsMatch function first checks whether candidate goes beyond the length of the array being searched.

This would be more efficient if the for loop were coded:

     for (int i = 0, n = self.Length - candidate.Length + 1; i < n; ++i) {
if (!IsMatch (self, i, candidate))
continue;
list.Add (i);
}

at this point one could also eliminate the test from the start of IsMatch, so long as you contract via pre-conditions never to call it with "illegal' parameters. Note: Fixed an off-by-one bug in 2019.

小开

I've created a new function using the tips from my answer and the answer by Alnitak.

public static List<Int32> LocateSubset(Byte[] superSet, Byte[] subSet)
{
if ((superSet == null) || (subSet == null))
{
throw new ArgumentNullException();
}
if ((superSet.Length < subSet.Length) || (superSet.Length == 0) || (subSet.Length == 0))
{
return new List<Int32>();
}
var result = new List<Int32>();
Int32 currentIndex = 0;
Int32 maxIndex =  superSet.Length - subSet.Length;
while (currentIndex < maxIndex)
{
Int32 matchCount = CountMatches(superSet, currentIndex, subSet);
if (matchCount ==  subSet.Length)
{
result.Add(currentIndex);
}
currentIndex++;
if (matchCount > 0)
{
currentIndex += matchCount - 1;
}
}
return result;
}


private static Int32 CountMatches(Byte[] superSet, int startIndex, Byte[] subSet)
{
Int32 currentOffset = 0;
while (currentOffset < subSet.Length)
{
if (superSet[startIndex + currentOffset] != subSet[currentOffset])
{
break;
}
currentOffset++;
}
return currentOffset;
}

the only part I'm not so happy about is the

         currentIndex++;
if (matchCount > 0)
{
currentIndex += matchCount - 1;
}

part... I would have like to use a if else to avoid the -1, but this results in a better branch prediction (although I'm not that sure it will matter that much)..

小开

Why make the simple difficult? This can be done in any language using for loops. Here is one in C#:

using System;
using System.Collections.Generic;


namespace BinarySearch
{
class Program
{
static void Main(string[] args)
{
byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};
byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,
122,22,4,7,89,76,64,12,3,5,76,8,0,6,125};


List<int> occurences = findOccurences(toBeSearched, pattern);


foreach(int occurence in occurences) {
Console.WriteLine("Found match starting at 0-based index: " + occurence);
}




}


static List<int> findOccurences(byte[] haystack, byte[] needle)
{
List<int> occurences = new List<int>();


for (int i = 0; i < haystack.Length; i++)
{
if (needle[0] == haystack[i])
{
bool found = true;
int j, k;
for (j = 0, k = i; j < needle.Length; j++, k++)
{
if (k >= haystack.Length || needle[j] != haystack[k])
{
found = false;
break;
}
}
if (found)
{
occurences.Add(i - 1);
i = k;
}
}
}
return occurences;
}
}
}

小开

thanks for taking the time...

This is the code that i was using/testing with before I asked my question... The reason I ask this question was that I´m certain of that I´m not using the optimal code to do this... so thanks again for taking the time!

   private static int CountPatternMatches(byte[] pattern, byte[] bytes)
{
int counter = 0;


for (int i = 0; i < bytes.Length; i++)
{
if (bytes[i] == pattern[0] && (i + pattern.Length) < bytes.Length)
{
for (int x = 1; x < pattern.Length; x++)
{
if (pattern[x] != bytes[x+i])
{
break;
}


if (x == pattern.Length -1)
{
counter++;
i = i + pattern.Length;
}
}
}
}


return counter;
}

Anyone that see any errors in my code? Is this considered as an hackish approach? I have tried almost every sample you guys have posted and I seems to get some variations in the match results. I have been running my tests with ~10Mb byte array as my toBeSearched array.

小开

Originally I posted some old code I used but was curious about Jb Evain's benchmarks. I found that my solution was stupid slow. It appears that bruno conde's SearchBytePattern is the fastest. I could not figure why especially since he uses an Array.Copy and an Extension method. But there is proof in Jb's tests, so kudos to bruno.

I simplified the bits even further, so hopefully this will be the clearest and simplest solution. (All the hard work done by bruno conde) The enhancements are:

Buffer.BlockCopy
Array.IndexOf<byte>
while loop instead of a for loop
start index parameter

converted to extension method

public static List<int> IndexOfSequence(this byte[] buffer, byte[] pattern, int startIndex)
{
List<int> positions = new List<int>();
int i = Array.IndexOf<byte>(buffer, pattern[0], startIndex);
while (i >= 0 && i <= buffer.Length - pattern.Length)
{
byte[] segment = new byte[pattern.Length];
Buffer.BlockCopy(buffer, i, segment, 0, pattern.Length);
if (segment.SequenceEqual<byte>(pattern))
positions.Add(i);
i = Array.IndexOf<byte>(buffer, pattern[0], i + 1);
}
return positions;
}

Note that, the last statement in the while block should be i = Array.IndexOf<byte>(buffer, pattern[0], i + 1); instead of i = Array.IndexOf<byte>(buffer, pattern[0], i + pattern.Length);. Look at the comment by Johan. A simple test could prove that:

byte[] pattern = new byte[] {1, 2};
byte[] toBeSearched = new byte[] { 1, 1, 2, 1, 12 };

With i = Array.IndexOf<byte>(buffer, pattern[0], i + pattern.Length);, nothing returned. i = Array.IndexOf<byte>(buffer, pattern[0], i + 1); returns the correct result.

小开

Speed isn't everything. Did you check them for consistency?

I didn't test all the code listed here. I tested my own code (which wasn't totally consistent, I admit) and IndexOfSequence. I found that for many tests IndexOfSequence was quite a bit faster than my code but with repeated testing I found that it was less consistent. In particular it seems to have the most trouble with finding patterns at the end of the array but it will miss them in the middle of the array too sometimes.

My test code isn't designed for efficiency, I just wanted to have a bunch of random data with some known strings inside. That test pattern is roughly like a boundary marker in an http form upload stream. That's what I was looking for when I ran across this code so I figured I'd test it with the kind of data I'll be searching for. It appears that the longer the pattern is the more often IndexOfSequence will miss a value.

private static void TestMethod()
{
Random rnd = new Random(DateTime.Now.Millisecond);
string Pattern = "-------------------------------65498495198498";
byte[] pattern = Encoding.ASCII.GetBytes(Pattern);


byte[] testBytes;
int count = 3;
for (int i = 0; i < 100; i++)
{
StringBuilder TestString = new StringBuilder(2500);
TestString.Append(Pattern);
byte[] buf = new byte[1000];
rnd.NextBytes(buf);
TestString.Append(Encoding.ASCII.GetString(buf));
TestString.Append(Pattern);
rnd.NextBytes(buf);
TestString.Append(Encoding.ASCII.GetString(buf));
TestString.Append(Pattern);
testBytes = Encoding.ASCII.GetBytes(TestString.ToString());


List<int> idx = IndexOfSequence(ref testBytes, pattern, 0);
if (idx.Count != count)
{
Console.Write("change from {0} to {1} on iteration {2}: ", count, idx.Count, i);
foreach (int ix in idx)
{
Console.Write("{0}, ", ix);
}
Console.WriteLine();
count = idx.Count;
}
}


Console.WriteLine("Press ENTER to exit");
Console.ReadLine();
}

(obviously I converted IndexOfSequence from an extension back into a normal method for this testing)

Here's a sample run of my output:

change from 3 to 2 on iteration 1: 0, 2090,
change from 2 to 3 on iteration 2: 0, 1045, 2090,
change from 3 to 2 on iteration 3: 0, 1045,
change from 2 to 3 on iteration 4: 0, 1045, 2090,
change from 3 to 2 on iteration 6: 0, 2090,
change from 2 to 3 on iteration 7: 0, 1045, 2090,
change from 3 to 2 on iteration 11: 0, 2090,
change from 2 to 3 on iteration 12: 0, 1045, 2090,
change from 3 to 2 on iteration 14: 0, 2090,
change from 2 to 3 on iteration 16: 0, 1045, 2090,
change from 3 to 2 on iteration 17: 0, 1045,
change from 2 to 3 on iteration 18: 0, 1045, 2090,
change from 3 to 1 on iteration 20: 0,
change from 1 to 3 on iteration 21: 0, 1045, 2090,
change from 3 to 2 on iteration 22: 0, 2090,
change from 2 to 3 on iteration 23: 0, 1045, 2090,
change from 3 to 2 on iteration 24: 0, 2090,
change from 2 to 3 on iteration 25: 0, 1045, 2090,
change from 3 to 2 on iteration 26: 0, 2090,
change from 2 to 3 on iteration 27: 0, 1045, 2090,
change from 3 to 2 on iteration 43: 0, 1045,
change from 2 to 3 on iteration 44: 0, 1045, 2090,
change from 3 to 2 on iteration 48: 0, 1045,
change from 2 to 3 on iteration 49: 0, 1045, 2090,
change from 3 to 2 on iteration 50: 0, 2090,
change from 2 to 3 on iteration 52: 0, 1045, 2090,
change from 3 to 2 on iteration 54: 0, 1045,
change from 2 to 3 on iteration 57: 0, 1045, 2090,
change from 3 to 2 on iteration 62: 0, 1045,
change from 2 to 3 on iteration 63: 0, 1045, 2090,
change from 3 to 2 on iteration 72: 0, 2090,
change from 2 to 3 on iteration 73: 0, 1045, 2090,
change from 3 to 2 on iteration 75: 0, 2090,
change from 2 to 3 on iteration 76: 0, 1045, 2090,
change from 3 to 2 on iteration 78: 0, 1045,
change from 2 to 3 on iteration 79: 0, 1045, 2090,
change from 3 to 2 on iteration 81: 0, 2090,
change from 2 to 3 on iteration 82: 0, 1045, 2090,
change from 3 to 2 on iteration 85: 0, 2090,
change from 2 to 3 on iteration 86: 0, 1045, 2090,
change from 3 to 2 on iteration 89: 0, 2090,
change from 2 to 3 on iteration 90: 0, 1045, 2090,
change from 3 to 2 on iteration 91: 0, 2090,
change from 2 to 1 on iteration 92: 0,
change from 1 to 3 on iteration 93: 0, 1045, 2090,
change from 3 to 1 on iteration 99: 0,

I don't mean to pick on IndexOfSequence, it just happened to be the one I started working with today. I noticed at the end of the day that it seemed to be missing patterns in the data so I wrote my own pattern matcher tonight. It's not as fast though. I'm going to tweak it a bit more to see if I can get it 100% consistent before I post it.

I just wanted to remind everyone that they should test things like this to make sure they give good, repeatable results before you trust them in production code.

小开

I tried various solutions and ended up modifying the SearchBytePattern one. I tested on a 30k sequence and it is, fast :)

    static public int SearchBytePattern(byte[] pattern, byte[] bytes)
{
int matches = 0;
for (int i = 0; i < bytes.Length; i++)
{
if (pattern[0] == bytes[i] && bytes.Length - i >= pattern.Length)
{
bool ismatch = true;
for (int j = 1; j < pattern.Length && ismatch == true; j++)
{
if (bytes[i + j] != pattern[j])
ismatch = false;
}
if (ismatch)
{
matches++;
i += pattern.Length - 1;
}
}
}
return matches;
}

Let me know your thoughts.

小开

These are the simplest and fastest methods you can use, and there wouldn't be anything faster than these. It is unsafe but that's what we use pointers for is speed. So here I offer you my extension methods that I use search for a single, and a list of indices of the occurrences. I would like to say this is the cleanest code here.

    public static unsafe long IndexOf(this byte[] Haystack, byte[] Needle)
{
fixed (byte* H = Haystack) fixed (byte* N = Needle)
{
long i = 0;
for (byte* hNext = H, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
{
bool Found = true;
for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
if (Found) return i;
}
return -1;
}
}
public static unsafe List<long> IndexesOf(this byte[] Haystack, byte[] Needle)
{
List<long> Indexes = new List<long>();
fixed (byte* H = Haystack) fixed (byte* N = Needle)
{
long i = 0;
for (byte* hNext = H, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
{
bool Found = true;
for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
if (Found) Indexes.Add(i);
}
return Indexes;
}
}

Benchmarked with Locate, it is 1.2-1.4x faster

小开

Here is a solution I came up with. I included notes I found along the way with the implementation. It can match forward, backward and with different (in/dec)remement amounts e.g. direction; starting at any offset in the haystack.

Any input would be awesome!

    /// <summary>
/// Matches a byte array to another byte array
/// forwards or reverse
/// </summary>
/// <param name="a">byte array</param>
/// <param name="offset">start offset</param>
/// <param name="len">max length</param>
/// <param name="b">byte array</param>
/// <param name="direction">to move each iteration</param>
/// <returns>true if all bytes match, otherwise false</returns>
internal static bool Matches(ref byte[] a, int offset, int len, ref byte[] b, int direction = 1)
{
#region Only Matched from offset Within a and b, could not differ, e.g. if you wanted to mach in reverse for only part of a in some of b that would not work
//if (direction == 0) throw new ArgumentException("direction");
//for (; offset < len; offset += direction) if (a[offset] != b[offset]) return false;
//return true;
#endregion
//Will match if b contains len of a and return a a index of positive value
return IndexOfBytes(ref a, ref offset, len, ref b, len) != -1;
}


///Here is the Implementation code


/// <summary>
/// Swaps two integers without using a temporary variable
/// </summary>
/// <param name="a"></param>
/// <param name="b"></param>
internal static void Swap(ref int a, ref int b)
{
a ^= b;
b ^= a;
a ^= b;
}


/// <summary>
/// Swaps two bytes without using a temporary variable
/// </summary>
/// <param name="a"></param>
/// <param name="b"></param>
internal static void Swap(ref byte a, ref byte b)
{
a ^= b;
b ^= a;
a ^= b;
}


/// <summary>
/// Can be used to find if a array starts, ends spot Matches or compltely contains a sub byte array
/// Set checkLength to the amount of bytes from the needle you want to match, start at 0 for forward searches start at hayStack.Lenght -1 for reverse matches
/// </summary>
/// <param name="a">Needle</param>
/// <param name="offset">Start in Haystack</param>
/// <param name="len">Length of required match</param>
/// <param name="b">Haystack</param>
/// <param name="direction">Which way to move the iterator</param>
/// <returns>Index if found, otherwise -1</returns>
internal static int IndexOfBytes(ref byte[] needle, ref int offset, int checkLength, ref byte[] haystack, int direction = 1)
{
//If the direction is == 0 we would spin forever making no progress
if (direction == 0) throw new ArgumentException("direction");
//Cache the length of the needle and the haystack, setup the endIndex for a reverse search
int needleLength = needle.Length, haystackLength = haystack.Length, endIndex = 0, workingOffset = offset;
//Allocate a value for the endIndex and workingOffset
//If we are going forward then the bound is the haystackLength
if (direction >= 1) endIndex = haystackLength;
#region [Optomization - Not Required]
//{


//I though this was required for partial matching but it seems it is not needed in this form
//workingOffset = needleLength - checkLength;
//}
#endregion
else Swap(ref workingOffset, ref endIndex);
#region [Optomization - Not Required]
//{
//Otherwise we are going in reverse and the endIndex is the needleLength - checkLength
//I though the length had to be adjusted but it seems it is not needed in this form
//endIndex = needleLength - checkLength;
//}
#endregion
#region [Optomized to above]
//Allocate a value for the endIndex
//endIndex = direction >= 1 ? haystackLength : needleLength - checkLength,
//Determine the workingOffset
//workingOffset = offset > needleLength ? offset : needleLength;
//If we are doing in reverse swap the two
//if (workingOffset > endIndex) Swap(ref workingOffset, ref endIndex);
//Else we are going in forward direction do the offset is adjusted by the length of the check
//else workingOffset -= checkLength;
//Start at the checkIndex (workingOffset) every search attempt
#endregion
//Save the checkIndex (used after the for loop is done with it to determine if the match was checkLength long)
int checkIndex = workingOffset;
#region [For Loop Version]
///Optomized with while (single op)
///for (int checkIndex = workingOffset; checkIndex < endIndex; offset += direction, checkIndex = workingOffset)
///{
///Start at the checkIndex
/// While the checkIndex < checkLength move forward
/// If NOT (the needle at the checkIndex matched the haystack at the offset + checkIndex) BREAK ELSE we have a match continue the search
/// for (; checkIndex < checkLength; ++checkIndex) if (needle[checkIndex] != haystack[offset + checkIndex]) break; else continue;
/// If the match was the length of the check
/// if (checkIndex == checkLength) return offset; //We are done matching
///}
#endregion
//While the checkIndex < endIndex
while (checkIndex < endIndex)
{
for (; checkIndex < checkLength; ++checkIndex) if (needle[checkIndex] != haystack[offset + checkIndex]) break; else continue;
//If the match was the length of the check
if (checkIndex == checkLength) return offset; //We are done matching
//Move the offset by the direction, reset the checkIndex to the workingOffset
offset += direction; checkIndex = workingOffset;
}
//We did not have a match with the given options
return -1;
}

小开

I'm a little late to the party How about using Boyer Moore algorithm but search for bytes instead of strings. c# code below.

EyeCode Inc.

class Program {
static void Main(string[] args) {
byte[] text         =  new byte[] {12,3,5,76,8,0,6,125,23,36,43,76,125,56,34,234,12,4,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,123};
byte[] pattern      = new byte[] {12,3,5,76,8,0,6,125};


BoyerMoore tmpSearch = new BoyerMoore(pattern,text);


Console.WriteLine(tmpSearch.Match());
Console.ReadKey();
}


public class BoyerMoore {


private static int ALPHABET_SIZE = 256;


private byte[] text;
private byte[] pattern;


private int[] last;
private int[] match;
private int[] suffix;


public BoyerMoore(byte[] pattern, byte[] text) {
this.text = text;
this.pattern = pattern;
last = new int[ALPHABET_SIZE];
match = new int[pattern.Length];
suffix = new int[pattern.Length];
}




/**
* Searches the pattern in the text.
* returns the position of the first occurrence, if found and -1 otherwise.
*/
public int Match() {
// Preprocessing
ComputeLast();
ComputeMatch();


// Searching
int i = pattern.Length - 1;
int j = pattern.Length - 1;
while (i < text.Length) {
if (pattern[j] == text[i]) {
if (j == 0) {
return i;
}
j--;
i--;
}
else {
i += pattern.Length - j - 1 + Math.Max(j - last[text[i]], match[j]);
j = pattern.Length - 1;
}
}
return -1;
}




/**
* Computes the function last and stores its values in the array last.
* last(Char ch) = the index of the right-most occurrence of the character ch
*                                                           in the pattern;
*                 -1 if ch does not occur in the pattern.
*/
private void ComputeLast() {
for (int k = 0; k < last.Length; k++) {
last[k] = -1;
}
for (int j = pattern.Length-1; j >= 0; j--) {
if (last[pattern[j]] < 0) {
last[pattern[j]] = j;
}
}
}




/**
* Computes the function match and stores its values in the array match.
* match(j) = min{ s | 0 < s <= j && p[j-s]!=p[j]
*                            && p[j-s+1]..p[m-s-1] is suffix of p[j+1]..p[m-1] },
*                                                         if such s exists, else
*            min{ s | j+1 <= s <= m
*                            && p[0]..p[m-s-1] is suffix of p[j+1]..p[m-1] },
*                                                         if such s exists,
*            m, otherwise,
* where p is the pattern and m is its length.
*/
private void ComputeMatch() {
/* Phase 1 */
for (int j = 0; j < match.Length; j++) {
match[j] = match.Length;
} //O(m)


ComputeSuffix(); //O(m)


/* Phase 2 */
//Uses an auxiliary array, backwards version of the KMP failure function.
//suffix[i] = the smallest j > i s.t. p[j..m-1] is a prefix of p[i..m-1],
//if there is no such j, suffix[i] = m


//Compute the smallest shift s, such that 0 < s <= j and
//p[j-s]!=p[j] and p[j-s+1..m-s-1] is suffix of p[j+1..m-1] or j == m-1},
//                                                         if such s exists,
for (int i = 0; i < match.Length - 1; i++) {
int j = suffix[i + 1] - 1; // suffix[i+1] <= suffix[i] + 1
if (suffix[i] > j) { // therefore pattern[i] != pattern[j]
match[j] = j - i;
}
else {// j == suffix[i]
match[j] = Math.Min(j - i + match[i], match[j]);
}
}


/* Phase 3 */
//Uses the suffix array to compute each shift s such that
//p[0..m-s-1] is a suffix of p[j+1..m-1] with j < s < m
//and stores the minimum of this shift and the previously computed one.
if (suffix[0] < pattern.Length) {
for (int j = suffix[0] - 1; j >= 0; j--) {
if (suffix[0] < match[j]) { match[j] = suffix[0]; }
}
{
int j = suffix[0];
for (int k = suffix[j]; k < pattern.Length; k = suffix[k]) {
while (j < k) {
if (match[j] > k) {
match[j] = k;
}
j++;
}
}
}
}
}




/**
* Computes the values of suffix, which is an auxiliary array,
* backwards version of the KMP failure function.
*
* suffix[i] = the smallest j > i s.t. p[j..m-1] is a prefix of p[i..m-1],
* if there is no such j, suffix[i] = m, i.e.


* p[suffix[i]..m-1] is the longest prefix of p[i..m-1], if suffix[i] < m.
*/
private void ComputeSuffix() {
suffix[suffix.Length-1] = suffix.Length;
int j = suffix.Length - 1;
for (int i = suffix.Length - 2; i >= 0; i--) {
while (j < suffix.Length - 1 && !pattern[j].Equals(pattern[i])) {
j = suffix[j + 1] - 1;
}
if (pattern[j] == pattern[i]) {
j--;
}
suffix[i] = j + 1;
}
}


}


}

小开

Here's a simple code i wrote using only basic data types: (It returns the index of first occurance)

private static int findMatch(byte[] data, byte[] pattern) {
if(pattern.length > data.length){
return -1;
}
for(int i = 0; i<data.length ;){
int j;
for(j=0;j<pattern.length;j++){


if(pattern[j]!=data[i])
break;
i++;
}
if(j==pattern.length){
System.out.println("Pattern found at : "+(i - pattern.length ));
return i - pattern.length ;
}
if(j!=0)continue;
i++;
}


return -1;
}

小开

I was missing a LINQ method/answer :-)

/// <summary>
/// Searches in the haystack array for the given needle using the default equality operator and returns the index at which the needle starts.
/// </summary>
/// <typeparam name="T">Type of the arrays.</typeparam>
/// <param name="haystack">Sequence to operate on.</param>
/// <param name="needle">Sequence to search for.</param>
/// <returns>Index of the needle within the haystack or -1 if the needle isn't contained.</returns>
public static IEnumerable<int> IndexOf<T>(this T[] haystack, T[] needle)
{
if ((needle != null) && (haystack.Length >= needle.Length))
{
for (int l = 0; l < haystack.Length - needle.Length + 1; l++)
{
if (!needle.Where((data, index) => !haystack[l + index].Equals(data)).Any())
{
yield return l;
}
}
}
}

小开

Use LINQ Methods.

public static IEnumerable<int> PatternAt(byte[] source, byte[] pattern)
{
for (int i = 0; i < source.Length; i++)
{
if (source.Skip(i).Take(pattern.Length).SequenceEqual(pattern))
{
yield return i;
}
}
}

Very simple!

小开

Just another answer that is easy to follow and pretty efficient for a O(n) type of operation without using unsafe code or copying parts of the source arrays.

Be sure to test. Some of the suggestions found on this topic are susceptible to gotta situations.

    static void Main(string[] args)
{
//                                                         1   1  1  1  1  1  1  1  1  1  2   2   2
//                           0  1  2  3  4  5  6  7  8  9  0   1  2  3  4  5  6  7  8  9  0   1   2  3  4  5  6  7  8  9
byte[] buffer = new byte[] { 1, 0, 2, 3, 4, 5, 6, 7, 8, 9, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 5, 5, 0, 5, 5, 1, 2 };
byte[] beginPattern = new byte[] { 1, 0, 2 };
byte[] middlePattern = new byte[] { 8, 9, 10 };
byte[] endPattern = new byte[] { 9, 10, 11 };
byte[] wholePattern = new byte[] { 1, 0, 2, 3, 4, 5, 6, 7, 8, 9, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 };
byte[] noMatchPattern = new byte[] { 7, 7, 7 };


int beginIndex = ByteArrayPatternIndex(buffer, beginPattern);
int middleIndex = ByteArrayPatternIndex(buffer, middlePattern);
int endIndex = ByteArrayPatternIndex(buffer, endPattern);
int wholeIndex = ByteArrayPatternIndex(buffer, wholePattern);
int noMatchIndex = ByteArrayPatternIndex(buffer, noMatchPattern);
}


/// <summary>
/// Returns the index of the first occurrence of a byte array within another byte array
/// </summary>
/// <param name="buffer">The byte array to be searched</param>
/// <param name="pattern">The byte array that contains the pattern to be found</param>
/// <returns>If buffer contains pattern then the index of the first occurrence of pattern within buffer; otherwise, -1</returns>
public static int ByteArrayPatternIndex(byte[] buffer, byte[] pattern)
{
if (buffer != null && pattern != null && pattern.Length <= buffer.Length)
{
int resumeIndex;
for (int i = 0; i <= buffer.Length - pattern.Length; i++)
{
if (buffer[i] == pattern[0]) // Current byte equals first byte of pattern
{
resumeIndex = 0;
for (int x = 1; x < pattern.Length; x++)
{
if (buffer[i + x] == pattern[x])
{
if (x == pattern.Length - 1)  // Matched the entire pattern
return i;
else if (resumeIndex == 0 && buffer[i + x] == pattern[0])  // The current byte equals the first byte of the pattern so start here on the next outer loop iteration
resumeIndex = i + x;
}
else
{
if (resumeIndex > 0)
i = resumeIndex - 1;  // The outer loop iterator will increment so subtract one
else if (x > 1)
i += (x - 1);  // Advance the outer loop variable since we already checked these bytes
break;
}
}
}
}
}
return -1;
}


/// <summary>
/// Returns the indexes of each occurrence of a byte array within another byte array
/// </summary>
/// <param name="buffer">The byte array to be searched</param>
/// <param name="pattern">The byte array that contains the pattern to be found</param>
/// <returns>If buffer contains pattern then the indexes of the occurrences of pattern within buffer; otherwise, null</returns>
/// <remarks>A single byte in the buffer array can only be part of one match.  For example, if searching for 1,2,1 in 1,2,1,2,1 only zero would be returned.</remarks>
public static int[] ByteArrayPatternIndex(byte[] buffer, byte[] pattern)
{
if (buffer != null && pattern != null && pattern.Length <= buffer.Length)
{
List<int> indexes = new List<int>();
int resumeIndex;
for (int i = 0; i <= buffer.Length - pattern.Length; i++)
{
if (buffer[i] == pattern[0]) // Current byte equals first byte of pattern
{
resumeIndex = 0;
for (int x = 1; x < pattern.Length; x++)
{
if (buffer[i + x] == pattern[x])
{
if (x == pattern.Length - 1)  // Matched the entire pattern
indexes.Add(i);
else if (resumeIndex == 0 && buffer[i + x] == pattern[0])  // The current byte equals the first byte of the pattern so start here on the next outer loop iteration
resumeIndex = i + x;
}
else
{
if (resumeIndex > 0)
i = resumeIndex - 1;  // The outer loop iterator will increment so subtract one
else if (x > 1)
i += (x - 1);  // Advance the outer loop variable since we already checked these bytes
break;
}
}
}
}
if (indexes.Count > 0)
return indexes.ToArray();
}
return null;
}

小开

My version of Foubar's answer above, which avoids searching past the end of the haystack, and allows specifying a starting offset. Assumes needle is not empty or longer than the haystack.

public static unsafe long IndexOf(this byte[] haystack, byte[] needle, long startOffset = 0)
{
fixed (byte* h = haystack) fixed (byte* n = needle)
{
for (byte* hNext = h + startOffset, hEnd = h + haystack.LongLength + 1 - needle.LongLength, nEnd = n + needle.LongLength; hNext < hEnd; hNext++)
for (byte* hInc = hNext, nInc = n; *nInc == *hInc; hInc++)
if (++nInc == nEnd)
return hNext - h;
return -1;
}
}

小开

You can use ORegex:

var oregex = new ORegex<byte>("{0}{1}{2}", x=> x==12, x=> x==3, x=> x==5);
var toSearch = new byte[]{1,1,12,3,5,1,12,3,5,5,5,5};


var found = oregex.Matches(toSearch);

Will be found two matches:

i:2;l:3
i:6;l:3

Complexity: O(n*m) in worst case, in real life it is O(n) because of internal state machine. It is faster than .NET Regex in some cases. It is compact, fast and designed especialy for array pattern matching.

小开

This is my propossal, more simple and faster:

int Search(byte[] src, byte[] pattern)
{
int maxFirstCharSlot = src.Length - pattern.Length + 1;
for (int i = 0; i < maxFirstCharSlot; i++)
{
if (src[i] != pattern[0]) // compare only first byte
continue;
        

// found a match on first byte, now try to match rest of the pattern
for (int j = pattern.Length - 1; j >= 1; j--)
{
if (src[i + j] != pattern[j]) break;
if (j == 1) return i;
}
}
return -1;
}

The logic behind this code is this: in first place it search ONLY THE FIRST BYTE (this is the key improvement) and when is found this first byte, i try to match the rest of pattern

小开

I tried to understand Sanchez's proposal and make faster search.Below code's performance nearly equal.But code is more understandable.

public int Search3(byte[] src, byte[] pattern)
{
int index = -1;


for (int i = 0; i < src.Length; i++)
{
if (src[i] != pattern[0])
{
continue;
}
else
{
bool isContinoue = true;
for (int j = 1; j < pattern.Length; j++)
{
if (src[++i] != pattern[j])
{
isContinoue = true;
break;
}
if(j == pattern.Length - 1)
{
isContinoue = false;
}
}
if ( ! isContinoue)
{
index = i-( pattern.Length-1) ;
break;
}
}
}
return index;
}

小开

This is my own approach on the topic. I used pointers to ensure that it is faster on larger arrays. This function will return the first occurence of the sequence (which is what I needed in my own case).

I am sure you can modify it a little bit in order to return a list with all occurences.

What I do is fairly simple. I loop through the source array (haystack) until I find the first byte of the pattern (needle). When the first byte is found, I continue checking separately if the next bytes match the next bytes of the pattern. If not, I continue searching as normal, from the index (in the haystack) I was previously on, before trying to match the needle.

So here's the code:

    public unsafe int IndexOfPattern(byte[] src, byte[] pattern)
{
fixed(byte *srcPtr = &src[0])
fixed (byte* patternPtr = &pattern[0])
{
for (int x = 0; x < src.Length; x++)
{
byte currentValue = *(srcPtr + x);


if (currentValue != *patternPtr) continue;


bool match = false;


for (int y = 0; y < pattern.Length; y++)
{
byte tempValue = *(srcPtr + x + y);
if (tempValue != *(patternPtr + y))
{
match = false;
break;
}


match = true;
}


if (match)
return x;
}
}
return -1;
}

Safe code below:

    public int IndexOfPatternSafe(byte[] src, byte[] pattern)
{
for (int x = 0; x < src.Length; x++)
{
byte currentValue = src[x];
if (currentValue != pattern[0]) continue;


bool match = false;


for (int y = 0; y < pattern.Length; y++)
{
byte tempValue = src[x + y];
if (tempValue != pattern[y])
{
match = false;
break;
}


match = true;
}


if (match)
return x;
}


return -1;
}

小开

I hit this problem the other day, try this:

        public static long FindBinaryPattern(byte[] data, byte[] pattern)
{
using (MemoryStream stream = new MemoryStream(data))
{
return FindBinaryPattern(stream, pattern);
}
}
public static long FindBinaryPattern(string filename, byte[] pattern)
{
using (FileStream stream = new FileStream(filename, FileMode.Open))
{
return FindBinaryPattern(stream, pattern);
}
}
public static long FindBinaryPattern(Stream stream, byte[] pattern)
{
byte[] buffer = new byte[1024 * 1024];
int patternIndex = 0;
int read;
while ((read = stream.Read(buffer, 0, buffer.Length)) > 0)
{
for (int bufferIndex = 0; bufferIndex < read; ++bufferIndex)
{
if (buffer[bufferIndex] == pattern[patternIndex])
{
++patternIndex;
if (patternIndex == pattern.Length)
return stream.Position - (read - bufferIndex) - pattern.Length + 1;
}
else
{
patternIndex = 0;
}
}
}
return -1;
}

It does nothing clever, keeps it simple.

小开

If you are using .NET Core 2.1 or later (or a .NET Standard 2.1 or later platform) you can use the MemoryExtensions.IndexOf extension method with the new Span type:

int matchIndex = toBeSearched.AsSpan().IndexOf(pattern);

To find all occurrences, you could use something like:

public static IEnumerable<int> IndexesOf(this byte[] haystack, byte[] needle,
int startIndex = 0, bool includeOverlapping = false)
{
int matchIndex = haystack.AsSpan(startIndex).IndexOf(needle);
while (matchIndex >= 0)
{
yield return startIndex + matchIndex;
startIndex += matchIndex + (includeOverlapping ? 1 : needle.Length);
matchIndex = haystack.AsSpan(startIndex).IndexOf(needle);
}
}

Unfortunately, the implementation in .NET Core 2.1 - 3.0 uses an iterated "optimized single-byte search on first-byte then check remainder" approach rather than a fast string search algorithm, but that could change in a future release. (See dotnet/runtime#60866.)

小开

I use a simple generic method

void Main()
{
Console.WriteLine(new[]{255,1,3,4,8,99,92,9,0,5,128}.Position(new[]{9,0}));
    

Console.WriteLine("Philipp".ToArray().Position("il".ToArray()));


Console.WriteLine(new[] { "Mo", "Di", "Mi", "Do", "Fr", "Sa", "So","Mo", "Di", "Mi", "Do", "Fr", "Sa", "So"}.Position(new[] { "Fr", "Sa" }, 7));
}


static class Extensions
{
public static int Position<T>(this T[] source, T[] pattern, int start = 0)
{
var matchLenght = 0;
foreach (var indexSource in Enumerable.Range(start, source.Length - pattern.Length))
foreach (var indexPattern in Enumerable.Range(0, pattern.Length))
if (source[indexSource + indexPattern].Equals(pattern[indexPattern]))
if (++matchLenght == pattern.Length)
return indexSource;
return -1;
}
}

Output:

7
2
11