This answer is now complete except that the code is not well (Unit) tested, and could probably be optimized further, for example I repeated the local function __SameChar instead of using a comparison function callback which would have been faster, and actually, allowing the user to pass in a comparison function for all these would be great for Unicode users who want to provide some extra logic (equivalent sets of Unicode glyphs for some languages). I do however have a Boyer-Moore based solution for case-insensitive Pos, and faster substring counting, and now search and replace too:
Based on Dorin Dominica's code, I built the following.
{ _FindStringBoyer:
Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
Credited to Dorin Duminica.
}
function _FindStringBoyer(const sString, sPattern: string;
const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;
function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
if bCaseSensitive then
Result := (sString[StringIndex] = sPattern[PatternIndex])
else
Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
var
SkipTable: array [Char] of Integer;
LengthPattern: Integer;
LengthString: Integer;
Index: Integer;
kIndex: Integer;
LastMarker: Integer;
Large: Integer;
chPattern: Char;
begin
if fromPos < 1 then
raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
LengthPattern := Length(sPattern);
LengthString := Length(sString);
for chPattern := Low(Char) to High(Char) do
SkipTable[chPattern] := LengthPattern;
for Index := 1 to LengthPattern -1 do
SkipTable[sPattern[Index]] := LengthPattern - Index;
Large := LengthPattern + LengthString + 1;
LastMarker := SkipTable[sPattern[LengthPattern]];
SkipTable[sPattern[LengthPattern]] := Large;
Index := fromPos + LengthPattern -1;
Result := 0;
while Index <= LengthString do begin
repeat
Index := Index + SkipTable[sString[Index]];
until Index > LengthString;
if Index <= Large then
Break
else
Index := Index - Large;
kIndex := 1;
while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
Inc(kIndex);
if kIndex = LengthPattern then begin
// Found, return.
Result := Index - kIndex + 1;
Index := Index + LengthPattern;
exit;
end else begin
if __SameChar(Index, LengthPattern) then
Index := Index + LastMarker
else
Index := Index + SkipTable[sString[Index]];
end; // if kIndex = LengthPattern then begin
end; // while Index <= LengthString do begin
end;
{ Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
a substring inside the main string, at a much faster rate than we
could have done otherwise. Another thing that would be great is
to have a function that returns an array of find-locations,
which would be way faster to do than repeatedly calling Pos.
}
function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
var
foundPos:Integer;
fromPos:Integer;
Limit:Integer;
guard:Integer;
SkipTable: array [Char] of Integer;
LengthPattern: Integer;
LengthString: Integer;
Index: Integer;
kIndex: Integer;
LastMarker: Integer;
Large: Integer;
chPattern: Char;
function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
if CaseSensitive then
Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
else
Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
result := 0;
foundPos := 1;
fromPos := 1;
Limit := Length(aSourceString);
guard := Length(aFindString);
Index := 0;
LengthPattern := Length(aFindString);
LengthString := Length(aSourceString);
for chPattern := Low(Char) to High(Char) do
SkipTable[chPattern] := LengthPattern;
for Index := 1 to LengthPattern -1 do
SkipTable[aFindString[Index]] := LengthPattern - Index;
Large := LengthPattern + LengthString + 1;
LastMarker := SkipTable[aFindString[LengthPattern]];
SkipTable[aFindString[LengthPattern]] := Large;
while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin
Index := fromPos + LengthPattern -1;
if Index>Limit then
break;
kIndex := 0;
while Index <= LengthString do begin
repeat
Index := Index + SkipTable[aSourceString[Index]];
until Index > LengthString;
if Index <= Large then
Break
else
Index := Index - Large;
kIndex := 1;
while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
Inc(kIndex);
if kIndex = LengthPattern then begin
// Found, return.
//Result := Index - kIndex + 1;
Index := Index + LengthPattern;
fromPos := Index;
Inc(Result);
break;
end else begin
if __SameChar(Index, LengthPattern) then
Index := Index + LastMarker
else
Index := Index + SkipTable[aSourceString[Index]];
end; // if kIndex = LengthPattern then begin
end; // while Index <= LengthString do begin
end;
end;
This is really a nice Algorithm, because:
- it's way faster to count instances of substring X in string Y this way, magnificently so.
- For merely replacing Pos() the _FindStringBoyer() is faster than the pure asm version of Pos() contributed to Delphi by FastCode project people, that is currently used for Pos, and if you need the case-insensitivity, you can imagine the performance boost when we don't have to call UpperCase on a 100 megabyte string. (Okay, your strings aren't going to be THAT big. But still, Efficient Algorithms are a Thing of Beauty.)
Okay I wrote a String Replace in Boyer-Moore style:
function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
var
errors:Integer;
fromPos:Integer;
Limit:Integer;
guard:Integer;
SkipTable: array [Char] of Integer;
LengthPattern: Integer;
LengthString: Integer;
Index: Integer;
kIndex: Integer;
LastMarker: Integer;
Large: Integer;
chPattern: Char;
CaseSensitive:Boolean;
foundAt:Integer;
lastFoundAt:Integer;
copyStartsAt:Integer;
copyLen:Integer;
function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
if CaseSensitive then
Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
else
Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
result := '';
lastFoundAt := 0;
fromPos := 1;
errors := 0;
CaseSensitive := rfIgnoreCase in Flags;
Limit := Length(aSourceString);
guard := Length(aFindString);
Index := 0;
LengthPattern := Length(aFindString);
LengthString := Length(aSourceString);
for chPattern := Low(Char) to High(Char) do
SkipTable[chPattern] := LengthPattern;
for Index := 1 to LengthPattern -1 do
SkipTable[aFindString[Index]] := LengthPattern - Index;
Large := LengthPattern + LengthString + 1;
LastMarker := SkipTable[aFindString[LengthPattern]];
SkipTable[aFindString[LengthPattern]] := Large;
while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin
Index := fromPos + LengthPattern -1;
if Index>Limit then
break;
kIndex := 0;
foundAt := 0;
while Index <= LengthString do begin
repeat
Index := Index + SkipTable[aSourceString[Index]];
until Index > LengthString;
if Index <= Large then
Break
else
Index := Index - Large;
kIndex := 1;
while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
Inc(kIndex);
if kIndex = LengthPattern then begin
foundAt := Index - kIndex + 1;
Index := Index + LengthPattern;
//fromPos := Index;
fromPos := (foundAt+LengthPattern);
if lastFoundAt=0 then begin
copyStartsAt := 1;
copyLen := foundAt-copyStartsAt;
end else begin
copyStartsAt := lastFoundAt+LengthPattern;
copyLen := foundAt-copyStartsAt;
end;
if (copyLen<=0)or(copyStartsAt<=0) then begin
Inc(errors);
end;
Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
lastFoundAt := foundAt;
if not (rfReplaceAll in Flags) then
fromPos := 0; // break out of outer while loop too!
break;
end else begin
if __SameChar(Index, LengthPattern) then
Index := Index + LastMarker
else
Index := Index + SkipTable[aSourceString[Index]];
end; // if kIndex = LengthPattern then begin
end; // while Index <= LengthString do begin
end;
// the part that didn't require any replacing, because nothing more was found,
// or rfReplaceAll flag was not specified, is copied at the
// end as the final step.
if (lastFoundAt+LengthPattern < Limit) then begin
copyStartsAt := lastFoundAt+LengthPattern;
copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
end;
end;
Okay, problem: Stack footprint of this:
var
skiptable : array [Char] of Integer; // 65536*4 bytes stack usage on Unicode delphi
Goodbye CPU hell, Hello stack hell. If I go for a dynamic array, then I have to resize it at runtime. So this thing is basically fast, because the Virtual Memory system on your computer doesn't blink at 256K going on the stack, but this is not always an optimal piece of code. Nevertheless my PC doesn't blink at big stack stuff like this. It's not going to become a Delphi standard library default or win any fastcode challenge in the future, with that kinda footprint. I think that repeated searches are a case where the above code should be written as a class, and the skiptable should be a data field in that class. Then you can build the boyer-moore table once, and over time, if the string is invariant, repeatedly use that object to do fast lookups.