tags:

views:

319

answers:

8

Hi. I have to check if I have duplicate paths in a FileListBox (FileListBox has the role of some kind of job list or play list). Using Delphi's SameText, CompareStr, CompareText, takes 6 seconds. So I came with my own compare function which is (just) a bit faster but not fast enough. Any ideas how to improve it?

function SameFile(CONST Path1, Path2: string): Boolean;
VAR i: Integer;
begin
 Result:= Length(Path1)= Length(Path2);                                         { if they have different lenghts then obviously are not the same file }
 if Result then
  for i:= Length(Path1) downto 1 DO                                             { start from the end because it is more likely to find the difference there }
   if Path1[i]<> Path2[i] then
    begin
     Result:= FALSE;
     Break;
    end;
end;

I use it like this:

 for x:= JList.Count-1 downto 1 DO
  begin
   sMaster:= JList.Items[x];
   for y:= x-1 downto 0 DO
    if SameFile(sMaster, JList.Items[y]) then
     begin
      JList.Items.Delete (x); { REMOVE DUPLICATES }
      Break;
     end;
  end;

Note: The chance of having duplicates is small so Delete is not called often. Also the list cannot be sorted because the items are added by user and sometimes the order may be important.

Update:
The thing is that I lose the asvantage of my code because it is Pascal. It would be nice if the comparison loop ( Path1[i]<> Path2[i] ) would be optimized to use Borland's ASM code.


Delphi 7, Win XP 32 bit, Tests were done with 577 items in the list. Deleting the items from list IS NOT A PROBLEM because it happens rarely.


CONCLUSION

As Svein Bringsli pointed, my code is slow not because of the comparing algorithm but because of TListBox. The BEST solution was provided by Marcelo Cantos. Thanks a lot Marcelo.
I accepted Svein's answer because it answers directly my question "how to make my comparison function faster" with "there is no point to make it faster".
For the moment I implemented the dirty and quick to implement solution: when I have under 200 files, I use my slow code to check the duplicates. If there are more than 200 files I use dwrbudr's solution (which is damn fast) considering that if the user has so many files, the order is irrelevant anyway (human brain cannot track so many items).

I want to thank you all for ideas and especially Svein for revealing the truth: (Borland's) visual controls are damn slow!

+14  A: 
Marcelo Cantos
+1 for sorting.
Chris Thornton
Sorry. These tricks will not help me. 1. The items are added by the user (some kind of playlist). I could sort the list but the user ill be pretty mad at me.2. As the list is built by the user, the Delete is used rarely (already specified that in my post). This is like a safety net (rarely used :)
Altar
+1 for your post. It may be useful for other people that can use Sort.
Altar
@Altar, I think you've misunderstood. You don't have to sort the FileListBoxes. As Marcelo pointed out, *copy the list, sort the copy, ...*
Lieven
@Altar: I've amended my answer to accommodate order-preservation.
Marcelo Cantos
@Altar: I've amended it again with some sample code.
Marcelo Cantos
@Altar: really, you need to sort. If you need to preserve the order, then use some sort of structure so that you can restore the original order. An array of records, where each record holds the path, and original sort position. Then you sort the whole thing on the path. Before you put it back in the FileListBox, sort on the Original Sort Key. StringList has a custom sort option that could be useful here.
Chris Thornton
@Chris: Sorting isn't the only option. You could build up a hash table. You could even do it in a single pass by inspecting the hash table as you traverse. Sorting is just convenient because TStringList does it already.
Marcelo Cantos
A: 

4 seconds for how many calls? Great performance if you call it a billion times...

Anyway, does Length(Path1) get evaluated every time through the loop? If so, store that in an Integer variable prior to looping.

Pointers may yield some speed over the strings.

Try in-lining the function with: function SameFile(blah blah): Boolean; Inline;

That will save some time, if this is being called thousands of times per second. I would start with that and see if it saves anything.

EDIT: I didn't realize that your list wasn't sorted. Obviously, you should do that first! Then you don't have to compare against every other item in the list - just the prior or next one.

Chris Thornton
I have Delphi 7. Inline is not available in D7. "does Length(Path1) get evaluated every time through the loop?" - I already did that, but as it is in the external loop doesn't help that much.
Altar
+7  A: 

Using your code as a starting point, I modified it to take a copy of the list before searching for duplicates. The time went from 5,5 seconds to about 0,5 seconds.

vSL := TStringList.Create;
try
  vSL.Assign(jList.Items);
  vSL.Sorted := true;
  for x:= vSL.Count-1 downto 1 DO
  begin
   sMaster:= vSL[x];
   for y:= x-1 downto 0 DO
    if SameFile(sMaster, vSL[y]) then
     begin
      vSL.Delete (x); { REMOVE DUPLICATES }
      jList.Items.Delete (x);
      Break;
     end;
  end;
finally
  vSL.Free;
end;

Obviously, this is not a good way to do it, but it demonstrates that TFileListBox is in itself quite slow. I don't believe you can gain much by optimizing your compare-function.

To demonstrate this, I replaced your SameFile function with the following, but kept the rest of your code:

function SameFile(CONST Path1, Path2: string): Boolean;
VAR i: Integer;
begin
  Result := false; //Pretty darn fast code!!!
end;

The time went from 5,6 seconds to 5,5 seconds. I don't think there's much more to gain there :-)

Svein Bringsli
Oh my God!!!!!!!!!!!!!!!!
Altar
+1 for not blindly answering the question but solving the problem.
mghie
+1  A: 

Create another sorted list with sortedList.Duplicates := dupIgnore and add your strings to that list, then back.


vSL := TStringList.Create;
try
  vSL.Sorted := true;
  vSL.Duplicates := dupIgnore;
  for x:= 0 to jList.Count - 1 do
    vSL.Add(jList[x]);

  jList.Clear;
  for x:= 0 to vSL.Count - 1 do
    jList.Add(vSL[x]);
finally
  vSL.Free;
end;
dwrbudr
A: 

I use a modified Ternary Search Tree (TST) to dedupe lists. You simply load the items into the tree, using the whole string as the key, and on each item you can get back an indication if the key is already there (and delete your visible entry). Then you throw away the tree. Our TST load function can typically load 100000 80-byte items in well under a second. And it could not take any more than this to repaint your list, with proper use of begin- and end-update. The TST is memory-hungry, but not so that you would notice it at all if you only have of the order of 500 items. And much simpler than sorting, comparisons and assembler (if you have a suitable TST implementation, of course).

frogb
+1  A: 

The absolute fastest way, bar none (as alluded to before) is to use a routine that generates a unique 64/128/256 bit hash code for a string (I use the SHA256Managed class in C#). Run down the list of strings, generate the hash code for the strings, check for it in the sorted hash code list, and if found then the string is a duplicate. Otherwise add the hash code to the sorted hash code list.

This will work for strings, file names, images (you can get the unique hash code for an image), etc, and I guarantee that this will be as fast or faster than any other impementation.

PS You can use a string list for the hash codes by representing the hash codes as strings. I've used a hex representation in the past (256 bits -> 64 characters) but in theory you can do it any way you like.

Misha
Actually, I have posted a better option that doesn't require hash codes.
Misha
A: 

No need to use a hash table, a single sorted list gives me a result of 10 milliseconds, that's 0.01 seconds, which is about 500 times faster! Here is my test code using a TListBox:

procedure TForm1.Button1Click(Sender: TObject);
var
  lIndex1: Integer;
  lString: string;
  lIndex2: Integer;
  lStrings: TStringList;
  lCount: Integer;
  lItems: TStrings;
begin
  ListBox1.Clear;
  for lIndex1 := 1 to 577 do begin
    lString := '';
    for lIndex2 := 1 to 100 do
      if (lIndex2 mod 6) = 0 then
       lString := lString + Chr(Ord('a') + Random(2))
      else
        lString := lString + 'a';
    ListBox1.Items.Add(lString);
  end;

  CsiGlobals.AddLogMsg('Start', 'Test', llBrief);

  lStrings := TStringList.Create;
  try
    lStrings.Sorted := True;
    lCount := 0;
    lItems := ListBox1.Items;
    with lItems do begin
      BeginUpdate;
      try
        for lIndex1 := Count - 1 downto 0 do begin
          lStrings.Add(Strings[lIndex1]);
          if lStrings.Count = lCount then
            Delete(lIndex1)
          else
            Inc(lCount);
        end;
      finally
        EndUpdate;
      end;
    end;
  finally
    lStrings.Free;
  end;

  CsiGlobals.AddLogMsg('Stop', 'Test', llBrief);
end;
Misha
See the answer provided by Svein Bringsli.
Altar
TListBox is NOT slow if you use BeginUpdate and EndUpdate and "batch" the updates. You could do 500 deletes in a matter of milliseconds if you only update the list once, as the code above shows. I'd add a comment to your question of I was allowed to add a comment!!!
Misha
And you are doing two loops, when only one is needed. It is the double loop that is incredibly slow, not the TListBox.
Misha
It is a TListBox. As explained, the 'delete item' is not the problem. When I delete items it is probably slow. But I don't care. I delete items rarely, probably never.
Altar
A: 

I'd also like to point out that your solution would take an extreme amount of time if applied to a huge list (like containing 100,000,000 items or more). Even constructing a hashtable or sorted list would take too much time.

In cases like that you could try another approach : Hash each member, but instead of populating a full-blown hashtable, create a bitset (large enough to contain a close factor to as many slots as there are input items) and just set each bit at the offset indicated by the hashfunction. If the bit was 0, change it to 1. If it was already 1, take note of the offending string-index in a separate list and continue. This results in a list of string-indexes that had a collision in the hash, so you'll have to run it a second time to find the first cause of those collisions. After that, you should sort & de-dupe the string-indexes in this list (as all indexes apart from the first one will be present twice). Once that's done you should sort the list again, but this time sort it on the string-contents in order to easily spot duplicates in a following single scan.

Granted it could be a bit extreme to go this all this length, but at least it's a workable solution for very large volumes! (Oh, and this still won't work if the number of duplicates is very high, when the hash-function has a bad spread or when the number of slots in the 'hashtable' bitset is chosen too small - which would give many collisions which aren't really duplicates.)

PatrickvL