ansaurus

Question

Really fast function to compare the name (full path) of two files.

Answer 1

+14 A:

Marcelo Cantos 2010-09-09 12:07:18

+1 for sorting.

Chris Thornton 2010-09-09 12:20:17

Sorry. These tricks will not help me. 1. The items are added by the user (some kind of playlist). I could sort the list but the user ill be pretty mad at me.2. As the list is built by the user, the Delete is used rarely (already specified that in my post). This is like a safety net (rarely used :)

Altar 2010-09-09 12:30:47

+1 for your post. It may be useful for other people that can use Sort.

Altar 2010-09-09 12:36:00

@Altar, I think you've misunderstood. You don't have to sort the FileListBoxes. As Marcelo pointed out, *copy the list, sort the copy, ...*

Lieven 2010-09-09 12:41:28

@Altar: I've amended my answer to accommodate order-preservation.

Marcelo Cantos 2010-09-09 12:53:12

@Altar: I've amended it again with some sample code.

Marcelo Cantos 2010-09-09 13:02:20

@Altar: really, you need to sort. If you need to preserve the order, then use some sort of structure so that you can restore the original order. An array of records, where each record holds the path, and original sort position. Then you sort the whole thing on the path. Before you put it back in the FileListBox, sort on the Original Sort Key. StringList has a custom sort option that could be useful here.

Chris Thornton 2010-09-09 13:16:29

@Chris: Sorting isn't the only option. You could build up a hash table. You could even do it in a single pass by inspecting the hash table as you traverse. Sorting is just convenient because TStringList does it already.

Marcelo Cantos 2010-09-09 13:24:10

Answer 2

A:

4 seconds for how many calls? Great performance if you call it a billion times...

Anyway, does Length(Path1) get evaluated every time through the loop? If so, store that in an Integer variable prior to looping.

Pointers may yield some speed over the strings.

Try in-lining the function with: function SameFile(blah blah): Boolean; Inline;

That will save some time, if this is being called thousands of times per second. I would start with that and see if it saves anything.

EDIT: I didn't realize that your list wasn't sorted. Obviously, you should do that first! Then you don't have to compare against every other item in the list - just the prior or next one.

Chris Thornton 2010-09-09 12:07:31

I have Delphi 7. Inline is not available in D7. "does Length(Path1) get evaluated every time through the loop?" - I already did that, but as it is in the external loop doesn't help that much.

Altar 2010-09-09 12:35:04

Answer 3

+7 A:

Using your code as a starting point, I modified it to take a copy of the list before searching for duplicates. The time went from 5,5 seconds to about 0,5 seconds.

vSL := TStringList.Create;
try
  vSL.Assign(jList.Items);
  vSL.Sorted := true;
  for x:= vSL.Count-1 downto 1 DO
  begin
   sMaster:= vSL[x];
   for y:= x-1 downto 0 DO
    if SameFile(sMaster, vSL[y]) then
     begin
      vSL.Delete (x); { REMOVE DUPLICATES }
      jList.Items.Delete (x);
      Break;
     end;
  end;
finally
  vSL.Free;
end;

Obviously, this is not a good way to do it, but it demonstrates that TFileListBox is in itself quite slow. I don't believe you can gain much by optimizing your compare-function.

To demonstrate this, I replaced your SameFile function with the following, but kept the rest of your code:

function SameFile(CONST Path1, Path2: string): Boolean;
VAR i: Integer;
begin
  Result := false; //Pretty darn fast code!!!
end;

The time went from 5,6 seconds to 5,5 seconds. I don't think there's much more to gain there :-)

Svein Bringsli 2010-09-09 12:41:38

Oh my God!!!!!!!!!!!!!!!!

Altar 2010-09-09 12:46:21

+1 for not blindly answering the question but solving the problem.

mghie 2010-09-10 06:57:26

Answer 4

+1 A:

Create another sorted list with sortedList.Duplicates := dupIgnore and add your strings to that list, then back.


vSL := TStringList.Create;
try
  vSL.Sorted := true;
  vSL.Duplicates := dupIgnore;
  for x:= 0 to jList.Count - 1 do
    vSL.Add(jList[x]);

  jList.Clear;
  for x:= 0 to vSL.Count - 1 do
    jList.Add(vSL[x]);
finally
  vSL.Free;
end;

dwrbudr 2010-09-09 14:40:45

Answer 5

A:

I use a modified Ternary Search Tree (TST) to dedupe lists. You simply load the items into the tree, using the whole string as the key, and on each item you can get back an indication if the key is already there (and delete your visible entry). Then you throw away the tree. Our TST load function can typically load 100000 80-byte items in well under a second. And it could not take any more than this to repaint your list, with proper use of begin- and end-update. The TST is memory-hungry, but not so that you would notice it at all if you only have of the order of 500 items. And much simpler than sorting, comparisons and assembler (if you have a suitable TST implementation, of course).

frogb 2010-09-09 15:01:13

Answer 6

+1 A:

The absolute fastest way, bar none (as alluded to before) is to use a routine that generates a unique 64/128/256 bit hash code for a string (I use the SHA256Managed class in C#). Run down the list of strings, generate the hash code for the strings, check for it in the sorted hash code list, and if found then the string is a duplicate. Otherwise add the hash code to the sorted hash code list.

This will work for strings, file names, images (you can get the unique hash code for an image), etc, and I guarantee that this will be as fast or faster than any other impementation.

PS You can use a string list for the hash codes by representing the hash codes as strings. I've used a hex representation in the past (256 bits -> 64 characters) but in theory you can do it any way you like.

Misha 2010-09-10 01:37:57

Actually, I have posted a better option that doesn't require hash codes.

Misha 2010-09-10 04:25:03

Answer 7

A:

No need to use a hash table, a single sorted list gives me a result of 10 milliseconds, that's 0.01 seconds, which is about 500 times faster! Here is my test code using a TListBox:

procedure TForm1.Button1Click(Sender: TObject);
var
  lIndex1: Integer;
  lString: string;
  lIndex2: Integer;
  lStrings: TStringList;
  lCount: Integer;
  lItems: TStrings;
begin
  ListBox1.Clear;
  for lIndex1 := 1 to 577 do begin
    lString := '';
    for lIndex2 := 1 to 100 do
      if (lIndex2 mod 6) = 0 then
       lString := lString + Chr(Ord('a') + Random(2))
      else
        lString := lString + 'a';
    ListBox1.Items.Add(lString);
  end;

  CsiGlobals.AddLogMsg('Start', 'Test', llBrief);

  lStrings := TStringList.Create;
  try
    lStrings.Sorted := True;
    lCount := 0;
    lItems := ListBox1.Items;
    with lItems do begin
      BeginUpdate;
      try
        for lIndex1 := Count - 1 downto 0 do begin
          lStrings.Add(Strings[lIndex1]);
          if lStrings.Count = lCount then
            Delete(lIndex1)
          else
            Inc(lCount);
        end;
      finally
        EndUpdate;
      end;
    end;
  finally
    lStrings.Free;
  end;

  CsiGlobals.AddLogMsg('Stop', 'Test', llBrief);
end;

Misha 2010-09-10 02:11:16

See the answer provided by Svein Bringsli.

Altar 2010-09-10 06:26:37

TListBox is NOT slow if you use BeginUpdate and EndUpdate and "batch" the updates. You could do 500 deletes in a matter of milliseconds if you only update the list once, as the code above shows. I'd add a comment to your question of I was allowed to add a comment!!!

Misha 2010-09-10 07:47:38

And you are doing two loops, when only one is needed. It is the double loop that is incredibly slow, not the TListBox.

Misha 2010-09-10 07:49:24

It is a TListBox. As explained, the 'delete item' is not the problem. When I delete items it is probably slow. But I don't care. I delete items rarely, probably never.

Altar 2010-09-11 13:20:55

Answer 8

A:

I'd also like to point out that your solution would take an extreme amount of time if applied to a huge list (like containing 100,000,000 items or more). Even constructing a hashtable or sorted list would take too much time.

In cases like that you could try another approach : Hash each member, but instead of populating a full-blown hashtable, create a bitset (large enough to contain a close factor to as many slots as there are input items) and just set each bit at the offset indicated by the hashfunction. If the bit was 0, change it to 1. If it was already 1, take note of the offending string-index in a separate list and continue. This results in a list of string-indexes that had a collision in the hash, so you'll have to run it a second time to find the first cause of those collisions. After that, you should sort & de-dupe the string-indexes in this list (as all indexes apart from the first one will be present twice). Once that's done you should sort the list again, but this time sort it on the string-contents in order to easily spot duplicates in a following single scan.

Granted it could be a bit extreme to go this all this length, but at least it's a workable solution for very large volumes! (Oh, and this still won't work if the number of duplicates is very high, when the hash-function has a bad spread or when the number of slots in the 'hashtable' bitset is chosen too small - which would give many collisions which aren't really duplicates.)

PatrickvL 2010-09-13 14:22:00

ansaurus

tags:

views:

answers:

Really fast function to compare the name (full path) of two files.

related questions