ansaurus

Question

algorithm to match prefix and name to a list of names

Answer 1

A:

Simple: iterate over fileList once, generate the prefix (set 2 entry) and file name (set 1 entry), and check if they are in their respective sets. If both are, you have a match, so return it; otherwise, return nothing for that item.

Also, this handles the 'doubles' problem you mention.

strager 2010-08-14 18:06:50

How can you conclude that this algorithm is O(1) ?

Benoît 2010-08-15 07:05:59

It seems to me that this is O(K logM logN) (K=number of files in fileList, M=number of prefixes in set2, N=number of files in set1, both logs from binary searches), right? And please read Jérémie's comment, his case is important to consider.

rubenvb 2010-08-15 09:13:03

You iterate through `fileList` once is O(N). I assumed checking if an item is in a set is O(1); it seems be O(ln N), as you say. Sorry about that.

strager 2010-08-15 09:26:03

(Using a `hash_set` gives you average-case O(1) for lookups.)

strager 2010-08-15 09:31:54

strager's description is exactly what I would have given. Hash lookup is generally considered to have O(1) amoritized cost - faster than binary search.

Joe Soul-bringer 2010-08-25 18:29:16

Right, but generating the contents of hash_set is not O(1) !

Benoît 2010-08-26 11:34:51

Answer 2

A:

Just use a helper hash-table to get a runtime of set1.size() + fileList.size()

The Pseudo-Code:

unordered_set<string, list<string> > hash;
foreach (i in fileList):
  (fprex, fname) = split(i)
  hash[fname].push_back(fprex)
foreach (j in set1):
  a = hash.contains(j)
  if (a != hash.end())
    foreach(k in a)
       print k +'/' + j;

Or somethng like that. unordered_set is available in Boost (or tr1), and a insert/lookup operation is in O(1).

maxschlepzig 2010-08-21 21:01:14

Sounds nice... will this detect ambiguously defined files (see comment by Jérémie)? An additional problem that maybe I didn't specify enough is this: there's a file in fileList: `folderA/folderB/file.ext` and set1 constains: `folderB/file.ext`, and set 2 contains `folderA`. This makes the split function either ambiguous, or I need to split for every slash, which increases the size of the hashtable, no? Thanks

rubenvb 2010-08-22 10:16:55

Well, if do not want to allow same filenames in different directories just check via hash.find(fname) != hash.end() if the hash-table already contains an entry. If yes, then report an error. As a consequence of this, you just need string as value-type of hash (and no list).If you want to do suffix lookups that may include directories you can either add all directory prefixes to the hash-table or use some sophisticated data-structure like suffix-trees or radix-trees.Depends really on your use case if the use of sophisticated and complicated data-structures pays off.

maxschlepzig 2010-08-22 15:25:24

Answer 3

+1 A:

One area which you are not clear about is this:

Given set1 & set2 as described above, what if fileList had "file4.ext" and "folder\file4.ext". Would you want both? Or is the list of file in set1 guaranteed to be unique?

Assuming that you'd want both, pseudo-code:

 foreach(pathname in fileList)
    separate pathname into path & filename.
    if path is not empty, but not in set2, skip to next pathname.
    if filename is in set1, output pathname.

Since a set lookup should be O(1), total complexity is O(2 * fileList.Length)

If the filenames in set1 are unique, you can count the number of pathnames output, and exit early when set1.Length is reached.

It may seem counter-intuitive to step through the longest collection, but it also has the slowest lookup, so operations on fileList have to be minimized.

UPDATE: Here's the full working C++ code (includes & usings elided)

void ListFiles()
{
    vector<string> fileList;
    fileList.push_back("folder/file1");
    fileList.push_back("folder/file2");
    fileList.push_back("file3");
    fileList.push_back("file4.ext");

    set<string> set1;
    set1.insert("file2");
    set1.insert("file4.ext");

    set<string> set2;
    set2.insert("folder");

    for(vector<string>::iterator iter = fileList.begin();
        iter != fileList.end();
        ++iter)
    {
        string pathname = *iter;
        string filename;
        string path;
        size_t pos = pathname.find('/');
        if (pos == string::npos || pos == 0)
            filename = pathname;
        else
        {
            path = pathname.substr(0, pos);
            if (set2.find(path) == set2.end())
                continue;
            filename = pathname.substr(pos+1);
        }
        if (set1.find(filename) != set1.end())
            cout << pathname << endl;
    }

}

James Curran 2010-08-25 15:49:41

I'd like to output an error in the case of ambiguously defined files. The user should define the/a fuller pathname in set1 (eg no `folder` in set2, and only a `folder/file4.ext` in set1). That last one will be troublesome with your algorithm I think, because the "filenames" in set1 could contain a part of the full path. (I'm getting my inspiration from Qt's qmake, which handles stuff in the exact same way)

rubenvb 2010-08-25 16:17:07

Thanks for the full code example, but it still doesn't handle paths in set1, due to your assumption there's only one slash in the whole thing. I understand my example may have implied that, but it was a simplistic representation of what I wanted.

rubenvb 2010-08-25 16:24:50

Answer 4

A:

Your expected results look like you are searching for suffixes in FileList that match rows in set1 and set2 is immaterial.

The size of set2 decides which way to go for actual matching. If it's reasonably small you can turn it into a regex and either add regex anchors to match the end of string or pre-process FileList (by extracting just file name but also keeping the original string for the result). You can also reverse strings in both lists so that it becomes prefix matching indeed.

If set2 is big then you need to build hash-table from it and in this case you do need to pre-process FileList to extract file names as "keys" that you'll try to "find" in hash table. Make sure you handle case sensitivity if that's a potential issue (like converting all keys to upper case). With that in place just print out every row in FileList for which it's key is present in the hash table build from set 1.

If set 2 does have some significance (in which case your expected result is wrong) then that's the second pass to filter the results from the first pass - with another hash table for the 2nd filter.

ZXX 2010-08-26 11:32:25

ansaurus

tags:

views:

answers:

algorithm to match prefix and name to a list of names

related questions