views:

321

answers:

4

I have a std::vector<std::string> of all the files in a directory:

// fileList
folder/file1
folder/file2
file3
file4.ext

and a std::set<std::string> of filenames and the same for all used folder prefixes:

// set1
file2
file4.ext

// set2
folder

I need to generate the full (relative) paths to the ALL files in set1, but see no way of doing that without iterating over set2 set1.size() times, multiplied by fileList.size()

UPDATE: some clarification:

Expected output for above example:

folder/file2
file4.ext

Proposed (inefficient?) solution, maybe too verbose and with stupid implementation:

// pseudo-code!
vector<string> allpossibleFullPaths( set1.size()*set2.size() );
vector<string> output;
foreach( prefix_in_set2 )
    foreach( filename_in_set1 )
        allpossibleFullpaths.push_back( set2[i] + "/" set1[i] )

foreach( filename_in_fileList )
    files.push_back( find( fileList[i] in allpossibleFullPaths ) );

(fast pseudocode-ish) This seems very innefficient, is there a better way to make these matches?

Thanks!

PS: better still would be a way to keep track of doubles, so that I can warn the user about that.

A: 

Simple: iterate over fileList once, generate the prefix (set 2 entry) and file name (set 1 entry), and check if they are in their respective sets. If both are, you have a match, so return it; otherwise, return nothing for that item.

Also, this handles the 'doubles' problem you mention.

strager
How can you conclude that this algorithm is O(1) ?
Benoît
It seems to me that this is O(K logM logN) (K=number of files in fileList, M=number of prefixes in set2, N=number of files in set1, both logs from binary searches), right? And please read Jérémie's comment, his case is important to consider.
rubenvb
You iterate through `fileList` once is O(N). I assumed checking if an item is in a set is O(1); it seems be O(ln N), as you say. Sorry about that.
strager
(Using a `hash_set` gives you average-case O(1) for lookups.)
strager
strager's description is exactly what I would have given. Hash lookup is generally considered to have O(1) amoritized cost - faster than binary search.
Joe Soul-bringer
Right, but generating the contents of hash_set is not O(1) !
Benoît
A: 

Just use a helper hash-table to get a runtime of set1.size() + fileList.size()

The Pseudo-Code:

unordered_set<string, list<string> > hash;
foreach (i in fileList):
  (fprex, fname) = split(i)
  hash[fname].push_back(fprex)
foreach (j in set1):
  a = hash.contains(j)
  if (a != hash.end())
    foreach(k in a)
       print k +'/' + j;

Or somethng like that. unordered_set is available in Boost (or tr1), and a insert/lookup operation is in O(1).

maxschlepzig
Sounds nice... will this detect ambiguously defined files (see comment by Jérémie)? An additional problem that maybe I didn't specify enough is this: there's a file in fileList: `folderA/folderB/file.ext` and set1 constains: `folderB/file.ext`, and set 2 contains `folderA`. This makes the split function either ambiguous, or I need to split for every slash, which increases the size of the hashtable, no? Thanks
rubenvb
Well, if do not want to allow same filenames in different directories just check via hash.find(fname) != hash.end() if the hash-table already contains an entry. If yes, then report an error. As a consequence of this, you just need string as value-type of hash (and no list).If you want to do suffix lookups that may include directories you can either add all directory prefixes to the hash-table or use some sophisticated data-structure like suffix-trees or radix-trees.Depends really on your use case if the use of sophisticated and complicated data-structures pays off.
maxschlepzig
+1  A: 

One area which you are not clear about is this:

  • Given set1 & set2 as described above, what if fileList had "file4.ext" and "folder\file4.ext". Would you want both? Or is the list of file in set1 guaranteed to be unique?

Assuming that you'd want both, pseudo-code:

 foreach(pathname in fileList)
    separate pathname into path & filename.
    if path is not empty, but not in set2, skip to next pathname.
    if filename is in set1, output pathname.

Since a set lookup should be O(1), total complexity is O(2 * fileList.Length)

If the filenames in set1 are unique, you can count the number of pathnames output, and exit early when set1.Length is reached.

It may seem counter-intuitive to step through the longest collection, but it also has the slowest lookup, so operations on fileList have to be minimized.

UPDATE: Here's the full working C++ code (includes & usings elided)

void ListFiles()
{
    vector<string> fileList;
    fileList.push_back("folder/file1");
    fileList.push_back("folder/file2");
    fileList.push_back("file3");
    fileList.push_back("file4.ext");

    set<string> set1;
    set1.insert("file2");
    set1.insert("file4.ext");

    set<string> set2;
    set2.insert("folder");

    for(vector<string>::iterator iter = fileList.begin();
        iter != fileList.end();
        ++iter)
    {
        string pathname = *iter;
        string filename;
        string path;
        size_t pos = pathname.find('/');
        if (pos == string::npos || pos == 0)
            filename = pathname;
        else
        {
            path = pathname.substr(0, pos);
            if (set2.find(path) == set2.end())
                continue;
            filename = pathname.substr(pos+1);
        }
        if (set1.find(filename) != set1.end())
            cout << pathname << endl;
    }

}
James Curran
I'd like to output an error in the case of ambiguously defined files. The user should define the/a fuller pathname in set1 (eg no `folder` in set2, and only a `folder/file4.ext` in set1). That last one will be troublesome with your algorithm I think, because the "filenames" in set1 could contain a part of the full path. (I'm getting my inspiration from Qt's qmake, which handles stuff in the exact same way)
rubenvb
Thanks for the full code example, but it still doesn't handle paths in set1, due to your assumption there's only one slash in the whole thing. I understand my example may have implied that, but it was a simplistic representation of what I wanted.
rubenvb
A: 

Your expected results look like you are searching for suffixes in FileList that match rows in set1 and set2 is immaterial.

The size of set2 decides which way to go for actual matching. If it's reasonably small you can turn it into a regex and either add regex anchors to match the end of string or pre-process FileList (by extracting just file name but also keeping the original string for the result). You can also reverse strings in both lists so that it becomes prefix matching indeed.

If set2 is big then you need to build hash-table from it and in this case you do need to pre-process FileList to extract file names as "keys" that you'll try to "find" in hash table. Make sure you handle case sensitivity if that's a potential issue (like converting all keys to upper case). With that in place just print out every row in FileList for which it's key is present in the hash table build from set 1.

If set 2 does have some significance (in which case your expected result is wrong) then that's the second pass to filter the results from the first pass - with another hash table for the 2nd filter.

ZXX