views:

838

answers:

13

For recreational reasons I wrote a PHP class that classifies files with tags instead of in a hierarchical way, the tags are stored in the filename itself in the form of +tag1+tag2+tagN+MD5.EXTENSION and thus I'm stucked with the chars limit (255) imposed by the FS/OS. Here is the class:

<?php

class TagFS
{
    public $FS = null;

    function __construct($FS)
    {
     if (is_dir($FS) === true)
     {
      $this->FS = $this->Path($FS);
     }
    }

    function Add($path, $tag)
    {
     if (is_dir($path) === true)
     {
      $files = array_slice(scandir($path), 2);

      foreach ($files as $file)
      {
       $this->Add($this->Path($path) . $file, $tag);
      }

      return true;
     }

     else if (is_file($path) === true)
     {
      $file = md5_file($path);

      if (is_file($this->FS . $file) === false)
      {
       if (copy($path, $this->FS . $file) === false)
       {
        return false;
       }
      }

      return $this->Link($this->FS . $file, $this->FS . '+' . $this->Tag($tag) . '+' . $file . '.' . strtolower(pathinfo($path, PATHINFO_EXTENSION)));
     }

     return false;
    }

    function Get($tag)
    {
     return glob($this->FS . '*+' . str_replace('+', '{+,+*+}', $this->Tag($tag)) . '+*', GLOB_BRACE);
    }

    function Link($source, $destination)
    {
     if (is_file($source) === true)
     {
      if (function_exists('link') === true)
      {
       return link($source, $destination);
      }

      if (is_file($destination) === false)
      {
       exec('fsutil hardlink create "' . $destination . '" "' . $source . '"');

       if (is_file($destination) === true)
       {
        return true;
       }
      }
     }

     return false;
    }

    function Path($path)
    {
     if (file_exists($path) === true)
     {
      $path = str_replace('\\', '/', realpath($path));

      if ((is_dir($path) === true) && ($path[strlen($path) - 1] != '/'))
      {
       $path .= '/';
      }

      return $path;
     }

     return false;
    }

    function Tag($string)
    {
     /*
     TODO:
     Remove (on Windows):    . \ / : * ? " < > |
     Remove (on *nix):     . /
     Remove (on TagFS):     + * { }
     Remove (on TagFS - Possibly!) -
     Max Chars (in Windows)    255
     Max Char (in *nix)    255
     */

     $result = array_filter(array_unique(explode(' ', $string)));

     if (empty($result) === false)
     {
      if (natcasesort($result) === true)
      {
       return strtolower(implode('+', $result));
      }
     }

     return false;
    }
}

?>

I believe this system works well for a couple of small tags, but my problem is when the size of the whole filename exceeds 255 chars. What approach should I take in order to bypass the filename limit? I'm thinking in splitting tags on several hard links of the same file, but the permutations may kill the system.

Are there any other ways to solve this problem?

EDIT - Some usage examples:

<?php

$images = new TagFS('S:');

$images->Add('P:/xampplite/htdocs/tag/geoaki.png', 'geoaki logo');
$images->Add('P:/xampplite/htdocs/tag/cloud.jpg', 'geoaki cloud tag');
$images->Add('P:/xampplite/htdocs/tag/cloud.jpg', 'nuvem azul branco');
$images->Add('P:/xampplite/htdocs/tag/xml-full.gif', 'geoaki auto vin api service xml');
$images->Add('P:/xampplite/htdocs/tag/dunp3d-1.jpg', 'dunp logo');
$images->Add('P:/xampplite/htdocs/tag/d-proposta-04c.jpg', 'dunp logo');

/*
[0] => S:/+api+auto+geoaki+service+vin+xml+29be189cbc98fcb36a44d77acad13e18.gif
[1] => S:/+azul+branco+nuvem+4151ae7900f33788d0bba5fc6c29bee3.jpg
[2] => S:/+cloud+geoaki+tag+4151ae7900f33788d0bba5fc6c29bee3.jpg
[3] => S:/+dunp+logo+0cedeb6f66cbfc3974c6b7ad86f4fbd3.jpg
[4] => S:/+dunp+logo+8b9fcb119246bb6dcac1906ef964d565.jpg
[5] => S:/+geoaki+logo+5f5174c498ffbfd9ae49975ddfa2f6eb.png
*/
echo '<pre>';
print_r($images->Get('*'));
echo '</pre>';

/*
[0] => S:/+azul+branco+nuvem+4151ae7900f33788d0bba5fc6c29bee3.jpg
*/
echo '<pre>';
print_r($images->Get('azul nuvem'));
echo '</pre>';

/*
[0] => S:/+dunp+logo+0cedeb6f66cbfc3974c6b7ad86f4fbd3.jpg
[1] => S:/+dunp+logo+8b9fcb119246bb6dcac1906ef964d565.jpg
[2] => S:/+geoaki+logo+5f5174c498ffbfd9ae49975ddfa2f6eb.png
*/
echo '<pre>';
print_r($images->Get('logo'));
echo '</pre>';

/*
[0] => S:/+dunp+logo+0cedeb6f66cbfc3974c6b7ad86f4fbd3.jpg
[1] => S:/+dunp+logo+8b9fcb119246bb6dcac1906ef964d565.jpg
*/
echo '<pre>';
print_r($images->Get('logo dunp'));
echo '</pre>';

/*
[0] => S:/+geoaki+logo+5f5174c498ffbfd9ae49975ddfa2f6eb.png
*/
echo '<pre>';
print_r($images->Get('geo* logo'));
echo '</pre>';

?>

EDIT: Due to the several suggestions to use a serverless database or any other type of lookup table (XML, flat, key/value pairs, etc) I want to clarify the following: although this code is written in PHP, the idea is to port it to Python and make a desktop application out of it - this has noting to do (besides the example of course) with PHP. Furthermore, if I have to use some kind of lookup table I'll definitely go with SQLite 3, but what I'm looking for is a solution that doesn't involves any other additional "technology" besides the filesystem (folders, files and hardlinks).

You may call me nuts but I'm trying to accomplish two simple goals here: 1) keep the system "garbage" free (who likes Thumbs.db or DS_STORE for example?) and 2) keep the files easily identifiable if for some reason the lookup table (in this case SQLite) gets busy, corrupt, lost or forgot (in backups for instance).

PS: This is supposed to run on both Linux, Mac, and Windows (under NTFS).

A: 

You should make the tags directories instead of filename elements, i.e. instead of /dir/tag1+tag2+tagN+MD5.EXT, /dir/tag1/tag2/tagN/MD5.EXT. You're shooting yourself in the foot in several ways by treating directory hierarchy as something to be avoided.

If you're engaging in this avoidance because you believe it's difficult to generate the directory structure on demand, you should look into the third argument, $recursive, to PHP's mkdir.

chaos
I though of that, the problem is the Get function would then become too complex and extremelly slow or extremelly buggy. Imagine for example $Tag->Get('tag1 tagN'); // no tag2
Alix Axel
Yeah, search won't be any fun. Mainly because filesystems aren't designed for search. Any particular reason you're not using technology actually optimized for this purpose, i.e. a database?
chaos
I see some explanation in your response to Joey Robert. If what you want is to use only PHP-native technology, all you really need is a serialized hash that maps tags to lists of files. Like $tags = array('tag1' => array('FILE1.EXT', 'FILE2.EXT')).
chaos
Alix Axel
Regarding your serialized array solution, that wouldn't be dynamic or persistent.
Alix Axel
I'm still brainstorming how I can implement tags in folders while keeping the speed and ease of search, but no solution came to me yet. Other alternative is to store the tags in file comments but that way I would have to open a handler on every file for every search and that would consume way too many disk I/O performance.
Alix Axel
Re serialized array: Mmm what? The point of the serialization is to make it persistent, and you just rewrite it when you need it to be dynamic.
chaos
I'm sorry but I don't follow, I can understand how serialized arrays can make the class either dynamic or persistant but not dynamic AND persistant.
Alix Axel
You have a PHP array for tag indexing. When you first generate it, and when you change it, you write its serialized contents to tagindex.dat or something. When you want to use it, you deserialize the contents of tagindex.dat and use them. What's the problem?
chaos
How would that be diferent (or more performace-wise) than using for instance a SQLite database?
Alix Axel
I brought it up because you seemed to want to use strictly PHP-core techniques if you could. That's the only reason to do this instead of using a database.
chaos
+4  A: 

You may want to create a cache of tags for each folder your concerned with, similar to the way Windows creates a Thumbs.db file to cache thumbs when browsing folders.

Creating a metadata file like this has the advantage of working across many different file systems without encountering a file name limitation.

Joey Robert
Using SQLite or any other type of serverless database is a option however I was looking for a way to solve this problem without having to use any other type of system. (Just for the fun of it). =)
Alix Axel
The other reason I'm trying to avoid any type of database system is also for portability, that way if the database gets too many read/write requests or if it crashes and looses all the data the files would be very hard to identify. It would be sweet if you could just backup the files and they would work on another system with zero configuration.
Alix Axel
+11  A: 

If you have use of hard/soft links than you might look into giving each tag it's own directory having a link for each file with that "tag." Then when you are given multiple tags you can compare those found in both. Then the files could be stored in a single folder and having them unique in name of course.

I don't know how this would be different from having a meta file named by the tag, then listing all files that exist in that tag.

he_the_great
That is a pretty clever sugestion, I'll give some more though about it. =)
Alix Axel
A: 

If you don't want to use a database why not try xml, you could list all of your data like this:

<file>
  <md5>MD5</md5>
  <body>tag5+tag4+tag3</body>
</file>

You could easily add more like title and description.

Scott
+4  A: 

I would insert that information into a database, even if it's a lightweight one, like an sqlite file in the same directory.

If you don't want to do that, you could create hard links to the file without any permutations. One file per tag. Tagging P:/xampplite/htdocs/tag/geoaki.png with geoaki and logo would result in two files both being hard links pointing to the same data as the original file:

  • P:/xampplite/htdocs/tag/geoaki.png.geoaki)
  • P:/xampplite/htdocs/tag/geoaki.png.logo)

This has the advantage that you can select all tags belonging to that file with glob() for example.

# All tags
$tags = array();
files = glob('P:/xampplite/htdocs/tag/geoaki.png.*')
foreach ($files as $file) {
    if (fileinode($file) === fileinode('P:/xampplite/htdocs/tag/geoaki.png')) {
        $tags[] = substr($file, strlen('P:/xampplite/htdocs/tag/geoaki.png.'));
    }
}

# Check if file has tag foo:
file_exists('P:/xampplite/htdocs/tag/geoaki.png.foo')
    && fileinode(P:/xampplite/htdocs/tag/geoaki.png.foo) === fileinode('P:/xampplite/htdocs/tag/geoaki.png');

One more thing: Relying on md5 hashes alone for identifying files is not safe, you're better off using the file name as the identifier, which is guaranteed to be unique within the folder. Negative effects of md5 as identifier are:

  • The system breaks, as soon as a file is changed
  • There are collisions in md5, two distinct files could have the same md5 hash (the probability is small, but existent)
soulmerge
+1 for warning about MD5. Good rule of thumb, always use at least two different hashing algorithms if you want to come close to guaranteeing uniqueness. md5+sha1 seems to work well.
Trey
The linux distro 'Gentoo' uses more than 2 hashes - they rely on the mix of RMD160 SHA1 *and* SHA256.
soulmerge
A: 

the whole point of tags is to be able to search quickly for multiple combinations of tags. ideally, you want to have a database with a tag table {tag, path-to-file}. if you're set on keeping your tags in the filename, you need to use some sort of compression. keep a lookup table around (db or flat file), mapping every tag to a 2-character code (e.g. aa: tag1, ab: tag2, ac: tag3 ...). sticking to ascii, this should give you ~10k tags, if that isn't enough use three chars. now your filename will be something like aa.ag.f2.gx.ty.extension

another point to note is that, since you want to search on multiple tags, you want to make sure the tag codes in your filename are in strict lexical order. then, to search on tags aa, f3 and yz at once, do an "ls .*aa.*f3.*yz.*", which will pick out filenames containing all those codes.

Martin DeMello
A: 

Choosing to avoid SQLite because it is 'not PHP native' seems like a false dichotomy, as It is compiled into almost every practical distribution of PHP. If you'd rather have a non SQL solution, berkeleydb provides a simple key-value store you could use to associate a list of filenames with any given tag filenames with lists of tags.

But go with the SQL solution. It will be fast, portable, and simpler than you think.

TokenMacGuy
A: 

"What approach should I take in order to bypass the filename limit?"

How about a file system that supports tags? Tagsistant You didn't specify your operating system.

gradbot
Nice, I followed TagFS for a while but they made me believe they stopped the development (the amount of loops was crazy).
Alix Axel
+2  A: 

You've narrowed the question sufficiently that I believe the answer is: "No."

You don't want a central registry of tags because it could become corrupted.

You don't want file or files hidden in each directory to hold the data because that is "garbage".

You probably don't want a parallel set of directories or directories with links, because then it goes out of date when you move stuff and probably constitutes "garbage" on the file system.

You surely don't want to put tags in the contents of the files themselves.

So is there anywhere else you could put tags aside from the file's name in the directory structure?

No. (Or at least there is nothing portable).

Certainly there is nowhere to keep metadata except in the file's name or in the actual file itself that would stay with a file (when it is copied and moved using the usual tools) that would work on all three of the major operating systems you mention (Linux, Mac, Win).

It would be nice if there was a portable metadata system that could do this, but there is not. My impression is that there is no general agreement on what the best way to do tagging is. So each system does it differently and with a different set of trade-offs.

I think that relative to most of the major ideas in operating systems (hierarchical filesystems, GUI interfaces, etc), using tagging is a relatively new idea. Most of the facilities shared across all three systems are rather old and established ideas.

Your best bet would probably be to study how each system does it and then write a library that would portably provide the lowest common denominator of functionality between systems.

Maybe someone has written a library for Python that does this already?

C.J.

CJ
+2  A: 

More of a brainstorm than an answer.

As @CJ pointed out, without any external metadata and with the constraint of 255 bytes as filename identifier plus 'tag-cloud' your tagfs remains a problem.

Symbolic links are nice. Instead of packing all tagnames into one filename, one could spread the tags over several files, or – for the sake of space – symlinks. steps:

  1. compute a checksum or hash for a given file
  2. store a symlink in some format, e.g. <hash>-tag or tag-<hash>

I understand, that's what you mean by 'garbage', but if you want to store an arbitrary number of arbitrary tags in a fixed length string, you'll hit an information barrier sooner or later. using a database scales better, but storing and retrieving symlinks should be easy to implement. the 'garbage' could be stored in a single metadata repository with a leading 'dot', which is a widely used and established pratice on some operating systems.

good luck!

The MYYN
+1  A: 

actually, I have built a shell script implementation of this utility, and integrated it with the nautilus file browser...

I used the soft-link approach: a directory called .tags contained all the "tags", and tags were just directories in the .tags directory.

If a file was tagged with "fun", then a soft link to it would be created in .tags/fun .. however, this method is not good for searching by tags.

If you want to support searching too, I recommend using sqlite.

cheers, jrh.

Here Be Wolves
+1  A: 

The file system is your database, so use it.

  1. Come up with a "unique name" for your file. Doesn't matter what the file name is, as long as it is unique across the space. The file name has nothing to do with the tags.

  2. Hash the file name to a "storage" directory. If you aren't going have a bazillion files (< 1000-2000), you can store all of the files in a single directory. Otherwise, make a bunch of "bucket" directories, and hash the file to the correct directory. This process is, obviously, deterministic based on the file name.

  3. For each tag on the file, either store an "empty" file of the same name in a "tag" directory or, simply have a "tag file" that lists the files in that tag. Again, if you expect to have zillions of files in a specific tag, hash the files in to buckets.

To add a tag to a file, simply add the file reference to the proper tag dir. To delete the tag, same thing.

To delete a file, simply remove the file from the main store. When you iterate the tag references, you can check for the file at that point and delete the entries lazily. You're probably going to be hitting the file for anything interesting anyway.

If you want to store actual meta data for the file, then create a mirror "meta data" directory. When you add a file, you place it in the file store directory, and a matching meta data file in a "meta data store" directory, using the same scheme. Deleting a file by deleting the original and it's meta data.

Just simple file operations, no file system shenanigans (beyond hashing directory buckets), no links, attributes, what have you.

This gives you "unlimited" tags per file, you can manage it from the command line or file explorer with the only tool required being the Mark I Eyeball. You also get permanent references to the actual file (since it name never changes).

Darkest part is that you'll need to "scan the tag cloud" to find out what tags a file has. If you choose to go with a metadata file, you can maintain the tag list in that (that will complicate your tagging/untagging operations, but not horribly).

Will Hartung
+1  A: 

If your operating system and filesystem support file extended attributes, use that to store the tags. On OS X and FreeBSD, see the setxattr and getxattr manual pages; Linux and Solaris have similar facilities. Windows has support for extended attributes in NTFS. See "extended file attributes" on wikipedia for more information.