tags:

views:

27

answers:

2

I have a huge list of video files from a webcam that have that look like this:

video_123
video_456
video_789
...

Where each number (123, 456, and 789) represents the start time of the file in seconds since epoch. The files are created based on file size and are not always the same duration. There may also be gaps in the files (eg camera goes down for an hour). It is a custom file format that I cannot change.

I have a tool that can extract out portions of the video given a time range and a set of files. However, it will run MUCH faster if I only give the tool the files that have frames within the given range. It's very costly to determine the duration of each file. Instead, I'd like to use the start timestamp to rule out most files. For example, if I wanted video for 500-600, I know video_123 will not be needed because video_456 is larger. Also, video_789 is larger than 600 so it will not be needed either.

I could do an ls and iterate through each file, converting the timestamp to an int and comparing until we hit a file bigger than the desired range. I have a LOT of files and this is slow. Is there a faster method? I was thinking of having some sort of binary-tree that could get log2n search time and already have the timestamps parsed out. I am doing most of this work in bash and would prefer to use simple, common tools like grep, awk, etc. However, I will consider Perl or some other large scripting language if there is a compelling reason.

+1  A: 

If you do several searchs with the files, you can pre-process the files, in the sense of loading them into a bash array (note, bash, not sh), order them, and then do a binary search. Assume for a second that the name of the file is just the time tag, this will ease the examples (you can always do ${variable/video_/} to remove the prefix.)

First, you can use an array to load all the files sorted:

files=(`echo * | sort -n`)

Then implement the binary search (just a sketch, searching for the pos $min-$max):

nfiles=${#files[*]}
nfiles2=`expr $nfiles / 2`
if test ${files[$nfiles2]} -gt $max
then
    nfiles2=`expr $nfiles2 - $nfiles2/2`
else
    #check $min, etc.
fi

And so on. Searching several times once you have the files ordered in the array would make faster lookups.

Diego Sevilla
A: 

Because of a quirk of UNIX design, there is no way to search for the name of a file in a directory other than stepping through the filenames one by one. So if you keep all your files in one directory, you're not going to get much faster than using ls.

That said, if you're willing to move your files around, you could turn your flat directory into a tree by splitting on the most significant digits. Instead of:

video_12301234
video_12356789
video_12401234
video_13579123

You could have:

12/video_12301234
12/video_12356789
12/video_12401234
13/video_13579123

or even:

12/30/video_12301234
12/35/video_12356789
12/40/video_12401234
13/57/video_13579123

For best results with this method, you'll want to have your files named with leading zeros so the numbers are all the same length.

Jander
Interesting idea. Unfortunately, the code used to produce the file does not have an option for such output. Is there an easy way to convert from flat to the idea above after the file is written? Also, do you know if `ls -U|sort` is faster than `ls`?
User1
`ls -U | sort` vs. `ls` probably won't make a big difference. The big bottleneck here is reading from the disk. You could have the file producing code add the file to a holding directory, and use another script running in the background to look for new files and move them to the proper place. Moving files while they're being written is safe as long as your writer doesn't try to close and re-open the file.
Jander