views:

67

answers:

2

I have series of files like this:

foo1.txt.gz
foo2.txt.gz
bar1.txt.gz
..etc..

and a tabular format file that describe those files:

foo1 - Explain foo1
foo2 - Explain foo2
bar1 - Explain bar1
..etc..

What I want to do is to have a website with a simple search bar and allow people to type foo1 or just foo and finally return the gzipped file(s) and the related explanation of the file(s).

What's the best way to implement this and what kind of tools should I use. Sorry I am totally new in this area.

Update: Specifically I want to give list of URLs linked to the matched files. So that people can later choose which one to download.

+1  A: 
  1. you build an HTML search form.

    • The form has a text input element

    • On submission, the form sends the value of the search string from to a back-end script (for example, a Perl CGI script iplemented using CGI.pm for simplicity, though these days use more modern web frameworks such as Perl's Catalyst or templating frameworks such as EmbPerl)

  2. The back-end script searches for the matching files:

    • Open the list of matching files in Perl, use glob("*$search*.txt.gz"), or File::Find module if the files are in sub-directories.

    • Open, read and parse the descriptions file into a hash mapping file base "foo1" to the description.

    • run grep to look for file names that match the search string (using regular expression)

    • Print the HTML report page with the table listing found file names and their descriptions - that page will be sent back to the browser.

    • The file name would be a link (see below) to download a file. Easiest approach to do so is to add the files to a directory inside "htdocs" tree - e.g. somewhere within a directory where web server looks for documents. Then you can just reference them by URL. For example, if your home page is /home/webpages/main/index.html (with URL of http://mysite.com/index.html), you can put your files as /home/webpages/main/foofiles/foo1.txt.gz and the URL would be http://mysite.com/foofiles/foo1.txt.gz.

    You must make sure that your web server can send these files with appropriate content header (e.g. will not send them as text/html).

DVK
NOTE: if you want to actually download the matching file(s) to the users, please update your question to clearly state that and I'll add instructions for that.
DVK
@DVK: Thanks so much. I have already add the update to the OP. Please check.
neversaint
Updated answer.
DVK
+1  A: 

For performance reasons, what you'll likely want to do is have a periodic process build an index. There are very sophisticated ways to do this, but it's also possible to make something quite reasonably useful in a very simple way.

At heart, an "index" is the very same sort of thing you'd find at the end of a textbook, but translate that idea into a computer world. You'll want to scan through your tables of descriptions, and build a key/value "dictionary","hash", or whatever your language's equivelent structure is called. The keys will be the words you find in your description. The values will be an array (or list or whatever your language calls it), of urls in which that word could be found.

When you process a query, you break apart the words in the query, and look each one up in your dictionary. Then each "url" can get a point for every word that url contains. You then rank your results based on how many points each url has. Alternatively, you can return only results that contain all the words by performing a set intersection between all the various url arrays you find by looking up your words.

depending on what you are trying to achieve, you can get more sophisticated about how you construct your index, such as using phonetic representations of words as keys, instead of the raw words themselves. When you do a search, break the search terms into their phonetic representations, and in this way you can eliminate problems to do with common misspellings.

Alternatively you can address common misspellings directly by making duplicate keys for each word.

Alternatively, you can also index letter triplets rather than whole words, to catch alternative forms of words with different tenses and conjugations.

etc. etc.

You'll probably want to not construct this index on every query (otherwise, what's the point?), so you'll want to be able to save it to disk and load it (or parts of it) into memory during a query. Whether you use a database, or whatever for doing this, I leave up to you.

Breton