views:

199

answers:

2

It's part of an information retrieval thing I'm doing for school. The plan is to create a hashmap of words using the the first two letters of the word as a key and any words with the two letters saved as a string value. So,

hashmap["ba"] = "bad barley base"

Once I'm done tokenizing a line I take that hashmap, serialize it, and append it to the text file named after the key.

The idea is that if I take my data and spread it over hundreds of files I'll lessen the time it takes to fulfill a search by lessening the density of each file. The problem I am running into is when I'm making 100+ files in each run it happens to choke on creating a few files for whatever reason and so those entries are empty. Is there any way to make this more efficient? Is it worth continuing this, or should I abandon it?

I'd like to mention I'm using PHP. The two languages I know relatively intimately are PHP and Java. I chose PHP because the front end will be very simple to do and I will be able to add features like autocompletion/suggested search without a problem. I also see no benefit in using Java. Any help is appreciated, thanks.

+1  A: 

I would use a single file to get and put the serialized string. I would also use json as the serialization.

Put the data

$string = "bad barley base";
$data = explode(" ",$string);
$hashmap["ba"] = $data;

$jsonContent = json_encode($hashmap);
file_put_contents("a-z.txt",$jsonContent);

Get the data

$jsonContent = file_get_contents("a-z.txt");
$hashmap = json_decode($jsonContent);

foreach($hashmap as $firstTwoCharacters => $value) {
    if ($firstTwoCharacters == 'ba') {
        $wordCount = count($value);
    }
}
Brant
I am working with a 29mb txt file. You don't think a single file containing json_encode($hashmap) would be inefficient
tipu
You could break up to where each alpha character has it's own file. a.txt, b.txt, c.txt. For searching data yes it would be taxing. You could only write to a-z.txt when an addition happens. It really depends on what you're using the data for?
Brant
A: 

You didn't explain the problem you are trying to solve. I'm guessing you are trying to make a full text search engine, but you don't have document ids in your hashmap so I'm not sure how you are using the hashmap to find matching documents.

Assuming you want a full text search engine, I would look into using a trie for the data structure. You should be able to fit everything in it without it growing too large. Nodes that match a word you want to index would contain the ids of the documents containing that word.

jshen
You're absolutely correct in assuming I'm making a full text search engine. I'm taking a look at the trie data structure at the moment and this is so much more efficient than what I'm doing at the moment (which is what I described above). I'm looking to implement this now, thanks!
tipu
go here http://www.ics.uci.edu/~chenli/pubs.html and look at the paper titled Efficient Interactive Fuzzy Keyword Search
jshen