ansaurus

Question

Faster way to run through a couple hundred lines in a for loop and check if match exists in 200k+ line file

Answer 1

+1 A:

If the file is too large, then it is not a good idea to read it all into memory. You can process it line by line:

<?php
$words = array('a', 'b', 'c'); # words to insert, assumed to be unique

$fp = fopen('words.txt', 'r+');
while (!feof($fp))
{
    $line = trim(fgets($fp));
    $key = array_search($line, $words);
    if ($key !== false)
    {
        unset($words[$key]);
        if (!$words) break;
    }
}

foreach ($words as $word)
{
    fputs($fp, "$word\n");
}

fclose($fp);

?>

It loops through the entire file, checking to see if the current line (assumed to be a single word) exists in the array. If it does, that element is removed from the array. If there is nothing left in the array, then the search stops. After cycling through the file, if the array is not empty, it adds each of them to the file.

(File locking and error handling are not implemented in this example.)

Note that this is a very bad way to store this data (file based, unsorted, etc). Even sqlite would be a big improvement. You could always simply write an exporter to .txt if you needed it in plain text.

konforce 2010-09-13 19:08:17

Thank you. i will play with this and let you know asap.

Jim_Bo 2010-09-13 19:10:03

Seconding the database solution, especially if you will be doing this frequently. However, @konforce +1 for the sequential file reading to avoid memory issues.

Fanis 2010-09-13 19:13:00

bad idea. why do an algorithm that has run time 0(n) if O(1) is possible? this is a simple hashmap lookup!

Joe Hopfgartner 2010-09-13 19:30:20

How do you think you build a hash table from a plain text file, as the OP is working with? By reading through the contents in O(N) time... I finished my answer with a disclaimer that this is an inefficient solution. Using sqlite, you get a "hash table" (index) and an external file (potentially save memory).

konforce 2010-09-13 19:39:56

Now, I have decided to break the "filter files" into the first letter in the name. I will need to open the dir and search only that file. I thank you for your guidance. I will try to figure out how to adapt your code to do that. If you already know how, can you post that too before i check this as answered although you actually did already, this will prevent me from posting another question if I get stuck.

Jim_Bo 2010-09-13 19:50:26

You should probably just ask a new question that deals with whatever trips you up. Note that the basic way to change the above code would be to move the code into a function, and call the function many times with two parameters: 1) a filtered array that only contains words that begin with the same letter and 2) the name of the file to search in.

konforce 2010-09-13 19:58:21

Thank you. I will be creating folders a-z and one called num. I will have it shift off the first character then search all files in that folder until it matches. I have chunked files into 500 lines as well. That should be a much better method. Thanks again. Answered!

Jim_Bo 2010-09-13 20:02:41

Answer 2

+2 A:

sorry, but that sound like a REALLY AWFUL APPROACH! doing a whole scan (of a table, list or whatever) if you want to check if something already exists is just... wrong. this is what hashtables are for! your case sounds like a classical database job... if you don't have a database available you can use a local sqlite file which will provide essential functionalities. let me explain the background... a lookup of "foo" in an hashtable basically consumes O(1) time. which means a static amount of time. because your algorithm knows WHERE to look and can see whether its THERE. hashmaps have the the attitude to run into ambiguiti because of the one-way nature of hashing procedures, which really doesnt matter that much because the hashmap delivers some possible matches which can be compared directly (for a reasonable number of elements, like probably the google index laugh) so if you want (for some reason) stay with your text-file approach, consider the following:

sort your file and insert your data at the right place (alphabetically would be the most intuitive approach). then you can jump from position to position and isolate the area where the word should be. there are several algorithms available, just have a google. but keep it takes longer the more data you have. usually your running time will be O(log(n)) where n is the size of the table.

well this is all basically just to guide you on the right track. you can as well shard your data, this would be for example saving every word beginning with a in the file a.txt and so on. or to split the word into characters and create a folder for every character and the last character is the file, then you check if the file exists. those are stupid suggestions, as you will probably run out of inodes on your disk, but it illustrates that you can CHECK for EXISTENCE witout having to do a FULL SCAN.

the main thing is that you have to project some search tree into a reasonable structure (like a database system does automatically for you). the folder example was an example of the basic principle.

this wikipedia entry might be a good place to start: http://en.wikipedia.org/wiki/Binary_search_tree

Joe Hopfgartner 2010-09-13 19:29:30

That makes great sense. That inspired me to a more unique approach. I have since decided to break up the "filter" files into alpha named files(a.txt,b.txt etc). I then grab the first character off the "word" in question and search in that file only. This combined with konforce's snippet is far better than what I had. This is a total flatfile approach. I will be using mysql as well as soon as I learn to that level.

Jim_Bo 2010-09-13 19:42:44

ps, one up 4u 2!

Jim_Bo 2010-09-13 19:43:15

Joe is absolutely correct regarding the efficiency. If you don't need to store the data in plain text files, then you simply shouldn't. If I were you, I'd spend the time learning sqlite (comes bundled with PHP and requires no external database) instead of building your own psuedo-index and solving concurrency issues with file locks. There's really nothing to be gained from rolling your own inefficient text based database.

konforce 2010-09-13 19:52:12

agree @konforce, but its a good way of learning :)

Joe Hopfgartner 2010-09-13 20:12:59

ansaurus

tags:

views:

answers:

Faster way to run through a couple hundred lines in a for loop and check if match exists in 200k+ line file

related questions