tags:

views:

42

answers:

2

I have a couple hundred single words that are identified in a foreach routine and pushed into an array.

I would like to check each one (word) to see if it exists in an existing txt file that is single column 200k+ lines.

(Similar to a huge "bad word" routine i guess but, in the end this will add to the "filter" file.)

I don't know whether i should do this with preg_match in the loop or should I combine the arrays somehow and array_unique?

I would like to add the ones not found to the main file as well. Also flocking in attempt to avoid any multi access issues.

Is this a pipe dream? Well it is for this beginner. My attempts have timed out in 30 seconds.

Stackoverflow has been such a great resource. I don't know what I would do without it. Thanks in advance either way.

+1  A: 

If the file is too large, then it is not a good idea to read it all into memory. You can process it line by line:

<?php
$words = array('a', 'b', 'c'); # words to insert, assumed to be unique

$fp = fopen('words.txt', 'r+');
while (!feof($fp))
{
    $line = trim(fgets($fp));
    $key = array_search($line, $words);
    if ($key !== false)
    {
        unset($words[$key]);
        if (!$words) break;
    }
}

foreach ($words as $word)
{
    fputs($fp, "$word\n");
}

fclose($fp);

?>

It loops through the entire file, checking to see if the current line (assumed to be a single word) exists in the array. If it does, that element is removed from the array. If there is nothing left in the array, then the search stops. After cycling through the file, if the array is not empty, it adds each of them to the file.

(File locking and error handling are not implemented in this example.)

Note that this is a very bad way to store this data (file based, unsorted, etc). Even sqlite would be a big improvement. You could always simply write an exporter to .txt if you needed it in plain text.

konforce
Thank you. i will play with this and let you know asap.
Jim_Bo
Seconding the database solution, especially if you will be doing this frequently. However, @konforce +1 for the sequential file reading to avoid memory issues.
Fanis
bad idea. why do an algorithm that has run time 0(n) if O(1) is possible? this is a simple hashmap lookup!
Joe Hopfgartner
How do you think you build a hash table from a plain text file, as the OP is working with? By reading through the contents in O(N) time... I finished my answer with a disclaimer that this is an inefficient solution. Using sqlite, you get a "hash table" (index) and an external file (potentially save memory).
konforce
Now, I have decided to break the "filter files" into the first letter in the name. I will need to open the dir and search only that file. I thank you for your guidance. I will try to figure out how to adapt your code to do that. If you already know how, can you post that too before i check this as answered although you actually did already, this will prevent me from posting another question if I get stuck.
Jim_Bo
You should probably just ask a new question that deals with whatever trips you up. Note that the basic way to change the above code would be to move the code into a function, and call the function many times with two parameters: 1) a filtered array that only contains words that begin with the same letter and 2) the name of the file to search in.
konforce
Thank you. I will be creating folders a-z and one called num. I will have it shift off the first character then search all files in that folder until it matches. I have chunked files into 500 lines as well. That should be a much better method. Thanks again. Answered!
Jim_Bo
+2  A: 

sorry, but that sound like a REALLY AWFUL APPROACH! doing a whole scan (of a table, list or whatever) if you want to check if something already exists is just... wrong. this is what hashtables are for! your case sounds like a classical database job... if you don't have a database available you can use a local sqlite file which will provide essential functionalities. let me explain the background... a lookup of "foo" in an hashtable basically consumes O(1) time. which means a static amount of time. because your algorithm knows WHERE to look and can see whether its THERE. hashmaps have the the attitude to run into ambiguiti because of the one-way nature of hashing procedures, which really doesnt matter that much because the hashmap delivers some possible matches which can be compared directly (for a reasonable number of elements, like probably the google index laugh) so if you want (for some reason) stay with your text-file approach, consider the following:

  • sort your file and insert your data at the right place (alphabetically would be the most intuitive approach). then you can jump from position to position and isolate the area where the word should be. there are several algorithms available, just have a google. but keep it takes longer the more data you have. usually your running time will be O(log(n)) where n is the size of the table.

well this is all basically just to guide you on the right track. you can as well shard your data, this would be for example saving every word beginning with a in the file a.txt and so on. or to split the word into characters and create a folder for every character and the last character is the file, then you check if the file exists. those are stupid suggestions, as you will probably run out of inodes on your disk, but it illustrates that you can CHECK for EXISTENCE witout having to do a FULL SCAN.

the main thing is that you have to project some search tree into a reasonable structure (like a database system does automatically for you). the folder example was an example of the basic principle.

this wikipedia entry might be a good place to start: http://en.wikipedia.org/wiki/Binary_search_tree

Joe Hopfgartner
That makes great sense. That inspired me to a more unique approach. I have since decided to break up the "filter" files into alpha named files(a.txt,b.txt etc). I then grab the first character off the "word" in question and search in that file only. This combined with konforce's snippet is far better than what I had. This is a total flatfile approach. I will be using mysql as well as soon as I learn to that level.
Jim_Bo
ps, one up 4u 2!
Jim_Bo
Joe is absolutely correct regarding the efficiency. If you don't need to store the data in plain text files, then you simply shouldn't. If I were you, I'd spend the time learning sqlite (comes bundled with PHP and requires no external database) instead of building your own psuedo-index and solving concurrency issues with file locks. There's really nothing to be gained from rolling your own inefficient text based database.
konforce
agree @konforce, but its a good way of learning :)
Joe Hopfgartner