tags:

views:

57

answers:

1

This below goes through files in a directory, reads them and saves them in files of 500 lines max to a new directory. This works great for me (thanks Daniel) but, I need a modification. I would like to save to alpha num based files.

First, sort the array alpha numerically (already lowercase) would be the first step I assume.

Grab all of the lines in each $incoming."/.txt" that start with "a" and put them into a folder at $save500."/a" but, a max of 500 lines each. (I guess it would be best to start with the first at the top of the sort so "0" not "a" right?)

All the lines that start with a number, go into $save500."/num".

None of the lines will start with anything but a-z0-9.

This will allow me to search my files for a match more efficiently using this flatfile method. Narrowing it down to one folder.

$nextfile=0;
    if (glob("" . $incoming . "/*.txt") != false){
     $nextfile = count(glob("" . $save500 . "/*.txt"));
     $nextfile++;
    }
    else{$nextfile = 1;}
    /**/
     $files = glob($incoming."/*.txt");
     $lines = array();
     foreach($files as $file){
     $lines = array_merge($lines, file($file, FILE_SKIP_EMPTY_LINES | FILE_IGNORE_NEW_LINES));
    }
     $lines = array_unique($lines);
    /*this would put them all in one file*/
    /*file_put_contents($dirname."/done/allofthem.txt", implode("\n", $lines));*/
    /*this breaks them into files of 500*/
     foreach (array_chunk($lines, 500) as $chunk){
     file_put_contents($save500 . "/" . $nextfile . ".txt", implode("\n", $chunk));
     $nextfile++;
    }

Each still need to be in a max of 500 lines.

I will graduate to mysql later on. Only been doing this a couple months now.

As if that is not enough. I even thought of taking the first two characters off. Making directories with subs a/0 thru z/z!

Could be the wrong approach above since no responses.

But I want a word like aardvark saved to the 1.txt the a/a folder (appending). Unless 1.txt has 500 lines then save it to a/a 2.txt.

So xenia would be appended to the x/e folder 1.txt file unless there are 500 lines so create 2.txt and save it there.

I will then be able to search for those words more efficiently without loading a ton into memory or looping through files /lines that won't contain a match.

Thanks everyone!

+1  A: 

Hi,

I wrote some code here that should do what you're looking for, it's not a perfomance beauty but should do the job. Try it in a safe environment, no guarantee for any data-loss ;)

Comment if there are any errors, it's pretty late here ;) I have to get some sleep ;)

NOTE: This one only works if every line has at least 2 characters! ;)

$nextfile=0;

if (glob("" . $incoming . "/*.txt") != false){
  $nextfile = count(glob("" . $save500 . "/*.txt"));
  $nextfile++;
}
else
{
  $nextfile = 1;
}



$files = glob($incoming."/*.txt");
$lines = array();
foreach($files as $file){
  $lines = array_merge($lines, file($file, FILE_SKIP_EMPTY_LINES | FILE_IGNORE_NEW_LINES));
}


$lines = array_unique($lines);


/*this would put them all in one file*/
/*file_put_contents($dirname."/done/allofthem.txt", implode("\n", $lines));*/
/*this breaks them into files of 500*/

// sort array
sort($lines);

// outer grouping
$groups     = groupArray($lines, 0);
$group_keys = array_keys($groups);

foreach($group_keys as $cKey) {
  // inner grouping
  $groups[$cKey] = groupArray($groups[$cKey], 1);

  foreach($groups[$cKey] as $innerKey => $innerArray) {
    $nextfile = 1;
    foreach(array_chunk($innerArray, 500) as $chunk) {
      file_put_contents($save500 . "/" . $cKey . "/" . $innerKey . "/" . $nextfile . ".txt", implode("\n", $chunk));    
      $nextfile++;
    }
  }

}


function groupArray($data, $offset) {

  $grouped = array();

  foreach($data as $cLine) {
    $key = substr($cLine, $offset, 1);
    if(!isset($grouped[$key])) {
      $grouped[$key] = array($cLine);
    } 
    else
    {
      $grouped[$key][] = $cLine;
    }
  }

  return $grouped;
}
sled
Thank you. I did 16 hrs 2day so, I feel ya. I will test in morn and supply mucho kudos I am sure. Thanks so mich...
Jim_Bo
"undefined call to grouparray() on line etc... "(where it is called for the first time.) So I lowercased all instances of groupArray and still the same error. EDIT- SORRY MY BAD - I had this whole routine stuck in an if (i want to run this){this snippet}. Will continue to test.
Jim_Bo
it should work. I've quick-tested it with a dummy array, try moving the function to the top then. Otherwise re-post your source.
sled
Works, I found out I have some with a _ . + or - as the second character (but never as first) so, I need to facilitate those. So, I want to put a-ardvark in a/sc folder (sc = special character). Is that possible? I also want to now search these files, so before adding new ones to the $incoming file / array, it can check the folders for dupes before pushing saving. Should I ask another question. How can i let you know that I posted that new question?
Jim_Bo
I was concerned that sc may interfere with s. just to be clear.
Jim_Bo
$sc = array("-", "_", ".", "@", "+", "~"); if (in_array($innerkey, $sc)) {$innerkey="sc";}
Jim_Bo
I put the array at the top of the routine. I am thinking i should put the if right under the nextfile=1 in the //innergrouping loop?
Jim_Bo
put the array outside of the loop because it never changes, it's better for performance ;)
sled
Answered upped and thanks. Here is continuation: http://stackoverflow.com/questions/3709059/php-writing-lines-to-specific-folders-based-on-first-two-chars-then-searching-t
Jim_Bo
Thanks again sled, you really helped me.
Jim_Bo
Need your help again sled. Follow link in comment above..
Jim_Bo