views:

601

answers:

3

How can I split a large text file into separate files by character count using PHP? So a 10,000 character file split every 1000 characters would be split into 10 files. Further, can I split only after a full stop is found?

Thanks.

UPDATE 1: I like zombats code and I removed some errors and have come up with the following, but does anyone know how to only split after a full stop?

$i = 1;
    $fp = fopen("test.txt", "r");
    while(! feof($fp)) {
        $contents = fread($fp,1000);
        file_put_contents('new_file_'.$i.'.txt', $contents);
        $i++;
    }

UPDATE 2: I took zombats suggestion and modified the code to that below and it seems to work -

$i = 1;
    $fp = fopen("test.txt", "r");
    while(! feof($fp)) {
        $contents = fread($fp,20000);
        $contents .= stream_get_line($fp,1000,".");
        $contents .=".";

        file_put_contents("Split/".$tname."/"."new_file_".$i.".txt", $contents);
        $i++;
    }
+1  A: 

The easiest way is to read the contents of the file, split the content, then save to two other files. If your files are more than a few gigabytes, you're going to have a problem doing it in PHP due to integer size limitations.

Ian
Assuming a large file, it would be much more efficient to simply read in the desired number of bytes in a loop, rather than read in the entire original file at once. You wouldn't have an size issues either, unless you were reading in file chunks larger than the integer maximum.
zombat
I should have been more specific, the text files won't be more then 10-15MB in size.
usertest
@zombat, PHP cannot make a file read pointer go past 4,294,967,296 bytes.. so if your file is more than 4GB, even if you read it in chunks, PHP will crap out once it reaches the 4GB mark.
Ian
@Ian - The max integer size is defined by PHP_INT_MAX, a constant set based on the OS, so yes, you couldn't use anything larger than that on a 32-bit system. You'd have to use a work-around in that case.
zombat
A: 

You could also write a class to do this for you.

<?php

/**
* filesplit class : Split big text files in multiple files
*
* @package
* @author Ben Yacoub Hatem <[email protected]>
* @copyright Copyright (c) 2004
* @version $Id$ - 29/05/2004 09:02:10 - filesplit.class.php
* @access public
**/
class filesplit{
    /**
     * Constructor
     * @access protected
     */
    function filesplit(){

    }

    /**
     * File to split
     * @access private
     * @var string
     **/
    var $_source = 'logs.txt';

    /**
     *
     * @access public
     * @return string
     **/
    function Getsource(){
        return $this->_source;
    }

    /**
     *
     * @access public
     * @return void
     **/
    function Setsource($newValue){
        $this->_source = $newValue;
    }

    /**
     * how much lines per file
     * @access private
     * @var integer
     **/
    var $_lines = 1000;

    /**
     *
     * @access public
     * @return integer
     **/
    function Getlines(){
        return $this->_lines;
    }

    /**
     *
     * @access public
     * @return void
     **/
    function Setlines($newValue){
        $this->_lines = $newValue;
    }

    /**
     * Folder to create splitted files with trail slash at end
     * @access private
     * @var string
     **/
    var $_path = 'logs/';

    /**
     *
     * @access public
     * @return string
     **/
    function Getpath(){
        return $this->_path;
    }

    /**
     *
     * @access public
     * @return void
     **/
    function Setpath($newValue){
        $this->_path = $newValue;
    }

    /**
     * Configure the class
     * @access public
     * @return void
     **/
    function configure($source = "",$path = "",$lines = ""){
        if ($source != "") {
            $this->Setsource($source);
        }
        if ($path!="") {
            $this->Setpath($path);
        }
        if ($lines!="") {
            $this->Setlines($lines);
        }
    }


    /**
     *
     * @access public
     * @return void
     **/
    function run(){
        $i=0;
        $j=1;
        $date = date("m-d-y");
        unset($buffer);

        $handle = @fopen ($this->Getsource(), "r");
        while (!feof ($handle)) {
          $buffer .= @fgets($handle, 4096);
          $i++;
              if ($i >= $split) {
              $fname = $this->Getpath()."part.$date.$j.txt";
               if (!$fhandle = @fopen($fname, 'w')) {
                    print "Cannot open file ($fname)";
                    exit;
               }

               if (!@fwrite($fhandle, $buffer)) {
                   print "Cannot write to file ($fname)";
                   exit;
               }
               fclose($fhandle);
               $j++;
               unset($buffer,$i);
                }
        }
        fclose ($handle);
    }


}
?>


Usage Example
<?php
/**
* Sample usage of the filesplit class
*
* @package filesplit
* @author Ben Yacoub Hatem <[email protected]>
* @copyright Copyright (c) 2004
* @version $Id$ - 29/05/2004 09:14:06 - usage.php
* @access public
**/

require_once("filesplit.class.php");

$s = new filesplit;

/*
$s->Setsource("logs.txt");
$s->Setpath("logs/");
$s->Setlines(100); //number of lines that each new file will have after the split.
*/

$s->configure("logs.txt", "logs/", 2000);
$s->run();
?>

Source http://www.weberdev.com/get_example-3894.html

streetparade
+1  A: 

You should be able to accomplish this easily with a basic fread(). You can specify how many bytes you want to read, so it's trivial to read in an exact amount and output it to a new file.

Try something like this:

$i = 1;
$fp = fopen("test.txt",'r');
while(! feof($fp)) {
    $contents = fread($fp,1000);
    file_put_contents('new_file_'.$i.'.txt',$contents);
    $i++;
}

EDIT

If you wish to stop after a certain amount of length OR on a certain character, then you could use stream_get_line() instead of fread(). It's almost identical, except it allows you to specify any ending delimiter you wish. Note that it does not return the delimeter as part of the read.

$contents = stream_get_line($fp,1000,".");
zombat
Thanks, but can it only split after the full stop?
usertest
See my updated answer.
zombat
I tried stream_get_line with 20000 as the length instead and used one of the large text files on the gutenberg site - http://www.gutenberg.org/files/2759/2759.txt. Over a thousand files with little to no text in them were created.
usertest
I think the problem is it stops reading by number of characters or the delimiter, whichever comes first. The problem is the full stop always comes first. So the files end up being very short. Instead it should go the length plus however long until it gets to the delimiter.
usertest
If that's what you're trying to do then you should do an initial `freads($fp,1000)`, then follow it with the `stream_get_line()`. You'll get a full block of text before you have to worry about looking for the ending delimiter.
zombat
How would that work? I tried the following which outputted the same small files. $contents = fread($fp,1000);$contents = stream_get_line($contents,1000,".");
usertest
How do I use the return value from freads (which is a string) as a parameter in stream_get_line (which requires a resource handle)? Thanks.
usertest