I have a file contining some no of lines. I want split file into n no.of files with particular names. It doesn't matter how many line present in each file. I just want particular no.of files (say 5). here the problem is the no of lines in the original file keep on changing. So I need to calculate no of lines then just split the files into 5 parts. If possible we have to send each of them into different directories.
On linux, there is a split
command,
split –bytes=1m /path/to/large/file /path/to/output/file/prefix
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT is -, read standard input.
...
-l, --lines=NUMBER put NUMBER lines per output file
...
You would have to calculate the actual size of the splits beforehand, though.
I can think of a few ways to do it. Which you would use depends a lot on the data.
Lines are fixed length: Find the size of the file by reading it's directory entry and divide by the line length to get the number of lines. Use this to determine how many lines per file.
The files only need to have approximately the same number of lines. Again read the file size from the directory entry. Read the first N lines (N should be small but some reasonable fraction of the file) to calculate an average line length. Calculate the approximate number of lines based on the file size and predicted average line length. This assumes that the line length follows a normal distribution. If not, adjust your method to randomly sample lines (using seek() or something similar). Rewind the file after your have your average, then split it based on the predicted line length.
Read the file twice. The first time count the number of lines. The second time splitting the file into the requisite pieces.
EDIT: Using a shell script (according to your comments), the randomized version of #2 would be hard unless you wrote a small program to do that for you. You should be able to use ls -l
to get the file size, wc -l
to count the exact number of lines, and head -nNNN | wc -c
to calculate the average line length.