views:

377

answers:

5

I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).

I could do this fairly easily in Python but I'm wondering if there's any kind of ninja way to do this using bash and unix utils (as opposed to manually looping and counting / partitioning lines).

+7  A: 

Have you looked at the split command?

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

You could do something like:

split -l 200000 filename
Mark Byers
+2  A: 

use 'split'

Split a file into fixed-size pieces, creates output files containing consecutive sections of INPUT (standard input if none is given or INPUT is `-')

Syntax split [options] [INPUT [PREFIX]]

http://ss64.com/bash/split.html

zipcodeman
+2  A: 

How about the split command?

split -l 200000 mybigfile.txt

Robert Christie
+1  A: 

Yes, there is a split command. It will split a file by lines or bytes.

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
Dave Kirby
A: 

you can also use awk

awk -vc=1 'NR%200000==0{++c}{print $0 > c".txt"}' largefile
ghostdog74
Not bad, but not so far off from what I would have done in Python.
danben
`awk -v lines=200000 -v fmt="%d.txt" '{print>sprintf(fmt,1+int((NR-1)/lines))}'`
Mark Edgar