views:

432

answers:

5

I'm trying to copy a chunk from one binary file into a new file. I have the byte offset and length of the chunk I want to grab.

I have tried using the dd utility, but this seems to read and discard the data up to the offset, rather than just seeking (I guess because dd is for copying/converting blocks of data). This makes it quite slow (and slower the higher the offset. This is the command I tried:

dd if=inputfile ibs=1 skip=$offset count=$datalength of=outputfile

I guess I could write a small perl/python/whatever script to open the file, seek to the offset, then read and write the required amount of data in chunks.

Is there a utility that supports something like this?

A: 

You can use the

--input-position=POS

option of ddrescue.

hlovdal
A: 

You can use tail -c+N to trim the leading N bytes from input, then you can use head -cM to output only the first M bytes from its input.

$ echo "hello world 1234567890" | tail -c+9 | head -c6
rld 12

So using your variables, it would probably be:

tail -c+$offset inputfile | head -c$datalength > outputfile


Ah, didn't see it had to seek. Leaving this as CW.

Mark Rushakoff
+1  A: 

Thanks for the other answers. Unfortunately, I'm not in a position to install additional software, so the ddrescue option is out. The head/tail solution is interesting (I didn't realise you could supply + to tail), but scanning through the data makes it quite slow.

I ended up writing a small python script to do what I wanted. The buffer size should probably be tuned to be the same as some external buffer setting, but using the value below is performant enough on my system.

#!/usr/local/bin/python

import sys

BUFFER_SIZE = 100000

# Read args
if len(sys.argv) < 4:
    print >> sys.stderr, "Usage: %s input_file start_pos length" % (sys.argv[0],)
    sys.exit(1)
input_filename = sys.argv[1]
start_pos = int(sys.argv[2])
length = int(sys.argv[3])

# Open file and seek to start pos
input = open(sys.argv[1])
input.seek(start_pos)

# Read and write data in chunks
while length > 0:
    # Read data
    buffer = input.read(min(BUFFER_SIZE, length))
    amount_read = len(buffer)

    # Check for EOF
    if not amount_read:
        print >> sys.stderr, "Reached EOF, exiting..."
        sys.exit(1)

    # Write data
    sys.stdout.write(buffer)
    length -= amount_read
kevinm
The buffer size should be large enough to keep the number of syscalls (and context switches) down, and a multiple of the page size to make the caching as happy as possible. Kernel readahead means that it won't likely have any real effect on the size of the disk I/Os requested. 100000 isn't a multiple of 4kiB, but values from 64kiB to 1MiB are reasonable.
hobbs
A: 

According to man dd on FreeBSD:

skip=n

Skip n blocks from the beginning of the input before copying. On input which supports seeks, an lseek(2) operation is used. Otherwise, input data is read and discarded. For pipes, the correct number of bytes is read. For all other devices, the correct number of blocks is read without distinguishing between a partial or complete block being read.

Using dtruss I verified that it does use lseek() on an input file on Mac OS X. If you just think that it is slow then I agree with the comment that this would be due to the 1-byte block size.

mark4o
+3  A: 

Yes it's awkward to do this with dd today. We're considering adding skip_bytes and count_bytes params to dd in coreutils to help. The following should work though:

#!/bin/sh

bs=100000
infile=$1
skip=$2
length=$3

(
  dd bs=1 skip=$skip count=0
  dd bs=$bs count=$(($length / $bs))
  dd bs=$(($length % $bs)) count=1
) < "$infile"
pixelbeat
Yeah, adding skip/count_bytes would be really useful, and make dd an easy-to-use general purpose byte-grabber :)
kevinm