I am building a routine that processes disk buffers for forensic purposes. Am I better off using python strings or the array() type? My first thought was to use strings, but I'm trying to void unicode problems, so perhaps array('c') is better?
views:
227answers:
2Write the code using what is most natural (strings), find out if it's too slow and then improve it.
Arrays can be used as drop-in replacements for str
in most cases, as long as you restrict yourself to index and slice access. Both are fixed-length. Both should have about the same memory requirements. Arrays are mutable, in case you need to change the buffers. Arrays can read directly from files, so there's no speed penalty involved when reading.
I don't understand how you avoid Unicode problems by using arrays, though. str
is just an array of bytes and doesn't know anything about the encoding of the string.
I assume that the "disk buffers" you mention can be rather large, so you might think about using mmap
:
Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a', or change a substring by assigning to a slice: obj[i1:i2] = '...'. You can also read and write data starting at the current file position, and seek() through the file to different positions.
If you need to alter the buffer in-place (it's not clear if you do, since you use the ambiguous term "to process"), array
s will likely be better, since str
ings are immutable. In Python 2.6 or better, however, bytearray
s can be the best of both worlds -- mutable and rich of methods and usable with regular expressions too.
For read-only operations, strings have the edge over array
(thanks to many more methods, plus extras such as regular expressions, available on them), if you're stuck with old Python versions and so cannot use bytearray
. Unicode is not an issue in either case (in Python 2; in Python 3, definitely go for bytearray
!-).