Setup
I'm writing a script to process and annotate build logs from Visual Studio. The build logs are HTML, and from what I can tell, Unicode (UTF-16?) as well. Here's a snippet from one of the files:
c:\anonyfolder\anonyfile.c(17169) : warning C4701: potentially uninitialized local variable 'object_adrs2' used
c:\anonyfolder\anonyfile.c(17409) : warning C4701: potentially uninitialized local variable 'pclcrd_ptr' used
c:\anonyfolder\anonyfile.c(17440) : warning C4701: potentially uninitialized local variable 'object_adrs2' used
The first 16 bytes of the file look like this:
feff 003c 0068 0074 006d 006c 003e 000d
The rest of the file is littered with null bytes as well.
I'd like to be able to perform string and regular expression searches/matches on these files. However, when I try the following code I get an error message.
buildLog = open(sys.argv[1]).readlines()
for line in buildLog:
match = u'warning'
if line.find(match) >= 0:
print line
The error message:
Traceback (most recent call last):
File "proclogs.py", line 60, in
if line.find(match) >= 0:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
Apparently it's choking on the 0xff
byte in 0xfeff
at the beginning of the file. If I skip the first line, I get no matches:
buildLog = open(sys.argv[1]).readlines()
for line in buildLog[1:]: # Skip the first line.
match = u'warning'
if line.find(match) >= 0:
print line
Likewise, using the non-Unicode match = 'warning'
produces no results.
Question
How can I portably search a Unicode file using strings and regular expressions in Python? Additionally, how can I do so such that I can reconstruct the original file? (The goal is to be able to write annotations on the warning lines without mangling the file.)