tags:

views:

606

answers:

5

I have to convert a number of large files (up to 2GB) of EBCDIC 500 encoded files to Latin-1. Since I could only find EBCDIC to ASCII converters (dd, recode) and the files contain some additional proprietary character codes, I thought I'd write my own converter.

I have the character mapping so I'm interested in the technical aspects.

This is my approach so far:

# char mapping lookup table
EBCDIC_TO_LATIN1 = {
  0xC1:'41', # A
  0xC2:'42', # B
  # and so on...
}

BUFFER_SIZE = 1024 * 64
ebd_file = file(sys.argv[1], 'rb')
latin1_file = file(sys.argv[2], 'wb')

  buffer = ebd_file.read(BUFFER_SIZE)
  while buffer:
    latin1_file.write(ebd2latin1(buffer))
    buffer = ebd_file.read(BUFFER_SIZE)

ebd_file.close()
latin1_file.close()

This is the function that does the converting:

def ebd2latin1(ebcdic):

   result = []
   for ch in ebcdic:
     result.append(EBCDIC_TO_LATIN1[ord(ch)])

   return ''.join(result).decode('hex')

The question is whether or not this is a sensible approach from an engineering standpoint. Does it have some serious design issues? Is the buffer size OK? And so on...

As for the "proprietary characters" that some don't believe in: Each file contains a year's worth of patent documents in SGML format. The patent office has been using EBCDIC until they switched to Unicode in 2005. So there are thousands of documents within each file. They are separated by some hex values that are not part of any IBM specification. They were added by the patent office. Also, at the beginning of each file there are a few digits in ASCII that tell you about the length of the file. I don't really need that information but if I want to process the file so I have to deal with them.

Also:

$ recode IBM500/CR-LF..Latin1 file.ebc
recode: file.ebc failed: Ambiguous output in step `CR-LF..data'

Thanks for the help so far.

+1  A: 

If you set up the table correctly, then you just need to do:

translated_chars = ebcdic.translate(EBCDIC_TO_LATIN1)

where ebcdic contains EBCDIC characters and EBCDIC_TO_LATIN1 is a 256-char string which maps each EBCDIC character to its Latin-1 equivalent. The characters in EBCDIC_TO_LATIN1 are the actual binary values rather than their hex representations. For example, if you are using code page 500, the first 16 bytes of EBCDIC_TO_LATIN1 would be

'\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F'

using this reference.

Vinay Sajip
+1 for str.translate, -1 for stuffing up the example: bytes [4:11] are wrongly transcribed e.g. byte[4] should be \x9C but you have \x37 which is the digit '7' in ASCII/Latin1/Unicode :-O ... EBCDIC went silly with the alphabet but nobody has yet been silly enough to design a codepage that didn't have the digits 0-9 in order at consecutive codepoints... also you picked codepage 500 to illustrate, not the one he is actually using and gave a link to ("1047").
John Machin
Eh? I thought he said "EBCDIC 500" in the question, went with that ...but you're right, my table seems to be going in the opposite direction to what it should be. Still - it's only meant as an illustration. The correct bytes [4:11] would be '\x9C\x09\x86\x7F\x97\x8D\x8E' for CP 1047.
Vinay Sajip
+1  A: 

EBCDIC 500, aka Code Page 500, is amongst Pythons encodings, although you link to cp1047, which doesn't. Which one are you using, really? Anyway this works for cp500 (or any other encoding that you have).

from __future__ import with_statement
import sys
from contextlib import nested

BUFFER_SIZE = 16384
with nested(open(sys.argv[1], 'rb'), open(sys.argv[2], 'wb')) as (infile, outfile):

    while True:
        buffer = infile.read(BUFFER_SIZE)
        if not buffer:
            break
        outfile.write(buffer.decode('cp500').encode('latin1'))

This way you shouldn't need to keep track of the mappings yourself.

Lennart Regebro
Read the spec: "the files contain some additional proprietary character codes"
John Machin
+1: Used this (and 'ibm037' and 'ibm039' ) built-in EBCDIC codecs. Work great. Very fast. Someone else defined them.
S.Lott
@S.Lott: He seems rather certain that he CAN'T use an existing Python codec, he's quoted a link to IBM codepage 1047, ...
John Machin
He doesn't mention existing codecs only EBCDIC to ASCII converters. And if he has really proprietary characters they don't exist in Latin-1 either. However, he does seem confused on which EBCDIC codepage he wants. There is a whole host of them, that all have the full Latin-1 character set: http://en.wikipedia.org/wiki/EBCDIC_8859
Lennart Regebro
@Lennart: I agree re proprietary and confusion -- see my comment on the OP's question. BTW, you might want to take another look at your while loop; it works beautifully ... for empty files :-)
John Machin
OK, ok, I should stop being so lazy. I have now changed it to a tested version, that also uses with, so you don't have to close the files explicitly, uses open() instead of file(), etc. :) Much nicer.
Lennart Regebro
Thanks, Lennart. That way I don't have to write a lookup table for all the 'normal' characters. I can focus on the annoying proprietary ones.
Eisen
But CP500 contains *all* Latin-1 characters. If you have proprietary characters, you can't convert them to Latin-1 either... I think it's rather more likely that you are using another EBCDIC variant. There are unfortunately a whole bunch, which all are Latin-1. except that the characters a re moved around.
Lennart Regebro
A: 

Assuming cp500 contains all of your "additional propietary characters", a more concise version based on Lennart's answer using the codecs module:

import sys, codecs
BUFFER_SIZE = 64*1024

ebd_file = codecs.open(sys.argv[1], 'r', 'cp500')
latin1_file = codecs.open(sys.argv[2], 'w', 'latin1')

buffer = ebd_file.read(BUFFER_SIZE)
while buffer:
    latin1_file.write(buffer)
    buffer = ebd_file.read(BUFFER_SIZE)

ebd_file.close()
latin1_file.close()
mhawke
-1 He has files up to 2GB each in size. Your solution would require up to 4GB of virtual memory -- not generally available.
John Machin
@John You are right re: memory usage, however, I was more concerned with pointing out use of the codecs module to simplify things, rather than the memory efficiency of the code. The vagueness of the "question" about which charset he actually means doesn't help, but the OP does demonstrate awareness of needing buffered IO.
mhawke
Simplicity means nothing when it chugs a fair way into the file and aborts with MemoryError. If you want to introduce a variation like the codecs module, show code that will *WORK* in the OP's problem space. His vagueness is irrelevant ... s/cp500/anyothersupportedencoding/, your code still doesn't work.
John Machin
@John, while correct on a practical level, your pedantry is annoying, and technically incorrect should there be sufficient memory available. I've edited the sample in line with your recommendations, I hope that 128KB isn't too taxing for the average system ;)
mhawke
A: 

Answer 1:

Yet another silly question: What gave you the impression that recode produced only ASCII as output? AFAICT it will transcode ANY of its repertoire of charsets to ANY of its repertoire, AND its repertoire includes IBM cp500 and cp1047, and OF COURSE latin1. Reading the comments, you will note that Lennaert and I have discovered that there aren't any "proprietary" codes in those two IBM character sets. So you may well be able to use recode after all, once you are certain what charset you've actually got.

Answer 2:

If you really need/want to transcode IBM cp1047 via Python, you might like to firstly get the mapping from an authoritative source, processing it via script with some checks:

URL = "http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/glibc-IBM1047-2.1.2.ucm"
"""
Sample lines:
<U0000>  \x00 |0
<U0001>  \x01 |0
<U0002>  \x02 |0
<U0003>  \x03 |0
<U0004>  \x37 |0
<U0005>  \x2D |0
"""
import urllib, re
text = urllib.urlopen(URL).read()
regex = r"<U([0-9a-fA-F]{4,4})>\s+\\x([0-9a-fA-F]{2,2})\s"
results = re.findall(regex, text)
wlist = [None] * 256
for result in results:
    unum, inum = [int(x, 16) for x in result]
    assert wlist[inum] is None
    assert 0 <= unum <= 255
    wlist[inum] = chr(unum)
assert not any(x is None for x in wlist)
print repr(''.join(wlist))

Then carefully copy/paste the output into your transcoding script for use with Vinay's buffer.translate(the_mapping) idea, with a buffer size perhaps a bit larger than 16KB and certainly a bit smaller than 2GB :-)

John Machin
You're a pretty smart cookie, aren't you?Have you seen my data? Exactly. So how do you know there aren't any byte values in it that are not part of CP500? Oh, you don't know? That's what I thought.I was talking about proprietary as in proprietary, you know?
Eisen
@Eisen: I know that because CP500 defines *ALL* 256 code points. "proprietary character code" is terminology used to describe a situation where an encoding doesn't define all code points leaving some free for somebody to add in their own. You didn't point out in your question that *most* of your data consists of sub-files encoded in CP500 and the remainder is encoded in some other as-yet-undescribed fashion. The link that you gave was to CP1047 which is a DIFFERENT encoding to CP500. Sorry, we don't have crystal balls.
John Machin
@Eisen: if you had provided these details upfront, we wouldn't all be second guessing what your question and problem is.
mhawke
A: 

No crystal ball, no info from OP, so had a bit of a rummage in the EPO website. Found freely downloadable weekly patent info files, still available in cp500/SGML even though website says this to be replaced by utf8/XML in 2006 :-). Got the 2009 week 27 file. Is a zip containing 2 files s350927[ab].bin. "bin" means "not XML". Got the spec! Looks possible that "proprietary codes" are actually BINARY fields. Each record has a fixed 252-byte header. First 5 bytes are record length in EBCDIC e.g. hex F0F2F2F0F8 -> 2208 bytes. Last 2 bytes of the fixed header are the BINARY length (redundant) of the following variable part. In the middle are several text fields, two 2-byte binary fields, and one 4-byte binary field. The binary fields are serial numbers within groups, but all I saw are 1. The variable part is SGML.

Example (last record from s350927b.bin):

Record number: 7266
pprint of header text and binary slices:
['EPB102055619         TXT00000001',
 1,
 '        20090701200927 08013627.8     EP20090528NN    ',
 1,
 1,
 '                                     T *lots of spaces snipped*']
Edited version of the rather long SGML:
<PATDOC FILE="08013627.8" CY=EP DNUM=2055619 KIND=B1 DATE=20090701 STATUS=N>
*snip*
<B541>DE<B542>Windschutzeinheit für ein Motorrad
<B541>EN<B542>Windshield unit for saddle-ride type vehicle
<B541>FR<B542>Unité pare-brise pour motocyclette</B540>
*snip*
</PATDOC>

There are no header or trailer records, just this one record format.

So: if the OP's annual files are anything like this, we might be able to help him out.

Update: Above was the "2 a.m. in my timezone" version. Here's a bit more info:

OP said: "at the beginning of each file there are a few digits in ASCII that tell you about the length of the file." ... translate that to "at the beginning of each record there are five digits in EBCDIC that tell you exactly the length of the record" and we have a (very fuzzy) match!

Here is the URL of the documentation page: http://docs.epoline.org/ebd/info.htm
The FIRST file mentioned is the spec.

Here is the URL of the download-weekly-data page: http://ebd2.epoline.org/jsp/ebdst35.jsp

An observation: The data that I looked at is in the ST.35 series. There is also available for download ST.32 which appears to be a parallel version containing only the SGML content (in "reduced cp437/850", one tag per line). This indicates that the fields in the fixed-length header of the ST.35 records may not be very interesting, and can thus be skipped over, which would greatly simplify the transcoding task.

For what it's worth, here is my (investigatory, written after midnight) code:
[Update 2: tidied up the code a little; no functionality changes]

from pprint import pprint as pp
import sys
from struct import unpack

HDRSZ = 252

T = '>s' # text
H = '>H' # binary 2 bytes
I = '>I' # binary 4 bytes
hdr_defn = [
    6, T,
    38, H,
    40, T,
    94, I,
    98, H,
    100, T,
    251, H, # length of following SGML text
    HDRSZ + 1
    ]
# above positions as per spec, reduce to allow for counting from 1
for i in xrange(0, len(hdr_defn), 2):
    hdr_defn[i] -= 1

def records(fname, output_encoding='latin1', debug=False):
    xlator=''.join(chr(i).decode('cp500').encode(output_encoding, 'replace') for i in range(256))
    # print repr(xlator)
    def xlate(ebcdic):
        return ebcdic.translate(xlator)
        # return ebcdic.decode('cp500') # use this if unicode output desired
    f = open(fname, 'rb')
    recnum = -1
    while True:
        # get header
        buff = f.read(HDRSZ)
        if not buff:
            return # EOF
        recnum += 1
        if debug: print "\nrecnum", recnum
        assert len(buff) == HDRSZ
        recsz = int(xlate(buff[:5]))
        if debug: print "recsz", recsz
        # split remainder of header into text and binary pieces
        fields = []
        for i in xrange(0, len(hdr_defn) - 2, 2):
            ty = hdr_defn[i + 1]
            piece = buff[hdr_defn[i]:hdr_defn[i+2]]
            if ty == T:
                fields.append(xlate(piece))
            else:
                fields.append(unpack(ty, piece)[0])
        if debug: pp(fields)
        sgmlsz = fields.pop()
        if debug: print "sgmlsz: %d; expected: %d - %d = %d" % (sgmlsz, recsz, HDRSZ, recsz - HDRSZ)
        assert sgmlsz == recsz - HDRSZ
        # get sgml part
        sgml = f.read(sgmlsz)
        assert len(sgml) == sgmlsz
        sgml = xlate(sgml)
        if debug: print "sgml", sgml
        yield recnum, fields, sgml

if __name__ == "__main__":
    maxrecs = int(sys.argv[1]) # dumping out the last `maxrecs` records in the file
    fname = sys.argv[2]
    keep = [None] * maxrecs
    for recnum, fields, sgml in records(fname):
        # do something useful here
        keep[recnum % maxrecs] = (recnum, fields, sgml)
    keep.sort()
    for k in keep:
        if k:
            recnum, fields, sgml = k
            print
            print recnum
            pp(fields)
            print sgml
John Machin