ansaurus

Question

What strategies are there for escaping character entities?

Answer 1

+1 A:

Maybe I don't get the problem correctly, but I would create a very unique escape marker which is unlikely to be touched, and then use it to enclose the entity encoded as a base32 string.

Eventually, you can transmit the unique markers and their number along the chain through a separate channel, and check their presence and number at the end.

Example, something like

the value of the temperature was 18 cd48d8c50d7f40aeb6a164181b17feee EZSGKZY= cd48d8c50d7f40aeb6a164181b17feee

your marker is a uuid, and the entity is &deg encoded in base32. You then pass along the marker cd48d8c50d7f40aeb6a164181b17feee. It cannot be corrupted (if it gets corrupted, your filters will probably corrupt anything made of letters and numbers anyway, but at least you can exclude them because they are fixed length) and you can always recover the content by looking inside the two markers.

Of course, if you have uuids in your documents, this could represent a problem, but since you are not transmitting them as authorized markers along the lateral channel, they won't be recognized as such (and in any case, what's inbetween won't validate as a base32 string anyway).

If you need to search for them, then you can keep the uuid subdivision, and then use a proper regexp to spot these occurrences. Example:

>>> re.search("(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})(.*?)(\\1)", s)
<_sre.SRE_Match object at 0x1003d31f8>
>>> _.groups()
('6d378205-1265-44e4-80b8-a47d1ceaad51', ' EZSGKZY= ', '6d378205-1265-44e4-80b8-a47d1ceaad51')
>>>

If you really need a specific "token" to test, you can use a uuid1, with a very defined specification of a node:

>>> uuid.uuid1(node=0x1234567890)  
UUID('bdcce554-e95d-11de-bd0f-001234567890')
>>> uuid.uuid1(node=0x1234567890)  
UUID('c4c57a91-e95d-11de-90ca-001234567890')
>>>

You can use anything you prefer as a node, the uuid will be unique, but you can still test for presence (although you can get false positives).

Stefano Borini 2009-12-15 08:28:53

You do understand the problem precisely and your solution is logical. I don't mind the length, and we could put symbolic constants in the code. However there is not clear lexical indication of when these entities start and end - you can't for example search for them.

peter.murray.rust 2009-12-15 09:24:06

Answer 2

+3 A:

Get someone to do this who really understands character encodings. It looks like you don't, because you're not using the terminology correctly. Alternatively, read this.
Do not brew up your own escape scheme - it will cause you more problems than it will solve. Instead, normalize the various source encodings to UTF-8 (which is really just one such escape scheme, except efficient and standardized) and handle character encodings correctly. Perhaps use UTF-7 if you're really that scared of high bits.
In this day and age, not handling character encodings correctly is not acceptable. If a tool doesn't, abandon it - it is most likely very bad quality code in many other ways as well and not worth the hassle using.

Michael Borgwardt 2009-12-15 08:44:34

My last comment disappeared... Could you please indicate where I am using terminology correctly and I will try to learn.

peter.murray.rust 2009-12-15 09:37:07

You're mixing up a standard defining abstract characters (Unicode), concrete encodings (ASCII, ISO-Latin) and file formats (PDF, HTML, which support arbitrary encodings). You're using the unqualified term "ANSI" which has various conflicting meanings. You say "ISO-Latin characters expressed in UNICODE", which is completely backwards. Do read the article I linked to carefully, it should make things clearer.

Michael Borgwardt 2009-12-15 10:00:17

ansaurus

tags:

views:

answers:

What strategies are there for escaping character entities?

related questions