For JSON formatting (unicode escape, e.g. \uabcd
), I am using the following algorithm to achieve this:
- Encode the Unicode string into the backslash-escape format which it would eventually be in the JSON version
- Truncate 3 bytes more than my final limit
- Use a regular expression to detect and chop off a partial encoding of a Unicode value
So (in Python 2.5), with some_string
and a requirement to cut to around 100 bytes:
# Given some_string is a long string with arbitrary Unicode data.
encoded_string = some_string.encode('unicode_escape')
partial_string = re.sub(r'([^\\])\\(u|$)[0-9a-f]{0,3}$', r'\1', encoded_string[:103])
final_string = partial_string.decode('unicode_escape')
Now final_string
is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.
Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.