views:

627

answers:

5

I have code that does some very CPU-intensive string manipulations and I was looking for ways to improve performance.

(EDIT: I'm doing stuff like finding longest common substring, running lots of regular expressions which might be better expressed as state machines in c, stripping comments from HTML, stuff like that.)

I am currently looking into porting some of the code to Cython after hearing many good things about it. However, it seems that the main focus of Cython is numerical calculations and working with strings is barely documented.

Unicode might also be a big issue.

My questions are:

  1. Should I even bother with Cython for string stuff? Does anyone have experience with this type of processing in cython and can share?
  2. Am I missing something in the Cython docs? Does anyone know of a tutorial/reference/documentation about working with strings in Cython?
+5  A: 

Am I right in thinking it's currently implemented in CPython?

Have you tried profiling your code yet to identify exactly what is slow about it? First step in trying to improve something should always be identifying what exaclty is wrong :-)

Jon Cage
+4  A: 

I voted up the 'profile it' answer, but wanted to add this: where possible the best optimisation you can make is to use Python standard libraries or built-in functions to perform the tasks you want. These are typically implemented in C and will provide performance broadly equivalent to any extension, including extensions written in Cython. If your algorithms are performing character by character loops in Python then those should be the first things to go, if possible.

But if you have algorithms that can't be reworked in terms of built-ins or other existing standard libraries, Cython seems like a reasonable approach. It just compiles pseudo-Python down to native code and is as suited to string operations as any other operation, really. But I'm not convinced you will see a great benefit from using Cython if you just hand it idiomatic Python code. The maximum benefit will come if you are able to rewrite some or all of each algorithm in C so that low-level operations are not constantly translating variables across the Python/C barrier.

Finally, Unicode - you've implied it might be 'a big issue' but haven't specified how you're using it. Cython will presumably produce C code that calls the relevant Python APIs that handle Unicode so the functionality is unlikely to be limited. However handling Unicode strings in C is non-trivial and may mean that the idea of rewriting some of your algorithms in C for better performance isn't worth the effort. A lot of classic string algorithms simply won't work on many Unicode encodings, which aren't 'strings' in the traditional sense of having 1 unit of storage per character.

Kylotan
+1: Good point on using the standard libraries.
Jon Cage
Regarding your last paragraph: this is exactly what I meant by saying it's a big issue...
itsadok
Well, you said it 'might' be a big issue. ;) But it is potentially important to note that some algorithms will work just fine, eg. most of the read-only ones, or direct pattern matching and swapping. You may also find some utility in falling back to optimised ASCII-style algorithms where the data allows. A lot depends on the details.
Kylotan
"a lot of classic string algorithms won't work on many Unicode encodings"??? An encoding is the name of a method of converting between Unicode and a byte-oriented charset. If you mean that classic algorithms don't work well on charsets that have a varible number of bytes per character (like UTF-8, gb, big5): classic answer is "don't do that", use Unicode which is close enough to one storage unit per character for most people. Classic algorithms that depend on small alphabet size go pearshaped with thousand of characters of course, but you don't have to write new algos yourself.
John Machin
John, the terms are synonymous: if you're using the UTF-8 encoding, that's a charset obtained via the UTF-8 encoding method. Typical string algorithms will fail on strings using such encodings. You say "use Unicode which is close enough to one storage unit per character", but that's not possible as Unicode doesn't have an intrinsic representation - it's just a list of code points. Perhaps you mean UCS-4 or UTF-32 which is the nearest thing to a 1-to-1 encoding of those points.
Kylotan
Given the topic is about Python/Cython, of course I meant "unicode". A charset is not an encoding. The problems with using classic string algorithms on PRE-Unicode variable-byte-count charsets like gb and big5 (not "Unicode encodings") existed PRE-Unicode, and are irrelevant to the discussion about whether to use Cython or not.
John Machin
What you are calling charsets, many people call encodings, the same way that things that have been written can be called 'writings', even though writing is a verb. The topic was about Cython but my reply - the one you are commenting on - is talking about the possibility of using C directly on the string data. And the problems with using typical C algorithms on the strings are not restricted to pre-Unicode charsets like gb and big5 but extend to any variable-length encoding, such as UCS2, which what a Python unicode object stores and exposes to the C API.
Kylotan
+1  A: 

This is a very interesting issue. Cython at it's core is a tool to integrate python with C data types. It doesn't supply any functionality to assist in dealing with strings, probably because there isn't as much demand for that as there is for specific Numpy functionality.

Having said that, you could well use Cython to interface with existing C/C++ libraries designed to handle the types of problems you describe. For processing HTML/XML you might want to look into libxml for example. However there are (of course) ready-made python bindings already available for just that. I've used lxml extensively for processing HTML and it does all I need and does it fast, plus it handles unicode pretty well.

In your case I would imagine a combination of lxml and custom made C functions would be best. For example you could "easily" make a fast function for finding longest substrings in C, as that could be done at the byte level (recall that a string in C is just a char*, which is an array of bytes). Then you could map those back to python (which Cython will make real easy for you) and carry on in unicode abstracted heaven :). Certainly not trivial, but may be worth the effort if your app's performance relies on it.

Then there are of course nice (albeit non-trivial) approaches to working with unicode in C/C++. This article by Evan Jones may help you decide if it's worth the effort.

edvald
lxml is written in Cython. Doesn't seem to have any bother with string processing.
John Machin
A: 

Just for completeness, what I ended up doing is just write (some of) the string manipulation code in C.

As it turns out, it's ridiculously easy to get started writing c extensions to python. Unicode strings are just arrays of Py_UNICODE, which is an int or a short depending on the python build.

I got a x20 improvement converting code like

s = re.sub(r' +', ' ', s)

to c. I got similar improvements with more complicated regexps, but the c code becomes crazy complex very quickly.

Overall, my throughput went up 20% after the rewrite. I am now looking for more things to rewrite...

itsadok
+2  A: 

"Ridiculously easy" is a very relative term. "Getting started" is just that. Writing robust extensions in C requires very careful attention to things like reference counting, memory allocation/freeing, and error handling. Cython does much of that for you.

A non-unicode string in Cython is either a Python str object, or it's an array of char, as in C. What Cython-specific documentation do you think that you need?

I recommend that you try Cython for yourself. BUT before you do that, I strongly recommend that you examine your Python code for inefficiencies. Sometimes you can get big speedups ridiculously easily.

For example, compressing runs of space characters ... using

re.sub(' +', ' ', s) # one space in pattern

means that in the presumably not uncommon case where a run has a length of 1, it will be replacing a space with a space. If all the runs have length 1, it will create a new replacement string when it could easily just increment (or not decrement, or whatever) the reference count of the input string and pass that back.

re.sub('  +', ' ', s) # two spaces in pattern

produces exactly the same results and may run faster ... let's see:

All runs length 1: It runs at 3.4 times the speed. Not shown: the longer the input string, the better it gets.

\python26\python -mtimeit -s"s='now is the winter of our discontent'; import re; x = re.compile(' +').sub" "x(' ', s)"
100000 loops, best of 3: 8.26 usec per loop

\python26\python -mtimeit -s"s='now is the winter of our discontent'; import re; x = re.compile('  +').sub" "x(' ', s)"
100000 loops, best of 3: 2.41 usec per loop

With one run having length 2, the speed ratio is 2.5. With all runs having length 2, the speed ratio is 1.2. All things considered, not a bad return on an investment of 1 keystroke.

John Machin
Thanks for the advice! That was a really good point about the regex. Got any advice about the other stumbling block I hit with Cython - http://stackoverflow.com/questions/943658 ?
itsadok