[Pyrex] Speeding up custom string lowercasing with Pyrex

Wed Oct 31 14:49:29 CET 2007

Alexy Khrabrov wrote:
> On Oct 31, 2007, at 4:25 PM, Stefan Behnel wrote:
>> Are you using a unicode string or some 8-bit string encoding?
> 
> This is indeed an 8-bit encoding, Windows 1251 (cp1251), we use to store
> huge corpora -- doubling memory size is not outweighed by the
> convenience of Unicode.
> 
> I also looked at the translate solution, but was wondering specifically
> how one would do this in Pyrex -- instead of per-character list
> comprehension on a Python string and building a new one, which again is
> slow in Python, character by character, looks like plugging in C would
> help.  Or should it be pure C, without Pyrex?

No, compared to C, Pyrex will keep things simple here. But you should avoid
string concatenation. Instead, for a fast-and-not-quite-dirty solution, copy
the input string and modify the memory buffer of the new string *in place*
before you pass it back. See

http://docs.python.org/api/stringObjects.html

for that, especially PyString_AS_STRING() and PyString_GET_SIZE() - but make
sure you *never* operate on Unicode strings in this case.

Also, as said already, make sure you declare char* and char type variables as
such, to avoid implicit conversion by Pyrex.

Stefan