[Pyrex] Speeding up custom string lowercasing with Pyrex

Francesc Altet faltet at carabos.com
Wed Oct 31 15:17:15 CET 2007


A Wednesday 31 October 2007, Stefan Behnel escrigué:
> Alexy Khrabrov wrote:
> > On Oct 31, 2007, at 4:25 PM, Stefan Behnel wrote:
> >> Are you using a unicode string or some 8-bit string encoding?
> >
> > This is indeed an 8-bit encoding, Windows 1251 (cp1251), we use to
> > store huge corpora -- doubling memory size is not outweighed by the
> > convenience of Unicode.
> >
> > I also looked at the translate solution, but was wondering
> > specifically how one would do this in Pyrex -- instead of
> > per-character list comprehension on a Python string and building a
> > new one, which again is slow in Python, character by character,
> > looks like plugging in C would help.  Or should it be pure C,
> > without Pyrex?
>
> No, compared to C, Pyrex will keep things simple here. But you should
> avoid string concatenation. Instead, for a fast-and-not-quite-dirty
> solution, copy the input string and modify the memory buffer of the
> new string *in place* before you pass it back. See
>
> http://docs.python.org/api/stringObjects.html
>
> for that, especially PyString_AS_STRING() and PyString_GET_SIZE() -
> but make sure you *never* operate on Unicode strings in this case.
>
> Also, as said already, make sure you declare char* and char type
> variables as such, to avoid implicit conversion by Pyrex.

Here is a raw implementation of what Stefan is saying:

cdef extern from "Python.h":
    char *PyString_AsString(object string)

def rulower(object s):
    cdef int i
    cdef unsigned char n
    cdef char *toc
    cdef object to

    to = s[:]   # do a copy of the string first!
    toc = PyString_AsString(to)  # get the pointer to the string
    for i from 0 <= i < len(to):
        # Russian Caps: 0xc0-0xdf
        n = toc[i]
        if ((0xC0 <= n and n < 0xE0) or (n == 0xA8) or
            (0x41 <= n and n <  0x5B)):
            n = n + 32
            toc[i] = n
    return to

On my machine, this is around 500 times faster than your first try.

HTH,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"



More information about the Pyrex mailing list