[Pyrex] Speeding up custom string lowercasing with Pyrex
Francesc Altet
faltet at carabos.com
Wed Oct 31 15:17:15 CET 2007
A Wednesday 31 October 2007, Stefan Behnel escrigué:
> Alexy Khrabrov wrote:
> > On Oct 31, 2007, at 4:25 PM, Stefan Behnel wrote:
> >> Are you using a unicode string or some 8-bit string encoding?
> >
> > This is indeed an 8-bit encoding, Windows 1251 (cp1251), we use to
> > store huge corpora -- doubling memory size is not outweighed by the
> > convenience of Unicode.
> >
> > I also looked at the translate solution, but was wondering
> > specifically how one would do this in Pyrex -- instead of
> > per-character list comprehension on a Python string and building a
> > new one, which again is slow in Python, character by character,
> > looks like plugging in C would help. Or should it be pure C,
> > without Pyrex?
>
> No, compared to C, Pyrex will keep things simple here. But you should
> avoid string concatenation. Instead, for a fast-and-not-quite-dirty
> solution, copy the input string and modify the memory buffer of the
> new string *in place* before you pass it back. See
>
> http://docs.python.org/api/stringObjects.html
>
> for that, especially PyString_AS_STRING() and PyString_GET_SIZE() -
> but make sure you *never* operate on Unicode strings in this case.
>
> Also, as said already, make sure you declare char* and char type
> variables as such, to avoid implicit conversion by Pyrex.
Here is a raw implementation of what Stefan is saying:
cdef extern from "Python.h":
char *PyString_AsString(object string)
def rulower(object s):
cdef int i
cdef unsigned char n
cdef char *toc
cdef object to
to = s[:] # do a copy of the string first!
toc = PyString_AsString(to) # get the pointer to the string
for i from 0 <= i < len(to):
# Russian Caps: 0xc0-0xdf
n = toc[i]
if ((0xC0 <= n and n < 0xE0) or (n == 0xA8) or
(0x41 <= n and n < 0x5B)):
n = n + 32
toc[i] = n
return to
On my machine, this is around 500 times faster than your first try.
HTH,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
More information about the Pyrex
mailing list