[Pyrex] Speeding up custom string lowercasing with Pyrex
Alexy Khrabrov
deliverable at gmail.com
Wed Oct 31 14:02:24 CET 2007
Greetings -- I'm counting Russian words, encoded in Cyrillics, and
found that lowercasing them in Python slows down the program 5-10
times. Then I've rewritten the s.lower() case to my own rulower()
Pyrex function. Since this is my first Pyrex function, I'd
appreciate feedback from the list on whether I get all the speedup I
can get. The callee location looks liks:
ngram = ' '.join(words[i:(i+self.opts.n)]) # mashed
if self.opts.lower:
# x = ngram.lower() # TODO does nothin'!
x = rulower(ngram)
#print "[%s]" % x
else:
x = ngram
try:
self.ngrams[x] += 1
except:
self.new_words += 1
self.the_words.append(x)
The self.option.lower flag is set by a command line option. When not
set, and no lowercasing is done, the execution time was
just above 1 minute on a 15 million word corpus. When lowercasing
*is* done, there're choices of either a pure-Python or Pyrex rulower
() function to do custom lowercasing.
The pure Python lowercase function looked like,
def rulower(s): # pure Python
to = ""
for c in s:
n = ord(c)
# Russian Caps: 0xc0-0xdf
if n in xrange(0xC0, 0xE0) or n == 0xA8:
n += 32
c = chr(n)
to += c
return to
-- execution time was 15 minutes.
The Pyrex lowercase function looked like,
def rulower(char *s):
to = ""
for c in s:
n = ord(c)
# Russian Caps: 0xc0-0xdf
if (0xC0 <= n and n < 0xE0) or (n == 0xA8) or (0x41 <= n and n <
0x5B):
n = n + 32
c = chr(n)
#else:
# pass
to = to + c
return to
-- I had to replace x += y by x = x + y as Pyrex didn't like +=.
-- execution time was 8 minutes. Still 8 times slower than no-
lowercasing run.
Thus, I got speedup of 2 over pure Python. I wonder whether the rest
-- 8 times slowdown w.r.t. no-lowercasing run -- is the consequence
of creating a new x object?! Or perhaps I could optimize the Pyrex
version -- e.g., assign source s to target 'to', and just walk over
chars in index positions? I also saw in the list references to
direct pointer arithmetics -- does it make sense here to write a C
function, and how would I integrate it with Pyrex or directly?
Cheers,
Alexy
More information about the Pyrex
mailing list