[Pyrex] Speeding up custom string lowercasing with Pyrex

Alexy Khrabrov deliverable at gmail.com
Wed Oct 31 14:02:24 CET 2007


Greetings -- I'm counting Russian words, encoded in Cyrillics, and  
found that lowercasing them in Python slows down the program 5-10  
times.  Then I've rewritten the s.lower() case to my own rulower()  
Pyrex function.  Since this is my first Pyrex function, I'd  
appreciate feedback from the list on whether I get all the speedup I  
can get.  The callee location looks liks:

				ngram = ' '.join(words[i:(i+self.opts.n)]) # mashed
				if self.opts.lower:
					# x = ngram.lower() # TODO does nothin'!
					x = rulower(ngram)
					#print "[%s]" % x
				else:
					x = ngram
				try:
					self.ngrams[x] += 1
				except:
					self.new_words += 1
					self.the_words.append(x)


The self.option.lower flag is set by a command line option.  When not  
set, and no lowercasing is done, the execution time was
just above 1 minute  on a 15 million word corpus.    When lowercasing  
*is* done, there're choices of either a pure-Python or Pyrex rulower 
() function to do custom lowercasing.

The pure Python lowercase function looked like,

def rulower(s): # pure Python
	to = ""
	for c in s:
		n = ord(c)
		# Russian Caps: 0xc0-0xdf
		if n in xrange(0xC0, 0xE0) or n == 0xA8:
			n += 32
			c = chr(n)
		to += c
	return to

-- execution time was 15 minutes.

The Pyrex lowercase function looked like,

def rulower(char *s):
	to = ""
	for c in s:
		n = ord(c)
		# Russian Caps: 0xc0-0xdf
		if (0xC0 <= n and n < 0xE0) or (n == 0xA8) or (0x41 <= n and n <  
0x5B):
			n = n + 32
			c = chr(n)
		#else:
		#	pass
		to = to + c
	return to

-- I had to replace x += y by x = x + y as Pyrex didn't like +=.

-- execution time was 8 minutes.  Still 8 times slower than no- 
lowercasing run.

Thus, I got speedup of 2 over pure Python.  I wonder whether the rest  
-- 8 times slowdown w.r.t. no-lowercasing run -- is the consequence  
of creating a new x object?!  Or perhaps I could optimize the Pyrex  
version -- e.g., assign source s to target 'to', and just walk over  
chars in index positions?  I also saw in the list references to  
direct pointer arithmetics -- does it make sense here to write a C  
function, and how would I integrate it with Pyrex or directly?

Cheers,
Alexy



More information about the Pyrex mailing list