[Pyrex] [Cython] intern(str) fails if string is not a C string

Stefan Behnel stefan_ml at behnel.de
Fri Oct 23 18:25:01 CEST 2009


John Arbash Meinel wrote:
>> I recommend using a dedicated dict instead, and put your byte strings
>> there. This will not change the performance in any way, given that intern()
>> on a char* has always been creating a Python byte string before interning
>> (and possibly dropping) it. But it will make it clearer in the code what is
>> actually happening.
> 
> So I can't intern() a char* because it has NULLs in the array.

You have to use PyString_FromStringAndSize() to build a Python byte string
manually, which then supports being interned in Python 2 (and that will be
fixed in Cython 0.12).


> I don't want to use a dedicated dict, because then the strings become
> immortal.

Except that a dedicated dict allows you to control if the strings /really/
become immortal or not. Once a string is interned in CPython, there is no
way to get it out of the dict of interned strings. Your own dict is under
your control.


> I do understand that interning in python is really meant for internal
> use. Because attributes, etc are all managed via py strings (becoming
> Unicode in Py3), and thus lookups in dicts, etc are better if you intern
> everything.

Interning is not required. Any dict will work just fine, as long as you
make sure that the strings you use come from that dict.


> However, there is no way to implement de-duping without immortality in
> python

Unless you have a way of keeping track of the usage of a value. Depending
on your use case, it might work to just clear (and maybe rebuild) the whole
dict when it reaches a given size or after the 1000000-th insertion, or
when memory gets tight, or whatever.


> other than something like weakrefs (which strings and tuples
> don't support, and really exacerbates the memory problems w/ interning,

Plus, weakrefs are pretty slow. I did a little benchmarking in lxml lately
to find out if a cached object reference (that I had added for performance
reasons, but that introduced a cyclic reference) could be replaced by a
weak reference. It turned out that it was actually faster to just recreate
the object than to keep a weak reference to a life object. So I just
dropped the cached reference and with it all sorts of memory issues that
were due to requiring a GC cleanup run.

Stefan




More information about the Pyrex mailing list