[Pyrex] array.array vs long ints for bit vectors

Sun Oct 30 10:55:05 CET 2005

Nitin Madnani wrote:
> Hi Guys
> 
> I am planning to use a bit vector as a member of an extension type.

For a start, what operations do you want to perform on this "bit vector"?

Let's assume at the very least it should have get/set like a list, but 
contained objects are restrained to belong to set([0, 1]).

Maybe it should also support something like this:

three = Bitvector([1, 1,])
six = Bitvector([ 0, 1, 1])
seven = Bitvector([1, 1, 1])
two = Bitvector([0, 1])
assert three | six == seven
assert three & six == two

Which operations are required only inside the extension, and which 
operations need to be exposed to the user of the extension?

> I first thought that I can easily do this by just using a long int for  
> the vector but then I realized that I would like the bit vector to  grow 
> only to the size needed (which would be defined by an argument  to the 
> __init__ method of the type.

"grow" and "needed" are a little contradictory. Do you want an *exact* 
size, so that you can do this:

zero = Bitvector(size=20, contents=[0])
all_ones = ~zero

and know exactly how long the result will be?

What makes you think a long int would grow unconscionably longer than 
needed? How long is your longest map going to be? In any case, just 
think int, which morphs seamlessly into long as and when required. A 
20-bit map would fit into an int.

Another thing to worth contemplating: given the amount of going-over by 
Tim Peters that the long int code has had, if long ints waste more 
memory on leading zeroes than n bits [where n is < 32 and could be 
determined by inspecting the code], then the end of the world is nigh.

> So, then I remembered the 'array'  module 
> but then this would also store each bit as a byte (according  to the 
> python documentation). So, if I had a 20-bit map, 'array'  would take 
>  >20 bytes to store this bitmap (only the buffer would be  20 bytes) ?

"This" wouldn't be doing the storing. *You* would be [mis]using a byte 
array as a bit array.

> I have also read that a long int also comes with a bit of  baggage.

*Every* type necessarily has more than one bit of baggage. [Sorry; the 
cheer squad was hollerin' "Pun! Pun! Pun!"]

> 
> So, is there no other memory-efficient option other than to write an  
> extension myself?

What language would you write it in? C? If so, think how you would pack 
the bits into the elements of a C byte array, and then think how you 
could use the elements of a Python array.array('B', .....) instead. Then 
think how much easier it would be to write it in Pyrex.

> However, I am guessing whatever I come up with  would 
> be slower compared to the both long ints and arrays.

Not necessarily. Such guesses can be wildly wrong.

> So, should  I just 
> bite the bullet and live with the wasted memory ?

*If* you can live with it. How many bitvector instances will you have? 
How long will each be? How much memory will be wasted by which 
implementation? How much memory do you have, anyway? The cost ratio of 
(say) 256MB of memory to a contract developer's hourly rate is what?

> 
> 
> Thanks !
> Nitin
> 
> PS: I know this question is sort of out of place here but since I am  
> planning to use it in Pyrex, I wanted to ask it here. I also know  
> little about either long ints or the 'array' module, so please  correct 
> me if I am being completely naive.

Correct yourself; RTF source code. That's one of the big advantages of 
open source software. You get to stand on the shoulders of giants, like 
the afore-mentioned timbot.

Cheers,
John