[Pyrex] faster in / output from objects [long post + code!]

Robert Bradshaw robertwb at math.washington.edu
Thu Jan 24 21:10:17 CET 2008


On Jan 24, 2008, at 4:24 AM, Martijn Meijers wrote:

> Dear list members,
>
> Currently I'm working in the geo-informatics field and I'm doing
> research on storage of vector data in a DBMS. For this my programming
> language of choice is Python. . Although there are some vector  
> libraries
> in C with Python bindings, I feel that those are  not really  
> comfortable
> to work with (due to their API).Therefore, I decided to roll my own
> library for educational and research purposes and I'm using Cython for
> this purpose (as I'm not really proficient in C or C++, and I'm not
> really willing to go that route, as it involves quite a steep learning
> curve).

Sounds like a good choice.

> Below, you'll find my library that I created. Creation of objects is
> fairly fast, compared with the C-lib-with-python bindings that I have
> available for comparison (my approach is around 1.5 times faster with
> object creation). However, I'm stuck with in/output of my objects: Two
> formats I'd like to support: a text based format and a binary format.
> Here, I have the feeling I don't understand how I can use Cython to  
> push
> the throughput to the limits. My approach (with Visitors) is fairly
> slow. As I understand it, Cython is more geared towards (mathematical)
> computations, then to text processing...

Our Sage branch of Pyrex used to be called SageX, and we were all  
surprised after the first year how little our improvements were  
specific to the mathematics infrastructure we were supporting.  
However, it is true that the Python/C api doesn't make it easy to  
naively do fast string processing without having to think about the  
underlaying string representation.

>
> I'd like to know some things about my code:
> (a) Did I do things the right way, or can the code be optimized more
> (while staying in Cython)?

Lots.

I didn't read all of your code, but here's some things that jumped  
out at me:

1) Use a more object-oriented style (this should clean up code as  
well as optimizing). E.g.

def is_empty(Geometry geom):
     if geom.type == __POINT:
         return False # Point cannot be empty, at the moment
     elif geom.type == __LINESTRING:
         return num_points(geom) == 0
     elif geom.type == __POLYGON:
         return num_rings(geom) == 0

would be better as a method of Point, LineString, and Polygon rather  
than branching on geom.type

2) Store just the actual data, rather than list of python objects  
wrapping the data. E.g. in LineString, rather than points being a  
python list, let it be a c-array of Coordinate structs. Only  
construct the Point class for __getitem__ or other methods that  
expose it to the outside.

3) You're using def functions all over the place, consider using more  
cdef (or cpdef) functions.

> (b) Is it possible to speed up the in- and output of text and binary
> formats (here a lot of python functions are still used, but I can't  
> seem
> to find examples of how to do text/binary stream processing with  
> Cython)...?

See above, especially (3).

If one's writing to a file, one can access the c FILE* pointer and  
operate on that directly. I notice you keep converting back and forth  
between strings and streams--this has got to be expensive.

I had to write something that is very similar to what you're doing  
(but in 3d) and the fastest way I found was to output a (possibly)  
neseted list of strings, which are then joined at the very end. See

      http://www.sagemath.org/hg/sage-main/file/a66354d13708/sage/ 
plot/plot3d/index_face_set.pyx

specifically [tachyon | obj | jmol]_repr(). this is passed to an  
extremely optimized "flatten_list" command at the end of

      http://www.sagemath.org/hg/sage-main/file/a66354d13708/sage/ 
plot/plot3d/base.pyx

Also relevant is

      http://www.sagemath.org/hg/sage-main/file/a66354d13708/sage/ 
plot/plot3d/point_c.pxi

Note, code doesn't need to be near as tightly written, or use the  
Python/C API directly to take advantage of the ideas illustrated.)  
There's been several requests on this streaming/fast IO, but no  
examples of using buffers/stringio in cython directly, so I hope the  
above is useful to lots of people.

- Robert


>
> Thanks very much for your time and advice in advance!
>
> Kind regards,
>
> Martijn Meijers
> Delft University of Technology, The Netherlands
> OTB, Section GIS-technology
>

...



More information about the Pyrex mailing list