Discussion:
[Numpy-discussion] Cythonizing some of NumPy
David Cournapeau
2015-08-30 21:44:39 UTC
Permalink
Hi there,

Reading Nathaniel summary from the numpy dev meeting, it looks like there
is a consensus on using cython in numpy for the Python-C interfaces.

This has been on my radar for a long time: that was one of my rationale for
splitting multiarray into multiple "independent" .c files half a decade
ago. I took the opportunity of EuroScipy sprints to look back into this,
but before looking more into it, I'd like to make sure I am not going
astray:

1. The transition has to be gradual
2. The obvious way I can think of allowing cython in multiarray is
modifying multiarray such as cython "owns" the PyMODINIT_FUNC and the
module PyModuleDef table.
3. We start using cython for the parts that are mostly menial refcount
work. Things like functions in calculation.c are obvious candidates.

Step 2 should not be disruptive, and does not look like a lot of work:
there are < 60 methods in the table, and most of them should be fairly
straightforward to cythonize. At worse, we could just keep them as is
outside cython and just "export" them in cython.

Does that sound like an acceptable plan ?

If so, I will start working on a PR to work on 2.

David
Nathaniel Smith
2015-09-01 07:16:17 UTC
Permalink
Post by David Cournapeau
Hi there,
Reading Nathaniel summary from the numpy dev meeting, it looks like there is
a consensus on using cython in numpy for the Python-C interfaces.
This has been on my radar for a long time: that was one of my rationale for
splitting multiarray into multiple "independent" .c files half a decade ago.
I took the opportunity of EuroScipy sprints to look back into this, but
1. The transition has to be gradual
Yes, definitely.
Post by David Cournapeau
2. The obvious way I can think of allowing cython in multiarray is modifying
multiarray such as cython "owns" the PyMODINIT_FUNC and the module
PyModuleDef table.
The seems like a plausible place to start.

In the longer run, I think we'll need to figure out a strategy to have
source code divided over multiple .pyx files (for the same reason we
want multiple .c files -- it'll just be impossible to work with
otherwise). And this will be difficult for annoying technical reasons,
since we definitely do *not* want to increase the API surface exposed
by multiarray.so, so we will need to compile these multiple .pyx and
.c files into a single module, and have them talk to each other via
internal interfaces. But Cython is currently very insistent that every
.pyx file should be its own extension module, and the interface
between different files should be via public APIs.

I spent some time poking at this, and I think it's possible but will
take a few kluges at least initially. IIRC the tricky points I noticed
are:

- For everything except the top-level .pyx file, we'd need to call the
generated module initialization functions "by hand", and have a bit of
utility code to let us access the symbol tables for the resulting
modules

- We'd need some preprocessor hack (or something?) to prevent the
non-main module initialization functions from being exposed at the .so
level (like 'cdef extern from "foo.h"', 'foo.h' re#defines
PyMODINIT_FUNC to remove the visibility declaration)

- By default 'cdef' functions are name-mangled, which is annoying if
you want to be able to do direct C calls between different .pyx and .c
files. You can fix this by adding a 'public' declaration to your cdef
function. But 'public' also adds dllexport stuff which would need to
be hacked out as per above.

I think the best strategy for this is to do whatever horrible things
are necessary to get an initial version working (on a branch, of
course), and then once that's done assess what changes we want to ask
the cython folks for to let us eliminate the gross parts.

(Insisting on compiling everything into the same .so will probably
also help at some point in avoiding Cython-Related Binary Size Blowup
Syndrome (CRBSBS), because the masses of boilerplate could in
principle be shared between the different files. I think some modern
linkers are even clever enough to eliminate this kind of duplicate
code automatically, since C++ suffers from a similar problem.)
Post by David Cournapeau
3. We start using cython for the parts that are mostly menial refcount work.
Things like functions in calculation.c are obvious candidates.
Step 2 should not be disruptive, and does not look like a lot of work: there
are < 60 methods in the table, and most of them should be fairly
straightforward to cythonize. At worse, we could just keep them as is
outside cython and just "export" them in cython.
Does that sound like an acceptable plan ?
If so, I will start working on a PR to work on 2.
Makes sense to me!

-n
--
Nathaniel J. Smith -- http://vorpus.org
David Cournapeau
2015-09-01 11:56:21 UTC
Permalink
Post by David Cournapeau
Post by David Cournapeau
Hi there,
Reading Nathaniel summary from the numpy dev meeting, it looks like
there is
Post by David Cournapeau
a consensus on using cython in numpy for the Python-C interfaces.
This has been on my radar for a long time: that was one of my rationale
for
Post by David Cournapeau
splitting multiarray into multiple "independent" .c files half a decade
ago.
Post by David Cournapeau
I took the opportunity of EuroScipy sprints to look back into this, but
1. The transition has to be gradual
Yes, definitely.
Post by David Cournapeau
2. The obvious way I can think of allowing cython in multiarray is
modifying
Post by David Cournapeau
multiarray such as cython "owns" the PyMODINIT_FUNC and the module
PyModuleDef table.
The seems like a plausible place to start.
In the longer run, I think we'll need to figure out a strategy to have
source code divided over multiple .pyx files (for the same reason we
want multiple .c files -- it'll just be impossible to work with
otherwise). And this will be difficult for annoying technical reasons,
since we definitely do *not* want to increase the API surface exposed
by multiarray.so, so we will need to compile these multiple .pyx and
.c files into a single module, and have them talk to each other via
internal interfaces. But Cython is currently very insistent that every
.pyx file should be its own extension module, and the interface
between different files should be via public APIs.
I spent some time poking at this, and I think it's possible but will
take a few kluges at least initially. IIRC the tricky points I noticed
- For everything except the top-level .pyx file, we'd need to call the
generated module initialization functions "by hand", and have a bit of
utility code to let us access the symbol tables for the resulting
modules
- We'd need some preprocessor hack (or something?) to prevent the
non-main module initialization functions from being exposed at the .so
level (like 'cdef extern from "foo.h"', 'foo.h' re#defines
PyMODINIT_FUNC to remove the visibility declaration)
- By default 'cdef' functions are name-mangled, which is annoying if
you want to be able to do direct C calls between different .pyx and .c
files. You can fix this by adding a 'public' declaration to your cdef
function. But 'public' also adds dllexport stuff which would need to
be hacked out as per above.
I think the best strategy for this is to do whatever horrible things
are necessary to get an initial version working (on a branch, of
course), and then once that's done assess what changes we want to ask
the cython folks for to let us eliminate the gross parts.
Agreed.

Regarding multiple cython .pyx and symbol pollution, I think it would be
fine to have an internal API with the required prefix (say `_npy_cpy_`) in
a core library, and control the exported symbols at the .so level. This is
how many large libraries work in practice (e.g. MKL), and is a model well
understood by library users.

I will start the cythonize process without caring about any of that though:
one large .pyx file, and everything build together by putting everything in
one .so. That will avoid having to fight both cython and distutils at the
same time :)

David
Post by David Cournapeau
(Insisting on compiling everything into the same .so will probably
also help at some point in avoiding Cython-Related Binary Size Blowup
Syndrome (CRBSBS), because the masses of boilerplate could in
principle be shared between the different files. I think some modern
linkers are even clever enough to eliminate this kind of duplicate
code automatically, since C++ suffers from a similar problem.)
Post by David Cournapeau
3. We start using cython for the parts that are mostly menial refcount
work.
Post by David Cournapeau
Things like functions in calculation.c are obvious candidates.
there
Post by David Cournapeau
are < 60 methods in the table, and most of them should be fairly
straightforward to cythonize. At worse, we could just keep them as is
outside cython and just "export" them in cython.
Does that sound like an acceptable plan ?
If so, I will start working on a PR to work on 2.
Makes sense to me!
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...