Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leverage the new PEP 574 for no-copy pickling of contiguous arrays #11161

Closed
ogrisel opened this issue May 25, 2018 · 25 comments
Closed

Leverage the new PEP 574 for no-copy pickling of contiguous arrays #11161

ogrisel opened this issue May 25, 2018 · 25 comments

Comments

@ogrisel
Copy link
Contributor

ogrisel commented May 25, 2018

PEP 574 (scheduled for Python 3.8) introduces pickle protocol 5 with support for no-copy pickling of large mutable buffers.

I made a small proof-of-concept benchmark script using @pitrou's pickle5 backport of his draft implementation of PEP 547.

See: https://gist.github.com/ogrisel/a2b0e5ae4987a398caa7f9277cb3b90a

The meat lies in the following reducer:

from pickle5 import PickleBuffer

def _array_from_buffer(buffer, dtype, shape):
    return np.frombuffer(buffer, dtype=dtype).reshape(shape)


def reduce_ndarray_pickle5(a):
    # This reducer assumes protocol 5 as currently there is no way to register
    # protocol-aware reduce function in the global copyreg dispatch table.
    if not a.dtype.hasobject and a.flags.c_contiguous:
        # No-copy pickling for C-contiguous arrays and protocol 5
        return _array_from_buffer, (PickleBuffer(a), a.dtype, a.shape), None
    else:
        # Fall-back to generic method
        return a.__reduce__()

This works as expected (no extra copy when dumping and loading) and also fixes the in-memory speed overhead reported in by @mrocklin in #7544.

To get this in numpy, we would need to make a protocol-aware reduce function that is, have ndarray implement a __reduce_ex__ method that accepts a protocol argument instead of the existing bytes-based implementation from array_reduce in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/methods.c#L1577. This bytes-based implementation should probably be kept as a fallback when protocol < 5.

@ogrisel
Copy link
Contributor Author

ogrisel commented May 25, 2018

Also, my naive reducer only works if the array is C-contiguous. Adding support for F-contiguous arrays should be easy by taking a transposed view. I am not sure about pickling non-contiguous arrays/views.

@pitrou
Copy link
Member

pitrou commented May 25, 2018

Your example is not zero-copy as it doesn't pass a buffer_callback argument to dump. But it probably makes one less copy than with protocol 4 :-)

@ogrisel
Copy link
Contributor Author

ogrisel commented May 25, 2018

Well it does stream the content into the file object without making any spurious copy. This is checked with the linked gist script. There is a sample output of the script in the gist comments. It's in-band zero copy :)

The buffer_callback feature of PEP 574 would further make it possible to do out-of-band zero copy communication which would be very useful for safe asyncio-based communication over the network: we could start sending data only once we are sure that a compound Python object is picklable. That requires some extra wrapping protocol around the pickle bytes themselves though.

@pitrou
Copy link
Member

pitrou commented May 25, 2018

To clarify: an example (trivial use) of zero-copy is as follows:

buffers = []
p.dump(data, buffer_callback=buffers.extend)

# Later, to recreate array
obj = pickle.loads(data, buffers=buffers)

In a real use case, the buffers could be shipped separately from one process (or subinterpreter - see PEP 554) to another, or shared using shared memory, etc.

@pierreglaser
Copy link
Contributor

hi @ogrisel @pitrou
To carry on with this issue, I implemented __reduce_ex__ for numpy arrays using the C Python API in a branch of my fork, and used @pitrou 's python fork to use pickle protocol 5, and a np._frombuffer that reconstructs an array given a buffer, its dtype and its shape
I traced the peak memory when pickling/unpicking (with airspeed velocity), using

  • pickle protocol 4
  • pickle protocol 5
  • pickle protocol 5 with buffer_callback

Each one of them for different shapes.

Here is what we get:
results
For the biggest shape, ((10000,1000)), the size of the array is 80MB. We do see a 160MB decrease in peak memory between pickle protocol 4 and pickle protocol 5 with callbacks, so 2 copies less as you say :)

There seems to be also a big memory increase between pickle protocol 4 and pickle protocol 5 with no buffer callback. I cannot explain it for now, will investigate.

What do you guys think?

@pitrou
Copy link
Member

pitrou commented Sep 21, 2018

Thank you very much @pierreglaser. Did you measure CPU times as well?

I cannot explain why protocol 5 without callbacks would exhibit larger memory consumption. It may be a problem in how you implemented __reduce_ex__. By the way, how did you measure memory consumption? Is it RSS or virtual memory size?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 21, 2018

RSS with psutil I guess.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Sep 21, 2018

@pierreglaser did you try to subtract the baseline memory cost? If so, how did you do it?

Edit: added pierreglaser's handle instead of ogrisel. :/ man I'm 0/1.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 21, 2018

@pierreglaser did the work. He uses airspeed velocity to do the measurements. I am not sure if the baseline is substracted or not. Probably not.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 21, 2018

@pierreglaser I suppose this is your branch: master...pierreglaser:implement-reduce-ex

There is a problem: you use Python 3.8 -specific C-API, notably the PyPickleBuffer_FromObject function. We don't want the numpy build to depend on Python 3.8. It should still be buildable with Python 2.7 and old versions of Python 3.

So instead of using PyPickleBuffer_FromObject you should build pickle buffers by importing the PickleBuffer class from the pickle module and then calling on the array (as you would do in Python). This way in the C code you only manipulate it as a PyObject and do not introduce a build dependency. I am not familiar enough with the C API but I am pretty sure this is possible.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 21, 2018

Also maybe feel free to open a [WIP] pull-request against numpy master so that we can give you more detailed feedback there.

@pierreglaser
Copy link
Contributor

pierreglaser commented Sep 21, 2018

results

@pitrou here is the updated figure with CPU time. Pretty low when using buffers.
@ogrisel got it, will take a look at this!

@pitrou
Copy link
Member

pitrou commented Sep 21, 2018

So instead of using PyPickleBuffer_FromObject you should build pickle buffers by importing the PickleBuffer class from the pickle module and then calling on the array (as you would do in Python)

One could also use a preprocessor conditional block, something like:

#if PY_VERSION_HEX < 0x03080000
/* code for Python 3.7 and older */
#else
/* code for Python 3.8+ */
#endif

Things get a bit more complicated if you want to take into account the pickle5 backport, though, because you can't discriminate at compile-time, so you'd have to try importing that module and do as you suggest (treat the PickleBuffer class as a regular Python object).

@pierreglaser
Copy link
Contributor

@hmaarrfk the first numpy array has a size of 80KB which is residual compared to the memory usage (20MB). So my guess would be 20MB as a rough estimate.

@pitrou
Copy link
Member

pitrou commented Sep 21, 2018

By the way, both @ogrisel and I got better results with protocol 5 over protocol 4, even without a buffer callback:

@pierreglaser Out of curiosity, can you post the code for your benchmark?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 21, 2018

The solution with the preprocessor conditional block is simple but indeed it does not make it possible to benefit from the pickle5 package installed on Python 3.7. But maybe we don't care that much.

@pierreglaser
Copy link
Contributor

@pitrou I pushed the repository with my benchmarks here :)

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 21, 2018

@pierreglaser what you call stream in your benchmark is not a stream, it's an in-memory bytes string. It would be more interesting to pickle to an out-of-RAM file object (e.g. an temporary file on the filesystem): use dump instead of dumps and load instead of loads.

This should make it easier to identify spurious copies. Also please use a payload that is significantly larger than the default size of the python process to make it easier to count the number of spurious copies. For instance a 200MB numpy array.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 21, 2018

Or even a 1GB array.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 21, 2018

Also I would rename Protocol5WithBuffer to Protocol5WithOutOfBandBuffers to make it more explicit. Protocol5 also uses PickleBuffer objects internally but they are serialized into the pickle bytes stream.

@jakirkham
Copy link
Contributor

What are the units in your graphs, @pierreglaser? Am particularly looking at runtime. Has it been normalized somehow?

@pierreglaser
Copy link
Contributor

pierreglaser commented Sep 24, 2018

Hi All,
So here is how I addressed this issue:

  • if python3.8, import pickle
    if python3.7, try importing the pickle5 backport. If it is not available, return an error, because then protocol 5 is not available.

Here is the updated graph (with correct units @jakirkham ). The peak memory increase for pickle5 with no out of band buffer was a false alert.
results_c14eb6060b70ec1786bd869ef0879c8b0c8c6df8

(Note: time axis is in log scale)

@pitrou
Copy link
Member

pitrou commented Sep 24, 2018

Thanks for the update :-)

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 11, 2018

Closing: #12011 was merged in numpy master.

@Neltherion
Copy link

Neltherion commented Sep 26, 2021

@ogrisel Hi. I've asked a question about pickle's protocol5 inefficiency vs PyArrow's Serialization/Deserialization methods here.

Is there anything that can be done to utilize pickle to serialize/deserialize objects containing Numpy arrays faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants