-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: implementation of array_reduce_ex #12011
Conversation
d383f8a
to
57c4304
Compare
numpy/core/src/multiarray/methods.c
Outdated
/* if the python version is below 3.8, the pickle module does not provide | ||
* built-in support for protocol 5. We try importing the pickle5 | ||
* backport instead */ | ||
if (PY_VERSION_HEX < 0x03080000){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a runtime condition, it should be possible to use a build time preprocessor directive here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, although it was not necessary for the code to compile - definitely can do it in further iterations, but did not want to generate any misunderstandings :)
numpy/core/src/multiarray/methods.c
Outdated
{ | ||
int protocol; | ||
PyObject *ret = NULL, *numeric_mod = NULL, *from_buffer_func = NULL; | ||
PyObject *buffer_tuple = NULL, *pickle_module=NULL, *pickle_class = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces here, fixing it.
Thank you very much for the new benchmarks. They look good: there is no more spurious memory copies when pickling and loading contiguous numpy arrays when using protocol 5 either with in-band or out-of-band buffers. |
Next steps:
|
I don't think we can modify the signature of Otherwise looks fine so far, at inital skim. |
numpy/core/src/multiarray/methods.c
Outdated
@@ -1948,7 +1951,11 @@ PyArray_Dump(PyObject *self, PyObject *file, int protocol) | |||
} | |||
|
|||
#if defined(NPY_PY3K) | |||
#if PY_VERSION_HEX < 0x03080000 && PY_VERSION_HEX >= 0x03070000 | |||
cpick = PyImport_ImportModule("pickle5"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary to try to import pickle5
when protocol < 5
. In particular data.dump(f)
should always work for Python 3.6 and 3.7 even if the pickle5
is not installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually need an explicit test for those cases.
e5a1de5
to
40b81bf
Compare
numpy/core/numeric.py
Outdated
@@ -40,7 +40,13 @@ | |||
newaxis = None | |||
|
|||
if sys.version_info[0] >= 3: | |||
import pickle | |||
if sys.version_info[1] == 7: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sys.version_info[1] in [6, 7]
numpy/core/tests/test_dtype.py
Outdated
if sys.version_info[:2] in ((3, 7), (3, 6)): | ||
import pickle5 as pickle | ||
else: | ||
import pickle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should not raise ImportError if pickle5 is not installed in Python 3.7 or 3.6: pickle5 should be an optional dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifically this file or all tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid replicating complex import everywhere, you could probably do:
from numpy.core.numeric import pickle
numpy/core/tests/test_records.py
Outdated
if sys.version_info[:2] in ((3, 7), (3, 6)): | ||
import pickle5 as pickle | ||
else: | ||
import pickle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here: pickle5 should never be a mandatory test dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from numpy.core.numeric import pickle
numpy/core/tests/test_ufunc.py
Outdated
if sys.version_info[:2] in ((3, 7), (3, 6)): | ||
import pickle5 as pickle | ||
else: | ||
import pickle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the local import is necessary. I think this can be moved to the top level as:
from numpy.core.numeric import pickle
tools/travis-test.sh
Outdated
@@ -187,6 +193,11 @@ if [ -n "$USE_WHEEL" ] && [ $# -eq 0 ]; then | |||
pushd dist | |||
pip install --pre --no-index --upgrade --find-links=. numpy | |||
pip install nose pytest | |||
if [ "$VERSION" == "3.7" ] || [ "$VERSION" == "3.6" ] ; then | |||
# use pickle5 backport when on python3.[6-7] | |||
pip install pickle5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hardcoding the installation of pickle5
based on the Python version number, please use an explicit environment variable
if [ "$INSTALL_PICKLE5" == "1" ] ; then
pip install pickle5
fi
and then set that variable on some of the Python 3.6 and/or 3.7 builds in .travis.yml
.
a5faa92
to
b415bac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another batch of comments:
.appveyor.yml
Outdated
@@ -24,26 +24,31 @@ environment: | |||
PYTHON_VERSION: 3.6 | |||
PYTHON_ARCH: 32 | |||
TEST_MODE: fast | |||
INSTALL_PICKLE5: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to set the variable when it's 0-valued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually INSTALL_PICKLE5=0
will not work as expected with the following bash conditions from tools/travis-test.sh
:
if [ -n "$INSTALL_PICKLE5" ]; then
pip install pickle5
fi
so all INSTALL_PICKLE5: 0
occurrences must be removed from this configuration file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is appveyor. But still I think we should not have to set INSTALL_PICKLE5: 0
explicitly.
numpy/core/src/multiarray/methods.c
Outdated
PyDataType_FLAGCHK(descr, NPY_ITEM_HASOBJECT) || | ||
PyDataType_ISUNSIZED(descr)) { | ||
/* In case self is an instance of a numpy array subclass we get the | ||
* __reduce__ method of this subclass */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change this comment to:
The PickleBuffer class from version 5 of the pickle protocol can only be used for arrays backed by a contiguous data buffer. For all other cases we fallback to the generic array_reduce method that involves using a temporary bytes allocation. However we do not call array_reduce directly but instead lookup and call the __reduce__
method to make sure that it's possible customize pickling in sub-classes.
numpy/core/tests/test_multiarray.py
Outdated
np.array([('xxx', 1, 2.0)], dtype=[('a', (str, 3)), ('b', int), | ||
('c', float)]) | ||
] | ||
for proto in range(2, pickle.HIGHEST_PROTOCOL+1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I think the style for binary operators in numpy is to use whitespaces around operators:
for proto in range(2, pickle.HIGHEST_PROTOCOL + 1):
...
this issue is repeated many times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you, however:
- I don't see any mention of it in the docs (https://docs.scipy.org/doc/numpy-1.15.0/dev/index.html )
- PEP8 does not explicitly recommend systematically adding spaces between the
+
binary operator (https://www.python.org/dev/peps/pep-0008/)
I can change it, but there was no easy way for me to infer that I had to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree this is not mandatory. But I looked around and it seems that the numpy convention is to use spaces more often than not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy/core/tests/test_ufunc.py
Outdated
@@ -1,5 +1,6 @@ | |||
from __future__ import division, absolute_import, print_function | |||
|
|||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import?
shippable.yml
Outdated
@@ -35,6 +35,7 @@ build: | |||
# install pytest-xdist to leverage a second core | |||
# for unit tests | |||
- pip install pytest-xdist | |||
- if [ $(python -c 'import sys; print(sys.version[0])') == "3" ]; then pip install pickle5; fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the python version is 3.5 or 3.8 this command will fail.
The test should pass if pickle5 is not there. I think this can be removed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments:
numpy/core/tests/test_multiarray.py
Outdated
def test_array_with_dtype_has_object_pickling(self): | ||
if pickle.HIGHEST_PROTOCOL < 5: | ||
pass | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing the test, this should be skipped with pytest:
@pytest.mark.skipif(pickle.HIGHEST_PROTOCOL < 5, reason="requires pickle protocol 5")
def test_array_with_dtype_has_object_pickling(self):
my_object = object()
...
numpy/core/tests/test_multiarray.py
Outdated
assert_equal(zs.dtype, zs2.dtype) | ||
|
||
def test_pickle_with_buffercallback(self): | ||
if hasattr(pickle, 'PickleBuffer'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, let's use @pytest.mark.skipif
:
@pytest.mark.skipif(pickle.HIGHEST_PROTOCOL < 5, reason="requires pickle protocol 5")
def test_pickle_with_buffercallback(self):
...
numpy/core/tests/test_multiarray.py
Outdated
else: | ||
assert pickle.HIGHEST_PROTOCOL < 5 | ||
|
||
def test_array_with_dtype_has_object_pickling(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this to test_record_array_with_object_dtype
. There is already "pickling" in the test class name.
By playing this PR I noticed the following: >>> import numpy as np
>>> import pickle5
>>> a = np.arange(12) Let's mark the array read-only and then pickle it: >>> a.flags.writeable = False
>>> pickle5.loads(a.dumps(protocol=2)).flags.writeable
True The writeable flag is lost: it means that currently (in numpy master and all past versions), a readonly array would be loaded back as a read-write array. I think this is a bug of numpy. (I actually checked that I could reproduce the behavior with the last stable release of numpy). On the contrary, the >>> pickle5.loads(a.dumps(protocol=5)).flags.writeable
False @ahaldane @charris do you confirm that the fact that a |
My initial impression is it's not a bug: We don't preserve the writeable flag by default when we construct a new array from another one by |
numpy/core/numeric.py
Outdated
if transpose: | ||
return c_contiguous_array.reshape(shape, order='F') | ||
else: | ||
return c_contiguous_array.reshape(shape, order='C') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not realized that you could reshape with order. So it's better to pass order
itself directly as an argument to _frombuffer
instead of the transpose
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename c_contiguous_array
to flat_array
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But actually that would just become a one liner:
def _frombuffer(buffer, dtype, shape, order):
return frombuffer(buffer, dtype=dtype).reshape(shape, order=order)
numpy/core/src/multiarray/methods.c
Outdated
#elif PY_VERSION_HEX < 0x03080000 && PY_VERSION_HEX >= 0x03060000 | ||
cpick = PyImport_ImportModule("pickle5"); | ||
#else | ||
return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot do this, without setting a proper error message.
For instance with Python 3.5 I get the following with the current implementation:
>>> import numpy as np
>>> a = np.arange(12)
>>> s = a.dumps(protocol=6)
Traceback (most recent call last):
File "<ipython-input-9-2e271e7e7bc8>", line 1, in <module>
s = a.dumps(protocol=6)
ValueError: pickle protocol must be <= 4
>>> s = a.dumps(protocol=5)
Traceback (most recent call last):
File "<ipython-input-10-7689e2059d65>", line 1, in <module>
s = a.dumps(protocol=5)
SystemError: <built-in method dumps of numpy.ndarray object at 0x7f2a181c22b0> returned NULL without setting an error
For Python <= 3.5, we need to raise:
ValueError("pickle protocol must be <= %d" % pickle.HIGEST_PROTOCOL)
@ahaldane thanks for the feedback. However I hope that you don't consider a bug that this PR would preserve the writeable flag for contiguous array with the protocol 5? The fact that this behavior is not consistent if the array is contiguous or not might be surprising to the users though. |
OK, I just reread the PEP and I see what you're getting at: Now that an unpickled ndarray might be just a view of the original pickled ndarray, we have to maintain the writeable flags to avoid creating writeable views of arrays which have writeable=False. That seems like a strong argument for maintaining the writeable flag. One might object that it's a bit strange that whether or not an unpickled ndarray is writeable depends on the pickle protocol and whether it was saved to disk or not. But on the other hand, maybe it's easy to explain to people: Protocol <5 always returns a copy, so it is always writeable, while protocol 5 can return a view in which case users should expect it to sometimes be unwriteable. That's the same as many other numpy operations where copies are writeable while views might not be, eg fancy indexing vs slice indexing. In summary, the way you have it sounds fine. |
7c021e3
to
3878df8
Compare
So a little clarification on the error handling:
For Otherwise, we let pickle raise its own error messages, that tell the right information. With the code used right now, there will be no This is all being tested by the way, yielding a drop in the coverage because some tests are mutually exclusive on the python version, but the coverage report is specified using only python 3.6 |
We might want to enable coverage collection on Python 3.5 for this reason, no? |
numpy/core/tests/test_dtype.py
Outdated
@@ -666,7 +667,7 @@ class user_def_subcls(np.void): | |||
class TestPickling(object): | |||
|
|||
def check_pickling(self, dtype): | |||
for proto in range(pickle.HIGHEST_PROTOCOL + 1): | |||
for proto in range(2, pickle.HIGHEST_PROTOCOL+1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You changed the style of the spacing around the operator for no reason. Please stay consistent with the existing code style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go all over the range(...)
iterations on the pickle protocol and change them all to one unique style.
numpy/core/tests/test_multiarray.py
Outdated
protocol=5) | ||
|
||
assert b'frombuffer' in arr_without_object_bytes_string | ||
assert b'frombuffer' not in arr_with_object_bytes_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an implementation detail not worth testing in my opinion. Instead please check that you can loads the pickles and check that the dtypes are correct.
We'd just have to be a little cautious since some projects (i.e., SciPy) appear to have reached the codecov data limit, and then nothing works. |
Also the changes in the |
37dd980
to
4084dec
Compare
UPDATE:
if sys.version_info[0] >= 3:
if sys.version_info[1] in (6, 7):
try:
import pickle5 as pickle
except ImportError:
import pickle
else:
import pickle
else:
import cPickle as pickle each time we import pickle in python, we import it once this way, in Now,
This is what the first commit is about. The second commit implements Happy to hear your suggestions, |
4084dec
to
7fce5ea
Compare
Needs a release note. |
7fce5ea
to
64a855f
Compare
Any thoughts about adding this to the NumPy API as some point? |
Great work. Thanks everyone! @pitrou are you confident that PEP 574 will be accepted and your protocol 5 branch be merged in CPython before the release of Python 3.8? Otherwise some assumptions made in the code of this PR might be slightly off although they should not cause any breakage (I believe).
Can you be more specific? I don't think I understand what you are suggesting. |
I don't think the PEP will be controversial. The main uncertainty at this point is around the schedule for choosing and putting the new governance model in place. But I don't expect 3.8 to be feature-frozen before that is done anyway :-) |
array = np.arange(10) | ||
buffers = [] | ||
bytes_string = pickle.dumps(array, buffer_callback=buffers.append, | ||
protocol=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou in the current draft of the PEP states that buffer_callback
the example can accept a list of buffers with buffer_callback=buffers.extend
. However, the implementation we test here only accepts a single buffer at a time. This is in line with this section of the PEP: https://www.python.org/dev/peps/pep-0574/#passing-a-sequence-of-buffers-in-buffer-callback
I think the example of the PEP should be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! You're right, I forgot to fix this.
For reference, using the current numpy master and the latest stable release of pandas, I confirm that pickle protocol 5 makes it possible to have no-copy semantics for pandas dataframes without changing anything in pandas: >>> import pandas as pd
>>> import numpy as np
>>> import pickle5 as pickle
>>> df = pd.DataFrame({'a': np.random.randn(3), 'b': np.random.randint(0, 2, 3)})
>>> buffers = []
>>> df_pkl = pickle.dumps(df, protocol=5, buffer_callback=buffers.append)
>>> buffers
[<pickle.PickleBuffer object at 0x7f32f7eea5c8>, <pickle.PickleBuffer object at 0x7f32f7e5f848>]
>>> pickle.loads(df_pkl, buffers=buffers)
a b
0 -0.577441 1
1 0.168650 1
2 0.783261 0 |
Follow up on #11161.
In order to enable no-copy pickling as mentioned in the PEP 574 proposal,
numpy arrays need to have:
__reduce_ex__
method, returning metadata and a buffer on their internal datadtype
and ashape
The first one is mostly inspired from the
array_reduce
method incore/src/multiarray/methods.c
The second one is a wrapper written in python around
np.frombuffer(..).reshape(..)
Here is the decrease in peak memory and cpu time:
pinging @ogrisel and @pitrou as of now.