Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: implementation of array_reduce_ex #12011

Merged
merged 2 commits into from
Oct 10, 2018

Conversation

pierreglaser
Copy link
Contributor

@pierreglaser pierreglaser commented Sep 21, 2018

Follow up on #11161.
In order to enable no-copy pickling as mentioned in the PEP 574 proposal,
numpy arrays need to have:

  • a __reduce_ex__ method, returning metadata and a buffer on their internal data
  • a method that reconstructs an array from a buffer, a dtype and a shape

The first one is mostly inspired from the array_reduce method in core/src/multiarray/methods.c

The second one is a wrapper written in python around np.frombuffer(..).reshape(..)

Here is the decrease in peak memory and cpu time:
results_c14eb6060b70ec1786bd869ef0879c8b0c8c6df8

pinging @ogrisel and @pitrou as of now.

/* if the python version is below 3.8, the pickle module does not provide
* built-in support for protocol 5. We try importing the pickle5
* backport instead */
if (PY_VERSION_HEX < 0x03080000){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a runtime condition, it should be possible to use a build time preprocessor directive here.

Copy link
Contributor Author

@pierreglaser pierreglaser Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although it was not necessary for the code to compile - definitely can do it in further iterations, but did not want to generate any misunderstandings :)

{
int protocol;
PyObject *ret = NULL, *numeric_mod = NULL, *from_buffer_func = NULL;
PyObject *buffer_tuple = NULL, *pickle_module=NULL, *pickle_class = NULL;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces here, fixing it.

@ogrisel
Copy link
Contributor

ogrisel commented Sep 27, 2018

Thank you very much for the new benchmarks. They look good: there is no more spurious memory copies when pickling and loading contiguous numpy arrays when using protocol 5 either with in-band or out-of-band buffers.

@ogrisel
Copy link
Contributor

ogrisel commented Sep 27, 2018

Next steps:

@ahaldane
Copy link
Member

ahaldane commented Sep 30, 2018

I don't think we can modify the signature of PyArray_FromBuffer: It is part of the numpy C-api which we are very reluctant to modify, so it's better to go back to the wrapper function you had before.

Otherwise looks fine so far, at inital skim.

@@ -1948,7 +1951,11 @@ PyArray_Dump(PyObject *self, PyObject *file, int protocol)
}

#if defined(NPY_PY3K)
#if PY_VERSION_HEX < 0x03080000 && PY_VERSION_HEX >= 0x03070000
cpick = PyImport_ImportModule("pickle5");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to try to import pickle5 when protocol < 5. In particular data.dump(f) should always work for Python 3.6 and 3.7 even if the pickle5 is not installed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually need an explicit test for those cases.

@@ -40,7 +40,13 @@
newaxis = None

if sys.version_info[0] >= 3:
import pickle
if sys.version_info[1] == 7:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.version_info[1] in [6, 7]

if sys.version_info[:2] in ((3, 7), (3, 6)):
import pickle5 as pickle
else:
import pickle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should not raise ImportError if pickle5 is not installed in Python 3.7 or 3.6: pickle5 should be an optional dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically this file or all tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid replicating complex import everywhere, you could probably do:

from numpy.core.numeric import pickle

if sys.version_info[:2] in ((3, 7), (3, 6)):
import pickle5 as pickle
else:
import pickle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here: pickle5 should never be a mandatory test dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from numpy.core.numeric import pickle

if sys.version_info[:2] in ((3, 7), (3, 6)):
import pickle5 as pickle
else:
import pickle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the local import is necessary. I think this can be moved to the top level as:

from numpy.core.numeric import pickle

@@ -187,6 +193,11 @@ if [ -n "$USE_WHEEL" ] && [ $# -eq 0 ]; then
pushd dist
pip install --pre --no-index --upgrade --find-links=. numpy
pip install nose pytest
if [ "$VERSION" == "3.7" ] || [ "$VERSION" == "3.6" ] ; then
# use pickle5 backport when on python3.[6-7]
pip install pickle5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hardcoding the installation of pickle5 based on the Python version number, please use an explicit environment variable

if [ "$INSTALL_PICKLE5" == "1" ] ; then
    pip install pickle5
fi

and then set that variable on some of the Python 3.6 and/or 3.7 builds in .travis.yml.

@pierreglaser pierreglaser changed the title WIP: implementation of array_reduce_ex ENH: implementation of array_reduce_ex Oct 3, 2018
Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another batch of comments:

.appveyor.yml Outdated
@@ -24,26 +24,31 @@ environment:
PYTHON_VERSION: 3.6
PYTHON_ARCH: 32
TEST_MODE: fast
INSTALL_PICKLE5: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to set the variable when it's 0-valued.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually INSTALL_PICKLE5=0 will not work as expected with the following bash conditions from tools/travis-test.sh:

if [ -n "$INSTALL_PICKLE5" ]; then
    pip install pickle5
fi

so all INSTALL_PICKLE5: 0 occurrences must be removed from this configuration file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is appveyor. But still I think we should not have to set INSTALL_PICKLE5: 0 explicitly.

PyDataType_FLAGCHK(descr, NPY_ITEM_HASOBJECT) ||
PyDataType_ISUNSIZED(descr)) {
/* In case self is an instance of a numpy array subclass we get the
* __reduce__ method of this subclass */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this comment to:

The PickleBuffer class from version 5 of the pickle protocol can only be used for arrays backed by a contiguous data buffer. For all other cases we fallback to the generic array_reduce method that involves using a temporary bytes allocation. However we do not call array_reduce directly but instead lookup and call the __reduce__ method to make sure that it's possible customize pickling in sub-classes.

np.array([('xxx', 1, 2.0)], dtype=[('a', (str, 3)), ('b', int),
('c', float)])
]
for proto in range(2, pickle.HIGHEST_PROTOCOL+1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I think the style for binary operators in numpy is to use whitespaces around operators:

        for proto in range(2, pickle.HIGHEST_PROTOCOL + 1):
            ...

this issue is repeated many times.

Copy link
Contributor Author

@pierreglaser pierreglaser Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you, however:

I can change it, but there was no easy way for me to infer that I had to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree this is not mandatory. But I looked around and it seems that the numpy convention is to use spaces more often than not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1,5 +1,6 @@
from __future__ import division, absolute_import, print_function

import sys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import?

shippable.yml Outdated
@@ -35,6 +35,7 @@ build:
# install pytest-xdist to leverage a second core
# for unit tests
- pip install pytest-xdist
- if [ $(python -c 'import sys; print(sys.version[0])') == "3" ]; then pip install pickle5; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the python version is 3.5 or 3.8 this command will fail.

The test should pass if pickle5 is not there. I think this can be removed now.

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments:

def test_array_with_dtype_has_object_pickling(self):
if pickle.HIGHEST_PROTOCOL < 5:
pass
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing the test, this should be skipped with pytest:

@pytest.mark.skipif(pickle.HIGHEST_PROTOCOL < 5, reason="requires pickle protocol 5")
def test_array_with_dtype_has_object_pickling(self):
     my_object = object()
     ...

assert_equal(zs.dtype, zs2.dtype)

def test_pickle_with_buffercallback(self):
if hasattr(pickle, 'PickleBuffer'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, let's use @pytest.mark.skipif:

@pytest.mark.skipif(pickle.HIGHEST_PROTOCOL < 5, reason="requires pickle protocol 5")
def test_pickle_with_buffercallback(self):
    ...

else:
assert pickle.HIGHEST_PROTOCOL < 5

def test_array_with_dtype_has_object_pickling(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this to test_record_array_with_object_dtype. There is already "pickling" in the test class name.

@ogrisel
Copy link
Contributor

ogrisel commented Oct 3, 2018

By playing this PR I noticed the following:

>>> import numpy as np
>>> import pickle5
>>> a = np.arange(12)

Let's mark the array read-only and then pickle it:

>>> a.flags.writeable = False
>>> pickle5.loads(a.dumps(protocol=2)).flags.writeable
True

The writeable flag is lost: it means that currently (in numpy master and all past versions), a readonly array would be loaded back as a read-write array. I think this is a bug of numpy. (I actually checked that I could reproduce the behavior with the last stable release of numpy).

On the contrary, the PickleBuffer class of the protocol 5 propagates the readonly/writeable state and therefore:

>>> pickle5.loads(a.dumps(protocol=5)).flags.writeable
False

@ahaldane @charris do you confirm that the fact that a dumps/loads cycle does not preserve the writeable flag can be considered a bug?

@ahaldane
Copy link
Member

ahaldane commented Oct 3, 2018

My initial impression is it's not a bug: We don't preserve the writeable flag by default when we construct a new array from another one by b = np.array(a), even though we retain other properties like the memory order. To me doing a save + load is like constructing a new buffer from an old one, like the np.array call.

if transpose:
return c_contiguous_array.reshape(shape, order='F')
else:
return c_contiguous_array.reshape(shape, order='C')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not realized that you could reshape with order. So it's better to pass order itself directly as an argument to _frombuffer instead of the transpose flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename c_contiguous_array to flat_array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually that would just become a one liner:

def _frombuffer(buffer, dtype, shape, order):
     return frombuffer(buffer, dtype=dtype).reshape(shape, order=order)

#elif PY_VERSION_HEX < 0x03080000 && PY_VERSION_HEX >= 0x03060000
cpick = PyImport_ImportModule("pickle5");
#else
return NULL;
Copy link
Contributor

@ogrisel ogrisel Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot do this, without setting a proper error message.

For instance with Python 3.5 I get the following with the current implementation:

>>> import numpy as np                                                                                                                                                          
>>> a = np.arange(12)                                                                                                                                                           
>>> s = a.dumps(protocol=6)                                                                                                                                                     
Traceback (most recent call last):
  File "<ipython-input-9-2e271e7e7bc8>", line 1, in <module>
    s = a.dumps(protocol=6)
ValueError: pickle protocol must be <= 4

>>> s = a.dumps(protocol=5)                                                                                                                                                     
Traceback (most recent call last):
  File "<ipython-input-10-7689e2059d65>", line 1, in <module>
    s = a.dumps(protocol=5)
SystemError: <built-in method dumps of numpy.ndarray object at 0x7f2a181c22b0> returned NULL without setting an error

For Python <= 3.5, we need to raise:

 ValueError("pickle protocol must be <= %d" % pickle.HIGEST_PROTOCOL)

@ogrisel
Copy link
Contributor

ogrisel commented Oct 3, 2018

@ahaldane thanks for the feedback. However I hope that you don't consider a bug that this PR would preserve the writeable flag for contiguous array with the protocol 5?

The fact that this behavior is not consistent if the array is contiguous or not might be surprising to the users though.

@ahaldane
Copy link
Member

ahaldane commented Oct 3, 2018

OK, I just reread the PEP and I see what you're getting at: Now that an unpickled ndarray might be just a view of the original pickled ndarray, we have to maintain the writeable flags to avoid creating writeable views of arrays which have writeable=False.

That seems like a strong argument for maintaining the writeable flag.

One might object that it's a bit strange that whether or not an unpickled ndarray is writeable depends on the pickle protocol and whether it was saved to disk or not. But on the other hand, maybe it's easy to explain to people: Protocol <5 always returns a copy, so it is always writeable, while protocol 5 can return a view in which case users should expect it to sometimes be unwriteable. That's the same as many other numpy operations where copies are writeable while views might not be, eg fancy indexing vs slice indexing.

In summary, the way you have it sounds fine.

@pierreglaser
Copy link
Contributor Author

pierreglaser commented Oct 5, 2018

So a little clarification on the error handling:
I think we need personalized messages for python 3.6 and 3.7, alerting users that the highest protocol possible is 5, and not 4, as pickle would do. Using numpy.ndarray.dumps, numpy.ndarray.dump, and numpy.ndarray.__reduce_ex__

  • if protocol > 5, a ValueError will be raised saying'pickle protocol must be <=5'
  • if protocol == 5, and pickle5 is not installed, an ImportError will be raised telling the user that pickle5 is needed

For __reduce_ex__, we raise a ValueError if protocol > 5, and if ( protocol == 5 and sys.version_info[:2] < (3, 6)

Otherwise, we let pickle raise its own error messages, that tell the right information. With the code used right now, there will be no SystemError anymore, only pickle error messages, or the messages stated above.

This is all being tested by the way, yielding a drop in the coverage because some tests are mutually exclusive on the python version, but the coverage report is specified using only python 3.6

@ogrisel
Copy link
Contributor

ogrisel commented Oct 5, 2018

This is all being tested by the way, yielding a drop in the coverage because some tests are mutually exclusive on the python version, but the coverage report is specified using only python 3.6

We might want to enable coverage collection on Python 3.5 for this reason, no?

@@ -666,7 +667,7 @@ class user_def_subcls(np.void):
class TestPickling(object):

def check_pickling(self, dtype):
for proto in range(pickle.HIGHEST_PROTOCOL + 1):
for proto in range(2, pickle.HIGHEST_PROTOCOL+1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You changed the style of the spacing around the operator for no reason. Please stay consistent with the existing code style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go all over the range(...) iterations on the pickle protocol and change them all to one unique style.

protocol=5)

assert b'frombuffer' in arr_without_object_bytes_string
assert b'frombuffer' not in arr_with_object_bytes_string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an implementation detail not worth testing in my opinion. Instead please check that you can loads the pickles and check that the dtypes are correct.

@tylerjereddy
Copy link
Contributor

codecov aggregates all the coverage stats and reports the union of coverage reports.

We'd just have to be a little cautious since some projects (i.e., SciPy) appear to have reached the codecov data limit, and then nothing works.

@mattip
Copy link
Member

mattip commented Oct 8, 2018

Now that #12090, #12091 have been merged, this should be rebased off master and squashed to a smaller number (1?) of relevant commits.

@ogrisel
Copy link
Contributor

ogrisel commented Oct 9, 2018

Also the changes in the ndarray.dump(s) should be removed from this PR.

@pierreglaser pierreglaser force-pushed the implement-reduce-ex branch 2 times, most recently from 37dd980 to 4084dec Compare October 10, 2018 12:17
@pierreglaser
Copy link
Contributor Author

pierreglaser commented Oct 10, 2018

UPDATE:
Just rebased the PR. I made 2 commits though (and not one), because there still were some cleanups to do about pickle that were not included in #12090. Most notably:

if sys.version_info[0] >= 3:
    if sys.version_info[1] in (6, 7):
        try:
            import pickle5 as pickle
        except ImportError:
            import pickle
    else:
        import pickle
else:
    import cPickle as pickle

each time we import pickle in python, we import it once this way, in numpy.core.numeric, and then each time pickle is used in other files, we import it from numpy.core.numeric. The only time we import pickle another way is in numpy/core/setup.py

Now, numpy should be clean pickle-wise, meaning:

  • there is no more np.ndarray.dump(s) call in the test suite.
  • there is a consistent pickle importing behavior
  • each time pickle.dumps is tested, it is tested for every protocol.

This is what the first commit is about.

The second commit implements array_reduce_ex, as well as some tests to check the good behavior of pickle protocol 5 on numpy arrays. This PR can thus be easily broken out in two smaller PRs, one for each commit.
Preventively, I created the PR (#12133) for the first commit only, on which this branch can be furtherly rebased on.

Happy to hear your suggestions,
Pierre

@charris
Copy link
Member

charris commented Oct 10, 2018

Needs a release note.

@charris charris added component: numpy._core component: numpy.ma masked arrays 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes labels Oct 10, 2018
@charris charris removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Oct 10, 2018
@charris charris merged commit 2ed08ba into numpy:master Oct 10, 2018
@charris
Copy link
Member

charris commented Oct 10, 2018

Any thoughts about adding this to the NumPy API as some point?

@ogrisel
Copy link
Contributor

ogrisel commented Oct 11, 2018

Great work. Thanks everyone!

@pitrou are you confident that PEP 574 will be accepted and your protocol 5 branch be merged in CPython before the release of Python 3.8? Otherwise some assumptions made in the code of this PR might be slightly off although they should not cause any breakage (I believe).

Any thoughts about adding this to the NumPy API as some point?

Can you be more specific? I don't think I understand what you are suggesting.

@pitrou
Copy link
Member

pitrou commented Oct 11, 2018

@pitrou are you confident that PEP 574 will be accepted and your protocol 5 branch be merged in CPython before the release of Python 3.8?

I don't think the PEP will be controversial. The main uncertainty at this point is around the schedule for choosing and putting the new governance model in place. But I don't expect 3.8 to be feature-frozen before that is done anyway :-)

array = np.arange(10)
buffers = []
bytes_string = pickle.dumps(array, buffer_callback=buffers.append,
protocol=5)
Copy link
Contributor

@ogrisel ogrisel Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou in the current draft of the PEP states that buffer_callback the example can accept a list of buffers with buffer_callback=buffers.extend. However, the implementation we test here only accepts a single buffer at a time. This is in line with this section of the PEP: https://www.python.org/dev/peps/pep-0574/#passing-a-sequence-of-buffers-in-buffer-callback

I think the example of the PEP should be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! You're right, I forgot to fix this.

@ogrisel
Copy link
Contributor

ogrisel commented Oct 11, 2018

For reference, using the current numpy master and the latest stable release of pandas, I confirm that pickle protocol 5 makes it possible to have no-copy semantics for pandas dataframes without changing anything in pandas:

>>> import pandas as pd
>>> import numpy as np
>>> import pickle5 as pickle
>>> df = pd.DataFrame({'a': np.random.randn(3), 'b': np.random.randint(0, 2, 3)})
>>> buffers = []
>>> df_pkl = pickle.dumps(df, protocol=5, buffer_callback=buffers.append)
>>> buffers
[<pickle.PickleBuffer object at 0x7f32f7eea5c8>, <pickle.PickleBuffer object at 0x7f32f7e5f848>]
>>> pickle.loads(df_pkl, buffers=buffers)
          a  b
0 -0.577441  1
1  0.168650  1
2  0.783261  0

@pitrou
Copy link
Member

pitrou commented Oct 11, 2018

@ogrisel This is great :-) I suspect @mrocklin will be interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants