Discussion:
[Numpy-discussion] FeatureRequest: support for array construction from iterators
Stephan Sahm
2015-11-27 10:37:17 UTC
Permalink
​​
[ this
​request
/discussion refers to numpy issue
​​
​​
#5863
​ ​
​​
https://github.com/numpy/numpy/pull/5863#issuecomment-159738368 ]


Dear all,

As far as I can think, the expected functionality of np.array(...) would be
np.array(list(...)) or something even nicer.
Therefore, I like to request a generator/iterator support for np.array(...)
as far as list(...) supports it.


A more detailed reasoning behind this follows now.


In general it seems possible to identify iterators/generators as needed for
this purpose: - someone actually implemented this feature already (see
​​
​​
#5863
<https://mail.google.com/mail/u/0/%E2%80%8Bhttps://github.com/numpy/numpy/pull/5863#issuecomment-159738368>
) - there is ​``type.GeneratorType​`` and ​``​collections.abc.Iterator​``
for ``isinstance​(...)`` check ​- numpy can destinguish them already from
all other types which get well translated into a numpy array​ ​

Given this, I think the general argument goes roughly like the following:

PROS (effect maybe 10% of numpy user or more): - more intuitive overall
behaviour, array(...) = array(list(...)) roughly - python3 compatibility
(see e.g. #5951 <https://github.com/numpy/numpy/issues/5951>) -
compatibility with analog ``__builtin__`` functions (see e.g. #5756
<https://github.com/numpy/numpy/issues/5756>) - all the above make numpy
easier to use in an interactive style (e.g. ipython --pylab) (computation
not that important, however coding time well) CONS (effect less than 0.1%
numpy user I would guess): - might break existing code

which in total, at least for me at this stage, speaks in favour of merging
​ ​
​the already existing
​featurebranch (see
​
​​
​​
#5863
<https://mail.google.com/mail/u/0/%E2%80%8Bhttps://github.com/numpy/numpy/pull/5863#issuecomment-159738368>
​)
​
or something similar into numpy master
​.

Discussion, please!
​cheers,
Stepha​n
Alan G Isaac
2015-11-27 13:18:59 UTC
Permalink
I like to request a generator/iterator support for np.array(...) as far as list(...) supports it.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html

hth,
Alan Isaac
Stephan Sahm
2015-12-11 22:27:27 UTC
Permalink
numpy.fromiter is neither numpy.array nor does it work similar to
numpy.array(list(...)) as the dtype argument is necessary

is there a reason, why np.array(...) should not work on iterators? I have
the feeling that such requests get (repeatedly) dismissed, but until yet I
haven't found a compelling argument for leaving this Feature missing (to
remember, it is already implemented in a branch)

Please let me know if you know about an argument,
best,
Stephan
Post by Alan G Isaac
I like to request a generator/iterator support for np.array(...) as far
as list(...) supports it.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html
hth,
Alan Isaac
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2015-12-11 23:12:00 UTC
Permalink
Constructing an array from an iterator is fundamentally different from
constructing an array from an in-memory data structure like a list,
because in the iterator case it's necessary to either use a
single-pass algorithm or else create extra temporary buffers that
cause much higher memory overhead. (Which is undesirable given that
iterators are mostly used exactly in the case where one wants to
reduce memory overhead.)

np.fromiter requires the dtype= argument because this is necessary if
you want to construct the array in a single pass.

np.array(list(iter)) can avoid the dtype argument, because it creates
that large memory buffer. IMO this is better than making
np.array(iter) internally call list(iter) or equivalent, because the
workaround (adding an explicit call to list()) is trivial, while also
making it obvious to the user what the actual cost of their request
is. (Explicit is better than implicit.)

In addition, the proposed API has a number of infelicities:
- We're generally trying to *reduce* the magic in functions like
np.array (e.g. the discussions of having less magic for lists with
mismatched numbers of elements, or non-list sequences)
- There's a strong convention in Python is when making a function like
np.array generic, it should accept any iter*able* rather any
iter*ator*. But it would be super confusing if np.array({1: 2})
returned array([1]), or if array("foo") returned array(["f", "o",
"o"]), so we don't actually want to handle all iterables the same.
It's somewhat dubious even for iterators (e.g. someone might want to
create an object array containing an iterator...)...

hope that helps,
-n
Post by Stephan Sahm
numpy.fromiter is neither numpy.array nor does it work similar to
numpy.array(list(...)) as the dtype argument is necessary
is there a reason, why np.array(...) should not work on iterators? I have
the feeling that such requests get (repeatedly) dismissed, but until yet I
haven't found a compelling argument for leaving this Feature missing (to
remember, it is already implemented in a branch)
Please let me know if you know about an argument,
best,
Stephan
Post by Alan G Isaac
I like to request a generator/iterator support for np.array(...) as far
as list(...) supports it.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html
hth,
Alan Isaac
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Nathaniel J. Smith -- http://vorpus.org
Juan Nunez-Iglesias
2015-12-12 07:32:59 UTC
Permalink
Nathaniel,
IMO this is better than making np.array(iter) internally call list(iter)
or equivalent

Yeah but that's not the only option:

from itertools import chain
def fromiter_awesome_edition(iterable):
elem = next(iterable)
dtype = whatever_numpy_does_to_infer_dtypes_from_lists(elem)
return np.fromiter(chain([elem], iterable), dtype=dtype)

I think this would be a huge win for usability. Always getting tripped up
by the dtype requirement. I can submit a PR if people like this pattern.

btw, I think np.array(['f', 'o', 'o']) would be exactly the expected result
for np.array('foo'), but I guess that's just me.

Juan.
Constructing an array from an iterator is fundamentally different from
constructing an array from an in-memory data structure like a list,
because in the iterator case it's necessary to either use a
single-pass algorithm or else create extra temporary buffers that
cause much higher memory overhead. (Which is undesirable given that
iterators are mostly used exactly in the case where one wants to
reduce memory overhead.)
np.fromiter requires the dtype= argument because this is necessary if
you want to construct the array in a single pass.
np.array(list(iter)) can avoid the dtype argument, because it creates
that large memory buffer. IMO this is better than making
np.array(iter) internally call list(iter) or equivalent, because the
workaround (adding an explicit call to list()) is trivial, while also
making it obvious to the user what the actual cost of their request
is. (Explicit is better than implicit.)
- We're generally trying to *reduce* the magic in functions like
np.array (e.g. the discussions of having less magic for lists with
mismatched numbers of elements, or non-list sequences)
- There's a strong convention in Python is when making a function like
np.array generic, it should accept any iter*able* rather any
iter*ator*. But it would be super confusing if np.array({1: 2})
returned array([1]), or if array("foo") returned array(["f", "o",
"o"]), so we don't actually want to handle all iterables the same.
It's somewhat dubious even for iterators (e.g. someone might want to
create an object array containing an iterator...)...
hope that helps,
-n
Post by Stephan Sahm
numpy.fromiter is neither numpy.array nor does it work similar to
numpy.array(list(...)) as the dtype argument is necessary
is there a reason, why np.array(...) should not work on iterators? I have
the feeling that such requests get (repeatedly) dismissed, but until yet
I
Post by Stephan Sahm
haven't found a compelling argument for leaving this Feature missing (to
remember, it is already implemented in a branch)
Please let me know if you know about an argument,
best,
Stephan
Post by Alan G Isaac
I like to request a generator/iterator support for np.array(...) as far
as list(...) supports it.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html
hth,
Alan Isaac
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2015-12-12 08:00:04 UTC
Permalink
On Fri, Dec 11, 2015 at 11:32 PM, Juan Nunez-Iglesias
Post by Juan Nunez-Iglesias
Nathaniel,
IMO this is better than making np.array(iter) internally call list(iter)
or equivalent
from itertools import chain
elem = next(iterable)
dtype = whatever_numpy_does_to_infer_dtypes_from_lists(elem)
return np.fromiter(chain([elem], iterable), dtype=dtype)
I think this would be a huge win for usability. Always getting tripped up by
the dtype requirement. I can submit a PR if people like this pattern.
This isn't the semantics of np.array, though -- np.array will look at
the whole input and try to find a common dtype, so this can't be the
implementation for np.array(iter). E.g. try np.array([1, 1.0])

I can see an argument for making the dtype= argument to fromiter
optional, with a warning in the docs that it will guess based on the
first element and that you should specify it if you don't want that.
It seems potentially a bit error prone (in the sense that it might
make it easier to end up with code that works great when you test it
but then breaks later when something unexpected happens), but maybe
the usability outweighs that. I don't use fromiter myself so I don't
have a strong opinion.
Post by Juan Nunez-Iglesias
btw, I think np.array(['f', 'o', 'o']) would be exactly the expected result
for np.array('foo'), but I guess that's just me.
In general np.array(thing_that_can_go_inside_an_array) returns a
zero-dimensional (scalar) array -- np.array(1), np.array(True), etc.
all work like this, so I'd expect np.array("foo") to do the same.

-n
--
Nathaniel J. Smith -- http://vorpus.org
Juan Nunez-Iglesias
2015-12-12 23:02:06 UTC
Permalink
Hey Nathaniel,

Fascinating! Thanks for the primer! I didn't know that it would check dtype
of values in the whole array. In that case, I would agree that it would be
bad to infer it magically from just the first value, and this can be left
to the users.

Thanks!

Juan.
Post by Nathaniel Smith
On Fri, Dec 11, 2015 at 11:32 PM, Juan Nunez-Iglesias
Post by Juan Nunez-Iglesias
Nathaniel,
IMO this is better than making np.array(iter) internally call list(iter)
or equivalent
from itertools import chain
elem = next(iterable)
dtype = whatever_numpy_does_to_infer_dtypes_from_lists(elem)
return np.fromiter(chain([elem], iterable), dtype=dtype)
I think this would be a huge win for usability. Always getting tripped
up by
Post by Juan Nunez-Iglesias
the dtype requirement. I can submit a PR if people like this pattern.
This isn't the semantics of np.array, though -- np.array will look at
the whole input and try to find a common dtype, so this can't be the
implementation for np.array(iter). E.g. try np.array([1, 1.0])
I can see an argument for making the dtype= argument to fromiter
optional, with a warning in the docs that it will guess based on the
first element and that you should specify it if you don't want that.
It seems potentially a bit error prone (in the sense that it might
make it easier to end up with code that works great when you test it
but then breaks later when something unexpected happens), but maybe
the usability outweighs that. I don't use fromiter myself so I don't
have a strong opinion.
Post by Juan Nunez-Iglesias
btw, I think np.array(['f', 'o', 'o']) would be exactly the expected
result
Post by Juan Nunez-Iglesias
for np.array('foo'), but I guess that's just me.
In general np.array(thing_that_can_go_inside_an_array) returns a
zero-dimensional (scalar) array -- np.array(1), np.array(True), etc.
all work like this, so I'd expect np.array("foo") to do the same.
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Benjamin Root
2015-12-14 15:56:02 UTC
Permalink
Devil's advocate here: np.array() has become the de-facto "constructor" for
numpy arrays. Right now, passing it a generator results in what, IMHO, is a
Post by Juan Nunez-Iglesias
Post by Nathaniel Smith
np.array((i for i in range(10)))
array(<generator object <genexpr> at 0x7f28b2beca00>, dtype=object)
Post by Juan Nunez-Iglesias
Post by Nathaniel Smith
np.array((i for i in range(10)), dtype=np.int_)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: long() argument must be a string or a number, not 'generator'

Therefore, I think it is not out of the realm of reason that passing a
generator object and a dtype could then delegate the work under the hood to
np.fromiter()? I would even go so far as to raise an error if one passes a
generator without specifying dtype to np.array(). The point is to reduce
the number of entry points for creating numpy arrays.


By the way, any reason why this works?
Post by Juan Nunez-Iglesias
Post by Nathaniel Smith
np.array(xrange(10))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


Cheers!
Ben Root
Post by Juan Nunez-Iglesias
Hey Nathaniel,
Fascinating! Thanks for the primer! I didn't know that it would check
dtype of values in the whole array. In that case, I would agree that it
would be bad to infer it magically from just the first value, and this can
be left to the users.
Thanks!
Juan.
Post by Nathaniel Smith
On Fri, Dec 11, 2015 at 11:32 PM, Juan Nunez-Iglesias
Nathaniel,
IMO this is better than making np.array(iter) internally call
list(iter)
or equivalent
from itertools import chain
elem = next(iterable)
dtype = whatever_numpy_does_to_infer_dtypes_from_lists(elem)
return np.fromiter(chain([elem], iterable), dtype=dtype)
I think this would be a huge win for usability. Always getting tripped
up by
the dtype requirement. I can submit a PR if people like this pattern.
This isn't the semantics of np.array, though -- np.array will look at
the whole input and try to find a common dtype, so this can't be the
implementation for np.array(iter). E.g. try np.array([1, 1.0])
I can see an argument for making the dtype= argument to fromiter
optional, with a warning in the docs that it will guess based on the
first element and that you should specify it if you don't want that.
It seems potentially a bit error prone (in the sense that it might
make it easier to end up with code that works great when you test it
but then breaks later when something unexpected happens), but maybe
the usability outweighs that. I don't use fromiter myself so I don't
have a strong opinion.
btw, I think np.array(['f', 'o', 'o']) would be exactly the expected
result
for np.array('foo'), but I guess that's just me.
In general np.array(thing_that_can_go_inside_an_array) returns a
zero-dimensional (scalar) array -- np.array(1), np.array(True), etc.
all work like this, so I'd expect np.array("foo") to do the same.
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Robert Kern
2015-12-14 17:38:22 UTC
Permalink
Post by Benjamin Root
By the way, any reason why this works?
np.array(xrange(10))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
It's not a generator. It's a true sequence that just happens to have a
special implementation rather than being a generic container.
Post by Benjamin Root
len(xrange(10))
10
Post by Benjamin Root
xrange(10)[5]
5

--
Robert Kern
Benjamin Root
2015-12-14 17:41:45 UTC
Permalink
Heh, never noticed that. Was it implemented more like a generator/iterator
in older versions of Python?

Thanks,
Ben Root
Post by Robert Kern
Post by Benjamin Root
By the way, any reason why this works?
np.array(xrange(10))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
It's not a generator. It's a true sequence that just happens to have a
special implementation rather than being a generic container.
Post by Benjamin Root
len(xrange(10))
10
Post by Benjamin Root
xrange(10)[5]
5
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Robert Kern
2015-12-14 17:49:29 UTC
Permalink
Post by Benjamin Root
Heh, never noticed that. Was it implemented more like a
generator/iterator in older versions of Python?

No, it predates generators and iterators so it has always had to be
implemented like that.

--
Robert Kern
Stephan Sahm
2015-12-15 07:08:07 UTC
Permalink
I would like to further push Benjamin Root's suggestion:

"Therefore, I think it is not out of the realm of reason that passing a
generator object and a dtype could then delegate the work under the hood to
np.fromiter()? I would even go so far as to raise an error if one passes a
generator without specifying dtype to np.array(). The point is to reduce
the number of entry points for creating numpy arrays."

would this be ok?
Post by Robert Kern
Post by Benjamin Root
Heh, never noticed that. Was it implemented more like a
generator/iterator in older versions of Python?
No, it predates generators and iterators so it has always had to be
implemented like that.
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Stephan Sahm
2016-01-19 19:33:27 UTC
Permalink
just to not prevent it from the black hole - what about integrating
fromiter into array? (see the post by Benjamin Root)

for me personally, taking the first element for deducing the dtype would be
a perfect default way to read generators. If one wants a specific other
dtype, one could specify it like in the current fromiter method.
Post by Stephan Sahm
"Therefore, I think it is not out of the realm of reason that passing a
generator object and a dtype could then delegate the work under the hood to
np.fromiter()? I would even go so far as to raise an error if one passes a
generator without specifying dtype to np.array(). The point is to reduce
the number of entry points for creating numpy arrays."
would this be ok?
Post by Robert Kern
Post by Benjamin Root
Heh, never noticed that. Was it implemented more like a
generator/iterator in older versions of Python?
No, it predates generators and iterators so it has always had to be
implemented like that.
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Antony Lee
2016-02-18 18:13:56 UTC
Permalink
Actually, while working on https://github.com/numpy/numpy/issues/7264 I
realized that the memory efficiency (one-pass) argument is simply incorrect:

import numpy as np

class A:
def __getitem__(self, i):
print("A get item", i)
return [np.int8(1), np.int8(2)][i]
def __len__(self):
return 2

print(repr(np.array(A())))

This prints out

A get item 0
A get item 1
A get item 2
A get item 0
A get item 1
A get item 2
A get item 0
A get item 1
A get item 2
array([1, 2], dtype=int8)

i.e. the sequence is "turned into a concrete sequence" no less than 3 times.

Antony
Post by Stephan Sahm
just to not prevent it from the black hole - what about integrating
fromiter into array? (see the post by Benjamin Root)
for me personally, taking the first element for deducing the dtype would
be a perfect default way to read generators. If one wants a specific other
dtype, one could specify it like in the current fromiter method.
Post by Stephan Sahm
"Therefore, I think it is not out of the realm of reason that passing a
generator object and a dtype could then delegate the work under the hood to
np.fromiter()? I would even go so far as to raise an error if one passes a
generator without specifying dtype to np.array(). The point is to reduce
the number of entry points for creating numpy arrays."
would this be ok?
Post by Robert Kern
Post by Benjamin Root
Heh, never noticed that. Was it implemented more like a
generator/iterator in older versions of Python?
No, it predates generators and iterators so it has always had to be
implemented like that.
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...