[Numpy-discussion] loadtxt and usecols

Discussion:

Irvin Probst

2015-11-09 09:15:04 UTC

Hi,
I've recently seen many students, coming from Matlab, struggling against
the usecols argument of loadtxt. Most of them tried something like:
loadtxt("foo.bar", usecols=2) or the ones with better documentation
reading skills tried loadtxt("foo.bar", usecols=(2)) but none of them
understood they had to write usecols=[2] or usecols=(2,).

Is there a policy in numpy stating that this kind of arguments must be
sequences ? I think that being able to an int or a sequence when a
single column is needed would make this function a bit more user
friendly for beginners. I would gladly submit a PR if noone disagrees.

Regards.

--
Irvin

Benjamin Root

2015-11-09 18:42:49 UTC

Permalink

My personal rule for flexible inputs like that is that it should be
encouraged so long as it does not introduce ambiguity. Furthermore,
Allowing a scalar as an input doesn't add a congitive disconnect on the
user on how to specify multiple columns. Therefore, I'd give this a +1.

Post by Irvin Probst
Hi,
I've recently seen many students, coming from Matlab, struggling against
loadtxt("foo.bar", usecols=2) or the ones with better documentation
reading skills tried loadtxt("foo.bar", usecols=(2)) but none of them
understood they had to write usecols=[2] or usecols=(2,).
Is there a policy in numpy stating that this kind of arguments must be
sequences ? I think that being able to an int or a sequence when a single
column is needed would make this function a bit more user friendly for
beginners. I would gladly submit a PR if noone disagrees.
Regards.
--
Irvin
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Ralf Gommers

2015-11-09 19:36:57 UTC

Permalink

There isn't. In many/most cases it's array_like, which means scalar,
sequence or array.

Post by Benjamin Root
I think that being able to an int or a sequence when a single column is

Post by Irvin Probst
needed would make this function a bit more user friendly for beginners. I
would gladly submit a PR if noone disagrees.

+1

Ralf

Sebastian Berg

2015-11-10 08:19:33 UTC

Permalink

Post by Benjamin Root
My personal rule for flexible inputs like that is that it
should be encouraged so long as it does not introduce
ambiguity. Furthermore, Allowing a scalar as an input doesn't
add a congitive disconnect on the user on how to specify
multiple columns. Therefore, I'd give this a +1.
On Mon, Nov 9, 2015 at 4:15 AM, Irvin Probst
Hi,
I've recently seen many students, coming from Matlab,
struggling against the usecols argument of loadtxt.
loadtxt("foo.bar", usecols=2) or the ones with better
documentation reading skills tried loadtxt("foo.bar",
usecols=(2)) but none of them understood they had to
write usecols=[2] or usecols=(2,).
Is there a policy in numpy stating that this kind of
arguments must be sequences ?
There isn't. In many/most cases it's array_like, which means scalar,
sequence or array.

Agree, I think we have, or should have, to types of things there (well,
three since we certainly have "must be sequence").
Args such as "axes" which is typically just one, so we allow scalar, but
can often be generalized to a sequence. And things that are array-likes
(and broadcasting).

So, if this is an array-like, however, the "correct" result could be
different by broadcasting between `1` and `(1,)` analogous to indexing
the full array with usecols:

usecols=1 result:
array([2, 3, 4, 5])

usecols=(1,) result [1]:
array([[2, 3, 4, 5]])

since a scalar row (so just one row) is read and not a 2D array. I tend
to say it should be an array-like argument and not a generalized
sequence argument, just wanted to note that, since I am not sure what
matlab does.

- Sebastian

[1] could go further and do `usecols=[[1]]` and get
`array([[[2, 3, 4, 5]]])`

Post by Benjamin Root
I think that being able to an int or a sequence when a
single column is needed would make this function a bit
more user friendly for beginners. I would gladly
submit a PR if noone disagrees.
+1
Ralf
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Irvin Probst

2015-11-10 09:24:57 UTC

Permalink

Post by Sebastian Berg
since a scalar row (so just one row) is read and not a 2D array. I tend
to say it should be an array-like argument and not a generalized
sequence argument, just wanted to note that, since I am not sure what
matlab does.

Hi,
By default Matlab reads everything, silently fails on what can't be
converted into a float and the user has to guess what was read or not.
Say you have a file like this:

2010-01-01 00:00:00 3.026
2010-01-01 01:00:00 4.049
2010-01-01 02:00:00 4.865

Post by Sebastian Berg

M=load('CONCARNEAU_2010.txt');
M(1:3,:)

ans =

1.0e+03 *

2.0100 0 0.0030
2.0100 0.0010 0.0040
2.0100 0.0020 0.0049

I think this is a terrible way of doing it even if newcomers might find
this handy. There are of course optionnal arguments (even regexps !) but
to my knowledge almost no Matlab user even knows these arguments are there.

Anyway, I made a PR here https://github.com/numpy/numpy/pull/6656 with
usecols as an array-like.

Regards.

Sebastian Berg

2015-11-10 13:17:32 UTC

Permalink

Post by Irvin Probst

Hi,
By default Matlab reads everything, silently fails on what can't be
converted into a float and the user has to guess what was read or not.
2010-01-01 00:00:00 3.026
2010-01-01 01:00:00 4.049
2010-01-01 02:00:00 4.865

Post by Sebastian Berg

M=load('CONCARNEAU_2010.txt');
M(1:3,:)

ans =
1.0e+03 *
2.0100 0 0.0030
2.0100 0.0010 0.0040
2.0100 0.0020 0.0049
I think this is a terrible way of doing it even if newcomers might find
this handy. There are of course optionnal arguments (even regexps !) but
to my knowledge almost no Matlab user even knows these arguments are there.
Anyway, I made a PR here https://github.com/numpy/numpy/pull/6656 with
usecols as an array-like.

Actually, it is the "sequence special case" type ;). (matlab does not
have this, since matlab always returns 2-D I realized).

As I said, if usecols is like indexing, the result should mimic:

arr = np.loadtxt(f)
arr = arr[usecols]

in which case a 1-D array is returned if you put in a scalar into
usecols (and you could even generalize usecols to higher dimensional
array-likes).
The way you implemented it -- which is fine, but I want to stress that
there is a real decision being made here --, you always see it as a
sequence but allow a scalar for convenience (i.e. always return a 2-D
array). It is a `sequence of ints or int` type argument and not an
array-like argument in my opinion.

- Sebastian

Post by Irvin Probst
Regards.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Irvin Probst

2015-11-10 15:07:13 UTC

Permalink

Post by Sebastian Berg
Actually, it is the "sequence special case" type ;). (matlab does not
have this, since matlab always returns 2-D I realized).
arr = np.loadtxt(f)
arr = arr[usecols]
in which case a 1-D array is returned if you put in a scalar into
usecols (and you could even generalize usecols to higher dimensional
array-likes).
The way you implemented it -- which is fine, but I want to stress that
there is a real decision being made here --, you always see it as a
sequence but allow a scalar for convenience (i.e. always return a 2-D
array). It is a `sequence of ints or int` type argument and not an
array-like argument in my opinion.

I think we have two separate problems here:

The first one is whether loadtxt should always return a 2D array or
should it match the shape of the usecol argument. From a CS guy point of
view I do understand your concern here. Now from a teacher point of view
I know many people expect to get a "matrix" (thank you Matlab...) and
the "purity" of matching the dimension of the usecol variable will be
seen by many people [1] as a nerdy useless heavyness noone cares of (no
offense). So whatever you, seadoned numpy devs from this mailing list,
decide I think it should be explained in the docstring with a very clear
wording.

My own opinion on this first problem is that loadtxt() should always
return a 2D array, no less, no more. If I write np.loadtxt(f)[42] it
means I want to read the whole file and then I explicitely ask for
transforming the 2-D array loadtxt() returned into a 1-D array. Otoh if
I write loadtxt(f, usecol=42) it means I don't want to read the other
columns and I want only this one, but it does not mean that I want to
change the returned array from 2-D to 1-D. I know this new behavior
might break a lot of existing code as usecol=(42,) used to return a 1-D
array, but usecol=((((42,)))) also returns a 1-D array so the current
behavior is not consistent imho.

The second problem is about the wording in the docstring, when I see
"sequence of int or int" I uderstand I will have to cast into a 1-D
python list whatever wicked N-dimensional object I use to store my
column indexes, or hope list(my_object) will do it fine. On the other
hand when I read "array-like" the function is telling me I don't have to
worry about my object, as long as numpy knows how to cast it into an
array it will be fine.

Anyway I think something like that:

import numpy as np
a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)

should just work and return me a 2-D (or 1-D if you like) array with the
data I asked for and I don't think "a" here is an int or a sequence of
int (but it's a good example of why loadtxt() should not match the shape
of the usecol argument).

To make it short, let the reading function read the data in a consistent
and predictible way and then let the user explicitely change the data's
shape into anything he likes.

Regards.

[1] read non CS people trying to switch to numpy/scipy

Benjamin Root

2015-11-10 15:24:40 UTC

Permalink

Just pointing out np.loadtxt(..., ndmin=2) will always return a 2D array.
Notice that without that option, the result is effectively squeezed. So if
you don't specify that option, and you load up a CSV file with only one
row, you will get a very differently shaped array than if you load up a CSV
file with two rows.

Ben Root

On Tue, Nov 10, 2015 at 10:07 AM, Irvin Probst <

The first one is whether loadtxt should always return a 2D array or should
it match the shape of the usecol argument. From a CS guy point of view I do
understand your concern here. Now from a teacher point of view I know many
people expect to get a "matrix" (thank you Matlab...) and the "purity" of
matching the dimension of the usecol variable will be seen by many people
[1] as a nerdy useless heavyness noone cares of (no offense). So whatever
you, seadoned numpy devs from this mailing list, decide I think it should
be explained in the docstring with a very clear wording.
My own opinion on this first problem is that loadtxt() should always
return a 2D array, no less, no more. If I write np.loadtxt(f)[42] it means
I want to read the whole file and then I explicitely ask for transforming
the 2-D array loadtxt() returned into a 1-D array. Otoh if I write
loadtxt(f, usecol=42) it means I don't want to read the other columns and I
want only this one, but it does not mean that I want to change the returned
array from 2-D to 1-D. I know this new behavior might break a lot of
existing code as usecol=(42,) used to return a 1-D array, but
usecol=((((42,)))) also returns a 1-D array so the current behavior is not
consistent imho.
The second problem is about the wording in the docstring, when I see
"sequence of int or int" I uderstand I will have to cast into a 1-D python
list whatever wicked N-dimensional object I use to store my column indexes,
or hope list(my_object) will do it fine. On the other hand when I read
"array-like" the function is telling me I don't have to worry about my
object, as long as numpy knows how to cast it into an array it will be fine.
import numpy as np
a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)
should just work and return me a 2-D (or 1-D if you like) array with the
data I asked for and I don't think "a" here is an int or a sequence of int
(but it's a good example of why loadtxt() should not match the shape of the
usecol argument).
To make it short, let the reading function read the data in a consistent
and predictible way and then let the user explicitely change the data's
shape into anything he likes.
Regards.
[1] read non CS people trying to switch to numpy/scipy
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Sebastian Berg

2015-11-10 15:57:26 UTC

Permalink

Post by Benjamin Root
Just pointing out np.loadtxt(..., ndmin=2) will always return a 2D
array. Notice that without that option, the result is effectively
squeezed. So if you don't specify that option, and you load up a CSV
file with only one row, you will get a very differently shaped array
than if you load up a CSV file with two rows.

Oh, well I personally think that default squeeze is an abomination :).

Anyway, I just wanted to point out that it is two different possible
logics, and we have to pick one.
I have a slight preference for the indexing/array-like interpretation,
but I am aware that from a usage point of view the sequence one is
likely better.
I could throw in another option: Throw an explicit error instead of the
general.

Anyway, I *really* do not have an opinion about what is better.

Array-like would only suggest that you also accept buffer interface
objects or array_interface stuff. Which in this case is really
unnecessary I think.

- Sebastian

Post by Benjamin Root
Ben Root
On Tue, Nov 10, 2015 at 10:07 AM, Irvin Probst
Actually, it is the "sequence special case" type ;).
(matlab does not
have this, since matlab always returns 2-D I
realized).
As I said, if usecols is like indexing, the result
arr = np.loadtxt(f)
arr = arr[usecols]
in which case a 1-D array is returned if you put in a
scalar into
usecols (and you could even generalize usecols to
higher dimensional
array-likes).
The way you implemented it -- which is fine, but I
want to stress that
there is a real decision being made here --, you
always see it as a
sequence but allow a scalar for convenience (i.e.
always return a 2-D
array). It is a `sequence of ints or int` type
argument and not an
array-like argument in my opinion.
The first one is whether loadtxt should always return a 2D
array or should it match the shape of the usecol argument.
From a CS guy point of view I do understand your concern here.
Now from a teacher point of view I know many people expect to
get a "matrix" (thank you Matlab...) and the "purity" of
matching the dimension of the usecol variable will be seen by
many people [1] as a nerdy useless heavyness noone cares of
(no offense). So whatever you, seadoned numpy devs from this
mailing list, decide I think it should be explained in the
docstring with a very clear wording.
My own opinion on this first problem is that loadtxt() should
always return a 2D array, no less, no more. If I write
np.loadtxt(f)[42] it means I want to read the whole file and
then I explicitely ask for transforming the 2-D array
loadtxt() returned into a 1-D array. Otoh if I write
loadtxt(f, usecol=42) it means I don't want to read the other
columns and I want only this one, but it does not mean that I
want to change the returned array from 2-D to 1-D. I know this
new behavior might break a lot of existing code as
usecol=(42,) used to return a 1-D array, but
usecol=((((42,)))) also returns a 1-D array so the current
behavior is not consistent imho.
The second problem is about the wording in the docstring, when
I see "sequence of int or int" I uderstand I will have to cast
into a 1-D python list whatever wicked N-dimensional object I
use to store my column indexes, or hope list(my_object) will
do it fine. On the other hand when I read "array-like" the
function is telling me I don't have to worry about my object,
as long as numpy knows how to cast it into an array it will be
fine.
import numpy as np
a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)
should just work and return me a 2-D (or 1-D if you like)
array with the data I asked for and I don't think "a" here is
an int or a sequence of int (but it's a good example of why
loadtxt() should not match the shape of the usecol argument).
To make it short, let the reading function read the data in a
consistent and predictible way and then let the user
explicitely change the data's shape into anything he likes.
Regards.
[1] read non CS people trying to switch to numpy/scipy
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Daπid

2015-11-10 15:52:52 UTC

Permalink

I know this new behavior might break a lot of existing code as
usecol=(42,) used to return a 1-D array, but usecol=((((42,)))) also
returns a 1-D array so the current behavior is not consistent imho.

((((42,)))) is exactly the same as (42,) If you want a tuple of tuples,
you have to do ((42,),), but then it raises: TypeError: list indices must
be integers, not tuple.

What numpy cares about is that whatever object you give it is iterable, and
its entries are ints, so usecol={0:'a', 5:'b'} is perfectly valid.

I think loadtxt should be a tool to read text files in the least surprising
fashion, and a text file is a 1 or 2D container, so it shouldn't return any
other shapes. Any fancy stuff one may want to do with the output should be
done with the typical indexing tricks. If I want a single column, I would
first be very surprised if I got a 2D array (I was bitten by this design in
MATLAB many many times). For the rare cases where I do want a "fake" 2D
array, I can make it explicit by expanding it with arr[:, np.newaxis], and
then I know that the shape will be (N, 1) and not (1, N). Thus, usecols
should be int or sequence of ints, and the result 1 or 2D.

In your example:

a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)

What would the shape of foo be?

/David.

Irvin Probst

2015-11-10 16:39:05 UTC

Permalink

Post by DaÏid
((((42,)))) is exactly the same as (42,) If you want a tuple of
tuples, you have to do ((42,),), but then it raises: TypeError: list
indices must be integers, not tuple.

My bad, I wrote that too fast, please forget this.

Post by DaÏid
I think loadtxt should be a tool to read text files in the least
surprising fashion, and a text file is a 1 or 2D container, so it
shouldn't return any other shapes.

And I *do* agree with the "shouldn't return any other shapes" part of
your phrase. What I was trying to say, admitedly with a very bogus
example, is that either loadtxt() should always output an array whose
shape matches the shape of the object passed to usecol or it should
never do it, and I'm if favor of never.
I'm perfectly aware that what I suggest would break the current behavior
of usecols=(2,) so I know it does not have the slightest probability of
being accepted but still, I think that the "least surprising fashion" is
to always return an 2-D array because for many, many, many people a text
data file has N lines and M columns and N=1 or M=1 is not a specific case.

Anyway I will of course modify my PR according to any decision made here.

Post by DaÏid
a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)
What would the shape of foo be?
should just work and return me a 2-D (or 1-D if you like) array with

the data I asked for

So, 1-D or 2-D it is up to you, but as long as there is no ambiguity in
which columns the user is asking for it should imho work.

Regards.

Sebastian Berg

2015-11-11 17:38:50 UTC

Permalink

Post by Irvin Probst

Post by DaÏid
((((42,)))) is exactly the same as (42,) If you want a tuple of
tuples, you have to do ((42,),), but then it raises: TypeError: list
indices must be integers, not tuple.

My bad, I wrote that too fast, please forget this.

Post by DaÏid
I think loadtxt should be a tool to read text files in the least
surprising fashion, and a text file is a 1 or 2D container, so it
shouldn't return any other shapes.

Sounds fine to me, and considering the squeeze logic (which I think is
unfortunate, but it is not something you can easily change), I would be
for simply adding logic to accept a single integral argument and
otherwise not change anything.
I am personally against the flattening and even the array-like logic [1]
currently in the PR, it seems like arbitrary generality for my taste
without any obvious application.

As said before, the other/additional thing that might be very helpful is
trying to give a more useful error message.

- Sebastian

[1] Almost all 1-d array-likes will be sequences/iterables in any case,
those that are not are so obscure that there is no point in explicitly
supporting them.

Post by Irvin Probst
I'm perfectly aware that what I suggest would break the current behavior
of usecols=(2,) so I know it does not have the slightest probability of
being accepted but still, I think that the "least surprising fashion" is
to always return an 2-D array because for many, many, many people a text
data file has N lines and M columns and N=1 or M=1 is not a specific case.
Anyway I will of course modify my PR according to any decision made here.

Post by DaÏid
a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)
What would the shape of foo be?
should just work and return me a 2-D (or 1-D if you like) array with

the data I asked for
So, 1-D or 2-D it is up to you, but as long as there is no ambiguity in
which columns the user is asking for it should imho work.
Regards.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Irvin Probst

2015-11-13 10:51:54 UTC

Permalink

Post by Sebastian Berg
Sounds fine to me, and considering the squeeze logic (which I think is
unfortunate, but it is not something you can easily change), I would be
for simply adding logic to accept a single integral argument and
otherwise not change anything.
[...]
As said before, the other/additional thing that might be very helpful is
trying to give a more useful error message.

I've modified my PR to (hopefully) match these requests.
https://github.com/numpy/numpy/pull/6656

Regards.

--
Irvin