FIX: clean up unit conversion unpacking of data, particularly for dates and pandas series#11664
FIX: clean up unit conversion unpacking of data, particularly for dates and pandas series#11664jklymak wants to merge 3 commits into
Conversation
|
|
12b3b7f to
445047c
Compare
| "values must have same the length") | ||
| # check position | ||
| if hasattr(positions, 'values'): | ||
| positions = positions.values |
There was a problem hiding this comment.
Are there any better ways to check if something is a pandas Series?
In this case the following (admittedly very constructed) example would fail:
import numpy as np
import matplotlib.pyplot as plt
class Years(list):
values = True
data = np.random.rand(5,2)
years = Years([1999,2003])
plt.figure()
plt.boxplot(data, positions=years)
plt.show()
There was a problem hiding this comment.
Well, its a damned if you do, damned if you don't situation.
Other data types use the values convention as well (xarray for instance). The goal here was to unpack the pandas series w/o having to know what a pandas series is (there is no pandas import in the main library).
If someone tries to pass a class that has values meaning something else, then I guess they will have to unpack before they plot.
FWIW, my preference would be that we only support numpy arrays and basic python lists, and let users and/or downstream libraries do their own unpacking.
There was a problem hiding this comment.
I don't like the idea at all that only because some object has a values attribute, matplotlib now thinks that it should use it (and not the actual data in the object, e.g. provided by some iterator).
Or, if you think this is what should always happen, it should actually always happen. As it stands, matplotlib would use the iterator for usual numbers, but the values attribute for dates (#10638) or in boxplots (this PR).
There was a problem hiding this comment.
OK, agreed. Stand by for a more holistic solution - though potentially more controversial...
445047c to
fb7eb0b
Compare
b84c4b5 to
95e4e52
Compare
d87ac91 to
db394cb
Compare
|
ping @jorisvandenbossche as this is a redo of a PR you helped with previously... |
|
Maybe closes #5896 |
db394cb to
b16c311
Compare
|
Rebased, this is still ready to go and waiting for review. |
|
Looks okish, but I'm not deep enough into the numpy/pandas data formats to judge the implications. |
efiring
left a comment
There was a problem hiding this comment.
Maybe just a very minor change and some explanation or additional comments.
| # convert to one dimensional arrays | ||
| C = C.ravel() | ||
| coords = np.column_stack((X, Y)).astype(float, copy=False) | ||
| coords = np.column_stack((X, Y)).astype(float) |
There was a problem hiding this comment.
I don't see any point in removing the "copy=False" kwarg. It saves time, and it is logically what is wanted here. A copy was already made via the column_stack call, so nothing done to coords will affect X or Y.
| # some iterables need to be massaged a bit... | ||
| if munits.ConversionInterface.is_numlike(x): | ||
| return x | ||
| if isinstance(x, list) or isinstance(x, numbers.Number): |
There was a problem hiding this comment.
What is the rationale for returning a list as-is instead of running it through np.asarray? Just that the latter is unnecessary? Or are there cases where the conversion would be an error?
| return x | ||
| if isinstance(x, list) or isinstance(x, numbers.Number): | ||
| return x | ||
| if issubclass(type(x), np.ma.MaskedArray): |
There was a problem hiding this comment.
Why is this using issubclass(type(x)... instead of isinstance?
| return x | ||
| if issubclass(type(x), np.ma.MaskedArray): | ||
| return np.ma.asarray(x) | ||
| return np.asarray(x) |
There was a problem hiding this comment.
Is the point of this whole addition above to short-circuit some of what would otherwise happen in get_converter? I'm not opposed, just confused.
There was a problem hiding this comment.
it seems worse, because calling np.asarray can drop information in some cases, making it unavailable in the call to convert below. Unless I'm missing something...
| if ((isinstance(d, np.ndarray) and | ||
| np.issubdtype(d.dtype, np.datetime64)) or | ||
| isinstance(d, np.datetime64)): | ||
| return _dt64_to_ordinalf(d) |
There was a problem hiding this comment.
What case is this whole block covering? Again, I'm confused.
| data = np.random.rand(5, 2) | ||
| years = pd.date_range('1/1/2000', | ||
| periods=2, freq=pd.DateOffset(years=1)).year | ||
| # Does not work |
There was a problem hiding this comment.
Another place I am confused. Is the comment obsolete?
| @@ -1477,9 +1479,14 @@ def have_units(self): | |||
| return self.converter is not None or self.units is not None | |||
|
|
|||
| def convert_units(self, x): | |||
There was a problem hiding this comment.
This seems to be converting anything numlike, which includes things like pint.Quantity, into an array/masked array before calling convert. That seems like it would break pint, though I will note that our unit tests seem to be passing.
There was a problem hiding this comment.
Great, I’ll check vs pint when I look at the other comments above. Thanks for the reminder...
There was a problem hiding this comment.
Ok, I guess I still don't understand how all of this works. 😉 So is_numlike only returns true for things that are basic numbers (or contain them) and don't need converting. I think I'm ok with this code then. Tests passing is always a good thing too. 😄
|
OK, somewhat confused about everything I was thinking here, and some of the type checking has moved on since this was written. So I'm going to close. Note #13304 deregisters the pandas converters which I think was an important contribution here. Sorry for the reviewer effort on this... |
PR Summary
Closes #10022 and #11391
This does one small thing (run units conversion on the position argument of
bxp/boxplot) and two big things:np.asarrayon incoming objects to unpack numpy arrays from them. Previously in order to get native pandas support we were unpacking thevaluesfield, but runningnp.asarraydoes the same thing and allows the packed values to not necessarily be stored invalues. i.e.returns:
This re-does #10638, which unpacked pandas py checking for and using
values.datehandling should be able to strip pandas series of their date information and convert. This passes all the pandas tests, but I admit this may be a controversial step.Immediate fix code:
failed on master. This fixes...
PR Checklist