Refactor CompImageHDU to inherit from ImageHDU#15474
Conversation
|
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
|
👋 Thank you for your draft pull request! Do you know that you can use |
ed65a28 to
7f7ca75
Compare
mhvk
left a comment
There was a problem hiding this comment.
@astrofrog - I went over it relatively quickly and noticed mostly little imprementation things (particularly for _image_header_to_empty_bintable where quite some code can be removed since you are now initializing a new bintable).
Unfortunately, I feel rather less able to judge the main changes... But I do like the overall idea and think it would be good to proceed for 6.0 -- hopefully, the test coverage is substantial enough...
On the API changes, I think they are all fine (for a major version bump), but note that it is possible to ensure that isinstance(chdu, BinTableHDU) keeps working (using ABCMeta and __subclasshook__. Though I think this is not really needed.
| # Store a temporary backup of self.data in a different attribute; | ||
| # see below | ||
| self._imagedata = self.data | ||
| if self._bintable is not None: |
There was a problem hiding this comment.
I'm a bit confused about self._tmp_bintable -- could one not just create self._bintable if it does not yet exist? I.e., have tmp_bintable instead of self._tmp_bintable and here
if self._bintable is None:
self._bintable = tmp_bintable
else:
self._bintable.header = self._tmp_bintable.header
self._bintable.data = self._tmp_bintable.data
The difference is that if there was no bintable, there will now be one (since _postwriteto would likely do nothing -- which would seem fine (if not, a flag to remove it on self might still be a bit nicer)
There was a problem hiding this comment.
This is a philosophical/deliberate choice - if a CompImageHDU wasn't backed by a _bintable prior to writing, I was going with the idea that it should also not be after writing, i.e. writing the CompImageHDU should not have any side effects on CompImageHDU. Does this make sense or do you think after writing to a file the CompImageHDU should switch to being backed by a bintable? (this could result in changes to .data if there is any quantization).
There was a problem hiding this comment.
Once the data is fully loaded in memory, do we need to keep the bintable ?
There was a problem hiding this comment.
Yes we need to, at least in update mode, as if the data isn't changed, we want to write the bintable out as-is without recompressing it. If we dropped _bintable when the data is decompressed we'd also lose the ability for the user to use the compressed_data property. I would vote to keep it as-is for now and to see in a follow-up whether there are any benefits to dropping _bintable once .data is accessed under certain circumstances.
| """ | ||
| The tile shape used for the tiled compression. | ||
| # FIXME: determine how we can simplify or avoid the following methods. These | ||
| # seem to be needed for the checksum calculations to work properly. |
There was a problem hiding this comment.
Weirdly, coverage suggests the methods are not actually called. Though it is wrong so often that I've started to just ignore it...
| assert h2[1].header["CHECKSUM"] == "D2GXD29VD2EVD29V" | ||
| assert "DATASUM" in h2[1].header | ||
| assert h2[1].header["DATASUM"] == "113055149" | ||
| assert h2[1].header["DATASUM"] == "2189405276" |
There was a problem hiding this comment.
Does it make sense that both checksum and datasum have changed?
There was a problem hiding this comment.
I think checksum includes the data and the header so since the datasum has changed I think it makes sense that checksum changes too
There was a problem hiding this comment.
with this branch, comment are moved to the end and blank cards are removed:
❯ diff -u <(fitsheader tmp-ref.fits) <(fitsheader tmp-new.fits)
--- /proc/self/fd/11 2024-06-14 19:34:45.798001800 +0200
+++ /proc/self/fd/14 2024-06-14 19:34:45.798001800 +0200
@@ -1,12 +1,12 @@
-# HDU 0 in tmp-ref.fits:
+# HDU 0 in tmp-new.fits:
SIMPLE = T / conforms to FITS standard
BITPIX = 8 / array data type
NAXIS = 0 / number of array dimensions
EXTEND = T
-CHECKSUM= 'B3dAC2Z9B2dAB2Z9' / HDU checksum updated 2024-06-14T19:29:18
-DATASUM = '0 ' / data unit checksum updated 2024-06-14T19:29:18
+CHECKSUM= 'A3dAD2Z9A2bAA2Z9' / HDU checksum updated 2024-06-14T19:29:38
+DATASUM = '0 ' / data unit checksum updated 2024-06-14T19:29:38
-# HDU 1 in tmp-ref.fits:
+# HDU 1 in tmp-new.fits:
XTENSION= 'IMAGE ' / Image extension
BITPIX = 16 / My special comment
NAXIS = 2 / My special naxis comment
@@ -14,10 +14,6 @@
NAXIS2 = 300
PCOUNT = 0 / My special PCOUNT comment
GCOUNT = 1 / My special GCOUNT comment
-COMMENT FITS (Flexible Image Transport System) format defined in Astronomy and
-COMMENT Astrophysics Supplement Series v44/p363, v44/p371, v73/p359, v73/p365.
-COMMENT Contact the NASA Science Office of Standards and Technology for the
-COMMENT FITS Definition document #100 and other FITS information.
OBJECT = 'NGC 1316'
TELESCOP= 'Optical '
EQUINOX = 1950.0 / EPOCH OF RA DEC
@@ -34,34 +30,12 @@
CRPIX2 = 1.470000000000000E+02
CROTA2 = 0.000000000E+00 /
ORIGIN = 'AIPSGorilla (NRAOCV IPX) 15JUL94' /
-CHECKSUM= 'WdAaaZ5YZbAaaZ3Y' / HDU checksum updated 2024-06-14T19:29:18
-DATASUM = '113055149' / data unit checksum updated 2024-06-14T19:29:18
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+CHECKSUM= 'N7gSP4ZSN4fSN4ZS' / HDU checksum updated 2024-06-14T19:29:38
+DATASUM = '113055149' / data unit checksum updated 2024-06-14T19:29:38
+COMMENT FITS (Flexible Image Transport System) format defined in Astronomy and
+COMMENT Astrophysics Supplement Series v44/p363, v44/p371, v73/p359, v73/p365.
+COMMENT Contact the NASA Science Office of Standards and Technology for the
+COMMENT FITS Definition document #100 and other FITS information.
There was a problem hiding this comment.
DATASUM is different because it was previously computed on the bintable, and now it is computed on the image array. I think I remember you raised that question a while ago, though I don't remember the outcome.
There was a problem hiding this comment.
I see you mention this in the changelog entry:
Fix the image checksum/datasum in
CompImageHDU.headerto be those for the
image HDU instead of for the underlying binary table.
And the discussion I was remembering is #14753.
However I'm not convinced that CHECKSUM/DATASUM should be the ones of the (uncompressed) image.
https://github.com/astropy/astropy/blob/a32e8565405c88c333192ae4d3d3b43216fdd4be/docs/io/fits/appendix/history.rst#321-2014-03-04 mentions this in PyFITS 3.2.1:
Fixed checksums on compressed images, so that the ZHECKSUM and ZDATASUM contain a checksum of the original image HDU, while CHECKSUM and DATASUM contain checksums of the compressed image HDU. This feature was supposed to be supported in 3.2, but the support was buggy.
This seems more logical to me.
There was a problem hiding this comment.
Regarding the changelog entry, I think we do want CompImageHDU.header (which is the decompressed/image header) to contain the image checksum. This should then get stored as ZHECKSUM on disk. And if one opens the file disabling the image compression, BinTableHDU.header will contain the BinTableHDU checksum, stored as CHECKSUM on disk. Does this make sense?
I'll double check that this is the actual behaviour though.
|
@mhvk - thanks for the review! Will go through and implement shortly. @saimn - given that there are only 9 days left until feature freeze, I am thinking maybe it would be best not to rush this in, given that we are still catching up with bug reports from the refactoring of the compression C module. The release schedule now allows us to have 7.0.0 after 6.1.0 so maybe it makes sense to plan to merge this very soon after 6.1.0 is released to allow more time for people to try out the developer version and do follow-up PRs etc. So this would mean delaying merging this by just over 6 months, and then planning to have another major release in about a year. |
|
@astrofrog - yes, makes sense and seems a good plan. We should merge this kind of PR at the beginning of a release cycle so we have to test internally and fix a few things if needed. |
|
Hello everybody, When running from astropy.io import fits
import numpy as np
data = np.arange(21 * 33).reshape((21, 33)).astype(np.int32)
header = fits.Header()
hdu = fits.CompImageHDU(data, header, compression_type="RICE_1", tile_shape=(5, 6))
print(hdu.section[0, 1])The following error is thrown If I understand correctly this is due to But then I hit some new obstacles in |
I am under the impression that |
|
@astrofrog - I saw you answered some (most?) of the comments - let me know if this needs review again (since it is a large one, and I don't really remember anything from my last review, I'll wait till you are done!) |
|
@mhvk will do! I still need to address the remaining comments and push my changes but ran out of time for this week :) |
30d1eb5 to
c6e1874
Compare
1b218d3 to
ee51849
Compare
|
pre-commit.ci autofix |
|
doctest failure in oldest deps looks related: |
be79a57 to
a10f1d5
Compare
|
@cmarmo - thank you for investigating which bugs were fixed by this PR! I have now added regression tests for all these bugs in this PR and indeed most now pass. However, now it is #5999 which is still present, so I have xfailed that test. I started looking into it but it is a complex issue and not a regression in this PR, so I would like to keep it for a separate standalone PR. |
|
pre-commit.ci autofix |
|
This PR is now complete in terms of everything I want to include/fix, and all tests pass - with the exception of #15474 (comment). As mentioned in #15474 (comment) we will want to wait until 6.1 has been branched before merging this, but in the meantime it would be great to go through the review cycle so that we can have it ready to merge as soon as 6.1 is out, and maximize how long it lives in |
c8771e6 to
ae3a32d
Compare
|
@pllim - that failure should be fixed now. It was actually an important bug which wasn't caught by the main test suite so I have added a proper test too. |
|
Ok there are now a few other test failures - I clearly need to work on this more before a review, switching back to draft. |
…alled when writing out and will detect changes in the data shape
… a fileobj that is changing position between calls
bda155e to
a3543f1
Compare
|
(I rebased to pick up any recent CI changes/fixes) |
|
Didn't have time to answer yet, sorry, but while using this branch and working on something else I notice that fitsinfo with a big compressed file triggers a full decompression because we now inherit |
|
I've started working on dropping the internal table if the full data is decompressed, there are test failures which I'll look into later, just wanted to push the work so far to not lose it. |
e7f905f to
a3543f1
Compare
…ata work, and switch to assert_equal
…g decompressed, and replace _decompression_active with _data_loaded
|
@saimn - ok so I've spent some more time on this to try and address the remaining issues. First, I wasn't able to get rid of the bintable completely when the data is fully decompressed, because we need to keep a reference to the bintable to know things like
@saimn - does this seem reasonable? |
for more information, see https://pre-commit.ci
Compimagehdu inheritance
There was a problem hiding this comment.
I had a last look and I think we are good to go ! 🎉
Many thanks @astrofrog, not an easy one but really great to finally do this !
I just pushed a little change, removing the _data_modified attribute which is now unused. Let's wait for CI and if you can just cross check.
|
Ok, let's merge and close a few issues ! :) |
|
Thank you for the reviews and helping push this forward! |
|
Amazing work both @astrofrog and also @saimn for reviewing this monster, I am very excited to start being able to use this. |
This pull request addresses #9238 by refactoring
CompImageHDUto inherit fromImageHDUinstead ofBinTableHDU. This includes some subtle API changes that cannot be warned about in advance using deprecation warnings or future warnings, so this PR, if accepted, would definitely warrant a major version bump. Ideally this would go in v6.0 but it depends on whether we can achieve consensus and stamp out remaining issues.This PR isn't 100% ready (see 'Outstanding issues' below and FIXME comments in code) but I wanted to try and open this now in case it is something we can consider getting in before the v6.0 feature freeze. I doubt that the outstanding issues would cause significant code changes so it should be safe to already review things as-is.
Background
Compressed image HDUs are stored in FITS files as binary tables. The
CompImageHDUwas therefore developed as a a subclass ofBinTableHDU. However, that implementation is quite complex because it tries to look to users like anImageHDU- for example.headeris meant to be the image header, and.datais meant to be the decompressed image data. To quote comments from the code:Issue #9238 by @mhvk also suggested that
CompImageHDUwould be simpler if it inherited fromImageHDU.This PR does that refactoring, and represents
CompImageHDUas anImageHDUsubclass that internally stores aBinTableHDUrepresenting the compressed data. Instead of always keeping the image and table header in sync, headers are only converted at I/O time which should in principle make things faster.Overall, this results in a few hundred lines of code being removed. For now
CompImageHeaderis kept to avoid breaking API so that warnings about reserved keywords are emitted as soon as they are set. However, if we would be happy to 'break' this behavior and only warn at I/O time or when callingverify, we could get rid of the custom header subclass. If we want to keep the behavior, we can nevertheless try and avoid code duplication with theHeaderclass by ssimplifying the header validation (see #15293).API changes/fixes
I have tried to minimize API changes as much as possible. However, it is impossible to not have any changes at all:
With this PR, a compressed image HDU
chduwill no longer returnTrueforisinstance(chdu, BinTableHDU). There's no way to avoid this and also no way to do this with a gradual deprecation. However, I would hope that this is rare and that if anything people would doisinstance(chdu, CompImageHDU)which will still beTrue.The following options have changed defaults in the
CompImageHDUinitializer to matchImageHDU:uint=Truereplacesuint=False(I need to check if I can come up with an example that shows the difference here)scale_back=Nonereplacesscale_back=False(in practice I think this won't change anything)The checksum on the main header is now the image checksum not the binary table checksum (I believe this was a bug: Incorrect checksum/datasum in some cases for compressed FITS images? #14753)
Fixes:
CompImageHDU#12216Repeating for auto-close in less readable format:
Fixes #9238
Fixes #10512
Fixes #12216
Fixes #14081
Fixes #14611
Fixes #14753
Modified tests
I've tried to change tests as little as possible but there were some unavoidable changes:
chdu._headerwhich no longer exists (the equivalent ischdu._bintable.header.test_compressed_header_missing_znaxishas been removed - this checked that removingZNAXISorZBITPIXfrom the internal binary table caused accessing the compressed data property to fail. This doesn't seem needed because it relies on modifying private attributes onCompImageHDUand we now also generate the compressed header on-the-fly so there is no opportunity to remove the keywords before the data is compressed.test_compressed_header_double_extname- included some lines checking state of internal/private properties. Those checks have been removed as the test is no longer relevant for this API.Outstanding issues/open questions
CompImageHDU.shape(should be simple)mode='update'if a file is simply opened and closed (test_open_comp_image_in_update_modefailure). For now I have xfailed the test to make sure the rest of the CI works properly.For future
CompImageHeaderclass - as noted higher up we could potentially consider removing or simplifying this in future.compressed_dataproperty onCompImageHDU. It might be better to deprecate it and encourage people to open the file withdisable_image_compression? I doubt this property is used much. We can do this in a follow-up though.