Skip to content

Add faster implementation of _IonNature.ion_hash#22

Merged
popematt merged 2 commits intoamazon-ion:masterfrom
popematt:master
Sep 3, 2021
Merged

Add faster implementation of _IonNature.ion_hash#22
popematt merged 2 commits intoamazon-ion:masterfrom
popematt:master

Conversation

@popematt
Copy link
Copy Markdown
Contributor

@popematt popematt commented Sep 2, 2021

Issue #, if available:

Not necessarily a fix; at least an improvement for #21

Description of changes:

New implementation of .ion_hash() that does away with the overhead of instantiating ion_writers. As part of this change, I improved the performance of the _escape() function, which has the side effect of also benefitting the ion_hash_reader and the ion_hash_writer.

I did some very informal performance testing using this script.

                      Time to hash single value             Ops/Sec                   Approx
Test Case                 (old)       (new)           (old)             (new)       Improvement
===============================================================================================
ion_hash_python_21     2734.39 μs   389.30 μs      365.71 ops/s      2568.72 ops/s       7x
int                     106.84 μs     9.36 μs     9359.53 ops/s    106820.70 ops/s      11x
bool                    100.58 μs     6.17 μs     9942.43 ops/s    162009.17 ops/s      16x
null                     98.02 μs     7.11 μs    10201.86 ops/s    140577.22 ops/s      14x
string                  104.39 μs     7.59 μs     9579.21 ops/s    131760.86 ops/s      14x
decimal                 125.09 μs    16.29 μs     7994.50 ops/s     61396.71 ops/s       8x
clob                    102.30 μs     8.38 μs     9775.37 ops/s    119359.82 ops/s      12x
timestamp               178.36 μs    40.24 μs     5606.68 ops/s     24850.82 ops/s       4x
empty_struct            139.38 μs     7.16 μs     7174.65 ops/s    139569.14 ops/s      19x
struct_one_field        307.97 μs    38.24 μs     3247.12 ops/s     26150.63 ops/s       8x
struct_few_fields       572.92 μs    75.29 μs     1745.44 ops/s     13281.10 ops/s       8x
struct_many_fields     1735.05 μs   329.23 μs      576.35 ops/s      3037.42 ops/s       5x
empty_list              118.40 μs     6.21 μs     8445.82 ops/s    161004.19 ops/s      19x
long_list              2890.12 μs   180.54 μs      346.01 ops/s      5539.09 ops/s      16x
nested_containers       487.38 μs    19.84 μs     2051.80 ops/s     50411.10 ops/s      25x
one_annotation          265.96 μs    11.36 μs     3760.01 ops/s     87993.25 ops/s      23x
many_annotations        506.17 μs    20.03 μs     1975.62 ops/s     49921.85 ops/s      25x
very_long_string        246.75 μs    10.28 μs     4052.64 ops/s     97308.19 ops/s      24x
very_long_blob          304.76 μs    15.67 μs     3281.30 ops/s     63807.71 ops/s      19x

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@popematt popematt requested a review from cheqianh September 2, 2021 17:37
Copy link
Copy Markdown
Contributor

@cheqianh cheqianh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good!

Copy link
Copy Markdown

@tgregg tgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice.

Comment thread ionhash/__init__.py Outdated
Comment thread ionhash/hasher.py
Comment on lines +481 to +483
return _bytes.replace(bytes([_ESCAPE_BYTE]), bytes([_ESCAPE_BYTE, _ESCAPE_BYTE])) \
.replace(_BEGIN_MARKER, bytes([_ESCAPE_BYTE, _BEGIN_MARKER_BYTE])) \
.replace(_END_MARKER, bytes([_ESCAPE_BYTE, _END_MARKER_BYTE]))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid recomputing bytes([...]) on each call by storing the values?

Copy link
Copy Markdown
Contributor Author

@popematt popematt Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I have also heard that looking up non-local values can be expensive in Python, and if we want to compute once and store the value, then it would need to be a non-local value. I'll create a followup issue to see if it would improve performance.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me, although it does seem like a simple enough thing to try given that you already have a basic perf testing setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants