-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: Ensure SeedSequence 0-padding does not collide with spawn keys #16551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes #16539 The implicit 0-padding that is done to small entropy inputs to make them the size of the internal pool conflicts with the spawn keys, which start with an appended 0. In order to maintain stream compatibility with unspawned `SeedSequence`s, we explicitly 0-pad short inputs out to the pool size only if the spawn key is provided, and thus would trigger the bug. This should minimize the impact on users that were not encountering the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me. Do we have to limit the number of spawned children to something like 2**32
or maybe 2**64
:
In [6]: SeedSequence(42, spawn_key=(2**32,)).generate_state(4)
Out[6]: array([3908812709, 3341582407, 3454793571, 679120907], dtype=uint32)
In [7]: SeedSequence(42, spawn_key=(0,1)).generate_state(4)
Out[7]: array([3908812709, 3341582407, 3454793571, 679120907], dtype=uint32)
This may be a bit hypothetical, but just noticed that it is possible...
# Ensure that large integers are inserted in little-endian fashion to avoid | ||
# trailing 0s. | ||
ss0 = SeedSequence(42) | ||
ss1 = SeedSequence(42 << 32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this test (also) spawn a single child? Otherwise, does it actually test the trailing 0s, since the length would differ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one isn't related to #16539, just explicitly testing that we did design the conversion of large integers to uint32
words correctly. There were two obvious ways to do it, and the other (perhaps more-obvious) way would have also caused a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we convert 42 << 32
into a uint32
array for consumption by SeedSequence
converts it to [0, 42]
. The other way to do it ("big-endian") would have converted it to [42, 0]
, which would mean that 42
, 42 << 32
, 42 << 64
and 42 << 96
all would generate the same states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I relalize, I figured that in one case (if this was the "bad" version) this would be [42,]
and in the other [42, 0]
and I assume that this would still give a different result due to the hashing procedure being sensitive to trailing 0s as well.
But with the 0 appending from the spawn key, it becomes [42, ..., 0]
in both cases, where ...
is filled with 0s.
In any case, I don't mind, it is still a good test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the concatenated words of entropy (entropy
and spawn_key
, as returned by get_assembled_entropy()
) will still get 0-padded to pool_size
if it's shorter. It just happens implicitly.
numpy/numpy/random/bit_generator.pyx
Lines 359 to 361 in 7a82f14
# Our pool size is bigger than our entropy, so just keep | |
# running the hash out. | |
mixer[i] = hashmix(0, hash_const) |
The hashing algorithm is only sensitive to different number of trailing 0s for trailing 0s beyond the pool_size
. This is testing for a potential design flaw (which I avoided) even without the spawn_key
.
One thing is whether we should add release notes for this. Which I guess means in this case we have to edit the 1.19 release notes directly, not sure how to best do t hat. |
If it has a release note it will be easy to pick up from |
Notionally, yes, we should prevent spawning if it would go past |
Yeah, its a different (and very corner case) issue in any case, thanks. I think we can just merge this, just would like to make you/Chuck a call on whether to put it into 1.19 and how the release note should be. |
Still needs a release note. It can be removed from master after backporting. There will also be a release note for the default rng change, so this isn't unique. The PR could also be made against the 1.19 branch and forward ported. |
That release note file will be overwritten, the real one is in the maintenance/1.19.x branch. I can work with this, but it will be easier to deal with if next time it is a file in |
Thinking it over, the best bet is probably to make the default rng change against |
Are you referring to the changes in this PR? |
Thanks Robert. |
@rkern I was thinking of the upcoming change of the default generator. But this was easy enough to backport, I just cherry-picked the first commit and cut and pasted the notes insert, so that works. |
Thanks! Sorry for the faffing around. |
… in numpy 1.19 (numpy/numpy#16551); fix some pytest warnings
Fixes #16539
The implicit 0-padding that is done to small entropy inputs to make them the
size of the internal pool conflicts with the spawn keys, which start with an
appended 0.
In order to maintain stream compatibility with unspawned
SeedSequence
s, weexplicitly 0-pad short inputs out to the pool size only if the spawn key is
provided, and thus would trigger the bug. This should minimize the impact on
users that were not encountering the bug.