- A+

I use this hash function a lot, i.e. to record the value of a dataframe. Wanted to see if I could break it. Why aren't these hash values identical?

This requires the digest package.

Plain text output:

`> digest(Inf-Inf) [1] "0d59b2dae9351c1ce6c76133295322d7" > digest(NaN) [1] "4e9653ddf814f0d16b72624aeb85bc20" > digest(1) [1] "6717f2823d3202449301145073ab8719" > digest(1 + 0) [1] "6717f2823d3202449301145073ab8719" > digest(5) [1] "5e338704a8e069ebd8b38ca71991cf94" > digest(sum(1, 1, 1, 1, 1)) [1] "5e338704a8e069ebd8b38ca71991cf94" > digest(1^0) [1] "6717f2823d3202449301145073ab8719" > 1^0 [1] 1 > digest(1) [1] "6717f2823d3202449301145073ab8719" `

Additional weirdness. Calculations equal NaN have identical hash values, but NaN's hash values are not equivalent:

`> Inf - Inf [1] NaN > 0/0 [1] NaN > digest(Inf - Inf) [1] "0d59b2dae9351c1ce6c76133295322d7" > digest(0/0) [1] "0d59b2dae9351c1ce6c76133295322d7" > digest(NaN) [1] "4e9653ddf814f0d16b72624aeb85bc20" `

**tl;dr** this has to do with very deep details of how `NaN`

s are represented in binary. You could work around it by using `digest(.,ascii=TRUE)`

...

Following up on @Jozef's answer: note boldfaced digits ...

> base::serialize(Inf-Inf,connection=NULL) [1] 58 0a 00 00 00 03 00 03 06 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 [26] 00 0e 00 00 00 01fff8 00 00 00 00 00 00 > base::serialize(NaN,connection=NULL) [1] 58 0a 00 00 00 03 00 03 06 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 [26] 00 0e 00 00 00 017ff8 00 00 00 00 00 00

Alternatively, using `pryr::bytes()`

...

`> bytes(NaN) [1] "7F F8 00 00 00 00 00 00" > bytes(Inf-Inf) [1] "FF F8 00 00 00 00 00 00" `

The Wikipedia article on floating point format/NaNs says:

Some operations of floating-point arithmetic are invalid, such as taking the square root of a negative number. The act of reaching an invalid result is called a floating-point exception. An exceptional result is represented by a special code called a NaN, for "Not a Number". All NaNs in IEEE 754-1985 have this format:

- sign = either 0 or 1.
- biased exponent = all 1 bits.
- fraction = anything except all 0 bits (since all 0 bits represents infinity).

The sign is the first bit; the exponent is the next 11 bits; the fraction is the last 52 bits. Translating the first four hex digits given above to binary, `Inf-Inf`

is `1111 1111 1111 0100`

(sign=1; exponent is all ones, as required; fraction starts with `0100`

) whereas `NaN`

is `0111 1111 1111 0100`

(the same, but with sign=0).

To understand *why* `Inf-Inf`

ends up with sign bit 1 and `NaN`

has sign bit 0 you'd probably have to dig more deeply into the way floating point arithmetic is implemented on this platform ...

It might be worth raising an issue on the digest GitHub repo about this; I can't think of an *elegant* way to do it, but it seems reasonable that objects where `identical(x,y)`

is `TRUE`

in R should have identical hashes ... Note that `identical()`

specifically ignores these differences in bit patterns via the `single.NA`

(default `TRUE`

) argument:

single.NA: logical indicating if there is conceptually just one numeric ‘NA’ and one ‘NaN’; ‘single.NA = FALSE’ differentiates bit patterns.

Within the C code, it looks like R simply uses C's `!=`

operator to compare `NaN`

values *unless* bitwise comparison is enabled, in which case it does an explicit check of equality of the memory locations: see here. That is, C's comparison operator appears to treat different kinds of `NaN`

values as equivalent ...