BINARY_CHECKSUM – different result depending on number of rows

  • A+
Category:Languages

I wonder why the BINARY_CHECKSUM function returns different result for the same:

SELECT *, BINARY_CHECKSUM(a,b) AS bc FROM (VALUES(1, NULL, 100),             (2, NULL, NULL),             (3, 1, 2)) s(id,a,b);  SELECT *, BINARY_CHECKSUM(a,b) AS bc FROM (VALUES(1, NULL, 100),             (2, NULL, NULL)) s(id,a,b); 

Ouput:

+-----+----+------+-------------+ | id  | a  |  b   |     bc      | +-----+----+------+-------------+ |  1  |    | 100  |        -109 | |  2  |    |      | -2147483640 | |  3  | 1  |   2  |          18 | +-----+----+------+-------------+  -- -109 vs 100 +-----+----+------+------------+ | id  | a  |  b   |     bc     | +-----+----+------+------------+ |  1  |    | 100  |        100 | |  2  |    |      | 2147483647 | +-----+----+------+------------+ 

And for second sample I get what I would anticipate:

SELECT *, BINARY_CHECKSUM(a,b) AS bc FROM (VALUES(1, 1, 100),             (2, 3, 4),             (3,1,1)) s(id,a,b);  SELECT *, BINARY_CHECKSUM(a,b) AS bc FROM (VALUES(1, 1, 100),             (2, 3, 4)) s(id,a,b); 

Ouptut for both first two rows:

+-----+----+------+-----+ | id  | a  |  b   | bc  | +-----+----+------+-----+ |  1  | 1  | 100  | 116 | |  2  | 3  |   4  |  52 | +-----+----+------+-----+ 

db<>fiddle demo


It has strange consequences when I want to compare two tables/queries:

WITH t AS (   SELECT 1 AS id, NULL AS a, 100 b   UNION ALL SELECT 2, NULL, NULL   UNION ALL SELECT 3, 1, 2           -- comment this out ), s AS (   SELECT 1 AS id ,100 AS a, NULL as b   UNION ALL SELECT 2, NULL, NULL   UNION ALL SELECT 3, 2, 1           -- comment this out ) SELECT t.*,s.*   ,BINARY_CHECKSUM(t.a, t.b) AS bc_t, BINARY_CHECKSUM(s.a, s.b) AS bc_s FROM t JOIN s   ON s.id = t.id WHERE BINARY_CHECKSUM(t.a, t.b) = BINARY_CHECKSUM(s.a, s.b); 

db<>fiddle demo2

For 3 rows I get single result:

+-----+----+----+-----+----+----+--------------+-------------+ | id  | a  | b  | id  | a  | b  |    bc_t      |    bc_s     | +-----+----+----+-----+----+----+--------------+-------------+ |  2  |    |    |  2  |    |    | -2147483640  | -2147483640 | +-----+----+----+-----+----+----+--------------+-------------+ 

but for 2 rows I get also id = 1:

+-----+----+------+-----+------+----+-------------+------------+ | id  | a  |  b   | id  |  a   | b  |    bc_t     |    bc_s    | +-----+----+------+-----+------+----+-------------+------------+ |  1  |    | 100  |  1  | 100  |    |        100  |        100 | |  2  |    |      |  2  |      |    | 2147483647  | 2147483647 | +-----+----+------+-----+------+----+-------------+------------+ 

Remarks:

  • I am not searching for alternatives like(HASH_BYTES/MD5/CHECKSUM)

  • I am aware that BINARY_CHECKSUM could lead to collisions(two different calls produce the same output) here scenario is a bit different

For this definition, we say that null values, of a specified type, compare as equal values. If at least one of the values in the expression list changes, the expression checksum can also change. However, this is not guaranteed. Therefore, to detect whether values have changed, we recommend use of BINARY_CHECKSUM only if your application can tolerate an occasional missed change.

It is strange for me that hash function returns different result for the same input arguments. Is this behaviour by design or it is some kind of glitch?

EDIT:

As @scsimon points out it works for materialized tables but not for cte. db<>fiddle actual table

Metadata for cte:

SELECT name, system_type_name FROM sys.dm_exec_describe_first_result_set(' SELECT * FROM (VALUES(1, NULL, 100),             (2, NULL, NULL),             (3, 1, 2)) s(id,a,b)', NULL,0);  SELECT name, system_type_name FROM sys.dm_exec_describe_first_result_set(' SELECT * FROM (VALUES(1, NULL, 100),             (2, NULL, NULL)) s(id,a,b)', NULL,0)  -- working workaround SELECT name, system_type_name FROM sys.dm_exec_describe_first_result_set(' SELECT * FROM (VALUES(1, cast(NULL as int), 100),             (2, NULL, NULL)) s(id,a,b)', NULL,0) 

For all cases all columns are INT but with explicit CAST it behaves as it should.

db<>fidde metadata

 


For the literal NULL without the CAST (and without any typed values in the column) it entirely ignores it and just gives you the same result as BINARY_CHECKSUM(b).

This seems to happen very early on. The initial tree representation output from

SELECT *, BINARY_CHECKSUM(a,b) AS bc FROM (VALUES(1, NULL, 100),             (2, NULL, NULL)) s(id,a,b) OPTION (RECOMPILE, QUERYTRACEON 8605, QUERYTRACEON 3604); 

Already shows that it has decided to just use one column as input to the function

ScaOp_Intrinsic binary_checksum      ScaOp_Identifier COL: Union1008  

This compares with the following output for your first query

ScaOp_Intrinsic binary_checksum      ScaOp_Identifier COL: Union1011       ScaOp_Identifier COL: Union1010  

If you try and get the BINARY_CHECKSUM with

SELECT *, BINARY_CHECKSUM(a) AS bc FROM (VALUES(1, NULL, 100)) s(id,a,b) 

It gives the error

Msg 8184, Level 16, State 1, Line 8 Error in binarychecksum. There are no comparable columns in the binarychecksum input.

This is not the only place where an untyped NULL constant is treated differently from an explicitly typed one.

Another case is

SELECT COALESCE(CAST(NULL AS INT),CAST(NULL AS INT)) 

vs

SELECT COALESCE(NULL,NULL) 

I'd err on the side of "glitch" in this case rather than "by design" though as the columns from the derived table are supposed to be int before they get to the checksum function.

SELECT COALESCE(a,b) FROM (VALUES(NULL, NULL)) s(a,b) 

Does work as expected without this glitch.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: