What is the meaning of hash if we still need to check every item?

  • A+
Category:Languages

We know that tuple objects are immutable and thus hashable. We also know that lists are mutable and non-hashable.

This can be easily illustrated

>>> set([1, 2, 3, (4, 2), (2, 4)]) {(2, 4), (4, 2), 1, 2, 3}  >>> set([1, 2, 3, [4, 2], [2, 4]]) TypeError: unhashable type: 'list' 

Now, what is the meaning of hash in this context, if, in order to check for uniqueness (e.g. when building a set), we still have to check each individual item in whatever iterables are there in the set anyway?

We know two objects can have the same hash value and still be different. So, hash only is not enough to compare objects. So, what is the point of hash? Why not just check each individual items in the iterables direrctly?

My intuition is that it could be for one of the reasons

  1. hash is just a (pretty quick) preliminary comparison. If hashes are different, we know objects are different.
  2. hash sinalizes that an object is mutable. This should be enough to raise an exception when comparing to other objects: at that specific time, the objects could be equal, but maybe later, they are not.

Am I in the right direction? Or am I missing important piece of this?

Thank you

 


Now, what is the meaning of hash in this context, if, in order to check for uniqueness (e.g. when building a set), we still have to check each individual item in whatever iterables are there in the set anyway?

Yes, but the hash is used to make a conservative estimate if two objects can be equal, and is also used to assign a "bucket" to an item. If the hash function is designed carefully, then it is likely (not a certainty) that most, if not all, end up in a different bucket, and as a result, we thus make the membercheck/insertion/removal/... algorithms run on average in constant time O(1), instead of O(n) which is typical for lists.

So your first answer is partly correct, although one has to take into account that the buckets definitely boost performance as well, and are actually more important than a conservative check.

Background

Note: I will here use a simplified model, that makes the principle clear, in reality the implementation of a dictionary is more complicated. For example the hashes are here just some numbers that show the principe.

A hashset and dictionary is implemented as an array of "buckets". The hash of an element determines in which bucket we store an element. If the number of elements grows, then the number of buckets is increased, and the elements that are already in the dictionary are typically "reassigned" to the buckets.

For example, an empty dictionary might look, internally, like:

+---+ |   | | o----> NULL |   | +---+ |   | | o----> NULL |   | +---+ 

So two buckets, in case we add an element 'a', then the hash is 123. Let us consider a simple algorithm to allocate an element to a bucket, here there are two buckets, so we will assign the elements with an even hash to the first bucket, and an odd hash to the second bucket. Since the hash of 'a' is odd, we thus assign 'a' to the second bucket:

+---+ |   | | o----> NULL |   | +---+ |   |   +---+---+ | o---->| o | o----> NULL |   |   +-|-+---+ +---+    'a' 

So that means if we now check if 'b' is a member of the dictionary, we first calculate hash('b'), which is 456, and thus if we would have added this to the dictionary, it would be in the first bucket. Since the first bucket is empty, we never have to look for the elements in the second bucket to know for sure that 'b' is not a member.

If we then for example want to check if 'c' is a member, we first generate the hash of 'c', which is 789, so we add it to the second bucket again, for example:

+---+ |   | | o----> NULL |   | +---+ |   |   +---+---+   +---+---+ | o---->| o | o---->| o | o----> NULL |   |   +-|-+---+   +-|-+---+ +---+    'c'         'a' 

So now if we again would check if 'b' is a member, we would look to the first bucket, and again, we never thus have to iterate over 'c' and 'a' to know for sure that 'b' is not a member of the dictionary.

Now of course one might argue that if we keep adding more characters like 'e' and 'g' (here we consider these to have an odd hash), then that bucket will get quite full, and thus if we later check if 'i' is a member, we still will need to iterate over the elements. But in case the number of elements grows, typically the number of buckets will increase as well, and the elements in the dictionary will be assigned a new bucket.

For example if we now want to add 'd' to the dictionary, the dictionary might note that the number of elements after insertion 3, is larger than the number of buckets 2, so we create a new array of buckets:

+---+ |   | | o----> NULL |   | +---+ |   | | o----> NULL |   | +---+ |   | | o----> NULL |   | +---+ |   | | o----> NULL |   | +---+ 

and we reassign the members 'a' and 'c'. Now all elements with a hash h with h % 4 == 0 will be assigned to the first bucket, h % 4 == 1 to the second bucket, h % 4 == 2 to the third bucket, and h % 4 == 3 to the last bucket. So that means that 'a' with hash 123 will be stored in the last bucket, and 'c' with hash 789 will be stored in the second bucket, so:

+---+ |   | | o----> NULL |   | +---+ |   |   +---+---+ | o---->| o | o----> NULL |   |   +-|-+---+ +---+    'c' |   | | o----> NULL |   | +---+ |   |   +---+---+ | o---->| o | o----> NULL |   |   +-|-+---+ +---+    'a' 

we then add 'b' with hash 456 to the first bucket, so:

+---+ |   |   +---+---+ | o---->| o | o----> NULL |   |   +-|-+---+ +---+    'b' |   |   +---+---+ | o---->| o | o----> NULL |   |   +-|-+---+ +---+    'c' |   | | o----> NULL |   | +---+ |   |   +---+---+ | o---->| o | o----> NULL |   |   +-|-+---+ +---+    'a' 

So if we want to check the membership of 'a', we calculate the hash, know that if 'a' is in the dictionary, we have to search in the third bucket, and will find it there. If we look for 'b' or 'c' the same happens (but with a different bucket), and if we look for 'd' (here with hash 12), then we will search in the third bucket, and will never have to check equality with a single element to know that it is not part of the dictionary.

If we want to check if 'e' is a member, then we calculate the hash of 'e' (here 345), and search in the second bucket. Since that bucket is not empty, we start iterating over it.

For every element in the bucket (here there is only one), The algorithm will first look if the key we search for, and the key in the node refer to the same object (two different objects can however be equal), since this is not the case, we can not yet claim that 'e' is in the dictionary.

Next we will compare the hash of the key we search for, and the key of the node. Most dictionary implementations (CPython's dictionaries and sets as well if I recall correctly), then store the hash in the list node as well. So here it checks if 345 is equal to 789, since this is not the case, we know that 'c' and 'e' are not the same. If it was expensive to compare the two objects, we thus could save some cycles with that.

If the hashes are equal, that does not mean that the elements are equal, so in that case, we thus will check if the two objects are equivalent, if that is the case, we know that the element is in the dictionary, otherwise, we know it is not.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: