What is a hash?
Computer science refers to the output of a hash function as a hash or hash value, but it also refers to a list-like data type in which elements are accessed by their hash value, the hash table.
The hash function generally maps an arbitrary data space to data of fixed size. Their typical properties differ to some extent in different application areas. The name is derived from the English term for chopping and hashing and provides a vivid analogy for how a hash function works when calculating hashes from the information of the input values.
For which applications is the hash used?
- Hash table data type
- Protection of sensitive data
- Finding duplicates
- Search for similar records or substrings in strings
- Test for being contained in a set
- Various applications in cryptography
The basic requirements for a hash function
First of all, every hash function must always return the same result for every input when used multiple times; it must be deterministic. This distinguishes it from a randomization function. Thus, no random elements may be included in the calculation of hash values, unless they remain constant for the useful life of the hash.
This is the case, for example, with the Python programming language, whose interpreter generates a random value for the randomization of the hash values at startup, so that these are valid only within a program run.
Furthermore, the results that a hash function returns for its expected inputs should be distributed as evenly as possible over its range of values. In particular, different data sets with the same hash value, so-called collisions, are usually undesirable. In the case of the hash data type, they increase the effort required to find the corresponding elements.
In cryptography, they represent potential attack vectors. However, collisions can only be completely avoided in special cases; this is called a perfect hash, and, from a mathematical point of view, the hash function is injective. In some applications, collisions are even useful. This is the case, for example, when they hide insignificant information such as differences in the case of the input values. Likewise, locality-sensitive hashing (LSH), is used to search for similar documents, such as copied content on the Web.
Other characteristic hash properties
In addition to the uniform distribution of hash values over the set of values of the hash function, the continuity of its results is important, but in different ways depending on the field of application.
In cryptography, hash functions are desired that provide values that are as different as possible for different inputs. This makes brute force attacks, which try to guess the original value by systematically trying out possible input values, more difficult.
If, on the other hand, the hash is used to find similar elements, then the hash function should, just the opposite, provide values that are as continuous as possible. Thus, each hash value in this application should differ as little as possible from that of a similar input.
A hash is not encryption!
Hash functions are often used to protect sensitive data. For example, the passwords for shell access to Unix and compatible systems are stored by default only as hash values and not in plain text in the password file. However, this is not encryption. A major difference is that a hash function is not invertible. This means that it is not possible to convert the calculated hash value back into the original value, in the example mentioned into the password.
However, this is exactly what is required in an encryption algorithm so that the legitimate recipient of an encrypted message can make the original text visible again. In the case of the password hash, on the other hand, recalculation is not desired, but also not necessary, because a correct password entry can be verified simply by comparing the calculated hash value with the stored hash value.