VOOZH about

URL: https://en.wikipedia.org/wiki/Double_hashing

⇱ Double hashing - Wikipedia


Jump to content
From Wikipedia, the free encyclopedia
Computer programming technique

Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is a classical data structure on a table πŸ‘ {\displaystyle T}
.

The double hashing technique uses one hash value as an index into the table and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, or the entire table has been searched; but this interval is set by a second, independent hash function. Unlike the alternative collision-resolution methods of linear probing and quadratic probing, the interval depends on the data, so that values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of clustering.

Given two random, uniform, and independent hash functions πŸ‘ {\displaystyle h_{1}}
and πŸ‘ {\displaystyle h_{2}}
, the πŸ‘ {\displaystyle i}
th location in the bucket sequence for value πŸ‘ {\displaystyle x}
in a hash table of πŸ‘ {\displaystyle |T|}
buckets is: πŸ‘ {\displaystyle h(i,x)=(h_{1}(x)+i\cdot h_{2}(x)){\bmod {|}}T|.}
The locations can be conveniently calculated by incrementing the previous hash by πŸ‘ {\displaystyle h_{2}(x)}
, i.e. πŸ‘ {\displaystyle h(i+1,x)=(h(i,x)+h_{2}(x)){\bmod {|}}T|.}

Generally, πŸ‘ {\displaystyle h_{1}}
and πŸ‘ {\displaystyle h_{2}}
are selected from a set of universal hash functions; πŸ‘ {\displaystyle h_{1}}
is selected to have a range of πŸ‘ {\displaystyle \{0,|T|-1\}}
and πŸ‘ {\displaystyle h_{2}}
to have a range of πŸ‘ {\displaystyle \{1,|T|-1\}}
. Double hashing approximates a random distribution; more precisely, pair-wise independent hash functions yield a probability of πŸ‘ {\displaystyle (n/|T|)^{2}}
that any pair of keys will follow the same bucket sequence.

Selection of h2(x)

[edit]

The secondary hash function πŸ‘ {\displaystyle h_{2}(x)}
should have several characteristics:[1]

  1. It should never yield an index of zero. When 0 is returned, only one index is probed.
  2. All πŸ‘ {\displaystyle h_{2}(x)}
    should be relatively prime to πŸ‘ {\displaystyle |T|}
    . Otherwise the number of indices probed over k items would be πŸ‘ {\displaystyle \min(k,|T|/h_{2}(x))}
    , which could be as small as 2.
  3. It should cycle through the whole table.
  4. It should be very fast to compute.
  5. It should be pair-wise independent of πŸ‘ {\displaystyle h_{1}(x)}
    .

The distribution characteristics of πŸ‘ {\displaystyle h_{2}}
are irrelevant. It is analogous to a random-number generator.

In practice:

Analysis

[edit]

Suppose one selects two hash functions πŸ‘ {\displaystyle h_{1}}
and πŸ‘ {\displaystyle h_{2}}
and inserts πŸ‘ {\displaystyle \alpha |T|}
elements into a hash table with πŸ‘ {\displaystyle |T|}
slots. Suppose each insertion of a key πŸ‘ {\displaystyle x}
places the key in the first available slot from the sequence πŸ‘ {\displaystyle h(0,x),h(1,x),h(2,x),\ldots }
defined by

πŸ‘ {\displaystyle h(i,x)=(h_{1}(x)+i\cdot h_{2}(x)){\bmod {|}}T|.}

Given this setup, theoretical analyses seek to determine the time needed to perform an additional insertion (or, equivalently, the time needed to perform an unsucessful search). Guibas and SzemerΓ©di[2] proved in 1978 that, if πŸ‘ {\displaystyle h_{1}}
and πŸ‘ {\displaystyle h_{2}}
are uniformly random, and πŸ‘ {\displaystyle \alpha <0.319}
, then the expected time is πŸ‘ {\displaystyle O(1)}
. Subsequent work by Lueker and Molodowitch[3] proved a bound of πŸ‘ {\displaystyle 1/(1-\alpha )}
for any πŸ‘ {\displaystyle \alpha }
, and established that the behavior of the hash table can be directly coupled to that of a standard random-probing based solution. Much more recently, in 2007, Bradford and Katehakis[4] showed that even using universal hash functions, rather than fully random ones, suffices to get a πŸ‘ {\displaystyle 1/(1-\alpha )}
bound.

Like all other forms of open addressing, double hashing becomes linear as the hash table approaches maximum capacity. The usual heuristic is to limit the table loading to 75% of capacity. Eventually, rehashing to a larger size will be necessary, as with all other open addressing schemes.

Variants

[edit]

Peter Dillinger's PhD thesis points out that double hashing produces unwanted equivalent hash functions when the hash functions are treated as a set, as in Bloom filters: If πŸ‘ {\displaystyle h_{2}(y)=-h_{2}(x)}
and πŸ‘ {\displaystyle h_{1}(y)=h_{1}(x)+k\cdot h_{2}(x)}
, then πŸ‘ {\displaystyle h(i,y)=h(k-i,x)}
and the sets of hashes πŸ‘ {\displaystyle \left\{h(0,x),...,h(k,x)\right\}=\left\{h(0,y),...,h(k,y)\right\}}
are identical. This makes a collision twice as likely as the hoped-for πŸ‘ {\displaystyle 1/|T|^{2}}
.[5]

There are additionally a significant number of mostly-overlapping hash sets; if πŸ‘ {\displaystyle h_{2}(y)=h_{2}(x)}
and πŸ‘ {\displaystyle h_{1}(y)=h_{1}(x)\pm h_{2}(x)}
, then πŸ‘ {\displaystyle h(i,y)=h(i\pm 1,x)}
, and comparing additional hash values (expanding the range of πŸ‘ {\displaystyle i}
) is of no help.

Triple hashing

[edit]

Adding a third hash as a quadratic term (triple hashing) makes the overlap a lot less lightly, since equivalent classes now need to be generated by a collaboration of both πŸ‘ {\displaystyle h_{2}(x)}
and πŸ‘ {\displaystyle h_{3}(x)}
, at a cost of 50% more calculations due to the added hash function. Choices for the factor for this πŸ‘ {\displaystyle h_{3}(x)}
include πŸ‘ {\displaystyle i^{2}}
[6] and the triangular numbers πŸ‘ {\displaystyle i(i\pm 1)/2}
. The added hash function should obey the same requirements as listed above for πŸ‘ {\displaystyle h_{2}(x)}
.[1]

Using the triangular numbers make it easier to calculate the value by forward differencing: for the πŸ‘ {\displaystyle i(i-2)/2}
variety,[1]

fromcollections.abcimport Iterator, Callable
fromtypingimport TypeVar

T = TypeVar('T')
hashfunc = Callable[T, int]
# assume h1, h2, h3 are defined and of type hashfunc

MODULUS = (1 << 32)

deftriple_hash(key: T, k: int) -> Iterator[int]:
"""Return k iterations of a triple hash."""
 x, y, z = h1(key), h2(key), h3(key)
 yield x
 for i in range(1, k-1):
 x = (x + y) % MODULUS
 y = (y + z) % MODULUS
 yield x

This kind of construction does not fully remove equivalent sets. If:

πŸ‘ {\displaystyle h_{1}(y)=h_{1}(x)+k\cdot h_{2}(x)+k^{2}\cdot h_{3}(x),}
πŸ‘ {\displaystyle h_{2}(y)=-h_{2}(x)-2k\cdot h_{3}(x),}
and
πŸ‘ {\displaystyle h_{3}(y)=h_{3}(x).}

then

πŸ‘ {\displaystyle {\begin{aligned}h(k-i,y)&=h_{1}(y)+(k-i)\cdot h_{2}(y)+(k-i)^{2}\cdot h_{3}(y)\\&=h_{1}(y)+(k-i)(-h_{2}(x)-2kh_{3}(x))+(k-i)^{2}h_{3}(x)\\&=\ldots \\&=h_{1}(x)+kh_{2}(x)+k^{2}h_{3}(x)+(i-k)h_{2}(x)+(i^{2}-k^{2})h_{3}(x)\\&=h_{1}(x)+ih_{2}(x)+i^{2}h_{3}(x)\\&=h(i,x).\\\end{aligned}}}

Enhanced double hashing

[edit]

Adding a cubic term πŸ‘ {\displaystyle i^{3}}
[6] or πŸ‘ {\displaystyle (i^{3}-i)/6}
(a tetrahedral number),[1] does solve the problem, a technique known as enhanced double hashing. The tetrahedral number can be computed efficiently by forward differencing:

structkey;/// Opaque
/// Replace "unsigned int" with other types as needed. (Must be unsigned for guaranteed wrapping.)
typedefunsignedinthashfunc(structkeyconst*);
externhashfunch1,h2;

/// Calculate k hash values from two underlying hash functions
/// h1() and h2() using enhanced double hashing. On return,
/// hashes[i] = h1(x) + i*h2(x) + (i*i*i - i)/6.
/// Takes advantage of automatic wrapping (modular reduction)
/// of unsigned types in C.
voidext_dbl_hash(structkeyconst*x,unsignedinthashes[],unsignedintn)
{
unsignedinta=h1(x),b=h2(x),i=0;

hashes[i]=a;
for(i=1;i<n;i++){
a+=b;// Add quadratic difference to get cubic
b+=i;// Add linear difference to get quadratic
// i++ adds constant difference to get linear
hashes[i]=a;
}
}

In addition to rectifying the collision problem, enhanced double hashing also removes double-hashing's numerical restrictions on πŸ‘ {\displaystyle h_{2}(x)}
's properties, allowing a hash function similar in property to (but still independent of) πŸ‘ {\displaystyle h_{1}}
to be used. (Using the numbering in Β§ Selection, the first two requriements are removed.)[1]

See also

[edit]

References

[edit]
  1. ^ a b c d e f Dillinger, Peter C.; Manolios, Panagiotis (November 15–17, 2004). Bloom Filters in Probabilistic Verification (PDF). 5h International Conference on Formal Methods in Computer Aided Design (FMCAD 2004). Austin, Texas. CiteSeerX 10.1.1.119.628. doi:10.1007/978-3-540-30494-4_26.
  2. ^ Guibas; Szemeredi (1978). "The Analysis of Double Hashing". J. Comput. System Sci. 16: 226–274.
  3. ^ Lueker, George; Molodowitch, Mariko (1988). "More analysis of double hashing". Proceedings of the twentieth annual ACM symposium on Theory of computing - STOC '88. New York, New York, USA: ACM Press: 354–359. doi:10.1145/62212.62246.
  4. ^ Bradford, Phillip G.; Katehakis, Michael N. (April 2007), "A Probabilistic Study on Combinatorial Expanders and Hashing" (PDF), SIAM Journal on Computing, 37 (1): 83–111, doi:10.1137/S009753970444630X, MR 2306284, archived from the original (PDF) on 2016-01-25.
  5. ^ Dillinger, Peter C. (December 2010). Adaptive Approximate State Storage (PDF) (PhD thesis). Northeastern University. pp. 93–112.
  6. ^ a b Kirsch, Adam; Mitzenmacher, Michael (September 2008). "Less Hashing, Same Performance: Building a Better Bloom Filter" (PDF). Random Structures and Algorithms. 33 (2): 187–218. CiteSeerX 10.1.1.152.579. doi:10.1002/rsa.20208.

External links

[edit]