Help & Manual authoring tool
Hash tables in Delphi: Hashing technique to compare a string with many possible values using a hash table

Hash tables

Copyright © 2001 Alirio A. Gavidia B.. To see more of my articles
visit Programación Orientada en Delphi (Site in Spanish)
Translated by Ernesto De Spirito with the author's permission

Pascal Newsletter. Free ezine for Delphi (and Kylix) programmers with articles, news, reviews, tips, trinks, and links to new Delphi content on the web!

How to take a decision

A common problem in programming is the taking of a course of action based on the value of a literal sequence of characters. This is simply to be able to execute an option if for example the variable Payment is 'cash', 'check', 'card' or 'transfer'.

We could design a sequence of code like the following:

Const
  STransfer = 'transfer';
  SCheck = 'check';
  SCard = 'card';
  SCash = 'cash';
  SPayMode = 'Pay mode';
  SUnknow = 'unknown';

If pago=SCash then
    ...
If pago=SCheck then
    ...
If pago=SCard then
    ...
If pago=STransfer then
    ...

This would work, and -to be sincere- I wouldn't worry much about it. It's practical. Nevertheless, if we speak of more options (like for example 32 to give a number), it might become a bit uncomfortable and certainly little elegant.

The option of a case is -in this level- unusable, simply because it requires ordinal types (integers, chars and enumerations). This allows the case statement to be really efficient (unlike the "Case" of xBase languages).

Improving the previous example we can go to:

If pago=SCash then
    ...
else If pago= SCheck then
    ...
else If pago= SCard then
    ...
else If pago= STransfer then
    ...

This change prevents that "n" checkings be done for an equal number of options, but, for the case of the last option -or even worse, the case in which the option be different from the expected ones-, "n" checkings are performed anyway.

The alternatives - order

The order seems to be the way. A sorted list is established and then a binary search is made (divide and conquer). Let us suppose that we have 32 elements: we divide in two sections of 16 elements and select, divide in two of 8 and select, divide in two of 4 and select, divide in two of 2 and select, and we choose the end. Total: 5 decisions (notice the fact that 2 to the 5th power is 32).

Until now, the binary search guarantees us that for 32 options a maximum of 5 (log N) checkings (in fact, 50% of the searches would be solved in 5 checkings, think about it). Nested if statements would yield in an average of 16 checkings (n/2).

Closer to "case..of"

An idea that always appears, and that I have seen used, is to take the first letter. For the initial example, this would let to us use a case structure, but when arriving to the letter or 'c' of 'cash', 'check' or 'card' in English (or the 't' of 'tarjeta' and 'transferencia' in Spanish) we would need at least to place an if or another case.

Another approach would be to create a function that gives us a number for an identifier. For example, I propose the following function:

Function IdentIndex(AIdent: String): integer;
Begin
  Result := length(AIdent) + Ord(AIdent[3])
end

sCash      119  // in English
SCheck     106  // in English
SCard      118  // in English
STransfer  105  // in English

sCash      109  // in Spanish
SCheck     107  // in Spanish
SCard      121  // in Spanish
STransfer  110  // in Spanish

Now it's possible to use a Case:

Var
  Ident : string;
begin
  Ident := 'cash';
  if InputQuery('Hash', SPayMode, Ident) then
    case IdentIndex(Ident) of
    109: ShowMessage(SCash);
    107: ShowMessage(SCheck);
    121: ShowMessage(SCard);
    110: ShowMessage(STransfer);
    else
      ShowMessage(SUnknown);
    end
end;

Pros and cons

The method is interesting due to the fact that reduces the number of necessary checkings to one. However now we have the overload of having to determine the suitable function that returns the indexes. An incorrect determination of this formula will introduce the following problems:

  • Index repetition
  • Scattered indexes
  • Slowness in the evaluation of the formula

The last point is very obvious and thus I won't make comments about it.

The repetition of indexes forces us to make additional steps within the Case. If we have two identifiers generating the same index, then it's necessary to make at least one additional analysis to recognize which is the correct way.

Scattered indexes are a drawback if we try to build a table. In this case we could design an array like the one I exemplify here:

Begin
  :
  FillChar(JumpTable,SizeOf(JumpTable), 0);
  JumpTable[IdentIndex(SCash)] := PayModeCash;
  JumpTable[IdentIndex(SCheck)] := PayModeCheck;
  JumpTable[IdentIndex(SCard)] := PayModeCard;
  JumpTable[IdentIndex(sTransfer)] := PayModeTransfer;
  :
End;

Var
  JumpTable : array [100..130] of TRoutine;
  Index : Integer;
  Ident : string;

Begin

  Ident := SCash;
  InputQuery('Hash', SPayMode, Ident);
  Index := IdentIndex(Ident);
  if (Index in [100..130]) and (Assigned(JumpTable[Index])) then
    JumpTable[Index]
end;

We have JumpTable as an array of routines that are executed given an index. But there's a waste since we used only four out of 30 possible values. The ideal would be that IdentIndex gives us four consecutive values. Sometimes we have to live with that, but sometimes no. In the case of the analyzer of a compiler, normally it's possible to dedicate effort to determine the adequate formula without waste. But when there are many words, or they aren't known beforehand, we must rethink the whole subject.

"Hash" tables or lists

The formula to determine the index generally must be limited since usually we can't create arrays with 14-digits indexes. The simplest solution is dividing and getting the rest. If we want to limit us to only 8 possibilities, we could divide into 8 and get the rest of the division.

This way, the new values will be:

sCash      7    // in English
SCheck     2    // in English
SCard      6    // in English
STransfer  1    // in English

sCash      5    // in Spanish
SCheck     3    // in Spanish
SCard      1    // in Spanish
STransfer  6    // in Spanish

Too good, but sometimes it doesn't end up being this way. For example, limiting it to four instead of eight, index repetition appears:

sCash      3    // in English
SCheck     2    // in English
SCard      2    // in English
STransfer  1    // in English

sCash      1    // in Spanish
SCheck     3    // in Spanish
SCard      1    // in Spanish
STransfer  2    // in Spanish

The suggested limit for a table is usually a prime number that is not very near to a power of 2.

For tables with repetition of indexes, normally the index is obtained, the table is checked, and if the cell is free, it is taken, otherwise the next free cell is searched in a linear way. Each failure in this search reduces the efficiency of the method.

Hash lists are implemented as lists of lists. Each node of the main list points to the list of identifiers that match the given index. There are other solutions to this problem, but I won't analyze them here.

Facing the way to find the index - Multiplication method

I like this way of solving the problem when we have a number in a wide range and we want to reduce it to a smaller range.

For a number of possible values 2^N (2 raised to the Nth power). Key is a value generated directly when evaluating the identifier.

The formula is determined this way:

Hash := ((K*Key) and M) mod S;

Where:

  • K is 158 for 8-bits indexes, 40503 for 16-bits indexes and 2654435769 for 16-bits indexes.
  • S is 2 raised to the difference between the number of bits and N.
  • M is the size of table minus 1. (2^N-1).

For N=10 we have

K=40503   (16 bits)
S=2^(16-10) = 2^6 = 64
M=1023

Hash := ((40503*Key) and 1023) mod 64;

The result would be a value from 0 to 63 (from an original range of 0 to 1023).

Manipulating strings of characters

Usually we can take each character of the string, add them and take the module with 256. However this generates problems with anagrams ('abcd' and 'dcba' for example) and similar words ('cba' and 'cab' for example).

A solution is using xor (exclusive or) instead of the addition and to add an intermediate table of pseudo-random values that gives us a better distribution, apart from considering in the algorithm the position of each letter and not only its intrinsic value.

Conclusion

Using hash tables can give terrific results in situations where we have knowledge of the identifiers to search and/or where we can waste memory in exchange for speed. In particular, compilers and interpreters are -in my opinion- the best cases. It's also useful in database systems where the problem is not space but the need of a minimum amount of readings. However, a problem to consider is the elimination of records in a table, something usual in databases.


To learn more:

  • D. E. Knuth, 1973, The Art of Computer Programming, Vol. 3: Sorting and Searching, Reading, MS: Addison-Wesley.

Copyright © 2001 by Alirio A. Gavidia B. All rights reserved.

The publication of this material is allowed by any means from anyone as long as this material is not modified and the original source is mentioned.

 
You can find the full source code of this article in the archive that accompanies the Pascal Newsletter #23

JfControls Library - for Delphi and C++ Builder
Copyright © 2000/2006 Ernesto De Spirito.   All rights reserved.