Hash tables
Copyright © 2001 Alirio A. Gavidia B..
To see more of my articles
visit Programación
Orientada en Delphi (Site in Spanish)
Translated by Ernesto De Spirito with the author's permission
![]() |
How to take a decision
A common problem in programming is the taking of a course of action
based on the value of a literal sequence of characters. This is simply
to be able to execute an option if for example the
variable Payment is 'cash', 'check', 'card' or 'transfer'.
We could design a sequence of code like the following:
Const
STransfer = 'transfer';
SCheck = 'check';
SCard = 'card';
SCash = 'cash';
SPayMode = 'Pay mode';
SUnknow = 'unknown';
If pago=SCash then
...
If pago=SCheck then
...
If pago=SCard then
...
If pago=STransfer then
...
This would work, and -to be sincere- I wouldn't worry much about it. It's practical. Nevertheless, if we speak of more options (like for example 32 to give a number), it might become a bit uncomfortable and certainly little elegant.
The option of a case is -in this level- unusable, simply because it
requires ordinal types (integers, chars and enumerations). This allows
the case statement to be really efficient (unlike the "Case" of xBase
languages).
Improving the previous example we can go to:
If pago=SCash then
...
else If pago= SCheck then
...
else If pago= SCard then
...
else If pago= STransfer then
...
This change prevents that "n" checkings be done for an equal number of options, but, for the case of the last option -or even worse, the case in which the option be different from the expected ones-, "n" checkings are performed anyway.
The alternatives - order
The order seems to be the way. A sorted list is established and then a binary search is made (divide and conquer). Let us suppose that we have 32 elements: we divide in two sections of 16 elements and select, divide in two of 8 and select, divide in two of 4 and select, divide in two of 2 and select, and we choose the end. Total: 5 decisions (notice the fact that 2 to the 5th power is 32).
Until now, the binary search guarantees us that for 32 options a maximum
of 5 (log N) checkings (in fact, 50% of the searches would be solved in
5 checkings, think about it). Nested if statements would yield
in an average of 16 checkings (n/2).
Closer to "case..of"
An idea that always appears, and that I have seen used, is to take the
first letter. For the initial example, this would let to us use
a case structure, but when arriving to the letter
or 'c' of 'cash', 'check' or 'card' in
English (or the 't' of 'tarjeta' and 'transferencia' in
Spanish) we would need at least to place an if or another case.
Another approach would be to create a function that gives us a number for an identifier. For example, I propose the following function:
Function IdentIndex(AIdent: String): integer;
Begin
Result := length(AIdent) + Ord(AIdent[3])
end
sCash 119 // in English
SCheck 106 // in English
SCard 118 // in English
STransfer 105 // in English
sCash 109 // in Spanish
SCheck 107 // in Spanish
SCard 121 // in Spanish
STransfer 110 // in Spanish
Now it's possible to use a Case:
Var
Ident : string;
begin
Ident := 'cash';
if InputQuery('Hash', SPayMode, Ident) then
case IdentIndex(Ident) of
109: ShowMessage(SCash);
107: ShowMessage(SCheck);
121: ShowMessage(SCard);
110: ShowMessage(STransfer);
else
ShowMessage(SUnknown);
end
end;
Pros and cons
The method is interesting due to the fact that reduces the number of necessary checkings to one. However now we have the overload of having to determine the suitable function that returns the indexes. An incorrect determination of this formula will introduce the following problems:
- Index repetition
- Scattered indexes
- Slowness in the evaluation of the formula
The last point is very obvious and thus I won't make comments about it.
The repetition of indexes forces us to make additional steps within
the Case. If we have two identifiers generating the same
index, then it's necessary to make at least one additional analysis to
recognize which is the correct way.
Scattered indexes are a drawback if we try to build a table. In this case we could design an array like the one I exemplify here:
Begin
:
FillChar(JumpTable,SizeOf(JumpTable), 0);
JumpTable[IdentIndex(SCash)] := PayModeCash;
JumpTable[IdentIndex(SCheck)] := PayModeCheck;
JumpTable[IdentIndex(SCard)] := PayModeCard;
JumpTable[IdentIndex(sTransfer)] := PayModeTransfer;
:
End;
Var
JumpTable : array [100..130] of TRoutine;
Index : Integer;
Ident : string;
Begin
Ident := SCash;
InputQuery('Hash', SPayMode, Ident);
Index := IdentIndex(Ident);
if (Index in [100..130]) and (Assigned(JumpTable[Index])) then
JumpTable[Index]
end;
We have JumpTable as an array of routines that are executed given an
index. But there's a waste since we used only four out of 30 possible
values. The ideal would be that IdentIndex gives us four consecutive
values. Sometimes we have to live with that, but sometimes no. In the
case of the analyzer of a compiler, normally it's possible to dedicate
effort to determine the adequate formula without waste. But when there
are many words, or they aren't known beforehand, we must rethink the
whole subject.
"Hash" tables or lists
The formula to determine the index generally must be limited since usually we can't create arrays with 14-digits indexes. The simplest solution is dividing and getting the rest. If we want to limit us to only 8 possibilities, we could divide into 8 and get the rest of the division.
This way, the new values will be:
sCash 7 // in English
SCheck 2 // in English
SCard 6 // in English
STransfer 1 // in English
sCash 5 // in Spanish
SCheck 3 // in Spanish
SCard 1 // in Spanish
STransfer 6 // in Spanish
Too good, but sometimes it doesn't end up being this way. For example, limiting it to four instead of eight, index repetition appears:
sCash 3 // in English
SCheck 2 // in English
SCard 2 // in English
STransfer 1 // in English
sCash 1 // in Spanish
SCheck 3 // in Spanish
SCard 1 // in Spanish
STransfer 2 // in Spanish
The suggested limit for a table is usually a prime number that is not very near to a power of 2.
For tables with repetition of indexes, normally the index is obtained, the table is checked, and if the cell is free, it is taken, otherwise the next free cell is searched in a linear way. Each failure in this search reduces the efficiency of the method.
Hash lists are implemented as lists of lists. Each node of the main list points to the list of identifiers that match the given index. There are other solutions to this problem, but I won't analyze them here.
Facing the way to find the index - Multiplication method
I like this way of solving the problem when we have a number in a wide range and we want to reduce it to a smaller range.
For a number of possible values 2^N (2 raised to the Nth power). Key
is a value generated directly when evaluating the identifier.
The formula is determined this way:
Hash := ((K*Key) and M) mod S;
Where:
- K is 158 for 8-bits indexes, 40503 for 16-bits indexes and 2654435769 for 16-bits indexes.
- S is 2 raised to the difference between the number of bits and N.
- M is the size of table minus 1. (2^N-1).
For N=10 we have
K=40503 (16 bits)
S=2^(16-10) = 2^6 = 64
M=1023
Hash := ((40503*Key) and 1023) mod 64;
The result would be a value from 0 to 63 (from an original range of 0 to 1023).
Manipulating strings of characters
Usually we can take each character of the string, add them and take the
module with 256. However this generates problems with anagrams
('abcd' and 'dcba' for example) and similar
words ('cba' and 'cab' for example).
A solution is using xor (exclusive or) instead of the
addition and to add an intermediate table of pseudo-random values that
gives us a better distribution, apart from considering in the algorithm
the position of each letter and not only its intrinsic value.
Conclusion
Using hash tables can give terrific results in situations where we have knowledge of the identifiers to search and/or where we can waste memory in exchange for speed. In particular, compilers and interpreters are -in my opinion- the best cases. It's also useful in database systems where the problem is not space but the need of a minimum amount of readings. However, a problem to consider is the elimination of records in a table, something usual in databases.
To learn more:
- D. E. Knuth, 1973, The Art of Computer Programming, Vol. 3: Sorting and Searching, Reading, MS: Addison-Wesley.
Copyright © 2001 by Alirio A. Gavidia B. All rights reserved.
The publication of this material is allowed by any means from anyone as long as this material is not modified and the original source is mentioned.
You can find the full source code of this article in the archive that accompanies the Pascal Newsletter #23
![]() |



