awk String manipulation functions Computing a hash of a string


While implementing one of the standard hashing algorithm in awk is probably a tedious task, defining a hash function that can be used as a handle to text documents is much more tractable. A practical situation where such a function is useful is to assign short ids to items given their description, for instance test cases, so that the short id can be given as reference to the item by the user instead of supplying its long description.

The hash function needs to convert characters to numeric codes, which is accomplished by using a lookup table initialised at the beginning of the script. The hash function is then computed using modular arithmetic transformations, a very classical approach to the computation of hashes.

For demonstration purposes, we add a rule to decorate input lines with their hash, but this rule is not needed to use the function:

  for(n=0;n<256;n++) {
    ord[sprintf("%c",n)] = n

function hash(text, _prime, _modulo, _ax, _chars, _i)
  _prime = 104729;
  _modulo = 1048576;
  _ax = 0;
  split(text, _chars, "");
  for (_i=1; _i <= length(text); _i++) {
    _ax = (_ax * _prime + ord[_chars[_i]]) % _modulo;
  return sprintf("%05x", _ax)

# Rule to demonstrate the function
#  These comments and the following line are not relevant
#  to the definition of the hash function but illustrate
#  its use.

{ printf("%s|%s\n", hash($0), $0) }

We save the program above to the file hash.awk and demonstrate it on a short list of classical english book titles:

awk -f hash.awk <<EOF
Wuthering Heights
Jane Eyre
Pride and Prejudice
The Mayor of Casterbridge
The Great Gatsby
David Copperfield
Great Expectations
The Return of the Soldier
Alice's Adventures in Wonderland
Animal Farm

The output is

6d6b1|Wuthering Heights
7539b|Jane Eyre
d8fba|Pride and Prejudice
fae95|The Mayor of Casterbridge
17fae|The Great Gatsby
c0005|David Copperfield
7492a|Great Expectations
12871|The Return of the Soldier
c3ab6|Alice's Adventures in Wonderland
46dc0|Animal Farm

When applied on each of the 6948 non-blank lines of my favourite novel this hash function does not generate any collision.