6

To summarise this question in advance, I'm looking for a good hash function that is suitable for generating pseudo-random numbers in Monte Carlo simulations. This means it should be reasonably fast (so something like md5 is ruled out) but have statistics that are good enough for numerical applications.

The reason is that sometimes, when writing simulation code, I've felt the need to generate a function that maps some set of objects to floats between 0 and 1, in a pseudo-random way. Such a function might always map the Numpy array [1.0, 2.0, 3.0] to 0.11214, but on a different run of the simulation a different function might be generated, which always maps [1.0, 2.0, 3.0] to 0.92546. The input set might not consist of arrays, but if it does then it's important that order is not ignored, i.e. [1.0, 2.0, 3.0] and [1.0, 3.0, 2.0] should return different results.

It seems there are a few different technical terms for generating such a random function: depending on details of the implementation it's either called universal hashing or psuedorandom function families. Of course one way to implement this is to use a good hashing function. Different pseudorandom functions can be generated using "salting", i.e. prepending a fixed string to every object before running the hashing algorithm. Changing this prefix gives a different pseudorandom function.

Secure hashing algorithms such as md5 will have excellent statistics, but are too slow to be practical in a Monte Carlo simulation. Of course there exist many "fast hashing" algorithms (i.e. non-secure hashing algorithms) that might be suitable to use instead. However, the issue is that I haven't been able to find a good analysis of any such algorithms in terms of their suitability for numerical computation.

This is important, because most fast hashing algorithms are designed for use in file systems. Typically, papers describing hash functions evaluate them in terms of how well they will perform in such an application. However, and the requirements for this are rather different from the requirements for a numerical simulation. I'm not an expert, but the essential difference is that for a file system we really care a lot about avoiding collisions, whereas for a numerical algorithm we care most about uniformity of the distribution of outputs and statistical independence of the functions' outputs, even if the function is given inputs that are closely related.

These goals are related to one another (and secure hashing algorithms will score very well on all three) but they are not the same. As a trivial example to illustrate this, using Python's built-in hashing algorithm modulo 1000, "aaa" is mapped to 340, "aab" to 343, "aac" to 342, "aad" to 337, "aae" to 336 and "aaf" to 339. There are no collisions in this data, but the results are clearly neither uniformly distributed nor independent.

So I'm looking for any advice on which hashing algorithm(s) to use in Monte Carlo simulations, to get a good trade-off between speed and statistics that are suitable for numerics. In an ideal world there would also be an existing C implementation with Python bindings; it would also be ideal if objects like Numpy arrays could be hashed directly, rather than first having to convert them to strings. Ideally I would like an off the shelf solution that I can trust, in the same way that I trust Numpy's random number generator. However, I don't mind implementing it myself if that's what necessary - the important thing is to find an algorithm that has been formally evaluated in terms of it suitability for numerical applications, rather than for use in file systems.

Anton Menshov
  • 8,672
  • 7
  • 38
  • 94
N. Virgo
  • 1,223
  • 1
  • 10
  • 23

2 Answers2

2

The first thing that comes to my mind is the following:

import random

def randomy(obj):
    """
    Map objects to uniformly distributed random numbers in [0, 1].
    The input must be hash-able. You will get the same output for
    the same input during all calls in the same program run, but
    outputs may vary between different program runs.
    """
    try:
        return randomy._store[obj]
    except KeyError:
        value = random.random()
        randomy._store[obj] = value
        return value

randomy._store = {} # Initialize memoization map

x = "Test string"
y = "Another string"

print randomy(x)
print randomy(x)
print randomy(y)
Florian Brucker
  • 970
  • 1
  • 8
  • 21
  • That only works with hashable objects, i.e. you cannot use mutable objects, such as lists or numpy arrays. – Jaime Aug 15 '13 at 11:10
  • This is essentially what I've done in the past. However, depending on the situation it can use up too much storage space to be practical. (Often many of the inputs are unique, with only a few coming up more than once.) Something like universal hashing would avoid this problem. – N. Virgo Aug 15 '13 at 12:54
  • There's also another disadvantage to this that a universal hashing type implementation would solve. Choosing a new random function is like randomly changing all the parameters of my model, and it would be nice to keep the same (randomly generated) parameters but change the initial conditions. Using a lookup table approach makes this impossible, because it will generate a different function if given the same inputs in a different order. Your answer is still a good one in many cases though. – N. Virgo Aug 15 '13 at 13:05
  • @Jaime: If your objects are mutable then you have to think about whether two calls of randomy(x) should return the same value if x is mutable and was modified in between. For either choice it's easy to extend this approach to handle it correctly. However, as @Nathaniel mentioned, space is often the bigger problem. – Florian Brucker Aug 15 '13 at 14:03
  • @FlorianBrucker Even if you know the content is not going to change, so the approach is conceptually correct, Python will spit something like this at you: >>> {[]:5}; Traceback (most recent call last):; File "<stdin>", line 1, in <module>;TypeError: unhashable type: 'list' – Jaime Aug 15 '13 at 14:15
  • @Jaime: Of course. As I said: "For either choice it's easy to extend this approach to handle it correctly." – Florian Brucker Aug 15 '13 at 14:34
2

If the data object you are trying to map is large enough, then just taking bit patterns is probably enough. For example, you might simply xor all the bytes of your object, giving you a number between 0 and 255 which you can then map onto the reals between 0 and 1.

Wolfgang Bangerth
  • 55,373
  • 59
  • 119
  • In my model it's very important that inputs that differ by only a few bits will give uncorrelated results, so a solution like this isn't so good for my purposes. Of course one can think of ways to make the results less likely to be correlated, but this kind of ad-hoc solution is really what I'm trying to avoid. – N. Virgo Aug 15 '13 at 13:00
  • @Nathaniel: What keeps you from applying a fast hash (i.e. not a cryptographic hash) to your objects' bit patterns? See for example this paper for a fast hash function on variable-length inputs that's designed to give a good approximation to uniformly distributed outputs. – Florian Brucker Aug 15 '13 at 14:11
  • 1
    If you're going to use a non-crypto hash of the bits, you should also check out MurmurHash (https://sites.google.com/site/murmurhash/) which has very good distribution properties and is in the public domain. – Bill Barth Aug 15 '13 at 14:48
  • @FlorianBrucker your comment was helpful - I've edited the question to answer your question a bit. A good fast hashing algorithm would indeed suit my needs, but as a non-expert it's hard for me to know whether the statistics of a given algorithm will be good enough for numerical purposes. Papers describing hashing algorithms tend to focus on their performance in file system applications, which don't necessarily have the same needs as from Monte Carlo simulations, and this would make me cautious about publishing the results. – N. Virgo Aug 16 '13 at 04:20
  • @BillBarth thank you for the helpful suggestion (I will try it), but see my above comment and edited question - the evaluation of MurmurHash and its relatives always seems to focus on its performance in file system type applications, rather than numerical computation. This would make me cautious about publishing results based on it, because the needs of file system applications and of numerical applications are not necessarily the same. – N. Virgo Aug 16 '13 at 04:24
  • MurmurHash comes with a hash testing library SMHasher that demonstrates its statistical properties. AFAIK, there was no explicit filesystem purpose in its creation, just the need for a fast, well-distributed has function. Why don't you give it a try and see if it meets your needs? I don't think there's any other libraries that are proven to meet all your criteria. – Bill Barth Aug 16 '13 at 12:03
  • @BillBarth I'll try it as soon as I know what the maximum output value is, so I can scale them to floats. (Python has no LONG_MAX constant, and it seems to output values greater than 232 but much less than 264.) I've emailed the developer and I hope they'll get back to me. However, as I said, I'm specifically looking for algorithms that have been formally evaluated for numerical applications. I will feel uneasy about using any algorithm for which there is no paper published on that subject, because I lack the expertise to do that kind of evaluation myself. – N. Virgo Aug 17 '13 at 05:02
  • @BillBarth I don't mean to be ungreatful though - it's a good suggestion and it's quite likely to be what I end up using. Another disadvantage is that it requires me to convert every object to a string before hashing it. This isn't necessarily a huge problem, but I can imagine it will cause speed issues if I need to hash large Numpy arrays. – N. Virgo Aug 17 '13 at 05:10
  • @Nathaniel, sometimes you have to blaze new territory in computational science. :) MurmurHash comes in 32-, 64-, and 128-bit versions. Converting 64-bit floating-point numbers to strings is free in C/C++/Fortrn. Are you trying to generate one hash value per Numpy array or one per array element? – Bill Barth Aug 17 '13 at 14:35
  • @BillBarth I'm using this library, which implements it for Python: http://code.google.com/p/pyfasthash/ . I think I must have made a mistake earlier about murmur3_32 outputting numbers greater than 2^32. I'll need one hash value per Numpy array - but in Python str() is clearly not free, and I don't think it's possible to simply reinterpret_cast a Numpy array into a Python string. – N. Virgo Aug 17 '13 at 14:42
  • If you're going for one value per whole array, yes you have a problem of speed in any language, but no one was suggesting that you pass your array or its elements to str(). We were suggesting that you take the literal bit values of the array elements and treat them as a string (which would be free in C/C++) and then pass that to your hashing function. With larger numbers of elements, you could xor them together with some small cost and pass that. – Bill Barth Aug 18 '13 at 11:46
  • @BillBarth I know that's what you were suggesting, I was just pointing out that it isn't possible to do that in Python. XORing the elements would ignore their order, so it isn't really an option. (By the way, I don't get notified of your replies unless you put '@Nathaniel' in your comments.) – N. Virgo Aug 19 '13 at 04:25
  • @Nathaniel: If order is important, you should have specified that in the question. That makes this a completely different question requiring a completely different response. – Bill Barth Aug 19 '13 at 10:38
  • @BillBarth what part of "a function that maps some set of objects to floats between 0 and 1, in a pseudo-random way" implies that I would want to ignore the order of any elements contained in the object? In any case I've edited the question to try and make it clearer. – N. Virgo Aug 19 '13 at 12:43
  • @Nathaniel, because the purpose of your mapping isn't clear, and the definition of "objects" is open, I was confused. – Bill Barth Aug 19 '13 at 15:38
  • @BillBarth in retrospect it should have been clearer, my apologies. – N. Virgo Aug 20 '13 at 00:00