2

I'm aware that MD5 is broken, and collisions have been found for it. I'm interested in other hashes (SHA-1, SHA-2, SHA-3) when truncated to the same digest size, i.e. 128 bits.

The time complexity of a collision attack is 2^(n/2) (the "birthday attack"). So in the case of a 128 bit hash, one would have to hash 2^64 inputs for a 50% probability of finding any two inputs hashing to the same value (though a collision might be found much earlier in practice).

The currently fastest single Bitcoin mining machine performs about 5 trillion double SHA-256 hashes per second, so assuming 10 trillion/sec for single ones, and assuming the inputs tested would not be larger than one block it would amount to:

2^64 / 10*10^12 / 31536000 = 5.85 years

To scan enough hashes to yield a 50% chance of a collision. Though at that point the machine would have to allocate at least 16 * 2^64 ~= 2.95+e20 bytes ~= 295 exabytes of memory for storing the previous hashes. With today's computing capabilities, this seems somewhat reachable (though not very trivial, and may be extremely expensive to handle the enormous memory/storage requirements).

Has this already been tried/achieved? any references? (I'm interested mostly in attacks using brute-force search, but more sophisticated ones are also relevant).

Anon2000
  • 341
  • 1
  • 9
  • I'd guess no, but it's only a guess, because I can't think of any sitatuation where you'd want to use a 128-bit hash function. And hence this would be "just for fun" and it looks to expensive for that. (Maybe NSA did though...) – SEJPM May 19 '15 at 11:32
  • Also note that there are ways to search for such a collision that radically reduce the amount of memory required (at not that huge of a computational cost). – poncho May 19 '15 at 19:09
  • additional note: Using a single machine is pointless, such an attack would use many machines and effectively reduce the time needed to a lot less, like a few days. (NSA certainly can do this I think) – SEJPM May 19 '15 at 19:23
  • @SOJPM: sure, I gave the example to show that it could be even done with one machine. Managing and distributing the CPU power wouldn't really a problem (I guess it's a sort of a "embarrassingly parallelizable" problem). The monster memory/storage requirements would be much more challenging, though.. – Anon2000 May 19 '15 at 19:49
  • @poncho: I'm really interested what kind of approach you describe. Is this some way to order the workload for efficient memory utilization, or based on a more theoretical attack? – Anon2000 May 19 '15 at 19:50
  • my answer should further explain @poncho 's comment and a more "realistic" time schedule – SEJPM May 19 '15 at 19:54

1 Answers1

2

I'm not aware of any case where somebody actually searched for such a collision.

However it would certainly be possible as the same workload ($2^{64}$) was already accomplished a few years ago (2002) by this project, having brute-forced RC5-64.

Now assume you'd use the full power of the bitcoin blockchain (300 Peta-Hashes / s = 600 Peta-Hashes /s for single hashing(19th may 2015)) you'd expect a (64-bit-)collision after 30.74 seconds ($=2^{64}/600*10^{15}$).

As noted correctly by poncho there are algorithms that help you overcome the massive amount of memory you calculated. Of particular interest would be a "memoryless" variation of Yuval's birthday attack(book page: 369, pdf page: 50 of the "Handbook of applied Cryptography").
This related question may be helpful.

SEJPM
  • 45,967
  • 7
  • 99
  • 205
  • I was actually expecting this to be a sort of a prefix-tree or some other data structure.. (at least something I could perhaps more easily understand :) ). It turned out to be a bit more challenging than I expected. I guess it'll to take me some time to figure it out. It's sort of hard for me though, to imagine how this process could be "memoryless" in concept. Is there some easier, less mathematically precise explanation for how exactly it is this possible to go through 2^64 unpredictable values and find a single duplicate without needing to somehow memorize them? – Anon2000 May 19 '15 at 20:21
  • the basic idea of the paper cited in the related question is to find a collision using cycle-finding. This means you calculate $x_1=h(x_1)$ and $x_2=h(h(x_2))$ as soon as $x_2=x_1$ you've found a collision (if you remembered the pre-images...) this requires you to store only a small amount of data (the current values and the pre-images) and you'll still find the answer in $O(2^{n/2})$. The fact that this works has to do that random mappings (-> hashes) follow some path and start to cycle and you basically let the one always make 2 steps on the cycle and the other only one until they meet. – SEJPM May 19 '15 at 20:30
  • @Anon2000: one obvious way is to do iterated hashing (where we compute $x_i = Hash_{128}(x_{i-1})$, and stop at distinguished points (say, the first 32 bits are all zero), and store the initial/final values in a table. Build up a long list of such table entries (circa $2^{32}$ should do), and look for collisions in the final value -- if we find one, then the two chains merge (and finding where the chains merge is straight-forward). That's not a zero-memory solution, however it gets the memory requirements small enough... – poncho May 19 '15 at 20:31
  • @SEJPM Surprisingly, I managed to somehow understand your suggestion first! perhaps not 100% though, at least not yet.. There's one interesting thing though, is that the "cycle" you describe when performing h(h(h(...))) is actually a "collision" (i.e. an already seen hash) right? however that particular collision is not necessarily the one that would be eventually found? or do the tortoise and the hare start at the same values? and if they do I guess it is? what if the hare misses a collision found by the tortoise?.. (this isn't very easy to reason about..) – Anon2000 May 19 '15 at 21:02
  • Usually you start at $x_1=x_2$ and for each update of $x_1$, you update $x_2$ twice. At some point the "hare"($x_2$) completes a round around the cycle and catches up to the tortoise($x_1$) and as soon as they do that you've found your collision. If you need a graphical representation of a cycle, you can find it here(book page 55, pdf page 8 from "the handbook"). – SEJPM May 19 '15 at 21:12
  • @SEJPM You know, I was thinking of a different construction! how about simply having a running XOR of all the hashes? when two cycles are completed, wouldn't it simply zero out? is this a useful direction at least? – Anon2000 May 19 '15 at 21:23
  • And if it is, could this be used as a more general-purpose memoryless cycle detection algorithm? I mean, as long as each visited vertex could be hashed in some way right? (either directly, or through some construction that generates a hash for its state I guess)? – Anon2000 May 19 '15 at 21:51
  • @poncho Any ideas on the matter? Is this a well-known technique? – Anon2000 May 19 '15 at 22:07
  • 2
    @Anon2000: yes, the method that SOJPM suggested is known as rho cycle finding. The approach that I suggested (which is rather different) is more related to the Hellman time-memory tradeoff. I suspect the method I suggested is more practical (for one, it's more parallelizable; rather important if you're contemplating $2^{64}$ computations...) – poncho May 19 '15 at 22:10
  • @poncho: And what about the XOR technique I suggested? will it work? (I intended to ask you about that in fact..) – Anon2000 May 19 '15 at 22:12
  • @Anon2000: I don't see how you would make it work. When you hit a cycle, the XOR wouldn't zero out (as the values prior to when you hit the loop wouldn't zero out); instead the XOR's would return to values you've previously seen. And, looking for values previously seen is the problem we're trying to solve... – poncho May 19 '15 at 22:18
  • @poncho What I meant was that in the first run it would accumulate the XOR and the second one it would cancel it (that's what I meant by "two cycles"). Eventually it would have all zeros when it exactly hits the end of the [second] run. – Anon2000 May 19 '15 at 22:20
  • @Anon2000: nope; remember, the cycle is unlikely to include the beginning value; instead, there'll be a series of unique values, and then you'll hit the cycle. Now, you could arbitrarily zero out the XOR after it's likely you've entered the cycle; that'll allow to do detect the size of the cycle; it's not as clear how you'd find out how you entered the cycle (which is what you're really interested in) – poncho May 19 '15 at 22:27
  • @poncho Yes, sure, I realized it myself: this only gets me back to the beginning (doesn't make sense, sorry..) . But you know, this kind of simple "ideas" could be evolved to a more elaborate techniques.. And maybe found useful for a completely different class of problems. It isn't always wise to just "throw away" interesting methods.. Though I'm definitely not a computer scientist so this is not the sort of things I think about daily.. – Anon2000 May 19 '15 at 22:43
  • @poncho It's seems like this is method to detect the existence of a cycle [for a specific starting point], but doesn't give a way to exactly detect where it "branches out". And in this particular situation, the states are identified by very large, random numbers so it might not be very useful (you could just save the expected value itself) . Maybe in a different scenario where the states were, for example, zeroes and ones, it might be more useful? [Just speculating.. and I know this is getting sort of off topic.. sorry :) ] – Anon2000 May 19 '15 at 23:12
  • Well I guess we have somehow three different hash collision searching algorithms in this discussion.... My "easy" circle finding which was linked in the linked question, poncho's TMTO one and the one described in the handbook. For practical purposes the easy one "dies" because it can ony operate sequentially. The TMTO may work as would the handbook's attack. – SEJPM May 20 '15 at 20:29