57

I recently encountered a case where I needed an integer division operation on a chip that lacked one (ARM Cortex-A8). While trying to research why that must be, I found out that in general division takes many more cycles than addition, subtraction or multiplication on pretty much any integer (or fixed-point) architecture. Why is this the case? Is it not representable with a two-layer AND-OR logic like everything else?

Phonon
  • 673
  • 1
  • 5
  • 8
  • 5
    Division is fundamentally harder wrt boolean circuit depth: https://dl.acm.org/doi/pdf/10.1145/800057.808714 – user14717 Apr 30 '20 at 10:43

3 Answers3

41

Division is an iterative algorithm where the result from the quotient must be shifted to the remainder using a Euclidean measure, see 2; whereas, multiplication can be reduced to a (fixed) series of bit manipulation tricks.

aterrel
  • 3,644
  • 24
  • 26
  • 4
    It used to be that both multiplication and division were slow operations. Nowadays multiplication is a bit faster (but slightly slower than addition/subtraction), but division still is slower than the others. I believe Newton-Raphson is still used internally by most for reciprocating a number. – J. M. Dec 03 '11 at 00:36
  • 22
    (Off-topic: "Inverse operations are usually hard. Just look at integration versus differentiation." - depends on whether what you're doing is symbolic or numeric. Differentiation is symbolically easy, but numerically hard; integration is symbolically hard, but numerically easy.) – J. M. Dec 03 '11 at 00:37
  • @J.M. Hmm I don't know that integration is any easier numerically than differentiation. If by easy you mean possible, via monte carlo, then yes. One can probably come up with cases for both that are equally hard. – aterrel Dec 03 '11 at 02:01
  • 1
    Okay, I'll cop out by saying that cubature is a different can of worms; but at least in the one-dimensional case, quadrature is easier than differentiation. – J. M. Dec 03 '11 at 02:32
  • How is differentiation hard? It is guaranteed to be within 5x of a function evaluation by Griewank's AD result. – Matt Knepley Dec 03 '11 at 04:04
  • @Matt, that counts as "symbolically easy". I still stand by the "numerically hard" statement. – J. M. Dec 03 '11 at 04:08
  • 2
    In any case, inverses always come in pairs. Why would you call one the "operation" and the other the "inverse"? – David Ketcheson Dec 03 '11 at 04:33
  • 4
    Neither iteration nor inverse makes it harder. Hardness of division comes from the fact that you have to shift the result from quotient to remainder using a Euclidean measure. See the division algorithm theorem. –  Dec 03 '11 at 21:42
  • @J.D. You are correct that the shifting of the result from the quotient to the remainder is a difficult step in the division algorithm. Multiplication still has simple bit hacks that speed it up though. I will edit the answer to include this. – aterrel Dec 03 '11 at 21:48
  • All, I'm removing the comment about inverses. Its clear that it is misleading. – aterrel Dec 03 '11 at 21:49
  • Howdy Andy. Great response! – meawoppl May 05 '12 at 00:51
29

While all current CPU's appear to use an iterative approach as aterrel suggests, there has been some work done on non-iterative approaches. Variable Precision Floating Point Division and Square Root talks about a non-iterative implementation of floating point division and square root in an FPGA, using lookup tables and taylor series expansion.

I suspect that the same techniques may make it possible to get these operations down to a single cycle (throughput, if not latency), but you are likely to need huge lookup tables, and thus infeasibly large areas of silicon real-estate to do it.

Why would it not be feasible?

In designing CPU's there are many trade-offs to make. Functionality, complexity (number of transistors), speed and power consumption are all interrelated and the decisions made during design can make a huge impact on performance.

A modern processor probably could have a main floating point unit which dedicates enough transistors on the silicon to perform a floating point division in a single cycle, but it would be unlikely to be an efficient use of those transistors.

The floating point multiply made this transition from iterative to non-iterative a decade ago. These days, single cycle multiply and even multiply-accumulate are commonplace, even in mobile processors.

Before it became an efficient use of transistor budget, multiply, like division, was often performed by an iterative method. Back then, dedicated DSP processors might dedicate most of their silicon to a single fast multiply accumulate (MAC) unit. A Core2duo CPU has a floating point multiply latency of 3 (the value comes out of the pipeline 3 cycle after it went in), but can have 3 multiplies in flight at once, resulting in a single-cycle throughput, meanwhile it's SSE2 unit can pump out multiple FP multiplies in a single cycle.

Instead of dedicating huge areas of silicon to a single-cycle divide unit, modern CPU's have multiple units, each of which can perform operations in parallel, but are optimised for their own specific situations. In fact, once you take into account SIMD instructions such as SSE or the CPU integrated graphics of the Sandy Bridge or later CPU's, there may be many such floating-point divide units on your CPU.

If generic floating point division were more important to modern CPU's then it might make sense to dedicate enough silicon area to make it single cycle, however most chip makers have obviously decided that they can make better use of that silicon by using those gates for other things. Thus one operation is slower, but overall (for typical usage scenarios) the CPU is faster and/or consumes less power.

Mark Booth
  • 2,426
  • 19
  • 31
  • 1
    To my knowledge, no chips have single-cycle divide latencies for floating point. For example, Agner Fog's instruction tables for Intel, AMD, and VIA CPUs lists DIVPS (SSE packed floating-point divide) as 10-14 cycles. I can't find any hardware with single-cycle divide instructions, but I'd be willing to be proved wrong. It's not common as far as I can tell. – Bill Barth Dec 06 '11 at 15:06
  • 1
    @Bill - Thanks, you're right. I'm sure I've seen single-cycle division operations in DSP chips before, so assumed it would have made it's way to the desktop, just as single-cycle multiply did, but I can't find any references now. I've updated my answer and added some relevant information on non iterative methods which might allow it in the future though. It's amazing to think that division is no more efficient per cycle now than back when I was using transputers. – Mark Booth Dec 06 '11 at 19:48
  • 3
    I think DSPs do that by limiting the range in which they are accurate. This is the same strategy used for lookup+interpolation for square root. – Matt Knepley Dec 07 '11 at 11:27
  • 1
    I am not sure what the latency of such a division would be, though. At 4 GHz, making a round-trip to the look-up table within N cycles severely limits the potential size of said table (for example, the L1 caches have been stagnating at 32K each). Going 3D would help increasing this (but is challenging wrt. cooling). Do you have any idea what latency could be reached for modern 4GHz/5GHz CPUs? – Matthieu M. Jan 11 '17 at 13:59
  • CPU manufacturers would only dedicate a large swath of silicon to a low latency floating point divide unit if they couldn't use that silicon more effectively to speed up the CPU in other ways. Generally speaking though, having more long latency FDIVs are a more efficient use of silicon than fewer shorter latency FDIVs. AS for current generation CPUs, as with other areas, we have only seen incremental improvements. Some division operation latencies apparently dropped from 20 cycles to 14 cycles this generation, but that is quite rare these days. – Mark Booth Mar 14 '17 at 10:59
  • 1
    For divps / divpd vs. mulps / mulpd latency and throughput numbers, see Floating point division vs floating point multiplication. I took data from Agner Fog's instruction tables and formatted it into a summary across uarches of div and mul throughput and latency, for single vs. double and for different SIMD vector widths. (Intel chips typically have a SIMD divider that's only half the width of the other vector ALUs.) – Peter Cordes Nov 27 '18 at 23:32
0

I'll try to give a simplistic, vaguely correct answer.

The slow Division algo is essentially iterative subtract, (with some extra steps.) It's intrinsically more time-consuming than multiply.

(turns out it's not a great idea to think I could sum this up in a couple of lines.)

ocodo
  • 101
  • 3