Lets take an algorithm as an example, the DCT-II used in JPEG. The formula is as follows:
Now the thing is that, the values in M are not integers and many are even less than 1. We have two options: 1) Fixed point maths and 2) Floating point maths. We can also see that the data input is multiplied with M and also its transpose. This implies a lot of multiplication and addition operations.
Lets assume that we are going to use 16-bit fixed point representation for M, this means that all values in M will be scaled by 2^16 for the arithmetic and then scaled down. This means that we will have a lot of rounding errors in representing M as fixed point number and then also rounding errors when we complete 1/2 of the calculation (either M*V or V*Mt) and scaling up and down, since the output from each matrix multiplication is supposed to be integer. How exactly can a person calculate the precise amount of error fixed point maths using certain number of bits and a rounding method will introduce into the result?
The question is more suitable for a mathematician but I have posted it here.

