Algorithm for 1d spline interpolation suitable for 8 bit microcontroler

Question

What is a concise, fast, down to earth algorithm for doing (or closely approximating) spline interpolation on a 1d continuous stream of data?

(Edit1: The paragraph below equates to saying "the data is uniform in interval." in an awkward way.)

The data is 1d in that sampling on x is at fixed, regular intervals (likely a power of 2 constant) known well ahead of time. This makes y the only variant, which should allow for quite a bit of simplification and precomputation (LUT?).

Here is a graphical example of approximately what we're trying to do. It's our algo applied to a coarsely sampled sine function.

(Edit2: Note that this is only an example of what our algo should do with coarsely sampled sine data, however, the data we would like to process will be random in the set [0,255], most likely 3 or 4 points)

_{(source: mathworks.com)}

Assume high accuracy is not required, but that we must compute the number of results (red dots) between samples (blue circles) in less than 8ms (~120Hz). Also assume the signal processor available is limited in power and instruction set (PIC/AVR), so has only the following relevant instructions and limitations.

(signed + carry) Addition/subtraction instructions.
(unsigned 8x8-to-16) Multiplication instruction. (PIC18, megaAVR only)
Byte wide Boolean instructions (XOR, OR, NOR, AND, NAND, etc.)
Single bit left and right logical shifts. (no barrel shifter)
Can only execute at 2~4 MIPS

Additional Notes:

Responses would preferentially be in pseudocode, so they are more generally applicable.
Cheating is totally OK; it doesn't need to be perfect, just better than liner interpolation.
Bonus points for alternatives that don't need multiplication.
More bonus points for alternatives likely to complete in less than 1ms!

This is for an RGB mood lamp x-mas present for my sister and mommy :3, which I would do myself, but the maths for this are apparently beyond me.

Edit 12-21-2016: Better list formatting

Do you need to calculate the output in order or can the algorithm be such that it puts those original samples sparsely into a large array in memory and filters that array in multiple passes? — Olli Niemitalo, Dec 27 '16 at 08:37
It probably could be the sparse scenario, yes. My intention is to create an endless smooth waveform with no discontinuity's having random peeks and valleys. (see the last picture in my own answer) Right now I'm taking samples from a 4 byte array of random numbers and 4x up sampling (cubic Hermite) to a 32 byte ring buffer. Every 4 samples the 4 byte array gets shifted 1 byte, and a new original sample is loaded on the freshly emptied end. — Charlie, Dec 27 '16 at 13:13
A discrete low-pass filter followed by linear interpolation should also work well. Doesn't have to be a linear-phase filter if your input is noise. — Olli Niemitalo, Dec 27 '16 at 17:18

robert bristow-johnson · Accepted Answer · 2024-02-27T04:58:05.543

6

Take a look at the cubic Hermite spline. The interpolated function is continuous at the data points and the first derivative is also continuous. Away from the data points all of the derivatives are continuous.

Let's say that the function $f(x)$ is defined by equally-spaced data points for all $x$ that is an integer. This means you know the values of $f(0), f(1), f(2), ...$

Then separate $x$ into an integer and fractional parts:

$$ x \triangleq n+u $$

where

$$ n = \lfloor x \rfloor = \operatorname{floor}(x) $$

$ n = \lfloor x \rfloor$ is the sole integer such that

$$ x-1 < \lfloor x \rfloor \le x < \lfloor x \rfloor + 1 $$

and the fractional part

$$ u = x - n \quad \text{ , } \quad 0 \le u < 1 $$

$$ $$

$$\begin{align} f(n+u) & = \begin{bmatrix} 1 & u & u^2 & u^3 \\ \end{bmatrix} \cdot \begin{bmatrix} 0 & 1 & 0 & 0 \\ -\tfrac12 & 0 & \tfrac12 & 0 \\ 1 & -\tfrac52 & 2 & -\tfrac12 \\ -\tfrac12 & \tfrac32 & -\tfrac32 & \tfrac12 \\ \end{bmatrix} \cdot \begin{bmatrix} f(n-1) \\ f(n) \\ f(n+1) \\ f(n+2) \end{bmatrix} \\ \\ & = \frac12 \begin{bmatrix} -u^3 +2u^2 - u \\ 3u^3 - 5u^2 + 2 \\ -3u^3 + 4u^2 + u \\ u^3 - u^2 \end{bmatrix}^\mathrm{T} \cdot \begin{bmatrix} f(n-1)\\f(n)\\f(n+1)\\f(n+2) \end{bmatrix} \\ \\ & = \frac12 \begin{bmatrix} u ((2-u) u-1) \\ u^2 (3 u-5)+2 \\ u ((4-3 u) u+1) \\ u^2 (u-1) \end{bmatrix}^\mathrm{T} \cdot \begin{bmatrix} f(n-1)\\f(n)\\f(n+1)\\f(n+2) \end{bmatrix} \\ \\ & = \tfrac12 \bigg( (u^2(2-u)-u)f(n-1) \ + \ (u^2(3u-5)+2)f(n) \\ & \quad \quad \quad \quad + \ (u^2(4-3u)+u)f(n+1) \ + \ u^2(u-1)f(n+2) \bigg) \\ \\ & = \tfrac12 \bigg(\Big(\big(-f(n-1) + 3f(n) - 3f(n+1) + f(n+2)\big) u \\ & \quad \quad \quad \quad + \big(2f(n-1) - 5f(n) + 4f(n+1) - f(n+2)\big)\Big)u \\ & \quad \quad \quad \quad + \big(-f(n-1) + f(n+1)\big)\bigg)u \ + \ f(n)\\ \end{align}$$

where $[\cdot]^\mathrm{T}$ means the matrix transpose and $\lfloor\cdot\rfloor=\operatorname{floor}(\cdot)$ means the floor() function, the largest integer that does not exceed the argument. The last expression is using Horner's Method, which might be useful (perhaps more useful with floating-point).

Is this enough information for how to do this in your PIC? You need to be able to separate into integer and fractional parts and you need to be able to multiply.

in my opinion, Olli's method [now moved into its own answer] is not the best way of looking at it for the OP's case of implementing this simply in a PIC. [his formulation] separates the four data points and computes four coefficients that are attached to powers of $u$. that's the way to do it if your fractional ordinate is any arbitrary value that is $0 \le u < 1$. but the OP has only a few values like $u=0, \tfrac14, \tfrac12, \tfrac34$. or maybe 8 multiples of $\tfrac18$.

so my recommendation is to compute the vaules of these four polynomials:

$$ c_{-1} = \tfrac12 (-u^3 +2u^2 - u) \\ c_0 = \tfrac12 (3u^3 - 5u^2 + 2) \\ c_1 = \tfrac12 (-3u^3 + 4u^2 + u) \\ c_2 = \tfrac12 (u^3 - u^2) $$

and do that for every fractional value of $u$ (such as $u=0, \tfrac14, \tfrac12, \tfrac34$) that you will use many, many times.

then the code in the PIC only needs to implement is a dot product between the 4 data points and the selected set of coefficients:

$$ f(x) = f(n+u) = c_{-1} f(n-1) + c_0 f(n) + c_1 f(n+1) + c_2 f(n+2) $$

since $c_{-1}$ and $c_2$ can be shown to always be negative for $0 < u < 1$, then put into the table their absolute values and subtract their terms:

$$ f(x) = f(n+u) = c_0 f(n) + c_1 f(n+1) - (-c_2) f(n+2) - (-c_{-1}) f(n-1) $$

the coefficients stored will be 256 times bigger than their actual value (and stored as 8-bit unsigned integers) then, after multiplying and accumulating your answer (that is 256 times too big), you add 128 (for rounding) and shift right 8 bits (which is the same as taking the answer out of the higher-order byte of the product).

edited Feb 27 '24 at 04:58

answered Dec 13 '16 at 06:21

robert bristow-johnson

20,661
4
38
76

Thanks for your interest, and yes, I will probably need help with the math.
Had glanced at cubic Hermite spline before, but now that you call it out specifically I'm looking more closely ... So, if I'm understanding this right, cubic Hermite interpolation is a linear combination of the four (simple) Hermite basis functions much the same way a Fourier series is the representation of an arbitrary waveform from the sum of simple sine waves? Cubic Hermite spline also links to the Bézier curves as a simplification. Would this possibly be more along the lines what I'm trying to do? Thanks again.
– Charlie Dec 13 '16 at 07:32
@Charlie, i missed this response from you. it's about a week old. i have some assembly code for the SHArC DSP that does cubic Hermite interpolation, or i can simply plop the math down, but the Wikipedia reference also shows that math. which would you rather have? – robert bristow-johnson Dec 21 '16 at 19:14
I don't know SHArC, so the math is probably better. That being said, I'm about high school level math (as embarrassing as it is to admit) so even the math may not be enough for me. I'm looking at it now though. – Charlie Dec 22 '16 at 01:43
@Charlie, do you understand the matrix multiply notation? – robert bristow-johnson Dec 22 '16 at 01:45
i expanded out the matrix notation into a regular equation. you need to compute $u, u^2, (2-u), (3u-5), (4-3u),$ and $(u-1)$ then assemble it all with the $f(\cdot)$ values and don't forget the $\tfrac12$. – robert bristow-johnson Dec 22 '16 at 01:59
Maybe, maybe not. I have a passing knowledge of matrix math, picked up from OpenGL programing. Taking a guess, you multiply each element of the left most square brackets with each row of the mid matrix, then the result with each function in the right square brackets, correct? – Charlie Dec 22 '16 at 02:00
yes, but the matrix on the left (that i copied from Wikipedia) is supposed to be a row (1 x 4) and the one on the right is supposed to be a column (4 x 1), which is why i put a "T" transpose notation on the left column. but i expanded it out for you. does your PIC only do 8-bit integer math? how about multiplications? do you know how to do "add-with-carry" with your PIC? do you have a 16-bit accumulator for 8x8 multiply? – robert bristow-johnson Dec 22 '16 at 02:03
PIC microcontrollers can do up to IEEE 754 floating point math if you use C, though it's painfully slow most of the time. It also has a truncated 24-bit version that is faster and more compact at the cost of precision. The actual hardware (which is all single cycle instructions) is as described under the graphic in the question. – Charlie Dec 22 '16 at 02:07
are you coding in PIC assembly? and do you want to do this in fixed-point math or using the "truncated 24-bit version" of floating-point math? – robert bristow-johnson Dec 22 '16 at 02:09
One of the comments to one of the other reply's answered this, but I'll repeat for your convenience. I can do either C or ASM (assembly), as needed. Whichever is better. I can also do inline ASM in C.
(SE is complaining about extended discussions, should we take this to chat?)
– Charlie Dec 22 '16 at 02:15
naw, fuck SE. so the value "$u$" is the fractional part of the index and, in your example about $u=$ 0, 0.25, 0.5, and 0.75. Those little cubic polynomials can be computed in advance (by hand by you). then it's a matter of choosing the correct set of 4 coefficients (depending on what $u$ is) and doing a simple dot product. – robert bristow-johnson Dec 22 '16 at 02:19
now, to do signed multiplication when you have only an unsigned multiplier, do you know how to do "offset binary arithmetic"? – robert bristow-johnson Dec 22 '16 at 02:25
offset binary arithmetic... "two's complement" close enough? PIC's are good at that. – Charlie Dec 22 '16 at 02:28
1

but when you have negative numbers and no 2's comp multiply, you have to bias your numbers to be multiplied unless you're certain they are non-negative. is all of your plotting data positive? if not, add a bias (that you know of) and bump it up. since the coefficients are calculated in advance, you can use their abs value and add or subtract the result. otherwise, you better learn about and understand offset binary arithmetic (sometimes called "excess-K"). – robert bristow-johnson Dec 22 '16 at 02:46
All data (both the starting points, and the resulting plotted points) are all in the set [0,255], so all positive integers. – Charlie Dec 22 '16 at 02:47
1

okay, so you need not offset them. then the coefficients you get from evaluating those 4 polynomials of $u$, with $u=$ 0, 0.25, 0.50, 0.75, when the coefficient is negative, then use the abs value and subtract after the multiplication. also, those coefs have fractional values, so you must represent them as some $2^n$ times their actual value. then after adding up the terms, you must shift right by $n$ bits to undo the $2^n$ factor. that is essentially how fixed-point arithmetic is done. – robert bristow-johnson Dec 22 '16 at 03:21
@Olli Niemitalo: Thanks for that code, I think I'm starting to get this now. (It's funny that it's easier for me to understand the math from the code rather than the actual math.) So, if I understand correctly, the "fractional part" or "fractional position" is the space between the intervals along x, right? And this is in the set [0,1], that is, between 0.0 and 1.0, right? OK, so, what should I do if x is going to be between some power of two value? For example, lets say it's 0 to 31. So we are adding 32 new points between y[0] and y[1], who's fraction is represented by integers in [0,31]. – Charlie Dec 22 '16 at 11:02
@Charlie that's right. You are essentially asking how to do fixed-point arithmetic. If you can do linear interpolation I don't see why you'd have any problems with Hermite. I'll add code for linear interpolation too so you can compare. – Olli Niemitalo Dec 22 '16 at 11:46
Believe me, I'm not ecstatic that I don't understand this either. My only rebuttal is linear interpolation is trivial from where I sit. 1: Find the difference between y[n] and y[n+1](an integer), 2: divide that difference by the distance between y[n] and y[n+1], (which at unit interval is always a fixed constant integer, preferably a power of two) then 3: add that to y[n] repeatedly until you reach y[n+1]. I just am not seeing the same concise process for cubic Hermite spline that doesn't involve calculus, polynomials, exponents, and "terminology". Code for liner interpolation would be cool. – Charlie Dec 22 '16 at 12:00
I think you're right in that I'm probably just hung up on the fixed-point arithmetic, so let me go over that. So... I get that my 8bit value for $u$ is dividing [0,1] into 256 "chunks", with the decimal implied to be there. I just don't get how to change the rest of the code to accommodate this. Obviously I can't merely replace x with [0,255], nothing else changed, and expect it to just work. The rest of the code needs to be scaled by the same factor to make it work; it needs to be "aware" that 0xFF =/= 255, but rather is 1.00, 0xFE = 0.99609375, and so on. I'll read up more on fixed-point. – Charlie Dec 22 '16 at 12:25
1

@Olli Niemitalo -Just FYI, the code you've provided (which I assume is C) for the Hermite spline algo attempts to index an array with a negative number (y[-1]). This is almost always going to NOT point to valid address/data, because no valid element exists before the first element in an array. I would advise incrementing all those indexes by one, so y[1], y[2], y[3], y[4]. Or at least making note of this problem. (Q.E.D., my compiler throws an error because of this.) – Charlie Dec 23 '16 at 13:20
1

ANSI C does not bounds check indices of array references. And you can define a pointer into the middle of an array, and have negative indices w.r.t. that pointer in the middle and the reference WILL point to valid data. – robert bristow-johnson Dec 23 '16 at 17:51
with your 8-bit PIC, are you writing this in C or asm? what is your 8-bit signed number called? is it int or short or char? what is the 16-bit number coming out of the multiplier called? is it int or even a long? in the PIC asm can you add and subtract with carry (to chain 8-bit bytes together into longer integers)? is there a 16-bit accumulator from the multiply instruction? can you add 16-bit numbers directly? – robert bristow-johnson Dec 23 '16 at 18:01
never let 0xFF be the same as $1.0$ . 0xFF should be $\frac{255}{256}$. – robert bristow-johnson Dec 23 '16 at 18:04
@robertbristow-johnson 1. I'm writing this in C. 2. My 8-bit signed number is "signed char," and the 8x8 multiplier produces a 16-bit "unsigned int" 3. Yes, I can add/subtract with carry in PIC asm. Can also do some multi-byte multiplication the same way. 4. There is a 16-bit accumulator for the multiplier. 5. Not certain what you mean when you say "directly" In C I can add 16-bit numbers easily. I cannot add 16 bit numbers in a single cycle, with a single instruction but I can do it multi-byte fairly easily. 6. Yes, you're of course right about '0xFF'. (can't edit my old comments) – Charlie Dec 24 '16 at 05:08
1

how many points between your given points do you want to interpolate? if it's a small number, like 4 or 8, you should precompute those coefficients (that are the cubic polynomials of $u$ above) in advance. – robert bristow-johnson Dec 24 '16 at 05:27
I was aiming for $\frac{1}{256}$ for the fractional part... the number of sub-points. One, because it gives high resolution. Two, because it's the word size of the CPU. If it's too small a denominator then I fear the result will end up practically no better than linear interpolation from lack of resolution. 16 or 32 might be a good compromise though, there is actually plenty of space for doing a look up table array (512 bytes ROM for 32 x 4 float coefficients). That being said, doing a toy implementation using 4 or 8 might still be useful as it would probably give me better incite to this. – Charlie Dec 24 '16 at 07:20
Actually..... I'm playing with the Python code found on Centripetal Catmull–Rom spline Wiki page, and with 8 fractional-points (with liner interpolation between these, mind you!) it really isn't that bad looking. Even the worst case is acceptable, and is only seen when two adjacent points are near opposite extremes for Y. The math also looks easier, with the exception of the square rooting for the $t_n$ values. And, if I'm right, that should be the part we can precompute, correct? – Charlie Dec 24 '16 at 09:30
there is no square-rooting required. even if you precompute the 4 coefficients for the 8 fractional offsets. just a 3rd order polynomial. but i don't see any reason for a square root. – robert bristow-johnson Dec 24 '16 at 18:07
@robertbristow-johnson Just to confirm, are we still talking about cubic Hermite? I was actually talking about Centripetal Catmull-Rom spline, which I am finding much easier to understand honestly. Point of fact, I have been tweaking the Python code provided on that page and I'm almost to a point of simplification and refinement that I could drop it right in my PIC and go. Hell, I almost got it to the point that I can do it in assembly by hand. I'll edit my answer below so we all can see where I'm at. – Charlie Dec 24 '16 at 19:52
yes, we're talking about cubic Hermite and Catmull-Rom. but i am adding a restriction, i am adding that the points, in some sense of the word, are equally spaced. so i am saying that $t_2-t_1=t_1-t_0$ or, in general $t_{n+1}-t_n = t_n - t_{n-1}$ for all integer $n$. so, if you look at interpolating just one dimension of your 2-dimensional or 3-dimensional points and if the "$t$" ordinates of all of your points are equally spaced, the math described here is all you need. and you can compute the four coefficients for each fractional offset in advance for the sake of simple PIC code. – robert bristow-johnson Dec 25 '16 at 03:31
@robertbristow-johnson Yeah, I'm just figuring out today that the two methods condense out into about the same math when $x$ is uniform. It's just too bad I'm not really getting the math from the formulae you provided. I don't get how you expanded that last matrix, or the simplifications you did. I also don't see the relationship between your math and Olli's code. Am I correct that all the functions in that last column-vector are the starting point's $y$ component? Also, what are all the $a_{n}$ parts in the Horner's rule formula? I'm assuming they are our pre computed constants? – Charlie Dec 25 '16 at 07:37
1

Merry Christmas, by the way. – Charlie Dec 25 '16 at 07:38
The equation $$ f(x) = f(n+u) = \tfrac12 \bigg( (u^2(2-u)-u)f(n-1) \ + \ (u^2(3u-5)+2)f(n) + \ (u^2(4-3u)+u)f(n+1) \ + \ u^2(u-1)f(n+2) \bigg) $$ for $$ x = n + u $$ and $$ n = \lfloor x \rfloor \quad \in \mathbb{Z} $$ $$ 0 \le u < 1 $$ is continuous everywhere no matter what finite values the control points $f(n)$ (for integer $n\in\mathbb{Z}$) are. but linear interpolation would do that for you. this Hermite thing also gives you a continuous first derivative $f'(x)$, even at the integer control points. – robert bristow-johnson Dec 25 '16 at 07:48
@robertbristow-johnson I'm sorry brother, I read this a dozen times carefully and I just don't follow. I more or less already understood that our main function is continuous. I have to admit that the concept of a derivative is somewhat new to me, but that too I more or less understand at this point. (actually pretty straightforward) What I don't get is the implications of what you're saying beyond the obvious; the line plotted will have no jumps or holes. – Charlie Dec 25 '16 at 08:42
Charlie, i presume you know the difference between integer and fractional parts of a real number. do you know what a derivative is? do you know what "continuous" is? – robert bristow-johnson Dec 25 '16 at 19:49
@robertbristow-johnson I'll tell you what I think those are, and you can be the judge. First, I'm sure I understand the integer and fraction business. In laypersons terms, turn a number into two numbers by splitting it at the decimal place. Second, A derivative (in awkward verbage) is the slope of the tangent of a point of a function. Third, continuous means, for all input the function follows "an unbroken line". This line need not be straight to be continuous. Reciprocation is not continuous, because $y$ jumps from +inf to -inf instantly at $x=0$. A cubic function is continuous. – Charlie Dec 26 '16 at 05:28
@OlliNiemitalo That link is great, thanks. – Charlie Dec 26 '16 at 05:29
okay, Charlie. consider linear interpolation. the interpolated function is continuous, even at the "break points" or "control points". but, for linear interpolation, the derivative or slope is not continuous at the break points. also, for linear interpolation, you only use $f(n)$ and $f(n+1)$ of the data to make a straight line. for cubic Hermite, you need in addition two more points on the outside $f(n-1)$ and $f(n+2)$ to interpolate between $f(n)$ and $f(n+1)$. but now, with that extra information, you can guarantee that the slopes of the interpolated curve is continuous. – robert bristow-johnson Dec 26 '16 at 05:38
Charlie, i told you explicitly how to do this. and how to structure it for your 8-bit arithmetic and with an 8-bit$\times$8-bit unsigned multiply that your PIC does. i even told you how to compute your intermediate coefficients so that it's trivial for the PIC – robert bristow-johnson Dec 26 '16 at 05:41
Ah! I understand now. Your last rewriting of the formula, and the simplification therein, made it clear to me. Firstly I was seeing all your formula in the context of making the PIC do it. In reality, the first part of your formula describes how to generate a LUT (look up table) of precomputed constants (result of the 4 polynomials for each $u$), and the second part is how to use those constants in the PIC (dot product with $y$ of $n_n$). Since there are only a finite number $u$s, there are also only a finite number of result for the polynomials. For multiples of 1/4, it's $4u$ results. – Charlie Dec 26 '16 at 07:12
for multiples of $\tfrac14$, then $u=\tfrac04,\tfrac14,\tfrac24,\tfrac34$. for each of those 4 $u$ values, evaluate $c_{-1},c_0,c_1,c_2$. now you have 16 numbers stored in memory. because you are stuck with unsigned multiply and because you know, in advance that for every $0<u<1$, you will see that $c_{-1}<0$ and $c_2<0$, so store their positive values instead and subtract when computing your spline. then all of your 16 numbers should be scaled up by 256 and stored as unsigned 8-bit integer. that makes your result 256 times too large, so shift the result right by 8 bits. – robert bristow-johnson Dec 26 '16 at 07:22
if you know your PIC asm well, you can do this quite efficiently in asm. remember you are evaluating: $$ f(n+u) = c_0 f(n) + c_1 f(n+1) - (-c_2) f(n+2) - (-c_{-1}) f(n-1) $$ which is the same as $$ f(n+u) = \tfrac{1}{256}\bigg( (256\cdot c_0) f(n) + (256\cdot c_1) f(n+1) - (-256\cdot c_2) f(n+2) - (-256\cdot c_{-1}) f(n-1) \bigg) $$ – robert bristow-johnson Dec 26 '16 at 07:23
Yes, I agree. This would be stupidly fast in hand optimized PIC asm. – Charlie Dec 26 '16 at 07:28
so lets see if you can make a 4 $\times$ 4 table of $c_{-1}, c_0, c_1, c_2$ for each of $u=\tfrac04, \tfrac14, \tfrac24, \tfrac34$. after you do that, then multiply each of $c_k$ coefficients by 256 and represent that as an integer. – robert bristow-johnson Dec 26 '16 at 07:32
OK, I'll do that. - On another note, I have the aforementioned Catmull-Rom spline python code ported to C and optimized for my problem set. At 1MIPS, It takes about 2.5ms per new point. I'm computing the constants in real-time though. A LUT method, like yours, could easily be 10x or 100x faster. That is was what I was originally looking for too. – Charlie Dec 26 '16 at 07:41
make the LUT. remember that you have to scale up the $c_{-1}, c_0, c_1, c_2$ coefficients by 256 and store them as integers. and remember that $c_{-1}$ and $c_2$ are both less than zero, so store their abs value and subtract the product of multiplication instead of adding. each of those four terms (i.e. "$(256 \cdot c_0) f(n)$") will be a 16-bit number that you either add or subtract with the other 16-bit integers. you might have to use add-with-carry. when you're all done, then add 128 (for rounding) and take your result out of the upper byte of the 16-bit result. – robert bristow-johnson Dec 26 '16 at 07:50
@OlliNiemitalo, i think now might be a good time to place your code into your own answer. i think you deserve your own credit for your code, but also, since the approach is different than mine, i think putting your code based on your approach in my answer that is based on a different approach to the coding problem, i think it's less confusing if the two approaches are separated. – robert bristow-johnson Dec 26 '16 at 08:02

OK, I've computed the 16 constants for 1/4. (hopefully it stays formatted)

    0  :::: 0   1   0   0
    0.25 :::: -0.1171875 0.8671875 0.2265625 -0.0234375
    0.5  :::: -0.1875  0.5625  0.5625  -0.0625
    0.75 :::: -0.1640625 0.2265625 0.8671875 -0.0703125

    x256  0   256   0   0
    x256  -30   222   58   -6
    x256  -48   144   144   -16
    x256  -42   58   222   -18

– Charlie Dec 26 '16 at 08:54

... of course it wouldn't..... (:/) Disregard. It's now in my own answer as an edit. – Charlie Dec 26 '16 at 08:56
@robertbristow-johnson Question: If the resolution with the fractional stages at 1/4 intervals is to low/coarse, is it OK to apply cubic Hermite, as shown here, recursively? e.g, up-sample 4 points to 16 using cubic Hermite – Charlie Dec 26 '16 at 14:19
... then upsample that 16 again to 64 points? – Charlie Dec 26 '16 at 14:19
you could, but why not just make a bigger table? keep things simple. your table can be 8 $\times$ 4 or 16 $\times$ 4 or 64 $\times$ 4 . remember (for the PIC not the python emulation) that you will leave the minus signs offa $c_{-1}$ and $c_2$, because they are always negative. store the positive value in your table, use the unsigned multiply, and subtract the product instead of adding. – robert bristow-johnson Dec 26 '16 at 15:05
There are 2 reasons for not just using a bigger table. First is actually indirectly because of fixed point math optimizations. The case of 4x4 LUT at $u$x256 tends to make things byte wide operations, so is single cycle and very optimal on an 8-bit CPU. the 8x4 case requires the fractional component to be 1/1024, or shifted by 10 bits, a bit awkward to work with. Second is because of the flexibility. One has the option to dynamically choose to apply this recursively if needed. And acknowledged on addition of negative = subtraction of positive. I WILL make sure to do that for $c_-1$ and $C_2$. – Charlie Dec 26 '16 at 15:50
Clarification: ... Operations are more often than not single cycle (with 4x4 LUT)... – Charlie Dec 26 '16 at 15:56
NO! The scaler that you choose to scale up the coefficients need not be related at all to the number of fractional values of $u$. the scaling up of the coefficients is purely due to the scheme of fixed-point arithmetic that you are using. scale up by 256 if you plan to shift your result to the right by 8 bits. scale up by 1024 if you plan to shift your result to the right by 10 bits. How do you get these silly ideas into your head, @Charlie? – robert bristow-johnson Dec 26 '16 at 17:52
Check out the spread sheet I've provided in my answer. Notice that, with the cubit polynomials applied to fractions of 8 (8x4 LUT), the products of 256 are not integers. I had to scale the coefficients up by 1024 before the fractional parts are removed. I could round, sure, but then that will introduce error that the 4x4 case doesn't have. This is why I'm claiming the choice of LUT scheme indirectly screws with the fixed-point scheme. Is this interpretation wrong? And how so? Thanks again for your assistance. – Charlie Dec 27 '16 at 07:12
By the way, I have the 4x4 case basically done in PIC ASM now. 100us per iteration at 8MIPS, pretty damn fast! There seems to be a simplification for the first row in the table, the result will always be f(n). – Charlie Dec 27 '16 at 07:14
you can round the coefficients to the nearest multiple of $\frac{1}{256}$. that won't hurt you at all – robert bristow-johnson Dec 27 '16 at 16:17
Are you saying it's OK to round the coefficients because the fixed point arithmetic we do is already going to round when converting to integer? e.g. adding 128 to the fractional byte and carrying into the whole byte. – Charlie Dec 28 '16 at 12:07

Charlie · Answer 2 · 2016-12-28T11:06:10.687

OK, I'm now using (abusing?) this answer as a checkpoint for the progress I'm making. Eventually, this will fill out and become a "true" answer and this header can be removed... please bear with me.

Precomputed Constants for $u$ at 1/4 fractions.

This is related to the accepted answer; the Cubic Hermite spline case. It's here, because it needs to be formatted correctly to be remotely legible.

0      ::::    0            1           0           0
0.25   ::::   -0.0703125    0.8671875   0.2265625   -0.0234375
0.5    ::::   -0.0625       0.5625      0.5625      -0.0625
0.75   ::::   -0.0234375    0.2265625   0.8671875   -0.0703125

x256   ::::    0            256         0           0
x256   ::::   -18           222         58          -6
x256   ::::   -16           144         144         -16
x256   ::::   -6            58          222         -18

Edit: Thank you Robert. You were correct, of course, there was an error. The error was in the first columns polynomial. I was cubing $u$ in the second term when I should have squared it. The table is now correct, the spreadsheet will follow.

~~I have a *.ods spreadsheet that I used to make this that I will relinquish on request.~~

Here is a link to the spreadsheet. (Opens in browser)

So, after bashing my head on the (wonderful) answers provided so far for the last week, I digressed to a tangential algorithm, the Centripetal Catmull–Rom spline. The Wiki page has Python code which isn't very hard to get working. The code provided does almost exactly what I was asking, only with a TON of extra baggage that is not needed. I spent the better part of the night cutting and simplifying the algo, and it's getting close to perfect now.

The only thing it needs now is...

The vectoring needs to be unfolded so elements can be processed one-by-one.
It needs the rest of the constants precomputed down.
A linear interpolation stage will need to be wrapped around this.

Edit: After a day or two of messing with it, I have fully ported and partly simplified the Catmull-Rom spline algo from the Python code to working PIC XC8 code. What's more, it is reasonably fast, even though it calculates the constants in real time. On a PIC18 chip (w/hardware multiplier) operating at 1 MIPS, it takes ~2.5ms to output one new point. This is about 3x faster than the absolute minimum required for 120Hz operation. 1 MIPS is mostly a worst case as that is a paltry pace for most PICs, especially PIC18's. This is perfectly functional for my needs, and more or less solves my problem/question to my satisfaction.

Here is the relevant code.

    unsigned char j;
    unsigned char l = 0;
    for(j = 0; j < 16; j++)
    {
        // (global) unsigned char y[4] = {0};
        y[0] = y[1];
        y[1] = y[2];
        y[2] = y[3];
        y[3] = randchar(); // Wrapper, limits standard rand to [0,255]
        // Debug for overshoot worst case. (y[] should alternate 2xMAX, 2xMIN)
        //y[3] = y[0]; 

        //further limit our starting points to prevent overshoot
        if (y[3] > (255-16)){y[3]=(255-16);}
        if (y[3] < 12){y[3]=12;}

        unsigned char k;
        const static unsigned char c0 = 64; // amount of fixed point shift.
        for(k = c0; k < c0*2; k = k+(c0/16)) {
            signed int A1 = (((c0 - k) * y[0] + k * y[1]) / c0);
            signed int A2 = ((((c0*2) - k) * y[1] + (k - c0) * y[2]) / c0);
            signed int A3 = ((((c0*3) - k) * y[2] + (k - (c0*2)) * y[3]) / c0);

            signed int B1 = ((((c0*2) - k) / 2 * A1 + k / 2 * A2) / c0);
            signed int B2 = ((((c0*3) - k) / 2 * A2 + (k - c0) / 2 * A3) / c0);

            // (global) unsigned char buff[256] = {0};
            buff[l] = ((((c0*2) - k) * B1 + (k - c0) * B2) + (c0*16))/ c0;
            l++;
        }
    }

Notes:

The arrays y[] and buff[] will have to be defined somewhere.
The arrays don't necessarily have to be global. Especially y[].
j times k need to equal the length of buff[].
All the math is integers only. (well... fixed point)
The only core operators are addition/subtraction, multiplication, and division by powers of two. This should make it pretty damn fast, and simple.
Lastly, there is still some room for simplification.

Here is a plot resulting from running the ~~above~~ Python code.

And here is a plot for the new C code, run on the actual PIC, for RGB LED PWM output. Note that it looks jagged because it does not (yet) have a linear interpolation stage applied to it.

the other thing you have to do to correctly simulate fixed-point is use the floor() function to force each scaled coefficient to an integer value. — robert bristow-johnson, Dec 22 '16 at 17:34
also, instead of full screen shots, try to use the snipping tool to select just the content. or you can also save to a .png file and upload that to SE. — robert bristow-johnson, Dec 22 '16 at 17:36
@robertbristow-johnson You are, of course, totally right about using floor() on interpolation output. I was working on a revised version that did exactly that. However, it became beside the point when I realized that the whole damn thing is actually wrong( :/) It is only an odd coincidence that it outputs exactly what you would expect to see. Of course :sigh: octave would have it's own internal liner interpolation for plotting, which is obviously covering up any flaws in my own code. Thanks to this, I've more or less given up on trying to use Octave to work this out, and switched to just C. — Charlie, Dec 23 '16 at 13:39
you won't need a linear interpolation stage, if you just have enough different values of $u$ between the integer points. — robert bristow-johnson, Dec 26 '16 at 17:46
Agreed. I think the trade off between the speed vs. accuracy when comparing linear interpolation vs. spline interpolation is worth investigating a little. Excellent optimization could be had with the right balance of the two. Even so, the asm cubic Hermite I have going now is probably plenty fast enough for just about any use case. 1 iteration (generation one fractional point.) only takes ~104 machine cycles worst case, 32 every 4th iteration. So... ~85 cycles on average. — Charlie, Dec 27 '16 at 09:02

score 4 · Answer 3 · edited Jan 09 '24 at 14:57

This is a different way to do cubic Hermite interpolation than the one explained in Robert's answer. In his notation, we can also write:

\begin{align}f(n+u) =\, &u^3\left(-\tfrac{1}{2}f(n-1) + \tfrac{3}{2}f(n) - \tfrac{3}{2}f(n+1) + \tfrac{1}{2}f(n+2)\right)\\ +\, &u^2\left(f(n-1) - \tfrac{5}{2}f(n) + 2f(n+1) - \tfrac{1}{2}f(n+2)\right)\\ +\, &u\left(\tfrac{1}{2}f(n+1) - \tfrac{1}{2}f(n-1)\right)\\ +\, &f(n)\end{align}

My code has different variable names but does the calculation in essentially the same order. When you put the Hermite code to real use, it will sometimes address one sample (y[-1]) before the first sample in your data and one sample (y[2]) after the last sample in your data. I normally make those extra "safety" samples available in the memory just outside the array. Another warning is that in worst case cubic Hermite interpolation overshoots the original input range, say from maximum values [-128, 127] to maximum values [-159.875, 158.875] for worst-case inputs [127, -128, -128, 127] and [-128, 127, 127, -128]. This is floating point code but can be converted to fixed point.

// x = 0..1 is the fractional position.
// Interpolating between y[0] and y[1], using also y[-1] and y[2].
float c0 = y[0];
float c1 = 1/2.0*(y[1]-y[-1]);
float c2 = y[-1] - 5/2.0*y[0] + 2*y[1] - 1/2.0*y[2];
float c3 = 1/2.0*(y[2]-y[-1]) + 3/2.0*(y[0]-y[1]);
return ((c3*x+c2)*x+c1)*x+c0;

Try to implement linear interpolation first if you have trouble:

// x = 0..1 is the fractional position.
// Interpolating between y[0] and y[1].
return (y[1]-y[0])*x+y[0];

Here's vintage 1998, Pentium optimized fixed point assembly cubic Hermite interpolation code for 32-bit x86 architecture:

;8192-times oversampling Hermite interpolation of signed 8-bit integer data.
;ESI.ECX = position in memory, 32.32-bit unsigned fixed point, lowest 19 bits ignored.
;EAX = output, 24.8-bit signed fixed point.
data:
ipminus1        dd      0
ip1             dd      0
ip2             dd      0
code:
movsx   EBP, byte [ESI-1]
movsx   EDX, byte [ESI+1]
movsx   EBX, byte [ESI+2]
movsx   EAX, byte [ESI]
sal     EBX, 8

sal     EDX, 8

mov     dword [ip2], EBX
sal     EAX, 8

mov     dword [ip1], EDX
mov     EBX, EAX

sub     EAX, EDX

sal     EBP, 8
mov     [ipminus1], EBP 
lea     EAX, [EAX*4+EDX]
mov     EDX, ECX

sub     EAX, EBX

shr     EDX, 19

sub     EAX, EBP

add     EAX, [ip2]

lea     EBP, [EBX*4+EBX]
imul    EAX, EDX
sar     EAX, 32-19+1

add     EBP, [ip2]

sar     EBP, 1

add     EAX, [ip1]

add     EAX, [ip1]

add     EDI, 8

sub     EAX, EBP

mov     EBP, [ip1]

add     EAX, [ipminus1] 
sub     EBP, [ipminus1]
imul    EAX, EDX
sar     EBP, 1

sar     EAX, 32-19

add     EAX, EBP
imul    EAX, EDX
sar     EAX, 32-19

add     EAX, EBX

The above methods are useful if you need to interpolate at "random" positions. If you need to evaluate the interpolation polynomial at equidistant points, there's the forward difference method. There's an article about it in Dr Dobb's. You can do it without any multiplications in the inner loop, and also the rest of the multiplications are constant multiplications that in fixed point arithmetic can be done by shifts, additions and subtractions. Here's C/C++ demonstration code using floating point numbers:

#include <stdio.h>
#include <math.h>
// Forward difference cubic Hermite interpolation
const float x[4] = {-1, 2, -3, 4}; // Input data
int main() {
  const float y = &x[1]; // Interpolate between the middle two values
  const int m = 4; // Parameter: Interpolate 2^m values for each input value.
  // Cubic Hermite specific:
  float c0 = y[0]; 
  float c1 = 1/2.0(y[1]-y[-1]);
  float c2 = y[-1] - 5/2.0y[0] + 2y[1] - 1/2.0y[2];
  float c3 = 1/2.0(y[2]-y[-1]) + 3/2.0(y[0]-y[1]);
  // The rest works for any cubic polynomial:
  float diff0 = 3pow(2, 1 - 3m)c3;
  float diff1 = pow(2, 1 - 2m)c2 + 3pow(2, 1 - 3m)c3;
  float diff2 = pow(2, -m)c1 + pow(2, -2m)c2 + pow(2, -3m)c3;
  float poly = c0;
  for (int k = 0; k < (1<<m)+1; k++) {
    printf("%d, %f\n", k, poly);
    poly += diff2;
    diff2 += diff1;
    diff1 += diff0;
  }
}

Compared to Robert's method, this is less work in total, especially if hardware multiplication is slow or unavailable. A possible advantage of Robert's method is the balanced workload per output sample. Here there is also serial dependency. For PIC it is not a problem, but with processor architectures that have more parallel execution pipelines it becomes a bottleneck. That potential problem can be alleviated by parallelizing the calculation to groups of say four output samples with independent update of their [diff1, diff2, poly] state vectors, like in this (C/C++ code):

#include <stdio.h>
#include <math.h>
// Parallelized forward difference cubic Hermite interpolation
const float x[4] = {-1, 2, -3, 4}; // Input data
struct state {
  float diff1;
  float diff2;
  float poly;
};
int main() {
  const float y = &x[1]; // Interpolate between the middle two values
  const int m = 4; // Parameter: Interpolate 2^m values for each input value.
  const int n = 2; // Parameter: 2^n parallel state vectors.
  // Cubic Hermite specific:
  float c0 = y[0];
  float c1 = 1/2.0(y[1]-y[-1]);
  float c2 = y[-1] - 5/2.0y[0] + 2y[1] - 1/2.0y[2];
  float c3 = 1/2.0(y[2]-y[-1]) + 3/2.0(y[0]-y[1]);
  // The rest works for any cubic polynomial:
  state states[1<<n];
  float diff0 = 3pow(2, 1 - 3m)c3;
  float diff1 = pow(2, 1 - 2m)c2 + 3pow(2, 1 - 3m)c3;
  float diff2 = pow(2, -m)c1 + pow(2, -2m)c2 + pow(2, -3m)c3;
  states[0].poly = c0;
  printf("%d, %f\n", 0, states[0].poly);
  for (int k = 1; k < (1<<n); k++) {
    states[k].poly = states[k-1].poly + diff2;
    printf("%d, %f\n", k, states[k].poly);
    diff2 += diff1;
    diff1 += diff0;
  }
  diff0 = 3pow(2, 3(n-m) + 1)c3;
  for (int k = 0; k < (1<<n); k++) {
    // These are polynomials in k so could also be evaluated by forward difference, avoiding multiplicaton
    states[k].diff1 = pow(2, 2(n-m) + 1)c2 + pow(2, 1 - 3m)(3(1<<3n)c3 + 3(1<<2n)c3k);
    states[k].diff2 = pow(2, n - m)c1 + pow(2, - 2m)((1<<2n)c2 + (1<<n+1)c2k) + pow(2, - 3m)((1<<3n)c3 + 3(1<<2n)c3k + 3(1<<n)c3k*k);
  }
  for (int i = 1; i < 1<<(m-n); i++) {
    for (int k = 0; k < (1<<n); k++) {
      states[k].poly += states[k].diff2;
      states[k].diff2 += states[k].diff1;
      states[k].diff1 += diff0;
      printf("%d, %f\n", (i<<n)+k, states[k].poly);
    }
  }
  printf("%d, %f\n", 1<<m, states[0].poly + states[0].diff2);
}

+1 because it is, indeed helpful. It is a lot more readable as it's own answer too. Thanks for your help. — Charlie, Dec 26 '16 at 09:00

score 2 · Answer 4 · answered Dec 21 '16 at 18:57

2

Depends

Splines are good but I'm pretty sure you need division for that, which will be awkward on your PIC.

If both the original data and the interpolated are sampled at uniform intervals, than this turns simply into an up sampling problem. The way your picture looks, you just need to up-sample by a factor of 4. This can be done easily with a polyphase FIR filter that only requires multiplies and adds. On the downside, there is latency, i.e. your interpolated data will be delayed with respect to your original data. I don't know if that's okay or not.

If your output data is really just a sine wave and you simply don't know the frequency and phase (or its time variant), you could wrap a phase locked loop around it.

answered Dec 21 '16 at 18:57

Hilmar

44,604
1
32
63

2

BTW, i think implementing Hermite cubic spline in assembly code on a DSP is in the same ballpark of difficulty as implementing it in a PIC. are you writing C code to run in the PIC? and you don't need division at all to implement cubic Hermite splines for equally spaced (along the x-axis) data points (which is what uniform sampling is). it's a perfectly linear operation and is an approximation to $\operatorname{sinc}(\cdot)$-based interpolation. – robert bristow-johnson Dec 21 '16 at 19:54
@Hilmar 1: Division on the pic is indeed awkward, unless it's by a power of two integer, then it resolves down to a simple right shift by the equivalent number of bits. 2: Yes, starting data comes in uniform intervals. I'll update the question to say as much. 3: Latency is not an issue, the starting data will be buffered, then new points computed ahead of time during an anticipated idle period. 8ms is still the longest it should take. 4: It's, unfortunately, NOT just a sine wave. The data will be a three-tuple stream of random data in the set [0, 255]. i.e. randomly chosen colors in RGB form. – Charlie Dec 22 '16 at 01:26
@robert. 1: Authoring the algo in ASM probably is the same difficulty bracket for either platform, agreed. Note that a DSP would execute orders of magnitude faster, however. So simplification/optimization is essential. 2: I could go either way on the language, C or ASM. I can also inline ASM in the C. sudocode would probably be better, so everyone can benefit. 3: the data is uniform on x, so It's good that we don't need division for that. 4: sinc() interpolation sounds interesting, C has sin() built in as part of <math.h>, elaboration on the details of this point would be awesome. – Charlie Dec 22 '16 at 01:35

Algorithm for 1d spline interpolation suitable for 8 bit microcontroler

4 Answers4

Linked