So, yes, since your decimation is a power of two, going through cascaded half-bands is a clever thing to do!
First of all, what's special about a half-band filter? If you implement a filter of 1/2 bandwidth as FIR, then you can choose to design it so that the transition width's center is exactly at ¼ the sampling rate – that allows for symmetry, and that in turn gives you a FIR filter where every other coefficient is zero.
Zero coefficients don't need to be summed. This gives you the steepness of a $K$-tap filter at the cost of $\frac K2$!
Now, let's look at the fact that you're decimating by a factor of 2:
That means that you throw away every other output sample. Polyphase decomposition of your filter allows you to write your single filter as a sum of two alternatively "fed" filters, so that instead of calculating $K$ products and sums per time step, you calculate $\frac K2$. That's another effort reduction by a factor of 2.
So, what you do is:
- you pick an existing decimating FIR filter implementation in C or assembly, or write it yourself (it's but a simple
for-loop). GNU Radio has one; it's in C++, and uses optimized filter kernels, but it might help illustrate how you split up one filter into $K$ subfilters to reduce the load by a factor of $K$ in a $K$-decimating FIR.
- You design a half-band filter. You probably don't need something fantastic; make sure you get as many zeros in there as sensible.
- you concatenate $\log_2 64=6$ of these (in fact, that's only nearly efficient; read that article, it's good).
General comment: Using an 8-bit microcontroller to process 16 bit numbers is a bad idea through and through. Atmegas are usually more expensive than similarly peripheral-equipped ARM counterparts, and really, they aren't even remotely adapted to DSP loads. Don't do that to yourself. This task is easy even on a cheap 20 MHz ARM Cortex-M0, and you could have a Cortex-M4F for the price of a larger ATMega. That would have a floating point unit; not that you need that for any of this (you can do all these calculations in fixed point, like you'd have to do it on your AVR), but it does make one's life easier further down the processing chain. I assume you use an AVR "because you already have that, and have experience with that"; but believe me, you're shooting yourself in the knee. More suitable MCUs are cheap, and so are development boards (they literally start at 2$) sufficient to connect your ADC to them.