A cryptographic mode of operation has two parts: encryption/decryption and message authentication (anti tampering). You must use both, because encryption is malleable. Every message sent and received must be fully processed by both.
The possible options for encryption/decryption:
CBC mode has an inherent limitation that in the encryption direction (decryption doesn't have this problem) the 16 byte blocks need to be processed one after another, limiting parallelism. When you need to process many messages / streams in parallel, you can get enough parallelism to saturate AESNI units, but that depends on traffic pattern.
CTR mode and modes based on CTR (like GCM and CCM), on the other hand, can process all blocks in parallel. This means that with long enough messages, a single stream of data can saturate AESNI units. 768 bytes is long enough.
AES is either slow or insecure without special hardware support. Luckily, since ~all x86 chips since 2010 have AESNI, AES is fast on x86. On the order of a few gigabytes (not gigabits) per second per core, when AES blocks can be processed in parallel (so CTR and CTR based, not CBC).
ChaCha20 exists to be fast on chips that don't have hardware AES, like phones and tablets. On x86 (for example, on the other side of a connection from a phone) it is fast enough, maybe 1.6GB/sec. On phones and tablets, something like 100MB/sec, depending on the micro-architecture and the power/cooling budget.
The possible options for MAC:
HMAC is a bit more complicated than the raw hash function, but for longer messages it is just a bit slower than the raw hash function. SHA-256 is slow, on the order of 400MB/sec. With AVX when processing parallel streams or with Intel SHA Extensions, it can be ok, up to a few gigabytes per second per core (e.g. see this). The SHA instructions are new, not common. Support is bad. Intel's IPSec library supports them though.
- CBCMAC (AES-XCBC-MAC-96 or CCM)
CBCMAC uses the same AES execution units to process the same amount of data, so it at best halves your throughput compared to just encryption/decryption. It uses CBC internally, so it's slower than CTR. The advantage is that a single accelerator circuit can be used for both encryption and MAC, saving silicon area / cost, and also lower complexity (chance of bugs).
GCM uses GMAC for MAC (and CTR for encryption). GMAC is slow without hardware support in the form of CLMUL instructions. Luckily ~all x86 chips since 2010 have that.
Poly1305 was invented to be fast on chips that don't have special hardware like SHA extensions or CLMUL, like phones and tablets. It's fast enough there (maybe 100MB/sec) and on x86 (maybe 2GB/sec).
In theory, the encryption and MAC modes can be mixed and matched, but in practice they are used in only a few combinations.
Now, how to combine the performance numbers for the encryption part with the performance numbers for the MAC?
If they cannot be overlapped, you need to use (X*Y)/(X+Y) where X and Y are the performance of the two individual parts in the same units (e.g. MB/sec). Thus, for example, 520MB/sec encryption and 670MB/sec MAC leads to 293MB/sec combined throughout.
But in some cases they can be overlapped and then you might get the performance of the slowest of the two, with the other being "free". This is the case with GCM, which can run at the same speed as CTR, with GMAC being effectively for free. This is why it is popular. TLS and SSH switched to GCM (with ChaPoly for mobile). WireGuard is ChaPoly-only, for simplicity.
The above is generalities. I would suggest you always run your own benchmarks of the exact implementation on the exact hardware you intend to use. Message sizes play a role, processing multiple message streams in parallel plays a role and ability to overlap encryption and MAC plays a role.