Node-centric network metrics are very different from the sum of metrics of the Pods running on that node

Question

Context

I am trying to chart the network bandwidth usage of a node in 2 different manners:

By looking at global metrics for that node
By summing up the corresponding metric for each Pod

To achieve this, I am issuing the following Prometheus queries (example for the receive bandwidth):

For the entire node (metric from node-exporter)

sum(irate(node_network_receive_bytes_total{instance="10.142.0.54:9100"}[$__rate_interval])) by (device)

Per Pod (metric from kubelet)

sum(irate(container_network_receive_bytes_total{node="$node",container!=""}[$__rate_interval])) by (pod,interface)

The results are displayed in the following Grafana dashboard, after generating some load on a HTTP service called thrpt-receiver:

Here's what I see if I look at the raw metrics, without sum() and irate() applied:

Problem

As you can see, results are vastly different, to the point I'm almost certain I am doing something wrong, but what?

What makes me especially suspicious about the Pod metrics is the supposedly increasing received bandwidth of kube-proxy (which AFAIK is not supposed to be receiving any traffic in iptables mode), and agents such as the Prometheus node-exporter, etc.

score 1 · Accepted Answer · answered Nov 20 '20 at 19:43

I found out what was happening in my graphs. All the Pods mentioned above have one thing in common: they use the host's network namespace, so their network metrics are all identical, and equal to the global host's metric (just with a slightly different precision).

$ kubectl - monitoring get pod -o jsonpath='{.spec.hostNetwork}' \
    prometheus-stack-prometheus-node-exporter-jnhw7
true

$ kubectl -n kube-system get pod -o jsonpath='{.items[*].spec.hostNetwork}' \
    kube-proxy-gke-triggermesh-product-control-plane-7fc0ad24-z586 \
    gke-metrics-agent-5cv4m \
    prometheus-to-sd-tk8jv \
    fluentbit-gke-xh879

true true true true

One way to see it is to compare the host's metric to one of the above Pods:

Thanks for the hint. We researched a similar problem today. Our node-exporter pod bandwith massively increased (more than 400mb/s) - which we could not explain. In our case a redis pod had very high network transmit, which was routed trough calico/kube-proxy in the hostNetwork.
Although the host machine had only 150mb/s network transmit, which was caused by calico which only routed external traffic to other nodes. Lots of traffic stayed inside the virtual network of the host itself. — Martin Lantzsch, Nov 11 '22 at 13:50

Node-centric network metrics are very different from the sum of metrics of the Pods running on that node

Context

Problem

1 Answers1