Ask questionsvminsert: issues of reroute mechanism when storage node is temporarily unavailable

Describe the bug Once one(some) of the storage node(s) get killed, our vm cluster will lose the ability to ingest data into storage, and it is hard to recover if we don't get involved. The behaviour is:

  1. high rerouted_from rate, not only from the down nodes, but also some other healthy nodes(basically all the nodes) image

  2. slow data ingestion rate. The normal rate is about 25k per node but the minimum rate is actually under 1k per node. This is the graph of one node, others are all the same. image

  3. high index_write rate, more than 50X normal

    It's not a official panel from the grafana vm-cluster dasboard, the MetricQL is: sum(rate(vm_rows{type="indexdb"}[10m]))


  4. High IO usage and CPU usage: image image Not sure if it is the cause or a phenomenon.

PS: We have over 100 vmstorage nodes, and have a life-keeper process to re-launch the storage node immediately after it gets killed(within 1m).

Since the high rate of rerouted_from and index_write, we assume that maybe it is caused by reroute in vminsert, This is the hypothesis based on our cases: We have two main reasons which get vmstoage killed:

  1. manually kill vmstorage or other processes's high memory usage gets vmstorage OOM
  2. some slow queries increase vmstorages' memory usage, and it gets OOM

After the storage node is down, vminsert reroutes data to other healthy nodes, and the new data increase the resource usage including IO, CPU, and the data ingestion of the healthy nodes get slow, so the reroute mechanism reroute data to any other healthy node, and boom, it causes avalanche!!!

To re-produce the situation, we build a cluster and use some other methods to keep the high IO and CPU usage, in the meanwhile, we scrape parts of our prod data into the cluster by vmagent, every is fine until we shut down one of our nodes, and the situation described above shown up.

In order to prove our hypothesis, we update the code to let vminsert stop rerouting the data from storage-x but still rerouting data from other nodes(we simply drop the data instead of rerouting). We've done two operations here, 18:00 at storage-6 and 18:07 at storage-5. As you can see, after I shut down and restart storage-6 at 18:00, every thing seems fine because I stop the reroute from storage-6. But the same situation comes up when I shut down and restart vm-5 at 18:07. image image

Version v1.39.4-cluster


Answer questions valyala

FYI, all the improvements mentioned above have been included in v1.42.0.

Github User Rank List