Related issues:
https://github.com/longhorn/longhorn/issues/2629
Case 1: Kubelet restart on RKE1 multi node cluster:
- Create a RKE1 cluster with config of 1 etcd/control plane and 3 worker nodes.
- Deploy Longhorn on the cluster.
- Deploy prometheus monitoring app on the cluster which is using Longhorn storage class or deploy a statefulSet with Longhorn volume.
- Write some data into the mount point and compute the md5sum.
- Restart the kubelet on the node where the statefulSet or Prometheus pod is running using the command
sudo docker restart kubelet
- Observe the volume. It becomes degraded but is still running.
- Once the node is back, the volume of the workload should work fine and the data is intact.
- Scale down then re-scale up the workload. Verify the existing data is correct.
Case 2: Kubelet restart on one node RKE1 cluster:
- Create a RKE1 cluster with config of 1 node all role.
- Deploy Longhorn on the cluster.
- Deploy prometheus monitoring app on the cluster which is using Longhorn storage class or deploy a statefulSet with Longhorn volume.
- Write some data into the mount point and compute the md5sum.
- Restart the kubelet on the node using the command
sudo docker restart kubelet
- Check the instance manager pods on the node are still running.
- Observe the volume. It gets detached and the pod gets terminated (since the only replica of the volume becomes failed).
- Once the pod is terminated, a new pod should be created and get attached to the volume successfully.
- Verify that the mount of the volume is successful and data is safe.
Case 3: rke2-server/rke2-agent restart on RKE2 multi node cluster:
- Create a RKE2 cluster with config of 1 control plane and 3 worker nodes.
- Deploy Longhorn on the cluster.
- Deploy prometheus monitoring app on the cluster which is using Longhorn storage class or deploy a statefulSet with Longhorn volume.
- Write some data into the mount point and compute the md5sum.
- Restart the rke-agent service on the node where the statefulSet or Prometheus pod is running using the command
systemctl restart rke2-agent.service
- Observe the volume. It becomes degraded but is still running.
- Once the node is back, the volume of the workload should work fine and the data is intact.
- Scale down then re-scale up the workload. Verify the existing data is correct.
- Create a StatefulSet with Longhorn volume on the control plane node.
- Once the StatefulSet is up and running, Write some data into the mount point and compute the md5sum.
- Restart the rke2-service on the control plane node using the command
systemctl restart rke2-server.service
. - Observe the volume. It becomes degraded but is still running.
- Once the node is back, the volume of the workload should work fine and the data is intact.
- Scale down then re-scale up the workload. Verify the existing data is correct.
Case 4: Kubelet restart on a node with RWX volume on a RKE1 Cluster:
- Create a RKE1 cluster with config of 1 etcd/control plane and 3 worker nodes.
- Deploy Longhorn on the cluster.
- Deploy a statefulSet attached with an RWX volume.
- Write some data into the mount point and compute the md5sum.
- Restart the kubelet on the node where the share-manager pod is running using the command
sudo docker restart kubelet
- Observe the volume. It gets detached and the share-manager pod gets terminated.
- Watch the pod of the StatefulSet using the command
kubectl get pods -n <namespace> -w
. - The pods (share manager and StatefulSet, instance manager pods are not included) should be terminated and restarted.
- Verify that the mount of the volume is successful and data is safe.
- Repeat the above steps where the StatefulSet pod and share-manager pod are attached to different nodes and restart the node where the statefulSet pod is running.
Case 5: Kubelet restart on a node with RWX volume on a RKE2 Cluster:
- Repeat the steps from the case 4 on an RKE2 cluster.