Prerequisite: Have an environment with just with 2 worker nodes or taint 1 out of 3 worker node to be NoExecute
& NoSchedule
.
This will serve as a constrained fallback and limited source of recovery in the event of failure.
1. Kill the engines and instance manager repeatedly
Given 1 RWO and 1 RWX volume is attached to a pod.
And Both the volumes have 2 replicas.
And Random data is continuously being written to the volume using command dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync
When One replica rebuilding is triggered by crashing the IM AND Immediately IM associated with another replica is crashed AND After crashing IMs, detaching of Volume is tried either by pod deletion or using Longhorn UI
Then Volume should not stuck in attaching-detaching loop
When Volume is detached and manually attached again. And Engine running on the node where is volume is attached in killed
Then Volume should recover once the engine is back online.
2. Illegal values in Volume/Snap.meta
Given 1 RWO and 1 RWX volume is attached to a pod. And Both the volumes have 2 replicas.
When Some random values are set in the Volume/snap meta file And If replica rebuilding is triggered and the IM associated with another replica is also crashed
Then Volume should not stuck in attaching-detaching loop
3. Deletion of Volume/Snap.meta
Given 1 RWO and 1 RWX volume is attached to a pod. And Both the volumes have 2 replicas.
When The Volume & snap meta files are deleted one by one. And If replica rebuilding is triggered and the IM associated with another replica is also crashed
Then Volume should not stuck in attaching-detaching loop
4. Failed replica tries to rebuild from other just crashed replica - https://github.com/longhorn/longhorn/issues/4212
Given 1 RWO and 1 RWX volume is attached to a pod.
And Both the volumes have 2 replicas.
And Random data is continuously being written to the volume using command dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync
When One replica rebuilding is triggered by crashing the IM AND Immediately IM associated with another replica is crashed
Then Volume should not stuck in attaching-detaching loop.
5. Volume attachment Modification/deletion
Given A deployment and statefulSet are created with same name and attached to Longhorn Volume. AND Some data is written and their md5sum is computed
When The statefulSet and Deployment are deleted without deleting the volumes And Same named new statefulSet and Deployment are created with new PVCs. And Before above deployed workload could attach to volumes, attached node is rebooted
Then After node reboot completion, volumes should reflect right status. And the newly created deployment and statefulSet should get attached to the volumes.
When The volume attachments of above workloads are deleted. And above workloads are deleted and recreated immediately.
Then No multi attach or other errors should be observed.
6. Use monitoring/word press/db workloads
Given Monitoring and word press and any other db related workload are deployed in the system
And All the volumes have 2 replicas.
And Random data is continuously being written to the volume using command dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync
When One replica rebuilding is triggered by crashing the IM AND Immediately IM associated with another replica is crashed
Then Volume should not stuck in attaching-detaching loop.