Module tests.test_ha

Functions

def ha_backup_deletion_recovery_test(client, volume_name, size, backing_image='')
def ha_rebuild_replica_test(client, volname)
def ha_salvage_test(client, core_api, volume_name, backing_image='')
def ha_simple_recovery_test(client, volume_name, size, backing_image='')
def prepare_engine_not_fully_deployed_environment(client, core_api)
  1. Taint node-1 with the taint: key=value:NoSchedule
  2. Delete the pod on node-1 of the engine image DaemonSet. Or delete the engine image DaemonSet and wait for Longhorn to automatically recreates it.
  3. Wait for the engine image CR state become deploying
def prepare_engine_not_fully_deployed_environment_with_volumes(client, core_api)
  1. Create 2 volumes, vol-1 and vol-2 with 3 replicas
  2. Taint node-1 with the taint: key=value:NoSchedule
  3. Attach vol-1 to node-1. Change the number of replicas of vol-1 to 2. Delete the replica on node-1
  4. Delete the pod on node-1 of the engine image DaemonSet. Or delete the engine image DaemonSet and wait for Longhorn to automatically recreates it.
  5. Wait for the engine image CR state become deploying
def prepare_upgrade_image_not_fully_deployed_environment(client, excluded_nodes=[])
def restore_with_replica_failure(client, core_api, volume_name, csi_pv, pvc, pod_make, allow_degraded_availability, disable_rebuild, replica_failure_mode)

restore_with_replica_failure is reusable by a number of similar tests. In general, it attempts a volume restore, kills one of the restoring replicas, and verifies the restore can still complete. The manner in which a replica is killed and the settings enabled at the time vary with the parameters.

def test_all_replica_restore_failure(set_random_backupstore, client, core_api, volume_name, csi_pv, pvc, pod_make)

[HA] Test if all replica restore failure will lead to the restore volume becoming Faulted, and if the auto salvage feature is disabled for the faulted restore volume.

  1. Enable auto-salvage.
  2. Set the a random backupstore.
  3. Do cleanup for the backupstore.
  4. Create a pod with a volume and wait for pod to start.
  5. Write data to the pod volume and get the md5sum.
  6. Create a backup for the volume.
  7. Randomly delete some data blocks of the backup, which will lead to all replica restore failures later.
  8. Restore a volume from the backup.
  9. Wait for the volume restore in progress by checking if: 9.1. volume.restoreStatus shows the related restore info. 9.2. volume.conditions[Restore].status == True && volume.conditions[Restore].reason == "RestoreInProgress". 9.3. volume.ready == false.
  10. Wait for the restore volume Faulted.
  11. Check if volume.conditions[Restore].status == False && volume.conditions[Restore].reason == "RestoreFailure".
  12. Check if volume.ready == false.
  13. Make sure auto-salvage is not triggered even the feature is enabled.
  14. Verify if PV/PVC cannot be created from Longhorn.
  15. Verify the faulted volume cannot be attached to a node.
  16. Verify this faulted volume can be deleted.
def test_auto_remount_with_subpath(client, core_api, storage_class, sts_name, statefulset)

Test Auto Remount With Subpath

Context:

Instead of manually finding and remounting all mount points of the volume, we delete the workload pod so that Kubernetes handles those works. This new implementation also solves the issue that remount doesn't support subpath (e.g. when pod use subpath in PVC). longhorn/longhorn#1719

Steps:

  1. Deploy a storage class with parameter numberOfReplicas: 1
  2. Deploy a statefulset with replicas: 1 and using the above storageclass Make sure the container in the pod template uses subpath, like this: ```yaml volumeMounts:
  3. name: mountPath: /data/sub subPath: sub ```
  4. exec into statefulset pod, create a file test_data.txt inside the folder /data/sub
  5. Delete the statefulset replica instance manager pod. This action simulates a network disconnection.
  6. Wait for volume healthy, then verify the file checksum.
  7. Repeat step #4~#5 for 3 times.
  8. Update numberOfReplicas to 3.
  9. Wait for replicas rebuilding finishes.
  10. Delete one of the statefulset engine instance manager pod.
  11. Wait for volume remount. Then verify the file checksum.
  12. Delete statefulset pod.
  13. Wait for pod recreation and volume remount. Then verify the file checksum.
def test_autosalvage_with_data_locality_enabled(client, core_api, make_deployment_with_pvc, volume_name, pvc)

This e2e test follows the manual test steps at: https://github.com/longhorn/longhorn/issues/2778#issue-939331805

Preparation: 1. Let's call the 3 nodes: node-1, node-2, node-3

Steps: 1. Add the tag node-1 to node-1 2. Create a volume with 1 replica, data-locality set to best-effort, and tag set to node-1 3. Create PV/PVC from the volume. 4. Create a pod that uses the PVC. Set node selector for the pod so that it will be schedule on to node-2. This makes sure that there is a failed-to-scheduled local replica 5. Wait for the pod to be in running state. 6. Kill the aio instance manager on node-1. 7. In a 3-min retry loop, verify that Longhorn salvage the volume and the workload pod is restarted. Exec into the workload pod. Verify that read/write to the volume is ok 8. Exec into the longhorn manager pod on node-2. Running ss -a -n | grep :8500 | wc -l to find the number of socket connections from this manager pod to instance manager pods. In a 2-min loop, verify that the number of socket connection is <= 20

Cleaning up: 1. Clean up the node tag

def test_disable_replica_rebuild(client, volume_name)

Test disable replica rebuild

  1. Disable node scheduling on node-2 and node-3. To make sure replica scheduled on node-1.
  2. Set 'Concurrent Replica Rebuild Per Node Limit' to 0.
  3. Create a volume with 1 replica and attach it to node-1.
  4. Enable scheduling on node-2 and node-3. Set node-1 scheduling to 'Disable' and 'Enable' eviction on node-1.
  5. Wait for 30 seconds, and check no eviction happen.
  6. 'Enable' node-1 scheduling and 'Disable' node-1 eviction.
  7. Detach the volume and update data locality to 'best-effort'.
  8. Attach the volume to node-2, and wait for 30 seconds, and check no data locality happen.
  9. Detach the volume and update data locality to 'disable'.
  10. Attach the volume to node-2 and update the replica number to 2.
  11. Wait for 30 seconds, and no new replica scheduled and volume is at 'degraded' state.
  12. Set 'Concurrent Replica Rebuild Per Node Limit' to 5, and wait for replica rebuild and volume becomes 'healthy' state with 2 replicas.
  13. Set 'Concurrent Replica Rebuild Per Node Limit' to 0, delete one replica.
  14. Wait for 30 seconds, no rebuild should get triggered. The volume should stay in 'degraded' state with 1 replica.
  15. Set 'Concurrent Replica Rebuild Per Node Limit' to 5, and wait for replica rebuild and volume becomes 'healthy' state with 2 replicas.
  16. Clean up the volume.
def test_dr_volume_with_restore_command_error(set_random_backupstore, client, core_api, volume_name, csi_pv, pvc, pod_make)

Test if Longhorn can capture and handle the restore command error rather than the error triggered the data restoring.

  1. Set a random backupstore.
  2. Create a volume, then create the corresponding PV, PVC and Pod.
  3. Write data to the pod volume and get the md5sum after the pod running.
  4. Create the 1st backup.
  5. Create a DR volume from the backup.
  6. Wait for the DR volume restore complete.
  7. Create a non-empty directory volume-delta-<last backup name>.img in one replica directory of the DR volume. This will fail the restore command call later.
  8. Write data to the original volume then create the 2nd backup.
  9. Wait for incremental restore complete. Then verify the DR volume is Degraded and there is one failed replica.
  10. Verify the failed replica will be reused for rebuilding (restore actually).
  11. Activate the DR volume and wait for it complete.
  12. Create PV/PVC/Pod for the activated volume.
  13. Validate the volume content.
  14. Verify Writing data to the activated volume is fine.
def test_engine_crash_for_dr_volume(set_random_backupstore, client, core_api, volume_name, csi_pv, pvc, pod_make)

[HA] Test DR volume can be recovered after the engine crashes unexpectedly.

  1. Setup a random backupstore.
  2. Create volume and start the pod.
  3. Write random data to the pod volume and get the md5sum.
  4. Create a backup for the volume.
  5. Create a DR volume from the backup.
  6. Wait for the DR volume init restore complete.
  7. Wait more data to the original volume and get the md5sum
  8. Create the 2nd backup for the original volume.
  9. Wait for the incremental restore triggered after the 2nd backup creation.
  10. Crash the DR volume engine process during the incremental restore.
  11. Wait for the DR volume detaching.
  12. Wait for the DR volume reattached.
  13. Verify the DR volume: 13.1. volume.ready == false. 13.2. volume.conditions[Restore].status == True && volume.conditions[Restore].reason == "RestoreInProgress". 13.3. volume.standby == true
  14. Activate the DR volume and wait for detached.
  15. Create a pod for the restored volume and wait for the pod start.
  16. Check the data md5sum for the DR volume.
def test_engine_crash_for_restore_volume(set_random_backupstore, client, core_api, volume_name, csi_pv, pvc, pod_make)

[HA] Test volume can successfully retry restoring after the engine crashes unexpectedly.

  1. Setup a random backupstore.
  2. Create volume and start the pod.
  3. Write random data to the pod volume and get the md5sum.
  4. Create a backup for the volume.
  5. Restore a new volume from the backup.
  6. Crash the engine during the restore.
  7. Wait for the volume detaching.
  8. Wait for the volume reattached.
  9. Verify if 9.1. volume.ready == false. 9.2. volume.conditions[Restore].status == True && volume.conditions[Restore].reason == "RestoreInProgress".
  10. Wait for the volume restore complete and detached.
  11. Recreate a pod for the restored volume and wait for the pod start.
  12. Check the data md5sum for the restored data.
def test_engine_image_miss_scheduled_perform_volume_operations(core_api, client, set_random_backupstore, volume_name)

Test volume operations when engine image DaemonSet is miss scheduled

  1. Create a volume, vol-1, of 3 replicas
  2. Taint node-1 with the taint: key=value:NoSchedule
  3. Verify that we can attach, take snapshot, take a backup, expand, then detach vol-1
def test_engine_image_not_fully_deployed_perform_auto_upgrade_engine(client, core_api)

Test auto upgrade engine feature when engine image DaemonSet is not fully deployed

Prerequisite: Prepare system for the test by calling the method prepare_engine_not_fully_deployed_evnironment to have tainted node and not fully deployed engine.

  1. Create 2 volumes vol-1 and vol-2 with 2 replicas
  2. Attach both volumes to make sure they are healthy and have 2 replicas
  3. Detach both volumes
  4. Deploy a new engine image, new-ei
  5. Upgrade vol-1 and vol-2 to the new-ei
  6. Attach vol-2 to current-node
  7. Set Concurrent Automatic Engine Upgrade Per Node Limit setting to 3
  8. In a 2-min retry, verify that Longhorn upgrades the engine image of vol-1 and vol-2.
def test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume(client, core_api, set_random_backupstore)

Test DR, restoring, expanding volumes when engine image DaemonSet is not fully deployed

Prerequisite: Prepare system for the test by calling the method prepare_engine_not_fully_deployed_evnironment to have tainted node and not fully deployed engine.

  1. Create volume vol-1 with 2 replicas
  2. Attach vol-1 to node-2, write data and create backup
  3. Create a DR volume (vol-dr) of 2 replicas.
  4. Verify that 2 replicas are on node-2 and node-3 and the DR volume is attached to either node-2 or node-3. Let's say it is attached to node-x
  5. Taint node-x with the taint key=value:NoSchedule
  6. Delete the pod of engine image DeamonSet on node-x. Now, the engine image is missing on node-1 and node-x
  7. Verify that vol-dr is auto-attached node-y.
  8. Restore a volume from backupstore with name vol-rs and replica count is 1
  9. Verify that replica is on node-y and the volume successfully restored.
  10. Wait for vol-rs to finish restoring
  11. Expand vol-rs.
  12. Verify that the expansion is ok
  13. Set Replica Replenishment Wait Interval setting to 600
  14. Crash the replica of vol-1 on node-x. Wait for the replica to fail
  15. In a 2-min retry verify that Longhorn doesn't create new replica for vol-1 and doesn't reuse the failed replica on node-x
def test_engine_image_not_fully_deployed_perform_engine_upgrade(client, core_api)

Test engine upgrade when engine image DaemonSet is not fully deployed

Prerequisite: Prepare system for the test by calling the method prepare_engine_not_fully_deployed_evnironment_with_volumes to have 2 volumes, tainted node and not fully deployed engine.

  1. Deploy a new engine image, new-ei
  2. Detach vol-1, verify that you can upgrade vol-1 to new-ei
  3. Detach then attach vol-1 to node-2
  4. Verify that you can live upgrade vol-1 to back to default engine image
  5. Try to upgrade vol-2 to new-ei
  6. Verify that the engineUpgrade API call returns error
def test_engine_image_not_fully_deployed_perform_replica_scheduling(client, core_api)

Test replicas scheduling when engine image DaemonSet is not fully deployed

Prerequisite: Prepare system for the test by calling the method prepare_engine_not_fully_deployed_evnironment to have tainted node and not fully deployed engine.

  1. Disable the scheduling for node-2
  2. Create a volume, vol-1, with 2 replicas, attach to node-3
  3. Verify that there is one replica fail to be scheduled
  4. enable the scheduling for node-2
  5. Verify that replicas are scheduled onto node-2 and node-3
def test_engine_image_not_fully_deployed_perform_volume_operations(client, core_api, set_random_backupstore)

Test volume operations when engine image DaemonSet is not fully deployed

Prerequisite: Prepare system for the test by calling the method prepare_engine_not_fully_deployed_evnironment_with_volumes to have 2 volumes, tainted node and not fully deployed engine.

  1. Verify that functions (snapshot, backup, detach) are working ok for vol-1
  2. Detach vol-1
  3. Attach vol-1 to node-1. Verify that Longhorn cannot attach vol-1 to node-1 since there is no engine image on node-1. The attach API call returns error
  4. Verify that we can attach to another node, take snapshot, take a backup, expand, then detach vol-1
  5. Verify that vol-2 cannot be attached to tainted nodes. The attach API call returns error
  6. Verify that vol-2 can attach to non-tainted node with degrade status
def test_ha_backup_deletion_recovery(set_random_backupstore, client, volume_name)

[HA] Test deleting the restored snapshot and rebuild

Backupstore: all

  1. Create volume and attach it to the current node.
  2. Write data to the volume and create snapshot snap2
  3. Backup snap2 to create a backup.
  4. Create volume res_volume from the backup. Check volume data.
  5. Check snapshot chain, make sure backup_snapshot exists.
  6. Delete the backup_snapshot and purge snapshots.
  7. After purge complete, delete one replica to verify rebuild works.

FIXME: Needs improvement, e.g. rebuild when no snapshot is deleted for restored backup.

def test_ha_prohibit_deleting_last_replica(client, volume_name)

Test prohibiting deleting the last replica

  1. Create volume with one replica and attach to the current node.
  2. Try to delete the replica. It should error out

FIXME: Move out of test_ha.py

def test_ha_recovery_with_expansion(client, volume_name, request)

[HA] Test recovery with volume expansion

  1. Create a volume and attach it to the current node.
  2. Write a large amount of data to the volume
  3. Remove one random replica and wait for the rebuilding starts
  4. Expand the volume immediately after the rebuilding start
  5. check and wait for the volume expansion and rebuilding
  6. Write more data to the volume
  7. Remove another replica of volume
  8. Wait volume to start rebuilding and complete
  9. Check the data intacty
def test_ha_salvage(client, core_api, volume_name, disable_auto_salvage)

[HA] Test salvage when volume faulted TODO The test cases should cover the following four cases: 1. Manual salvage with revision counter enabled. 2. Manual salvage with revision counter disabled. 3. Auto salvage with revision counter enabled. 4. Auto salvage with revision counter enabled.

Setting: Disable auto salvage

Case 1: Delete all replica processes using instance manager

  1. Create volume and attach to the current node
  2. Write data to the volume.
  3. Crash all the replicas using Instance Manager API
    1. Cannot do it using Longhorn API since a. it will delete data, b. the last replica is not allowed to be deleted
  4. Make sure volume detached automatically and changed into faulted state
  5. Make sure both replicas reports failedAt timestamp.
  6. Salvage the volume
  7. Verify that volume is in detached unknown state. No longer faulted
  8. Verify that all the replicas' failedAt timestamp cleaned.
  9. Attach the volume and check data

Case 2: Crash all replica processes

Same steps as Case 1 except on step 3, use SIGTERM to crash the processes

Setting: Enabled auto salvage.

Case 3: Revision counter disabled.

  1. Set 'Automatic salvage' to true.
  2. Set 'Disable Revision Counter' to true.
  3. Create a volume with 3 replicas.
  4. Attach the volume to a node and write some data to it and save the checksum.
  5. Delete all replica processes using instance manager or crash all replica processes using SIGTERM.
  6. Wait for volume to faulted, then healthy.
  7. Verify all 3 replicas are reused successfully.
  8. Check the data in the volume and make sure it's the same as the checksum saved on step 5.

Case 4: Revision counter enabled.

  1. Set 'Automatic salvage' to true.
  2. Set 'Disable Revision Counter' to false.
  3. Create a volume with 3 replicas.
  4. Attach the volume to a node and write some data to it and save the checksum.
  5. Delete all replica processes using instance manager or crash all replica processes using SIGTERM.
  6. Wait for volume to faulted, then healthy.
  7. Verify there are 3 replicas, they are all from previous replicas.
  8. Check the data in the volume and make sure it's the same as the checksum saved on step 5.
def test_ha_simple_recovery(client, volume_name)

[HA] Test recovering from one replica failure

  1. Create volume and attach to the current node
  2. Write data to the volume.
  3. Remove one of the replica using Longhorn API
  4. Wait for a new replica to be rebuilt.
  5. Check the volume data
def test_inc_restoration_with_multiple_rebuild_and_expansion(set_random_backupstore, client, core_api, volume_name, storage_class, csi_pv, pvc, pod_make)

[HA] Test if the rebuild is disabled for the DR volume 1. Setup a random backupstore. 2. Create a pod with a volume and wait for pod to start. 3. Write data to the volume and get the md5sum. 4. Create the 1st backup for the volume. 5. Create a DR volume based on the backup and wait for the init restoration complete. 6. Shutdown the pod and wait for the std volume detached. 7. Offline expand the std volume and wait for expansion complete. 8. Re-launch a pod for the std volume. 9. Write more data to the std volume. Make sure there is data in the expanded part. 10. Create the 2nd backup and wait for the backup creation complete. 11. For the DR volume, delete one replica and trigger incremental restore simultaneously. 12. Wait for the inc restoration complete and the volume becoming Healthy. 13. Check the DR volume size and snapshot info. Make sure there is only one snapshot in the volume. 14. Online expand the std volume and wait for expansion complete. 15. Write data to the std volume then create the 3rd backup. 16. Trigger the inc restore then re-verify the snapshot info. 17. Activate the DR volume. 18. Create PV/PVC/Pod for the activated volume and wait for the pod start. 19. Check if the restored volume is state healthy after the attachment. 20. Check md5sum of the data in the activated volume. 21. Crash one random replica. Then verify the rebuild still works fine for the activated volume. 22. Do cleanup.

def test_rebuild_after_replica_file_crash(client, volume_name)

[HA] Test replica rebuild should be triggered if any crashes happened.

  1. Create a longhorn volume with replicas.
  2. Write random data to the volume and get the md5sum.
  3. Remove file volume-head-000.img from one of the replicas.
  4. Wait replica rebuild to be triggered.
  5. Verify the old replica containing the crashed file will be reused.
  6. Read the data from the volume and verify the md5sum.
def test_rebuild_failure_with_intensive_data(client, core_api, volume_name, csi_pv, pvc, pod_make)

[HA] Test rebuild failure with intensive data writing

  1. Create PV/PVC/POD with livenss check
  2. Create volume and wait for pod to start
  3. Write data to /data/test1 inside the pod and get original_checksum_1
  4. Write data to /data/test2 inside the pod and get original_checksum_2
  5. Find running replicas of the volume
  6. Crash one of the running replicas.
  7. Wait for the replica rebuild to start
  8. Crash the replica which is sending data to the rebuilding replica
  9. Wait for volume to finish two rebuilds and become healthy
  10. Check md5sum for both data location
def test_rebuild_replica_and_from_replica_on_the_same_node(client, core_api, volume_name, csi_pv, pvc, pod_make)

[HA] Test the corner case that the from-replica and the rebuilding replica are on the same node

Test prerequisites: - set Replica Node Level Soft Anti-Affinity disabled

  1. Disable the setting replica-soft-anti-affinity.
  2. Set replica replenishment wait interval to an appropriate value.
  3. Create a pod with Longhorn volume and wait for pod to start
  4. Write data to /data/test inside the pod and get original_checksum
  5. Disable scheduling for all nodes except for one.
  6. Find running replicas of the volume
  7. Crash 2 running replicas.
  8. Wait for the replica rebuild to start.
  9. Check if the rebuilding replica is one of the crashed replica, and this reused replica is rebuilt on the only available node.
  10. Check md5sum for the written data
def test_rebuild_with_inc_restoration(set_random_backupstore, client, core_api, volume_name, csi_pv, pvc, pod_make)

[HA] Test if the rebuild is disabled for the DR volume 1. Setup a random backupstore. 2. Create a pod with a volume and wait for pod to start. 3. Write data to /data/test1 inside the pod and get the md5sum. 4. Create the 1st backup for the volume. 5. Create a DR volume based on the backup and wait for the init restoration complete. 6. Write more data to the original volume then create the 2nd backup. 7. Delete one replica and trigger incremental restore simultaneously. 8. Wait for the inc restoration complete and the volume becoming Healthy. 9. Activate the DR volume. 10. Create PV/PVC/Pod for the activated volume and wait for the pod start. 11. Check if the restored volume is state healthy after the attachment. 12. Check md5sum of the data in the activated volume. 13. Do cleanup.

def test_rebuild_with_restoration(set_random_backupstore, client, core_api, volume_name, csi_pv, pvc, pod_make)

[HA] Test if the rebuild is disabled for the restoring volume.

This is similar to test_single_replica_restore_failure and test_single_replica_unschedulable_restore_failure. In this version, a replica is deleted. We expect a new replica to be rebuilt in its place and the restore to complete.

  1. Setup a random backupstore.
  2. Do cleanup for the backupstore.
  3. Create a pod with a volume and wait for pod to start.
  4. Write data to the pod volume and get the md5sum.
  5. Create a backup for the volume.
  6. Restore a volume from the backup.
  7. Wait for the volume restore start.
  8. Delete one replica during the restoration.
  9. Wait for the restoration complete and the volume detached.
  10. Check if the replica is rebuilt.
  11. Create PV/PVC/Pod for the restored volume and wait for the pod start.
  12. Check if the restored volume is state Healthy after the attachment.
  13. Check md5sum of the data in the restored volume.
  14. Do cleanup.
def test_recovery_from_im_deletion(client, core_api, volume_name, make_deployment_with_pvc, pvc)

Related issue : https://github.com/longhorn/longhorn/issues/3070

Steps: 1. Create a volume and PV/PVC. 2. Create a deployment with 1 pod having below in command section on node-1 and attach to the volume. command: - "/bin/sh" - "-ec" - | touch /data/test tail -f /data/test 3. Wait for the pod to become healthy. 4. Write small(100MB) data. 5. Kill the instance-manager-e on node-1. 6. Wait for the instance-manager-e pod to become healthy. 7. Wait for pod to get terminated and recreated. 8. Read and write in the pod to verify the pod is accessible.

def test_replica_failure_during_attaching(settings_reset, client, core_api, volume_name)

Steps: 1. Set a short interval for setting replica-replenishment-wait-interval. 2. Disable the setting soft-node-anti-affinity. 3. Create volume1 with 1 replica. and attach it to the host node. 4. Mount volume1 to a new mount point. then use it as an extra node disk. 5. Disable the scheduling for the default disk of the host node, and make sure the extra disk is the only available disk on the node. 6. Create and attach volume2, then write data to volume2. 7. Detach volume2. 8. Directly unmount volume1 and remove the related mount point directory. –> Verify the extra disk becomes unavailable. 9. Attach volume2. –> Verify volume will be attached with state Degraded. 10. Wait for the replenishment interval. –> Verify a new replica cannot be scheduled. 11. Enable the default disk for the host node. 12. Wait for volume2 becoming Healthy. 13. Verify data content and r/w capability for volume2.

def test_replica_should_not_be_created_when_no_suitable_node_found(client, volume_name, settings_reset)

Test replica should not be created when no suitable node is found.

  1. Make sure 'Replica Node Level Soft Anti-Affinity' is disabled.
  2. Create a volume with 3 replicas.
  3. Attach the volume to a node and write some data to it and save the checksum.
  4. Increase the volume replica number to 4.
  5. No Replica should be created.
  6. Volume should show failed to schedule.
  7. Decrease the volume replica number to 3.
  8. Volume should be healthy.
  9. Check the data in the volume and make sure it's same as the checksum.
def test_restore_volume_with_invalid_backupstore(client, volume_name, backupstore_s3)

[HA] Test if the invalid backup target will lead to to volume restore.

  1. Enable auto-salvage.
  2. Set a S3 backupstore. (Cannot use NFS server here before fixing #1295)
  3. Create a volume then a backup.
  4. Invalidate the target URL. (e.g.: s3://backupbucket-invalid@us-east-1/backupstore-invalid)
  5. Restore a volume from the backup should return error. (The fromBackup fields of the volume create API should consist of the invalid target URL and the valid backup volume info)
  6. Check restore volume not created.
def test_retain_potentially_useful_replicas_in_autosalvage_loop()

Related issue: https://github.com/longhorn/longhorn/issues/7425

Related manual test steps: https://github.com/longhorn/longhorn-manager/pull/2432#issuecomment-1894675916

Steps: 1. Create a volume with numberOfReplicas=2 and staleReplicaTimeout=1. Consider its two replicas ReplicaA and ReplicaB. 2. Attach the volume to a node. 3. Write data to the volume. 4. Exec into the instance-manager for ReplicaB and delete all .img.meta files. This makes it impossible to restart ReplicaB successfully. 5. Cordon the node for Replica A. This makes it unavailable for autosalvage. 6. Crash the instance-managers for both ReplicaA and ReplicaB. 7. Wait one minute and fifteen seconds. This is longer than staleReplicaTimeout. 8. Confirm the volume is not healthy. 9. Confirm ReplicaA was not deleted. 10. Delete ReplicaB. 11. Wait for the volume to become healthy. 12. Verify the data.

def test_reuse_failed_replica(client, core_api, volume_name)

Steps: 1. Set a long wait interval for setting replica-replenishment-wait-interval. 2. Disable the setting soft node anti-affinity. 3. Create and attach a volume. Then write data to the volume. 4. Disable the scheduling for a node. 5. Mess up the data of a random snapshot or the volume head for a replica. Then crash the replica on the node. –> Verify Longhorn won't create a new replica on the node for the volume. 6. Update setting replica-replenishment-wait-interval to a small value. 7. Verify no new replica will be created. 8. Verify volume replica scheduling should fail. 9. Update setting replica-replenishment-wait-interval to a large value. 10. Enable the scheduling for the node. 11. Verify the failed replica (in step 5) will be reused. 12. Verify the volume r/w still works fine.

def test_reuse_failed_replica_with_scheduling_check(client, core_api, volume_name)

Steps: 1. Set a long wait interval for setting replica-replenishment-wait-interval. 2. Disable the setting soft node anti-affinity. 3. Add tags for all nodes and disks. 4. Create and attach a volume with node and disk selectors. Then write data to the volume. 5. Disable the scheduling for the 2 nodes (node1 and node2). 6. Crash the replicas on the node1 and node2. –> Verify Longhorn won't create new replicas on the nodes. 7. Remove tags for node1 and the related disks. 8. Enable the scheduling for node1 and node2. 9. Verify the only failed replica on node2 is reused. 10. Add the tags back for node1 and the related disks. 11. Verify the failed replica on node1 is reused. 12. Verify the volume r/w still works fine.

def test_salvage_auto_crash_all_replicas(client, core_api, storage_class, sts_name, statefulset)

[HA] Test automatic salvage feature by crashing all the replicas

Case #1: crash all replicas 1. Create StorageClass and StatefulSet. 2. Write random data to the pod and get the md5sum. 3. Run sync command inside the pod to make sure data flush to the volume. 4. Crash all replica processes using SIGTERM. 5. Wait for volume to faulted, then healthy. 6. Wait for K8s to terminate the pod and statefulset to bring pod to Pending, then Running. 7. Check volume path exist in the pod. 8. Check md5sum of the data in the pod. Case #2: crash one replica and then crash all replicas 9. Crash one of the replica. 10. Try to wait for rebuild start and the rebuilding replica running. 11. Crash all the replicas. 12. Make sure volume and pod recovers. 13. Check md5sum of the data in the pod.

FIXME: Step 5 is only a intermediate state, maybe no way to get it for sure

def test_single_replica_failed_during_engine_start(client, core_api, volume_name, csi_pv, pvc, pod)

Test if the volume still works fine when there is an invalid replica/backend in the engine starting phase.

Prerequisite: Setting "replica-replenishment-wait-interval" is 0

  1. Create a pod using Longhorn volume.
  2. Write some data to the volume then get the md5sum.
  3. Create a snapshot.
  4. Repeat step2 and step3 for 3 times then there should be 3 snapshots.
  5. Randomly pick up a replica and manually messing up the snapshot meta file.
  6. Delete the pod and wait for the volume detached.
  7. Recreate the pod and wait for the volume attached.
  8. Check if the volume is Degraded and if the chosen replica is ERR once the volume attached.
  9. Wait for volume rebuild and volume becoming Healthy.
  10. Check volume data.
  11. Check if the volume still works fine by r/w data and creating/removing snapshots.
def test_single_replica_restore_failure(set_random_backupstore, client, core_api, volume_name, csi_pv, pvc, pod_make)

[HA] Test if one replica restore failure will lead to the restore volume becoming Degraded, and if the restore volume is still usable after the failure.

This is similar to test_rebuild_with_restoration and test_single_replica_unschedulable_restore_failure. In this version, a replica is crashed. We expect the crashed replica to be rebuilt and the restore to complete.

  1. Setup a random backupstore.
  2. Do cleanup for the backupstore.
  3. Create a pod with a volume and wait for pod to start.
  4. Write data to the pod volume and get the md5sum.
  5. Create a backup for the volume.
  6. Restore a volume from the backup.
  7. Wait for the volume restore start.
  8. Crash one replica during the restoration.
  9. Wait for the restoration complete and the volume detached.
  10. Check if the replica is rebuilt.
  11. Create PV/PVC/Pod for the restored volume and wait for the pod start.
  12. Check if the restored volume is state Healthy after the attachment.
  13. Check md5sum of the data in the restored volume.
  14. Do cleanup.
def test_single_replica_unschedulable_restore_failure(set_random_backupstore, client, core_api, volume_name, csi_pv, pvc, pod_make)

[HA] Test if the restore can complete if a restoring replica is killed while it is ongoing and cannot be recovered.

This is similar to test_rebuild_with_restoration and test_single_replica_restore_failure. In this version, a replica is crashed and not allowed to recover. However, we enable allow-volume-creation-with-degraded-availability, so we expect the restore to complete anyway.

  1. Setup a random backupstore.
  2. Do cleanup for the backupstore.
  3. Enable allow-volume-creation-with-degraded-availability (to allow restoration to complete without all replicas).
  4. Create a pod with a volume and wait for pod to start.
  5. Write data to the pod volume and get the md5sum.
  6. Create a backup for the volume.
  7. Restore a volume from the backup.
  8. Wait for the volume restore start.
  9. Disable replica rebuilding (to ensure the killed replica cannot recover).
  10. Crash one replica during the restoration.
  11. Wait for the restoration complete and the volume detached.
  12. Create PV/PVC/Pod for the restored volume and wait for the pod start.
  13. Check if the restored volume is state Healthy after the attachment.
  14. Check md5sum of the data in the restored volume.
  15. Do cleanup.
def test_volume_reattach_after_engine_sigkill(client, core_api, storage_class, sts_name, statefulset)

[HA] Test if the volume can be reattached after using SIGKILL to crash the engine process

  1. Create StorageClass and StatefulSet.
  2. Write random data to the pod and get the md5sum.
  3. Crash the engine process by SIGKILL in the engine manager.
  4. Wait for volume to faulted, then healthy.
  5. Wait for K8s to terminate the pod and statefulset to bring pod to Pending, then Running.
  6. Check volume path exist in the pod.
  7. Check md5sum of the data in the pod.
  8. Check new data written to the volume is successful.
def wait_pod_for_remount_request(client, core_api, volume_name, pod_name, original_md5sum, data_path='/data/test')