Related issues
Basic instructions
- Deploy a migratable StorageClass. E.g.:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: test-sc
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
numberOfReplicas: "3"
migratable: "true"
- Create a migratable Volume. E.g.:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
namespace: default
spec:
accessModes:
- ReadWriteMany
volumeMode: Block
storageClassName: test-sc
resources:
requests:
storage: 1Gi
- Attach the volume to a node and wait for it to become running. E.g.:
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
name: test-va-1
spec:
attacher: driver.longhorn.io
nodeName: <old_node>
source:
persistentVolumeName: <volume_name>
- Write some data into the volume.
- Start the migration by attaching the volume to a second node.
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
name: test-va-2
spec:
attacher: driver.longhorn.io
nodeName: <new_node>
source:
persistentVolumeName: <volume_name>
- Trigger the scenarios described below with commands like:
# Attempt to confirm the migration by detaching from <old_node>.
kubectl -n longhorn-system delete -f va-1.yaml
# Attempt to roll back the migration by detaching from <new_node>.
kubectl -n longhorn-system delete -f va-2.yaml
# Check the migration status of the volume.
kubectl -n longhorn-system get volume -oyaml | grep -i nodeid
# View migration related logs.
kubetail -n longhorn-system -l 'app in (longhorn-manager,longhorn-csi-plugin)'
# Watch volume to check if it becomes detached or faulted.
kubectl -n longhorn-system get volume -oyaml -w | grep -e state -e robustness
# Check the names of engines and replicas before and then after triggering the scenario for comparison.
# - Some scenarios result in a new engine becoming active (e.g. test-e-0 is replaced by test-e-1).
# - Some scenarios result in new replicas becoming active (e.g. test-r-45032d2c is gone, but test-r-b28953eb is in its
# place).
kubectl -n longhorn-system get engine
kubectl -n longhorn-system get replica
- Before a test, verify the volume migration is ready. Logs should indicate “Volume migration engine is ready”, and:
kubectl -n longhorn-system get volume -oyaml | grep -i nodeid
migrationNodeID: eweber-v126-worker-9c1451b4-6464j
nodeID: eweber-v126-worker-9c1451b4-kgxdq
currentMigrationNodeID: eweber-v126-worker-9c1451b4-6464j
currentNodeID: eweber-v126-worker-9c1451b4-kgxdq
pendingNodeID: ""
Scenarios
1. New engine crash
Crash the engine on the migration node by killing its instance-manager pod.
1.1. Confirmation immediately after crash
Migration engine and replicas are recreated. Then, confirmation succeeds.
kl delete --wait=false pod instance-manager-ea5f8778d6c99e747289ff09c322d75a && sleep 0.5 && k delete -f va-1.yaml
# OR
kl delete --wait=false pod instance-manager-ea5f8778d6c99e747289ff09c322d75a && sleep 1 && k delete -f va-1.yaml
- “Waiting to confirm migration until migration engine is ready”
- “Confirming migration”
- No detachment
- New engine and replicas
1.2. Confirmation immediately before crash
Confirmation succeeds. Then, the volume detaches from and reattaches to the new node.
k delete --wait=false -f va-1.yaml && sleep 0.5 && kl delete pod instance-manager-ea5f8778d6c99e747289ff09c322d75a
# OR
k delete --wait=false -f va-1.yaml && sleep 1 && kl delete pod instance-manager-ea5f8778d6c99e747289ff09c322d75a
- “Confirming migration”
- “…selected to detach from
” - “…selected to attach to
” - New engine and replicas
1.3. Rollback immediately before or after crash
Rollback succeeds.
kl delete --wait=false pod instance-manager-ea5f8778d6c99e747289ff09c322d75a && sleep 0.5 && k delete -f va-2.yaml
# OR
kl delete --wait=false pod instance-manager-ea5f8778d6c99e747289ff09c322d75a && sleep 1 && k delete -f va-2.yaml
# OR
k delete --wait=false -f va-2.yaml && sleep 0.5 && kl delete pod instance-manager-ea5f8778d6c99e747289ff09c322d75a
# OR
k delete --wait=false -f va-2.yaml && sleep 1 && kl delete pod instance-manager-ea5f8778d6c99e747289ff09c322d75a
- “Rolling back migration”
- No detachment
- Same engine and replicas
2. Old engine crash
Crash the engine on the old node by killing its instance-manager pod.
2.1 No immediate confirmation or rollback
The volume completely detaches and remains detached. Logs indicate next steps. Deleting either of the two VolumeAttachments gets the volume unstuck.
kl delete pod instance-manager-699da83c0e9d22726e667344227e096b
- “…selected to detach from
” - “Cancelling migration for detached volume…”
- MigrationFailed event
- “Volume migration between
and failed; detach volume from extra node to resume”
kubectl -n longhorn-system get volume -oyaml | grep -i nodeid
migrationNodeID: ""
nodeID: ""
currentMigrationNodeID: ""
currentNodeID: ""
pendingNodeID: ""
kl delete -f va-2.yaml
kubectl -n longhorn-system get volume -oyaml | grep -i nodeid
migrationNodeID: ""
nodeID: eweber-v126-worker-9c1451b4-kgxdq
currentMigrationNodeID: ""
currentNodeID: eweber-v126-worker-9c1451b4-kgxdq
pendingNodeID: ""
2.2 Confirmation immediately after crash
The volume automatically detaches from the old node. Then, it reattaches to the new node.
kl delete --wait=false pod instance-manager-699da83c0e9d22726e667344227e096b && sleep 0.5 && k delete -f va-1.yaml
# OR
kl delete --wait=false pod instance-manager-699da83c0e9d22726e667344227e096b && sleep 1 && k delete -f va-1.yaml
- “…selected to detach from
” - “Cancelling migration for detached volume…”
- MigrationFailed event
- “…selected to attach to
” - Same engine and replicas
2.3 Confirmation immediately before crash
Confirmation succeeds.
k delete --wait=false -f va-1.yaml && sleep 0.5 && kl delete pod instance-manager-699da83c0e9d22726e667344227e096b
# OR
k delete --wait=false -f va-1.yaml && sleep 1 && kl delete pod instance-manager-699da83c0e9d22726e667344227e096b
- “Confirming migration…”
- No detachment
- New engine and replicas
2.4 Rollback immediately after crash
The volume automatically detaches from the old node. Then, it reattaches to the old node.
kl delete --wait=false pod instance-manager-699da83c0e9d22726e667344227e096b && sleep 0.5 && k delete -f va-2.yaml
# OR
kl delete --wait=false pod instance-manager-699da83c0e9d22726e667344227e096b && sleep 1 && k delete -f va-2.yaml
- “…selected to detach from
” - “Cancelling migration for detached volume…”
- MigrationFailed event
- “…selected to attach to
” - Same engine and replicas
2.5 Rollback immediately before crash
Confirmation succeeds. Then, the volume detaches from and reattaches to the old node.
k delete --wait=false -f va-2.yaml && sleep 0.5 && kl delete pod instance-manager-699da83c0e9d22726e667344227e096b
# OR
k delete --wait=false -f va-2.yaml && sleep 1 && kl delete pod instance-manager-699da83c0e9d22726e667344227e096b
- “Rolling back migration”
- “…selected to detach from
” - “…selected to attach to
” - Same engine and replicas (rolled back)
3. Single replica crash
Crash the replica on a node that is neither the old or migration node by cordoning the node and killing its instance-manager pod.
3.1 Degraded before migration and confirmation
Migration starts while the volume is degraded. Confirmation succeeds.
k cordon eweber-v126-worker-9c1451b4-rw5hf
kl delete pod instance-manager-6852914a55e4566d3ddea43529df22e0
k delete -f va-1.yaml
- “Confirming migration”
- No detachment
- New engine and replicas
3.2 Degraded before migration and rollback
Migration starts while the volume is degraded. Rollback succeeds.
k cordon eweber-v126-worker-9c1451b4-rw5hf
kl delete pod instance-manager-6852914a55e4566d3ddea43529df22e0
k apply -f va-1.yaml
k apply -f va-2.yaml
k delete -f va-2.yaml
- “Rolling back migration”
- No detachment
- Same engine and replicas
3.3 Degraded between migration start and confirmation
Confirmation succeeds.
k apply -f va-1.yaml
k apply -f va-2.yaml
k cordon eweber-v126-worker-9c1451b4-rw5hf
kl delete pod instance-manager-6852914a55e4566d3ddea43529df22e0
k delete -f va-1.yaml
- “Confirming migration”
- New engine and replicas
3.4 Degraded between migration start and rollback
Rollback succeeds.
k apply -f va-1.yaml
k apply -f va-2.yaml
k cordon eweber-v126-worker-9c1451b4-rw5hf
kl delete pod instance-manager-6852914a55e4566d3ddea43529df22e0
k delete -f va-2.yaml
- Rolling back migration
- Same engine and replicas
4. Attempt to attach to three nodes
The third attachment fails.
kl apply -f va-1.yaml
kl apply -f va-2.yaml
kl apply -f va-3.yaml
- test-va-1 attached
- test-va-2 attached
- test-va-3 not attached
- “…cannot attach migratable volume to more than two nodes…”
5. New engine node down
- Hard shut down the node running the migration engine.
- Wait until Kubernetes recognizes the node is down. (This is IMPORTANT! Otherwise, it is a different test case.)
- Attempt a confirmation or rollback.
5.1 Confirmation
The volume is allowed to detach from the old node (special logic in code). It attempts to attach to cleanly attach to the migration node, but is stuck until it comes back.
- “Waiting to confirm migration until migration engine is ready”
- “Detaching volume for attempted migration to down node”
- “…selected to attach to
” - Same engine and replicas
- Stuck in attaching waiting for node to come back
kubectl -n longhorn-system get volume -oyaml | grep -i nodeid
migrationNodeID: ""
nodeID: eweber-v126-worker-9c1451b4-6464j
currentMigrationNodeID: ""
currentNodeID: ""
pendingNodeID: ""
5.2 Rollback
Rollback succeeds.
- “Rolling back migration”
- No detachment
- Same engine and replicas
6. Old engine node down
- Hard shut down the node running the old engine.
- Wait until Kubernetes recognizes the node is down. (This is IMPORTANT! Otherwise, it is a different test case.)
- Verify the volume is no longer attached and no longer migrating. It should remain in this state indefinitely until a confirmation or rollback is attempted.
- Attempt a confirmation or rollback.
6.1 Confirmation
The migration is stuck until the Kubernetes pod eviction controller decides to terminate the instance-manager pod that was running on the old node. Then, Longhorn detaches the volume and cleanly reattaches it to the migration node.
- “Waiting to confirm migration…” (new engine crashes when old one does)
- Eventually…
- “…selected to detach from
” - “Cancelling migration for detached volume…”
- “…selected to attach to
” - Same engine and replicas
6.2 Rollback
The migration is stuck until the Kubernetes pod eviction controller decides to terminate the instance-manager pod that was running on the old node. Then, Longhorn detaches the volume and attempts to cleanly reattach it to the old node, but it is stuck until the node comes back.
- “Rolling back migration”
- Stuck in attached with engine and one replica unknown
- Eventually…
- “…selected to detach from
” - “…selected to attach to
” - Stuck in attaching waiting for node to come back
- Same engine and replicas