HA Volume Migration

Basic instructions

Deploy a migratable StorageClass. E.g.:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: test-sc
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
  numberOfReplicas: "3"
  migratable: "true"

Create a migratable Volume. E.g.:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Block
  storageClassName: test-sc
  resources:
    requests:
      storage: 1Gi

Attach the volume to a node and wait for it to become running. E.g.:

apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  name: test-va-1
spec:
  attacher: driver.longhorn.io
  nodeName: <old_node>
  source:
    persistentVolumeName: <volume_name>

Write some data into the volume.
Start the migration by attaching the volume to a second node.

apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  name: test-va-2
spec:
  attacher: driver.longhorn.io
  nodeName: <new_node>
  source:
    persistentVolumeName: <volume_name>

Trigger the scenarios described below with commands like:

# Attempt to confirm the migration by detaching from <old_node>.
kubectl -n longhorn-system delete -f va-1.yaml

# Attempt to roll back the migration by detaching from <new_node>.
kubectl -n longhorn-system delete -f va-2.yaml

# Check the migration status of the volume.
kubectl -n longhorn-system get volume -oyaml | grep -i nodeid

# View migration related logs.
kubetail -n longhorn-system -l 'app in (longhorn-manager,longhorn-csi-plugin)'

# Watch volume to check if it becomes detached or faulted.
kubectl -n longhorn-system get volume -oyaml -w | grep -e state -e robustness

# Check the names of engines and replicas before and then after triggering the scenario for comparison. 
# - Some scenarios result in a new engine becoming active (e.g. test-e-0 is replaced by test-e-1).
# - Some scenarios result in new replicas becoming active (e.g. test-r-45032d2c is gone, but test-r-b28953eb is in its
#   place).
kubectl -n longhorn-system get engine
kubectl -n longhorn-system get replica

Before a test, verify the volume migration is ready. Logs should indicate “Volume migration engine is ready”, and:

kubectl -n longhorn-system get volume -oyaml | grep -i nodeid
    migrationNodeID: eweber-v126-worker-9c1451b4-6464j
    nodeID: eweber-v126-worker-9c1451b4-kgxdq
    currentMigrationNodeID: eweber-v126-worker-9c1451b4-6464j
    currentNodeID: eweber-v126-worker-9c1451b4-kgxdq
    pendingNodeID: ""

Scenarios

1. New engine crash

Crash the engine on the migration node by killing its instance-manager pod.

1.1. Confirmation immediately after crash

Migration engine and replicas are recreated. Then, confirmation succeeds.

kl delete --wait=false pod instance-manager-ea5f8778d6c99e747289ff09c322d75a && sleep 0.5 && k delete -f va-1.yaml
# OR
kl delete --wait=false pod instance-manager-ea5f8778d6c99e747289ff09c322d75a && sleep 1 && k delete -f va-1.yaml

“Waiting to confirm migration until migration engine is ready”
“Confirming migration”
No detachment
New engine and replicas

1.2. Confirmation immediately before crash

Confirmation succeeds. Then, the volume detaches from and reattaches to the new node.

k delete --wait=false -f va-1.yaml && sleep 0.5 && kl delete pod instance-manager-ea5f8778d6c99e747289ff09c322d75a
# OR
k delete --wait=false -f va-1.yaml && sleep 1 && kl delete pod instance-manager-ea5f8778d6c99e747289ff09c322d75a

“Confirming migration”
“…selected to detach from ”
“…selected to attach to ”
New engine and replicas

1.3. Rollback immediately before or after crash

Rollback succeeds.

kl delete --wait=false pod instance-manager-ea5f8778d6c99e747289ff09c322d75a && sleep 0.5 && k delete -f va-2.yaml
# OR
kl delete --wait=false pod instance-manager-ea5f8778d6c99e747289ff09c322d75a && sleep 1 && k delete -f va-2.yaml
# OR
k delete --wait=false -f va-2.yaml && sleep 0.5 && kl delete pod instance-manager-ea5f8778d6c99e747289ff09c322d75a
# OR
k delete --wait=false -f va-2.yaml && sleep 1 && kl delete pod instance-manager-ea5f8778d6c99e747289ff09c322d75a

“Rolling back migration”
No detachment
Same engine and replicas

2. Old engine crash

Crash the engine on the old node by killing its instance-manager pod.

2.1 No immediate confirmation or rollback

The volume completely detaches and remains detached. Logs indicate next steps. Deleting either of the two VolumeAttachments gets the volume unstuck.

kl delete pod instance-manager-699da83c0e9d22726e667344227e096b

“…selected to detach from ”
“Cancelling migration for detached volume…”
MigrationFailed event
“Volume migration between and failed; detach volume from extra node to resume”

kubectl -n longhorn-system get volume -oyaml | grep -i nodeid
    migrationNodeID: ""
    nodeID: ""
    currentMigrationNodeID: ""
    currentNodeID: ""
    pendingNodeID: ""

kl delete -f va-2.yaml

kubectl -n longhorn-system get volume -oyaml | grep -i nodeid
    migrationNodeID: ""
    nodeID: eweber-v126-worker-9c1451b4-kgxdq
    currentMigrationNodeID: ""
    currentNodeID: eweber-v126-worker-9c1451b4-kgxdq
    pendingNodeID: ""

2.2 Confirmation immediately after crash

The volume automatically detaches from the old node. Then, it reattaches to the new node.

kl delete --wait=false pod instance-manager-699da83c0e9d22726e667344227e096b && sleep 0.5 && k delete -f va-1.yaml
# OR
kl delete --wait=false pod instance-manager-699da83c0e9d22726e667344227e096b && sleep 1 && k delete -f va-1.yaml

“…selected to detach from ”
“Cancelling migration for detached volume…”
MigrationFailed event
“…selected to attach to ”
Same engine and replicas

2.3 Confirmation immediately before crash

Confirmation succeeds.

k delete --wait=false -f va-1.yaml && sleep 0.5 && kl delete pod instance-manager-699da83c0e9d22726e667344227e096b
# OR
k delete --wait=false -f va-1.yaml && sleep 1 && kl delete pod instance-manager-699da83c0e9d22726e667344227e096b

“Confirming migration…”
No detachment
New engine and replicas

2.4 Rollback immediately after crash

The volume automatically detaches from the old node. Then, it reattaches to the old node.

kl delete --wait=false pod instance-manager-699da83c0e9d22726e667344227e096b && sleep 0.5 && k delete -f va-2.yaml
# OR
kl delete --wait=false pod instance-manager-699da83c0e9d22726e667344227e096b && sleep 1 && k delete -f va-2.yaml

“…selected to detach from ”
“Cancelling migration for detached volume…”
MigrationFailed event
“…selected to attach to ”
Same engine and replicas

2.5 Rollback immediately before crash

Confirmation succeeds. Then, the volume detaches from and reattaches to the old node.

k delete --wait=false -f va-2.yaml && sleep 0.5 && kl delete pod instance-manager-699da83c0e9d22726e667344227e096b
# OR
k delete --wait=false -f va-2.yaml && sleep 1 && kl delete pod instance-manager-699da83c0e9d22726e667344227e096b

“Rolling back migration”
“…selected to detach from ”
“…selected to attach to ”
Same engine and replicas (rolled back)

3. Single replica crash

Crash the replica on a node that is neither the old or migration node by cordoning the node and killing its instance-manager pod.

3.1 Degraded before migration and confirmation

Migration starts while the volume is degraded. Confirmation succeeds.

k cordon eweber-v126-worker-9c1451b4-rw5hf
kl delete pod instance-manager-6852914a55e4566d3ddea43529df22e0
k delete -f va-1.yaml

“Confirming migration”
No detachment
New engine and replicas

3.2 Degraded before migration and rollback

Migration starts while the volume is degraded. Rollback succeeds.

k cordon eweber-v126-worker-9c1451b4-rw5hf
kl delete pod instance-manager-6852914a55e4566d3ddea43529df22e0

k apply -f va-1.yaml
k apply -f va-2.yaml
k delete -f va-2.yaml

“Rolling back migration”
No detachment
Same engine and replicas

3.3 Degraded between migration start and confirmation

Confirmation succeeds.

k apply -f va-1.yaml
k apply -f va-2.yaml

k cordon eweber-v126-worker-9c1451b4-rw5hf
kl delete pod instance-manager-6852914a55e4566d3ddea43529df22e0

k delete -f va-1.yaml

“Confirming migration”
New engine and replicas

3.4 Degraded between migration start and rollback

Rollback succeeds.

k apply -f va-1.yaml
k apply -f va-2.yaml

k cordon eweber-v126-worker-9c1451b4-rw5hf
kl delete pod instance-manager-6852914a55e4566d3ddea43529df22e0

k delete -f va-2.yaml

Rolling back migration
Same engine and replicas

4. Attempt to attach to three nodes

The third attachment fails.

kl apply -f va-1.yaml
kl apply -f va-2.yaml
kl apply -f va-3.yaml

test-va-1 attached
test-va-2 attached
test-va-3 not attached
“…cannot attach migratable volume to more than two nodes…”

5. New engine node down

Hard shut down the node running the migration engine.
Wait until Kubernetes recognizes the node is down. (This is IMPORTANT! Otherwise, it is a different test case.)
Attempt a confirmation or rollback.

5.1 Confirmation

The volume is allowed to detach from the old node (special logic in code). It attempts to attach to cleanly attach to the migration node, but is stuck until it comes back.

“Waiting to confirm migration until migration engine is ready”
“Detaching volume for attempted migration to down node”
“…selected to attach to ”
Same engine and replicas
Stuck in attaching waiting for node to come back

kubectl -n longhorn-system get volume -oyaml | grep -i nodeid
    migrationNodeID: ""
    nodeID: eweber-v126-worker-9c1451b4-6464j
    currentMigrationNodeID: ""
    currentNodeID: ""
    pendingNodeID: ""

5.2 Rollback

Rollback succeeds.

“Rolling back migration”
No detachment
Same engine and replicas

6. Old engine node down

Hard shut down the node running the old engine.
Wait until Kubernetes recognizes the node is down. (This is IMPORTANT! Otherwise, it is a different test case.)
Verify the volume is no longer attached and no longer migrating. It should remain in this state indefinitely until a confirmation or rollback is attempted.
Attempt a confirmation or rollback.

6.1 Confirmation

The migration is stuck until the Kubernetes pod eviction controller decides to terminate the instance-manager pod that was running on the old node. Then, Longhorn detaches the volume and cleanly reattaches it to the migration node.

“Waiting to confirm migration…” (new engine crashes when old one does)
Eventually…
“…selected to detach from ”
“Cancelling migration for detached volume…”
“…selected to attach to ”
Same engine and replicas

6.2 Rollback

“Rolling back migration”
Stuck in attached with engine and one replica unknown
Eventually…
“…selected to detach from ”
“…selected to attach to ”
Stuck in attaching waiting for node to come back
Same engine and replicas

[Edit]

HA Volume Migration

Related issues

Basic instructions

Scenarios

1. New engine crash

1.1. Confirmation immediately after crash

1.2. Confirmation immediately before crash

1.3. Rollback immediately before or after crash

2. Old engine crash

2.1 No immediate confirmation or rollback

2.2 Confirmation immediately after crash

2.3 Confirmation immediately before crash

2.4 Rollback immediately after crash

2.5 Rollback immediately before crash

3. Single replica crash

3.1 Degraded before migration and confirmation

3.2 Degraded before migration and rollback

3.3 Degraded between migration start and confirmation

3.4 Degraded between migration start and rollback

4. Attempt to attach to three nodes

5. New engine node down

5.1 Confirmation

5.2 Rollback

6. Old engine node down

6.1 Confirmation

6.2 Rollback