Race condition in RBD node plugin may lead to total data loss while unmounting #4966

martin-helmich · 2024-11-19T13:17:20Z

Describe the bug

Recently, we encountered a total data loss in a few of our RBD volumes.

From the logs of the RBD nodeplugin, we inferred that the following condition occurred (all these steps can be correlated to the rbd-nodeplugin logs, attached to this issue):

Attempted to bindmount volume from its staging path to its target path at /var/lib/kubelet/pods/.../volumes/kubernetes.io~csi/.../mount.
We were seeing an initial /csi.v1.Node/NodePublishVolume, followed (for reasons unknown) by a second /csi.v1.Node/NodePublishVolume call exactly two minutes after that. We're presuming that this was the CSI timeout¹ after which the kubelet attempts to mount the volume again.
This causes the node plugin to bindmount the volume again over the target volume, WITHOUT unmounting the previous bindmount.
Shortly after, the volume is unmounted again. kubelet invokes a /csi.v1.Node/NodeUnpublishVolume call to the node plugin.
The node plugin unmounts the volume, and calls os.RemoveAll on the target path. HOWEVER, since mount was invoked twice on that volume, there is still an active bind mount to the actual RBD volume in that directory, causing os.RemoveAll to recursively remove the entire contents of that volume.
RIP 💀

Environment details

Image/version of Ceph CSI driver : v3.12.2
Helm chart version : 3.12.2
Kernel version : 6.10.11+bpo-amd64 (debian bookworm-backports)
Mounter used for mounting PVC (for cephFS its fuse or kernel. for rbd its krbd or rbd-nbd) : krbd
Kubernetes cluster version : v1.30.5
Ceph cluster version : v16.2.14

Steps to reproduce

Steps to reproduce the behavior:

Unknown. Assistance in further debugging might be required.

Presumably:

Schedule a Pod mounting an RBD volume
Have kubelet crash+restart shortly after mounting the volume, for example due to OOM
Terminate the Pod, causing kubelet to unmount (and involuntarily purge) the volume

Actual results

Total data loss in the RBD volume, due to the RBD node plugin running a recursive os.RemoveAll on a still-mounted volume.

Expected behavior

Mounting and unmounting (even under load and occasional error conditions) should work without data corruption.

Logs

Attached to this issue are our logs from the ceph-csi-rbd-nodeplugin from the time of the incident (newest entries at the top).

Explore-logs-2024-11-18 19_34_03.txt

Additional context

Quick&dirty workaround

We implemented a quick workaround by replacing the os.RemoveAll call with a regular os.Remove, which we can also provide as PR if required. Our reasoning for this was as follows:

If the volume was correctly unmounted, the target path will be an empty directory, and a simple os.Remove call will suffice to remove it correctly.
If the volume was not correctly unmounted, the target path will still contain the volume's contents, and the os.Remove call will fail with an error -- which is very preferable to an unexpected data loss. In this case, we would expect the NodeUnpublishVolume call to be (eventually) retried and to (also eventually) succeed.

"The kubelet should make sure that this does not happen"

From reading the code comments around the os.RemoveAll call in question, it seems to be presumed that the kubelet will guard against concurrent volume operations:

ceph-csi/internal/rbd/nodeserver.go

Lines 917 to 918 in 3e9b438

    
           // considering kubelet make sure node operations like unpublish/unstage...etc can not be called 
        
           // at same time, an explicit locking at time of nodeunpublish is not required.

This does not seem to be the case. In fact, the CSI spec² explicitly recognizes that (emphasis mine) "in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls."

Furthermore, the CephFS node plugin seems to already follow this recommendation, and both checks if the target path is actually a mount point and only uses a simple os.Remove call instead of os.RemoveAll (although even that might be susceptible to race conditions):

ceph-csi/internal/cephfs/nodeserver.go

Lines 642 to 645 in 3e9b438

    
           err = os.Remove(targetPath) 
        
           if err != nil && !os.IsNotExist(err) { 
        
           	return nil, status.Error(codes.Internal, err.Error()) 
        
           }

It should also be noted that the CephFS implementation also checks in its NodePublishVolume implementation if the target directory is already a mountpoint, and just returns early in this case (which would also prevent issues like these):

ceph-csi/internal/cephfs/nodeserver.go

Lines 551 to 562 in 3e9b438

    
           isMnt, err = util.IsMountPoint(ns.Mounter, targetPath) 
        
           if err != nil { 
        
           	log.ErrorLog(ctx, "stat failed: %v", err) 
        
           	return nil, status.Error(codes.Internal, err.Error()) 
        
           } 
        
           if isMnt { 
        
           	log.DebugLog(ctx, "cephfs: volume %s is already bind-mounted to %s", volID, targetPath) 
        
           	return &csi.NodePublishVolumeResponse{}, nil 
        
           }

Suggested mitigations

In NodeUnpublishVolume: Use os.RemoveAll only (if at all) after asserting that the target path is not still a mountpoint. Note that this alone might still be potentially racy.
In NodePublishVolume: Test if the target path is already bindmounted to the staging path, and return early without mounting it again.
To be entirely sure, it might also be advisable to add a (per-volume) locking mechanism protecting all volume operations, to ensure that no operations run in parallel.

Credits for discovering this issue and the associated debugging go to @hensur and @Lucaber.

The text was updated successfully, but these errors were encountered:

Madhu-1 · 2024-11-19T13:36:35Z

In NodeUnpublishVolume: Use os.RemoveAll only (if at all) after asserting that the target path is not still a mountpoint. Note that this alone might still be potentially racy.
In NodePublishVolume: Test if the target path is already bindmounted to the staging path, and return early without mounting it again.

@martin-helmich Thanks for checking this one, i expect that

ceph-csi/internal/rbd/nodeserver.go

Lines 875 to 878 in 3e9b438

    
           notMnt, err := ns.Mounter.IsLikelyNotMountPoint(mountPath) 
        
           if err == nil { 
        
           	return notMnt, nil 
        
           }

should try to detect already existing mounts and return success instead of going and doing mount again. it also mentioned here that https://github.com/kubernetes/utils/blob/6fe5fd82f078d1e207dddbba58d9427258f7956b/mount/mount.go#L59 it might not detect the bind mounts, we need to check on this one, Yes we should not be using os.Remove instead it should be replaced by os.remove

To be entirely sure, it might also be advisable to add a (per-volume) locking mechanism protecting all volume operations, to ensure that no operations run in parallel.

we had this earlier and it was removed later that this wont happen and also to speed up the mounts.

Madhu-1 · 2024-11-19T13:38:59Z

For reproducing the issue, Lets call the NodePublish (one more time) internally right after the success to trigger two NodePublish calls (to mimic the behaviour)

martin-helmich · 2024-11-20T08:38:50Z

should try to detect already existing mounts and return success instead of going and doing mount again. it also mentioned here that https://github.com/kubernetes/utils/blob/6fe5fd82f078d1e207dddbba58d9427258f7956b/mount/mount.go#L59 it might not detect the bind mounts

Yeah, the docs of IsLikelyNotMountPoint explicitly mention that it does not detect bind mounts. I'd suggest using IsMountPoint instead, which is slightly more expensive but also detects bind mounts (as a side note: The cephfs-nodeplugin appears to be already using IsMountPoint for this exact purpose).

Lucaber · 2024-11-20T10:07:37Z

For reproducing the issue, Lets call the NodePublish (one more time) internally right after the success to trigger two NodePublish calls (to mimic the behaviour)

Calling NodePublish sequentially (and waiting for a response) should work as expected, the problem occurs when two calls are executed concurrently.

IsLikelyNotMountPoint works by comparing the device ids of the mountpoint to the parent directory. This should not be a problem for our bindmounts to a different device.
https://github.com/kubernetes/mount-utils/blob/master/mount_linux.go#L470

root@fiestel-node-s-njkcti-0:/var/lib/kubelet/pods/{pod}/volumes/kubernetes.io~csi/{vol}/mount# stat . | grep Device
Device: 253,80	Inode: 128         Links: 3
root@fiestel-node-s-njkcti-0:/var/lib/kubelet/pods/{pod}/volumes/kubernetes.io~csi/{vol}/mount# stat .. | grep Device
Device: 0,4305	Inode: 204936      Links: 3

In our case the system was under high load. The mount syscall was propably hanging and kubelet retried the NodePublishVolume call after 2 minutes.

Time	ID 223	ID 244
04:05:01	GRPC call: /csi.v1.Node/NodePublishVolume
-	IsLikelyNotMountPoint = error; resulting in mkdir and a mount call
04:07:02		GRPC call: /csi.v1.Node/NodePublishVolume
-		IsLikelyNotMountPoint = true; resulting in a mount call
04:07:17	rbd: successfully mounted stagingPath
04:07:41		rbd: successfully mounted stagingPath

(The exact log output is attached to the issue.)

we had this earlier and it was removed later that this wont happen and also to speed up the mounts.

It should be possible to only lock the specific targetPath. This should not cause any performance issues during normal operation and prevent further race conditions.

Madhu-1 · 2024-11-20T10:28:55Z

For reproducing the issue, Lets call the NodePublish (one more time) internally right after the success to trigger two NodePublish calls (to mimic the behaviour)

Calling NodePublish sequentially (and waiting for a response) should work as expected, the problem occurs when two calls are executed concurrently.

I tried some hack to call the function in go routine

IsLikelyNotMountPoint works by comparing the device ids of the mountpoint to the parent directory. This should not be a problem for our bindmounts to a different device. https://github.com/kubernetes/mount-utils/blob/master/mount_linux.go#L470
root@fiestel-node-s-njkcti-0:/var/lib/kubelet/pods/{pod}/volumes/kubernetes.io~csi/{vol}/mount# stat . | grep Device
Device: 253,80	Inode: 128         Links: 3
root@fiestel-node-s-njkcti-0:/var/lib/kubelet/pods/{pod}/volumes/kubernetes.io~csi/{vol}/mount# stat .. | grep Device
Device: 0,4305	Inode: 204936      Links: 3

I missed the check in the code where its checking the root path. Thanks for pointing it out.

In our case the system was under high load. The mount syscall was propably hanging and kubelet retried the NodePublishVolume call after 2 minutes.

Time ID 223 ID 244
04:05:01 GRPC call: /csi.v1.Node/NodePublishVolume
IsLikelyNotMountPoint = error; resulting in mkdir and a mount call
04:07:02 GRPC call: /csi.v1.Node/NodePublishVolume
IsLikelyNotMountPoint = true; resulting in a mount call
04:07:17 rbd: successfully mounted stagingPath
04:07:41 rbd: successfully mounted stagingPath
(The exact log output is attached to the issue.)

we had this earlier and it was removed later that this wont happen and also to speed up the mounts.

It should be possible to only lock the specific targetPath. This should not cause any performance issues during normal operation and prevent further race conditions.

Yes thats possible (currently we dont have locks per targetPath.

Madhu-1 added bug Something isn't working component/rbd Issues related to RBD labels Nov 19, 2024

Madhu-1 mentioned this issue Nov 20, 2024

Prevent dataloss due to the concurrent RPC calls (occurrence is very low) #4970

Merged

mergify bot closed this as completed in #4970 Nov 21, 2024

mergify bot mentioned this issue Nov 21, 2024

Prevent dataloss due to the concurrent RPC calls (occurrence is very low) (backport #4970) #4972

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in RBD node plugin may lead to total data loss while unmounting #4966

Race condition in RBD node plugin may lead to total data loss while unmounting #4966

martin-helmich commented Nov 19, 2024

Madhu-1 commented Nov 19, 2024

Madhu-1 commented Nov 19, 2024

martin-helmich commented Nov 20, 2024

Lucaber commented Nov 20, 2024

Madhu-1 commented Nov 20, 2024

Race condition in RBD node plugin may lead to total data loss while unmounting #4966

Race condition in RBD node plugin may lead to total data loss while unmounting #4966

Comments

martin-helmich commented Nov 19, 2024

Describe the bug

Environment details

Steps to reproduce

Actual results

Expected behavior

Logs

Additional context

Quick&dirty workaround

"The kubelet should make sure that this does not happen"

Suggested mitigations

Footnotes

Madhu-1 commented Nov 19, 2024

Madhu-1 commented Nov 19, 2024

martin-helmich commented Nov 20, 2024

Lucaber commented Nov 20, 2024

Madhu-1 commented Nov 20, 2024