-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in RBD node plugin may lead to total data loss while unmounting #4966
Comments
@martin-helmich Thanks for checking this one, i expect that ceph-csi/internal/rbd/nodeserver.go Lines 875 to 878 in 3e9b438
we had this earlier and it was removed later that this wont happen and also to speed up the mounts. |
For reproducing the issue, Lets call the NodePublish (one more time) internally right after the success to trigger two NodePublish calls (to mimic the behaviour) |
Yeah, the docs of |
Calling NodePublish sequentially (and waiting for a response) should work as expected, the problem occurs when two calls are executed concurrently.
In our case the system was under high load. The mount syscall was propably hanging and kubelet retried the
(The exact log output is attached to the issue.)
It should be possible to only lock the specific |
I tried some hack to call the function in go routine
I missed the check in the code where its checking the root path. Thanks for pointing it out.
Yes thats possible (currently we dont have locks per targetPath. |
Describe the bug
Recently, we encountered a total data loss in a few of our RBD volumes.
From the logs of the RBD nodeplugin, we inferred that the following condition occurred (all these steps can be correlated to the rbd-nodeplugin logs, attached to this issue):
Attempted to bindmount volume from its staging path to its target path at
/var/lib/kubelet/pods/.../volumes/kubernetes.io~csi/.../mount
.We were seeing an initial
/csi.v1.Node/NodePublishVolume
, followed (for reasons unknown) by a second/csi.v1.Node/NodePublishVolume
call exactly two minutes after that. We're presuming that this was the CSI timeout1 after which the kubelet attempts to mount the volume again.This causes the node plugin to bindmount the volume again over the target volume, WITHOUT unmounting the previous bindmount.
Shortly after, the volume is unmounted again. kubelet invokes a
/csi.v1.Node/NodeUnpublishVolume
call to the node plugin.The node plugin
unmount
s the volume, and callsos.RemoveAll
on the target path. HOWEVER, sincemount
was invoked twice on that volume, there is still an active bind mount to the actual RBD volume in that directory, causingos.RemoveAll
to recursively remove the entire contents of that volume.RIP 💀
Environment details
Steps to reproduce
Steps to reproduce the behavior:
Presumably:
Actual results
Total data loss in the RBD volume, due to the RBD node plugin running a recursive
os.RemoveAll
on a still-mounted volume.Expected behavior
Mounting and unmounting (even under load and occasional error conditions) should work without data corruption.
Logs
Attached to this issue are our logs from the ceph-csi-rbd-nodeplugin from the time of the incident (newest entries at the top).
Explore-logs-2024-11-18 19_34_03.txt
Additional context
Quick&dirty workaround
We implemented a quick workaround by replacing the
os.RemoveAll
call with a regularos.Remove
, which we can also provide as PR if required. Our reasoning for this was as follows:os.Remove
call will suffice to remove it correctly.os.Remove
call will fail with an error -- which is very preferable to an unexpected data loss. In this case, we would expect theNodeUnpublishVolume
call to be (eventually) retried and to (also eventually) succeed."The kubelet should make sure that this does not happen"
From reading the code comments around the
os.RemoveAll
call in question, it seems to be presumed that the kubelet will guard against concurrent volume operations:ceph-csi/internal/rbd/nodeserver.go
Lines 917 to 918 in 3e9b438
This does not seem to be the case. In fact, the CSI spec2 explicitly recognizes that (emphasis mine) "in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls."
Furthermore, the CephFS node plugin seems to already follow this recommendation, and both checks if the target path is actually a mount point and only uses a simple
os.Remove
call instead ofos.RemoveAll
(although even that might be susceptible to race conditions):ceph-csi/internal/cephfs/nodeserver.go
Lines 642 to 645 in 3e9b438
It should also be noted that the CephFS implementation also checks in its
NodePublishVolume
implementation if the target directory is already a mountpoint, and just returns early in this case (which would also prevent issues like these):ceph-csi/internal/cephfs/nodeserver.go
Lines 551 to 562 in 3e9b438
Suggested mitigations
NodeUnpublishVolume
: Useos.RemoveAll
only (if at all) after asserting that the target path is not still a mountpoint. Note that this alone might still be potentially racy.NodePublishVolume
: Test if the target path is already bindmounted to the staging path, and return early without mounting it again.Credits for discovering this issue and the associated debugging go to @hensur and @Lucaber.
Footnotes
https://github.com/kubernetes/kubernetes/blob/cf480a3a1a9cb22f3439c0a7922822d9f67f31b5/pkg/volume/csi/csi_plugin.go#L54 ↩
https://github.com/container-storage-interface/spec/blob/master/spec.md#error-scheme ↩
The text was updated successfully, but these errors were encountered: