You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Having a multinode cluster in which a backup process has been started, if for whatever reason one of the nodes restarts, then the backup process will be stuck and the backup's status will be displayed as STARTED (but it will never get to an end).
Create a collection and add some big number of objects (so that the backup will take longer to finish):
docker run weaviatest create collection --collection TestCollection
docker run weaviatest create data --collection TestCollection --limit 200000 --randomize
Now, open two terminals:
In the first terminal start a backup (do not wait for it to finish):
weaviatest create backup --backup_id first_backup
In the other terminal, after few seconds of the backup having started, we will delete one of the pods:
kubectl delete pod weaviate-2 -n weaviate
And we will wait for the backup to finish checking its state:
curl --request GET \
--url http://localhost:8080/v1/backups/s3/first_backup
{"backend":"s3","id":"first_backup","path":"s3://weaviate-backups/first_backup","status":"STARTED"}
Once the node comes back, we wait for it to finish. However, that won't occurr. The state keeps on displaying as STARTED several hours (this backup usually takes few seconds in finishing under normal circunstances)
curl --request GET \
--url http://localhost:8080/v1/backups/s3/first_backup
{"backend":"s3","id":"first_backup","path":"s3://weaviate-backups/first_backup","status":"STARTED"}
And we can observe that the node which got restarted is not present in the s3 storage dump:
Every 2.0s: kubectl exec minio -n weaviate -- mc ls data/weaviate-backups/first_backup Joses-MacBook-Pro.local: Fri Oct 11 19:42:22 2024
[2024-10-11 17:38:59 UTC] 4.0KiB backup_config.json/
[2024-10-11 17:39:38 UTC] 4.0KiB weaviate-0/
[2024-10-11 17:39:39 UTC] 4.0KiB weaviate-1/
The restart of the node should also capture that there was a backup ongoing and progress from where it was before the restart, as otherwise the backup hangs for ever.
What is the expected behavior?
The backup process is able to resume after a node's restart
What is the actual behavior?
If one of the node restarts after a backup has started the backup will hang.
I'm experiencing the same scenario with docker-compose variant with a single node on Amazon Linux (in my case the docker process gets restarted due to insufficient resources).
How to reproduce this bug?
Having a multinode cluster in which a backup process has been started, if for whatever reason one of the nodes restarts, then the backup process will be stuck and the backup's status will be displayed as STARTED (but it will never get to an end).
To reproduce this issue you can leverage the weaviatest and weaviate-local-k8s tools.
local-k8s.sh
:Now, open two terminals:
Once the node comes back, we wait for it to finish. However, that won't occurr. The state keeps on displaying as STARTED several hours (this backup usually takes few seconds in finishing under normal circunstances)
The restart of the node should also capture that there was a backup ongoing and progress from where it was before the restart, as otherwise the backup hangs for ever.
What is the expected behavior?
The backup process is able to resume after a node's restart
What is the actual behavior?
If one of the node restarts after a backup has started the backup will hang.
Supporting information
No response
Server Version
1.26.6
Code of Conduct
The text was updated successfully, but these errors were encountered: