Backup process gets stucked after node's restart #5986

jfrancoa · 2024-10-11T17:46:18Z

How to reproduce this bug?

Having a multinode cluster in which a backup process has been started, if for whatever reason one of the nodes restarts, then the backup process will be stuck and the backup's status will be displayed as STARTED (but it will never get to an end).

To reproduce this issue you can leverage the weaviatest and weaviate-local-k8s tools.

First start a three node cluster using local-k8s.sh:

WORKERS=3 REPLICAS=3 WEAVIATE_VERSION="1.26.6" OBSERVABILITY=true ENABLE_BACKUP=true ./local-k8s.sh setup

Create a collection and add some big number of objects (so that the backup will take longer to finish):

docker run weaviatest create collection --collection TestCollection
docker run weaviatest create data --collection TestCollection --limit 200000 --randomize

Now, open two terminals:

In the first terminal start a backup (do not wait for it to finish):

 weaviatest create backup --backup_id first_backup

In the other terminal, after few seconds of the backup having started, we will delete one of the pods:

 kubectl delete pod weaviate-2 -n weaviate

And we will wait for the backup to finish checking its state:

 curl --request GET \
--url http://localhost:8080/v1/backups/s3/first_backup
{"backend":"s3","id":"first_backup","path":"s3://weaviate-backups/first_backup","status":"STARTED"}

Once the node comes back, we wait for it to finish. However, that won't occurr. The state keeps on displaying as STARTED several hours (this backup usually takes few seconds in finishing under normal circunstances)

curl --request GET \
  --url http://localhost:8080/v1/backups/s3/first_backup
{"backend":"s3","id":"first_backup","path":"s3://weaviate-backups/first_backup","status":"STARTED"}

And we can observe that the node which got restarted is not present in the s3 storage dump:

Every 2.0s: kubectl exec minio -n weaviate -- mc ls data/weaviate-backups/first_backup                                                                                                                                           Joses-MacBook-Pro.local: Fri Oct 11 19:42:22 2024

[2024-10-11 17:38:59 UTC] 4.0KiB backup_config.json/
[2024-10-11 17:39:38 UTC] 4.0KiB weaviate-0/
[2024-10-11 17:39:39 UTC] 4.0KiB weaviate-1/

The restart of the node should also capture that there was a backup ongoing and progress from where it was before the restart, as otherwise the backup hangs for ever.

What is the expected behavior?

The backup process is able to resume after a node's restart

What is the actual behavior?

If one of the node restarts after a backup has started the backup will hang.

Supporting information

No response

Server Version

1.26.6

Code of Conduct

I have read and agree to the Weaviate's Contributor Guide and Code of Conduct

The text was updated successfully, but these errors were encountered:

esaday · 2024-10-17T19:31:27Z

I'm experiencing the same scenario with docker-compose variant with a single node on Amazon Linux (in my case the docker process gets restarted due to insufficient resources).

jfrancoa added the bug label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup process gets stucked after node's restart #5986

Backup process gets stucked after node's restart #5986

jfrancoa commented Oct 11, 2024 •

edited

Loading

esaday commented Oct 17, 2024

Backup process gets stucked after node's restart #5986

Backup process gets stucked after node's restart #5986

Comments

jfrancoa commented Oct 11, 2024 • edited Loading

How to reproduce this bug?

What is the expected behavior?

What is the actual behavior?

Supporting information

Server Version

Code of Conduct

esaday commented Oct 17, 2024

jfrancoa commented Oct 11, 2024 •

edited

Loading