Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup process gets stucked after node's restart #5986

Open
1 task done
jfrancoa opened this issue Oct 11, 2024 · 1 comment
Open
1 task done

Backup process gets stucked after node's restart #5986

jfrancoa opened this issue Oct 11, 2024 · 1 comment
Labels

Comments

@jfrancoa
Copy link
Contributor

jfrancoa commented Oct 11, 2024

How to reproduce this bug?

Having a multinode cluster in which a backup process has been started, if for whatever reason one of the nodes restarts, then the backup process will be stuck and the backup's status will be displayed as STARTED (but it will never get to an end).

To reproduce this issue you can leverage the weaviatest and weaviate-local-k8s tools.

  • First start a three node cluster using local-k8s.sh:
WORKERS=3 REPLICAS=3 WEAVIATE_VERSION="1.26.6" OBSERVABILITY=true ENABLE_BACKUP=true ./local-k8s.sh setup
  • Create a collection and add some big number of objects (so that the backup will take longer to finish):
docker run weaviatest create collection --collection TestCollection
docker run weaviatest create data --collection TestCollection --limit 200000 --randomize
  • Now, open two terminals:

    • In the first terminal start a backup (do not wait for it to finish):
     weaviatest create backup --backup_id first_backup
    
    • In the other terminal, after few seconds of the backup having started, we will delete one of the pods:
     kubectl delete pod weaviate-2 -n weaviate
    
    • And we will wait for the backup to finish checking its state:
     curl --request GET \
    --url http://localhost:8080/v1/backups/s3/first_backup
    {"backend":"s3","id":"first_backup","path":"s3://weaviate-backups/first_backup","status":"STARTED"}
    
  • Once the node comes back, we wait for it to finish. However, that won't occurr. The state keeps on displaying as STARTED several hours (this backup usually takes few seconds in finishing under normal circunstances)

curl --request GET \
  --url http://localhost:8080/v1/backups/s3/first_backup
{"backend":"s3","id":"first_backup","path":"s3://weaviate-backups/first_backup","status":"STARTED"}
  • And we can observe that the node which got restarted is not present in the s3 storage dump:
Every 2.0s: kubectl exec minio -n weaviate -- mc ls data/weaviate-backups/first_backup                                                                                                                                           Joses-MacBook-Pro.local: Fri Oct 11 19:42:22 2024

[2024-10-11 17:38:59 UTC] 4.0KiB backup_config.json/
[2024-10-11 17:39:38 UTC] 4.0KiB weaviate-0/
[2024-10-11 17:39:39 UTC] 4.0KiB weaviate-1/

The restart of the node should also capture that there was a backup ongoing and progress from where it was before the restart, as otherwise the backup hangs for ever.

What is the expected behavior?

The backup process is able to resume after a node's restart

What is the actual behavior?

If one of the node restarts after a backup has started the backup will hang.

Supporting information

No response

Server Version

1.26.6

Code of Conduct

@jfrancoa jfrancoa added the bug label Oct 11, 2024
@esaday
Copy link

esaday commented Oct 17, 2024

I'm experiencing the same scenario with docker-compose variant with a single node on Amazon Linux (in my case the docker process gets restarted due to insufficient resources).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants