profile
viewpoint

Ask questionsBaggage claim not removing containers/volumes

Bug Report

Baggage claim is not removing dangling garden containers or btrfs/overlay2 volumes when worker is retired. This leads to inconsistent results from database vs. fly workers output for containers, as well as storage not being released for volumes.

Steps to Reproduce

Retire worker and check for contents under the following: Garden container paths under <worker_base_dir>/depot/ will remain running and relaunch on reboot leading to discrepancy between database and fly workers. btrfs volumes under <worker_base_dir>/volumes/live/ will persist through reboot and consume storage while not being reflected in database. overlay2 volumes under <worker_base_dir>/overlays/ <worker_base_dir>/volumes/live/ will persist through reboot and consume storage while not being reflected in database.

Expected Results

For garden, as concourse is the only service orchestrating and using the service, it should be authoritative and remove all worker filesystem container references as part of garbage collection. For volumes, volumes should be unmounted and mount paths/corresponding files should be removed prior to volumes being marked destroyed in database.

Actual Results

Large numbers of containers and volumes/mounts persist on hosts after being retired/reconfigured leading to garden subnet exhaustion and all storage being claimed by unused volumes. These containers/volumes are not tracked in the database, so they are not properly reflected in metrics collected from database for the workers and lead to unexpected failures.

Additional Context

Temporary workaround:

# retire worker and wait for services to stop successfully

# cleanup garden containers
sudo kill -9 `ps -aux | awk '/garden-init/ {print $2}'`
sudo rm -rf /opt/concourse/worker/depot \
            /opt/concourse/worker/garden-properties.json

# unmount and remove btrfs img
sudo umount -fl /opt/concourse/worker/volumes
sudo sync
sudo losetup -d /dev/loop0
sudo rm -rf /opt/concourse/worker/volumes.img

# unmount and remove overlay2 mounts
sudo umount -flR /opt/concourse/overlay /opt/concourse/worker/volumes
sudo sync
sudo rm -rf /opt/concourse/worker/overlay /opt/concourse/worker/volumes

# restart worker

Version Info

  • Concourse version: 5.3.0
  • Deployment type (BOSH/Docker/binary): binary
  • Infrastructure/IaaS: N/A
  • Browser (if applicable): N/A
  • Did this used to work? not that we are aware of
concourse/concourse

Answer questions vito

@mstansberry Landing won't remove anything; it'll drain, waiting for any in-flight work to finish, and shut down while leaving all of the containers and volumes there. Most importantly, they'll remain in the database as well. The worker can then come back and the web node will continue to use the remaining containers, re-attaching to in-flight processes if necessary. The volumes will still be around too.

Landing is the workflow for doing an in-place upgrade or restart of a worker without removing it. Retiring is the workflow for decommissioning a worker machine for good.

For handling a crash/reboot, we're working on a related issue (https://github.com/concourse/concourse/issues/4264) which may be affecting you if you're using the overlay volume driver.

useful!
source:https://uonfu.com/
answerer
Alex Suraci vito @vmware Toronto, ON @concourse co-creator, pm, engineer
Github User Rank List