profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/dany74q/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Danny Shemesh dany74q @wiz-sec Israel https://www.linkedin.com/in/dany74q Software is my passion.

dany74q/microsoft-teams-rtl-runner 12

Microsoft Teams RTL runner, written in Python, Supports Windows & Mac.

dany74q/FreeMock 9

Free function mocker

dany74q/MagnesDbg 4

A Simple debugger/detourer for stuff i work on

dany74q/Cpp-Modules-VS15-IDE-Support 2

This is a naive and simple integration of c++ modules to VS15 Update 1.

dany74q/cppxy-dir-walk 1

Directory walking using c++17, experimental features - filesystem & yield

dany74q/keyvaultlib 1

A KeyVault client wrapper that helps transition between using ADAL (Active Directory Authenticatino Libraries) and MSI (Managed Service Identity) as a token provider

dany74q/apache-ignite-log-pretty-printer-vs-code-ext 0

Apache Ignite log pretty printer VSCode extension

dany74q/autoscaler 0

Autoscaling components for Kubernetes

dany74q/Awesome-Python-3 0

Python 3 w/ Print statement support

created tagwiz-sec/grammes

tagv1.2.13-write-buffer-resizing

A Go package built to communicate with Apache TinkerPop™ Graph computing framework using Gremlin; a graph traversal language used by graph databases such as JanusGraph®, MS Cosmos DB, AWS Neptune, and DataStax® Enterprise Graph.

created time in a month

push eventwiz-sec/grammes

Danny Shemesh

commit sha d6effb09050be939315bd73fbd66567ffa33c671

Fixed tests and added circleci config

view details

push time in a month

delete branch wiz-sec/grammes

delete branch : danny/wz-3881-dynamically-growing-websocket-outbound-buffer

delete time in a month

PR merged wiz-sec/grammes

Reviewers
Added write buffer resizing option

configuration.go

  • Added WithWriteBufferResizing

go.mod

  • Replaced websocket with our fork

websocket.go, websocket_test.go

  • Added ability to control the write buffer resizing flag
  • Added test case
+55 -24

0 comment

6 changed files

dany74q

pr closed time in a month

push eventwiz-sec/grammes

Danny Shemesh

commit sha d78de3ebbd5b87d3dbb8f85543fafe926b62d45d

Added write buffer resizing option (#19) configuration.go - Added WithWriteBufferResizing go.mod - Replaced websocket with our fork websocket.go, websocket_test.go - Added ability to control the write buffer resizing flag - Added test case

view details

push time in a month

PR opened wiz-sec/grammes

Added write buffer resizing option

configuration.go

  • Added WithWriteBufferResizing

go.mod

  • Replaced websocket with our fork

websocket.go, websocket_test.go

  • Added ability to control the write buffer resizing flag
  • Added test case
+55 -24

0 comment

6 changed files

pr created time in a month

pull request commentkubernetes/autoscaler

Added node readiness grace time & node-info cache expiration

Hey @MaciekPytel, thanks for the elaborate response !

Following up on what you said, I was wondering -

CA is build on assumption that nodes don't change over time and we don't support adding or removing taints or labels once the node has started up (it may work in some cases, but there are no promises there).

I've seen this mentioned in the docs, but how does this reconcile with the fact that on each scale-up cycle - we override the cache with the first node-from-node-group returned from k8s (and this behavior is cloud-agnostic) ?

Overriding cache entries on each scale-up attempt effectively updates the returned node info object, thus making the returned template dynamic in nature, meaning that if labels or taints are added / removed at runtime (on that first returned node from the given node group) - the returned node-info will always reflect that.

This brings me back to the point where any node-template modifications made at runtime - are in fact reflected and returned from the cache, which might break autoscaling - even permanently (until restart), if the node group just scaled down to zero - as the cache could now be in a state where it holds an incorrect view from the previously running node.

Putting that aside - I like the idea of using the ignore taint - this definitely mitigates the need for a grace period; one could also dynamically set the taint on any custom teardown logic (if labels are removed dynamically), so that the cache would not be overridden with the torn-down node-info.

What I still think could be beneficial to address in this PR though - is the cache-overriding behavior described above, I'm not sure if it's entirely aligned with the notion that CA doesn't support dynamic node modifications; maybe we should keep the first record instead of continuously update the cache, or alternatively - explain that a bit better in the docs ?

Thanks again !

dany74q

comment created time in a month

PR opened kubernetes/autoscaler

Added node readiness grace time & node-info cache expiration

Fixes https://github.com/kubernetes/autoscaler/issues/3802, https://github.com/kubernetes/autoscaler/issues/4052 and potentially some other scale-up-from-0 scenarios (a-la https://github.com/kubernetes/autoscaler/issues/3780).

Quick context - In order to scale up a specific node group, the cluster autoscaler needs to be able to do the following: a. Fetch all available node groups of given cluster, by querying the cloud provider b. Build a virtual Kubernetes Node representation for each such node group (i.e. - how would a node from such a node group will look like once alive) c. Match pending pods against every such representation, and scale-up the matching node-group of the given node

The autoscaler currently builds such node representations by first looking at alive Kubernetes nodes in the cluster - and only if a node group had no current Kubernetes node alive (when it's scaled-down to 0), would the autoscaler query the cloud provider and build a node-template from the tags/labels/capacity the cloud provider returns; that is because the autoscaler works under the premise the alive nodes might hold extra information not available in the provider (for instance, additional labels which are added at runtime).

Keeping a solid node-template view in-memory for node groups which scale to and fro from zero and upwards is a bit tricky - the current implementation opts for the following: a. On each autoscaling cycle, all visible & ready Kubernetes nodes are cached to a non-expiring cache b. If a node was already cached, it's overridden c. The first Kubernetes-supplied node from a given node-group is chosen to represent nodes in that group

This could potentially lead to the following race conditions:

  1. Stale reads - when a node group scales back to zero, the autoscaler would continue using the last surviving Kubernetes node representation of that group available in cache (as entries don't expire); this could be problematic on two fronts: 1.1. Underlying node group changes would not be reflected until the autoscaler restarts (e.g. new labels / tags are added, or the instance type is changed) 1.2. When the last surviving node cached had some partial view of how a node from the given group should look like, for instance - if there is some runtime un-labeling going on pre-termination - the autoscaler might deem this group not fit for pods it can, in fact host

  2. Partial view during node initialization - There are several workloads which modify the node's attributes at runtime, consider for instance - a Daemonset which installs an agent or a kernel module of a given node, and then labels it ready to accept workload; one might purposefully 'trick' the autoscaler to accept such node-groups for scheduling by adding node-template tags to the underlying node group, thus having their desired view of a node in their cloud provider, which may take several minutes to reflect in Kubernetes nodes themselves.

In such and similar circumstances, what may happen in case of a quick ramp-of of nodes from the given group, is that Kubernetes might continuously return nodes which are still initializing as the first representative of the node-group; this would cause the autoscaler to have a prolonged undesired view of its node-templates, until a node which is fully ready had stabilized.

I propose the following two additions to the autoscaling loop, which I believe would eliminate such race conditions:

  1. Per @lsowen's suggestion - We can introduce a slight grace period for the use of Kubernetes nodes as a node-group representative until they've hit a point we'd consider fully initialized; This means that for the first n minutes, we'd use the cloud provider's representation, or a more mature node from the node group as a representative.

  2. Have the node info cache expire stale entries for node groups which scaled down to zero

This PR introduces the above two, with what I believe would be sensible defaults - a 5 minute node-readiness grace period, and a 10 minute cache expiration time.

I think that it'd be worthwhile to non-zero values (meaning - to have the grace period and expiration on by default), as it could benefit managed autoscaler cloud provider solutions and new users without further tinkering with the parameters, by having a relatively minimal impact (a few more cloud provider API hits) on cases where scaling to zero isn't utilized.

Thanks !

autoscaling_options.go, main.go

  • Added NodeReadinessGraceTime, which gives kubernetes nodes an extra grace period before basing node templates on them
  • Added NodeReadinessGraceTime, which controls expiration for the k8s node group -> node info cache

FAQ.md

  • Documented node-readiness-grace-time, node-cache-expiration-time and their usefulness

node_info_cache.go

  • Introduced, small expiring cache wrapper for node info entries

scale_up_test.go, utils_test.go

  • Fixed tests based on new GetNodeInfosForGroups signature
  • Introduced sanity tests for the node grace time & cache expiration

static_autoscaler.go

  • Will now use the new NodeInfoCache
  • Will now pass the NodeReadinessGraceTime

utils.go

  • Node info cache will now only populate nodes that are ready for a given readiness grace time
+240 -33

0 comment

9 changed files

pr created time in a month

push eventdany74q/autoscaler

Michael Cristina

commit sha 4cf9a9867916a69f99c8b356f6d805bf25f76eb0

Release leader election lock on shutdown

view details

Michael McCune

commit sha a24ea6c66b5cc61bbaf834899e4d6e58453f78db

add cluster cores and memory bytes count metrics This change adds 4 metrics that can be used to monitor the minimum and maximum limits for CPU and memory, as well as the current counts in cores and bytes, respectively. The four metrics added are: * `cluster_autoscaler_cpu_limits_cores` * `cluster_autoscaler_cluster_cpu_current_cores` * `cluster_autoscaler_memory_limits_bytes` * `cluster_autoscaler_cluster_memory_current_bytes` This change also adds the `max_cores_total` metric to the metrics proposal doc, as it was previously not recorded there. User story: As a cluster autoscaler user, I would like to monitor my cluster through metrics to determine when the cluster is nearing its limits for cores and memory usage.

view details

Thomas George Hartland

commit sha d103b70447c12fd53b9467d46e3fa47919d6f837

Enable magnum provider scale to zero Now supported by magnum. https://review.opendev.org/c/openstack/magnum/+/737580/ If using node group autodiscovery, older versions of magnum will still forbid scaling to zero or setting the minimum node count to zero.

view details

Lê Minh Quân

commit sha 71353a6dd47f88c0c231b94710ba2a3e2ede9414

Fix/dependencies

view details

Lê Minh Quân

commit sha b57ba6edb7672d10014b75893adee4c2f384ad39

Fix/Provider name

view details

Lê Minh Quân

commit sha 90bd1ebadb1bd58f359920457abd2803a3ad2de7

Fix/Add bizflycloud package in skipped_dirs

view details

Lê Minh Quân

commit sha dd8005d919eaaf9e80c34905918fe081c8c05dd6

Add Bizfly Cloud provider to README

view details

Lê Minh Quân

commit sha e95938525793a9c6d6a813eb7ae16488b1a64087

Update license for Bizfly Cloud dependencies

view details

Alexey

commit sha ec2676b36686471c6da5db588c7b519133e300ec

add required api resources to hetzner cluster-autoscaler example

view details

Benjamin Pineau

commit sha 037dc7367aba21bb2f593be09d85a7c99d0f894d

Don't pile up successive full refreshes during AWS scaledowns Force refreshing everything at every DeleteNodes calls causes slow down and throttling on large clusters with many ASGs (and lot of activity). That function might be called several times in a row during scale-down (once for each ASG having a node to be removed). Each time the forced refresh will re-discover all ASGs, all LaunchConfigurations, then re-list all instances from discovered ASGs. That immediate refresh isn't required anyway, as the cache's DeleteInstances concrete implementation will decrement the nodegroup size, and we can schedule a grouped refresh for the next loop iteration.

view details

Benjamin Pineau

commit sha 3ffe4b3557d716eddc540336adf670503f66549f

aws: support arm64 instances Sets the `kubernetes.io/arch` (and legacy `beta.kubernetes.io/arch`) to the proper instance architecture. While at it, re-gen the instance types list (adding new instance types that were missing)

view details

Jakub Tużnik

commit sha 249a7287ab0c24d5a1ae6ddf8f850ef39ea7ee14

Cluster Autoscaler: remove vivekbagade, add towca as an approver in OWNERS

view details

Sylvain Rabot

commit sha 535a21263e77d92778043cd39d1c4f2845748424

Improve misleading log Signed-off-by: Sylvain Rabot <sylvain@abstraction.fr>

view details

Jakub Tużnik

commit sha a15d9944f9dd8f88d8bff1c41baa4855413d2bfb

Cluster Autoscaler GCE: change the format of MIG id The current implementation assumes MIG ids have the "https://content.googleapis.com" prefix, while the canonical id format seems to begin with "https://www.googleapis.com". Both formats work while talking to the GCE API, but the API returns the latter and other GCP services seem to assume it as well.

view details

Kubernetes Prow Robot

commit sha 3c280300f9ff6de548822ea220b47a6696b8f630

Merge pull request #4047 from towca/jtuznik/mig-id Cluster Autoscaler GCE: change the format of MIG id

view details

Alastair Firth

commit sha 29280eebd6766aa67ab7a5237ae59f3b3d71fa31

Add example to AWS readme if taint has value

view details

Kubernetes Prow Robot

commit sha 1330ab1e5aeac18831e956119a74b4e384930cea

Merge pull request #4009 from bizflycloud/bizflycloud/bizflycloud-provider cloudprovider: add Bizflycloud provider

view details

Kubernetes Prow Robot

commit sha 89b237346fdae429d4a4e996682a53a6a817f035

Merge pull request #4040 from towca/jtuznik/owner Remove vivekbagade, add towca as an approver in cluster-autoscaler/OWNERS

view details

Kubernetes Prow Robot

commit sha 35b8e300b241e5bb5c52813492a1e4bf7fd19383

Merge pull request #3995 from tghartland/magnum-scale-to-zero Enable magnum provider scale to zero

view details

Kubernetes Prow Robot

commit sha 6c4101b64ccfc64e1dc384249d21a759e6b9fdfa

Merge pull request #3797 from DataDog/aws-not-refreshes-dogpiles aws: Don't pile up successive full refreshes during AWS scaledowns

view details

push time in a month

create barnchdany74q/autoscaler

branch : get-node-infos-for-groups-improvements

created branch time in a month

fork dany74q/autoscaler

Autoscaling components for Kubernetes

fork in a month

issue commentkubernetes/autoscaler

Cluster Autoscaler does not start new nodes when Taints and NodeSelector are used in EKS

@lsowen - I thought that might be the case, but the cache is not probed at that point at all, the result there is purely local to the function and is being re-calculated on every call, correct me if I'm wrong ofc.

Thanks !

dschunack

comment created time in a month

issue commentkubernetes/autoscaler

Cluster Autoscaler does not start new nodes when Taints and NodeSelector are used in EKS

@lsowen - Thanks ! I've seen the patch - the thing I don't fully understand about it though, is why the continuous overriding of the cache entries does not resolve this on its own after a period of time, if indeed the problematic cache entry is that initial one ?

GetNodeInfosForGroups is called on every scale attempt, and from the code it looks like the cache is always overridden with the latest k8s-supplied node object: https://github.com/lsowen/autoscaler/blob/5f5e0be76c99504cd20b7019c7e3694cfc5ec79d/cluster-autoscaler/core/utils/utils.go#L96-L100

What I would've expected in your case then, is that once the Node had stabilized with all correct labels, the cache entry would've eventually been overridden - and if it wouldn't have been cached on its way down with a partial label list, a well-formed Node should've been returned from the cache, and not the first partial view.

Do you see a flow in the code in which that first invalid entry would've been cached - and newer entries never overriding it (in case it's still up in the next autoscale cycle) ?

dschunack

comment created time in a month

issue commentkubernetes/autoscaler

Cluster Autoscaler does not start new nodes when Taints and NodeSelector are used in EKS

I've been experiencing similar symptoms to what's described here.

@lsowen - I think the race is a tad bit more specific - from what I see at least, it seems like the cache is populated (and existing entries overridden) from the k8s api server on every autoscaling attempt;

I believe the flow is the following:

  • We take all relevant nodes from the k8s api server - this uses a ListWatcher behind the scenes, which watches for Node changes from the k8s api server, and also resyncs the entire node list every hour; if the watch operation does not consistently fail - I believe that one would get a relatively up-to-date view of the nodes in the cluster with this on each invocation.

  • With the k8s supplied nodes at hand, we cache the node info of the first seen node for any cloud-provider node group on each iteration; different invocation might cache info from different nodes within the group, depending on the lister result; this means that if you have several nodes within your group, but one of them is off-sync with its labels - it might corrupt the autoscaler view of the entire group.

  • After caching all node infos, we iterate on all node groups from the cloud provider - and then we use the previously populated cached view if such exists; I'd guess this is due to the autoscaler preferring the use of the real-world view of your nodes vs the template generated from the cloud provider, as they may be off-sync.

If the above is correct, what I believe needs to happen in order to trigger such a race condition is that the last time the autoscaler had seen a node from the k8s api server and cached its info - only then should the labels be off-sync, in order to corrupt the state for the entirety of the next runs.

If we're operating under the premise that all of your node group nodes eventually do consist of all required labels, which are added at runtime - then, as long as there are alive nodes, the autoscaler state should be eventually consistent and it should work well in one of the next cycles (b/c it does override the cache entries on each cycle);

When it could indeed break, I believe, is at times where the group scales from 1->0, and when this soon-to-be-terminated node has a partial label list - potentially because it's removing labels before termination, or if it's terminating before it's fully provisioned; In that case, we would cache this partial view one last time before we no longer have any nodes for that given node group in the cluster - we then continue to use this corrupted view endlessly, as the cache entries aren't expiring.

Would you agree @lsowen ?

dschunack

comment created time in a month

issue commentaws/containers-roadmap

[EKS] Managed Node Groups support for Bottlerocket

Came here for the native support of containerd (:

@mikestef9 - Any idea if this would be available in a few weeks, or is it more like a couple of months ? (you've mentioned it coming soon in the crd ticket)

Thanks a ton !

mikestef9

comment created time in 2 months

issue closedawslabs/amazon-eks-ami

Creating containerd flavored AMIs

What would you like to be added: The ability to toggle between dockerd and containerd as a CRI is awesome ! I think that it would make it more accessible if there were containerd flavored pre-backed AMI-s (i.e. amazon-eks-node-1.20-v20210720-containerd).

Why is this needed: This is especially useful when utilizing managed node groups, spawning them from IaC frameworks a-la terraform; in those cases, changing the user data of the default launch template is a bit hacky.

This is also useful in cases where third party tools manage AKS clusters for other accounts - where if those lack the ec2:RunInstances action, they would not be able to patch the underlying ASG to use a new launch template version.

As this feature is probably not going to be exposed in the node group API in the upcoming future, this might be a simple yet very helpful alternative.

Thanks !

closed time in 2 months

dany74q

issue openedawslabs/amazon-eks-ami

Creating containerd flavored AMIs

What would you like to be added: The ability to toggle between dockerd and containerd as a CRI is awesome ! I think that it would make it more accessible if there were containerd flavored pre-backed AMI-s (i.e. amazon-eks-node-1.20-v20210720-containerd).

Why is this needed: This is especially useful when utilizing managed node groups, spawning them from IaC frameworks a-la terraform; in those cases, changing the user data of the default launch template is a bit hacky.

This is also useful in cases where third party tools manage AKS clusters for other accounts - where if those lack the ec2:RunInstances action, they would not be able to patch the underlying ASG to use a new launch template version.

As this feature is probably not going to be exposed in the node group API in the upcoming future, this might be a simple yet very helpful alternative.

Thanks !

created time in 2 months

issue commentcontainerd/containerd

containerd-shim process isn't reaped for some killed containers

Hey @fuweid - The container shim mentioned here isn't running, but I did manage to find another container with the same state, here's the state.json:

{
    "id": "56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
    "init_process_pid": 6707,
    "init_process_start": 8486,
    "created": "2021-07-20T12:40:36.126738308Z",
    "config": {
        "no_pivot_root": false,
        "parent_death_signal": 0,
        "rootfs": "/var/lib/docker/overlay2/4882ba8996023a0374dcc3ebed51dac3d19d40f8f7e095e93310a8ceb83f98f0/merged",
        "umask": null,
        "readonlyfs": false,
        "rootPropagation": 0,
        "mounts": [
            {
                "source": "proc",
                "destination": "/proc",
                "device": "proc",
                "flags": 14,
                "propagation_flags": null,
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "tmpfs",
                "destination": "/dev",
                "device": "tmpfs",
                "flags": 16777218,
                "propagation_flags": null,
                "data": "mode=755,size=65536k",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "devpts",
                "destination": "/dev/pts",
                "device": "devpts",
                "flags": 10,
                "propagation_flags": null,
                "data": "newinstance,ptmxmode=0666,mode=0620,gid=5",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "sysfs",
                "destination": "/sys",
                "device": "sysfs",
                "flags": 15,
                "propagation_flags": null,
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "cgroup",
                "destination": "/sys/fs/cgroup",
                "device": "cgroup",
                "flags": 15,
                "propagation_flags": null,
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "mqueue",
                "destination": "/dev/mqueue",
                "device": "mqueue",
                "flags": 14,
                "propagation_flags": null,
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/kubelet/plugins/ebs.csi.aws.com",
                "destination": "/csi",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/kubelet/plugins_registry",
                "destination": "/registration",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/kubelet/pods/ec951463-d7bb-4c9a-b86c-df9f5360b563/containers/csi-node-driver-registrar/7a3fce3f",
                "destination": "/dev/termination-log",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/docker/containers/19c45783df5ea0410cd69352bd570ac6ea2f94fa43776801045cae094d659d9c/resolv.conf",
                "destination": "/etc/resolv.conf",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/docker/containers/19c45783df5ea0410cd69352bd570ac6ea2f94fa43776801045cae094d659d9c/hostname",
                "destination": "/etc/hostname",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/kubelet/pods/ec951463-d7bb-4c9a-b86c-df9f5360b563/etc-hosts",
                "destination": "/etc/hosts",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/docker/containers/19c45783df5ea0410cd69352bd570ac6ea2f94fa43776801045cae094d659d9c/mounts/shm",
                "destination": "/dev/shm",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/kubelet/pods/ec951463-d7bb-4c9a-b86c-df9f5360b563/volumes/kubernetes.io~secret/csi-node-sa-token-fqxtk",
                "destination": "/var/run/secrets/kubernetes.io/serviceaccount",
                "device": "bind",
                "flags": 20481,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            }
        ],
        "devices": [
            {
                "type": 99,
                "major": 1,
                "minor": 3,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/null",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "type": 99,
                "major": 1,
                "minor": 8,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/random",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "type": 99,
                "major": 1,
                "minor": 7,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/full",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "type": 99,
                "major": 5,
                "minor": 0,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/tty",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "type": 99,
                "major": 1,
                "minor": 5,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/zero",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "type": 99,
                "major": 1,
                "minor": 9,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/urandom",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            }
        ],
        "mount_label": "",
        "hostname": "",
        "namespaces": [
            {
                "type": "NEWNS",
                "path": ""
            },
            {
                "type": "NEWNET",
                "path": "/proc/6509/ns/net"
            },
            {
                "type": "NEWPID",
                "path": ""
            },
            {
                "type": "NEWIPC",
                "path": "/proc/6509/ns/ipc"
            }
        ],
        "capabilities": {
            "Bounding": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "Effective": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "Inheritable": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "Permitted": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "Ambient": null
        },
        "networks": null,
        "routes": null,
        "cgroups": {
            "path": "/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
            "scope_prefix": "",
            "Paths": null,
            "devices": [
                {
                    "type": 97,
                    "major": -1,
                    "minor": -1,
                    "permissions": "rwm",
                    "allow": false
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 5,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 3,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 9,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 8,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 5,
                    "minor": 0,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 5,
                    "minor": 1,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 10,
                    "minor": 229,
                    "permissions": "rwm",
                    "allow": false
                },
                {
                    "type": 99,
                    "major": -1,
                    "minor": -1,
                    "permissions": "m",
                    "allow": true
                },
                {
                    "type": 98,
                    "major": -1,
                    "minor": -1,
                    "permissions": "m",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 3,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 8,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 7,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 5,
                    "minor": 0,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 5,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 9,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 136,
                    "minor": -1,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 5,
                    "minor": 2,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 10,
                    "minor": 200,
                    "permissions": "rwm",
                    "allow": true
                }
            ],
            "memory": 0,
            "memory_reservation": 0,
            "memory_swap": 0,
            "kernel_memory": 0,
            "kernel_memory_tcp": 0,
            "cpu_shares": 2,
            "cpu_quota": 0,
            "cpu_period": 100000,
            "cpu_rt_quota": 0,
            "cpu_rt_period": 0,
            "cpuset_cpus": "",
            "cpuset_mems": "",
            "pids_limit": 0,
            "blkio_weight": 0,
            "blkio_leaf_weight": 0,
            "blkio_weight_device": null,
            "blkio_throttle_read_bps_device": null,
            "blkio_throttle_write_bps_device": null,
            "blkio_throttle_read_iops_device": null,
            "blkio_throttle_write_iops_device": null,
            "freezer": "",
            "hugetlb_limit": null,
            "oom_kill_disable": false,
            "memory_swappiness": null,
            "net_prio_ifpriomap": null,
            "net_cls_classid_u": 0,
            "cpu_weight": 1,
            "unified": null,
            "skip_devices": false
        },
        "oom_score_adj": 1000,
        "uid_mappings": null,
        "gid_mappings": null,
        "mask_paths": [
            "/proc/acpi",
            "/proc/kcore",
            "/proc/keys",
            "/proc/latency_stats",
            "/proc/timer_list",
            "/proc/timer_stats",
            "/proc/sched_debug",
            "/proc/scsi",
            "/sys/firmware"
        ],
        "readonly_paths": [
            "/proc/asound",
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger"
        ],
        "sysctl": null,
        "seccomp": null,
        "Hooks": {
            "createContainer": null,
            "createRuntime": null,
            "poststart": null,
            "poststop": null,
            "prestart": null,
            "startContainer": null
        },
        "version": "1.0.2-dev",
        "labels": [
            "bundle=/run/containerd/io.containerd.runtime.v1.linux/moby/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b"
        ],
        "no_new_keyring": false
    },
    "rootless": false,
    "cgroup_paths": {
        "blkio": "/sys/fs/cgroup/blkio/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "cpu": "/sys/fs/cgroup/cpu,cpuacct/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "cpuacct": "/sys/fs/cgroup/cpu,cpuacct/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "cpuset": "/sys/fs/cgroup/cpuset/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "devices": "/sys/fs/cgroup/devices/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "freezer": "/sys/fs/cgroup/freezer/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "hugetlb": "/sys/fs/cgroup/hugetlb/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "memory": "/sys/fs/cgroup/memory/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "name=systemd": "/sys/fs/cgroup/systemd/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "net_cls": "/sys/fs/cgroup/net_cls,net_prio/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "net_prio": "/sys/fs/cgroup/net_cls,net_prio/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "perf_event": "/sys/fs/cgroup/perf_event/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "pids": "/sys/fs/cgroup/pids/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b"
    },
    "namespace_paths": {
        "NEWCGROUP": "/proc/6707/ns/cgroup",
        "NEWIPC": "/proc/6707/ns/ipc",
        "NEWNET": "/proc/6707/ns/net",
        "NEWNS": "/proc/6707/ns/mnt",
        "NEWPID": "/proc/6707/ns/pid",
        "NEWUSER": "/proc/6707/ns/user",
        "NEWUTS": "/proc/6707/ns/uts"
    },
    "external_descriptors": [
        "/dev/null",
        "pipe:[25372]",
        "pipe:[25373]"
    ],
    "intel_rdt_path": ""
}

I have access to this, or machines in similar state - so feel free to request any additional information.

In the mean time, I'll update that - in order to test the hypothesis that what triggers this behavior is dockerd (and not crd/runc) - I'll soon implement a transition of our EKS AMIs to use the one launched [yesterday](https://github.com/aws/containers-roadmap/issues/313#issuecomment-883523880), which supports using crd as a CRI w/o dockerd.

dany74q

comment created time in 2 months

issue commentaws/containers-roadmap

[EKS] [CRI]: Support for Containerd CRI

That's terrific ! @mikestef9 - quick q - I think a common use case would be provisioning node groups via terraform/pulumi or alike, where one can't override the launch template / bootstrap args directly (only by supplying a name + version).

Would it be possible to either have a dedicated launch template version for the crd runtime, or somehow integrate this to the creation API ?

paavan98pm

comment created time in 2 months

issue commentaws/containers-roadmap

[EKS] [CRI]: Support for Containerd CRI

I'd just add that it would be great if this could be available from to choose the CRI from the API for managed clusters, alongside the availability in eksctl which might change it in the ASG launch template 🙏

paavan98pm

comment created time in 2 months

issue closedNike-Inc/gimme-aws-creds

Support for touch id as a webauthn - MFA flow

Hey hey !

First, let me say that gimme-aws-creds is awesome, I use it around 30-40 times a week at our company - it's one of those terrific tools that are incorporated into our daily lives, that just-work, kudos.

This issue is more of a co-ordination attempt, so I could effectively contribute support for touch-id/password prompted, keychain backed webauthn flow, for OSX devices.

gimme-aws-creds touch-id support

Adding support for this appeared a bit non-trivial, and below, I've summarized my findings and implementation detail on key logic I've written to accommodate them;

First, I'll say that I'm operating under an assumption that could be wrong - which is that touch-id over keychain support is in fact an MFA flow that would benefit users if it were incorporated in gimme-aws-creds; to be respectful of your time, and future maintenance efforts - please let me know if that's not the case.

This is a 10-15 minute read - so feel free to comment back when you have some free time.

Key findings

My goal was to mimic chrome's behavior, for MFA-ing over touch-id, on that front, I've looked at chromium's frontend / rendering engine[0] webauth support, and backend, mac-centric[1] FIDO support.

Here are my synthesized findings:

  • Chrome stores the generated key pair in the secure enclave (offered by the T2 security chip) of the machine, by using the kSecAttrTokenIDSecureEnclave attribute[2][3]
  • Chrome uses a keychain access group[4][5] of EQHXZ8M8AV.com.google.Chrome.webauthn so that only it can access the keypair[4][5][6]
  • Chrome uses kSecAccessControlUserPresence access control[7][8] so that OSX would require user presence to retrieve the keypair (presence = touch id / password)
  • The webauthn cred is scoped, and can be requested, from a specific specific rp_id (the tenant's okta-domain) [9]
  • Chromedriver (/selenium) does not expose actual FIDO authenticators, but rather a virtualized browser-only alternative[10]
  • Operating against the secure enclave, and requiring user-presence, can only be done from signed executables with an apple developer id and some specific entitlements[11]

All of the above mean that: a. We can't read the key that chrome manages, which means b. We must register our own key in okta; and c. We can't guarantee that the user's installed interpreter is signed, which means d. We can't rely on the OS to guard access for our key - we must prompt touch-id / password ourselves

Resources

[0] https://chromium.googlesource.com/chromium/src/+/master/content/browser/webauth [1] https://chromium.googlesource.com/chromium/src/+/master/device/fido/mac/ Keychain Libraries - for storing, accessing and signing w/ key-pairs [2] https://developer.apple.com/documentation/security/ksecattrtokenidsecureenclave?language=objc [3] https://source.chromium.org/chromium/chromium/src/+/master:device/fido/mac/credential_store.mm;l=241?q=%22FindResidentCredentials%22&ss=chromium%2Fchromium%2Fsrc [4] https://developer.apple.com/documentation/security/keychain_services/keychain_items/sharing_access_to_keychain_items_among_a_collection_of_apps?language=objc [5] https://source.chromium.org/chromium/chromium/src/+/master:device/fido/mac/credential_store.mm;l=38?q=%22FindResidentCredentials%22&ss=chromium%2Fchromium%2Fsrc [6] https://chromium.googlesource.com/chromium/src/+/7eaea36b3ce4cd172f0ba2b245a3cd42cba38ed6/chrome/browser/webauthn/chrome_authenticator_request_delegate.cc#240 [7] https://developer.apple.com/documentation/security/secaccesscontrolcreateflags/ksecaccesscontroluserpresence?language=objc [8] https://source.chromium.org/chromium/chromium/src/+/master:device/fido/mac/touch_id_context.mm;l=34 [9] https://www.w3.org/TR/webauthn/#scope [10] https://github.com/SeleniumHQ/selenium/issues/7753 [11] https://developer.apple.com/forums/thread/658711

Implementation details - Operating against the keychain

It seems that the current go-to library for interacting with OSX's keychain from python - is keyring, which is used in this project as well.

The thing about keyring is that it is currently only scoped for passwords, and does not expose necessary logic for the storing, retrieval and signing of key-pairs; I've opened this issue over there, and I might be able to integrate that logic there.

On the off chance that this attempt will not be successful, I think that the next best solution is to spin up my own library that exposes such functionality, and add it as a dependency here; I don't think that OS specific code which uses a python-objc bridge would really be a good fit in gimme-aws-creds.

Questions - would love your input

  • Do you think that above is a fair statement ?
  • Would you be willing to add a dependency on a new spun-up library that solves such a specific need ?

Implementation details - FIDO2 - WebAuthn, CTAP2 & Keychain

There aren't any fido2 - webauthn / ctap2 libraries that I've found, that incorporate the keychain as a supported interface; Usually, the libraries seek supporting authenticator by enumerating HID interfaces for external authentiators (a-la yubikey).

I had to write some webauthn & CTAP2 wrappers, helpers and structures that support the use of keychain as an external authenticator - specifically around generating the attestation and signatures from the OSX retrieved public key.

I've looked at yubikey's python-fido2 library, thinking about incorporating the keychain as a potential ctap device; but that is problematic on two fronts, and I don't think would serve as an elegant solution:

First, keychain does not talk ctap2 at all, and the library is all about ctap & cbor transport for supported devices; Meaning, keychain as a device would require intercepting the communication layer, and tailoring only a subset of the ctap requests which relate to key creation, retrieval and signing.

Second, from what I can gather, there's not much active development going on there - there are a few open issues and one PR that have not been addressed by the maintainers for over two years.

Questions - would love your input

  • Given that I don't want to pollute gimme-aws-creds with code that only a few will understand, which will be a maintenance burden; do you think that such wrappers would also best sit at a new spun-up library, integrated as a dependency ?

Implementation details - Registering a new webauthn authenticator in okta

I needed to expand the library a bit, and have added a webauthn factor introspection, and enrollment flow, currently as a separate flag (--register-touch-id).

When this flag is set, here's the flow that happens:

  1. Getting an auth session against okta
  2. Requesting the /user/settings/factors/setup?factorType=FIDO_WEBAUTHN page
  3. Optionally, re-entering okta's password in a /user/verify_password page (if it's prompted)
  4. Stepping up auth w/ an already-registered MFA device (non touch-id)
  5. Introspecting the current factors /api/v1/authn/introspect
  6. POST-ing to an enrollment endpoint- api/v1/authn/factors
  7. Gathering the challenge, client data (user info, rp info, expected key algorithm)
  8. Creating a keypair in keychain with said params
  9. Generating an attestation object (generated credential id, public key, signature on gathered client data)
  10. POST-ing the cbor encoded attestation object, client data and state token back to okta's returned activation link
  11. GET-ting the /login/sessionCookieRedirect page with /enduser/settings?enrolledFactor=FIDO_WEBAUTHN as a redirect URI

At this point, the new authenticator is registered in okta, albeit - with an empty name, as it does not recognize the authenticator guid (I believe okta hard-codes them on their end, as there's no user-supplied name).

Questions - would love your input

As you can note, the setup of the authenticator is a bit involved - my main consideration is keeping the code simple and comprehensible for the community and maintainers.

  • What do you think is the right 'UX' in regards to setting up this authenticator - meaning, would it need to be a separate flag (--register-touch-id), in the configure phase, or lazily when the preferred MFA method is webauthn ?

Implementation details - Guarding the key with touch id

I've mentioned in the Key findings section that guarding keys with a touch-id prompt is a privilege only signed binaries, with specific apple entitlements could request; thus, creating our key, as opposed to chrome would be:

a. Would not prompt a touch-id / password input when accessing the item, and b. Visible in the keychain (vs T2 / security enclave keys, which are not), and c. Would be accessible to other binaries

To tackle point a, I propose incorporating a small piece of code that prompts touch-id / password input, on our side, and only allowing the flow to continue if the authentication was successful; this can be added via using the simple python-touch-id library.

To tackle points b and c, as access can't be limited to these items, the best alternative I can think of is to salt the rp id (which is used as a label, for key retrieval) with the credential id (which is generated by us, say, a uuid4) that is returned by okta. I believe that this would be sufficient to make the key relatively opaque in the phase of rd party keychain snoopers.

Questions - would love your input

  • Does my proposed solution make sense ? If not, would love to hear some alternatives we could consider

Recap

I'd like to add a new MFA flow to gimme-aws-creds; I have all code intact, but I believe that most of it does not belong, and will clutter this codebase - and instead should be split across different OSS libraries that are maintained separately (either in existing solutions, or new spun up, by me).

My main concerns are the need for this feature (am I scratching solely my own itch), the minimization of maintenance burden on your side, and delivery of good UX and security for users - in registering a new okta authenticator, and operating against the keychain.

I'd love to hear your thoughts on each 'questions' section, and any other comments you may in general.

Really appreciate your time.

closed time in 2 months

dany74q

issue openedcontainerd/containerd

containerd-shim process isn't reaped for some killed containers

Description We have several EKS clusters which autoscale throughout the day - they handle burst workloads, and in a given day the underlying ASG-s may scale down to 0 nodes, or scale up to tens of nodes.

We've noticed that once in a while, we have nodes which have pods stuck in a 'Terminating' status on them for days on-end, until we manually intervene and force-delete them;

I've SSH-ed to one of the nodes which experienced this behavior and tried to introspect the behavior, here's what I've gathered, hopefully covering most abstraction layers, but I could not find the root cause unfortunately - I'd love to know what can I do to further debug this.

Quick summary (more info below):

  • Kubelet is trying to kill an already dead container indefinitely
  • Docker thinks the container is running, although it is not
  • Containerd shows the task is stopped
  • The container-shim process wasn't reaped and is still alive
  • The shim receives a kill request from containerd, it execve's runc to kill the container; it receives "container not running" and responds to containerd with "process already finished: not found"
  • The above loops for every kubelet's kill retry

The above leads me to conclude that there's something fishy in the distributed orchestration of killing a container, I'd assume that it's somewhere between containerd<->runc, but I'm not entirely sure - and would love to know how can I better pinpoint the exact cause.

Steps to reproduce the issue: I'm not entirely sure how to reproduce the behavior yet, as it happens sporadically in arbitrary nodes.

Describe the results you received: Containers are dead, containerd is aware they are stopped, but the shim isn't reaped, docker thinks the container is still running and misleads kubelet to keep it as Terminated until manually intervening (force deleting the pod).

Describe the results you expected: The shim should go down with the container, docker should be notified the container is stopped so that kubelet will update the pod's status accordingly.

What version of containerd are you using:

$ containerd --version
containerd github.com/containerd/containerd 1.4.1 c623d1b36f09f8ef6536a057bd658b3aa8632828

Any other relevant information (runC version, CRI configuration, OS/Kernel version, etc.): Deeper dive per abstraction layer -

Kubernetes / kubelet:

  • The pod is stuck in a 'Terminating state'
  • There are no finalizers on the pod
  • When I describe the pod, I see something similar to 'Normal Killing 2m49s (x1715 over 2d6h) kubelet Stopping container csi-node'
  • In journalctl -u kubelet, I see the following near the time the pod started to terminate:
Jul 07 03:40:25 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:40:20.474218    4811 kubelet.go:1848] SyncLoop (DELETE, "api"): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)
"
Jul 07 03:40:32 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:40:32.052286    4811 kubelet.go:1870] SyncLoop (PLEG): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)", event:
 &pleg.PodLifecycleEvent{ID:"6f2f36c4-06f0-406d-9681-b92fa0106441", Type:"ContainerDied", Data:"789845b4d5dc620ce62b36ff5a6d2ef380a725226264b0170d33ea645bb837f1"}
Jul 07 03:41:18 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:41:18.375163    4811 kubelet.go:1870] SyncLoop (PLEG): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)", event:
 &pleg.PodLifecycleEvent{ID:"6f2f36c4-06f0-406d-9681-b92fa0106441", Type:"ContainerDied", Data:"650c0009c27bfd04f4578e3a5fe2ce0eea300088acb5e669f1d71c5d187139ff"}
  • Then, I see indefinite "killing container" messages:
Jul 09 10:32:23 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:32:23.591613    4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period
Jul 09 10:34:24 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:34:24.591670    4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period
Jul 09 10:36:09 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:36:09.591654    4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period

Docker:

  • When I docker container ls / docker inspect <container-id> - the container status is 'Running'
 {
        "Id": "cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
        "Created": "2021-07-07T00:11:27.14530101Z",
        "Path": "/entrypoint.sh",
        "Args": [
            "--endpoint=unix:///csi/csi.sock",
            "--v=4",
            "--volume-attach-limit=5"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 7064,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2021-07-07T00:11:36.641265855Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
	...
}
  • In journalctl, looking near the time the pod started terminating, I see the following:
-- Logs begin at Wed 2021-07-07 00:10:20 UTC, end at Fri 2021-07-09 10:46:56 UTC. --
Jul 07 03:30:43 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:30:43.252002398Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:30:44 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:30:44.542537801Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:39:32 ip-10-0-73-87.ec2.internal dockerd[4325]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11)
on returned error: write unix /var/run/docker.sock->@: write: broken pipe"
Jul 07 03:40:35 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:35.735692652Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:40:44 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:44.165635762Z" level=error msg="stream copy error: reading from a closed fifo"
Jul 07 03:40:47 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:47.504174793Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:40:56 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:55.687736472Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
Jul 07 03:41:07 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:06.802768003Z" level=info msg="Container cd7ed93ae2d1 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Jul 07 03:41:17 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:17.058149021Z" level=warning msg="Published ports are discarded when using host network mode"
Jul 07 03:41:17 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:17.066107174Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:41:48 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:48.397339761Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
Jul 07 03:41:58 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:58.415920505Z" level=info msg="Container cd7ed93ae2d1 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Jul 07 03:42:28 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:42:28.634636074Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
  • The SIGKILL / 'using the force' messages continue indefinitely
  • I've uploaded the dockerd stacktrace here - docker-stacktrace.txt

Containerd:

  • In containerd, the task of the container is 'STOPPED' - cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a 7064 STOPPED
  • The task's metrics prints the following:
ID                                                                  TIMESTAMP                                  
cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a    2021-07-09 10:52:16.172595382 +0000 UTC    

METRIC                   VALUE                                                
memory.usage_in_bytes    20029440                                             
memory.limit_in_bytes    9223372036854771712                                  
memory.stat.cache        11411456                                             
cpuacct.usage            69532053147                                          
cpuacct.usage_percpu     [14916245050 21098017259 14389165228 19128625610]    
pids.current             0                                                    
pids.limit               0  
  • The container's info doesn't show something particularly helpful:
{
    "ID": "cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
    "Labels": {
        "com.docker/engine.bundle.path": "/var/run/docker/containerd/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a"
    },
    "Image": "",
    "Runtime": {
        "Name": "io.containerd.runtime.v1.linux",
        "Options": {
            "type_url": "containerd.linux.runc.RuncOptions",
            "value": "CgRydW5jEhwvdmFyL3J1bi9kb2NrZXIvcnVudGltZS1ydW5j"
        }
    },
    "SnapshotKey": "",
    "Snapshotter": "",
    "CreatedAt": "2021-07-07T00:11:36.498772067Z",
    "UpdatedAt": "2021-07-07T00:11:36.498772067Z",
    "Extensions": null,
    "Spec": {
        "ociVersion": "1.0.1-dev",
        ...
    },
    ...
}

containerd-shim:

  • The containerd-shim of the container is still running:
root      7023  0.0  0.0 710748  6268 ?        Sl   Jul07   0:12 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f36
44815387a37a -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
  • I tried to peek at the stdout / stderr of the shim in - but either the process hadn't flushed data, or it is simply empty
  • To find the shim id, I've looked at the process fd-s and correlated the opened sockets w/ ss:
[root@ip-10-0-73-87 ~]# ls -al /proc/$pid/fd | awk '/socket/ { print $NF }' | grep -o '[0-9]*' | xargs -I{} sh -c "ss -apn | grep {}"
u_str            LISTEN                 0                   4096                /run/containerd/s/6e99a634bfa5b915cbeade50e47384f60874a9358e5e96cb59523a46339c138b 25584                                              
       * 0                    users:(("containerd-shim",pid=7023,fd=12),("containerd",pid=3538,fd=80))       
u_str            ESTAB                  0                   0                   /run/containerd/s/6e99a634bfa5b915cbeade50e47384f60874a9358e5e96cb59523a46339c138b 24369                                              
       * 25595                users:(("containerd-shim",pid=7023,fd=3))                                      
u_str            ESTAB                  0                   0                                                                                                 * 25595                                                 
    * 24369                users:(("containerd",pid=3538,fd=87))                                          
  • Then I looked at the containerd journal entries for the shim id (6e99a6...) and the only thing that came up was: Jul 07 00:11:36 ip-10-0-73-87.ec2.internal containerd[3538]: time="2021-07-07T00:11:36.521930693Z" level=info msg="shim containerd-shim started" address="unix:///run/containerd/s/6e99a634bfa5b915cbeade50e47384f6087 4a9358e5e96cb59523a46339c138b" debug=false pid=7023

  • I've managed to retrieve the shim's go stack trace by strace-ing to a file and sending a kill -USR1 to it, but I don't see anything of particular interest there:

<details> <summary>shim stacktrace</summary> <p>

write(8</var/lib/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a/shim.stdout.log>, "time="2021-07-09T11:56:32Z" level=info msg="=== BEGIN goroutine stack dump ===
goroutine 9 [running]:
main.dumpStacks(0xc000064150)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:276 +0x74
main.executeShim.func1(0xc00004a1e0, 0xc000064150)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:186 +0x3d
created by main.executeShim
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:184 +0x5e9

goroutine 1 [select]:
main.handleSignals(0xc000064150, 0xc0000a8720, 0xc00005c090, 0xc00006e000, 0xc0000d9e50, 0x0)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:239 +0x119
main.executeShim(0xc0000d2540, 0x7fe26c883088)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:189 +0x625
main.main()
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:118 +0x20a

goroutine 18 [chan receive]:
main.main.func1()
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:89 +0x85
created by main.main
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:88 +0x74

goroutine 6 [syscall]:
syscall.Syscall6(0xe8, 0xb, 0xc00003b9b8, 0x80, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x4)
        /usr/lib/golang/src/syscall/asm_linux_amd64.s:41 +0x5
github.com/containerd/containerd/vendor/golang.org/x/sys/unix.EpollWait(0xb, 0xc00003b9b8, 0x80, 0x80, 0xffffffffffffffff, 0xffffffffffffffff, 0x85a8e0, 0xa5cb40)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/golang.org/x/sys/unix/zsyscall_linux_amd64.go:76 +0x72
github.com/containerd/containerd/vendor/github.com/containerd/console.(*Epoller).Wait(0xc00005c2d0, 0xc0000d9a78, 0x8)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/console/console_linux.go:111 +0x7a
created by github.com/containerd/containerd/runtime/v1/shim.(*Service).initPlatform
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service_linux.go:113 +0xbb

goroutine 5 [chan receive]:
github.com/containerd/containerd/runtime/v1/shim.(*Service).processExits(0xc00006e000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:501 +0xd6
created by github.com/containerd/containerd/runtime/v1/shim.NewService
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:92 +0x40d

goroutine 24 [runnable]:
os/signal.process(0x85ec60, 0xa5cb70)
        /usr/lib/golang/src/os/signal/signal.go:240 +0x10a
os/signal.loop()
        /usr/lib/golang/src/os/signal/signal_unix.go:23 +0x45
created by os/signal.Notify.func1.1
        /usr/lib/golang/src/os/signal/signal.go:150 +0x45

goroutine 7 [chan receive, 3375 minutes]:
github.com/containerd/containerd/runtime/v1/shim.(*Service).forward(0xc00006e000, 0x85a3e0, 0xc00001a030)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:579 +0x71
created by github.com/containerd/containerd/runtime/v1/shim.NewService
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:96 +0x4bf

goroutine 8 [IO wait, 23 minutes]:
internal/poll.runtime_pollWait(0x7fe26c882de8, 0x72, 0x0)
        /usr/lib/golang/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc000074098, 0x72, 0x0, 0x0, 0x7fee5f)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Accept(0xc000074080, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/internal/poll/fd_unix.go:394 +0x1fc
net.(*netFD).accept(0xc000074080, 0xc00009e458, 0xc0000121b8, 0x79ccc0)
        /usr/lib/golang/src/net/fd_unix.go:172 +0x45
net.(*UnixListener).accept(0xc00005c330, 0xc00008fe50, 0xc00008fe58, 0x18)
        /usr/lib/golang/src/net/unixsock_posix.go:162 +0x32
net.(*UnixListener).Accept(0xc00005c330, 0x812878, 0xc000012190, 0x862f20, 0xc0000a0000)
        /usr/lib/golang/src/net/unixsock.go:260 +0x65
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve(0xc00005c090, 0x862f20, 0xc0000a0000, 0x861720, 0xc00005c330, 0x0, 0x0)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:87 +0x107
main.serve.func1(0x861720, 0xc00005c330, 0xc00005c090, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:224 +0x88
created by main.serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:222 +0x1fe

goroutine 25 [select]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc00009e3c0, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d

goroutine 10 [IO wait]:
internal/poll.runtime_pollWait(0x7fe26c882d08, 0x72, 0x85a8e0)
        /usr/lib/golang/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc0000e2318, 0x72, 0x85a800, 0xa4b248, 0x0)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0000e2300, 0xc000077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/internal/poll/fd_unix.go:159 +0x1a5
net.(*netFD).Read(0xc0000e2300, 0xc000077000, 0x1000, 0x1000, 0xc000056270, 0xc00018fc28, 0x20)
        /usr/lib/golang/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0000a6078, 0xc000077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/net.go:182 +0x8e
bufio.(*Reader).Read(0xc00004a420, 0xc000078060, 0xa, 0xa, 0xc00018fd98, 0x447f74, 0xc00018ff18)
        /usr/lib/golang/src/bufio/bufio.go:227 +0x222
io.ReadAtLeast(0x859fa0, 0xc00004a420, 0xc000078060, 0xa, 0xa, 0xa, 0xa, 0x2, 0x789f60)
        /usr/lib/golang/src/io/io.go:314 +0x87
io.ReadFull(...)
        /usr/lib/golang/src/io/io.go:333
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.readMessageHeader(0xc000078060, 0xa, 0xa, 0x859fa0, 0xc00004a420, 0x73, 0xc00002ec00, 0x0, 0x6b0aae)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:53 +0x69
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*channel).recv(0xc000078040, 0xc00018fe2c, 0x3, 0x2, 0xc00014a400, 0x0, 0x0, 0x0)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:101 +0x6b
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run.func1(0xc00005e120, 0xc00009e3c0, 0xc00005e1e0, 0xc000078040, 0xc00005e180, 0xc00004a480)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:362 +0x1b0
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:332 +0x2c5

goroutine 17533 [select, 23 minutes]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc00009e1e0, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d

goroutine 17545 [select, 23 minutes]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc000012190, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d

=== END goroutine stack dump ===" namespace=moby path=/run/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a pid=7023"

</p> </details>

  • When strace-ing the shim, I saw that it execve-s runc with the following: /usr/sbin/runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a/log.json --log-format json kill cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a 9; to which runc returns "container not running", and in turn the shim reports to containerd - "process already finished: not found"

runc:

  • There are no hanging runc processes which I could observe, nor any journal logs related; not sure how to introspect this layer after the fact.

Is there any unix socket I can communicate w/ to show some info pertaining to runc ?

os:

  • The container's process is gone, meaning it was actually killed

What you expected to happen: Pods should terminate once their underlying container had died.

How to reproduce it (as minimally and precisely as possible): Not actually sure how to reproduce it consistently - it happens when creating and destroying nodes rapidly, I'd assume.

<details> <summary>Environment</summary> <p>

- Kubernetes version: 1.19
- Docker version output:
Client:
 Version:           19.03.13-ce
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46
 Built:             Mon Oct 12 18:51:20 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46
  Built:            Mon Oct 12 18:51:50 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.1
  GitCommit:        c623d1b36f09f8ef6536a057bd658b3aa8632828
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
- Docker info output:
Client:
 Debug Mode: false

Server:
 Containers: 14
  Running: 9
  Paused: 0
  Stopped: 5
 Images: 8
 Server Version: 19.03.13-ce
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: c623d1b36f09f8ef6536a057bd658b3aa8632828
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0 (expected: fec3683)
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.4.117-58.216.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.46GiB
 Name: ip-10-0-73-87.ec2.internal
 ID: BFFR:6SUN:2BSZ:4MB4:K5NO:OBN2:6VHK:Z2YQ:LS3U:KPEW:5TUV:AEBW
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true

</p> </details>

<details><summary><code>runc --version</code></summary><br><pre> $ runc --version runc version 1.0.0-rc93 commit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec spec: 1.0.2-dev go: go1.15.8 libseccomp: 2.4.1 </pre></details>

<details><summary><code>uname -a</code></summary><br><pre> $ uname -a Linux ip-10-0-73-87.ec2.internal 5.4.117-58.216.amzn2.x86_64 #1 SMP Tue May 11 20:50:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux </pre></details>

created time in 2 months

issue openedawslabs/amazon-eks-ami

Pods are stuck in a Terminating state on specific nodes

What happened: We have several EKS clusters which autoscale throughout the day - they handle burst workloads, and in a given day the underlying ASG-s may scale down to 0 nodes, or scale up to tens of nodes.

We've noticed that once in a while, we have nodes which have pods stuck in a 'Terminating' status on them for days on-end, until we manually intervene and force-delete them;

In our case, it is especially problematic when this happens with the ebs csi driver pod - as it stops handling volume attachment requests - it holds to bound volumes until our manual intervention and does handle further attachment requests.

I've SSH-ed to one of the nodes which experienced this behavior and tried to introspect the behavior, here's what I've gathered, hopefully covering most abstraction layers, but I could not find the root cause unfortunately - I'd love to know what can I do to further debug this.

Quick summary:

  • Kubelet is trying to kill an already dead container indefinitely
  • Docker thinks the container is running, although it is not
  • Containerd shows the task is stopped
  • The container-shim process wasn't reaped and is still alive

The above lead me to conclude that there's something fishy in the distributed orchestration of killing a container, I'd assume that it's something between containerd<->runc, but I'm not entirely sure.

Deeper dive per abstraction layer -

Kubernetes / kubelet:

  • The pod is stuck in a 'Terminating state'
  • There are no finalizers on the pod
  • When I describe the pod, I see something similar to 'Normal Killing 2m49s (x1715 over 2d6h) kubelet Stopping container csi-node'
  • In journalctl -u kubelet, I see the following near the time the pod started to terminate:
Jul 07 03:40:25 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:40:20.474218    4811 kubelet.go:1848] SyncLoop (DELETE, "api"): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)
"
Jul 07 03:40:32 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:40:32.052286    4811 kubelet.go:1870] SyncLoop (PLEG): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)", event:
 &pleg.PodLifecycleEvent{ID:"6f2f36c4-06f0-406d-9681-b92fa0106441", Type:"ContainerDied", Data:"789845b4d5dc620ce62b36ff5a6d2ef380a725226264b0170d33ea645bb837f1"}
Jul 07 03:41:18 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:41:18.375163    4811 kubelet.go:1870] SyncLoop (PLEG): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)", event:
 &pleg.PodLifecycleEvent{ID:"6f2f36c4-06f0-406d-9681-b92fa0106441", Type:"ContainerDied", Data:"650c0009c27bfd04f4578e3a5fe2ce0eea300088acb5e669f1d71c5d187139ff"}
  • Then, I see indefinite "killing container" messages:
Jul 09 10:32:23 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:32:23.591613    4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period
Jul 09 10:34:24 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:34:24.591670    4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period
Jul 09 10:36:09 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:36:09.591654    4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period

Docker:

  • When I docker container ls / docker inspect <container-id> - the container status is 'Running'
 {
        "Id": "cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
        "Created": "2021-07-07T00:11:27.14530101Z",
        "Path": "/entrypoint.sh",
        "Args": [
            "--endpoint=unix:///csi/csi.sock",
            "--v=4",
            "--volume-attach-limit=5"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 7064,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2021-07-07T00:11:36.641265855Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
	...
}
  • In journalctl, looking near the time the pod started terminating, I see the following:
-- Logs begin at Wed 2021-07-07 00:10:20 UTC, end at Fri 2021-07-09 10:46:56 UTC. --
Jul 07 03:30:43 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:30:43.252002398Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:30:44 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:30:44.542537801Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:39:32 ip-10-0-73-87.ec2.internal dockerd[4325]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11)
on returned error: write unix /var/run/docker.sock->@: write: broken pipe"
Jul 07 03:40:35 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:35.735692652Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:40:44 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:44.165635762Z" level=error msg="stream copy error: reading from a closed fifo"
Jul 07 03:40:47 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:47.504174793Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:40:56 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:55.687736472Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
Jul 07 03:41:07 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:06.802768003Z" level=info msg="Container cd7ed93ae2d1 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Jul 07 03:41:17 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:17.058149021Z" level=warning msg="Published ports are discarded when using host network mode"
Jul 07 03:41:17 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:17.066107174Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:41:48 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:48.397339761Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
Jul 07 03:41:58 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:58.415920505Z" level=info msg="Container cd7ed93ae2d1 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Jul 07 03:42:28 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:42:28.634636074Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
  • The SIGKILL / 'using the force' messages continue indefinitely

Containerd:

  • In containerd, the task of the container is 'STOPPED' - cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a 7064 STOPPED
  • The task's metrics prints the following:
ID                                                                  TIMESTAMP                                  
cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a    2021-07-09 10:52:16.172595382 +0000 UTC    

METRIC                   VALUE                                                
memory.usage_in_bytes    20029440                                             
memory.limit_in_bytes    9223372036854771712                                  
memory.stat.cache        11411456                                             
cpuacct.usage            69532053147                                          
cpuacct.usage_percpu     [14916245050 21098017259 14389165228 19128625610]    
pids.current             0                                                    
pids.limit               0  
  • The container's info doesn't show something particularly helpful:
{
    "ID": "cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
    "Labels": {
        "com.docker/engine.bundle.path": "/var/run/docker/containerd/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a"
    },
    "Image": "",
    "Runtime": {
        "Name": "io.containerd.runtime.v1.linux",
        "Options": {
            "type_url": "containerd.linux.runc.RuncOptions",
            "value": "CgRydW5jEhwvdmFyL3J1bi9kb2NrZXIvcnVudGltZS1ydW5j"
        }
    },
    "SnapshotKey": "",
    "Snapshotter": "",
    "CreatedAt": "2021-07-07T00:11:36.498772067Z",
    "UpdatedAt": "2021-07-07T00:11:36.498772067Z",
    "Extensions": null,
    "Spec": {
        "ociVersion": "1.0.1-dev",
        ...
    },
    ...
}
  • journalctl -u containerd does not have logs with the given container id

containerd-shim:

  • The containerd-shim of the container is still running:
root      7023  0.0  0.0 710748  6268 ?        Sl   Jul07   0:12 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f36
44815387a37a -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
  • I tried to peek at the stdout / stderr of the shim in - but either the process hadn't flushed data, or it is simply empty
  • To find the shim id, I've looked at the process fd-s and correlated the opened sockets w/ ss:
[root@ip-10-0-73-87 ~]# ls -al /proc/$pid/fd | awk '/socket/ { print $NF }' | grep -o '[0-9]*' | xargs -I{} sh -c "ss -apn | grep {}"
u_str            LISTEN                 0                   4096                /run/containerd/s/6e99a634bfa5b915cbeade50e47384f60874a9358e5e96cb59523a46339c138b 25584                                              
       * 0                    users:(("containerd-shim",pid=7023,fd=12),("containerd",pid=3538,fd=80))       
u_str            ESTAB                  0                   0                   /run/containerd/s/6e99a634bfa5b915cbeade50e47384f60874a9358e5e96cb59523a46339c138b 24369                                              
       * 25595                users:(("containerd-shim",pid=7023,fd=3))                                      
u_str            ESTAB                  0                   0                                                                                                 * 25595                                                 
    * 24369                users:(("containerd",pid=3538,fd=87))                                          
  • Then I looked at the containerd journal entries for the shim id (6e99a6...) and the only thing that came up was: Jul 07 00:11:36 ip-10-0-73-87.ec2.internal containerd[3538]: time="2021-07-07T00:11:36.521930693Z" level=info msg="shim containerd-shim started" address="unix:///run/containerd/s/6e99a634bfa5b915cbeade50e47384f6087 4a9358e5e96cb59523a46339c138b" debug=false pid=7023

  • I've managed to retrieve the shim's go stack trace by strace-ing to a file and sending a kill -USR1 to it, but I don't see anything of particular interest there:

write(8</var/lib/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a/shim.stdout.log>, "time="2021-07-09T11:56:32Z" level=info msg="=== BEGIN goroutine stack dump ===
goroutine 9 [running]:
main.dumpStacks(0xc000064150)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:276 +0x74
main.executeShim.func1(0xc00004a1e0, 0xc000064150)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:186 +0x3d
created by main.executeShim
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:184 +0x5e9

goroutine 1 [select]:
main.handleSignals(0xc000064150, 0xc0000a8720, 0xc00005c090, 0xc00006e000, 0xc0000d9e50, 0x0)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:239 +0x119
main.executeShim(0xc0000d2540, 0x7fe26c883088)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:189 +0x625
main.main()
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:118 +0x20a

goroutine 18 [chan receive]:
main.main.func1()
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:89 +0x85
created by main.main
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:88 +0x74

goroutine 6 [syscall]:
syscall.Syscall6(0xe8, 0xb, 0xc00003b9b8, 0x80, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x4)
        /usr/lib/golang/src/syscall/asm_linux_amd64.s:41 +0x5
github.com/containerd/containerd/vendor/golang.org/x/sys/unix.EpollWait(0xb, 0xc00003b9b8, 0x80, 0x80, 0xffffffffffffffff, 0xffffffffffffffff, 0x85a8e0, 0xa5cb40)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/golang.org/x/sys/unix/zsyscall_linux_amd64.go:76 +0x72
github.com/containerd/containerd/vendor/github.com/containerd/console.(*Epoller).Wait(0xc00005c2d0, 0xc0000d9a78, 0x8)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/console/console_linux.go:111 +0x7a
created by github.com/containerd/containerd/runtime/v1/shim.(*Service).initPlatform
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service_linux.go:113 +0xbb

goroutine 5 [chan receive]:
github.com/containerd/containerd/runtime/v1/shim.(*Service).processExits(0xc00006e000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:501 +0xd6
created by github.com/containerd/containerd/runtime/v1/shim.NewService
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:92 +0x40d

goroutine 24 [runnable]:
os/signal.process(0x85ec60, 0xa5cb70)
        /usr/lib/golang/src/os/signal/signal.go:240 +0x10a
os/signal.loop()
        /usr/lib/golang/src/os/signal/signal_unix.go:23 +0x45
created by os/signal.Notify.func1.1
        /usr/lib/golang/src/os/signal/signal.go:150 +0x45

goroutine 7 [chan receive, 3375 minutes]:
github.com/containerd/containerd/runtime/v1/shim.(*Service).forward(0xc00006e000, 0x85a3e0, 0xc00001a030)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:579 +0x71
created by github.com/containerd/containerd/runtime/v1/shim.NewService
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:96 +0x4bf

goroutine 8 [IO wait, 23 minutes]:
internal/poll.runtime_pollWait(0x7fe26c882de8, 0x72, 0x0)
        /usr/lib/golang/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc000074098, 0x72, 0x0, 0x0, 0x7fee5f)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Accept(0xc000074080, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/internal/poll/fd_unix.go:394 +0x1fc
net.(*netFD).accept(0xc000074080, 0xc00009e458, 0xc0000121b8, 0x79ccc0)
        /usr/lib/golang/src/net/fd_unix.go:172 +0x45
net.(*UnixListener).accept(0xc00005c330, 0xc00008fe50, 0xc00008fe58, 0x18)
        /usr/lib/golang/src/net/unixsock_posix.go:162 +0x32
net.(*UnixListener).Accept(0xc00005c330, 0x812878, 0xc000012190, 0x862f20, 0xc0000a0000)
        /usr/lib/golang/src/net/unixsock.go:260 +0x65
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve(0xc00005c090, 0x862f20, 0xc0000a0000, 0x861720, 0xc00005c330, 0x0, 0x0)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:87 +0x107
main.serve.func1(0x861720, 0xc00005c330, 0xc00005c090, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:224 +0x88
created by main.serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:222 +0x1fe

goroutine 25 [select]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc00009e3c0, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d

goroutine 10 [IO wait]:
internal/poll.runtime_pollWait(0x7fe26c882d08, 0x72, 0x85a8e0)
        /usr/lib/golang/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc0000e2318, 0x72, 0x85a800, 0xa4b248, 0x0)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0000e2300, 0xc000077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/internal/poll/fd_unix.go:159 +0x1a5
net.(*netFD).Read(0xc0000e2300, 0xc000077000, 0x1000, 0x1000, 0xc000056270, 0xc00018fc28, 0x20)
        /usr/lib/golang/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0000a6078, 0xc000077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/net.go:182 +0x8e
bufio.(*Reader).Read(0xc00004a420, 0xc000078060, 0xa, 0xa, 0xc00018fd98, 0x447f74, 0xc00018ff18)
        /usr/lib/golang/src/bufio/bufio.go:227 +0x222
io.ReadAtLeast(0x859fa0, 0xc00004a420, 0xc000078060, 0xa, 0xa, 0xa, 0xa, 0x2, 0x789f60)
        /usr/lib/golang/src/io/io.go:314 +0x87
io.ReadFull(...)
        /usr/lib/golang/src/io/io.go:333
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.readMessageHeader(0xc000078060, 0xa, 0xa, 0x859fa0, 0xc00004a420, 0x73, 0xc00002ec00, 0x0, 0x6b0aae)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:53 +0x69
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*channel).recv(0xc000078040, 0xc00018fe2c, 0x3, 0x2, 0xc00014a400, 0x0, 0x0, 0x0)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:101 +0x6b
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run.func1(0xc00005e120, 0xc00009e3c0, 0xc00005e1e0, 0xc000078040, 0xc00005e180, 0xc00004a480)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:362 +0x1b0
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:332 +0x2c5

goroutine 17533 [select, 23 minutes]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc00009e1e0, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d

goroutine 17545 [select, 23 minutes]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc000012190, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d

=== END goroutine stack dump ===" namespace=moby path=/run/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a pid=7023"

runc:

  • There are no hanging runc processes which I could observe, nor any journal logs related; not sure how to introspect this layer after the fact.

Is there any unix socket I can communicate w/ to show some info pertaining to runc ?

os:

  • The container's process is gone, meaning it was actually killed

What you expected to happen: Pods should terminate once their underlying container had died.

How to reproduce it (as minimally and precisely as possible): Not actually sure how to reproduce it consistently - it happens when creating and destroying nodes rapidly, I'd assume.

Environment:

- AWS Region: us-east-1 (N. Virginia)
- Instance Type(s): t3.xlarge
- EKS Platform version: eks.4
- Kubernetes version: 1.19
- AMI Version: 1.19.6-20210628
- Kernel: 5.4.117-58.216.amzn2.x86_64
- Release information: 
BASE_AMI_ID="ami-0136f7c838fded2f6"
BUILD_TIME="Mon Jun 28 16:39:26 UTC 2021"
BUILD_KERNEL="5.4.117-58.216.amzn2.x86_64"
ARCH="x86_64"
- Docker version output:
Client:
 Version:           19.03.13-ce
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46
 Built:             Mon Oct 12 18:51:20 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46
  Built:            Mon Oct 12 18:51:50 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.1
  GitCommit:        c623d1b36f09f8ef6536a057bd658b3aa8632828
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

created time in 2 months

startedent/ent

started time in 3 months

startedibraheemdev/modern-unix

started time in 3 months

issue closedkubeguard/guard

Non interactive logins - Azure provider

Hey hey !

In the past few days, I've had a chance to look over the AKS-AAD integrations (which utilize guard, under the hood); That was during some research I'm doing for implementing non interactive logins for AAD enabled clusters (both legacy & managed).

Specifically, I'm looking at auth via service principals, passing a client_credentials-flow token, which holds a service principals object id claim and no UPN claim.

Managed AAD clusters have it mostly solved (a-la kubelogin) - one only needs to issue a token for the multi tenant AKS server app, and any directory entity could be used in this flow - the groups JWT claim is considered along with any overage data.

However, in case of overage - ms graph is consulted, and the request fails for SPNs, as msgraph 404-s fetching group memberships for service principals; This is because the graph API currently used only supports retrieving memberships for users (and not service principals).

As for legacy clusters, the groups JWT claim isn't considered at all -
ms graph is always consulted in fetching group memberships for the given object id; Specifically, the "/users/id/getMemberGroups" endpoint is used.

When an spn oid is passed - the API above 404-s in the same manner and it fails the request altogether.

I believe that ms graph had no API for retrieving groups for any given entity back in the day, but now one does exist -"/directoryObjects/id/getMemberGroups".

I was wondering if it would be a good idea to migrate to the new endpoint - this would enable non interactive login flows to legacy clusters and fix the flow for managed integrations for SPNs assigned to many groups.

Otherwise, it might make sense to have the ms graph call be best effort - returning blank groups in cases of error; It's a bit unfortunate that not being able to retrieve groups, fails the auth attempt altogether - when reaching that code path, we have a verified JWT at hand with some object id, it might have made sense to pass it onward and check for any direct k8s role mapping.

Another suggestion might be to flip the flag which considers the given groups claim (on AKS side) for legacy clusters - closing the disparity between the two integrations.

Would love to hear your two cents on this @weinong

Thanks !

closed time in 3 months

dany74q

delete branch dany74q/guard

delete branch : azure-new-graph-api-endpoint-issue-320

delete time in 3 months

PR closed kubeguard/guard

Migrated to new directoryObjects getGroupsById graph API

Fixes: https://github.com/kubeguard/guard/issues/320

azure.go

  • A userInfo wrapper that contains the oid and upn claims was introduced

azure.md

  • Updated docs to include the new Application.Read.All permission

azure_test.go

  • Migrated oid tests to new endpoint

graph.go

  • Extracted out getMemberGroupsGraphURL
  • Will now use the new directory objects endpoint for object IDs
  • Will fallback to the previous endpoint for UPNs (the new one does not support UPNs)
  • Will now use Url.Parse to retrieve a relative URL - it handles escaping (the UPN may contain '#' which need to be url-escaped)
  • Fixed warning

graph_test.go

  • Added TestGetMemberGroupsGraphURL

obo-server-app.png

  • Updated with new Application.Read.All permission
+115 -39

2 comments

6 changed files

dany74q

pr closed time in 3 months

pull request commentkubeguard/guard

Migrated to new directoryObjects getGroupsById graph API

Closing per our discussion in https://github.com/kubeguard/guard/issues/320

dany74q

comment created time in 3 months