profile
viewpoint

etcd-io/jetcd 843

etcd java client

avdheshgarodia/cuda-neural-network-digit-classification 1

Neural Network implemented in parallel on the graphics card using cuda. It is used to classify digits.

fanminshi/etcd-backup-operator 1

a kubenetes controller that perform backup for etcd cluster

fanminshi/CarND-LaneLines-P1 0

Lane Finding Project for Self-Driving Car ND

fanminshi/etcd 0

Distributed reliable key-value store for the most critical data of a distributed system

fanminshi/etcd-operator 0

etcd operator creates/configures/manages etcd clusters atop Kubernetes

issue openedopenshift/coredns-mdns

Question about the coredns-mdns use-case within k8s

Hi,

I would like to understand how is coredns-mdns be used in context of k8s. specifically what problems it tries to solve? To give a context, I am trying to figure how to resolve the api-server via mdns-name for binary such as kubelet and kube-proxy.

Thanks, Fanmin

created time in 15 days

issue commentNVIDIA/gpu-operator

toolkit installation container can't find nvidia-smi

@shivamerla thanks! that fixes my issue.

fanminshi

comment created time in 2 months

issue commentNVIDIA/gpu-operator

toolkit installation container can't find nvidia-smi

We are actually using containerd as the runtime. I check the kubelet setting --container-runtime-endpoint=unix:///run/containerd/containerd.sock. However, I noticed that the nvidia runtime is set in the docker config cat /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "args": [], "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime" }, "nvidia-experimental": { "args": [], "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental" } } But not at the containerd config

$ cat /etc/containerd/config.toml
version = 2
# Kubernetes requires the cri plugin.
required_plugins = ["io.containerd.grpc.v1.cri"]
# Kubernetes doesn't use containerd restart manager.
disabled_plugins = ["io.containerd.internal.v1.restart"]

[debug]
  level = "info"

[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "0"
  max_container_log_line_size = 16384
  SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/opt/cni/bin"
  conf_dir = "/etc/cni/net.d"
  conf_template = ""
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  endpoint = ["https://mirror.gcr.io","https://registry-1.docker.io"]
[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  BinaryName = "/usr/bin/runc"

I think this might be related to the issue i am seeing.

fanminshi

comment created time in 2 months

issue commentNVIDIA/gpu-operator

toolkit installation container can't find nvidia-smi

Also what container runtime are you using? I am using docker.

sudo docker version
Client: Docker Engine - Community
 Version:           20.10.8
 API version:       1.41
 Go version:        go1.16.6
 Git commit:        3967b7d
 Built:             Fri Jul 30 19:54:27 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:54:50 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 nvidia:
  Version:          1.0.0-rc95
  GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

the debug log is basically this

2021/10/08 22:28:07 No modification required
2021/10/08 22:28:07 Forwarding command to runtime
2021/10/08 22:28:07 Bundle directory path is empty, using working directory.
2021/10/08 22:28:07 Using bundle directory: /
2021/10/08 22:28:07 Using OCI specification file path: /config.json
2021/10/08 22:28:07 Looking for runtime binary 'docker-runc'
2021/10/08 22:28:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/10/08 22:28:07 Looking for runtime binary 'runc'
2021/10/08 22:28:07 Found runtime binary '/usr/bin/runc'
2021/10/08 22:28:07 Running /usr/local/nvidia/toolkit/nvidia-container-runtime.real
fanminshi

comment created time in 2 months

issue openedNVIDIA/gpu-operator

toolkit installation container can't find nvidia-smi

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node?
  • [ X ] Are you running Kubernetes v1.13+?
  • [X] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • [ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [X] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I deploy the gpu-operator version v1.8.2 on ubuntu 20.04. And I saw the following error.

kubectl -n gpu-operator-resources logs nvidia-operator-validator-d857c -c toolkit-validation
time="2021-10-07T15:16:34Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
toolkit is not ready

I believe there might be an error in the gpu operator code at on how to execute nvidia-smi https://github.com/NVIDIA/gpu-operator/blob/97fbf5b695d5c12beff3eb9958cfa0c2b44416bb/validator/main.go#L478-L485

and it should be

https://github.com/NVIDIA/gpu-operator/blob/97fbf5b695d5c12beff3eb9958cfa0c2b44416bb/validator/main.go#L421-L426

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

  • [ ] kubernetes pods status: kubectl get pods --all-namespaces

  • [ ] kubernetes daemonset status: kubectl get ds --all-namespaces

  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

  • [ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

  • [ ] Output of running a container on the GPU machine: docker run -it alpine echo foo

  • [ ] Docker configuration file: cat /etc/docker/daemon.json

  • [ ] Docker runtime configuration: docker info | grep runtime

  • [ ] NVIDIA shared directory: ls -la /run/nvidia

  • [ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

ls -la /usr/local/nvidia/toolkit
total 8556
drwxr-xr-x 3 root root    4096 Oct  6 21:40 .
drwxr-xr-x 3 root root    4096 Oct  6 21:40 ..
drwxr-xr-x 3 root root    4096 Oct  6 21:40 .config
lrwxrwxrwx 1 root root      28 Oct  6 21:40 libnvidia-container.so.1 -> libnvidia-container.so.1.5.1
-rwxr-xr-x 1 root root  179216 Oct  6 21:40 libnvidia-container.so.1.5.1
-rwxr-xr-x 1 root root     154 Oct  6 21:40 nvidia-container-cli
-rwxr-xr-x 1 root root   43024 Oct  6 21:40 nvidia-container-cli.real
-rwxr-xr-x 1 root root     342 Oct  6 21:40 nvidia-container-runtime
-rwxr-xr-x 1 root root     429 Oct  6 21:40 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root 3991000 Oct  6 21:40 nvidia-container-runtime.experimental
lrwxrwxrwx 1 root root      24 Oct  6 21:40 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x 1 root root 2359384 Oct  6 21:40 nvidia-container-runtime.real
-rwxr-xr-x 1 root root     198 Oct  6 21:40 nvidia-container-toolkit
-rwxr-xr-x 1 root root 2147896 Oct  6 21:40 nvidia-container-toolkit.real
  • [ x ] NVIDIA driver directory: ls -la /run/nvidia/driver
ls -la /run/nvidia/driver
total 88
drwxr-xr-x   1 root root  4096 Oct  6 21:39 .
drwxr-xr-x   4 root root   120 Oct  6 21:40 ..
lrwxrwxrwx   1 root root     7 Jul 23 17:35 bin -> usr/bin
drwxr-xr-x   2 root root  4096 Apr 15  2020 boot
drwxr-xr-x  15 root root  4160 Oct  6 21:40 dev
drwxr-xr-x   1 root root  4096 Oct  6 21:39 drivers
drwxr-xr-x   1 root root  4096 Oct  6 21:40 etc
drwxr-xr-x   2 root root  4096 Apr 15  2020 home
drwxr-xr-x   2 root root  4096 Oct  6 21:39 host-etc
lrwxrwxrwx   1 root root     7 Jul 23 17:35 lib -> usr/lib
lrwxrwxrwx   1 root root     9 Jul 23 17:35 lib32 -> usr/lib32
lrwxrwxrwx   1 root root     9 Jul 23 17:35 lib64 -> usr/lib64
lrwxrwxrwx   1 root root    10 Jul 23 17:35 libx32 -> usr/libx32
drwxr-xr-x   2 root root  4096 Jul 23 17:35 media
drwxr-xr-x   2 root root  4096 Jul 23 17:35 mnt
-rw-r--r--   1 root root 16047 Aug  3 20:33 NGC-DL-CONTAINER-LICENSE
drwxr-xr-x   2 root root  4096 Jul 23 17:35 opt
dr-xr-xr-x 944 root root     0 Oct  6 21:10 proc
drwx------   2 root root  4096 Jul 23 17:38 root
drwxr-xr-x   1 root root  4096 Oct  6 21:40 run
lrwxrwxrwx   1 root root     8 Jul 23 17:35 sbin -> usr/sbin
drwxr-xr-x   2 root root  4096 Jul 23 17:35 srv
dr-xr-xr-x  13 root root     0 Oct  6 21:39 sys
drwxrwxrwt   1 root root  4096 Oct  6 21:40 tmp
drwxr-xr-x   1 root root  4096 Jul 23 17:35 usr
drwxr-xr-x   1 root root  4096 Jul 23 17:38 var
  • [ ] kubelet logs journalctl -u kubelet > kubelet.logs

created time in 2 months

more