profile
viewpoint

kwikwag/ok-templates 1

Open Knesset Templates

kwikwag/ok-webfront 1

Frontend web server for Open Knesset.

kwikwag/biojava 0

:book::microscope::coffee: BioJava is an open-source project dedicated to providing a Java framework for processing biological data.

kwikwag/crouton 0

Chromium OS Universal Chroot Environment

kwikwag/cyrusmol_v2 0

CyrusMol is an experimental frontend for molecular modelling simulators such as Rosetta

kwikwag/dokan-sshfs 0

Dokan SSHFS

kwikwag/flot 0

Attractive JavaScript charts for jQuery

kwikwag/IIC 0

Invariant Information Clustering for Unsupervised Image Classification and Segmentation

kwikwag/imapclient 0

An easy-to-use, Pythonic and complete IMAP client library

pull request commentmindslab-ai/voicefilter

[WIP] implement power low compression loss

hey @linzwatt any results?

stegben

comment created time in 2 days

push eventkwikwag/voicefilter

kwikwag

commit sha f904559695bcbfd9dc194a0c78f705682ce6ed58

don't generate already-generated train/test files

view details

kwikwag

commit sha ba9c4d5b87a1c77d358682f865e3b5ec02a1e992

normalize-resample.sh: detect number of processes and skip existing files ahead of python invocation

view details

kwikwag

commit sha 608c89507a51b41d090cbb7bb27ebffdd20e2f4c

add some utility scripts for setting up required packages, downloading and extracting required data and converting audio formats

view details

push time in 18 days

fork kwikwag/voicefilter

Unofficial PyTorch implementation of Google AI's VoiceFilter system

http://swpark.me/voicefilter

fork in 18 days

issue commentcatboost/catboost

pip install fails on Alpine Linux

Hey, is catboost going to support the Python docker image? This doesn't work:

$ docker run python:alpine pip install catboost
ERROR: Could not find a version that satisfies the requirement catboost (from versions: none)
ERROR: No matching distribution found for catboost
nirsharonclick

comment created time in 19 days

issue openedSeleniumHQ/selenium-ide

Export `echo` command with variable to Python produces invalid code

🐛 Bug Report

When having an echo command in a test that prints out the value of a stored variables (e.g. x = ${x}), the Export feature produces the following Python code:

print(str("x= self.vars["x"]"))

The message is improperly quoted and the code cannot be run.

To Reproduce

Steps to reproduce the behavior:

  1. Create a test with an echo command, with Target = x = ${x}
  2. Export the test to Python

Expected behavior

The code produced should be runable.

Project file reproducing this issue (highly encouraged)

{
  "id": "cabee352-64f2-482a-8835-7a875b6d8ce8",
  "version": "2.0",
  "name": "bug_export_python_echo_var",
  "url": "https://httpbin.org/ip",
  "tests": [{
    "id": "378d577b-aea4-402a-9dbf-4a0498288882",
    "name": "test_echo_var",
    "commands": [{
      "id": "1af6f97b-daea-4e50-b7f5-b0bfa121324f",
      "comment": "",
      "command": "store",
      "target": "1",
      "targets": [],
      "value": "x"
    }, {
      "id": "d3703fe0-c2c9-4ea6-9fd8-af5d71018ab0",
      "comment": "",
      "command": "echo",
      "target": "x = ${x}",
      "targets": [],
      "value": ""
    }]
  }],
  "suites": [{
    "id": "56c3abfe-3832-46cf-94c6-106e59bd27ff",
    "name": "Default Suite",
    "persistSession": false,
    "parallel": false,
    "timeout": 300,
    "tests": ["378d577b-aea4-402a-9dbf-4a0498288882"]
  }],
  "urls": ["https://httpbin.org/ip"],
  "plugins": []
}

Environment

OS: Windows 10 Selenium IDE Version: 3.16.1 Selenium SIDE Runner Version: N/A Node version: N/A Browser: Chrome Browser Version: 79

created time in a month

issue commentSeleniumHQ/selenium-ide

I get an error when I want to export the project as XUnit.

👍 I isolated this to happen due to an echo command which had a Value instead of only a Target.

omeratli

comment created time in a month

fork kwikwag/imapclient

An easy-to-use, Pythonic and complete IMAP client library

https://imapclient.readthedocs.io/

fork in 4 months

issue commentAzure/AKS

Disk attachment/mounting problems, all pods with PVCs stuck in ContainerCreating

To update on the issue I was facing, first I wanted to mention that I noticed that all error messages involved a single volume.

 Cannot attach data disk 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-xxxxxxxx-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again.

This lead me to understand that there was a single disk culprit that I should try to detach manually. I couldn't find this GUID anywhere (looking in az disk list -g MC_xxx and az vmss show -g MC_xxx -n aks-nodepool1-xxxxxxxx-vmss --instance-id 0 --query storageProfile.dataDisks; BTW the command from the docs, gave me an empty list, until I explicity queried the instance with --instance-id). However, two disks belonging to the Zookeeper (identified by their Azure tags) showed up as Attached in the MC_ resource group. Since all StatefulSets were scaled down (expect for one, which wasn't Zookeeper, and whose pods were still stuck on ContainerCreating), I figured detaching them manually would be safe (and might help). That didn't do the trick, but it got me one step forward (finally, something finished successfully!) and set me on the right path. Here's a log of what I did after, with (approximate) times:

  • 15:42 detach disks with Azure CLI
  • 16:14 manual VMSS upgrade Azure CLI
  • 16:25 restart VMSS via the Azure portal
  • 16:28 cluster same-version upgrade
  • 16:31 manual VMSS VM instance upgrade Azure portal
  • 16:44 scaled up K8S cluster from 1 to 2

Between each step, I tried killing the two remaining StatefulSet-related pods to allow them to re-attach. Finally, at 16:47, the pods finally came out of ContainerCreating and I saw Running for this first time in ages... Scaling up all StatefulSets everything started slowly going back to normal.

I suspect this was due to the second Azure-portal upgrade of the VM instance that helped - either that or scaling back up after scaling down (I started doing so hoping to drain the original node, but ended up not need to). One weird thing that happened with respect to the upgrade, is that after the first and second upgrade, Azure portal reported that the sole instance of the VMSS (Standard_DS3_v2 size) to be running the "Latest model", but after things started running (possibly only after scaling?) again "Latest model" showed "No".

I would conclude that a workaround for this issue, for my case, might be (This is all still voodoo):

  1. Scale down all StatefulSets to 0 (kubectl -n namespace scale --all=true statefulset --replicas=0 for each namespace)
  2. Scale down to 1 node (az aks scale -g MC_xxx --name aks-nodepool1-xxxxxxxx-vmss --node-count 1)
  3. Ensure all VMSS disks are detached: 3.1. List attached volumes with az disk list -g MC_xxx --query "[?diskState=='Attached'].name" 3.2. Cross-reference the LUNs with az vmss show -g MC_xxx -n aks-nodepool1-xxxxxxxx-vmss --instance-id 0 --query "storageProfile.dataDisks[].{name: name, lun: lun}"
    3.3. Detach them with az vmss disk detach -g MC_xxx -n aks-nodepool1 --instance-id 0 --lun x (for each LUN).
  4. Update the node again (az vmss update-instances -g MC_xxx --name aks-nodepool1-xxxxxxxx-vmss --instance-id 0)
  5. Perform a forced same-version upgrade (az aks upgrade -g xxx --name xxx-k8s --kubernetes-version 1.14.6)
  6. Update this node again (az vmss update-instances -g MC_xxx --name aks-nodepool1-xxxxxxxx-vmss --instance-id 0)
  7. Scale K8S cluster back up (kubectl -n namespace scale --all=true statefulset --replicas=x)

I'm sure some steps here are extraneous, and I don't know if it'll really work the next time I encounter this problem, but it's worth writing it down in hopes that it will help me or someone else in the future...

This, of course, doesn't solve the issue, as it doesn't explain how we got here in the first place. And, truthfully, having to scale down + back up is very uncomfortable. Better than losing the PVCs, but still not good enough. Would be happy to receive any updates regarding this issue (will upgrading to the newer 1.15 preview version of Kubernetes work?).

kwikwag

comment created time in 4 months

issue openedAzure/AKS

Disk attachment/mounting problems, all pods with PVCs stuck in ContainerCreating

What happened: Pods with PVCs are stuck in ContainerCreating state, due to a problem with attachment/mounting.

I am using a VMSS-backed westus-located K8S (1.14.6; aksEngineVersion : v0.40.2-aks) cluster. Following a crash for the Kafka pods (using Confluent helm charts v5.3.1; see configuration below, under Environment), 2 of the 3 got stuck in the ContainerCreating state. The dashboard seems to show that all the PVCs are failing to mount because of one volume that has not been detached properly:

kafka-cp-kafka
Unable to mount volumes for pod "kafka-cp-kafka-0_default(kafkapod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "default"/"kafka-cp-kafka-0". list of unmounted volumes=[datadir-0]. list of unattached volumes=[datadir-0 jmx-config default-token-xxxcc]

kafka-cp-zookeeper
AttachVolume.Attach failed for volume "pvc-zookepvc-guid-xxxx-xxxx-xxxxxxxxxxxx" : Attach volume "kubernetes-dynamic-pvc-zookepvc-guid-xxxx-xxxx-xxxxxxxxxxxx" to instance "aks-nodepool1-00011122-vmss000000" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'diskfail-guid-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-00011122-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."
Unable to mount volumes for pod "kafka-cp-zookeeper-0_default(zookepod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "default"/"kafka-cp-zookeeper-0". list of unmounted volumes=[datadir datalogdir]. list of unattached volumes=[datadir datalogdir jmx-config default-token-xxxcc]

es-data-efk-logging-cluster-default
AttachVolume.Attach failed for volume "pvc-eslogdpvc-guid-xxxx-xxxx-xxxxxxxxxxxx" : Attach volume "kubernetes-dynamic-pvc-eslogdpvc-guid-xxxx-xxxx-xxxxxxxxxxxx" to instance "aks-nodepool1-00011122-vmss000000" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'diskfail-guid-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-00011122-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."
Unable to mount volumes for pod "es-data-efk-logging-cluster-default-0_logging(eslogdpod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "logging"/"es-data-efk-logging-cluster-default-0". list of unmounted volumes=[es-data]. list of unattached volumes=[es-data default-token-xxxdd]

es-master-efk-logging-cluster-default
AttachVolume.Attach failed for volume "pvc-eslogmpvc-guid-xxxx-xxxx-xxxxxxxxxxxx" : Attach volume "kubernetes-dynamic-pvc-eslogmpvc-guid-xxxx-xxxx-xxxxxxxxxxxx" to instance "aks-nodepool1-00011122-vmss000000" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'diskfail-guid-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-00011122-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."
Unable to mount volumes for pod "es-master-efk-logging-cluster-default-0_logging(eslogmpod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "logging"/"es-master-efk-logging-cluster-default-0". list of unmounted volumes=[es-data]. list of unattached volumes=[es-data default-token-xxxdd]

prometheus-prom-prometheus-operator-prometheus
AttachVolume.Attach failed for volume "pvc-promppvc-guid-xxxx-xxxx-xxxxxxxxxxxx" : Attach volume "kubernetes-dynamic-pvc-promppvc-guid-xxxx-xxxx-xxxxxxxxxxxx" to instance "aks-nodepool1-00011122-vmss000000" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'diskfail-guid-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-00011122-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."
Unable to mount volumes for pod "prometheus-prom-prometheus-operator-prometheus-0_monitoring(promppod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"prometheus-prom-prometheus-operator-prometheus-0". list of unmounted volumes=[prometheus-prom-prometheus-operator-prometheus-db]. list of unattached volumes=[prometheus-prom-prometheus-operator-prometheus-db config config-out prometheus-prom-prometheus-operator-prometheus-rulefiles-0 prom-prometheus-operator-prometheus-token-xxxee]

alertmanager-prom-prometheus-operator-alertmanager
AttachVolume.Attach failed for volume "pvc-promapvc-guid-xxxx-xxxx-xxxxxxxxxxxx" : Attach volume "kubernetes-dynamic-pvc-promapvc-guid-xxxx-xxxx-xxxxxxxxxxxx" to instance "aks-nodepool1-00011122-vmss000000" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'diskfail-guid-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-00011122-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."
Unable to mount volumes for pod "alertmanager-prom-prometheus-operator-alertmanager-0_monitoring(promapod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"alertmanager-prom-prometheus-operator-alertmanager-0". list of unmounted volumes=[alertmanager-prom-prometheus-operator-alertmanager-db]. list of unattached volumes=[alertmanager-prom-prometheus-operator-alertmanager-db config-volume prom-prometheus-operator-alertmanager-token-xxxff]

Running kubectl get pvc shows the PVC in Bound state (full YAML-JSON from Dashboard below in Environment):

NAMESPACE    NAME                                                                                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
default      datadir-0-kafka-cp-kafka-0                                                                                   Bound    pvc-kafkad01-guid-xxxx-xxxx-xxxxxxxxxxxx  200Gi      RWO            default        3d16h
default      datadir-0-kafka-cp-kafka-1                                                                                   Bound    pvc-kafkad02-guid-xxxx-xxxx-xxxxxxxxxxxx  200Gi      RWO            default        3d16h
default      datadir-0-kafka-cp-kafka-2                                                                                   Bound    pvc-kafkad03-guid-xxxx-xxxx-xxxxxxxxxxxx  200Gi      RWO            default        3d16h
default      datadir-kafka-cp-zookeeper-0                                                                                 Bound    pvc-zookepvc-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
default      datadir-kafka-cp-zookeeper-1                                                                                 Bound    pvc-zooked02-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
default      datadir-kafka-cp-zookeeper-2                                                                                 Bound    pvc-zooked03-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
default      datalogdir-kafka-cp-zookeeper-0                                                                              Bound    pvc-zookel01-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
default      datalogdir-kafka-cp-zookeeper-1                                                                              Bound    pvc-zookel02-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
default      datalogdir-kafka-cp-zookeeper-2                                                                              Bound    pvc-zookel03-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
logging      es-data-es-data-efk-logging-cluster-default-0                                                                Bound    pvc-eslogdpvc-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        10d
logging      es-data-es-master-efk-logging-cluster-default-0                                                              Bound    pvc-eslogmpvc-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        10d
monitoring   alertmanager-prom-prometheus-operator-alertmanager-db-alertmanager-prom-prometheus-operator-alertmanager-0   Bound    pvc-promapvc-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        10d
monitoring   prom-grafana                                                                                                 Bound    pvc-grafad01-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        10d
monitoring   prometheus-prom-prometheus-operator-prometheus-db-prometheus-prom-prometheus-operator-prometheus-0           Bound    pvc-promppvc-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        10d

I tried scaling the Kafka StatefulSet down to 0, then wait a long while, then scale back to 3, but they didn't recover.

Then I tried to scale all Deployments and StatefulSets down to 0, and do a same-version upgrade the K8S cluster. Unfortunately, because of a problem (reported here) with the VMAccessForLinux extension I installed on the VMSS (following this guide to update SSH credentials on the nodes), the upgrade failed, 2.5 hours later, and the cluster remained in a Failed state. Now all of the pods with PVCs got stuck in ContainerCreating. I tried adding a second nodepool successfully, but pods placed on the new nodes still reported the same error, so I removed the second nodepool and scaled down the first nodepool to 1. I then tried to reboot the node using the Azure portal and from within an SSH connection. They all fail because of the issue with the extesnion. I then tried to gradually scale down all StatefulSets (I had to uninstall the prometheus-operator helm since it insisted on scaling the alertmanager StatefulSet back up), and enable only the logging StatefulSets, as they are smaller. It didn't help.

After taking down all StatefulSets, when running kubectl get nodes --output json | jq '.items[].status.volumesInUse' I get null.

What you expected to happen: Pods with PVCs should start normally, and if mounting fails, it should eventually (and somewhat quickly) retry and succeed.

How to reproduce it (as minimally and precisely as possible):

I have no idea. This happens randomly. Up to now, we have worked around it by removing our PVCs, but I don't want to do this any more, I need a solution.

Anything else we need to know?:

This is similar to the following issues, reported on Kubernetes and AKS. All of them have been closed, but none with a real solution AFAIK.

  • https://github.com/kubernetes/kubernetes/issues/67014
  • https://github.com/kubernetes/kubernetes/issues/65500
  • https://github.com/kubernetes/kubernetes/issues/75548
  • https://github.com/Azure/AKS/issues/884
  • https://github.com/Azure/AKS/issues/615
  • https://github.com/Azure/AKS/issues/166

I replaced the GUIDs to anonimize the logs, but I kept it so that GUIDs are kept distinct.

Environment:

  • Kubernetes version (use kubectl version): VMSS-backed westus-located K8S (1.14.6; aksEngineVersion : v0.40.2-aks)
  • Size of cluster (how many worker nodes are in the cluster?) 1 nodepool with 3 Standard_DS3_v2 instances.
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) Kafka, several dotnet core HTTP microservices, logging (FluentBit + ElasticSearch + Kibana stack), monitoring (prometheus + grafana).
  • Others:
cp-kafka:
  enabled: true
  brokers: 3
  persistence:
    enabled: true
    size: 200Gi
    storageClass: ~
    disksPerBroker: 1

  configurationOverrides:
   "auto.create.topics.enable": "true"
   "num.partitions": "10"
   "log.retention.bytes": "180000000000"
    • Kafka PVC YAML (kubectl get pvc xxx --output json):
{
    "apiVersion": "v1",
    "kind": "PersistentVolumeClaim",
    "metadata": {
        "annotations": {
            "pv.kubernetes.io/bind-completed": "yes",
            "pv.kubernetes.io/bound-by-controller": "yes",
            "volume.beta.kubernetes.io/storage-provisioner": "kubernetes.io/azure-disk"
        },
        "creationTimestamp": "2019-10-13T12:00:00Z",
        "finalizers": [
            "kubernetes.io/pvc-protection"
        ],
        "labels": {
            "app": "cp-kafka",
            "release": "kafka"
        },
        "name": "datadir-0-kafka-cp-kafka-0",
        "namespace": "default",
        "resourceVersion": "3241128",
        "selfLink": "/api/v1/namespaces/default/persistentvolumeclaims/datadir-0-kafka-cp-kafka-0",
        "uid": "kafkad01-guid-xxxx-xxxx-xxxxxxxxxxxx"
    },
    "spec": {
        "accessModes": [
            "ReadWriteOnce"
        ],
        "resources": {
            "requests": {
                "storage": "200Gi"
            }
        },
        "storageClassName": "default",
        "volumeMode": "Filesystem",
        "volumeName": "pvc-kafkad01-guid-xxxx-xxxx-xxxxxxxxxxxx"
    },
    "status": {
        "accessModes": [
            "ReadWriteOnce"
        ],
        "capacity": {
            "storage": "200Gi"
        },
        "phase": "Bound"
    }
}

created time in 4 months

issue openedAzure/azure-linux-extensions

K8S cluster provisioning/VMSS restart failed due to VMExtensionProvisioningError on VMAccessForLinux extension

I am using a VMSS-backed westus-located K8S (1.14.6; aksEngineVersion : v0.40.2-aks) cluster.

I followed the guide at Connect with SSH to Azure Kubernetes Service (AKS) cluster nodes for maintenance or troubleshooting in order to be able to connect to my K8S nodes via SSH (to solve yet-another issue). That went well, however, after a few days, I same-version upgraded my K8S cluster in order to try and deal with yet another issue. I received multiple (about 60) deployment errors on the corresponding MC_ resource group:

{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.","details":[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",\r\n \"details\": [\r\n {\r\n \"code\": \"VMExtensionProvisioningError\",\r\n \"message\": \"Multiple VM extensions failed to be provisioned on the VM. Please see the VM extension instance view for other failures. The first extension failed due to the error: Provisioning of VM extension 'VMAccessForLinux' has timed out. Extension installation may be taking too long, or extension status could not be obtained.\"\r\n }\r\n ]\r\n }\r\n}"}]}

And my K8S cluster ultimately (after about 2.5h) entered a Failed state. During and after this failing deployment loop, I tried to delete the extension with the CLI, reinstall it with a different configurations (including an empty), different versions, finally reinstalling it as per the guide (az vmss extension set ...) with the same settings as I did originally. Each operation failed independently with the extension provisioning error as above. However, after a delete, even though I got an error message, when I listed extensions with az vmss extension list, I saw the extension indeed disappeared from the list of extensions on the VMSS, and running two consecutive deletes showed:

$ az vmss extension delete --resource-group $CLUSTER_RESOURCE_GROUP --vmss-name $SCALE_SET_NAME --name VMAccessForLinux
ERROR: Extension VMAccessForLinux not found

However, when restarting the VMSS via the Azure portal (by accessing the MC_ resource group), I still received the above error.

I then tried deleting the extension from the Azure portal, verifying it is deleted using the CLI, and then retry a same-version upgrade the K8S cluster to recover from the Failed state. I got the same errors, even though the extension did not show in the Portal VMSS Extensions page. This time I got 40 failed deployments (with the initial one taking 53 minutes), again failing after 2.5 hours.

Luckily (or not), I had SSH access to the node ( :) ). So I could locate the logs. Surprisingly I saw that the version installed is 1.5.3, even though when I originally installed the extension with the guide, I used 1.4. Perhaps it was in my attempts to delete/reset the extension when the cluster first failed, that the version change happened?

2019/10/16 17:49:25 [Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] sequence number is 0
2019/10/16 17:49:25 [Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] setting file path is/var/lib/waagent/Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3/config/0.settings
2019/10/16 17:49:25 [Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] JSON config:
2019/10/16 17:49:25 ERROR:[Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] JSON exception decoding
2019/10/16 17:49:25 ERROR:[Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] JSON error processing settings file:
2019/10/16 17:49:25 [Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] Current sequence number, 0, is not greater than the sequnce number of the most recent executed configuration. Exiting...

The times don't conincide with failing MC_ deployments though, which repeatedly fail every 4 minutes. The file /var/lib/waagent/Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3/config/0.settings is empty, which can explain the error, but when I tried to rewrite it to contain an empty JSON document ({}), and then restarted the VMSS, it was simply re-written.

I'm at a loss and so is my K8S cluster. Help?

created time in 4 months

create barnchkwikwag/IIC

branch : bugfix/relative_imports

created branch time in 4 months

fork kwikwag/IIC

Invariant Information Clustering for Unsupervised Image Classification and Segmentation

fork in 4 months

issue commentappbaseio-apps/reactivesearch-starter-app

Code update needed.

And also the API key is different. Should be 4HWI27QmA:58c731f7-79ab-4f55-a590-7e15c7e36721

zhaodaolimeng

comment created time in 5 months

startedkumofx/kumodocs

started time in 5 months

more