k8s中遇到的错误总结

cannot create resource

ERROR: Job failed (system failure): pods is forbidden: User “system:serviceaccount:dev:default” cannot create resource “pods” in API group “” in the namespace “xxx”

权限不够, 可能后面还会跟个clusterrole at scope..., 需要创建一个clusterrolebindind进行授权

上面例子中, system:serviceaccount是什么类型的对象, dev是namespace, default是用户, 而API group是核心组

http error: ResponseCode: 503

loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503

https://github.com/k3s-io/k3s/issues/5344

Unable to attach or mount volumes

longhorn: Unable to attach or mount volumes: unmounted volumes=[volv], unattached volumes=[volv kube-api-access-4tqrk]: timed out waiting for the condition

https://github.com/longhorn/longhorn/issues/3753

Cluster agent is not connected

after importing a cluster, rancher server shows this error [Ready False 38 mins ago [Disconnected] Cluster agent is not connected]

https://github.com/rancher/rancher/issues/36589

使用7层负载作为k3s server的负载, 不可以使用ssl认证

使用7层负载作为k3s server的负载, 不可以使用ssl认证

x509: certificate signed by unknown authority

使用nginx作为lb, 报错CA cert validation failed: Get "https://127.0.0.1:6444/cacerts\": x509: certificate signed by unknown authority

https://github.com/k3s-io/k3s/issues/5856

K8s Exited With 137 When Rolling Update

k8s 滚动升级时报了 exited with 137 错误：

Exec lifecycle hook ([sleep 40s]) for Container “xxx” in Pod “xxx_xxx(xxx)” failed - error: command ‘sleep 40s’ exited with 137: , message: “”

这个应该是在容器关闭时报的错。网上查 137 这个错好多都说是内存不足导致的，这里容器的内存应该是够的。

经过调查发现是容器部署的配置中 spec.template.spec.terminationGracePeriodSeconds 值为 30 导致的。

terminationGracePeriodSeconds 是容器优雅关机的最大允许时间，超过这个时间 k8s 会强制关闭 pod ，估计是这个操作导致返回了 137 错误代码。

按照这篇博客中的说法，terminationGracePeriodSeconds 并不是从 preStop 钩子执行结束后才开始计时，而是从关闭流程开始计时的。

preStop Hook 原本是想延迟 40 秒关闭服务（sleep 40s），但是在 30 秒的时候，由于已经超出了 terminationGracePeriodSeconds 设置的最大允许时间，此时会使用 SIGKILL 命令来强制结束进程。

SIGKILL 信号的返回值为 9，至于为何会返回 137 感兴趣的可以看一下这篇博客。

综上，在上面的例子中，将 terminationGracePeriodSeconds 设为大于 40 的值（如 60）就可以避免这个错误了。

coreDns：No files matching import glob pattern

coreDns的warn log, 参见: https://github.com/k3s-io/k3s/issues/4919

1 2	[WARNING] No files matching import glob pattern: /etc/coredns/custom/.server [WARNING] No files matching import glob pattern: /etc/coredns/custom/.server

解决办法: 配置configmap, 并执行

apiVersion: v1
kind: ConfigMap
metadata:
    name: coredns-custom
    namespace: kube-system
data:
    example.server: |
      example.org {
        log
        whoami
        }

pm2结合资源限制出现的一些问题

前提：使用pm2作为程序的启动入口

动作：配置pod的resources，内存设置为500Mi

现象：当副本的资源超过500Mi后，pod并未按预期销毁，而是pm2内部进行了服务销毁重启操作。

kubernetes Service与SpringBoot应用启动参数冲突

上下文

部署Springboot应用到kubernetes集群
创建了名为"server"的Service

描述

部署到kubernetes集群时,原因容器报错日志如下

Caused by: org.springframework.core.convert.ConversionFailedException: Failed to convert from type [java.lang.String] to type [java.lang.Integer] for value 'tcp://192.168.1.55:8080'; nested exception is java.lang.NumberFormatException: For input string: "tcp://192.168.1.55:8080"

当单独使用Docker部署时,则不报错
报错表现为SpringBoot框架无法将字符串“tcp://192.168.1.55:8080”转换为数字类型

原因

当Pod在Node上运行时，kubelet为每个活动服务添加一组环境变量。增加变量**{SVCNAME}_SERVICE_HOST和{SVCNAME}_SERVICE_PORT，其中"服务名称"大写，"-"转换为"_"**。
例如，服务redis-primary暴露TCP端口6379，并已分配集群IP地址10.0.0.11，产生以下环境变量:

REDIS_PRIMARY_SERVICE_HOST = 10.0.0.11
REDIS_PRIMARY_SERVICE_PORT = 6379
REDIS_PRIMARY_PORT = tcp://10.0.0.11:6379
REDIS_PRIMARY_PORT_6379_TCP = tcp://10.0.0.11:6379
REDIS_PRIMARY_PORT_6379_TCP_PROTO = tcp
REDIS_PRIMARY_PORT_6379_TCP_PORT = 6379
REDIS_PRIMARY_PORT_6379_TCP_ADDR = 10.0.0.11

总结

SpringBoot应用会读取POD中系统环境变量**"SERVER_PORT"**的值作为应用监听的端口，值应为数字类型
而当该应用在kubernetes中的Service名字命名为"server"时,kubernetes会默认在该命名空间下所有的POD中注入一个名叫**"SERVER_PORT=tcp://service-ip地址:镜像Dockerfile暴露出来的端口"**环境变量(该值为字符类型，报错示例中的为"SERVER_PORT=tcp://192.168.1.55:8080")来进行POD间的服务发现

解决方案

禁止SpringBoot应用部署在kubernetes中的Service名字命名为“server”，建议命名为“项目名-应用名”
在部署Deployment声明文件中添加"SERVER_PORT=8080"进行覆盖默认值

处于Pending-Upgrade状态的工作负载

问题：

time="2022-07-19T16:12:08Z" level=info msg="preparing upgrade for fleet-agent-c-97hcq" time="2022-07-19T16:12:08Z" level=info msg="getting history for release fleet-agent-c-97hcq" time="2022-07-19T16:12:08Z" level=error msg="error syncing 'cluster-fleet-default-c-97hcq-86145229ab95/fleet-agent-c-97hcq': handler bundle-deploy: another operation (install/upgrade/rollback) is in progress, requeuing"

Run the following command against the cluster where the fleet-agent is running:

1	kubectl get secret -A -l status=pending-upgrade

It will show the output of a secret that is causing the pending-upgrade state as follows:

1 2	NAMESPACE NAME TYPE DATA AGE cattle-fleet-system sh.helm.release.v1.fleet-agent-c-97hcq.v2 helm.sh/release.v1 1 132d

Based on the above output, run through the following steps:

Backup the secret into a yaml file (and save it to a persistent location) for the fleet secret that is causing the pending-upgrade state:

1	kubectl get secret -n cattle-fleet-system sh.helm.release.v1.fleet-agent-c-97hcq.v2 -oyaml > fleet-agent-c-97hcq.yaml

Delete the secret:

1	kubectl delete secret -n cattle-fleet-system sh.helm.release.v1.fleet-agent-c-97hcq.v2

see the state right now