cannot create resource
ERROR: Job failed (system failure): pods is forbidden: User “system:serviceaccount:dev:default” cannot create resource “pods” in API group “” in the namespace “xxx”
权限不够, 可能后面还会跟个clusterrole at scope..., 需要创建一个clusterrolebindind进行授权
上面例子中, system:serviceaccount
是什么类型的对象, dev
是namespace, default
是用户, 而API group
是核心组
http error: ResponseCode: 503
loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503
https://github.com/k3s-io/k3s/issues/5344
Unable to attach or mount volumes
longhorn: Unable to attach or mount volumes: unmounted volumes=[volv], unattached volumes=[volv kube-api-access-4tqrk]: timed out waiting for the condition
https://github.com/longhorn/longhorn/issues/3753
Cluster agent is not connected
after importing a cluster, rancher server shows this error [Ready False 38 mins ago [Disconnected] Cluster agent is not connected]
https://github.com/rancher/rancher/issues/36589
使用7层负载作为k3s server的负载, 不可以使用ssl认证
使用7层负载作为k3s server的负载, 不可以使用ssl认证
x509: certificate signed by unknown authority
使用nginx作为lb, 报错CA cert validation failed: Get "https://127.0.0.1:6444/cacerts\": x509: certificate signed by unknown authority
https://github.com/k3s-io/k3s/issues/5856
K8s Exited With 137 When Rolling Update
k8s 滚动升级时报了 exited with 137 错误:
Exec lifecycle hook ([sleep 40s]) for Container “xxx” in Pod “xxx_xxx(xxx)” failed - error: command ‘sleep 40s’ exited with 137: , message: “”
这个应该是在容器关闭时报的错。网上查 137 这个错好多都说是内存不足导致的,这里容器的内存应该是够的。
经过调查发现是容器部署的配置中 spec.template.spec.terminationGracePeriodSeconds 值为 30 导致的。
terminationGracePeriodSeconds 是容器优雅关机的最大允许时间,超过这个时间 k8s 会强制关闭 pod ,估计是这个操作导致返回了 137 错误代码。
按照这篇博客中的说法,terminationGracePeriodSeconds 并不是从 preStop 钩子执行结束后才开始计时,而是从关闭流程开始计时的。
preStop Hook 原本是想延迟 40 秒关闭服务(sleep 40s),但是在 30 秒的时候,由于已经超出了 terminationGracePeriodSeconds 设置的最大允许时间,此时会使用 SIGKILL
命令来强制结束进程。
SIGKILL
信号的返回值为 9,至于为何会返回 137 感兴趣的可以看一下这篇博客。
综上,在上面的例子中,将 terminationGracePeriodSeconds 设为大于 40 的值(如 60)就可以避免这个错误了。
coreDns:No files matching import glob pattern
coreDns的warn log, 参见: https://github.com/k3s-io/k3s/issues/4919
plaintext
1
2 [WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server解决办法: 配置configmap, 并执行
plaintext
1
2
3
4
5
6
7
8
9
10
11 apiVersion: v1
kind: ConfigMap
metadata:
name: coredns-custom
namespace: kube-system
data:
example.server: |
example.org {
log
whoami
}
pm2结合资源限制出现的一些问题
前提:使用pm2作为程序的启动入口
动作:配置pod的resources,内存设置为500Mi
现象:当副本的资源超过500Mi后,pod并未按预期销毁,而是pm2内部进行了服务销毁重启操作。
kubernetes Service与SpringBoot应用启动参数冲突
上下文
- 部署Springboot应用到kubernetes集群
- 创建了名为"server"的Service
描述
- 部署到kubernetes集群时,原因容器报错日志如下
1 | Caused by: org.springframework.core.convert.ConversionFailedException: Failed to convert from type [java.lang.String] to type [java.lang.Integer] for value 'tcp://192.168.1.55:8080'; nested exception is java.lang.NumberFormatException: For input string: "tcp://192.168.1.55:8080" |
当单独使用Docker部署时,则不报错
报错表现为SpringBoot框架无法将字符串“tcp://192.168.1.55:8080”转换为数字类型
原因
当Pod在Node上运行时,kubelet为每个活动服务添加一组环境变量。增加变量**
{SVCNAME}_SERVICE_HOST
和{SVCNAME}_SERVICE_PORT
,其中"服务名称"大写,"-"转换为"_"**。例如,服务redis-primary暴露TCP端口6379,并已分配集群IP地址10.0.0.11,产生以下环境变量:
1 | REDIS_PRIMARY_SERVICE_HOST = 10.0.0.11 |
总结
- SpringBoot应用会读取POD中系统环境变量**"
SERVER_PORT
"**的值作为应用监听的端口,值应为数字类型 - 而当该应用在kubernetes中的Service名字命名为"server"时,kubernetes会默认在该命名空间下所有的POD中注入一个名叫**"
SERVER_PORT=tcp://service-ip地址:镜像Dockerfile暴露出来的端口
"**环境变量(该值为字符类型,报错示例中的为"SERVER_PORT=tcp://192.168.1.55:8080")来进行POD间的服务发现
解决方案
- 禁止SpringBoot应用部署在kubernetes中的Service名字命名为“server”,建议命名为“项目名-应用名”
- 在部署Deployment声明文件中添加"SERVER_PORT=8080"进行覆盖默认值
处于Pending-Upgrade状态的工作负载
问题:
1 | time="2022-07-19T16:12:08Z" level=info msg="preparing upgrade for fleet-agent-c-97hcq" time="2022-07-19T16:12:08Z" level=info msg="getting history for release fleet-agent-c-97hcq" time="2022-07-19T16:12:08Z" level=error msg="error syncing 'cluster-fleet-default-c-97hcq-86145229ab95/fleet-agent-c-97hcq': handler bundle-deploy: another operation (install/upgrade/rollback) is in progress, requeuing" |
Run the following command against the cluster where the fleet-agent is running:
1 | kubectl get secret -A -l status=pending-upgrade |
It will show the output of a secret that is causing the pending-upgrade state as follows:
1 | NAMESPACE NAME TYPE DATA AGE |
Based on the above output, run through the following steps:
- Backup the secret into a yaml file (and save it to a persistent location) for the fleet secret that is causing the pending-upgrade state:
1 | kubectl get secret -n cattle-fleet-system sh.helm.release.v1.fleet-agent-c-97hcq.v2 -oyaml > fleet-agent-c-97hcq.yaml |
- Delete the secret:
1 | kubectl delete secret -n cattle-fleet-system sh.helm.release.v1.fleet-agent-c-97hcq.v2 |
- see the state right now