网站首页 > 技术教程正文

1分钟总结常用k8s常用诊断教程（k8s常见故障）

goqiw 2025-05-05 15:41:34 技术教程 29 ℃ 0 评论

在复杂的Kubernetes环境中，问题诊断和故障排查是每位管理员和开发者必备的技能。本文详细介绍了Kubernetes集群的常用诊断命令、技巧和实际使用场景，帮助您快速定位和解决问题。

一、集群和节点诊断

1. 集群基本信息获取

 # 显示集群基本信息，包括master和核心服务的运行状态
 kubectl cluster-info
 
 # 查看集群详细健康状况，包含所有组件的详细信息
 kubectl cluster-info dump
 
 # 获取kubectl配置详情，检查当前连接的集群
 kubectl config view
 
 # 切换集群上下文
 kubectl config use-context <context-name>

使用场景：当怀疑集群连接有问题，或需要确认当前操作的集群环境时使用这些命令。

2. 节点状态检查和管理

 # 列出所有节点基本信息
 kubectl get nodes
 
 # 列出节点详细信息，包括标签、条件和资源使用
 kubectl get nodes -o wide
 
 # 列出未就绪的节点
 kubectl get nodes --field-selector spec.unschedulable=true
 
 # 查看特定节点的详细信息
 kubectl describe node <节点名称>
 
 # 检查节点状态条件
 kubectl describe node <节点名称> | grep Conditions -A5
 
 # 查看节点的操作系统和内核版本
 kubectl get node <节点名称> -o jsonpath='{.status.nodeInfo.osImage} {.status.nodeInfo.kernelVersion}'
 
 # 检查节点上的污点
 kubectl describe node <节点名称> | grep Taints

使用场景：节点无法调度Pod或变为NotReady状态时，使用这些命令检查节点健康状况和配置。

3. 节点维护操作

 # 将节点标记为不可调度（维护前）
 kubectl cordon <节点名称>
 
 # 排空节点上的Pod（优雅迁移）
 kubectl drain <节点名称> --ignore-daemonsets --delete-emptydir-data
 
 # 解除节点不可调度状态
 kubectl uncordon <节点名称>
 
 # 给节点添加标签
 kubectl label node <节点名称> <标签键>=<标签值>
 
 # 给节点添加污点
 kubectl taint node <节点名称> <污点键>=<污点值>:<效果>

使用场景：节点需要维护、升级或修复时，使用这些命令安全地移除工作负载并恢复。

4. 节点资源监控

 # 查看所有节点资源使用情况
 kubectl top nodes
 
 # 查看特定节点的资源使用情况
 kubectl top node <节点名称>
 
 # 检查节点容量和可分配资源
 kubectl describe node <节点名称> | grep -E "Capacity|Allocatable"
 
 # 使用kubectl debug在节点上进行高级调试
 kubectl debug node/<节点名称> -it --image=ubuntu

使用场景：节点资源不足或性能问题时，使用这些命令分析资源使用情况。

二、Pod诊断与故障排查

1. Pod基本状态查询

 # 列出所有命名空间的Pod
 kubectl get pods --all-namespaces
 
 # 查看特定命名空间的所有Pod
 kubectl get pods -n <命名空间>
 
 # 按标签筛选Pod
 kubectl get pods -l app=<应用标签> -n <命名空间>
 
 # 查看Pod的详细状态，包括IP和节点分配
 kubectl get pods -o wide -n <命名空间>
 
 # 按Pod状态筛选，如查看所有失败的Pod
 kubectl get pods --field-selector status.phase=Failed -n <命名空间>
 
 # 以YAML或JSON格式查看Pod完整配置
 kubectl get pod <Pod名称> -n <命名空间> -o yaml
 kubectl get pod <Pod名称> -n <命名空间> -o json

使用场景：常规监控Pod状态，检查部署是否成功，以及快速筛选特定状态的Pod。

2. Pod详细信息分析

 # 查看Pod的详细描述
 kubectl describe pod <Pod名称> -n <命名空间>
 
 # 查看Pod的特定字段
 kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.status.phase}'
 
 # 查看Pod的容器状态
 kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.status.containerStatuses}'
 
 # 检查Pod的就绪状态
 kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
 
 # 查看Pod的调度信息
 kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.spec.nodeName}'
 
 # 查看Pod的重启次数
 kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.status.containerStatuses[0].restartCount}'

使用场景：Pod不能正常启动或运行异常时，深入分析Pod状态和配置。

3. Pod日志分析

 # 获取Pod的日志
 kubectl logs <Pod名称> -n <命名空间>
 
 # 获取Pod中特定容器的日志
 kubectl logs <Pod名称> -c <容器名称> -n <命名空间>
 
 # 实时查看Pod日志
 kubectl logs -f <Pod名称> -n <命名空间>
 
 # 查看最近的100行日志
 kubectl logs --tail=100 <Pod名称> -n <命名空间>
 
 # 查看特定时间后的日志
 kubectl logs --since=1h <Pod名称> -n <命名空间>
 
 # 查看崩溃前的上一个容器实例日志
 kubectl logs --previous <Pod名称> -n <命名空间>
 
 # 将Pod日志保存到文件
 kubectl logs <Pod名称> -n <命名空间> > pod_logs.txt

使用场景：排查应用程序错误，分析应用行为，或检查容器崩溃原因。

4. Pod交互与调试

 # 在Pod中执行命令
 kubectl exec <Pod名称> -n <命名空间> -- <命令>
 
 # 获取Pod中的交互式Shell
 kubectl exec -it <Pod名称> -n <命名空间> -- /bin/sh
 
 # 在多容器Pod中指定容器执行命令
 kubectl exec -it <Pod名称> -c <容器名称> -n <命名空间> -- /bin/bash
 
 # 将文件从Pod复制到本地
 kubectl cp <命名空间>/<Pod名称>:<容器内路径> <本地路径>
 
 # 将文件从本地复制到Pod
 kubectl cp <本地路径> <命名空间>/<Pod名称>:<容器内路径>

使用场景：需要在容器内执行命令、检查文件或手动排查问题时。

5. 使用临时调试容器

# 为现有Pod附加临时调试容器
kubectl debug -it <Pod名称> -n <命名空间> --image=busybox:1.28 --target=<容器名称>

# 创建Pod副本进行调试（原Pod保持不变）
kubectl debug <Pod名称> -it --copy-to=<新Pod名称> --container=<容器名称> -- sh

# 使用更改的命令创建Pod副本
kubectl debug <Pod名称> -it --copy-to=<新Pod名称> --container=<容器名称> -- sh

# 使用特权模式调试
kubectl debug -it <Pod名称> --image=busybox:1.28 --profile=sysadmin

使用场景：在不影响生产Pod的前提下进行调试，或者当原始容器没有必要的调试工具时。

6. Pod事件分析

# 获取特定Pod相关的事件
kubectl get events -n <命名空间> --field-selector involvedObject.name=<Pod名称>

# 按时间排序查看事件
kubectl get events -n <命名空间> --sort-by=.metadata.creationTimestamp

# 只查看警告级别的事件
kubectl get events -n <命名空间> --field-selector type=Warning

使用场景：Pod调度失败、启动异常或被终止时，查找相关事件以确定根本原因。

三、工作负载和控制器诊断

1. Deployment诊断

# 查看所有Deployment
kubectl get deployments -n <命名空间>

# 查看Deployment详情
kubectl describe deployment <Deployment名称> -n <命名空间>

# 检查Deployment滚动更新状态
kubectl rollout status deployment/<Deployment名称> -n <命名空间>

# 查看Deployment历史版本
kubectl rollout history deployment/<Deployment名称> -n <命名空间>

# 回滚到上一个版本
kubectl rollout undo deployment/<Deployment名称> -n <命名空间>

# 回滚到特定版本
kubectl rollout undo deployment/<Deployment名称> -n <命名空间> --to-revision=<版本号>

# 调整Deployment的副本数
kubectl scale deployment <Deployment名称> -n <命名空间> --replicas=<副本数>

使用场景：部署更新失败，或需要快速扩缩容时使用。

2. StatefulSet诊断

# 查看所有StatefulSet
kubectl get statefulsets -n <命名空间>

# 查看StatefulSet详情
kubectl describe statefulset <StatefulSet名称> -n <命名空间>

# 检查StatefulSet滚动更新状态
kubectl rollout status statefulset/<StatefulSet名称> -n <命名空间>

# 扩缩容StatefulSet
kubectl scale statefulset <StatefulSet名称> -n <命名空间> --replicas=<副本数>

使用场景：有状态应用如数据库服务的部署和状态检查。

3. DaemonSet诊断

# 查看所有DaemonSet
kubectl get daemonsets -n <命名空间>

# 查看DaemonSet详情
kubectl describe daemonset <DaemonSet名称> -n <命名空间>

# 检查每个节点上的DaemonSet Pod状态
kubectl get pods -n <命名空间> -o wide | grep <DaemonSet名称>

使用场景：节点级组件如日志收集、监控代理等部署状态检查。

4. Job和CronJob诊断

# 查看所有Job
kubectl get jobs -n <命名空间>

# 查看Job详情
kubectl describe job <Job名称> -n <命名空间>

# 查看所有CronJob
kubectl get cronjobs -n <命名空间>

# 查看CronJob详情
kubectl describe cronjob <CronJob名称> -n <命名空间>

# 查看Job的Pod状态
kubectl get pods -n <命名空间> --selector=job-name=<Job名称>

使用场景：批处理任务失败或调度问题排查。

四、服务和网络诊断

1. 服务发现和连接性

# 列出所有服务
kubectl get svc -n <命名空间>

# 查看服务详情
kubectl describe svc <服务名称> -n <命名空间>

# 检查服务的端点（后端Pod）
kubectl get endpoints <服务名称> -n <命名空间>

# 验证服务选择器与Pod标签匹配
kubectl get pods -n <命名空间> -l <服务选择器>

# 在集群内测试服务连接
kubectl run -it --rm debug --image=busybox:1.28 -- wget -qO- <服务名称>.<命名空间>.svc.cluster.local:<端口>

# 查看服务的ClusterIP
kubectl get svc <服务名称> -n <命名空间> -o jsonpath='{.spec.clusterIP}'

# 查看服务的NodePort
kubectl get svc <服务名称> -n <命名空间> -o jsonpath='{.spec.ports[0].nodePort}'

使用场景：服务不可访问或连接性问题排查。

2. Ingress诊断

# 列出所有Ingress
kubectl get ingress -n <命名空间>

# 查看Ingress详情
kubectl describe ingress <Ingress名称> -n <命名空间>

# 检查Ingress控制器Pod状态
kubectl get pods -n <ingress控制器命名空间> | grep ingress-controller

# 检查Ingress控制器日志
kubectl logs -n <ingress控制器命名空间> <ingress控制器Pod名称>

使用场景：外部流量无法正确路由到服务时。

3. DNS诊断

# 查看集群DNS服务
kubectl get svc kube-dns -n kube-system

# 检查DNS解析配置
kubectl run -it --rm debug --image=busybox:1.28 -- cat /etc/resolv.conf

# 测试DNS解析
kubectl run -it --rm debug --image=busybox:1.28 -- nslookup <服务名称>.<命名空间>.svc.cluster.local

# 查看DNS Pod状态
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 检查DNS Pod日志
kubectl logs -n kube-system -l k8s-app=kube-dns

使用场景：服务名称解析失败或DNS查询超时时。

4. 网络策略诊断

# 列出所有网络策略
kubectl get networkpolicies -n <命名空间>

# 查看网络策略详情
kubectl describe networkpolicy <策略名称> -n <命名空间>

# 测试网络连接
kubectl exec -it <源Pod名称> -n <源命名空间> -- wget -T 2 -O- <目标服务>:<端口>

# 执行网络追踪
kubectl exec -it <Pod名称> -n <命名空间> -- traceroute <目标IP>

# 检查Pod IP和CIDR配置
kubectl get pods <Pod名称> -n <命名空间> -o jsonpath='{.status.podIP}'

使用场景：Pod间通信被阻止或网络隔离问题。

5. 高级网络调试

# 创建网络诊断Pod
kubectl run net-debug --image=nicolaka/netshoot --rm -it -- /bin/bash

# 使用临时容器检查网络配置
kubectl debug -it <Pod名称> --image=nicolaka/netshoot --target=<容器名称>

# 查看容器网络接口
kubectl exec -it <Pod名称> -n <命名空间> -- ip addr

# 检查网络路由
kubectl exec -it <Pod名称> -n <命名空间> -- ip route

# 抓包分析网络流量
kubectl exec -it <Pod名称> -n <命名空间> -- tcpdump -i eth0

使用场景：复杂网络问题排查，如MTU不匹配、路由问题等。

五、存储诊断

1. PersistentVolume和PersistentVolumeClaim

# 列出所有PV
kubectl get pv

# 查看PV详情
kubectl describe pv <PV名称>

# 列出所有PVC
kubectl get pvc -n <命名空间>

# 查看PVC详情
kubectl describe pvc <PVC名称> -n <命名空间>

# 检查PVC绑定状态
kubectl get pvc <PVC名称> -n <命名空间> -o jsonpath='{.status.phase}'

# 查看使用PVC的Pod
kubectl get pods -n <命名空间> --field-selector spec.volumes.persistentVolumeClaim.claimName=<PVC名称>

使用场景：存储卷无法创建或绑定失败时。

2. StorageClass诊断

# 列出所有StorageClass
kubectl get storageclass

# 查看默认StorageClass
kubectl get storageclass -o jsonpath='{.items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")].metadata.name}'

# 查看StorageClass详情
kubectl describe storageclass <StorageClass名称>

使用场景：PVC自动配置失败或存储类配置问题。

3. 卷挂载诊断

# 检查Pod卷挂载
kubectl describe pod <Pod名称> -n <命名空间> | grep -A10 Volumes

# 检查容器内的挂载点
kubectl exec -it <Pod名称> -n <命名空间> -- mount | grep <挂载路径>

# 检查挂载卷的文件系统权限
kubectl exec -it <Pod名称> -n <命名空间> -- ls -la <挂载路径>

使用场景：应用无法读写持久卷或权限问题。

六、配置和安全诊断

1. ConfigMap和Secret

# 列出所有ConfigMap
kubectl get configmaps -n <命名空间>

# 查看ConfigMap内容
kubectl describe configmap <ConfigMap名称> -n <命名空间>

# 以YAML格式导出ConfigMap
kubectl get configmap <ConfigMap名称> -n <命名空间> -o yaml

# 列出所有Secret
kubectl get secrets -n <命名空间>

# 查看Secret详情（不显示加密内容）
kubectl describe secret <Secret名称> -n <命名空间>

# 解码Secret内容
kubectl get secret <Secret名称> -n <命名空间> -o jsonpath='{.data.<键名>}' | base64 --decode

使用场景：配置相关问题或需要检查密钥配置。

2. RBAC权限诊断

# 列出所有ServiceAccount
kubectl get serviceaccounts -n <命名空间>

# 查看ServiceAccount详情
kubectl describe serviceaccount <ServiceAccount名称> -n <命名空间>

# 列出所有角色
kubectl get roles -n <命名空间>

# 查看角色详情
kubectl describe role <角色名称> -n <命名空间>

# 列出所有角色绑定
kubectl get rolebindings -n <命名空间>

# 检查权限
kubectl auth can-i <动词> <资源> --as=system:serviceaccount:<命名空间>:<ServiceAccount名称>

使用场景：应用无权限访问资源或执行操作时。

3. 安全上下文和Pod安全策略

# 查看Pod安全上下文
kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.spec.securityContext}'

# 查看容器安全上下文
kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.spec.containers[0].securityContext}'

# 列出所有PodSecurityPolicy
kubectl get psp

使用场景：Pod因安全限制而无法启动或运行。

七、资源管理和监控

1. 资源使用监控

# 查看Pod资源使用情况
kubectl top pod -n <命名空间>

# 查看特定Pod资源使用
kubectl top pod <Pod名称> -n <命名空间>

# 查看节点资源使用情况
kubectl top node

# 查看容器的CPU请求和限制
kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.spec.containers[*].resources.requests.cpu}'
kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.spec.containers[*].resources.limits.cpu}'

# 查看容器的内存请求和限制
kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.spec.containers[*].resources.requests.memory}'
kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.spec.containers[*].resources.limits.memory}'

使用场景：监控资源使用情况，识别资源瓶颈。

2. 资源配额和限制

# 查看命名空间资源配额
kubectl get resourcequota -n <命名空间>

# 查看资源配额详情
kubectl describe resourcequota <配额名称> -n <命名空间>

# 查看LimitRange
kubectl get limitrange -n <命名空间>

# 查看LimitRange详情
kubectl describe limitrange <限制范围名称> -n <命名空间>

使用场景：资源创建失败或达到配额限制。

3. HorizontalPodAutoscaler诊断

# 列出所有HPA
kubectl get hpa -n <命名空间>

# 查看HPA详情
kubectl describe hpa <HPA名称> -n <命名空间>

# 检查HPA当前指标
kubectl get hpa <HPA名称> -n <命名空间> -o jsonpath='{.status.currentMetrics}'

使用场景：自动扩缩容不按预期工作时。

八、常见问题快速排查指南

1. Pod一直处于Pending状态

排查步骤：

检查集群资源是否足够：kubectl describe pod <Pod名称> -n <命名空间>，查看事件部分
检查节点是否有污点：kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
检查PVC是否绑定：kubectl get pvc -n <命名空间>
检查资源配额：kubectl describe resourcequota -n <命名空间>

2. Pod一直处于CrashLoopBackOff状态

排查步骤：

查看Pod日志：kubectl logs <Pod名称> -n <命名空间>
查看上一个崩溃容器的日志：kubectl logs --previous <Pod名称> -n <命名空间>
检查容器启动命令和参数：kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.spec.containers[0].command}'
检查健康检查配置：kubectl get pod <Pod名称> -n <命名空间> -o jsonpath='{.spec.containers[0].livenessProbe}'

3. Service无法访问

排查步骤：

确认Service存在且正常：kubectl describe svc <服务名称> -n <命名空间>
检查Endpoints是否有后端Pod：kubectl get endpoints <服务名称> -n <命名空间>
验证Pod标签是否与Service选择器匹配：kubectl get pods -n <命名空间> -l <选择器>
测试集群内DNS解析：kubectl run -it --rm debug --image=busybox:1.28 -- nslookup <服务名称>.<命名空间>.svc.cluster.local
检查网络策略是否阻止流量：kubectl get networkpolicies -n <命名空间>

4. 节点NotReady

排查步骤：

查看节点状态详情：kubectl describe node <节点名称>
检查kubelet状态：登录节点，systemctl status kubelet
查看kubelet日志：journalctl -u kubelet
检查节点资源使用：kubectl top node <节点名称>
检查节点网络连接：从节点ping master节点的IP

5. 存储问题

排查步骤：

检查PV/PVC状态：kubectl get pv,pvc -n <命名空间>
查看PVC事件：kubectl describe pvc <PVC名称> -n <命名空间>
确认StorageClass存在：kubectl get storageclass
检查存储提供者状态：如检查CSI驱动Pod kubectl get pods -n <CSI命名空间>
验证卷挂载：kubectl exec -it <Pod名称> -n <命名空间> -- df -h

九、实用诊断工具和技巧

1. 使用JSONPath提取特定信息

 # 列出所有就绪的Pod
 kubectl get pods -o jsonpath='{.items[?(@.status.containerStatuses[0].ready==true)].metadata.name}'
 
 # 获取所有节点的内部IP
 kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}'
 
 # 提取所有Pod的容器镜像
 kubectl get pods -n <命名空间> -o jsonpath='{.items[*].spec.containers[*].image}'

2. 使用标签和字段选择器过滤资源

 # 使用标签选择器筛选资源
 kubectl get pods -l app=nginx,tier=frontend -n <命名空间>
 
 # 使用字段选择器筛选资源
 kubectl get pods --field-selector status.phase=Running -n <命名空间>
 
 # 组合使用标签和字段选择器
 kubectl get pods -l app=nginx --field-selector spec.nodeName=<节点名称> -n <命名空间>

3. 使用kubectl插件

 # 安装kubectl-debug插件（如果不使用内置debug命令）
 kubectl krew install debug
 
 # 使用kubectl-ctx快速切换上下文
 kubectl ctx <上下文名称>
 
 # 使用kubectl-ns快速切换命名空间
 kubectl ns <命名空间>

通过掌握这些详细的K8s诊断命令和技巧，您将能够更加高效地排查和解决Kubernetes集群中的各种问题。结合良好的监控系统和日志管理工具，可以进一步提升对集群的可观测性和问题解决能力。

记住，良好的诊断不仅在于掌握工具，还在于理解Kubernetes的核心概念和组件之间的交互关系。

上一篇： VisiPics重复图片查找软件中文汉化教程
下一篇：亚马逊春节假期期间的店铺管理设置

网站首页 > 技术教程 正文

1分钟总结常用k8s常用诊断教程（k8s常见故障）

一、集群和节点诊断

1. 集群基本信息获取

2. 节点状态检查和管理

3. 节点维护操作

4. 节点资源监控

二、Pod诊断与故障排查

1. Pod基本状态查询

2. Pod详细信息分析

3. Pod日志分析

4. Pod交互与调试

5. 使用临时调试容器

6. Pod事件分析

三、工作负载和控制器诊断

1. Deployment诊断

2. StatefulSet诊断

3. DaemonSet诊断

4. Job和CronJob诊断

四、服务和网络诊断

1. 服务发现和连接性

2. Ingress诊断

3. DNS诊断

4. 网络策略诊断

5. 高级网络调试

五、存储诊断

1. PersistentVolume和PersistentVolumeClaim

2. StorageClass诊断

3. 卷挂载诊断

六、配置和安全诊断

1. ConfigMap和Secret

2. RBAC权限诊断

3. 安全上下文和Pod安全策略

七、资源管理和监控

1. 资源使用监控

2. 资源配额和限制

3. HorizontalPodAutoscaler诊断

八、常见问题快速排查指南

1. Pod一直处于Pending状态

2. Pod一直处于CrashLoopBackOff状态

3. Service无法访问

4. 节点NotReady

5. 存储问题

九、实用诊断工具和技巧

1. 使用JSONPath提取特定信息

2. 使用标签和字段选择器过滤资源

3. 使用kubectl插件

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎 你 发表评论:

网站首页 > 技术教程正文

取消回复欢迎你发表评论: