因NFS服务器网络变动.导致k8sPod无法删除也无法启动

TL;DR:测试ELK容器挂载共享NFS存储snapshot备份, 因网络策略调整导致nfs服务器无法访问.
执行df命令卡住无法检查nfs状态, 在node上直接umount -lf 并通过docker登陆pod对应的容器 umount -lf后 df正常.
临时修改yaml文件, 注释nfs挂载和pvc定义后重新部署节点.

1.问题现象

ELK snapshot备份失败, 登陆pod执行df卡住.
k8s报events

Unable to mount volumes for pod "elklogsvc-data-1-7f846f45c8-nnqd5_elk-logsvc-uat(53acafec-4660-11ea-b3ac-98039b885796)": timeout expired waiting for volumes to attach or mount for pod "elk-logsvc-uat"/"elklogsvc-data-1-7f846f45c8-nnqd5". list of unmounted volumes=[es-snapshort]. list of unattached volumes=[es-uat-data eslog escert jdk-secpolicy es-snapshort default-token-vlghg]

重新部署pod后, 老pod无法remove, 新pod为panding.

2.问题分析

ping nfs服务器不通, 应该是网络策略有调整.

登陆host, 检查nfs挂载都还在. 使用umount -f 无法摘掉. nfs 服务端没有问题.

mount | grep nfs
nfs_ip:/neworiental/nfs on /var/lib/kubelet/pods/3cc68af0-cb04-11e9-ab84-98039b88726a/volumes/kubernetes.io~nfs/pv-elklogsvcuat-snapshot-nfs type nfs4 (rw,relatime,vers=4.1,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.22.29.203,local_lock=none,addr=172.24.202.25)
nfs_ip:/neworiental/nfs on /var/lib/kubelet/pods/3779b2d1-cb05-11e9-ab84-98039b88726a/volumes/kubernetes.io~nfs/pv-elklogsvcuat-snapshot-nfs type nfs4 (rw,relatime,vers=4.1,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.22.29.203,local_lock=none,addr=172.24.202.25)

3.问题解决

3.1 在host上强制摘掉nfs文件系统

umount -lf /var/lib/kubelet/pods/3cc68af0-cb04-11e9-ab84-98039b88726a/volumes/kubernetes.io~nfs/pv-elklogsvcuat-snapshot-nfs

摘除后,之前的pod仍然无法remove掉. 想了想因为在不同的mount 命名空间内,因此还需要在容器内进行摘除nfs操作.

3.2 通过docker直接切换到root用户进行umount

docker ps | grep elk | grep coo
af8c30434775        dir.staff.xdf.cn/xdf-pub/elasticsearch                       "/usr/local/bin/dock…"   8 months ago        Up 8 months                             k8s_elklogsvc-coo-3_elklogsvc-coo-3-544d46d664-r8r67_elk-logsvc-uat_3779b2d1-cb05-11e9-ab84-98039b88726a_0
bc35792cd4a0        dir.staff.xdf.cn/google_containers/pause:3.1                 "/pause"                 8 months ago        Up 8 months                             k8s_POD_elklogsvc-coo-3-544d46d664-r8r67_elk-logsvc-uat_3779b2d1-cb05-11e9-ab84-98039b88726a_0

# 得到容器id
af8c30434775 

# umount
docker exec -u root -it --privileged af8c30434775 /bin/bash
umount -lf /usr/share/elasticsearch/snapshort

# 容器内df正常
df 
Filesystem                     1K-blocks       Used  Available Use% Mounted on
overlay                        103081248   23826408   74848944  25% /
tmpfs                              65536          0      65536   0% /dev
tmpfs                          131887444          0  131887444   0% /sys/fs/cgroup
/dev/mapper/vg_root-lv_root    480486104   11335796  444719924   3% /etc/hosts
/dev/mapper/vg_root-lv_docker  103081248   23826408   74848944  25% /etc/hostname
shm                                65536          0      65536   0% /dev/shm
/dev/mapper/vg_root-lv_new    3097960600 1556053352 1405997056  53% /usr/share/elasticsearch/data
tmpfs                          131887444         12  131887432   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                          131887444          4  131887440   1% /usr/share/elasticsearch/certs/logsvcuat-elastic.p12
tmpfs                          131887444          0  131887444   0% /proc/acpi
tmpfs                          131887444          0  131887444   0% /proc/scsi
tmpfs                          131887444          0  131887444   0% /sys/firmware

3.3 重新部署POD

调整后检查kubelet日志. 继续报nfs挂载错误, 等了一会儿老pod被kubelet干掉, 新pod启动. 全部恢复.

其他

检查kubelet时发现孤儿进程.

检查这些孤儿pod, 进入到 /var/lib/kubelet/pods/40ccb5cc-919c-11ea-8800-98039b88740f
发现pod还是elk的pod, 应该是老pod remove后重新调度上去的pod,因为nfs还是卡住,老pod下不去, 新pod也上不来. 因此就hang在这里了. 经过调整后,所有pod我都手动删除了, 因此etcd里面没有记录了,自然也就成了孤儿.
因此kublete报错. 解决办法是删除或者移除目录即可.

## 孤儿pod如何处理.

May 09 10:37:25 m725-c114-20-k8s-uat-master02 kubelet[195202]: E0509 10:37:25.068476  195202 kubelet_volumes.go:154] Orphaned pod "40ccb5cc-919c-11ea-8800-98039b88740f" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them

# 
cd /var/lib/kubelet/pods
mv 40ccb5cc-919c-11ea-8800-98039b88740f /tmp