操作系统崩溃并重启
排查思路
1
2
3
4
5
6
7
1. 通过 crash 分析内存转储文件
2. 找到内核崩溃的进程PID
3. 分析此PID的导致崩溃的模块
4. 分析出可能是因为地址空间不存在导致的
5. 分析出问题的模块对应的寄存器地址
6. 找到地址之后查看对应内存区域及其权限
准备依赖:
1
2
CentOS:
yum install -y crash kernel-debuginfo-$(uname -r)
分析内存转储文件(vmcore):
- 查看主机 kdump文件:
1 /var/crash/127.0.0.1-2024-04-08-20:09:56/vmcore2.进入调试: crash //vmcore //vmlinux
a.查看最后返回的信息,找到涉及崩溃的PID、COMMAND、崩溃原因
- PID: 2060266
- COMMAND: “nvidia-containe”
- PANIC: “general protection fault: 0000 [#1] SMP PTI”
表明内核遇到了一个通用保护故障
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
bash> crash /var/crash/127.0.0.1-2024-04-08-20:09:56/vmcore /usr/lib/debug/lib/modules/2.6.32-220.el6.x86_64/vmlinux
crash 7.2.8-3.al7.alnx
Copyright (C) 2002-2020 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: kernel relocated [480MB]: patching 97029 gdb minimal_symbol values
KERNEL: /usr/lib/debug/lib/modules/4.19.91-27.4.al7.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 8
DATE: Mon Apr 8 20:09:55 2024
UPTIME: 213 days, 04:16:10
LOAD AVERAGE: 0.11, 0.17, 0.18
TASKS: 954
NODENAME: k8s-serverless-10.19.207.13
RELEASE: 4.19.91-27.4.al7.x86_64
VERSION: #1 SMP Thu May 25 17:57:07 CST 2023
MACHINE: x86_64 (2500 Mhz)
MEMORY: 31 GB
PANIC: "general protection fault: 0000 [#1] SMP PTI"
PID: 2060266
COMMAND: "nvidia-containe"
TASK: ffff9623b37ba0c0 [THREAD_INFO: ffff9623b37ba0c0]
CPU: 4
STATE: TASK_RUNNING (PANIC)
3.通用保护故障,一般原因是:
a. rbx 寄存器中的地址无效或未映射到物理内存。
b. 访问违反了保护规则,例如试图读取一个只写或不可访问的内存区域。
c. rbx 寄存器加上偏移量后的地址超出了进程的地址空间。4.找到发生崩溃的PID堆栈情况 : bt <PID>
a. 查看方式堆栈从 #10 开始,开始向上分析;
b. #4 发现问题: destroy_node 内存有问题,找到是 cgpu_procfs 模块中的 cgpu_inst_ctl_write 函数 触发的;
c. 刚开始 只确认了 nvidia-containe模块中的 destroy_node 有问题,但是不确定原因,之后问阿里云才知道是他们的插件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
crash> bt 2060266
PID: 2060266 TASK: ffff9623b37ba0c0 CPU: 4 COMMAND: "nvidia-containe"
#0 [ffffa88944c93bf8] machine_kexec at ffffffff9f064e48
#1 [ffffa88944c93c48] __crash_kexec at ffffffff9f14b64a
#2 [ffffa88944c93d08] panic at ffffffff9f0a2443
#3 [ffffa88944c93d80] oops_end at ffffffff9f02ae88
#4 [ffffa88944c93da0] general_protection at ffffffff9fa0118e
[exception RIP: destroy_node+259]
RIP: ffffffffc4162473 RSP: ffffa88944c93e50 RFLAGS: 00010202
RAX: 0000000000000001 RBX: dead0000000000e0 RCX: 0000000000000062
RDX: 000000000000000c RSI: ffff9621be276d71 RDI: ffff9622ce92f0cc
RBP: ffffffffc4167508 R8: 0000000000000039 R9: ffffffffc41624e1
R10: ffff9624f8c1c1c0 R11: ffffd7ae95e30700 R12: ffff9621be276d71
R13: 0000000000000000 R14: ffff9622ce92f0c0 R15: ffff9622ce92f0e0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffffa88944c93e90] cgpu_inst_ctl_write at ffffffffc41625dd [cgpu_procfs]
#6 [ffffa88944c93eb0] proc_reg_write at ffffffff9f346ecc
#7 [ffffa88944c93ec8] vfs_write at ffffffff9f2bd1dd
#8 [ffffa88944c93f00] ksys_write at ffffffff9f2bd45a
#9 [ffffa88944c93f38] do_syscall_64 at ffffffff9f003e4b
#10 [ffffa88944c93f50] entry_SYSCALL_64_after_hwframe at ffffffff9fa0009c
RIP: 000000000044a9f0 RSP: 00007ffdd7a8d0d8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000400340 RCX: 000000000044a9f0
RDX: 000000000000000d RSI: 00007ffdd7a8d0f0 RDI: 0000000000000005
RBP: 00007ffdd7a8d110 R8: 3766326466623834 R9: 6164633937653865
R10: 6333323233663438 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000040ade0 R14: 000000000040ae70 R15: 0000000000000055
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
5.分析 #4 的 RIP地址查看
a. 发现最后的命令是: 0xffffffffc4162473 <destroy_node+259>: mov 0x20(%rbx),%rax
- 将 rbx 寄存器加上偏移量 0x20(32字节)的内存地址处的值移动到 rax 寄存器;
b. 此命令触发的 通用性保护故障,排查是不是 指向的地址不存在
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
crash> dis -rl ffffffffc4162473
0xffffffffc4162370 <destroy_node>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffc4162375 <destroy_node+5>: push %r15
0xffffffffc4162377 <destroy_node+7>: push %r14
0xffffffffc4162379 <destroy_node+9>: push %r13
0xffffffffc416237b <destroy_node+11>: push %r12
0xffffffffc416237d <destroy_node+13>: mov %rdi,%r12
0xffffffffc4162380 <destroy_node+16>: push %rbp
0xffffffffc4162381 <destroy_node+17>: push %rbx
0xffffffffc4162382 <destroy_node+18>: sub $0x10,%rsp
0xffffffffc4162386 <destroy_node+22>: test %rdi,%rdi
0xffffffffc4162389 <destroy_node+25>: je 0xffffffffc4162521 <destroy_node+433>
0xffffffffc416238f <destroy_node+31>: mov %rdi,%rsi
0xffffffffc4162392 <destroy_node+34>: xor %eax,%eax
0xffffffffc4162394 <destroy_node+36>: mov $0xffffffffc41644c0,%rdi
0xffffffffc416239b <destroy_node+43>: callq 0xffffffff9f10f4be <printk>
0xffffffffc41623a0 <destroy_node+48>: mov 0x54ba(%rip),%eax # 0xffffffffc4167860
0xffffffffc41623a6 <destroy_node+54>: test %eax,%eax
0xffffffffc41623a8 <destroy_node+56>: jle 0xffffffffc4162510 <destroy_node+416>
0xffffffffc41623ae <destroy_node+62>: mov $0xffffffffc4167508,%rbp
0xffffffffc41623b5 <destroy_node+69>: movq $0xffffffffc41678a0,(%rsp)
0xffffffffc41623bd <destroy_node+77>: xor %r13d,%r13d
0xffffffffc41623c0 <destroy_node+80>: movl $0x0,0xc(%rsp)
0xffffffffc41623c8 <destroy_node+88>: mov %rbp,%rdi
0xffffffffc41623cb <destroy_node+91>: callq 0xffffffff9f8e6bf0 <down_read>
0xffffffffc41623d0 <destroy_node+96>: mov 0x50(%rbp),%rax
0xffffffffc41623d4 <destroy_node+100>: mov %r13d,%esi
0xffffffffc41623d7 <destroy_node+103>: mov %r12,%rdi
0xffffffffc41623da <destroy_node+106>: mov 0x30(%rax),%rax
0xffffffffc41623de <destroy_node+110>: callq 0xffffffff9fc03000 <__entry_trampoline_end>
0xffffffffc41623e3 <destroy_node+115>: mov %rbp,%rdi
0xffffffffc41623e6 <destroy_node+118>: mov %rax,%rbx
0xffffffffc41623e9 <destroy_node+121>: callq 0xffffffff9f100770 <up_read>
0xffffffffc41623ee <destroy_node+126>: test %rbx,%rbx
0xffffffffc41623f1 <destroy_node+129>: je 0xffffffffc41624e3 <destroy_node+371>
0xffffffffc41623f7 <destroy_node+135>: mov %rbp,%rdi
0xffffffffc41623fa <destroy_node+138>: callq 0xffffffff9f8e6bf0 <down_read>
0xffffffffc41623ff <destroy_node+143>: mov 0x50(%rbp),%rax
0xffffffffc4162403 <destroy_node+147>: xor %esi,%esi
0xffffffffc4162405 <destroy_node+149>: mov %rbx,%rdi
0xffffffffc4162408 <destroy_node+152>: mov 0x98(%rax),%rax
0xffffffffc416240f <destroy_node+159>: callq 0xffffffff9fc03000 <__entry_trampoline_end>
0xffffffffc4162414 <destroy_node+164>: mov 0x50(%rbp),%rax
0xffffffffc4162418 <destroy_node+168>: xor %esi,%esi
0xffffffffc416241a <destroy_node+170>: mov %rbx,%rdi
0xffffffffc416241d <destroy_node+173>: mov 0x38(%rax),%rax
0xffffffffc4162421 <destroy_node+177>: callq 0xffffffff9fc03000 <__entry_trampoline_end>
0xffffffffc4162426 <destroy_node+182>: mov 0x50(%rbp),%rax
0xffffffffc416242a <destroy_node+186>: xor %esi,%esi
0xffffffffc416242c <destroy_node+188>: mov %rbx,%rdi
0xffffffffc416242f <destroy_node+191>: mov 0x78(%rax),%rax
0xffffffffc4162433 <destroy_node+195>: callq 0xffffffff9fc03000 <__entry_trampoline_end>
0xffffffffc4162438 <destroy_node+200>: mov %rbp,%rdi
0xffffffffc416243b <destroy_node+203>: callq 0xffffffff9f100770 <up_read>
0xffffffffc4162440 <destroy_node+208>: mov (%rsp),%rax
0xffffffffc4162444 <destroy_node+212>: mov %r12,%rdi
0xffffffffc4162447 <destroy_node+215>: mov (%rax),%rsi
0xffffffffc416244a <destroy_node+218>: callq 0xffffffff9f34d010 <remove_proc_subtree>
0xffffffffc416244f <destroy_node+223>: mov 0x508a(%rip),%r8 # 0xffffffffc41674e0
0xffffffffc4162456 <destroy_node+230>: mov %eax,0xc(%rsp)
0xffffffffc416245a <destroy_node+234>: mov (%r8),%rax
0xffffffffc416245d <destroy_node+237>: cmp $0xffffffffc41674e0,%r8
0xffffffffc4162464 <destroy_node+244>: lea -0x20(%r8),%r14
0xffffffffc4162468 <destroy_node+248>: mov %r8,%r15
0xffffffffc416246b <destroy_node+251>: lea -0x20(%rax),%rbx
0xffffffffc416246f <destroy_node+255>: jne 0xffffffffc416248e <destroy_node+286>
0xffffffffc4162471 <destroy_node+257>: jmp 0xffffffffc41624e3 <destroy_node+371>
0xffffffffc4162473 <destroy_node+259>: mov 0x20(%rbx),%rax
6.查看 0xffffffffc4162473地址的 内存映射和权限
a. 虚拟地址 ffffffffc4162473 映射到物理地址 70002c473。输出中的 PTE(页表项)标志 PRESENT|ACCESSED|DIRTY 表明该页是存在的、已被访问的,并且已被写入(脏页)
b. 排除 堆栈地址不存在导致的问题
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
crash> vtop ffffffffc4162473
VIRTUAL PHYSICAL
ffffffffc4162473 70002c473
PGD DIRECTORY: ffffffffa020a000
PAGE DIRECTORY: 2c220c067
PUD: 2c220cff8 => 2c220e067
PMD: 2c220e100 => 7038f5067
PTE: 7038f5b10 => 70002c061
PAGE: 70002c000
PTE PHYSICAL FLAGS
70002c061 70002c000 (PRESENT|ACCESSED|DIRTY)
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffd7ae9c000b00 70002c000 0 0 1 17ffffc0000000
7.又回到之前的地方,没有排查方向了,排查结果跟阿里云的人同步,发现这个是他们 cgpu的bug。
8.之后由阿里云的人处理,fix,更新版本