文章

操作系统崩溃并重启

排查思路

1
2
3
4
5
6
7
1. 通过 crash 分析内存转储文件
2. 找到内核崩溃的进程PID
3. 分析此PID的导致崩溃的模块
4. 分析出可能是因为地址空间不存在导致的
5. 分析出问题的模块对应的寄存器地址
6. 找到地址之后查看对应内存区域及其权限

准备依赖:

1
2
CentOS:
yum install -y crash kernel-debuginfo-$(uname -r)

分析内存转储文件(vmcore):

  1. 查看主机 kdump文件:
1
/var/crash/127.0.0.1-2024-04-08-20:09:56/vmcore

2.进入调试: crash //vmcore //vmlinux

a.查看最后返回的信息,找到涉及崩溃的PID、COMMAND、崩溃原因

  • PID: 2060266
  • COMMAND: “nvidia-containe”
  • PANIC: “general protection fault: 0000 [#1] SMP PTI”

    表明内核遇到了一个通用保护故障

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
bash> crash /var/crash/127.0.0.1-2024-04-08-20:09:56/vmcore /usr/lib/debug/lib/modules/2.6.32-220.el6.x86_64/vmlinux  

crash 7.2.8-3.al7.alnx
Copyright (C) 2002-2020  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.  

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...  

WARNING: kernel relocated [480MB]: patching 97029 gdb minimal_symbol values  


      KERNEL: /usr/lib/debug/lib/modules/4.19.91-27.4.al7.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Mon Apr  8 20:09:55 2024
      UPTIME: 213 days, 04:16:10
LOAD AVERAGE: 0.11, 0.17, 0.18
       TASKS: 954
    NODENAME: k8s-serverless-10.19.207.13
     RELEASE: 4.19.91-27.4.al7.x86_64
     VERSION: #1 SMP Thu May 25 17:57:07 CST 2023
     MACHINE: x86_64  (2500 Mhz)
      MEMORY: 31 GB
       PANIC: "general protection fault: 0000 [#1] SMP PTI"
         PID: 2060266
     COMMAND: "nvidia-containe"
        TASK: ffff9623b37ba0c0  [THREAD_INFO: ffff9623b37ba0c0]
         CPU: 4
       STATE: TASK_RUNNING (PANIC)

3.通用保护故障,一般原因是:

a. rbx 寄存器中的地址无效或未映射到物理内存。
b. 访问违反了保护规则,例如试图读取一个只写或不可访问的内存区域。
c. rbx 寄存器加上偏移量后的地址超出了进程的地址空间。

4.找到发生崩溃的PID堆栈情况 : bt <PID>

a. 查看方式堆栈从 #10 开始,开始向上分析;
b. #4 发现问题: destroy_node 内存有问题,找到是 cgpu_procfs 模块中的 cgpu_inst_ctl_write 函数 触发的;
c. 刚开始 只确认了 nvidia-containe模块中的 destroy_node 有问题,但是不确定原因,之后问阿里云才知道是他们的插件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
crash> bt 2060266
PID: 2060266  TASK: ffff9623b37ba0c0  CPU: 4   COMMAND: "nvidia-containe"
 #0 [ffffa88944c93bf8] machine_kexec at ffffffff9f064e48
 #1 [ffffa88944c93c48] __crash_kexec at ffffffff9f14b64a
 #2 [ffffa88944c93d08] panic at ffffffff9f0a2443
 #3 [ffffa88944c93d80] oops_end at ffffffff9f02ae88
 #4 [ffffa88944c93da0] general_protection at ffffffff9fa0118e
    [exception RIP: destroy_node+259]
    RIP: ffffffffc4162473  RSP: ffffa88944c93e50  RFLAGS: 00010202
    RAX: 0000000000000001  RBX: dead0000000000e0  RCX: 0000000000000062
    RDX: 000000000000000c  RSI: ffff9621be276d71  RDI: ffff9622ce92f0cc
    RBP: ffffffffc4167508   R8: 0000000000000039   R9: ffffffffc41624e1
    R10: ffff9624f8c1c1c0  R11: ffffd7ae95e30700  R12: ffff9621be276d71
    R13: 0000000000000000  R14: ffff9622ce92f0c0  R15: ffff9622ce92f0e0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffa88944c93e90] cgpu_inst_ctl_write at ffffffffc41625dd [cgpu_procfs]
 #6 [ffffa88944c93eb0] proc_reg_write at ffffffff9f346ecc
 #7 [ffffa88944c93ec8] vfs_write at ffffffff9f2bd1dd
 #8 [ffffa88944c93f00] ksys_write at ffffffff9f2bd45a
 #9 [ffffa88944c93f38] do_syscall_64 at ffffffff9f003e4b
#10 [ffffa88944c93f50] entry_SYSCALL_64_after_hwframe at ffffffff9fa0009c
    RIP: 000000000044a9f0  RSP: 00007ffdd7a8d0d8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000400340  RCX: 000000000044a9f0
    RDX: 000000000000000d  RSI: 00007ffdd7a8d0f0  RDI: 0000000000000005
    RBP: 00007ffdd7a8d110   R8: 3766326466623834   R9: 6164633937653865
    R10: 6333323233663438  R11: 0000000000000246  R12: 0000000000000000
    R13: 000000000040ade0  R14: 000000000040ae70  R15: 0000000000000055
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

5.分析 #4 的 RIP地址查看

a. 发现最后的命令是: 0xffffffffc4162473 <destroy_node+259>: mov 0x20(%rbx),%rax

  • 将 rbx 寄存器加上偏移量 0x20(32字节)的内存地址处的值移动到 rax 寄存器;

b. 此命令触发的 通用性保护故障,排查是不是 指向的地址不存在

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
crash> dis -rl ffffffffc4162473
0xffffffffc4162370 <destroy_node>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffc4162375 <destroy_node+5>:    push   %r15
0xffffffffc4162377 <destroy_node+7>:    push   %r14
0xffffffffc4162379 <destroy_node+9>:    push   %r13
0xffffffffc416237b <destroy_node+11>:   push   %r12
0xffffffffc416237d <destroy_node+13>:   mov    %rdi,%r12
0xffffffffc4162380 <destroy_node+16>:   push   %rbp
0xffffffffc4162381 <destroy_node+17>:   push   %rbx
0xffffffffc4162382 <destroy_node+18>:   sub    $0x10,%rsp
0xffffffffc4162386 <destroy_node+22>:   test   %rdi,%rdi
0xffffffffc4162389 <destroy_node+25>:   je     0xffffffffc4162521 <destroy_node+433>
0xffffffffc416238f <destroy_node+31>:   mov    %rdi,%rsi
0xffffffffc4162392 <destroy_node+34>:   xor    %eax,%eax
0xffffffffc4162394 <destroy_node+36>:   mov    $0xffffffffc41644c0,%rdi
0xffffffffc416239b <destroy_node+43>:   callq  0xffffffff9f10f4be <printk>
0xffffffffc41623a0 <destroy_node+48>:   mov    0x54ba(%rip),%eax        # 0xffffffffc4167860
0xffffffffc41623a6 <destroy_node+54>:   test   %eax,%eax
0xffffffffc41623a8 <destroy_node+56>:   jle    0xffffffffc4162510 <destroy_node+416>
0xffffffffc41623ae <destroy_node+62>:   mov    $0xffffffffc4167508,%rbp
0xffffffffc41623b5 <destroy_node+69>:   movq   $0xffffffffc41678a0,(%rsp)
0xffffffffc41623bd <destroy_node+77>:   xor    %r13d,%r13d
0xffffffffc41623c0 <destroy_node+80>:   movl   $0x0,0xc(%rsp)
0xffffffffc41623c8 <destroy_node+88>:   mov    %rbp,%rdi
0xffffffffc41623cb <destroy_node+91>:   callq  0xffffffff9f8e6bf0 <down_read>
0xffffffffc41623d0 <destroy_node+96>:   mov    0x50(%rbp),%rax
0xffffffffc41623d4 <destroy_node+100>:  mov    %r13d,%esi
0xffffffffc41623d7 <destroy_node+103>:  mov    %r12,%rdi
0xffffffffc41623da <destroy_node+106>:  mov    0x30(%rax),%rax
0xffffffffc41623de <destroy_node+110>:  callq  0xffffffff9fc03000 <__entry_trampoline_end>
0xffffffffc41623e3 <destroy_node+115>:  mov    %rbp,%rdi
0xffffffffc41623e6 <destroy_node+118>:  mov    %rax,%rbx
0xffffffffc41623e9 <destroy_node+121>:  callq  0xffffffff9f100770 <up_read>
0xffffffffc41623ee <destroy_node+126>:  test   %rbx,%rbx
0xffffffffc41623f1 <destroy_node+129>:  je     0xffffffffc41624e3 <destroy_node+371>
0xffffffffc41623f7 <destroy_node+135>:  mov    %rbp,%rdi
0xffffffffc41623fa <destroy_node+138>:  callq  0xffffffff9f8e6bf0 <down_read>
0xffffffffc41623ff <destroy_node+143>:  mov    0x50(%rbp),%rax
0xffffffffc4162403 <destroy_node+147>:  xor    %esi,%esi
0xffffffffc4162405 <destroy_node+149>:  mov    %rbx,%rdi
0xffffffffc4162408 <destroy_node+152>:  mov    0x98(%rax),%rax
0xffffffffc416240f <destroy_node+159>:  callq  0xffffffff9fc03000 <__entry_trampoline_end>
0xffffffffc4162414 <destroy_node+164>:  mov    0x50(%rbp),%rax
0xffffffffc4162418 <destroy_node+168>:  xor    %esi,%esi
0xffffffffc416241a <destroy_node+170>:  mov    %rbx,%rdi
0xffffffffc416241d <destroy_node+173>:  mov    0x38(%rax),%rax
0xffffffffc4162421 <destroy_node+177>:  callq  0xffffffff9fc03000 <__entry_trampoline_end>
0xffffffffc4162426 <destroy_node+182>:  mov    0x50(%rbp),%rax
0xffffffffc416242a <destroy_node+186>:  xor    %esi,%esi
0xffffffffc416242c <destroy_node+188>:  mov    %rbx,%rdi
0xffffffffc416242f <destroy_node+191>:  mov    0x78(%rax),%rax
0xffffffffc4162433 <destroy_node+195>:  callq  0xffffffff9fc03000 <__entry_trampoline_end>
0xffffffffc4162438 <destroy_node+200>:  mov    %rbp,%rdi
0xffffffffc416243b <destroy_node+203>:  callq  0xffffffff9f100770 <up_read>
0xffffffffc4162440 <destroy_node+208>:  mov    (%rsp),%rax
0xffffffffc4162444 <destroy_node+212>:  mov    %r12,%rdi
0xffffffffc4162447 <destroy_node+215>:  mov    (%rax),%rsi
0xffffffffc416244a <destroy_node+218>:  callq  0xffffffff9f34d010 <remove_proc_subtree>
0xffffffffc416244f <destroy_node+223>:  mov    0x508a(%rip),%r8        # 0xffffffffc41674e0
0xffffffffc4162456 <destroy_node+230>:  mov    %eax,0xc(%rsp)
0xffffffffc416245a <destroy_node+234>:  mov    (%r8),%rax
0xffffffffc416245d <destroy_node+237>:  cmp    $0xffffffffc41674e0,%r8
0xffffffffc4162464 <destroy_node+244>:  lea    -0x20(%r8),%r14
0xffffffffc4162468 <destroy_node+248>:  mov    %r8,%r15
0xffffffffc416246b <destroy_node+251>:  lea    -0x20(%rax),%rbx
0xffffffffc416246f <destroy_node+255>:  jne    0xffffffffc416248e <destroy_node+286>
0xffffffffc4162471 <destroy_node+257>:  jmp    0xffffffffc41624e3 <destroy_node+371>
0xffffffffc4162473 <destroy_node+259>:  mov    0x20(%rbx),%rax

6.查看 0xffffffffc4162473地址的 内存映射和权限

a. 虚拟地址 ffffffffc4162473 映射到物理地址 70002c473。输出中的 PTE(页表项)标志 PRESENT|ACCESSED|DIRTY 表明该页是存在的、已被访问的,并且已被写入(脏页)
b. 排除 堆栈地址不存在导致的问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
crash>  vtop ffffffffc4162473
VIRTUAL           PHYSICAL        
ffffffffc4162473  70002c473

PGD DIRECTORY: ffffffffa020a000
PAGE DIRECTORY: 2c220c067
   PUD: 2c220cff8 => 2c220e067
   PMD: 2c220e100 => 7038f5067
   PTE: 7038f5b10 => 70002c061
  PAGE: 70002c000

   PTE     PHYSICAL   FLAGS
70002c061  70002c000  (PRESENT|ACCESSED|DIRTY)

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffd7ae9c000b00 70002c000                0        0  1 17ffffc0000000

7.又回到之前的地方,没有排查方向了,排查结果跟阿里云的人同步,发现这个是他们 cgpu的bug。

8.之后由阿里云的人处理,fix,更新版本

本文由作者按照 CC BY 4.0 进行授权