设置核心内存转储后无效

    科技2025-03-29  12

    设置 核心内存转储后无效

    In this series of article, we’re retracing how I debugged an InvalidProgramException, caused by a bug in the Datadog profiler, from a memory dump sent by a customer.

    在本系列文章中,我们将介绍如何从客户发送的内存转储中调试由Datadog分析器中的错误引起的InvalidProgramException 。

    Part 1: Preliminary exploration

    第1部分:初步探索

    Part 2: Finding the generated IL

    第2部分:查找生成的IL

    Part 3: Identifying the error and fixing the bug

    第3部分:识别错误并修复错误

    Let’s start with a quick reminder. The profiler works by rewriting the IL of interesting methods to inject instrumentation code. The InvalidProgramException is thrown by the JIT when trying to compile the IL emitted by the profiler, which must be somehow invalid. The first part was about identifying in what method the exception was thrown, and I ended up concluding that Npgsql.PostgresDatabaseInfo.LoadBackendTypes was the culprit. The second part is going to be about how to find the generated IL for that method.

    让我们从快速提醒开始。 探查器通过重写有趣方法的IL来注入工具代码。 尝试编译探查器发出的IL时,JIT会抛出InvalidProgramException异常,它一定是无效的。 第一部分是确定以哪种方法引发异常,最后我得出结论,罪魁祸首是Npgsql.PostgresDatabaseInfo.LoadBackendTypes 。 第二部分将涉及如何找到该方法生成的IL。

    查找生成的IL (Finding the generated IL)

    Npgsql.PostgresDatabaseInfo.LoadBackendTypes is an asynchronous method. The logic of an async method is stored in the MoveNext method of its state-machine, so that’s the one I was interested in.

    Npgsql.PostgresDatabaseInfo.LoadBackendTypes是一个异步方法。 异步方法的逻辑存储在其状态机的MoveNext方法中,这就是我感兴趣的方法。

    There’s a function in dotnet-dump to display the IL of a method: dumpil. This method requires the MethodDescriptor (MD) of the target method, so I needed to find it for MoveNext.

    dotnet-dump有一个函数来显示方法的IL: dumpil 。 此方法需要目标方法的MethodDescriptor(MD),因此我需要为MoveNext找到它。

    I started by dumping all the types in the module, using the dumpmodule -mt command, to find the state-machine:

    首先,使用dumpmodule -mt命令dumpmodule -mt模块中的所有类型以查找状态机:

    This gave me the MT (MethodTable) of the state-machine type: 00007fcf16509c10. Then I fed it to the dumpmt -md command to get the MD:

    这给了我状态机类型的MT(MethodTable): 00007fcf16509c10 。 然后我将其馈送到dumpmt -md命令以获取MD:

    The command outputs all the methods from the given type, and from there we can see that our MD is 00007FCF16509B58.

    该命令输出给定类型的所有方法,从那里我们可以看到我们的MD为00007FCF16509B58 。

    Unfortunately, the dumpil command returned the original IL, not the rewritten one. Looking for ideas, I used the dumpmd command to get more information about the method:

    不幸的是, dumpil命令返回了原始的IL,而不是重写的IL。 为了寻找想法,我使用dumpmd命令来获取有关该方法的更多信息:

    Interestingly, the method was marked as not jitted. In hindsight, that makes sense. We rewrite the method using the JitCompilationStarted event of the profiler API. The JIT then tries to compile it, fails, and throws away the rewritten IL.

    有趣的是,该方法被标记为未对齐。 事后看来,这是有道理的。 我们使用事件探查器API的JitCompilationStarted事件重写该方法。 然后,JIT尝试对其进行编译,失败并丢弃重写的IL。

    Fun fact for those who know about tiered compilation: you may have noticed in the dumpmd output that there are two versions of the method, QuickJitted and OptimizedTier1, despite the IsJitted flag being false. I managed to reproduce this on a test app with a profiler emitting bad IL: after calling the method 30 times, the tiered JIT promotes it to tier 1, even though the method was never jitted successfully

    对于那些了解分层编译的人来说,这是一个有趣的事实:您可能已经在dumpmd输出中注意到,尽管IsJitted标志为false ,但该方法有两个版本,即QuickJitted和OptimizedTier1 。 我设法在带有剖析器的测试应用程序上重现此错误,该探查器发出不良的IL:调用该方法30次后,分层JIT将该方法提升为第1层,即使该方法从未成功地抖动过

    Dead-end? I really didn’t want to give up after going through the tedious process of finding the method, so I decided to get creative. The same way that I managed to find the InvalidProgramException on the heap even though it wasn’t referenced anymore, I figured out that there could still be traces of the generated IL somewhere.

    死路? 在经历了繁琐的寻找方法之后,我真的不想放弃,所以我决定要有创造力。 即使不再引用它,我也设法在堆上找到InvalidProgramException ,以同样的方式,我发现在某处仍然可能有生成的IL的痕迹。

    To give the rewritten IL to the JIT, the profiler uses the SetILFunctionBody API. What’s interesting about it is that the buffer, used to write the IL, is provided by the JIT own’s allocator. Quoting the documentation:

    为了将重写的IL提供给JIT,分析器使用SetILFunctionBody API。 有趣的是,用于写入IL的缓冲区是由JIT自己的分配器提供的。 引用文档:

    Use the ICorProfilerInfo::GetILFunctionBodyAllocator method to allocate space for the new method to ensure that the buffer is compatible.

    Maybe I could find traces of the IL in whatever data structure is used internally by the body allocator? Unfortunately, the allocator is just a call to the new operator:

    也许我可以在主体分配器内部使用的任何数据结构中找到IL的痕迹? 不幸的是,分配器只是对new运算符的调用:

    I have no clue how the new operator works in C++, so I decided to follow another lead. What happens to that buffer after it’s given to the SetILFunctionBody method? I’m not going to show the method implementation, but the interesting bit is how it ends up calling Module::SetDynamicIL. In turn, SetDynamicIL stores the IL body in an internal table (this time I’m showing the implementation because it’ll be important for later):

    我不知道new操作符如何在C ++中工作,所以我决定遵循另一条线索。 将其提供给SetILFunctionBody方法后,该缓冲区会发生什么情况? 我不会显示该方法的实现,但有趣的是它最终如何调用Module::SetDynamicIL 。 反过来, SetDynamicIL将IL主体存储在一个内部表中(这一次,我展示实现是因为它对以后很重要):

    fTemporaryOverride is false in this codepath, so m_debuggerSpecificData.m_pDynamicILBlobTable is used to store the IL. If I could find the address of that table in the memory dump, then maybe I could retrieve the generated IL!

    fTemporaryOverride在此代码fTemporaryOverride为false,因此m_debuggerSpecificData.m_pDynamicILBlobTable用于存储IL。 如果可以在内存转储中找到该表的地址,则可以检索生成的IL!

    As I shown in a previous article, it’s possible to export all the symbols of a module on Linux by using the nm command. So I tried looking for m_debuggerSpecificData, but no luck:

    如我在上一篇文章中所示,可以使用nm命令导出Linux上模块的所有符号。 所以我尝试寻找m_debuggerSpecificData ,但是没有运气:

    > nm -C libcoreclr.so | grep m_debuggerSpecificData>

    How could I possibly find this structure without symbols?

    我怎么可能找不到没有符号的结构?

    I firmly believe that debugging is a creative process. So I took a step back and started thinking. When Module::SetDynamicIL is called, the runtime is capable, somehow, of locating that structure. So the answer, whatever it is, must be somewhere in the assembly code of that method.

    我坚信调试是一个创造性的过程。 所以我退后一步,开始思考。 调用Module::SetDynamicIL时,运行时可以某种方式定位该结构。 因此,无论答案是什么,答案都必须在该方法的汇编代码中。

    Reading this article makes it sound like it’s an instantaneous process, but locating m_debuggerSpecificData without symbols is the result of 2 hours of trial and error and bouncing ideas back and forth with my former coworkers Christophe Nasarre and Gregory LeocadieIn the process, I also discovered that a ISOSDacInterface7 is being implemented for .NET 5, and it has all the facilities I needed to find the dynamic IL. *sigh*

    阅读这篇文章听起来好像是一个瞬时过程,但是查找m_debuggerSpecificData而不使用符号是2个小时的反复试验的结果,并且与我的前同事Christophe Nasarre和Gregory Leocadie来回跳动在此过程中,我还发现正在为.NET 5实现ISOSDacInterface7 ,它具有查找动态IL所需的所有功能。 *叹*

    Fortunately, that method is exported in the symbols:

    幸运的是,该方法已在符号中导出:

    > nm -C libcoreclr.so | grep Module::SetDynamicIL0000000000543da0 t Module::SetDynamicIL(unsigned int, unsigned long, int)

    I used gdb to decompile it:

    我用gdb反编译它:

    OK, that’s a lot to process. Especially if, like me, you’re not that familiar with native disassembly. The trick is to compare it with the original source code (that’s why I posted SetDynamicIL earlier), and focus exclusively on what you’re looking for.

    好的,要处理的事情很多。 特别是如果像我一样,您对本机反汇编不那么熟悉。 诀窍是将其与原始源代码进行比较(这就是我早些时候发布SetDynamicIL的原因),并专注于您要查找的内容。

    First, we need to locate the this parameter. Object-oriented programming does not exist at the assembly level, so the this pointer that we magically use must be given to the target function somehow. By convention, when calling an instance method, this is the first argument of the function.

    首先,我们需要找到this参数。 面向对象的程序设计在汇编级别并不存在,因此必须以某种方式将我们神奇使用的this指针赋予目标函数。 按照约定,在调用实例方法时, this是函数的第一个参数。

    Next, we need to know how arguments are given to the function. Calling Wikipedia to the rescue, we learn that Linux uses the “System V AMD64 ABI” calling convention. In that convention, the first argument of a function is stored in the rdi register.

    接下来,我们需要知道如何将参数赋予函数。 通过调用Wikipedia进行救援,我们了解到Linux使用了“ System V AMD64 ABI”调用约定。 按照该约定,函数的第一个参数存储在rdi寄存器中。

    Now we need some kind of “anchor”. A well-identified point within the function that we can focus on. Right at the beginning of SetDynamicIL, we find this condition:

    现在我们需要某种“锚”。 我们可以重点关注的功能内的明确点。 就在SetDynamicIL的开头,我们发现了这种情况:

    This is great because it uses m_debuggerSpecificData (the field we're looking for), it has a condition, and it calls a method (InitializeDynamicILCrst). This makes it very easy to spot in the disassembly. Now we know we have to focus on this bit:

    这很棒,因为它使用m_debuggerSpecificData (我们正在寻找的字段),它有条件,并且调用方法( InitializeDynamicILCrst )。 这使得在拆卸时很容易发现。 现在我们知道我们必须专注于这一点:

    Remember, this is stored in the rdi register. This register is copied to rbx :

    请记住, this存储在rdi寄存器中。 该寄存器被复制到rbx :

    Then we reuse this register here:

    然后我们在这里重用此寄存器:

    This code reads the memory at the address rbx+0x568, pushes the contents to the r14 register, then tests something: test r14,r14. Testing a register against itself is the assembly way of checking if a value is empty. That’s our if (m_debuggerSpecificData.m_pDynamicILCrst == NULL) check! This means that m_debuggerSpecificData.m_pDynamicILCrst is located at the offset 0x568 from the address of the module instance.

    该代码读取地址为rbx+0x568的内存,将内容压入r14寄存器,然后进行测试: test r14,r14 。 针对自身测试寄存器是检查值是否为空的组装方式。 这就是我们的if (m_debuggerSpecificData.m_pDynamicILCrst == NULL)检查! 这意味着m_debuggerSpecificData.m_pDynamicILCrst位于距模块实例地址偏移量0x568处。

    That’s great, but I needed m_debuggerSpecificData.m_pDynamicILBlobTable, not m_debuggerSpecificData.m_pDynamicILCrst. So I had a look at the structure stored in the m_debuggerSpecificData field:

    很好,但是我需要m_debuggerSpecificData.m_pDynamicILBlobTable ,而不是m_debuggerSpecificData.m_pDynamicILCrst 。 因此,我看了存储在m_debuggerSpecificData字段中的结构:

    Fields are stored in memory in the same order as they are declared in the code. So m_pDynamicILBlobTable is the pointer stored right after m_pDynamicILCrst in the memory.

    字段按照在代码中声明的顺序存储在内存中。 因此, m_pDynamicILBlobTable是m_pDynamicILBlobTable之后m_pDynamicILCrst存储在内存中的指针。

    To test this, I first needed the address of the module containing LoadBackendTypes. If you scroll all the way back to where I called dumpmd, you can find it in the output:

    为了测试这一点,我首先需要包含LoadBackendTypes的模块的地址。 如果一直滚动回到我称为dumpmd ,则可以在输出中找到它:

    I was looking for the content of the memory at the offset 0x568 of the Module, so I added that to the Module address to get 0x7FCF13EC67D8 + 0x568 = 0x7FCF13EC6D40

    我正在寻找模块偏移量0x568处的内存内容,因此我将其添加到模块地址中以获得0x7FCF13EC67D8 + 0x568 = 0x7FCF13EC6D40

    I then used LLDB to dump the memory at that address:

    然后,我使用LLDB将内存转储到该地址:

    (lldb) memory read --count 4 --size 8 --format x 7FCF13EC6D400x7fcf13ec6d40: 0x00007fce8c90c820 0x00007fce8c9383800x7fcf13ec6d50: 0x0000000000000000 0x0000000000000000

    Assuming my reasoning was correct, 0x00007fce8c90c820 would be the address of m_pDynamicILCrst, and 0x00007fce8c938380 the address of m_pDynamicILBlobTable. There was no way to be completely sure, but I could check if the values in memory matched the layout of the table in the source code:

    假设我的推理是正确的, 0x00007fce8c90c820会的地址m_pDynamicILCrst和0x00007fce8c938380的地址m_pDynamicILBlobTable 。 无法完全确定,但是我可以检查内存中的值是否与源代码中表的布局相匹配:

    One pointer to a table, followed by 4 integers indicating the size and the occupancy of the table.

    一个指向表的指针,后跟4个整数,指示表的大小和占用率。

    I first dumped the pointer:

    我首先转储了指针:

    (lldb) memory read --count 2 --size 8 --format x 0x00007fce8c9383800x7fce8c938380: 0x00007fce8c938a90 0x000000010000000b

    0x00007fce8c938a90 sure looks like a pointer to the heap. Then I checked the integers (using size 4 instead of size 8 ):

    0x00007fce8c938a90当然看起来像是指向堆的指针。 然后我检查了整数(使用size 4而不是size 8 ):

    (lldb) memory read --count 8 --size 4 --format x 0x00007fce8c9383800x7fce8c938380: 0x8c938a90 0x00007fce 0x0000000b 0x000000010x7fce8c938390: 0x00000001 0x00000008 0x00000045 0x00000000

    The pointer was still there (looking backward because of endianness), then a few small values. I mapped everything to the fields of the table and got:

    指针仍然在那儿(由于字节顺序向后看),然后是几个小值。 我将所有内容映射到表的字段并得到:

    m_table = 0x00007fce8c938a90m_tableSize = 0x0bm_tableCount = 0x1m_tableOccupied = 0x1m_tableMax = 0x8

    Once again, there was no way to be sure, but the values were consistent with the expected layout, and seemed to indicate that there was one element stored in the table!

    再次无法确定,但是值与预期的布局一致,并且似乎表明表中存储了一个元素!

    m_table is a hashtable associating method tokens and pointers to IL code:

    m_table是一个哈希表,将方法标记和指向IL代码的指针相关联:

    I must confess I had a bit of trouble figuring out the layout of the hashtable from the source code (full of templates and other C++ magic), so I cheated a bit. I knew from the dumpmd output that my method token was 0x6000D24. So I just dumped a bunch of memory at the memory location of the hashtable and looked for that value:

    我必须承认我在从源代码(充满模板和其他C ++魔术)中弄清楚哈希表的布局时遇到了一些麻烦,所以我作弊了一点。 从dumpmd输出中知道我的方法令牌是0x6000D24 。 所以我只是将一堆内存转储到哈希表的内存位置,然后寻找该值:

    (lldb) memory read --count 32 --size 4 --format x 0x00007fce8c938a900x7fce8c938a90: 0x00000000 0x00007fce 0x00000000 0x000000000x7fce8c938aa0: 0x00000000 0x00007fcf 0x00000000 0x000000000x7fce8c938ab0: 0x00000000 0x6265645b 0x00000000 0x000000000x7fce8c938ac0: 0x00000000 0x6974616c 0x00000000 0x000000000x7fce8c938ad0: 0x00000000 0x74636e75 0x00000000 0x000000000x7fce8c938ae0: 0x00000000 0x36303437 0x00000000 0x000000000x7fce8c938af0: 0x06000d24 0x00000000 0x8c983790 0x00007fce0x7fce8c938b00: 0x00000000 0x7367704e 0x00000000 0x00000000

    It turned out that the value was next to a pointer (0x00007fce8c983790, backwards), so there was a good probability that it was pointing to the IL I was looking for!

    原来,该值位于指针旁边( 0x00007fce8c983790 ,向后),因此很有可能它指向我正在寻找的IL!

    How to confirm it? Every IL method has a header, so I decompiled the original PostgresDatabaseInfo.LoadBackendTypes method with dnSpy to look for a remarkable value. The LocalVarSig token had a value of 0x11000275:

    如何确认? 每个IL方法都有一个标头,因此我将原始的PostgresDatabaseInfo.LoadBackendTypes方法与dnSpy进行了反编译,以寻找显着的值。 LocalVarSig令牌的值为0x11000275 :

    I then dumped a few bytes at the address I found and looked for the value:

    然后,我在找到的地址处转储了几个字节,并寻找值:

    (lldb) memory read --count 8 --size 4 --format x 0x00007fce8c9837900x7fce8c983790: 0x0002301b 0x000005fe 0x11000275 0x08007b020x7fce8c9837a0: 0x020a0400 0x0008037b 0x19060b04 0x0d167936

    And sure enough, it matched!

    当然可以,它匹配了!

    The next and final step was to dump the IL and try to understand why it was causing the InvalidProgramException. That will be the subject of the next article.

    下一步也是最后一步,就是转储IL并尝试了解它为什么导致InvalidProgramException 。 这将是下一篇文章的主题。

    翻译自: https://medium.com/@kevingosse/investigating-an-invalidprogramexception-from-a-memory-dump-part-2-of-3-daaecd8f3cf4

    设置 核心内存转储后无效

    Processed: 0.011, SQL: 8