//===----------------------------------------------------------------------===// // LLVM IR Embedded Metadata //===----------------------------------------------------------------------===// Sep 19, 2008 - Initial Revision This note describes a new approach to embedding arbitrary metadata in the LLVM IR in an efficient way. This introduces some new concepts and complexity, but it will deliver substantially improved efficiency and clarity, as well as give LLVM new capabilities. LLVM IR currently directly supports two forms of metadata: the llvm.annotation intrinsic and debug information. First I'll describe problems with the current debug info implementation (annotations have a subset of the problems that debug info faces, so I won't explicitly talk about it), I'll describe the proposed solution, then talk about future directions that this design also helps with. One thing that is worth pointing out is that the existing implementation has served LLVM quite well for a long time and it hit a very important sweet spot of getting things working quickly, allowing us to move on to other important problems. This proposal should be seen as a refinement of existing work, not something that obviates the need for that earlier work. //===----------------------------------------------------------------------===// // Intro to LLVM Debug Information //===----------------------------------------------------------------------===// LLVM debug information currently *works* fine. It is capable of capturing source information for a broad range of different source languages and is easily extensible to support new languages. Debug information in general consists of two major pieces: line location information and user variable information. Line location information is conceptually very simple: the IR uses calls to the @llvm.dbg.stoppoint intrinsic to represent line markers. The calls take constant integers that describe the line/column number of the location and a pointer to information that describes the "subprogram" the line exists in. Here is a random example: call void @llvm.dbg.stoppoint( i32 7, i32 0, { }* bitcast (%llvm.dbg.compile_unit.type* @llvm.dbg.compile_unit to { }*) ) Of course, this isn't complete without seeing the llvm.dbg.compile_unit definition. It is: @llvm.dbg.compile_unit = internal constant %llvm.dbg.compile_unit.type { i32 393233, { }* bitcast (%llvm.dbg.anchor.type* @llvm.dbg.compile_units to { }*), i32 1, i8* getelementptr ([11 x i8]* @str1, i32 0, i32 0), i8* getelementptr ([50 x i8]* @str2, i32 0, i32 0), i8* getelementptr ([45 x i8]* @str3, i32 0, i32 0) }, section "llvm.metadata" And these are the globals that it references: @str1 = internal constant [11 x i8] c"funccall.c\00", section "llvm.metadata" @str2 = internal constant [50 x i8] c"/Volumes/Big2/llvm/llvm/test/Regression/Debugger/\00", section "llvm.metadata" @str3 = internal constant [45 x i8] c"4.0.1 LLVM (Apple Computer, Inc. build 5421)\00\00", section "llvm.metadata" @llvm.dbg.compile_units = linkonce constant %llvm.dbg.anchor.type { i32 393216, i32 17 }, section "llvm.metadata" Variable information is also conceptually simple. For each user-level variable there is either an entry hanging off @llvm.global_variables or an @llvm.dbg.declare intrinsic call. In either case, the intrinsic or global connects the location of the variable with a pointer to an LLVM IR global in the llvm.metadata section that describes it. This then recursively points off to type information, etc. I won't give an example, because even a simple one is very large. Interestingly enough, this representation in the LLVM IR is not the representation that either the front-end generates or the code generator consumes. Because the IR representation is very verbose and complex, a transient tree representation defined in the llvm/CodeGen/MachineModuleInfo.h is produced by the front-end, and this representation then serializes itself into the LLVM IR. When the code generator goes to emit DWARF information, it first converts the LLVM IR into the MachineModuleInfo tree representation, then emits DWARF during a tree-walk of the MachineModuleInfo nodes. See http://llvm.org/docs/SourceLevelDebugging.html for more information on the form and structure of LLVM debug info. //===----------------------------------------------------------------------===// // Problems with LLVM Debug Information //===----------------------------------------------------------------------===// As I mentioned before, our debug information representation is rich enough to describe most simple things and is fully capable of capturing the full source level type system. One way to look at it is that it is an encoding of DWARF debug info that is captured into the LLVM IR through these global variable descriptors. This gives us great generality and made it possible to implement debug info support without extending the LLVM IR. However, the current representation obviously has problems. First, it is very difficult to read: in the @llvm.dbg.compile_unit descriptor (for example), you have to skip around in the file to find out that it represents a file named funccall.c. Looking at descriptors for types is also very difficult. Second, this representation is extremely expensive in both time and memory use. The expense is high, and it impacts a very important scenario: compile time of programs at "-O0 -g". When compiling in this mode, the job of the compiler is to produce fully debuggable code as fast as possible. With recent improvements in other areas of the compiler, the time spent shovelling around debug information is turning into a very significant portion of compile time. The compile time expense comes from several different but related problems: 1. Parity mismatch between LLVM IR and tree concepts ---------------------------------------------------- LLVM IR is specialized for representing executable code and handling ABI issues, but it doesn't contain any high level concepts like strings. Strings in particular are very expensive for debug information, as shown above. For each string (e.g. the filename) we have an allocation for an LLVM GlobalVariable, an allocation of a ConstantArray (for the initializer), and one ConstantInt for each letter (which are uniqued, so shouldn't be counted against any one string). In addition, the ConstantArray contains an array of operands (one for each character) and each User is 12 bytes in size. This means that just the ConstantArray object itself is at least 12*StringLen bytes on a 32-bit host (and twice that on a 64-bit host). In addition to all of this direct cost, there is an often hidden cost: all of these Constant objects are uniqued, which means there are shadow uniquing data structures that are currently expensive and poorly implemented. While this can be improved, we will always need to unique these objects. Because all of this is just an encoding of a debug information string, we get no value out of the capabilities that the LLVM IR provides for us (e.g. RAUW): this is strictly just overhead. Finally, it is worth pointing out that strings are very common in debug info: for example every C++ object gets a mangled name that is a string. Strings are a specific area that need substantial improvement. 2. LLVM IR Types ---------------- Because debug info is serialized into the LLVM IR, we have to live by its rules. In practice, this means that each global has to have a type and type conversions must be obeyed. For example, each string of a particular length requires a type to represent it ([8 x 11], [8 x 45], ...). Types are uniqued, but they are not free and many of these types won't be used by the program itself. Even worse than the types themselves are the constant expression that must be inserted to obey type rules. These bitcasts (and/or getelementptrs) happen for just about every string and for every LLVM global IR object. Using bitcasts like this is bad for two reasons: 1) it makes the IR harder to read, and 2) they are expensive to allocate and walk through. They add no value, they just add cost to represent, allocate, and push around. 3. Data Structure Conversions ----------------------------- The final big problem we have is that the compiler does a huge number of allocations and deallocations just to handle data structure conversions from the front-end AST to MachineModuleInfo to LLVM IR to MachineModuleInfo to output .s file. Ideally we'd like to just print out DWARF from the source-level AST, but this won't work for LLVM (the code generators are not allowed to to know about front-end data structures). However, just eliminating the MachineModuleInfo representation of debug info would be a huge saving of time. This would eliminate two (very large) data structures that are transient and expensive to produce. This would also signficantly reduce the amount of debug-info related heap thrashing that happens. //===----------------------------------------------------------------------===// // Proposed solution: Explicit IR Metadata //===----------------------------------------------------------------------===// The proposed solution to these problems are to introduce two new LLVM IR objects which derive from Value: 'MDString' and 'MDNode'. (MD == MetaData) MDString -------- MDString is a simple uniqued string that derives from Value. It contains an arbitrary byte sequence, maintains a length, and always has the same type: {}. This makes representation of strings dramatically cheaper: each unique string ends up being just two allocations: one for the StringMap node used to unique it and one for the Value* itself. When printed, an MDString looks like !'foo'. For example, this means that the llvm.var.annotation attribute would end up looking like this (instead of having two i8*'s that point to global variables): call void @llvm.var.annotation( i8* ..., {} !'stuff', {} !'file.c', i32 0) This change makes strings a lot more efficient to represent and push around and also makes the LLVM IR a lot easier to read. I'd like to be able to drop the {} type prefix as well, but that may be tricky in some contexts. MDNode ------ MDNode is a simple class, derived from User, that acts as a tuple of other values. Like MDString, MDNode always has a type of {}. Because the implementation is so simple, we can use a FoldingSet to unique these objects and dump whatever is needed into them. The syntax for an MDNode would be !{ a, b, c }. When printed, MDNodes can be shared and referenced with the !42 notation, just like we use %42 and @42 at present. In their initial implementation, the only allowed operands to MDNode will be simple constants (ConstantInt, ConstantFloat, UndefValue), MDString/MDNode, and GlobalValue. Invariants on MDNode and MDString: ---------------------------------- For general sanity, we'll require some invariants: these values can only be referenced by MDNode's, GlobalVariable's which have a 'llvm.' prefix, and by calls to intrinsic functions. Everything else is an error. Also, like constants, MDNode's and MDString's can never be named. Finally, I don't see a reason for MDNodes to be cyclic at this point, restricting them to acyclic graphs would make the implementation simpler. //===----------------------------------------------------------------------===// // Re-encoded Example //===----------------------------------------------------------------------===// If we take the location example above, it would re-encode (and auto-upgrade!) to something like this: ... !142 = !{ i32 393233, i32 1, !'funccall.c', !'/Volumes/Big2/llvm/llvm/test/Regression/Debugger/', !'4.0.1 LLVM (Apple Computer, Inc. build 5421)' } ... call void @llvm.dbg.stoppoint(i32 7, i32 0, !142) ; !142 = 'funccall.c' ... Note that this captures the exact same information, but is much easier to read and is substantially more efficient to represent. In particular, note that the asmprinter "will" (should) be enhanced to print information about any referenced metadata objects in the comment part of the intrinsic line. Specifically, it will print the first MDString object contained in the MDNode, or nothing if none is present. This will make it substantially easier to read the metadata in many cases: here we can tell that the stoppoint is for funccall.c without having to look at what !142 is. //===----------------------------------------------------------------------===// // Debug Info Construction/Manipulation Helper Classes //===----------------------------------------------------------------------===// I think this would be a huge step forward in making the LLVM IR more efficient at representing debug information, but it doesn't solve an important part of the problem: the need to generate MachineModuleInfo data structures in the front-end and reconstruct them in the backend. While walking a tree of MDNode's is certainly simpler than having to walk through bitcasts etc, the lack of strong typing (i.e. just having to 'know' that the third field of the current struct are its CVR qualifiers) is a real problem. To solve this, I propose the implementation of some really trivial by-value wrapper classes that contain a single Value*. For example, something like this would correspond to the CompileUnitDesc class in MachineModuleInfo and be used by the code generator and front-end to produce/manipulate debug info: class DICompileUnit { MDNode *V; public: DICompileUnit(Value *Node); // DICompileUnit::Create - Returns a DICompileUnit that wraps a newly made // MDNode with the appropriate contents. std::string is just for illustration // :) static DICompileUnit Create(unsigned Language, const std::string &Filename, const std::string &Directory, const std::string &Producer); // Accessors unsigned getLanguage() const { return cast(V->getOperand(1))->getZExtValue(); } const std::string &getFileName() const { return cast(V->getOperand(2))->getString(); } const std::string &getDirectory() const {...} const std::string &getProducer() const {...} }; The various accessors on higher-level objects would return (by-value) these helper classes where appropriate. For example, the DITypeDesc class would have a method like: DICompileUnit getFile() const; which would return a DICompileUnit that the TypeDesc is defined in. Likewise, the static constructor method would take a DICompileUnit when creating the DITypeDesc. The nice thing about this approach is that it is a) very simple, b) preserves type safety and the benefits of MachineModuleInfo, and c) is effectively free in terms of compile time. //===----------------------------------------------------------------------===// // Incremental Implementation //===----------------------------------------------------------------------===// Implementing this incrementally should be pretty straight-forward. First we introduce the new MDString and MDNode classes, and add bc/ll reader/writer support. Once this exists, simple .ll testcases can be used to prove that it works. After that exists, the first thing to be updated should be the llvm annotations, changing them to use MDString instead of a pointer to i8. This can provide a simple case for auto upgrade to grind on, and will make any clients of annotations lives a bit easier. Once that is done, we can start implementing the DI* helper classes, implement autoupgrade support for global variable-based debug info into the new form, and change the serialization/deserialization logic in MachineModule info to read/write the new IR form. Finally, once all this is in place, we can switch the code generator, then the front-ends over to using DI* classes directly. //===----------------------------------------------------------------------===// // Future Directions //===----------------------------------------------------------------------===// Once this step is taken, LLVM is well positioned to implement several other major new features: Language Specific Metadata -------------------------- The implementation above points out a clear implementation approach for encoding front-end specific meta-data in the LLVM IR. I hope that authors of front-ends will start to play with this to see if it helps with a specific area that LLVM needs to grow in: enabling higher level optimizations based on language-specific source infromation. For example, it should be very relatively easy to introduce a language-specific devirtualization pass that is driven by class hierarchy information encoded as MDNode's. It is worth pointing out that nothing stops the implementation of such things today, it is just much more awkward and would be more inefficient than an implementation based on the techniques above. Debugging Optimized Code ------------------------ Another new capability that I'd like to tackle in the future is updating debug information as optimizations progress. The most basic optimization that is currently hindered by debug information is mem2reg. For example, if you have: %Var = alloca i32 call llvm.dbg.declare(%Var, !47) ; !47 = 'Var' ... store 1, %Var ... store i32 %X, i32* %Var ... store i32 %Y, i32* %Var ... Right now, mem2reg won't promote the alloca, because it is pinned to the stack by the llvm.dbg.declare call. However, it would be easy to teach mem2reg to handle updating debug information if we only had a representation for it to produce. I suggest something like this: ; deleted: %Var = alloca i32 ... ; deleted: store 1, %Var call llvm.dbg.declare.val(!{ !47, i32 1 }) ; !47 = 'Var' ... ; deleted: store i32 %X, i32* %Var call llvm.dbg.declare.val(!{ !47, i32 %X }) ; !47 = 'Var' ... ; deleted: store i32 %Y, i32* %Var call llvm.dbg.declare.val(!{ !47, i32 %Y }) ; !47 = 'Var' ... This takes advantage of the fact that MDNodes are implicitly variadic with a single type ({}), so we can shove in a i32 just as well as we can a float or a first class aggregate value. Implementing this would require a couple of extensions from our current proposal. First, we'd need MDNode's to be able to reference Instruction values as well as other constants (e.g. ConstantExprs). Second, we'd need to do some planning to decide what the exact format we'd want for llvm.dbg.declare.val. Third, we'd need to start updating optimizations to tolerate MDNode users of instructions without pessimizing them. Finally, the lazy code generator people would have to actually do something with this information. :) Type Based Alias Analysis (TBAA) -------------------------------- TBAA is an important analysis that is required to get good performance on certain kinds of code. It basically boils down to having a language-specific alias set/tree that is pointed to by loads and stores. With MDNode, this becomes reasonably straight-forward: the front-end produces its alias tree as a collection of MDNodes, we enhance load/store/memcpy etc to take a pointer to one of these, then we implement the new AliasAnalysis class to walk the alias tree. The hard part of this is handling linking of two modules with different TBAA information in it. This is "straight-forward" to implement (just use an llvm.tbaa global with information on how to perform the linking) but the implementation is detailed and not for the feint of heart.