//===----------------------------------------------------------------------===//
// LLVM IR Embedded Metadata
//===----------------------------------------------------------------------===//

Sep 19, 2008 - Initial Revision

This note describes a new approach to embedding arbitrary metadata in the LLVM
IR in an efficient way.  This introduces some new concepts and complexity, but 
it will deliver substantially improved efficiency and clarity, as well as give
LLVM new capabilities.

LLVM IR currently directly supports two forms of metadata: the llvm.annotation
intrinsic and debug information.  First I'll describe problems with the current
debug info implementation (annotations have a subset of the problems that debug
info faces, so I won't explicitly talk about it), I'll describe the proposed
solution, then talk about future directions that this design also helps with.

One thing that is worth pointing out is that the existing implementation has
served LLVM quite well for a long time and it hit a very important sweet spot of
getting things working quickly, allowing us to move on to other important
problems.  This proposal should be seen as a refinement of existing work, not
something that obviates the need for that earlier work.

//===----------------------------------------------------------------------===//
// Intro to LLVM Debug Information
//===----------------------------------------------------------------------===//

LLVM debug information currently *works* fine.  It is capable of capturing
source information for a broad range of different source languages and is easily
extensible to support new languages.  Debug information in general consists of
two major pieces: line location information and user variable information.

Line location information is conceptually very simple: the IR uses calls to the
@llvm.dbg.stoppoint intrinsic to represent line markers.  The calls take
constant integers that describe the line/column number of the location and a
pointer to information that describes the "subprogram" the line exists in.  Here
is a random example:

call void @llvm.dbg.stoppoint( i32 7, i32 0,
  {  }* bitcast (%llvm.dbg.compile_unit.type* @llvm.dbg.compile_unit to {  }*) )

Of course, this isn't complete without seeing the llvm.dbg.compile_unit
definition.  It is:

@llvm.dbg.compile_unit = internal constant %llvm.dbg.compile_unit.type {
    i32 393233, 
    {  }* bitcast (%llvm.dbg.anchor.type* @llvm.dbg.compile_units to {  }*), 
    i32 1, 
    i8* getelementptr ([11 x i8]* @str1, i32 0, i32 0), 
    i8* getelementptr ([50 x i8]* @str2, i32 0, i32 0), 
    i8* getelementptr ([45 x i8]* @str3, i32 0, i32 0) },
     section "llvm.metadata"

And these are the globals that it references:

@str1 = internal constant [11 x i8] c"funccall.c\00", section "llvm.metadata"
@str2 = internal constant [50 x i8]
             c"/Volumes/Big2/llvm/llvm/test/Regression/Debugger/\00",  
             section "llvm.metadata"
@str3 = internal constant [45 x i8]
             c"4.0.1 LLVM (Apple Computer, Inc. build 5421)\00\00",
             section "llvm.metadata"
@llvm.dbg.compile_units = linkonce constant 
    %llvm.dbg.anchor.type { i32 393216, i32 17 }, section "llvm.metadata"

Variable information is also conceptually simple.  For each user-level variable
there is either an entry hanging off @llvm.global_variables or an
@llvm.dbg.declare intrinsic call.  In either case, the intrinsic or global
connects the location of the variable with a pointer to an LLVM IR global in
the llvm.metadata section that describes it.  This then recursively points off
to type information, etc.  I won't give an example, because even a simple one
is very large.

Interestingly enough, this representation in the LLVM IR is not the
representation that either the front-end generates or the code generator
consumes.  Because the IR representation is very verbose and complex, a
transient tree representation defined in the llvm/CodeGen/MachineModuleInfo.h
is produced by the front-end, and this representation then serializes itself
into the LLVM IR.  When the code generator goes to emit DWARF information, it
first converts the LLVM IR into the MachineModuleInfo tree representation, then
emits DWARF during a tree-walk of the MachineModuleInfo nodes.

See http://llvm.org/docs/SourceLevelDebugging.html for more information on the
form and structure of LLVM debug info.

//===----------------------------------------------------------------------===//
// Problems with LLVM Debug Information
//===----------------------------------------------------------------------===//

As I mentioned before, our debug information representation is rich enough to
describe most simple things and is fully capable of capturing the full source
level type system.  One way to look at it is that it is an encoding of DWARF
debug info that is captured into the LLVM IR through these global variable
descriptors.  This gives us great generality and made it possible to implement
debug info support without extending the LLVM IR.

However, the current representation obviously has problems.  First, it is very
difficult to read: in the @llvm.dbg.compile_unit descriptor (for example), you
have to skip around in the file to find out that it represents a file named
funccall.c.  Looking at descriptors for types is also very difficult.

Second, this representation is extremely expensive in both time and memory use.
The expense is high, and it impacts a very important scenario: compile time of
programs at "-O0 -g".  When compiling in this mode, the job of the compiler is
to produce fully debuggable code as fast as possible.  With recent improvements
in other areas of the compiler, the time spent shovelling around debug
information is turning into a very significant portion of compile time.

The compile time expense comes from several different but related problems:

1. Parity mismatch between LLVM IR and tree concepts 
----------------------------------------------------

LLVM IR is specialized for representing executable code and handling ABI issues,
but it doesn't contain any high level concepts like strings.  Strings in
particular are very expensive for debug information, as shown above.

For each string (e.g. the filename) we have an allocation for an LLVM
GlobalVariable, an allocation of a ConstantArray (for the initializer), and one
ConstantInt for each letter (which are uniqued, so shouldn't be counted against
any one string).  In addition, the ConstantArray contains an array of operands
(one for each character) and each User is 12 bytes in size.  This means that
just the ConstantArray object itself is at least 12*StringLen bytes on a 32-bit
host (and twice that on a 64-bit host).  In addition to all of this direct cost,
there is an often hidden cost: all of these Constant objects are uniqued, which
means there are shadow uniquing data structures that are currently expensive and
poorly implemented.  While this can be improved, we will always need to unique
these objects.

Because all of this is just an encoding of a debug information string, we get no
value out of the capabilities that the LLVM IR provides for us (e.g. RAUW): this
is strictly just overhead.  Finally, it is worth pointing out that strings are
very common in debug info: for example every C++ object gets a mangled name that
is a string.  Strings are a specific area that need substantial improvement.

2. LLVM IR Types 
----------------

Because debug info is serialized into the LLVM IR, we have to live by its rules.
In practice, this means that each global has to have a type and type conversions
must be obeyed.  For example, each string of a particular length requires a
type to represent it ([8 x 11], [8 x 45], ...).  Types are uniqued, but they are
not free and many of these types won't be used by the program itself.

Even worse than the types themselves are the constant expression that must be
inserted to obey type rules.  These bitcasts (and/or getelementptrs) happen for
just about every string and for every LLVM global IR object.  Using bitcasts
like this is bad for two reasons: 1) it makes the IR harder to read, and 2) they
are expensive to allocate and walk through.  They add no value, they just add
cost to represent, allocate, and push around.

3. Data Structure Conversions
-----------------------------

The final big problem we have is that the compiler does a huge number of
allocations and deallocations just to handle data structure conversions from the
front-end AST to MachineModuleInfo to LLVM IR to MachineModuleInfo to output .s
file.  Ideally we'd like to just print out DWARF from the source-level AST, but
this won't work for LLVM (the code generators are not allowed to to know about
front-end data structures).

However, just eliminating the MachineModuleInfo representation of debug info
would be a huge saving of time.  This would eliminate two (very large) data
structures that are transient and expensive to produce.  This would also
signficantly reduce the amount of debug-info related heap thrashing that
happens.

//===----------------------------------------------------------------------===//
// Proposed solution: Explicit IR Metadata
//===----------------------------------------------------------------------===//

The proposed solution to these problems are to introduce two new LLVM IR objects
which derive from Value: 'MDString' and 'MDNode'.  (MD == MetaData)

MDString
--------

MDString is a simple uniqued string that derives from Value.  It contains an
arbitrary byte sequence, maintains a length, and always has the same type: {}.
This makes representation of strings dramatically cheaper: each unique string
ends up being just two allocations: one for the StringMap node used to unique it
and one for the Value* itself.  When printed, an MDString looks like !'foo'. For
example, this means that the llvm.var.annotation attribute would end up looking
like this (instead of having two i8*'s that point to global variables):

	call void @llvm.var.annotation( i8* ..., {} !'stuff', {} !'file.c', i32 0)

This change makes strings a lot more efficient to represent and push around and
also makes the LLVM IR a lot easier to read.  I'd like to be able to drop the {}
type prefix as well, but that may be tricky in some contexts.

MDNode
------

MDNode is a simple class, derived from User, that acts as a tuple of other
values.  Like MDString, MDNode always has a type of {}.  Because the
implementation is so simple, we can use a FoldingSet to unique these objects
and dump whatever is needed into them.  The syntax for an MDNode would be
!{ a, b, c }.  When printed, MDNodes can be shared and referenced with the !42
notation, just like we use %42 and @42 at present.  

In their initial implementation, the only allowed operands to MDNode will be
simple constants (ConstantInt, ConstantFloat, UndefValue), MDString/MDNode, and
GlobalValue.


Invariants on MDNode and MDString:
----------------------------------

For general sanity, we'll require some invariants: these values can only be
referenced by MDNode's, GlobalVariable's which have a 'llvm.' prefix, and by
calls to intrinsic functions.  Everything else is an error.  Also, like
constants, MDNode's and MDString's can never be named.  Finally, I don't see a
reason for MDNodes to be cyclic at this point, restricting them to acyclic
graphs would make the implementation simpler.

//===----------------------------------------------------------------------===//
// Re-encoded Example
//===----------------------------------------------------------------------===//

If we take the location example above, it would re-encode (and auto-upgrade!)
to something like this:

...
!142 = !{
  i32 393233, 
  i32 1, 
  !'funccall.c',
  !'/Volumes/Big2/llvm/llvm/test/Regression/Debugger/',
  !'4.0.1 LLVM (Apple Computer, Inc. build 5421)'
}
...
  call void @llvm.dbg.stoppoint(i32 7, i32 0, !142)  ; !142 = 'funccall.c'
...

Note that this captures the exact same information, but is much easier to read
and is substantially more efficient to represent.  In particular, note that the
asmprinter "will" (should) be enhanced to print information about any referenced
metadata objects in the comment part of the intrinsic line.  Specifically, it
will print the first MDString object contained in the MDNode, or nothing if none
is present.  This will make it substantially easier to read the metadata in
many cases: here we can tell that the stoppoint is for funccall.c without having
to look at what !142 is.

//===----------------------------------------------------------------------===//
// Debug Info Construction/Manipulation Helper Classes
//===----------------------------------------------------------------------===//

I think this would be a huge step forward in making the LLVM IR more efficient
at representing debug information, but it doesn't solve an important part of the
problem: the need to generate MachineModuleInfo data structures in the front-end
and reconstruct them in the backend.  While walking a tree of MDNode's is
certainly simpler than having to walk through bitcasts etc, the lack of strong
typing (i.e. just having to 'know' that the third field of the current struct
are its CVR qualifiers) is a real problem.

To solve this, I propose the implementation of some really trivial by-value
wrapper classes that contain a single Value*.  For example, something like this
would correspond to the CompileUnitDesc class in MachineModuleInfo and be used
by the code generator and front-end to produce/manipulate debug info:

class DICompileUnit {
  MDNode *V;
public:
  DICompileUnit(Value *Node);

  // DICompileUnit::Create - Returns a DICompileUnit that wraps a newly made
  // MDNode with the appropriate contents.  std::string is just for illustration
  // :)
  static DICompileUnit Create(unsigned Language, const std::string &Filename,
                              const std::string &Directory,
                              const std::string &Producer);
  // Accessors
  unsigned getLanguage() const {
    return cast<ConstantInt>(V->getOperand(1))->getZExtValue();
  }
  const std::string &getFileName() const {
    return cast<MDString>(V->getOperand(2))->getString();
  }
  const std::string &getDirectory() const {...}
  const std::string &getProducer() const {...}
};

The various accessors on higher-level objects would return (by-value) these
helper classes where appropriate.  For example, the DITypeDesc class would have
a method like:

  DICompileUnit getFile() const;
  
which would return a DICompileUnit that the TypeDesc is defined in.  Likewise,
the static constructor method would take a DICompileUnit when creating the
DITypeDesc.

The nice thing about this approach is that it is a) very simple, b) preserves
type safety and the benefits of MachineModuleInfo, and c) is effectively free in
terms of compile time.

//===----------------------------------------------------------------------===//
// Incremental Implementation
//===----------------------------------------------------------------------===//

Implementing this incrementally should be pretty straight-forward.  First we
introduce the new MDString and MDNode classes, and add bc/ll reader/writer
support.  Once this exists, simple .ll testcases can be used to prove that it
works.

After that exists, the first thing to be updated should be the llvm annotations,
changing them to use MDString instead of a pointer to i8.  This can provide a
simple case for auto upgrade to grind on, and will make any clients of
annotations lives a bit easier.

Once that is done, we can start implementing the DI* helper classes, implement
autoupgrade support for global variable-based debug info into the new form, and
change the serialization/deserialization logic in MachineModule info to
read/write the new IR form.

Finally, once all this is in place, we can switch the code generator, then the 
front-ends over to using DI* classes directly.

//===----------------------------------------------------------------------===//
// Future Directions
//===----------------------------------------------------------------------===//

Once this step is taken, LLVM is well positioned to implement several other
major new features:

Language Specific Metadata
--------------------------

The implementation above points out a clear implementation approach for encoding
front-end specific meta-data in the LLVM IR.  I hope that authors of front-ends
will start to play with this to see if it helps with a specific area that LLVM
needs to grow in: enabling higher level optimizations based on language-specific
source infromation.  For example, it should be very relatively easy to introduce
a language-specific devirtualization pass that is driven by class hierarchy
information encoded as MDNode's.

It is worth pointing out that nothing stops the implementation of such things
today, it is just much more awkward and would be more inefficient than an
implementation based on the techniques above.

Debugging Optimized Code
------------------------

Another new capability that I'd like to tackle in the future is updating debug
information as optimizations progress.  The most basic optimization that is
currently hindered by debug information is mem2reg.  For example, if you have:

  %Var = alloca i32
  call llvm.dbg.declare(%Var, !47)  ; !47 = 'Var'
  ...
  store 1, %Var
  ...
  store i32 %X, i32* %Var
  ...
  store i32 %Y, i32* %Var
  ...
  
Right now, mem2reg won't promote the alloca, because it is pinned to the stack
by the llvm.dbg.declare call.  However, it would be easy to teach mem2reg to
handle updating debug information if we only had a representation for it to
produce.  I suggest something like this:

  ; deleted: %Var = alloca i32
  ...
  ; deleted: store 1, %Var
  call llvm.dbg.declare.val(!{ !47, i32 1 })   ; !47 = 'Var'
  ...
  ; deleted: store i32 %X, i32* %Var
  call llvm.dbg.declare.val(!{ !47, i32 %X })   ; !47 = 'Var'
  ...
  ; deleted: store i32 %Y, i32* %Var
  call llvm.dbg.declare.val(!{ !47, i32 %Y })   ; !47 = 'Var'
  ...
  
This takes advantage of the fact that MDNodes are implicitly variadic with a
single type ({}), so we can shove in a i32 just as well as we can a float or a
first class aggregate value.

Implementing this would require a couple of extensions from our current
proposal.  First, we'd need MDNode's to be able to reference Instruction values
as well as other constants (e.g. ConstantExprs).  Second, we'd need to do some
planning to decide what the exact format we'd want for llvm.dbg.declare.val.
Third, we'd need to start updating optimizations to tolerate MDNode users of
instructions without pessimizing them.  Finally, the lazy code generator people
would have to actually do something with this information. :)

Type Based Alias Analysis (TBAA)
--------------------------------

TBAA is an important analysis that is required to get good performance on
certain kinds of code.  It basically boils down to having a language-specific
alias set/tree that is pointed to by loads and stores.  With MDNode, this
becomes reasonably straight-forward: the front-end produces its alias tree as 
a collection of MDNodes, we enhance load/store/memcpy etc to take a pointer to
one of these, then we implement the new AliasAnalysis class to walk the alias
tree.  The hard part of this is handling linking of two modules with different
TBAA information in it.  This is "straight-forward" to implement (just use
an llvm.tbaa global with information on how to perform the linking) but the
implementation is detailed and not for the feint of heart.