Memory Use Markers ------------------ LLVM has a powerful AliasAnalysis infrastructure, and a nice caching MemoryDependenceAnalysis interface built on top of it. However, LLVM is currently has deficiencies in a couple of areas. This note aims to address these issues: 1. A front-end cannot communicate properties about memory to the optimizer. For example, the front-end may know (from the semantics of the language) that a region of memory is immutable over a region of code. 2. MemoryDependenceAnalysis caches information about liveness and transparency of memory while it is active, but is often invalidated. This can lead to recomputations in common cases. 3. LLVM has no way to reason about the lifetime/scoping of memory objects. To be clear, here are a few examples of the sorts of things that cannot be optimized due to each of these: ------------------------------------------------------------------------------- Memory Immutability ------------------------------------------------------------------------------- In C++, variables declared const may not be modified between when their constructor is run and when their destructor is run. Variables passed around by reference often have their addresses taken, so SRoA isn't able to promote them. For example, consider this code: #include void bar(const int &); int foo() { const int x = 4; bar(x); return x; } int foo2() { const std::pair x(4,5); bar(x.first); return x.first; } In this example, the llvm-g++ front-end is able to forward substitute the value of x to the return in foo, but is unable to optimize foo2 to "return 4;", and the optimizer doesn't do it either. This case applies equally well to C. There are many other cases where this sort of information is also useful. For example, consider: class C { virtual void virt(); virtual void foo(); C(); void bar() { foo(); } }; C::C() { bar(); bar(); } void C::virt() {} In this case, we know that, during the execution of C::C that the vtable pointer is guaranteed to be set to C's vtable. This means that a call to foo() in its body is guaranteed to go to C::foo. If C::C directly calls foo, the G++ front-end has a special hack to detect and handle this, but if you go through an indirect call (bar in this case) it doesn't do its optimization. This results in us (and g++) compiling the second call to foo into an indirect call through the vtable, instead of turning it into a direct call to C::foo. If the optimizer knew that the first call to foo() was guaranteed to not mutate the vtable pointer, both accesses would be devirtualized. It would also be nice to not have to replicate optimization hacks like this in Clang. One nice thing about memory immutablilty is that it is very language independent, and non-C languages often have higher-level semantics that they know about and would like to express so that the optimizer can take advantage of it. This can be handled to say that the "length" field of a java array is immutable, etc. ------------------------------------------------------------------------------- Invalidation of MemoryDependenceAnalysis ------------------------------------------------------------------------------- MemoryDependenceAnalysis does a good job of caching analysis results, but it is often invalidated, and there are other analyses that produce the same kinds of information but that do not use MemDep (e.g. LICM with its use of AliasSetTracker). Further, other forms of transformation can use information that memdep and other analysis produce. For example, consider: int G; int Arr[1000]; void foo() { int i; for (i = 0; i < 1000; ++i) Arr[i] += G; } In this example, LICM determines that the memory for G is not mutated in the loop, and thus the load can be hoisted out and used from a temporary register, something like: int G; int Arr[1000]; void foo() { int i, tmp; tmp = G; for (i = 0; i < 1000; ++i) Arr[i] += tmp; } This is a great optimization, as we've eliminated a load in the loop. However if the body of the loop is actually more complex than this, we may run out of registers to hold G across the loop. Currently the register allocator will generate something like this: tmp = G; store tmp -> stack for (i = 0; i < 1000; ++i) { load tmp <- stack Arr[i] += tmp; } In this case, it would be much better to not have done the transformation in the first place: we don't save a load in the loop, and now we have an extra store and an extra stack slot. If LICM could record the fact that the loaded memory is invariant across the loop, then the register allocator's rematerialization pass would be able to sink the load back into the loop, undoing the original transformation, saving the store and load. Unlike the examples in #1 above, it is actually possible for RegAlloc to do this transformation. It "just" requires remat to do the same analysis that LICM did, proving that the load from G would produce the same value inside and outside the loop. While this is possible and desirable, it is silly for both LICM and remat to recompute the exact same information. It would be better for remat to be able to efficiently realize that the memory is invariant. ------------------------------------------------------------------------------- Lifetime of Memory Objects ------------------------------------------------------------------------------- Currently, LLVM has no way of communicating the lifetime of a memory object from the front-end to the backend. The only real reasoning that the optimizer does about this is to know that allocas begin their life at their definition point, and are live until the end of the function. Having live/dead information can be extremely useful for two very different purposes. The first use is for explaining lifetime information to GVN and DSE: a load before a memory object is initialized is known to return 'undef' and a store that has no read before the object dies is dead. We already handle these two cases in GVN/DSE, but they are limited to allocas in trivial cases. With a more explicit model for this, we could handle heap allocations with malloc/free could handle placement new, and could have much tighter lifetime bounds for stack variables than the end of a function. The second use is completely different: explaining to the code generator a tight bound on the lifetime of an alloca. This allows is to do stack coloring of allocas, which can greatly reduce stack space. For example, consider: int foo(int i) { if (i) { char A[10000]; bar(A); } char B[10000]; bar(B); } It is very obvious to you and me that A and B have non-overlapping lifetimes, but the compiler doesn't know this. This results in a stack size of 20K instead of 10K. While this may not seem incredibly important, it actually can heavily impact C++ code with a lot of objects that are pasesd by reference, and impacts code that inlines multiple calls that have allocas (their lifetime is bound by the calls, so they cannot overlap). If the code generator knew a bound for the lifetime of the allocation, it could share stack space for them. ------------------------------------------------------------------------------- Implementation Approach ------------------------------------------------------------------------------- I propose that we introduce a few new LLVM intrinsics to model this explicitly in the IR. The lifetime markers are considered to write to the memory, the immutability markers are considered to read from memory. This prevents optimizations that are unaware of the intrinsics from invalidating the assumption by moving load/store across the barriers incorrectly. ; The first argument must be a constant integer. The second argument must be ; a pointer, but can be of any type. Eventually we will make the second ; argument be a metadata pointer which will mean that the intrinsic need not ; be variadic, just to be type generic. For variable sized objects, a size of ; -1 can be used (which is how the alias analysis interfaces already work). ; The pointer may be an arbitrary pointer, it need not necessarily be related ; to an alloca. ; ; This intrinsic indicates that at before this point in the code, that the ; value of the memory is "dead". This means that it is known to never be used ; and has an undefined value. Any load from the pointer that is preceeded by ; this intrinsic can be replaced with 'undef'. declare void llvm.lifetime.start(i64 %size, ...) ; This takes the same arguments as llvm.lifetime.start. This indicates that ; the memory object is dead after this point. This means that any stores into ; the memory object immediately before this can be removed as dead. declare void llvm.lifetime.end(i64 %size, ...) ; llvm.invariant.start - This takes the same arguments as above, but indicates ; that, until an llvm.invariant.end that uses the return value, that the ; referenced memory location is constant and unchanging. declare {} llvm.invariant.start(i64 %size, ...) ; llvm.invariant.end - This occurs at the end of an invariant block, ; indicating that the memory is mutable again. declare void llvm.invariant.end({} %start, i64 %size, ...) For example, in the first case: #include void bar(const int &); int foo2() { const std::pair x(4,5); bar(x.first); return x.first; } We currently lower this to: define i32 @_Z4foo2v() nounwind { entry: %x = alloca %"struct.std::pair" %0 = getelementptr %"struct.std::pair"* %x, i32 0, i32 0 store i32 4, i32* %0, align 8 %1 = getelementptr %"struct.std::pair"* %x, i32 0, i32 1 store i32 5, i32* %1, align 4 call void @_Z3barRKi(i32* %0) nounwind %2 = load i32* %0, align 8 ret i32 %2 } If we used the intrinsics above, we'd lower it to: define i32 @_Z4foo2v() nounwind { entry: %x = alloca %"struct.std::pair" ;; constructor starts here (this isn't needed since it is immediately ;; preceded by an alloca, but shown for completeness). llvm.lifetime.start(8, %x) %0 = getelementptr %"struct.std::pair"* %x, i32 0, i32 0 store i32 4, i32* %0, align 8 %1 = getelementptr %"struct.std::pair"* %x, i32 0, i32 1 store i32 5, i32* %1, align 4 ;; Constructor has finished here. %inv = llvm.invariant.start(8, %x) call void @_Z3barRKi(i32* %0) nounwind %2 = load i32* %0, align 8 ;; Destructor is run here. llvm.invariant.end({} %inv, 8, %x) ;; Destructor is done here. llvm.lifetime.end(8, %x) ret i32 %2 } Since this increases the size of the IR, it would make sense for a front-end to only generate this when in -O mode, not in -O0 mode.