Part of this is unavoidable, since dynamic memory has to be allocated when constructing the parse tree. But for some reason, I kept seeing really large numbers of allocations in places that just didn't warrant them. My first hunch was that I just needed to replace the allocation with something faster; so I wrote a quick linear allocator that would just spam allocations into a massive, pre-sized buffer, and then throw the whole thing away after compilation finishes.
After some hiccups, I got the whole thing to build (which takes forever now) and ran the first test. Paradoxically, it made no difference to execution speed, and it increased memory usage far more than it should have. Worse, I'm running only release-mode builds, because debug builds literally can't cope with my sample input in less than an hour or so. (That's how much excess cruft qi generates... eurgh.) This means that my profiler can't see true call stacks, but rather has to cope with the maze of semi-collapsed templated inline calls that fill the parser.
It took forever, therefore, to make a simple discovery that's had me kicking myself for several minutes now.
The AST structure for Epoch makes heavy use of boost::variant. The AST is also self-recursive, meaning that several nodes are defined in terms of themselves. This necessitated (at one point) the use of boost::recursive_wrapper<> to ensure that the compiler could figure out the circular definitions.
That requirement went away a long time ago when I started doing deferred construction of AST nodes using my own wrapper template; but I never updated all of the uses of recursive_wrapper, because they never showed up in profiling - just allocation calls.
Out of random curiosity and frustration, I finally cracked open the code for recursive_wrapper to see how it does its magic.
Yep, you guessed it: it dynamically allocates the contained type.
That means that I'm not only allocating memory for the AST nodes, I'm allocating memory for the deferred variant that points to that AST node - and adding a level of pointless indirection to boot.
The good news is, removing recursive_wrapper isn't too hard, and speeds things up, although just a tiny bit.
The bad news is, now I have a rampant memory leak, and I'm utterly stumped as to why.
One step forward, thirty steps back, it seems.
Rule Number One of writing a linear allocator in C++: you still have to manually invoke destructors even if freeing the memory is a no-op.