Things That Make My Life Hell, Part 1: App Domains

Brian in Coding | 0 Comments March 26, 2010

I’m going to start a new series of posts titled, “Things That Make My Life Hell”.  The goal of these posts isn’t to explain to you why you should be glad you’re not me.  No, the goal is to pick some of the harder, messier problems I’ve had to deal with and explain how I solved them.  That way, should you ever have the misfortune of facing the same problems, hopefully you’ll be armed with a solution.

For today’s misfortune I’d like to focus on .NET app domains. In .NET, an app domain is like a little mini process.  It runs in the same process as your program and shares the same threads.  But, there is a brick wall of an “app domain boundary” that separates one app domain from another.  You can get through this boundary but you have to write code specifically to do it. 

ASP.NET is a natural place to expect app domains because each site hosted on a web server should act like a tiny application of its own.  You don’t want one site messing with the state of another, so that brick wall is very handy.  In fact, app domains were designed specifically for ASP.NET because creating a separate process for each web site would waste a lot of server resources.

One of the key features of an app domain is that assemblies you load into an app domain can be unloaded when the domain is unloaded.  This allows you to change assemblies on disk and restart the app without restarting the entire process.  Again, this was designed primarily for easy updates of ASP.NET sites.  This ability to unload assemblies is also the reason why Cider uses them, and why other designers should.  If they can stand the heat in hell, that is.

Cider and App Domains

Cider uses app domains internally.  If we didn’t, every time you built your project we would have to load another set of your compiled assemblies, stacking up the old ones in memory until we exhausted all the memory in your system.  This is what most designers in VS (and Blend) do – Cider is one of the first to break new ground here.

A designer isn’t like an ASP.NET site, though. Designers have lots of communication with their host (Visual Studio, in this case).  Designers also drop and reload their app domains frequently, usually on each build of your project.  This produces two problems that didn’t exist for ASP.NET:

  1. Designers need to have exceptionally fast “cold start” performance because on each build they’re starting from scratch.
  2. Designers need to efficiently move data across their app domain boundary.

These problems are the continued focus of my living hell.  Let’s focus on cold start performance first.

App Domain Startup Performance

Imagine that it took thirty seconds to open a XAML page the first time you started VS (even with the effort we’ve put into performance, that shouldn’t be too hard).  Now imagine you suffered through that delay each time you built your project.  Each time we start up a new app domain, the designer starts “fresh”.  All metadata, assemblies, XAML, etc, needs to be loaded from scratch.  We get the benefit of the disk cache, but we don’t have anything in memory we can reuse.  This is a problem for designers, because they need to load a lot of metadata.  The default references for WPF, for example, have over 7,000 classes we need to scan in order to find XML namespaces for Intellisense. 

On top of the static state we need to recreate is the state the CLR itself needs to recreate and any global state needed by WPF or Silverlight.  By default, an app domain is configured for “minimal sharing”.  This means it shares very little of the work other app domains have already done.  A domain using this default configuration will JIT nearly all assemblies it loads including the .NET framework.  Luckily there are two other sharing modes an app domain can be configured to use:

  • MultiDomainHost. In this mode any assemblies that are loaded from the GAC will be shared, including JIT and nGEN code.  This is great if you’re in the GAC and gets you good performance wins because you don’t have to JIT the framework each time.
  • MultiDomain.  In this mode any assemblies from the GAC or from your probing path will be loaded as shared. This allows you to have decent startup performance for portions of your app that aren’t in the GAC.

In VS 2008, most of VS was installed into the GAC and we used MultiDomainHost.  There was an effort in VS 2010 to move assemblies out of the GAC, and we moved to MultiDomain.  There were some surprising issues with this, however, because the CLR hadn’t really been exercised under MultiDomain mode that much.  It all worked fine, but consumed more memory than we anticipated.  The end result was several months of painful bug fixing and a final admission of defeat that resulted in putting most of VS back in the GAC (we remained in MultiDomain mode, however).  This doesn’t mean you shouldn’t use MultiDomain in your own applications – VS is extremely large and we’re counting every byte.  For applications that are smaller, the slightly larger memory footprint of MultiDomain won’t be noticed.  Also, the additional memory used is in the form of virtual address space, so it doesn’t actually cost against your physical RAM.

This is the first nugget of information I can share with you.  Here’s how to configure secondary app domains so they load fast, but still allow unloading of stuff you load outside of your app base.  First, make sure this is on your main method:

[LoaderOptimization(LoaderOptimization.MultiDomain)]
static void Main(string[] args) { } 

This attribute tells the CLR that the default domain should be MultiDomain too.  If it’s not, nothing loaded in the default domain can be shared with the secondary domains.  That’s bad, since most of the framework generally gets loaded into the default domain.

Next, you need some code to create your new app domain.  If you want to prevent files from being locked while your domain is running, be sure to set the ShadowCopyFiles flag like I’ve done below:

AppDomainSetup setup = AppDomain.CurrentDomain.SetupInformation;
setup.ShadowCopyFiles = "true";
setup.ShadowCopyDirectories = shadowPaths;
setup.LoaderOptimization = LoaderOptimization.MultiDomain;
AppDomain domain = AppDomain.CreateDomain("Test", AppDomain.CurrentDomain.Evidence, setup);

One thing that’s important to note about shadow copying and MultiDomain:  if you don’t specify specific paths to be shadow copied in the ShadowCopyDirectories property, everything not in the GAC will be shadow copied.  This has the unfortunate side effect of disabling sharing for those assemblies too, which is definitely not what you want.

There are good reasons for the other statements too:

  • You want to derive your app domain’s setup from the current domain or else your new domain won’t have the same app base (it can’t load anything), same permissions (I’m using app domains for unloadability here, not for security), or same config (so anything requiring config settings won’t work).
  • You want to pass the current domain’s evidence into the new domain unless you want it to have a different security model.  In .NET 4 there is a new security model and this is optional, but customers can set a system-wide registry key to shift the CLR into the older CAS behavior and then your app can break.

Communicating Across the Boundary

Ok, if you followed along you’ll have a nice shiny new app domain.  Yay.  And, it will load assemblies pretty quickly too, so you’re on your way.  Next you need to communicate with that new domain.  At the very minimum you’ll want to create an object in that domain and talk to it.  Something like this:

RemoteClass remote = (RemoteClass)domain.CreateInstanceAndUnwrap(
    typeof(RemoteClass).Assembly.FullName,
    typeof(RemoteClass).FullName);

class RemoteClass : MarshalByRefObject {
    public void Foo(){
        Console.WriteLine("Foo");
    }
}

This will work wonderfully for a while.  In the snippet below, the second call will fail:

remote.Foo();
System.Threading.Thread.Sleep(300000);
remote.Foo();

Why would the second call fail when you have a perfectly valid reference to the remote object?  Because a remote object does not stay alive by GC.  It would be very complicated to coordinate this across processes or machines.  Instead, remoted objects use a mechanism based on lifetimes and leases.  When you are given an object is has a default lease that will keep it alive for a certain amount of time (I think the default is five minutes).  Each time you make a call to that object, the lifetime is renewed and you have another five minutes of access.  If, however, you stop touching the object, the remoting system disconnects it and all you’re left with is a dead proxy.

This is a pretty simple solution and scales pretty well because it eliminates polling across processes and machines.  But what do you do if you need to keep an object around for some indefinite amount of time?  For that, you need to “sponsor” it.  Sponsoring is pretty easy:

ClientSponsor sponsor = new ClientSponsor();
sponsor.Register(remote);

Until you call Unregister on the sponsor object, the remote object will maintain its connection across the app domain boundary.  This is simple, until you have code somewhere that makes a mistake.  Mistakes in sponsorship were the major cause of crashes in Cider in VS 2008.  We fixed those in VS 2008 SP1, but there was some evil lurking that we didn’t find until VS 2010.

In VS 2010 we found that we were creating lots of sponsor objects and those objects continued to stay in memory.  It turns out this is because the implementation ClientSponsor specifies that the sponsor object should always stay alive.  It does this by returning null from “InitializeLifetimeService”.  Don’t do this!  Well, that’s too strong.  Don’t do that if you’re going to be creating more than one instance of that object and remoting it, because each instance you create will stay in memory.  For Cider, this equated to megabytes of memory lost over a typical development session.  My fix for this was simple:  use a singleton sponsor object.

There was a flaw in that fix.

I had assumed – and if you search the internet you’ll find that I’m not alone in this assumption – that calling Register required a balanced call to Unregister.  It does, but I assumed it was ref counted.  I assumed I could have multiple calls to register and unregister like this:

sponsor.Register(remote);
sponsor.Register(remote);

sponsor.Unregister(remote);
sponsor.Unregister(remote); // I expected this to disconnect

What really happens is that the second call to register is ignored and the first call to Unregister actually unregisters the lease.  If each of these register/unregister pairs happens when a XAML document is opened and closed, what do you think happens if you have more than one document open, you close one, and then you wait five minutes and try to use the designer?  I’ll give you a hint – it looks like this:

image

The solution is easy:  implement ref counting in a wrapper.  I’ll save you some trouble.  Here’s a very simple wrapper you can use to introduce GC semantics into your remote objects:

public sealed class RemoteHandle<T> : IDisposable where T:class {
        
    private T _value;

    public RemoteHandle(T value) {
        Sponsor.Register(value as MarshalByRefObject);
        _value = value;
    }

    ~RemoteHandle() {
        Dispose(false);
    }

    public void Dispose() {
        Dispose(true);
    }

    private void Dispose(bool disposing) {

        // Always disconnect even if finalizing
        var value = _value;
        if (value != null) {
            _value = null;
            Sponsor.Unregister(value as MarshalByRefObject);
        }

        // And if we're disposing revoke
        // finalization.
        if (disposing) GC.SuppressFinalize(this);
    }

    public T Value { get { return _value; } }
}

internal sealed class Sponsor : MarshalByRefObject, ISponsor {
    private static readonly TimeSpan _renewal = TimeSpan.FromMinutes(5.0);
    private static Dictionary<MarshalByRefObject, ReferencedLease> _leaseReferences = 
                       new Dictionary<MarshalByRefObject, ReferencedLease>();
    private static object _syncLock = new object();

    private static readonly Sponsor _instance = new Sponsor();

    private Sponsor() { }

    public override object InitializeLifetimeService() { return null; }
    public TimeSpan Renewal(ILease lease) { return _renewal; }

    internal static void Register(MarshalByRefObject value) {
        if (value != null && RemotingServices.IsTransparentProxy(value)) {
            lock (_syncLock) {
                ReferencedLease r;
                if (_leaseReferences.TryGetValue(value, out r)) {
                    r.ReferenceCount++;
                }
                else {
                    r = new ReferencedLease();
                    r.Lease = RemotingServices.GetLifetimeService(value) as ILease;
                    if (r.Lease != null) {
                        r.ReferenceCount = 1;
                        r.Lease.Register(_instance);
                        _leaseReferences[value] = r;
                    }
                }
            }
        }
    }

    internal static void Unregister(MarshalByRefObject value) {
        if (value != null && RemotingServices.IsTransparentProxy(value)) {
            lock (_syncLock) {
                ReferencedLease r;
                if (_leaseReferences.TryGetValue(value, out r) && --r.ReferenceCount <= 0) {
                    // Note: Dictionary clears key and value from bucket list upon remove.
                    _leaseReferences.Remove(value);
                    try
                    {
                        // Catch here -- if we are finalizing we may have already
                        // finalized the weak handle in the table or the lease may
                        // have already been unloaded.
                        r.Lease.Unregister(Sponsor.instance);
                    }
                    catch (InvalidOperationException) { }
                    catch (AppDomainUnloadedException) { }
                }
            }
        }
    }

    private class ReferencedLease {
        public ILease Lease { get; set; }
        public int ReferenceCount { get; set; }
    }
}

Using this wrapper is easy, but it’s important to remain extremely consistent.  When holding a remote object reference, you should always hold it in a member variable of type RemoteHandle<T>.  Note that garbage collection can’t “see” beyond the domain boundary, so any object holding a RemoteHandle<T> should be implemented as IDisposable and during its dispose code it should also dispose the remote handle.

Update: The original wrapper I posted could throw exceptions during finalization if the foreign app domain had shut down before the remote handle finalized.  I have updated the code to include a try/catch for exceptions that can be thrown here.  I am not a fan of catching exceptions like this, but it seems like a better plan than having the remote object tell me when the domain is going away.  I'd like to stay away from any additional burden on the remote object's part.

Wrap Up

App domains are one of the only ways to ensure you can unload assemblies you dynamically load.  You can get good performance using multiple app domains but you have to “tune” the domain just right for your scenarios and you must tune cold start of the code you run inside the app domain.  Communicating across the domain boundary is easy, but can introduce bugs that don’t show up unless you leave your product alone for a period of time.  Very few of us test products this way, so it makes it all too likely that a bug will find its way into the final product.  Hopefully the tricks and sample code I provided can prevent you from falling into the same traps we did. 

Remember, like everything on this site, the sample code I provided here is “as is”.  If anyone finds bugs I’ll gladly fix them in the post, but don’t assume I’ve fully vetted this code.

Pingbacks and trackbacks (4)+

Comments are closed