Friday, June 20, 2014

Fun with Expression Trees

A common pattern I find in my MVC/WebAPI code is that my controller actions usually open a unit of work (or just a read-only repository therein), perform some query or insert/update something, and then return. It's a simple thing to expect a controller action to do, after all. But sometimes I come across an entity model where it's a bit difficult to put much of the logic on the model, and instead it ends up in the controller actions.

For example, imagine a system where your models are a fairly complex graph of objects, versions of those objects, dynamic properties in the form of other child objects, versions therein as well, etc. It's a common pattern when one builds a framework in which to configure an application, as opposed to building just an application itself.

Now in this system, imagine that you want all of your queries against "Entity" models to always ever return just the ones which have not been soft-deleted. (Viewing soft-deleted records would be a special case, and one we haven't built yet.) Well, if you have a lot of queries against those entities in their repository, those queries are all going to repeat the same ".Where()" clause. Perhaps something like this:

return someRepository.Entities
                     .Where(e => e.Versions
                                  .OrderByDescending(v => v.VersionDate)
                                  .FirstOrDefault()
                                  .Deleted != true))
                     .Where(// etc.

That is, you only ever want Entity records where the most recent Version of that Entity is not in a "Deleted" state. It's not a lot of code (at least, not in this simplified example), but it is repeated code all over the place. And for more complex examples, it's a lot of repeated code. And more importantly than the repetition, it's logic which conceptually belongs on the model. A model should be aware of whether or not it's in a "Deleted" state. The controller shouldn't necessarily care about this, save for just invoking some logic that exists on the model.

At first one might simply add a property to the Entity model:

public bool IsDeleted
{
    get { return Versions.OrderByDescending(v => v.VersionDate)
                         .FirstOrDefault()
                         .Deleted != true; }
}

Then we might use it as:

return someRepository.Entities
                     .Where(e => !e.IsDeleted)
                     .Where(// etc.

That's all well and good from an object oriented perspective, but if you're using Entity Framework (and I imagine any number of other ORMs) then there's a problem. Is the ORM smart enough to translate "IsDeleted" to run it on the database? Or is it going to have to materialize every record first and then perform this ".Where()" clause in the code?  (Or just throw an error and not run the query at all?) Most likely the latter (definitely the latter with Entity Framework in this case), and that's no good.

We want as much query logic as possible to run on the database for a number of reasons:
  • It's less data moving across the wire.
  • It's a smaller memory footprint for the application.
  • SQL Server is probably a lot better at optimizing queries than any code you or I write in some random web application.
  • It's a lot easier and more standard to horizontally scale a SQL Server database than a custom application.
So we really don't want to materialize all of the records so that our object oriented models can perform their logic. But we do want the logic itself to be defined on those models because, well, object oriented. So what we need on the model isn't necessarily a property, what we need is an expression which can be used in a Linq query.

A first pass might look something like this:

public static Func<Entity, bool> IsNotDeleted = e => e.Versions
                                                      .OrderByDescending(v => v.VersionDate)
                                                      .FirstOrDefault()
                                                      .Deleted != true;

Which we can then use as:

return someRepository.Entities
                     .Where(Entity.IsNotDeleted)
                     .Where(// etc.

This is a good first step. However, if you profile the SQL database when this executes you'll find that the filtering logic still isn't being applied in the SQL query, but rather still in-memory in the application. This is because a "Func<>" doesn't get translated through Linq To Entities, and remains in Linq To Objects. In order to go all the way to the database, it needs to be an "Expression<>":

public static Expression<Func<Entity, bool>> IsNotDeleted = e => e.Versions
                                                                  .OrderByDescending(v => v.VersionDate)
                                                                  .FirstOrDefault()
                                                                  .Deleted != true;

Same code, just wrapped in a different type. Now when you profile the database you'll find much more complex SQL queries taking place. Which is good, because as I said earlier SQL Server is really good at efficiently handling queries. And the usage is still the same:

return someRepository.Entities
                     .Where(Entity.IsNotDeleted)
                     .Where(// etc.

Depending on how else you use it though, you'll find one key difference. The compiler wants to use it on an "IQueryable<>", not things like "IEnumerable<>" or "IList<>". So it's not a completely drop-in replacement for in-code logic. But with complex queries on large data sets it's an enormous improvement in query performance by offloading the querying part to the database engine.

There was just one last catch while I was implementing this. In some operations I want records which are "Deleted", and in some operations I want records which are not "Deleted". And obviously this doesn't work:

return someRepository.Entities
                     .Where(!Entity.IsNotDeleted)
                     .Where(// etc.

How should one invert the condition then? I could create a second expression property called "IsDeleted", but that's tacky. Not to mention it's still mostly repeated logic that would need to be updated in both places should there ever be a change. And honestly, even this "IsNotDeleted" bothers me from a Clean Code perspective because positive conditionals are more intuitive than negative conditionals. I should have an "IsDeleted" which can be negated. But how?

Thanks to some help from good old Stack Overflow, there's a simple way. And it all comes down to, again, expression trees. Essentially what's needed is an extension which wraps an expression in a logical inverse. This wrapping of the expression would continue through the expression tree until it's translated at the source (SQL Server in this case). Turns out to be a fairly simple extension:

public static Expression<Func<T, bool>> Not<T>(this Expression<Func<T, bool>> f)
{
    return Expression.Lambda<Func<T, bool>>(Expression.Not(f.Body), f.Parameters);
}

See, while there's no ".WhereNot()" or ".Not()" in our normal Linq extensions, there is one for Expressions. And now with this we can wrap our expression. First let's make it a positive condition:

public static Expression<Func<Entity, bool>> IsDeleted = e => e.Versions
                                                               .OrderByDescending(v => v.VersionDate)
                                                               .FirstOrDefault()
                                                               .Deleted == true;

Now let's get records which are deleted:

return someRepository.Entities
                     .Where(Entity.IsDeleted)
                     .Where(// etc.

And records which are not deleted:

return someRepository.Entities
                     .Where(Entity.IsDeleted.Not())
                     .Where(// etc.

Profile the database again and we see that all of the logic is still happening SQL-side. And for the inverted ones, the generated SQL query just wraps the whole condition and negates it exactly as we'd expect it to.

Now, we can still have our calculated properties on our models and we can still do a lot with those models in memory once they're materialized from the underlying data source. But in terms of just querying the data, where performance is a concern (which isn't always, admit it), having some expression trees on our models allows us to still encapsulate our logic a bit while making much more effective use of the ORM and database.