Friday, August 6, 2010

LINQ to DB2

Some of you (assuming there are even more than one of you) may know that I've been looking for some way to use LINQ against the massive DB2 core database we have here at work.  And, if any of you have worked with DB2 before (this is my second time around), you know that it's a bit of a pain in the ass.  Nothing supports it, it has all kinds of special proprietary ways of doing anything and everything, etc.  And, to make matters worse, IBM's website is like a scavenger hunt through hell when you're trying to find a download for something.  I get the feeling they don't want you to be able to do anything without going through one of their sales reps and dedicated client engineers and installing their full suite of nonsense.  [take a deep breath, count to 10... ok]

Anyway, I've been searching online for a while now and trying out various tools and everything always comes down to fitting into one of three categories:
  1. Yes, you can do this! All you need is a driver from IBM that there's no download link for!
  2. Here's a handy tutorial for generating a great ORM for your business objects that creates a database from scratch. (Caveats: Useless against a massive legacy database, probably doesn't grok DB2)
  3. Download some tool and generate your ORM. (Tool doesn't support DB2, or claims to but fails when I try.)
I've even contacted the oracle of Stack Overflow a few times on the subject, phrasing the question differently or approaching it from different angles.  And, recently, it paid off (at the cost of 100 reputation points, which was a bargain). A helpful answer pointed me to an open source (MIT License) library I hadn't found before called DB_Linq.  Now, as usual, no DB2 support.  But that's okay, because it's extensible.  So I set about the task of adding a DB2 "vendor" to the code.  I figured it would mostly be a matter of overriding some methods that generate the SQL syntax to support DB2's own flavor, and initially that's all it was.  But my quest yielded a few more roadblocks.

I was hoping to contribute my DB2 support back to the project, but it turns out that I had to spend less time on the openness of it and more time fine-tuning the whole thing for our specific environment.  You'll see what I mean.

First thing's first, I wanted to make the whole thing read-only.  We don't need to be running experimental code against our core database, even in test.  So most of the overrides for generating the SQL just throw a NotImplementedException.  So if anybody tries to generate anything other than a SELECT statement, the app will fail.  Good.  All we want is SELECT, at least for now.  The SELECTs will get more and more complex as I add functionality to the provider, but for my tests so far I've kept it simple.

So, the code has been extended and the DB2 support has been added.  Let's generate the data context and table classes!  ...  Man, this is taking a long time.  Well, we have a big database, so it should take a little while.  ...  Ok, that code file is getting big.  ...  Yay!  We're done!  And all we have is, um, 5.2 million lines of code.  Visual Studio doesn't like that.  I don't like it either.

Thus, the next step was to modify the engine of the code generator (this is where it starts to fork off a bit too much to contribute back, that and the DB2 support is minimal and not very robust yet) to generate separate files for each table.  Luckily, it was pretty easy.  It already creates everything as partial classes so that it can all be extended, in the proper LINQ data context way.  So I just had to muck a bit with the loop that iterates over the tables and generates the code, and move that loop out of the StreamWriter and have it create its own with each iteration.  And, since it's all partial classes, each table class file also extends the data context with its own table property.  Nifty.

Ok, generate the code again.  Now we can see how big that database really is.  Well, as it turns out, the code generated 8,206 table class files.  Assuming my unit conversion is correct, that's just over 4.1 metric fuck-tons of tables.  Holy Hell, Batman, I would have considered a few hundred tables to be excessive.  But, it is what it is.  And now we have code.  Of course, Visual Studio still really doesn't like it.  So let's wrap up the code generation and compilation to a DLL in a script and just re-generate it any time we need it.  (<joke class="inside">I think I'll name it BGC.Entities.DataAccessLayer.dll</joke>)

Doesn't compile.  Shit.  Ok, let's take a look.  A bunch of the table classes have repeated members?  (Oh, and just so you know what I'm looking at, in what appears to be classic DB2 style the tables are named things like "TFX002AF" and the columns are named things like "WBRCH1".)  Well, as it turns out, we have tables with columns like "WTBK$" and "WTBK%" and such.  I know it hurts, but hopefully it'll build character.  So the code generator is interpreting both of those as "WTBK_" in the code.  Well, that sucks.  My first approach to this, just to get it to compile so I can see if it even works before I put too much effort into it, was to just loop through the members of each table when generating the code and, if it's a repeat, append another underscore.  So we'll have "WTBK_" and "WTBK__" on the table.  I'll need to go back later and either make something prettier (replace with names of known special characters instead of always an underscore?) or decorate it with enough intellisense that the developers can at least discern which column they're accessing.

A little more tweaking on the multiple files thing and it finally compiles.  Sweet.  Now to run it through some tests.  So I coded up a simple little app that grabs some data from LINQ a few times and then the same data from vanilla ADO.  The LINQ code is definitely sleeker and sexier and, of course, strongly friggin' typed.  I'm a big fan of that last bit, because it moves a certain subset of potential errors from run time to compile time.  It's also much less prone to SQL injection.  Well, I don't need to sell you on the benefits of LINQ.  So I run the test.

The LINQ code is slow.  Really slow.  It takes several seconds to run a query whereas the ADO DataAdapter fills a DataSet in the blink of an eye.  After a little tinkering, it's back to Stack Overflow.  The code compiled fine into a DLL, but it's a 36 MB DLL with over 8,000 classes in a given namespace.  Is that a problem?  Jon Skeet says "no" and, well, he is a bit of an oracle in the C# world.  Is it the generated SQL?  Took a bit of research to figure out how to get that out of the debugger, but that ends up not being the problem.  No, these simple SELECT statements are pretty straight-forward and run fine against the database.  I do notice in my testing, however, that if I restrict the code generating down to a subset of tables (say, 200 of them) then it runs as fast as expected.

Well, if the number of classes and the size of the DLL don't matter outside of compile-time, then that leaves the syntax-sugaring code that generates the SQL.  I didn't change that when I added the DB2 support, it's using the same stuff that the open-source library uses for everything else.  But, since it's open-source, I can debug against the code.  Let's step into the LINQ statement and see where it takes us.

It didn't take long to find the rate-determining step.  Now, this is where we get into the trenches with LINQ and start hitting against some internals with which I am unfamiliar.  Anyway, I ended up stepping into a GetTables() method that loops through every member of the data context class and does a little bit of reflection to figure stuff out.  Does the official Microsoft implementation do this?  I'll have to find out someday.  But this implementation does it, and I guess the developer(s) didn't expect to ever come across a database with 8,206 tables.  The fool(s)!

So what does this loop do?  Damned if I know.  Yet.  Well, at a high level, it iterates over all of the members of the data context class, looks for meta information about each table, and adds it to a list if it doesn't find it already.  Does it need to?  The only thing that calls this method just grabs from it the one table for which it's looking.  So why do I need meta information on all of them?  It doesn't appear to retain this information, this loop runs every time I build a LINQ query.  So screw it, don't run the loop.  You know the table you're looking for, get its meta information and add it to that list you're using.  So maybe now you'll have to re-find the meta information every time I write a query, but it seems like you were doing that anyway.  Or maybe now you'll have repeated meta information in that list.  Well, so far that hasn't posed any problems.  And if it does I'll address that when I need to, on a much smaller and more manageable list.

Ok, let's run the test again.  Yay!  A few more tests and I now have a perfectly cromulent LINQ to DB2 provider for .NET, neatly wrapped up in a 36 MB DLL.  Take that, ADO.

1 comment:

  1. Cool man, I am curious about that loop to find meta data you were talking about. I wonder why there isn't some kind of initialization process to do that and cache it or something.

    ReplyDelete