Thursday, December 30, 2010

F* Yeah! 2 + 1 * 3 = 5

I spent some time lately just reading about compilers a little bit. Nothing too in depth but I wanted to explore the idea of a basic compiler. So I worked my way through some Wikipedia link jumping. Reading about lexical analysis, parsers, grammars. abstract syntax trees, stack machines, register machines, some of everything. I'm surprised I didn't end up on Kevin Bacon's page.

The Idea

Anyway, after all of that I realized something about an exercise we did in school. Take an arithmetic expression, something like 2 + (3 - 1), parse it into a tree data structure, then evaluate the tree. This was basically a simple compiler. We didn't have to describe the lexical and grammar rules, nor generate an intermediate or executable representation, but it was still the same idea. Given this, turn it into something else.

So that sounds like a perfect project, so lets do it in Erlang.

The Lexer


I know the tokens I want: +, -, *, /, number, (, )

After a bit of research I find that Erlang has a lex implementation call...wait for it... leex. Given some lexical rules (defined with a bit of regular expressions and Erlang code) it'll generate me the Erlang code for the lexical analyzer.

Well that's easy. Here's are rules to turn a string representation of an expression into a series of tokens:

[0-9]+ :
  {token, {number, TokenLine, list_to_integer(TokenChars)}}.

[0-9]+\.[0-9]+ :
  {token, {number, TokenLine, list_to_float(TokenChars)}}.

\+ : {token, {'+', TokenLine}}.
\- : {token, {'-', TokenLine}}.
/ : {token, {'/', TokenLine}}.
\* : {token, {'*', TokenLine}}.
\( : {token, {'(', TokenLine}}.
\) : {token, {')', TokenLine}}.

[\000-\s] : skip_token.
[\n] : {end_token, {'$end', TokenLine}}.



The Parser


Ok, I have my lexer. I can turn "2 + 1 * 3" into [{number, 1, 2}, {'+', 1}, {number, 1, 1}, ...
That first 1 on all the tokens is the line number. In my current situation I'm only worrying about a single line expression.

Working directly with that list of tokens would be about as fun as fighting a cool monkey. I want to be friends and he wants to look like a badass while throwing poo. This is where a grammar definition and a parser comes in.

And guess what... in the butt? No, Erlang has a yacc implementation called... wait for it.... yecc. So here's my grammar:

Nonterminals expression.
Terminals number '+' '-' '/' '*' '(' ')'.

Rootsymbol expression.
Endsymbol '$end'.

Left 300 '+'.
Left 300 '-'.
Left 400 '*'.
Left 400 '/'.


expression -> expression '+' expression : [ '$2', '$1', '$3' ].
expression -> expression '-' expression : [ '$2', '$1', '$3' ].
expression -> expression '/' expression : [ '$2', '$1', '$3' ].
expression -> expression '*' expression : [ '$2', '$1', '$3' ].
expression -> '(' expression ')' : '$2'.
expression -> number : '$1'.



Basically I'm saying that every list of tokens should result in an expression (a nonterminal). Which could be a number, an expression inside parentheses, or two expressions joined with one of my supported operators. Those "Left " is the associativity, precedence, and token to aid the parser in getting things correctly.


The Generator


I have a lexer and a parser, sweet. What does that mean? I can turn "2 + 1 * 3" into

     [{'+',1},
          {number,1,2},
          [{'*',1},
              {number,1,1},
              {number,1,3}]]}



I know that is a terrible representation, but's an abstract syntax tree represented as a list (brackets) and tuples (curly braces). The '+' token is the head with only a number to its "left" and a subtree to its "right". Since the tokens come out as a list of tuples, we need another method to represent the tree. Lists will do just as well. This probably looks strange to an OOP guy, but structured data is structured data when you are processing and translating.

Well I'm not done. I have an abstract syntax tree, but how do I use it. Lets turn it into a made up stack based machine language:

-module(generator).
-export([generate/1]).

generate([Root, Left, Right]) ->
  LeftExp = generate(Left),
  RightExp = generate(Right),
  RootExp = generate(Root),
  LeftExp ++ RightExp ++ RootExp;
     
generate({number, _Line, Value}) ->
  [{push, Value}];
    
generate({'+', _Line}) ->
  [{add}];

generate({'-', _Line}) ->
  [{subtract}];

generate({'*', _Line}) ->
  [{multiply}];

generate({'/', _Line}) ->
  [{divide}].

That's the whole generator. The top generator() function just visits each node of the tree while the others rely on Erlang's pattern matching to turn a token into an instruction ({number..} into {push}). You are probably thinking why didn't the lexer generate these instructions as the tokens? Well you don't want to combine these truly separate concerns do you? Also the straight list of tokens as they come out from the lexer wouldn't work with a stack machine where you need to push the data on a stack and pop it off to operator on it. Oh, my operator instructions do the popping.... anyway.


The Virtual Machine


Now I'm turning "2 + 1 * 3" into [{push, 2},{push,1},{push,3},{multiply},{add}]. I can go from a line of "code" to a list of machine instructions. Hot damn, but I still don't have the evaluated value.

At this point you might need to know how to read Erlang code, so I'll skip the gist embedding and provide a link to my git repository for the curious:

https://github.com/copenhas/mathexp

Ok, holy crap. What else do we have to do just to evaluate some math? You start the virtual machine (which runs in it's own process), you send the VM your program (list of instructions), then asynchronously it runs the program and finally will send you the return value in a message.

I'm afraid I'll also leave understanding the basics of a stack machine to the reader as well, but the example of a program above should give you a good idea. I don't think I'm strictly following it (I believe they are usually a 0 operand instruction set not a 0-1), but it works pretty well and the code is pretty straight forward.

Nothing impressive, but the title expresses my feeling when I saw it work.

The Absurdity

Here is the absurdity as a series of Erlang code to get that magical (high) 5.... he he he.

{ok, Tokens, _EndLine} = lexical:string("2 + 1 * 3").
{ok, Ast} = grammar:parse(Tokens).
Program = generator:generate(Ast).
vm:start().
vm:load(Program).
vm:stop().
receive {return, Value} => Value end.

Erlang's assignment does data unpackaging (pulls the data out and sets the variable) and pattern matching (the right side has to match the format of the left). That last line would actually print out the value in the Erlang interpreter. This last bit might have been confusing but I'm going to leave you hanging.

Wednesday, December 29, 2010

ASP.NET Web Forms vs ASP.NET MVC: The Middle Ground

Ok, so it is about time I contribute to this blog as I have been listed as a contributor since it started and have yet to write anything. I apologize in advance for grammatical errors. This is something that has been bothering me for a while and I am seeing more and more argument about this online and among fellow developers and while both sides have their points, I find that there is a nice middle ground that most ignore.

Preface: This is about the .NET MVC framework, not the MVC design pattern. With the exception of one point in the article where I mention the design pattern specifically, any reference to MVC is related to the framework.

ASP.NET Web Forms vs ASP.NET MVC: The Middle Ground

I have heard one problem on both sides "It creates spaghetti code". For MVC it is hard to find where the actual work is being done, where it is posting, etc. For web forms it is the IsPostBack check in the page_load of every page to handle different events and lets not forget the ViewState. Both are valid points in my opinion and I am not a fan of either.

When I started writing websites, I used straight HTML for my pages. My first server side language was Cold Fusion (ugh.) and I never used the built in widgets for rendering. I followed that with ASP 3.0 and that didn't have any (that I knew of). When I started with .NET, I wrote my pages the same way. I never used the web controls provided by .NET, I used normal code. For/While loops instead of repeaters, Response.Write rather than labels, etc. I would use the normal HTML form tags rather than .NET's. I never used .NET's event model, relying instead on javascript and I didn't even know the ViewState existed for the first few years of using .NET.

I would make all of the data I needed for rendering page available to the view after the page_load was complete and that was it. .NET was no longer involved once the page was rendered, and when another page was called I would pull from the Request object for query string or form values and pump data back into the Response object if necessary. If I needed something to persist in the app or session, I used the Application, Session, and Cookie objects provided.

I always felt that this gave a better page lifecycle and user experience anyway, users could use the back button without messing up the flow, and a page only knew how to do one thing, two at the absolute most. I have heard MVC called an advancement of this methodology, and while I agree to a point, I feel it overcomplicates things is some ways.

The first thing that MVC is supposed to provide better than web forms is true separation of concerns. The model does the work, the controller acts a gate keeper, and the view just does the rendering. While all of this is true, I don't see it any better than how web forms CAN be used. Now, I have been told that the way I use web forms is practically the MVC design pattern, so I wonder why is there a need for a completely different set of libraries and handlers if that is the case. What am I really gaining from having the advanced routing methods? Prettier URLs? That is only an advantage for the end user because the views are still hard tied to controllers. Yes, a controller can return something else (such as json) without a view, but that is a marginal improvement at best. The only other advantage I see is less files in the code base, but the controllers are large if they handle a number of actions and can be a bit of a pain to search when you are looking for a specific one. In web forms, I know exactly where to look for my files based solely on the URL. The code behind can (and should) act as a "Controller" with one action. It will talk to the "Model" and handle flow control and redirection based on the request data or the "Model's" response. With the exception of smaller websites that have very little business logic to them, the "Model", whether in in MVC or Web Forms (in a perfect world) is separate from the website anyway. Whether it is a one or two-tier app with business logic in separate libraries/classes, or it is a separate entity (WCF, Windows session, etc) in a n-tier environment, the "Model" is already separate. So once again, MVC is not providing much.

I think the key thing with Web Forms has stirred up so much ire from the development community is the "pluggable widget model" (as D2 calls it). The intent behind the Web Forms design pattern is to act like a WinForm app, and since the early days of VB, WinForm development has been very much plug and play. Developers just drag and drop objects where they want them then wire it up in the back end. In this "controlled" (I use that term lightly) environment, it worked pretty well, but the web world is not as controlled. Programmers don't have the luxury of controlling every action of the user(without some crazy javascript, and that is only if javascript is enabled), only the browser does. After a web request is complete, there is no longer a connection to the server until the client does something else. That is the nature of the web. Embrace it.

When many developers tried to make the shift, from Win to Web, they tried to follow that ideology, not remembering that users, for the most part, are dumb. They don't read your warning not to use the back button, and for the most part, don't care about how you want it to be used, they will use it way they think it should be used, and this causes a lot of headaches. The result was spaghetti code.

The other cause of spaghetti code was the RAD model that Web Forms allowed. People with little to no development experience could create a website that did all kinds of flashy things and provided the end result they wanted without having to worry about silly things like security or maintainability.

In either case, these developers carved out a nice little niche and business types liked their quick turn around and usually low prices so they were hired to write apps that later needed to be maintained by others. Sometimes those others were similar developers that made the code base worse, and other times they were real developers that spent much of their time beating their heads against the wall and sending snippets to The Daily WTF, all while being told by the client that they don't understand why it is so difficult, "It's just a web page".

Like I said earlier, I don't like the Web Forms design model, and I don't see any real reason to use .NET MVC. Both have their advantages and disadvantages, but if you write code with the mindset of how the request/response model works and write your own html, why not use what .NET provides by default? This makes deployment easier as the boxes that house the web app will not need another set of libraries installed on top of .NET, and maintainability is a lot better because of the better codebase structure. This also ensures that your JS and CSS will act the way you expect them to and that you get a lot more control over what is happening at all stages.

Saturday, December 18, 2010

Regular Expressions and HTML

In lieu of writing something interesting or thought-provoking (at least to me) or making any kind of point, today I would instead just like to share something for posterity.  This was in response to the never-ending stream of questions on Stack Overflow regarding parsing HTML with regular expressions.


You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ


In other words, it can't be done. Stop trying. Use XML libraries and HTML DOM parsers, there is no shortage of them.

Originally posted as an answer by bobince on Stack Overflow.

Friday, December 3, 2010

Abstractions: Where do you live?

You know software development and computer science are extremely overwhelming fields. I believe the saying goes something like "the more you learn the more you realize you don't know". That might not be perfectly quoted but you get the idea. We luckily take advantage of previous ideas and build up abstractions and layer to help manage all this overwhelming complexity.

Lets just take a quick look at a set of related abstractions:
electronic signals (yeah that's right go fuck yourself)

Ok seriously, I'll try to list what's involved with just data access:

  • Set and Relational Theory
  • Relational Database Topics (schemas, functions vs stored procs, views, transactions) 
  • Sql Server
  • SQL
  • OLEDB
  • ADO.NET
  • NHibernate


I think that hits the big ones involved for a general .NET data access stack. Now working at the NHibernate level that abstracts away a lot of ADO.NET (provider creation, connections, transactions a little, generating some sql), but you aren't exempt from knowing about it. Beyond that you can't ignore that fact you have a real database underneath that speaks SQL and has its own way of doing things (Sql Server vs Oracle for example).

That's something everyone has to keep in mind: working at one level of abstractions does not mean you can be ignorant of lower levels of abstractions. I do differentiate between the required knowledge to work at a level of abstraction and potential knowledge.

Taking my previous example if you are working at the NHibernate level you are required to know basic NHibernate and some of it's abilities. You are also required to know SQL and something about relational databases (a little design mostly). Now there is the potential (and advantages) to knowing the inner workings and reasoning behind NHibernate (such as Unit of Work, advanced relational mapping), how to work directly with ADO.NET (providers, core interfaces, managing connections), plus all the things you can learn about how your database of choice works.

So what does this mean for people coming into this field... I hope someone is looking out for you. Seriously, you should try to get in at some level of abstraction with a mentor. Realistically you'll have to put some effort into finding a mentor. They can guide you into understanding the level of abstraction you'll be living at and figuring out how it fits in.

What does this mean for people already in the field? You should figure out the potential knowledge in your stack of abstractions you work with (both above and below where you are). Figure out where in that stack you live and try to tackle at least 2 levels down to increase your understanding. I would also suggest going up a level to keep in mind how something is used, but the important thing is to not be ignorant of what's powering your level.
 

Typing - I'll take both

I have seen my share of dynamic vs static articles and they have raised my eyebrow enough times I'm thinking I need to distill my current thoughts in a post. Let's see where this goes...

Firstly, WTF? Seriously, everyone seems to get really defensive about typing. I'm just getting annoyed by the whole thing. I see tons of posts that start off great turn aggressive either due to the author or (usually) via the comments from very opinionated readers. Then sprinkled between those are another annoyance I'm seeing on blogs and project comments. Yeah I think we know you have to make educated technical decisions based on your requirements, but seriously I don't need someone chiming in with nothing to add except "best tool for the job". Ok, I think now that I can try and put some thoughts about the debate down.

I personally think that dynamic typing feels little more closer to how object-oriented programming is suppose to feel. By that I mean the fact that duck typing makes you only concerned with what an object is capable of. You can't think about types when really writing code with duck typing. You are thinking about the object and not the class. Either it can perform the action you want (or process the message) or it can't and you'll get some delegated response or an error. Of course it's that runtime error that gets people twitchy, but seriously not all your code is controlling the space shuttle flight control systems.

Of course static languages give you some explicit self documenting code that the compile can do light correctness checking on, but you do sacrifice some flexibility. Also I find that design artifacts start to pop up from the usually static type system. Things like extra methods on interfaces (as in public members) that are only used if certain properties are a specific value (I know there are ways to design around this). Or heavy use of generics (as in the C# feature), interfaces, design patterns to try and get some extra flexibility out of your code. So there is certainly some trade offs as you would imagine.

Now I have messed around with F# some in the past and I have to say that the type system most common (such as the class system in C# or Java) doesn't do a lot for you. You have to fully define your types and declare them for use everything. It's very explicit on your part and the compiler doesn't do a lot for you. C# did introduce a very light type inferencing with the 'var' keyword but it's abilities are miniscule. If you take a seriously look at F# (I have heard that Haskell is mind blowing) you'll see that it infers types for you as much as possible. You can declare functions without any type information and the compiler will make sure you are passing in types that can fulfill the needs.

I need to spend more time in an advanced type system so I can understand how far type inferencing can take you, but I would like to see languages have a focus on dynamic typing with an optional static system. Even though I love dynamic languages there are still areas you want static typing such as external facing interfaces or touch points between components even inside the same system.

Anyway, that's enough for now.