This project is read-only.

Code Re-generation

Mar 12, 2009 at 9:48 PM
Just curious,

    One of the biggest areas that interests me is Code Generation, with a parser of a language, you have the opportunity to leverage the information at your disposal to possibly re-target the parsed source.  I've heard people rave over CodeDOM, but honestly, it's really not all that great.  There are many limitations, and the level to which it -doesn't- express code mightily outweighs its benefits.  Add to that there are errors in the core implementation of CodeDOM (I haven't checked in 3.0 if it's been fixed, but I know using a generic interface on a private implementation details of a method, causes the CSharpCodeProvider to emit bad code).

    Have the authors of this project ever considered utilizing a high-level code infrastructure to not only represent their code, but to potentially allow it to have more features (though they would require certain aspects to be rewritten, your tokenizer is quite complex, so it's hard to say) such as lambda expressions, language integrated query, and a type strict core (your resolver seems to return null on type identification, unfinished?), extension methods, et cetera?

    Another thing that might be more plausible from such an infrastructure is the ability to, as said, retarget the code to another language, like VB, Ruby or some other language that has enough functionality to handle the high-level code.  Here is a link to a file generated by three different projects I'm working on.  The first is a Lexical Analyzer State Machine builder (I'm working on the syntax analysis aspect now), the second is a code generator that has support for switch statements, operator overloads, implicit/explicit type cast overloads, and a variety functionality that CodeDOM didn't have.  The third is a code highlighter/colorizer, written by an earlier version of the lexical analysis system above.  The Code Generation framework is old, and it's also in the process of being written anew, to focus on a more type-strict core with a structure that enables ease of flow/type analysis and malleability.  I'm hoping to cover features from C# 3.0 and as soon as it's out, C# 4.0.

    The main question is, do other nerds like myself even have an interest in such a code framework, or is it all a waste of time?  Sure I don't mind being the only one who uses it, but if someone else can benefit, why not, right?  Granted, the project's far from completion, the old code generator, as it is, is far simpler to use than CodeDOM ever will be (unless MS rewrites it, from scratch, or adds a heck of a lot of functionality/helpers).  The entire reason I'm writing such a complex infrastructure is I have interests to make a new version of the lexical/syntactical parser builder (ie. Compiler Compiler) that has a sub-language embedded in it (such as C#).  The ability to retarget would be a godsend, and likely won't be easy.

Mar 13, 2009 at 7:22 AM
Hey Alexander,

Great to see you here : )

Pesronally I'm totally interested in this, in fact lack of things like << in the codedom was one of the things that got me into parsers long ago : ). Code gen is a huge part of my daily life -- a good regex tool can go a long way, but sometimes you need more. I'd love to have something intuitive like regex (did I just say that?!), but able to work on meta constructs in code. Lots of cool things are possible once you have that!

Mar 13, 2009 at 7:33 AM
Edited Mar 13, 2009 at 8:01 AM
Retargeting is a nice idea, and something I'm interested in, though to be honest I don't think C#3/4 features are that important because recoding lambdas and such is very difficult, and I doubt you could do it correctly, even in .NET languages, let alone in other domains like Embedded. Another thing with retargeting is that a lot of my motivation for using other languages is getting features not available in C#, and the current state of csparser gives me everything I really want to retarget. So for example, I might retarget the entity structures from C# into F# (because they are tedious to write again), but business methods need to be rewritten anyway because C# has no pattern matching and the like.

I guess what I'm trying to say is the framework's already good enough for my uses, but obviously any effort to extend it is welcome.
Mar 13, 2009 at 10:00 AM
Lambdas, to be honest, will probably be easier than Language Integrated Query.  LINQ does more than just rewrite the code a little, the oft complex results it emits sometimes amazes me.

Though my present focus right now is Getting Syntactical analysis down, once I have a program that can build basic parsers, I can build upon that to build a version that can build more complex parsers using the newer version of the code generation infrastructure.  Type verification for type-parameters was fairly straight forward, lambda signature inference will likely require a multi-pass method that focuses on the operations performed on the members within the lambda's body, via duck typing.  The actual extraction of the building blocks that make up the Lambda itself, shouldn't be that difficult, since I'm already prepared for massive structural changes to occur.  One of the main reasons is I'm planning on targeting Common Intermediate Language (CIL, aka MSIL), the core language of C#, Visual Basic, Iron Ruby, and Iron Python (or any .NET language).  Targeting this language means that any feature, no matter how small, that goes beyond CIL's straightforward nature, will have to be transformed into a CIL-compliant version.  I think the biggest issues will be iterators (via yield syntax), LINQ, and Lambdas.  Extensions are fairly easy since the framework requires you to emit ExtensionAttribute instances on assemblies, types, and then the methods that are associated to them.  It'd be pretty easy to filter out based upon the assembly attribute, then the type-attribute (and type-modifiers, must be a static class or module), and then the member attributes.

I'm interested to see what kind of compiler you could get if it was directed by meta-data (attributes).  Microsoft is already working on something similar for C# 5.0 (a managed compiler for C#).
Mar 14, 2009 at 9:47 AM
Edited Mar 27, 2009 at 3:47 PM
As part of the set analysis I'm doing right now, I'm working on finite data structures, with respect to each individual rule.

RelationalExpression ::=
        RelationalExpression:Left; ('<' | "<=" | '>' | ">="):Operator; ShiftLevelExpression:Right; |
        RelationalExpression:Left; ("is" | "as"):Operator; TypeReference:Right; |
IdentityCheckExpression ::=
        IdentityCheckExpression:Left; ("==" | "!="):Operator; RelationalExpression:Right; |

This way if you have three versions of say, RelationalExpression, it will build three data versions of the rule.  RelationalExpression in this case doesn't share any data-terms between all definitions.  So there's no data for its root rule, except for a rule selector (which specifies which rule was parsed, by index, the documentation comment will contain the definition of that rule).  The IdentityCheckExpression, on the other hand shares the 'Right' term with both, so the root rule definition will contain the 'Right' element.  The first subrule data structure will contain the Left reference and Operator (and the second rule data will be a non-abstract implementation of the root rule).  Another thing that's done to simplify things, is all 'literals' in rules are cast off into their token versions, so the system knows that they're all operators, and doesn't needlessly count each as a different type (tokens in this system are a bit non-traditional, but that's neither here nor there).

So far I have the total dataset union (because 'RuleEntry' derives from the RuleSeries, which has the data node, the union of the whole RuleEntry isn't necessary, since I can just start with rule 1 and interect from there, it was also useful to determine functional compliance via Console.WriteLine on what the results were), next is to obtain the intersection between all rules.  Afterwords, I can use the full intersection to obtain each rule's complement.  After the data's built, I'll need to do another pass to link the various data elements to their definitions, so when I write the parser blocks it can properly allocate varibles for temporary data and then associate them to the proper underlying rule data structures.

Lots of work ._.  That's just the tip.

Alongside this work is the set analysis for the rules themselves.  I've decided on a system similar to Lexical analysis of state->state transitions.  So far the debug output looks promising, though I've only scratched the surface.  I've got it listing the initial set of elements for the transitions, the next step is to take it a touch further and unify each individual initial set, so I can build the true initial set for each rule, and then compare each set across the board to create the proper logic paths for each possible parse. 
The issue lies in cases of unification of the tokens between imported rules for a given state transition:  if two rules require the keyword 'new', it'll have to look further, if none require such a thing, the path is clear.  Then there's unification on the initial token set of the current rule (vs. the rule transitions), the same kind of look-ahead is required.  It'll have to keep looking ahead until it finds a proper path, or an ambiguity occurs.  Ambiguities in this case would be cases where two rules overlap perfectly.  Such as a cast operation and a subtraction operation where the left operand uses parenthesis:
(test) - 3
This would cause an ambiguity because does it mean: ((test)-3) or (test - 3).  Both are perfectly valid, but the meaning isn't completely clear.  This is where I'll have to cast the path into an ambiguity rule, a specially generated rule that has the proper data components, but is marked as ambiguous, deriving from both for ease.  I'm not certain at this point how I'll link the data structure as said, but I'm sure I'll figure something out.  The resolution to the Ambiguity in C#, if the intent is to cast, is to do the following (refer to CS0075 for more information): ((test)(-3))
Mar 15, 2009 at 5:44 AM
Edited Mar 15, 2009 at 5:47 AM
Well, I figured I'd post a bit more about what it is I'm actually using, or writing I should say. At this point I'm not entirely sure how far this program will go. I just know that it'll provide me with a codebase to start version 2.0 of the same.
Expression+ ::=> AssignmentExpression |
    VariableDeclarationExpression |
    /* *
     * Templates must exist as a part of a rule,
     * thus their use in such a manner is oft a liklihood.
     * Null rules exist of simply a '()' (which is what this
     * reduces to) are removed.
     * */
        AssignmentExpression, ('=' | "*=" | "/=" | "%=" |"+=" | "-=" | "<<=" | ">>=" | "&=" | "^=" | "|=" | "??"), ConditionalExpression, Right,
        LogicalOrExpression, "||", LogicalAndExpression, Left,
        LogicalAndExpression, "&&", BitwiseOrExpression, Left,
        BitwiseOrExpression, '|', BitwiseExclusiveOrExpression, Left,
        BitwiseExclusiveOrExpression, '^', BitwiseAndExpression, Left,
        BitwiseAndExpression, '&', IdentityCheckExpression, Left,
        IdentityCheckExpression, ("==" | "!="), RelationalExpression, Left,
        RelationalExpression, ('<' | "<=" | '>' | ">="), ShiftLevelExpression, Left,
        ShiftLevelExpression, ("<<" | ">>"), AddOrSubtLevelExpression, Left,
        AddOrSubtLevelExpression, ('+' |  '-'), MulDivLevelExpression, Left,
        MulDivLevelExpression, ('*' | '/' | '%'), UnaryOperatorExpression, Left>;

RelationalExpression ::= RelationalExpression:Left; ("is" | "as"):Operator; TypeReference:Right;;

ConditionalExpression ::= LogicalOrExpression:Term; '?':QuestionDelimeter; ConditionalExpression:TruePart; ':':DecisionDelimeter; ConditionalExpression:FalsePart; |

That shows the use of a simple tempate, 'PrecedenceHelper' which I wrote more to test that templates were working.  They're basically a way to assist writing more complex rules, they accept what's basically a tuple set of data.  As an example, if you wrote a template with four parameters, and you gave it eight, you'd get two cycles out of it.  Above, there's 44 elements, for 11 cycles.  The Left and Right indicate the kind of associativity.  Now that's not to say that templates are some nifty code in the core, they're basically a syntactical trick, similar to how C++ templates work, only not as powerful (since they're not very complex).  Here's PrecedenceHelper's definition:

//Error ::= Identifier Operators.Eq String Operators.Comma HexNumber Operators.EntryTerminal;
InvalidAssociation = "Invalid association '{0}'.  Expected: Left or Right", 0x001;

/* *
 * @ before strings means case insensitive.
 * */
PrecedenceAssociation := @"Left":Left; | @"Right":Right;;

/* *
 * @"Expect", allows you to define a slight requirement for an input.
 * Special cases are: Token, Rule, or TokenOrRule.
 * Token can be a full token or a single part of a token.
 * '+' means that it's an array, multiple levels of '+' are expected in groups.
 * The first '+' requires that all following inputs are also '+' based.
 * this ensures that indexing can be done.
 * */
    Association:Expect=PrecedenceAssociation;+> ::=
    #if Association == Left
        #ifndef Current
            //Define the entity associated.
            #define Current = Current:Left; OperatorSeries:Operator; Next:Right; | Next:Right;;
            #if Index == 0
                //Return, this becomes part of the resulted
                //grammar of the calling rule.
                #return Current:Left; OperatorSeries:Operator; Next:Right; | Next:Right;;
                //Don't return this, add it to the already defined entity.
                #addrule Current, Current:Left; OperatorSeries:Operator; Next:Right; | Next:Right;;
    #elif Association == Right
        #ifndef Current
            //Define the entity associated.
            #define Current = Next:Left; OperatorSeries:Operator; Current:Right; | Next:Left;;
            #if Index == 0
                //Return, this becomes part of the resulted
                //grammar of the calling rule.
                #return Next:Left; OperatorSeries:Operator; Current:Right; | Next:Left;;
                //Don't return this, add it to the already defined entity.
                #addrule Current, Next:Left; OperatorSeries:Operator; Current:Right; | Next:Left;;
        #throw InvalidAssociation, Association;

As you can tell, it's very simple.  The limitations become obvious when you try to do something more automatic and complex, but I'll add more power to version 2, since the parser for this project is already a headache to look at, let alone modify (guess that happens when you research parsing, you start to hate your code).

Hopefully within a week or more I'll have made more progress on the parser gen.  The test project I'm using is a basic replica of C#'s expression system.  Which is probably one of the more complicated aspects of the language.  If it goes well I'll post more about it, as well as how the C# parser I'll be generating next goes.  

The crowd here is especially important, if the code this thing generates is garbage, what good is it to modify/extend?  I was perhaps hoping that when I get to the point of possibly emitting a good parser, if the folks here wouldn't mind taking a look at it.  Considering the name of the project.
Mar 21, 2009 at 1:26 PM
So far progress seems to be going well.  I've generated a transition graph of the parse from the start rule interleaving any and all of the sub-rules as necessary.

The next phase is to utilize this data to build individual parse methods, I'll focus on simple recognition first, and then weave in the data model on top of it.

It'll probably handle the look ahead transitions prior to any data extraction, to ensure that it can resolve what data it needs before it builds the associated structures necessary.  Though from the looks of it, the data I'm using seems to be more appropriate for a RD+PDA, which is weird, I'll just have to wait and see if it pans out.

Only complaint I have is it seems to take quite some time to do its relational analysis (~2 minutes on the test project that replicates C#'s grammar, 22 seconds for the recursive expression test).  I guess I'll worry about speed once I have it building what I want.  A lot of this is trial and error, I'm not sure about the rest of you, but I'm no professional: I just do this for fun. 

How about everyone else?  Is this project enjoyable?
Mar 24, 2009 at 10:21 PM
How about everyone else?  Is this project enjoyable?

I'm finding this project very interesting, though I don't know very much about compilers.  Could this be used to make a native code compiler for C#?
Mar 25, 2009 at 12:35 AM
Edited Mar 25, 2009 at 12:42 AM
Well with regards to the C# parser project, you would have your work cut out for you.

It's honestly not as simple as parsing the input source  files and building a program.  You'd have to have a 1 to 1 implementation of the BCL and a transformer program to do the cross-linking between the original BCL declarations and the secondary native implementation (which means you'd either have to link it to the BCL objects using reflection first, or omit that step and use standard symbol table lookup conventions, adjusting the lookup as needed).  Not only that but if you wanted to omit the .NET CLR entirely you'd have to have the source to all the libraries you use in the project, and build them the same way.  If that weren't enough, you'd have the fun of re-implementing the language features of C#, Lambdas, LINQ, Generics (there's a lot of dynamic code generation behind generics for them to be able to handle reference, and value types, respectively, co/contra-variance in 4.0 will be interesting), COM interop, .NET interop if you wanted it to do so (which would beg the question: why would you build to native if you support interop with .NET), and the list goes on and on.  Especially if you consider, I haven't even covered the step of emitting machine-code.  Which involves knowing your platform, perhaps the architecture (depending on the level of sophistocation of the compile, the target processor, and so on), emitting OS-specific headers for applications, the lengthy series of lookups and other code necessary to implement an object model, manage the native-variant of the language since you'd have to implement your own Garbage Collector (or systematically rewrite all the code to properly and fully dispose its objects, something a person would have to do).

The biggest initial point of interest would be the BCL implementation, it's huge.  The amount of work that went into just the windows forms library is nothing to joke about.  You'd need a language, other than assembly, to implement this in a timely manner (assembly -is- possible, but I would rather do other things).

The side project I'm working on (a Parser Compiler, or what have you), is just the same.  Sure you could build a parser for a language, but you'd still need to choose a platform, depending on what you choose would determine how much work you have cut out for you.  Myself, I'm hoping to implement something akin to 3.0 and take it further with other features, such as duck-typing, and my target platform is the Common Language Infrastructure or .NET.  Which means it immediately has the BCL, and a horde of other features which makes creating a compiler much easier.

So the simple asnwer to your question is: Yes, you can, but it will not be easy.
Mar 25, 2009 at 2:28 PM
Thanks for the reply, Alexander :) 

I've been watching the Digital Mars D Language project evolve over the last few years (, and they've faced those issues.  Building your standard library does seem to be a critical step.  That's how I came across this project actually...I was looking for a parser I could modify to use for syntax coloring in a Visual Studio integration package for D. 

Another interesting project I've come across is COSMOS ( which lets you make your own operating system in C#/.Net does this by turning the IL into machine code (I think).  True, it's a console based OS right now, but they're supposed to work on Windows Forms or possibly WPF in the future.  Again, it comes down to libraries.
Mar 27, 2009 at 3:44 PM
Edited Mar 27, 2009 at 3:53 PM
Just to note something I've been researching lately.

I'm trying to see about a backtrack-less LL( k ) parser, which is turning out to be fairly interesting.  I even looked at ANTLR as well as the source it emits, and I must say: ANTLR is confusing as hell.  The code it produces is akin to spaghetti code, I'm not exactly sure why I see it this way, though.  ANTLR itself is a backtracking parser, and allows you to specify yourself how far you want the look-ahead to be.

I'm not terribly familiar with ANTLR, anyone else here know why you'd need to specify the level of k?  From what I'm able to tell, the program should be able to figure out the length of 'k' and adjust its code output as necessary.

The back-tracking issue becomes more complicated when you consider ambiguous cases between the tails of one rule and the next possible token set in another.  Here's a simple example:

#LexerName    "TestLexer";
#ParserName   "TestParser";
#AssemblyName "TestGrammar";
#GrammarName  "TestGrammar";
#Root         "CP";
#RulePrefix   "CT";
#RuleSuffix   "CR";
#TokenPrefix  "Tkn";

Tokens :=
    '(':LP;    | ')':RP;    |
    '[':LSqBr; | ']':RSqBr; |
    '{':LCrBr; | '}':RCrBr; |

/* *
 * '>' after a '::=' for a rule means the rule is merely a forward
 * for other rules and acts as a 'grouping' construct. The CST
 * elements generated, as a result, are unified under the interface
 * created for the grouping rule.
 * */
CP ::=> Order1 | Order2 | Order3 | Order4;

Order1 ::= '[' CP ']';
Order2 ::= '(' CP ')';
Order3 ::= '{' CP '}';
Order4 ::= (Identifier | '[' | ')') ')'*;

/* *
 * Unused rule, Whitespace inlines the WhitespaceChar token
 * and WhitespaceChar is eliminated.
 * */
WhitespaceChar := ' ':Space; | "\r\n":CRLF; | '\t':Tab; | '\r':CarriageReturn; | '\n':LineFeed; | '\v':VerticalTab;;

/* *
 * '*' after a ':=' for a token means the token is ubiquidous and
 * can therefore be ignored if it's in a place where it's not
 * explicitly called for.
 * */
Whitespace :=* WhitespaceChar+;

Identifier := [A-Za-z_][A-Za-z_0-9]*;

  If you notice in the above, you'll see Order4 uses ')'*, which basically means that a string which uses Order2 and then delves into Order4 (since that's the only non-recursive rule), has the irritating tail-ambiguity between Order4 and Order2 on the right-parenthesis.  To prevent back-tracking, the program will need to emit 'tail-ambiguity' resolution.  Which is a fancy way of saying that it uses what it knows about the current parse state to determine exactly how many ')' belong to Order4 and how many are owned by the caller(s).  Since I'll be using a stack to represent the context-awareness, I can look at the stack, and deduce whether there's an ambiguity based upon the index of the value on the stack (since the index relates directly to the valid information information, the Parser-compiler knows what this data equates to without actually doing a look-up in code).

I'm thinking that I'll need to greatly increase the overall structural complexity of the output project to get this to work right.  Because of this, I might as well make this an effort to improve the generated code.  I'm thinking I'll build what I'll call a syntactical tree node system, which will represent the original parse infrastructure in a series of class instances.   This tree will contain information about the ambiguities so resolutions are more succinct and someone reading it won't look at it and say 'wtf?'  I'll also have to emit a enumeration relative to each node, which will contain documentation comments about what that node represents.  The tree itself will be built in a flattened format and contain internal links to each rule/state and so on.

If I learned one thing about reading up on ANTLR it's that the code this thing emits needs to be user-friendly, because generating code is one thing, making it maintainable is another.  The only point where I had to ignore this rule is on the Lexical step, the Literal-series state-machine is fairly straight-forward, comments indicate what each node represents up to that point, constant referrals to the allowed token set helped improve readability.  Capturing tokens are a different matter, they just work, for this version that'll have to do.  Perhaps when I get around to version 2, I'll have a bit better insight towards the problem domain and be able to improve their effectiviness (and make it so you can actually capture the individual elements of the token vs. just the overall string).

All of what I've just said doesn't even begin to cover greedy token captures.  If the rule Order4 had been a token, it would be nearly impossible in this version to detect it.
Mar 27, 2009 at 5:55 PM
A horrible experience with antlr was one of the original motivations for writing a 'by hand' parser. I gave up when I noticed the gen code uses exceptions for flow control (to back out of invalid states iirc). I'm sure there are people that can get that kind of thing to work, but I certainly don't have the chops for it. (no idea why it allows you to specify lookahead either, but I guess sometimes things just magically fix themselves when you increase that -- of course your code magically slows down too : )

I'm enjoying reading about your project, thanks for the updates, looking forward to them -- and it : ).
Mar 28, 2009 at 4:19 AM
Edited Apr 2, 2009 at 11:26 PM
So far, from what I can tell, the slowest part of all this is the process of making individual tokens.

The simple grammar above was constructed to test forward ambiguities, and I got to thinking I was missing something.  The tail ambiguities came to mind when I kept thinking what the follow sets would be calculated for.  Since I haven't written the code for tail-ambiguity resolution yet, I wrote a parser to handle the tail ambiguity by hand for the above simple language (as well as the individual rule data set, since most of what's done is related to calculating the data, not yet putting it to use).

The slowest parts come into play when you consider exactly how much data you want to process.  A standard 30,000,000 byte file takes varying degrees of time based upon how many discrete tokens make up that file.  30 million ')' in the above language can take some time when you consider it's checking for ambiguities, the context-stack on what's supposed to be 'next' for the calling rules varies as well (it wouldn't be good enough to just check for one ')' above the current level when parsing Order4, because what if you've got 6000 '(' before that?)  This is also just the simplest of examples, I'm sure the resolution time-impact will vary based upon how complex the resolutions become.  A large part of the individual resolution would reduce to code like the following (which I adjusted for a more complex case, where the first part of Order4 repeats [Identifier | '[' | ')'], instead of just a Right-parenthesis):

int tailCheck = 0;
    while (this.GetContextSize() > tailCheck && this.PeekPreviousContext(tailCheck) == 3)
    int offset = 0;
    int rpi = 0;
    //Look forward
    data = this.LookAhead(pendingPosition + (offset++));
    //No match or EOF.
    if (data == null ||
        (data.Covered & ValidTokensForTestGrammar.EOF) == ValidTokensForTestGrammar.EOF)
        goto enddata;
    //Right parenthesis
    if ((data.Covered & ValidTokensForTestGrammar.Tokens) == ValidTokensForTestGrammar.Tokens &&
        ((data.CoveredForToken.Value & TokenCases.RP) == TokenCases.RP))
/* *
           * If there are sufficient right-parenthesis to overcome the tail
           * ambiguity, insert it into the dataset.
           * */

        if (rpi > tailCheck)
/* *
           * Clear the Right-parenthesis index, something was encountered which
           * doesn't follow the disambiguation rule.
           * */

        rpi = 0;
        while (offset > 0)

    goto nextData;

Remember, the above is hand-written, but I figure if you can do it by hand you can automate it.  The '3' in the tail check refers to the context-information index which has a known overlap on the current context.  So far I've been able to automate state-machines for Lexical Analysis (and a bunch of other unrelated projects), so this should be about the same kind of deal.  I'm guessing each ambiguity will require a different local to manage, so I guess I could ask you guys a question:  What's a good source to name these locals?  Should I use the name of the token that represents the ambiguity?  Some ambiguities could be two tokens or so (eg. if you're using '>' and the caller uses '>>' for one token)  So were that the case, the '>' and '>' ambiguity would be greaterThan_GreaterThan_Ambig or something along those lines.  I could even condense it into the title-case letters of the name (gTGTAmbig), but I'm not entirely sure.  That's also assuming the resolution is that easy to determine.  I'll have to cross-check each rule that calls a given rule and check the tail-points of that rule, on top of this, I'll have to include cases where that rule is called by yet another rule and so on.  The ambiguity resolution will probably be pretty hectic. 

I'm sure you all can attest that certain things aren't always as simple as you first think they are, especially with things like ECMA-334: 9.2.3 Grammar ambiguities.  Insight appreciated.

Forgot to mention, the code above that's hand-written is the actual Syntactical parse code, the lexical analysis state-machines are finished.  For the example, I wrote the lexical analysis handler the same way I will the generated version.  I compare the series of overlaps on the first character and build a table of those overlaps and the tokens that are associated to them.  The lexer's next token procedure should be fairly simple to build once I get around to it, there's just so much other work to do that it's not a priority.

Update (March 28th, 2009, 9:34 PM, GMT-06:00):
Presently tying up some loose ends with the object model.  Adding bases for tokens, rules, and so on.  Tokens won't have any information other than their subset-information (or their string capture, if it's say, an Identifier).  If a token is made up of a series (such as keywords) there'll be one class per set of 32.  This might seem like overkill, but I ran a test using a mock-up of a toy Common Intermediate Language syntax definition, and it generated 15 different subsets for keywords (there's over 450 keywords), having 15, 32-bit values seemed like a bit much for each and every keyword token to have (since it needs at most one).  So if I create a class per each sub-set you have a higher code base but each individual token is much more compact.  Since there's cases where there might be big sub-sets created, it'll also include a reference to a case selector, so if you're looking to figure out which to look at for the value, it'll contian that as well (which will consume no data, just a property).

Update 2 (April 2nd, 2009, 5:24 PM, GMT-06:00):
Since this is taking longer than I anticipated, to reduce the spam on this project's boards, for those interested in following the progress, you can view the associated blog.