Building a reference map

Topics: Developer Forum, Main( )
Apr 10, 2007 at 8:51 AM

I am currently, together with a fellow student, developing a prototype application for a user study on fisheye visualization of source code at the Department of Computer Science, University of Copenhagen; a further investigation of the results found in 1.

The prototype is being written in C# and is intended for visualizing C# source code. As a part of the experiment we need a degree of interest (DOI) function based upon a priori interests (classes are more important than methods, methods are more important than conditions, etc.), syntactic distance (statements in th same scope are closer than statements in statements further out) and semantic distance (referenced code; i.e. method calls, using of fields, properties, etc. are considered close).

So far your C# Parser has been great for solving these tasks, but we are currently stuck developing the semantic component of the DOI - i.e. building a reference map.

What we need is basically a two dimensional array where the first index corresponds to a line in the visualized source code (the current focus point) and the second index corresponds to the weight of every single line in the same source code, with relation to the current focus point, i.e. when visualizing a n-lines long source code file, the array would be n*n.

Do you have any suggestions as how to approach this matter?

I have tried a couple of approaches, but without success.

1. Using only tokens and scope defined by indentation, but unfortunate the Token.Col value is incorrect sometimes (especially when there's an .-token, the .-token and the next token in the list have the same Token.Col value)
2. Iterating through name spaces, classes, methods, statements, etc.

Any feedback would be appreciated.

Best regards,
Thomas René Sidor

Apr 12, 2007 at 6:12 PM
With regards to the incorrect Token.Col value in TokensID.Dot; it seems that there's an bug in the lexer. In line 978 of Lexer.cs you do a 'c = src.Read()' without an following curCol++ (it actually is in line 999). Because of this the TokenID.Dot gets the correct value, but not the tokens following it (bacause it breaks when adding the token, and the curCol++ is after the break, thus not being evaluated).

Placing a curCol++ afther the 'c = src.Read()', commenting out the curCol++ in line 999 and adjusting line 987 from 'tokens.AddLast(new Token(TokenID.Dot, curLine, curCol));' to 'tokens.AddLast(new Token(TokenID.Dot, curLine, curCol-1));' seems to solve the problem (at least without any side effects that mess with my use of the parser and lexer; also it seems to correctly parse all of the Mono test files).

I'm also posting this in the issue tracker.

Best regards,
Thomas René Sidor
Apr 12, 2007 at 9:59 PM
Edited Apr 12, 2007 at 10:02 PM

1) Nice project :), really interesting.
I think iterating through the code graph is better.

How do you intend to use tokens only to determine methods, classes etc etc ... ? Maybe i missed something :)

It seems that you need a type/reference resolver. This work is in progress, by Omer, but nor him, nor other have much time to do it fast. Maybe if you ask Robin a developer access, of course if you want, you can finish it.

I can advise you that some time a node is not really what it seems to be ;). i.e. : the member access node can not be well resolved without a full identifier/type map. It is why some time an expression like "A.B.C" is parsed as a member access node of a member access node -> MA(C, MA(A,B) ) is the result. It might be a name space member access, a field/property memeber access or either a method's parameter member access ...

It is the same with the Identifier node : in the expression "typeof(B)", B can be a type reference or a field reference ...

And there is maybe some other ambigous cases unresolvable without a type/identifier map. And with pending ambigous expression i doubt you will be able to finish your project.

With an identifier map do you need a n*n array ? I mean, In an expression you get an identifier, you look for this identifier. The identifier map returns you a Field node and Method node ( defined in two distinct classes ). You have to choose between this two nodes. Looking the class which contains the expression you can determine the closest node from the expresion, using namespace and file name only. Did i understand well your need ? For this you need a not implemented feature : the parent-child relation between nodes.

2) I agree that the 'col' is sometime wrong. We must refine it. I think a first implemention was used in debug : locate the token in the file. The problem with the '.' is it can be a decimal separator or a qualified identifier "separator". In the first case the '.', its left part and its right part are the same token. In the second case, all are three different tokens. But surely you know that :). So i guess our algorithm is wrong. I will look it asap.