C# Parser Library

Ââåäåíèå

1. Introduction

 

The Metaspec C# parser library is designed to provide the kernel for applications which require processing C# source code: refactoring, formatting, and verification applications, and other similar tools.





Our C# parser is available in two versions: a compilation parser and a transformation parser.


The compilation parser is designed for applications that require converting source code into the internal format as fast as possible. Only the significant data required for correct analysis of the program is preserved. This parser is useful for the development of compilers, interpreters, analyzers and other tools that do not require modification of source code.

The transformation parser preserves all information in the program's source code (including comments, spaces and preprocessing directives). This parser is useful for the development of refactoring and formatting utilities and other programs that do transformations on the program's source code.

 

 

2. Input data

 

Input data for the parser are source files in the C# language and assemblies. The parser does not work directly with the file system. All files must be loaded into memory and passed to the parser with the help of pointers.

 

 

The following abstract data types are used for storing input data: files, assemblies and projects. A project can include any number of files and assemblies. It is responsible for deleting files and assemblies. When a project is deleted, all files and assemblies in it are also automatically deleted. For each project you can specify configuration macros that can include or exclude parts of the program. The parser can store and process any number of independent projects.

 

 

3. Parsing stages

 

 

Pass 1 – Creating AST trees. A separate tree is created for each file.

Metadata Pass – Generating and adding entities defined in assemblies to the entity model.

Pass 2 – Generating and adding entities defined in C# files to the entity model.

Find predefined entities – Searching for system data types (System.Object, System.ValueType, etc.). System data types can be defined both in C# files and in assemblies.

Pass 3 – Building the inheritance hierarchy and type lookup.

Pass 4 – Searching for members of types. Method overload resolution. Searching for explicit and implicit user-defined conversions. Checking if the program is semantically correct.

 

Knowledge of the order of passes can be used to optimize reparsing after the program is modified. For example, there is no need to go through all parsing stages after refactoring. It is enough to build AST trees for all modified files and go through stages 2-3-4 for all C# files.

 

 

4. Basic types

 

Two basic data types are used for internal representation of the program: AST nodes and entities. AST nodes are grouped into AST trees. Entities are gathered into the entity model. The rest of the data structures are auxiliary. Note that the transformation parser uses one more data type - tokens.

 

 

5. Abstract syntax trees

 

A separate abstract syntax tree consisting of AST nodes is built for each C# file. Only trees that are correct or partially correct from the syntax (but not semantic!) point of view are built in the process of parsing the program. Syntactically incorrect parts of the program are ignored. For example, a namespace cannot be defined inside a function and is ignored. A partially correct syntax subtree or node is built when the general syntax rule can be defined unambiguously, even if some part of the data for generating the complete rule is missing. For example, for the expression:

 

A f ( B b )

{

return b as; // missing as-expression cast-type

}

 

the CsAsIsExpression node that contains no cast type is generated.

An abstract syntax tree can be easily transformed or modified. Such actions are performed by replacing one node with another, deleting existing nodes and creating new ones.

 

 

5.1 AST nodes

 

Each AST node corresponds to a syntax rule of the C# language grammar. Nodes are related to each other via bidirectional parent-child links. Each AST node, except the compilation unit, refers to its parent. An AST node can contain any number of references to children depending on its type. Additionally, some nodes (for example, goto statements) refer to nodes that are not their parents or children. Each node is responsible for deleting its children. When a parent node is deleted, all its children are also deleted.

 

Any subtree or node can be detached from the tree if the reference to it is deleted from its parent node. To keep the model correct in this case, the reference to the parent node should be deleted as well. A node that has no reference to its parent is called detached; a node referring to its parent is called attached.

Nodes can contain additional information necessary for the representation of the program: names of identifiers, string literals, modifiers, etc.

 

Nodes contain references to entities. AST trees are related to the entity model with the help of these references. References to entities are divided into two types:

entity definitions - references to entities generated by the node itself, and

entity references - references to entities used by the node.

For example, the CsProperty node contains one definition referring to the property entity and two references referring to the getter and setter methods.

 

 

5.2 AST nodes and tokens

 

For each C# file, the C# transformation parser generates a list of tokens stored separately from the AST tree. Each token is assigned an index. It is unique for tokens belonging to one file. The nodes of the transformation tree contain token indexes corresponding to the syntax rule of the node.

 

Program example:

 

void f()

{

            g();

}

 

The relation between the character representation of the program and tokens (C# transformation parser):

 

 

The relation between tokens and AST nodes (C# transformation parser)

 

 

 

6. Entity Model

 

The entity model is global for the entire project. It is used to link together all files and assemblies. The entity model is a tree whose root is a global namespace.

 

In the process of parsing a program that has errors in it, a partially correct entity model can be generated. For example, two types with the same name can be declared in one namespace. In this case both types will be defined in one namespace. But the "Cannot resolve entity" error will be displayed for each attempt to find one of the two types by its name. Such an error is processed in a different way for those entities whose definition semantics depends on the order they are declared in the program (enumerations and local entities). The second entity with the same name is neither created nor added to the definition scope.

 

Unlike AST trees, the entity model is not subject to transformations. It is possible only to add entities to the entity model or remove them from it.

 

6.1 Entities

 

An entity is a named object defined within some scope. Entities are related to each other via bidirectional parent-child links. All entities, except the global namespace, have one and only one parent. The global namespace has no parent. An entity can have any number of children depending on its type. It is responsible for deleting its children. When an entity is deleted, all its children are also deleted. Children are stored either in the hash table of their scope or in special fields. An entity imported from an assembly contains a reference to the assembly from which it was imported.

 

6.1.1 Global and local entities

 

Global entities correspond to the types of metadata defined in the ECMA-335 standard. Global entities are visible from outside and any part of the program. Other assemblies can refer to them. Local and auxiliary entities are introduced to simplify the internal logic of the parser and to generalize the algorithms of searching for entities. They are neither visible nor available outside the method in which they are defined.

 

The correspondence between global entities and metadata types:

 

Global Entity

Metadata Type

CsEntityNamespace

absent, but implied

CsEntityClass    

Type (class)

CsEntityInterface

Type (interface)

CsEntityStruct   

Type (class) derived from System.ValueType

CsEntityEnum     

Type (class) derived from System.Enum

CsEntityDelegate 

Type (class) derived from System.MulticastDelegate

CsEntityMethod   

Method

CsEntityVariable 

Field

CsEntityConstant 

Field (literal)

CsEntityProperty 

Property

CsEntityEvent    

Event

 

Local and auxiliary entities:

 

Local Entity

Description

CsEntityFormalParameter

Formal parameter

CsEntityFormalParameterList

List of formal parameters

CsEntityBlock

Block

CsEntityBlockVariable

A local scope consisting of one variable

CsEntityLocalVariable

Local variable

CsEntityLocalConstant

Local constant

 

6.1.2 Entity name

 

The name of an entity can be specified directly in the program, imported from an assembly or generated according to the rules of the C# language. Some local and auxiliary entities have no names.

 

6.1.2.1 Special method names

 

The names of special methods (constructors, destructors, implicit and explicit conversions, operators) are specified according to the ECMA-335 standard.

 

6.1.2.2 Property method names

 

The names of methods for a property “P” are specified by adding prefixes to the name of the property. The getter method is formed by adding the get_ prefix to the name of the property (get_P). The setter method is formed by adding the set_ prefix to the name of the property (set_P).

 

6.1.2.3 Event method names

 

The names of methods for an event “E” are specified by adding prefixes to the name of the event. The add method is formed by adding the add_ prefix to the name of the event (add_E). The remove method is formed by adding the remove_ prefix to the name of the event (remove_E).

 

6.1.2.4 Indexer names

 

An indexer in C# corresponds to a property with parameters, but it has a number of peculiarities. The name of an indexer in C# is not taken into account in the overload resolution algorithms. One can even say that indexers in C# have no name. But according to the CLR specification, each method must have a name. That is why the C# compiler names a property “Item” by default. Correspondingly, the get and set functions of such a property are named get_Item and set_Item. It is possible to specify another name for the indexer property with the help of the  System.Runtime.CompilerServices.IndexerName attribute.

 

6.1.2.5 Explicit implementation names

 

Compound names are generated for the explicit implementations of a method, a property or an event. The generated name is the full name of a method, a property or an event implemented by the explicit implementation. Example:

 

namespace N

{

            class I

            {

                        public void f();

            }

            class C : I

            {

                        void I.f() {} // N.I.f

            }

}

 

If the implemented method, property or event cannot be found due to an error, the name is not generated and the entity remains nameless.

 

Explicit implementations are not taken into account while searching for members of a class. In other words, they are invisible.

 

 

6.2 The relation between AST nodes and entities

 

Program example:

 

namespace A.B

{

            class C

{

            public int P

            {

                        get() { return 0; }

            }

}

}

 

 

Entities generated out of C# files refer to AST nodes that generated them. In turn, AST nodes contain references to the entities generated by them.

 

 

7. Supported non-standard extensions of the C# language introduced by Microsoft

 

7.1. __makeref

 

Used for converting a variable into System.TypedReference. The parameter used for constructing a reference must be a left-hand expression. Example:

 

int x;

TypedReference tref;

tref = __makeref(x);

 

7.2. __reftype

 

Used for getting the type (System.Type) by the reference (System.TypedReference). Example:

 

TypedReference typedref;

Type type;

type = __reftype(typedref);

 

7.3. __refvalue

 

Used for getting the value stored by the reference (System.TypedReference). Example:

 

TypedReference typedref;

int x;

x = __refvalue(typedref, int);

 

7.4. __arglist

 

Used for working with functions taking a variable number of parameters. The __arglist keyword is used in three cases:

 

1. As the last formal parameter of a function taking a variable number of parameters. Example:

 

[CLSCompliant(false)]

public static void WriteLine(String format, Object arg0, Object arg1, Object arg2,Object arg3, __arglist);

 

2. As a parameter of the constructor of the System.ArgIterator object. Example:

 

ArgIterator args = new ArgIterator(__arglist);

 

3. As a parameter while calling a function taking a variable number of parameters. Example:

 

int AddInts(__arglist) {}

void Call()

{

  int result = AddInts(__arglist(1, 2, 3, 4, 5));

}

 

© 2005 metaspec