Notes
Slide Show
Outline
1
CDT DOM Proposal
2
Where are we (IBM) coming from?
  • We’re part of the Model-Driven Development (MDD) team in the Rational Software Division of the IBM Software Group
  • CDT is the core C++ component of Rational Software Architect
  • The C++ component includes visualization of code as UML as well as transformation of UML to C++
  • We need an accurate, complete DOM that allows programmatic code change
  • Since do both Java and C++ we’d like the architectures of the JDT and CDT to be “similar”.
3
What have we done until now?
  • Started work on parser in Dec. 2002
  • Considered a number of options but settled on a handwritten parser
    • Needed smart control over ambiguities in C/C++
    • Needed to support a number of clients with different needs
    • Similar in approach to newer versions of gcc.
  • First contributed in 1.1 for CModel replacing previous JavaCC-based parser
  • Added Indexing and Search as client in 1.2
    • Replaced previous ctags-based index (not without issue)
  • Numerous clients in 2.0, including content assist, F3, type browsers
4
Parsing Architecture 1.0
  • We had concern over scalability of AST
    • Settled on a callback-based architecture
    • Clients created only the data that they needed
  • Client would pass in a requestor to the parser and the parser would pass in parse info as the parse progressed
    • Despite the names, the IAST* parse info does not form a proper AST
  • Client would also be responsible for passing in ScannerInfo
    • Compiler command line arguments that affect parsing
      • Include paths, macro defs
    • Needed to properly parse
5
Parsing Architecture 1.0 – What worked well
  • Pretty accurate parse information (usually)
    • Content assist worked really well for C and C++
    • Accurate out line view
    • Accurate search results, F3
    • Ability to generate type info for class and hierarchy browsers
  • Flexibility for clients
    • Enabled us to pile on the clients in 2.0
6
Parser Architecture 1.0 – What didn’t work well
  • Performance
    • Assumed parse times would be quick, < 1 sec
      • Finding parse times usually in 2-4 second range
    • Index takes a lot of time, memory, and CPU power to generate
    • Content assist times out regularly
  • Hard to provide accurate results consistently
    • F3 and content assist often produce no results
    • Need for accurate ScannerInfo often hard to satisfy
  • Scalability to large projects
    • Problems only worsen when project grows
    • Index times longer than already long build times
    • ScannerInfo different for different files in project
7
Why a DOM?
  • Although flexible, the callback mechanism was hard to define
    • Unable to provide all information that every possible client would require
    • Often unable to determine parser context for certain constructs
    • Clients had added complexity to manage their own data
    • Parser had added complexity to provide enough data
      • Parse mode proliferation
  • Need to address scalability
    • Cut down on the amount of parsing we need to do
    • Can we reuse parse results?
  • Need data structures to help programmatically make changes to the source code
8
DOM Architecture
  • The DOM is composed of the parser/scanner and three levels of data
    • 1) Physical AST with mapping back through macro expansions to source
    • 2) Logical Scope/Binding tree with cross translation unit indexing
    • 3) AST Rewriter
  • Firm up interfaces for DOM creation
    • Navigate from Core Model to DOM
    • Hide parser from clients (use core model)
    • Allows us to play with how the DOM is created to improve scalability
9
Goals for 3.0
  • To reduce the fixed cost of a parse (preprocess, scan & syntax matching) and to allow for lazy semantic evaluation
    • Improve performance & reduce memory footprint of navigation features
  • To provide a “complete” physical AST which can make our clients aware of preprocessor macro expansions in source code
  • To provide better support for C
    • Link-time resolution cross references
    • Tailored implementation of parser/semantic bindings
10
Physical AST - IASTNode
  • Given a file to parse, the AST Framework shall return an IASTTranslationUnit which then can be traversed or visited
  • The IASTNode interface hierarchy represent constructs in the C/C++ grammar.
      • long x; /* IASTSimpleDeclaration with 1 IASTDeclarator */
      • long y();  /* IASTSimpleDeclaration with 1 IASTFunctionDeclarator */
      • int f() { return sizeof( int ); }  /* IASTFunctionDefinition */
  • Physical tree is unaware of any semantic knowledge
    • Declaration before use (C++)
    • Scoping
    • Type compatibility
    • lValue vs. rValue
  • Allows for quick generation of syntax tree
    • Only slight overhead to cost of scan & preprocess
11
Logical Tree - IBinding
  • Logical elements are higher-level constructs that map onto a physical IASTNode
  • For 3.0 : IASTName#resolveBinding()
      • long x; /* IASTName for x resolves to an IVariable */
      • long y();  /* IASTName for y resolves to an IFunction */
      • int f() { return sizeof( int ); }  /* IASTName for f resolves to an IFunction*/
  • Beyond 3.0 – IASTBinaryExpression could bind to an user defined operator
  • Semantic errors return an IProblemBinding describing the error
  • IASTTranslationUnit can be asked for all declarations or references of a particular IBinding
  • Bindings can be resolved completely lazily on request or in a full traversal of the AST
    • Indexer would require full binding resolution
    • Most other clients do not require all bindings to be resolved
12
Macro Awareness in the Physical AST
  • Nearly all parser clients in the CDT are concerned with offsets
    • Selection
    • Search Markers
    • Navigation
    • Compare
  • Within a preprocessor macro expansion, our IScanner implementation massages the offsets as tokens arrive
  • However, the CDT 2.x AST Nodes are unaware as to whether or not they are a result of a macro expansion
    • This deficiency affects different clients differently.

13
The Good – Outline View
14
The Bad – Search Markers Slightly Off
15
The Ugly – A Refactoring Gone Wrong
16
Introducing IASTNodeLocation
  • Every node in the AST derives from IASTNode
  • Every IASTNode provides a mechanism for resolving its location in the physical AST
    • External Header Files
    • Resources
    • Working Copies
    • Macro Expansions
  • getNodeLocations() returns an array of IASTNodeLocation, as a node may span multiple locations (when macros are involved)
  • Interpretation of macro expansion locations are up to the client
17
Better support for C
  • Support for link-time bindings
    • resolving function references in C
      • Will greatly aid navigation/search features for this style of C code
      • Refactoring/programmatic edit support will be difficult
        • Definite candidate for “Preview” pane in refactoring
        • Without full build-model to reference, resolution is heuristic-based
  • Stricter emphasis upon providing custom implementation for C & GCC
    • AST & supporting data structures will be more compact memory wise
    • Syntax parsing will be more accurate
    • Algorithms for semantic analysis in C are far less rigorous
      • Should yield better performance for nearly all clients
18
Parser Language Variants
  • Interfaces
    • org.eclipse.cdt.core.dom.ast – Base interfaces that apply to both ANSI C & ISO C++
    • org.eclipse.cdt.core.dom.ast.c – C99 specific sub-interfaces
    • org.eclipse.cdt.core.dom.ast.cpp – ISO C++ 98 specific sub-interfaces
    • org.eclipse.cdt.core.dom.ast.gnu – GNU extensions that apply to both C & C++
    • org.eclipse.cdt.core.dom.ast.gnu.c – GNU extensions that apply to C
    • org.eclipse.cdt.core.dom.ast.gnu.cpp – GNU extensions that apply to C++
  • Implementations
    • org.eclipse.cdt.internal.core.parser2 – supporting infrastructure
    • org.eclipse.cdt.internal.core.parser2.c – C99 & GCC support
    • org.eclipse.cdt.internal.core.parser2.cpp – ISO C++ 98 & G++ support
  • Other variants may subclass C or C++ Source Code Parser and choose which GNU extensions they wish to enable through a configuration interface


19
Physical Tree Hierarchy
  • C++ extends the common AST
  • GNU C++ extends the C++
  • C++ and C are unrelated outside of the common AST package
20
Logical Tree Hierarchy
  • C extends common
  • C++ extends common
  • Theoretically, new bindings could be added for future variants
    • e.g. GCC Signatures
21
AST Rewriting
  • AST Rewriter accepts requests to change AST
    • Add (Insert/Set), Remove, Replace
    • The new code can be a new AST Node or a String
  • AST Rewrite gathers change requests and then executes
    • Analyzes the change requests for validity
    • Produces text edits that can be applied to documents to affect the change
  • The analysis can decide not to do a change because it is too hard
    • E.g. when macros are involved
22
DOM as a Service
  • We need to take some effort to minimize parsing within Eclipse
    • C and C++ are mature languages : hence, larger source code bases
      • Multiple clients parsing on resource-change events can cripple the system
    • A complex web of #include’s throughout the code base is difficult to optimize per-parse without having knowledge of previous parses
      • Same with templates …
    • The Indexer is already parsing continually, we should be able to leverage that information for all other clients that require saved-file parse trees
    • Since parsing can be a processor and memory-intensive operation, it is difficult for the indexer to co-ordinate its priority vs. other parser clients for system resources
      • users requests an Open Declaration which competes against a running indexer for memory & CPU
    • Eventual Goal
      • AST Service w/Index
      • Incremental Parse