JavaScript:SpiderMonkey:Parser API
Contents
JavaScript API
Design
A general parsing API could treat the visitor interface from JsonMLAST as a builder with a parametric type. In Java-like pseudo-types:
interface Visitor<A> { visitThisExpr(attr : {}) : A; visitReturnStmt(attr : {}, expr : A) : A; /* etc. */ }
and take it as an optional argument to the constructor of a parser object:
class Parser<A> { Parser(builder : Visitor<A>?); parse(src : string, [filename : string?, startLine : (number >= 1)?, startColumn : (number >= 0)?]) : A; }
This would allow the parse
method to use the visitor as a way of injecting the parsed nodes into whatever type A the client wants. Making the argument optional makes it easy to default to the JsonMLAST format.
It's a little messy (wrt types) to reuse the visitor interface for this, since the JsonMLAST visitor expects JsonMLAST values as arguments, whereas this use expects the type A as arguments. So it's maybe better to have a parallel "Builder" interface that's the same as the visitor interface but with buildXXXX methods instead:
interface Builder<A> { buildThisExpr(attr : {}) : A; buildReturnStmt(attr : {}, expr : A) : A; /* etc. */ }
Should also implement a JsonMLAST processor that takes a visitor:
processAST(ast : JsonMLAST, visitor : Visitor<A>) : A;
This could be implemented in pure JS, at least initially. (It'll be ugly without pattern matching, though.)
Plan
Several separate steps to make this work:
- Bug 548461 -- refactor JSCompiler to JSSourceCompiler and JSAstCompiler (draft done, patch under review)
- Bug 533874 -- implement marshalling to JsonMLAST (draft done, not quite ready for review)
- implement source location information
- generalize to take visitor (not started)
- refactor to build on top of C++ API? (not started)
C++ API
Brendan's notes
Currently jsparse.h defines a JSParseNode
struct, a variant record documented by the major comment before its declaration, and a few JS_FRIEND_API
entry points for parsing and compiling, but not executing, token streams.
There is interest, from the jseng newsgroup, in adding a jsparseapi.[ch]
module to SpiderMonkey that exposes these parse-tree internals in a sustainable way. See my post in the thread on this topic.
In a followup post, I talk about various APIs that could be used to glean information from the direct members of JSParseNode
, but with a new jsparseapi.h
interface, we would do better to use functions to hide all data types that are not already exposed via jspubtd.h
and jsapi.h
.
The parse-tree API should expose a JSParseTree
opaque struct type. This type represents a tree parsed from a token stream, or a subtree (even just one leaf node). We should not expose JSTokenStream
, rather we should make API entry points such as
extern JS_PUBLIC_API(JSParseTree *) JS_ParseScript(JSContext *cx, const jschar *chars, size_t length, const char *filename, uintN lineno);
that returns NULL
on failure and a pointer to a valid ParseTree
on success.
The API would consist of accessors only, no mutators. We might want an API modeled on the visitor pattern. So JSParseTree
would be a Visitable and the API would introduce a Visitor interface (in the C callbacks or poor-man's vtable sense) for API clients to implement.
Please feel free to add your design notes below.
Kimman's Notes
Work on this API originated from an email in the news group post by Jeremy. I volunteered to assist him and since then we have worked together on this. After getting a feel of the task at hand, we started work on a note outlining ideas for this API. Jeremy has been unable to devote time owing to other committments, and so I am posting this even though I have not had his comments on this note.
Work Done So Far
Based on the hints provided in Brendan's post, the following progress has been made :
1. Parsing a script using js_ParseTokenStream
2. Walking of the ParseNodes tree using recursion
3. Implementation of a simple callback set (HaveNewSubTree, HaveNodeInTree, EndSubTree) that is called during the walking of the nodes. The structure used is :
typedef struct _JSParseNodeVisitor { NewParseNodeTree newnode; UseParseNode usenode; EndParseNodeTree endnode; void *data; } JSParseNodeVisitor;
This is not a proper implementation of the Visitor pattern but used as a simple starting point.
4. Implementation of two sets of callbacks - the first that prints the contents of the nodes (not all node types have been done as yet) and the second that builds a frequency distribution for 'arity' of nodes, TOK_ constants and JS Opcodes.
5. Implementation of a non-recursive walker making the same callbacks.
About the ParserAPI
Before implementing an API, it is essential to have the end-use in perspective, as well as to delineate clearly the boundaries of capabilities.
First the boundaries: The ParserAPI will not evaluate the scripts. This will ensure that the API can be used on scripts for which object implementations are not available. For eg, debugger.js references an object jsd in its top level code. An attempt to evaluate the code will return an error.
However, the ParserAPI should be able to allow a javascript function (that is compiled and evaluated) to be a user of the API. This could be achieved through a form of 'chaining' where the user of the API is an internal implementation that in turn invokes the real user implemented in JS.
The idea of the ParserAPI emerged from an interest in a tool, like
JavaDoc, to aid in documenting JS scripts and script collections. Some
other potential uses for the API could be :
A GUI application that allows a user to drill down a collection of JS
ie the nodes of interest may differ depending on where the GUI is at
that point in time, and secondly the ability to locate a parseNode and
then continue the API usage from there. Eg, in the first pass, the GUI
may display the source files and function names. When the user clicks
a function, we then have to display the variables, functions called
etc. The application would hold the ParseNode tree - when the user
clicks the funciton, the application would locate that ParseNode and
then start using our API from there.
An application that wants to build a Calls/Called By kind of graph - they would use the API to parse the nodes and then build their own data structures.
An application is complex and the user is reviewing the code - the API could be the foundation on which to build a simple search for usage kind of interface.
During the initial 'getting-the-feel' work, simple Token Type and OpCode counting functions have been implemented as users of the API. This could help to identify a group of scripts that use all tokens and op codes - this could serve as a 'testing set' for JS engine builds.
Design Ideas
1. A user of the API should be able to control the extent of recursion. For eg, a user who is interested only in top level code and function definitions should be able to return a value from the 'visitor' callback that will determine the next step for walking. So on receiving a FUNCTION during the walk, the callback could return a value that implies 'SKIP THE FUNCTION BODY'. Likewise, a callback could return a value saying 'STOP WALKING' eg the user found what they were looking for or decided they werent interested any longer.
The callback should be able to inhibit the walking of the pn_next node in a given ParseNode.
2. Related to the above, the API should be able to return a 'marker' for each node. The user of the API should be able to provide this marker at a subsequent point to determine the 'walk start' node. For implementations where GC is not an issue (eg a GUI application for documentation), this marker could be the 'NODE' pointer itself. Howevever, for other implementations the marker could be based on the BEGINPOS, ENDPOS available with each ParseNode.
3. A single walk of the parse tree should be able to cater to multiple 'visitors'. This is an idea I bring from the arena of image processing - while walking the contents of an image, different consumers are supported. At the end of walking, the application can query each consumer for information of interest. This has two advantages - the walking is done once, and consumer implementations can be simpler (and also easily reused). An application sets up the consumers, walks the contents and then queries each consumer for information. This information is aggregated / evaluated to determine the application's course of action.
4. The Visitable/Visitor definition : I think the API should provide different types of these. At the 'lowest' level, the Visitor interface could include a callback for each type of ParseNode. At the next level, the Visitor interface could include some categorisation based on TOK_ types. This would seem to be like introducing a 'guide' interface ie the guide interface prepares the visitable which a vistor can visit.
5. Reviewing the SpiderMonkey code, there are several places where this walker (as a 'js_friend') could be used eg js_FoldConstants, js_EmitTree, CheckSideEffects and some others. These functions currently use recursion and in some cases this recursion appears expensive. I am not sure whether the 'auhtors/maintainers' of these functions would find the code easily maintained if they were to switch to using this API.
Other Issues
1. The API should provide human readable descriptions for elements like JS OpCode and Parser Tokens. Currently, jsopcode.c is the only place in the JS code where the 'names' of JS op codes are available. These are defined in the file jsopcode.tbl. This file is included in jsopcode.h and jsopcode.c : in the .h file, only the numeric constants for the opcodes are included whereby the enumeration of the JS opcodes is exposed. In the .c file, the complete table is included which is used during disassembly. For the ParserAPI, a function should be added to jsopcode.c that would return a pointer to the name corresponding to a given JS Opcode. An option is that there is a public function exposed by the ParserAPI which in turn invokes a 'friend' function in jsopcode.c.
2. Memory and GC : The starting point for the API will be to parse a script using js_ParseTokenStream. Upon successful parsing, the core functions of the API will be used ie walking the tree of parse nodes. During the use of the API, the memory allocated in the context temp arena pool must not be freed. This suggests when the API user has finished with the ParserAPI, they must call a function that will release the arena pool - this seems 'reasonable'. However, the API will also depend on GC not taking place so that atoms referenced in the different JSParseNodes will not get freed. During js_ParseTokenStream, atoms are protected from GC by using JS_KEEP_ATOMS. Upon completion of the parsing, JS_UNKEEP_ATOMS is called. One option is that after parsing is completed, JS_KEEP_ATOMS is called again and JS_UNKEEP_ATOMS is called in the same function when the arena pool is released.
3. Folding of constants : When the parsing is completed successfully, js_FoldConstants is called. If I've understood this correctly, js_FoldConstants performs optimisation by resolving once expressions involving literals. A question that arises is whether, a user of the ParserAPI would prefer that this optimisation is not performed. For eg, a user of the API may be trying to identify the usage of a constant value that has been used as a literal in a collection of scripts - with the constants folded, the literals will not be visible via the ParserAPI. Similarly, a user of the API trying to find examples of usage of different operators may get an empty result set on account of the folding of constants.
4. When an error occurs during parsing, it is necessary to report the error to the ErrorHandler callback if set in the context. The error is available in an Exception object. An issue to consider is whether the ParserAPI should aggregate all Warnings and make these available at the end of the parsing.
5. Clarifications required on TOK_DEFSHARP and TOK_USESHARP. Also on
- n=[], #n={..}, #n# and the like in the comments in jsparse.h.