Platform/XML Rewrite

From MozillaWiki
Jump to: navigation, search

("I" in this page refers to hsivonen.)

Goals

  • Get rid of nsParser
  • Get rid of nsScanner
  • Get rid of nsIContentSink and related nsI stuff
  • Get rid of nsIParser
  • Align notification behavior with the HTML parser instead of deferring append notification for legacy reasons (after the change, we wouldn't have the state where DOM contains nodes that have been appended but whose append notifications haven't been posted)
  • Move Web content XML parsing off the main thread
  • For Web content, reuse code from the HTML side
  • Less COMtamination

Non-Goals

  • Replacing expat
  • Hiding expat from sinks
  • Moving chrome prototype parser or XSLT off the main thread

Background observations

The HTML5 parser has a design that works. When document.write handling complexity is not considered, the HTML5 parser has these major parts:

  • A parser object (nsHtml5Parser) that nsDocument sees and that holds the rest together.
  • An IO driver (nsHtml5StreamParser) that can receive bytes from a network stream, manages the character encoding conversion and pushes UTF-16 code units to the portable parser core.
  • The portable parser core (nsHtml5Tokenizer and nsHtml5TreeBuilder).
  • Glue code that produces tree ops from what the portable core does (nsHtml5TreeBuilderCppSupplement)
  • An executor for the tree ops (nsHtml5TreeOpExecutor)

XML Web content loading

I propose making the XML Web content load path have the same structure as the HTML loads path (with document.write simplified out). That is, it would have these major parts:

  • A parser object (mozilla::XmlParser) that nsDocument sees and that holds the rest together.
  • An IO driver (mozilla::XmlStreamParser) that can receive bytes from a network stream, manages the character encoding conversion and pushes UTF-16 code units to expat.
  • expat (portable parser core)
  • An object that implements handler callback for expat and produces tree ops. (mozilla::XmlTreeOpGenerator)
  • The same executor for the tree ops an on the HTML side (nsHtml5TreeOpExecutor, eventually to be named mozilla::TreeOpExecutor)

Character encodings

expat has built-in capability to decode US-ASCII, ISO-8859-1, UTF-8 and UTF-16 and has an unconventional API for plugging in other decoders.

We should continue to handle characters encodings outside expat, but we should handle the buffering in clearer code than the current nsScanner that wasn't meant for XML to begin with and is now a very strange way to handle buffering before data reaches expat.

Connecting handlers to expat

Looking at the existing sinks, it looks like there's no real value in having an abstraction between expat and code that does the actual work in response to expat's callbacks. If we switched away from expat today, we'd have to change the current abstraction layer anyway. That is, I think it doesn't make sense to have a single class (like the old nsExpatDriver) that provides a set of expat callbacks and then provides another abstraction for concrete handler classes that do the real work. I propose we make the concrete handler classes set themselves as expat callbacks directly. That is, mozilla::XmlTreeOpGenerator should know how to register itself as the handler of various expat callbacks. This way, we don't need a layer of virtual calls on right on top of expat's function pointer-based calls.

Dealing with stream data off the main thread

mozilla::XmlStreamParser should implement nsIStreamListener on the main thread and copy data over to the parser thread the way nsHtml5StreamParser does.

Dealing with entity references off the main thread

Currently, we map a small set of magic public ids to a DTD file that we actually feed to expat so that it gets parsed every time the user loads a document that references one of the magic public ids, such as the public ids for the XHTML 1.0 DTDs. This way, entities defined in the XHTML 1.0 DTDs are available to documents.

Since our IO APIs are meant to be called on the main thread, starting IO for the local DTD file from the parser thread is not good. And in any case, it's rather silly to parse an actual file when we know in advance what the file will contain.

Instead of parsing a special file in this case, expat should be hacked in such a way that its internal entity tables can be mutated to a state that's equivalent with the state they'd end up in by parsing the special DTD without actually parsing anything. Failing that, we could bake the data into the shared library so that it's available as static data on any thread.

Lack of actual speculation

In the HTML case, the only thing that can cause a speculation fail is document.write. Since XML has no document.write, the off-the-main-thread XML parser can parse its input to completion and doesn't need to support stream rewinding. All the tree ops can be queued up and they just need to be executed in chunks that end at a script execution op so that the world experienced by scripts looks as though the parts of the document after the current script didn't exist yet.

Parsing chrome: XML

Chrome documents in Firefox are localized using external DTDs that define named entities. This needs to work with the new implementation. Since initiating DTD IO from off the main thread is trouble, chrome: documents should be parsed on the main thread. To enable this, there should be an on-the-main-thread alternative for mozilla::XmlStreamParser: mozilla::XmlMainThreadStreamParser. To get assertions about which methods should run on which thread right, it is probably useful to actually have two classes instead of having one class with a flag that picks different code paths within the class. The two classes should probably share encoding sniffing code in a common superclass.

Since chrome: documents can be XHTML, it follows that mozilla::parser::xml::TreeOpGenerator needs to work on the main thread, too. This shouldn't be a big deal considering that tree op generation in the HTML case can run on either thread.

Parsing XML that's not Web content-like

We use expat for 1) XHR, 2) XML Web content, 3) Firefox UI files using the prototype parser and 4) for XSLT programs. Moving the last two off-the-main-thread may not be worthwhile.

Fragment parsing

On the HTML side, nsHtml5Parser also supports fragment parsing, but that functionality doesn't really benefit from being in the class that's oriented towards full page loading, so I think even on the HTML side, the fragment parsing functionality should be separated from nsHtml5Parser. I think it's not worthwhile to avoid using the tree op executor mechanism, though, since avoiding it would lead to a lot of code duplication. I think both HTML and XML should have distinct fragment parsing entry points in separate classes but the code for generating tree ops and executing them should be shared with the Web content network stream parsing code paths.

XSLT and XHR

The document that is given as input to an XSLT program and the document loaded via XHR are currently built by nsXMLContentSink. In the new world, they'd use the tree op mechanism instead like other things that currently use nsXMLContentSink.

Showing a common interface to nsDocument

nsIParser is pointlessly crufty and COMtaminated. There should be a new non-XPCOM abstract class that shows the commonality of XML and HTML parsing to nsDocument but no cruft. The new interface should look something like this:

class mozilla::parser::AParser {
  virtual void setCharsetAndSource(mozilla::Encoding* aEncoding, uint32_t aSource) = 0;
  virtual nsIStreamListener* GetStreamListener() = 0;
  virtual void UnblockParser() = 0;
  virtual void Start(nsIRequestObserver* aObserver) = 0; // Replaces Parse()
  virtual void Terminate() = 0;
  // possibly document.write and script execution stuff that'd be no-op in the XML case
};

Both mozilla::XmlParser and nsHtml5Parser would inherit from this abstract class.

Alternatives

Both the input stage to the XML parser (nsParser, nsScanner) and the output stage (nsXMLContentSink) could be rewritten to address legacy issues without also moving expat off the main thread.