ServerJS/Binary/C

From MozillaWiki
Jump to: navigation, search
See also: the Show of hands as well as the Unpacking and Essay portions which were removed.

This proposal was written by Daniel Friesen as an alternative to the Binary/B proposal.

This proposal extends the Blob type that a number of existing Server-side JavaScript implementations use, and a Buffer type reflecting the StringBuffer/StringBuilder within Java.

This proposal has a Buffer instead of Binary/B's ByteArray. None of the prior art actually implemented a ByteArray as Binary/B proposes. Most implementations implemented a Blob type similar to Binary/B's ByteString, and any that implemented something called ByteArray actually implemented something more like a stream API based buffer rather than anything remotely resembling an Array.

One goal of this proposal is interoperability between Strings and Blobs. That is, like Binary/B, this proposal aims to permit a class of generic algorithms that can operate on both Blob and String through a generic intersection between their API's. However, unlike Binary/B, this proposal avoids things that seem counter-intuitive, like putting .charAt on a Blob, a byte collection. Instead, this proposal augments String with a .valueAt so that method can be used generically on both Blob and String and a Buffer system which works on either type.

This proposal is based off of API's drafted for MonkeyScript (Blob Buffer).

Terms and reading notes

To avoid confusion and ambiguity these are the basic definitions of terms used within this document.

List
A type which groups a series of items in a specific order.
Sequence
A type of list which manages a list of fixed-unit pieces of data.
These units of data are normally either bytes or characters. "Sequence" is basically a term which refers generically to both Strings, Blobs/ByteStrings, and mutable counterparts like Buffer and whatnot.
Array
A type of list which manages a list of items. These items are not related to one another in any way other than their inclusion in the list and do not need to be of the same type.
A key importance is an Array is a loose collection of items, these items do not have any sort of fixed unit to them.
memcopy
Where used memcopy is used it refers to the technique of copying memory as directly as possible from one source to another. At the very least this refers to copying from A to B without creating an intermediate Blob.

Where "as if by" is used in the spec the result is meant, the algorithm should not be affected by changes to the class' prototype.

Differences between a Sequence and an Array

While this may not be the case in lower level languages, JavaScript's API does make a clear distinction between strings and arrays.

Units
A Sequence is built up of a list of single unit items. Whilst an Array is built up of unitless items, the array does nothing but point to objects, it contains nothing itself. The sequence "abc" is made up of 3 units { a, b, c } whilst [1,"asdf",3,{}] is made up of 4 items { 1, "asdf", 3, {} } with no relation to each other and no fixed units as we see two separate numbers in there, a 4 unit sequence inside of it, and an object which could have an indefinite hierarchy.
Spillover
Depending on whether the type is a Sequence or an Array type functions such as .indexOf may "spill" or "overflow" over multiple items. Sequences spill, while Arrays do not spill. There is a subtle difference in the api between the two.
  • sequence.indexOf(sequence, [offset]);
  • array.indexOf(item, [offset]);
When using .indexOf on a sequence you give it another sequence. indexOf does not look for just a single item, but a sequence of items within that sequence. Contrasted to this, when using .indexOf on an array it ONLY looks for a single item and the search is unaffected by adjacent items.
This is apparent from how "foobarbaz".index("bar"); returns the index of "bar" despite the fact that 'b', 'a', and 'r' are 3 units within this 9 unit long sequence (in this 1 unit being 1 character). While contrasted to this [1,2,3,4,5].indexOf([2,3,4]); does NOT return the location of the 2, 3, and 4 inside this array. The reason for this being that indexOf on an array is a single item operation, it does not spill lookup over into the following items.
Pushing and Popping
Another point which does not get emphasised because strings are immutable in JavaScript and thus don't need methods to mutate them as Arrays do, is the semantics of .push, .pop, etc...
.pop() and .shift() remove ONE item from an array and return it.
As well given one argument .push() and .unshift() add ONE item to an array.
The key point here is [1,2,3].push([4,5,6]); does NOT turn the array into [1,2,3,4,5,6] it just adds the [4,5,6] as a sub array as so [123,[456]].
You can give multiple arguments to these methods, but then you are no longer working with your lists in the same way.
There is another name which does fit this kind of operation, "Append" (Side note, Wrench.js does add .append to Array). Using [1,2,3].append([4,5,6]); DOES push 4, 5, and 6 onto the array creating the array [1,2,3,4,5,6].

Prior art

Java's java.lang.StringBuffer is a very good reference for prior art. It is made for Strings rather than bytes, but nonetheless it's a api designed solely for the purpose of mutation of a string, not one designed for one purpose and hacked to suit another.

Java's StringBuffer works using by append[ing](), insert[ing](), strings to grow the buffer. .delete() removes portions of the buffer, .indexOf() and .lastIndexOf() can search, .replace() and .reverse() are available, .length() shows the length of the data itself, .capacity() shows the current amount of memory allocated, and .substring can grab a substring from the StringBuffer.

The API

The api for this spec defines two new classes. Blob (Fluspferd, Google, jslibs have all used this name, it's a fairly long-standing name and normally works similarly) and Buffer.

It is up to an implementation whether they wish to make Blob and Buffer native global objects, or seclude them inside of a binary module. Whether they are made global or not if the implementation implements require() then require('binary'); must return an object containing Blob and Buffer as keys, even if the binary module is simply a module containing exports.Blob = Blob; exports.Buffer = Buffer;.

Blob

Blob is the binary counterpart to String, it has a slightly different API but has many similarities. A Blob is an immutable representation of a sequence of 8bit bytes.

Most of the blob methods work on blobish data, rather than flat blobs. This means that the argument is treated as if it were passed through Blob(), thus .indexOf(255); is the same as if you had done .indexOf(Blob(255)), so you do not need to explicitly convert everything into a blob.

Note that unlike String, Blob is not defined as a primitive datatype by ECMA, this means that typeof will never return 'blob' and all blobs will be objects unlike strings which are normally primitives. Blob works with and without the new constructor and acts the same. It is recommended to use the `Blob()` form

[new] Blob();
Construct an empty blob
[new] Blob(number);
Construct a single unit blob, converting the number a byte. If the item is outside that range, not a number, or not an integer (has a decimal point) a TypeError should be thrown.
[new] Blob(arrayOfNumbers);
Construct an blob the same length as the array, converting numbers 0..255 into bytes. If any item is outside that range, not a number, or not an integer (has a decimal point) a TypeError should be thrown.
[new] Blob(blob);
Passes the blob through.
[new] Blob(string, toCharset);
Construct a new blob with the binary contents of a string. The string will be encoded from the native UTF-16 charset into the charset specified by the toCharset argument and represented in the new blob in 8bit bytes.
blob.length;
Returns the length of the blob. This is immutable.
blob.contentConstructor;
Returns Blob to indicate this has binary content.
blob[index]; // Optional
blob.byteAt(index);
blob.valueAt(index);
@showofhands (ashb suggests .byteAt could return Number (byte) instead of a single unit blob; .valueAt would still return blob so that abstract code still works)
Extracts a single byte from the blob and returns a new blob object containing only it. Note that the blob[i] form is optional, implementations may chose to exclude support for it. This should be ideally be relevant to support for string[i], thus if the interpreter being used supports string[i] it is expected that an implementation should attempt to support it as well. (Waiting on show of hands)
blob.byteCodeAt(index);
blob.codeAt(index);
Extracts a single byte from the blob and returns it as a unsigned integer (Number) such that the number will be in the range 0..255.
blob.indexOf(blob, offset=0);
blob.lastIndexOf(blob, offset=0);
Returns the index within the calling blob object of the first or last (depending on which method is used) occurrence of the specified value, or -1 if not found.
blob.concat(otherBlob, ...);
Combines the content of multiple blobs together and returns a new blob.
blob.slice(begin, end);
Extracts a section of the blob and returns a new blob containing it as the contents. (This should behave the same as string.slice and array.slice)
blob.split();
blob.split(separator);
blob.split(separator, limit);
Splits the blob based on a sequence of bytes ({0 0 0 255 0 0} split by 255 would become [{0 0 0}, {0 0}]) and returns an array of blobs. This is the same as string.split except it does not support regular expressions. Like string.split this supports sequences of more than one unit (ie: You may split {0 0 255 0 0 255 3 0} by the blob {255 0} and get [{0 0}, {0 255 3 0}])
blob.toBlob([fromCharset, toCharset]);
If passed with no argument returns the same blob.
If passed with two charset arguments transcodes the data from one charset to the other and returns the data as a new blob.
Note that if a single argument is passed to this method it should throw a TypeError to prevent gotchas where someone runs .toBlob(charset) on a blob instead of a string where it is relevant.
blob.toString();
Returns a debug representation like "[Blob length=2]", where 2 is the length of the blob. Alternative debug representations are valid too, as long as (A) this method will never fail, (B) the length is included, (C) It is not only the representation of an implicitly converted string.
blob.toString(fromCharset);
Converts the binary data in the blob from the charset specified by fromCharset to the native UTF-16 charset and returns a new string with that content.
blob.toArray();
Returns an array containing the bytes as numbers as if by [ blob.byteCodeAt(i) for ( i in blob ) ].
blob.toArray(fromCharset);
Returns an array containing the decoded Unicode code points as if by var str = blob.toString(fromCharset); [ str.charCodeAt(i) for ( i in str ) ].
blob.toSource();
This method is optional, it should be included if the interpreter being used supports .toSource() on it's various objects and types.
Returns a representation of the blob in the format "(Blob([]))" or "(new Blob([]))". If the blob has content in it the string should contain integers 0..255 representing the blob such that if evaluated (calling the correct Blob function) would return a blob with the same content.

Buffer

Buffer is accompanied by three classes; Buffer, StringBuffer, and BlobBuffer. Buffer itself is the generic class, making calls to it will normally create either a StringBuffer or a BlobBuffer. Both StringBuffer and BlobBuffer should inherit from Buffer and return true in a buf instanceof Buffer.

Buffers may implement smart resizing in the background (ie: padding arrays or whatnot to sizes to avoid reallocating on each insert) but information on this is not available to the JavaScript API.

Buffers will only take their own data type as arguments. If you try to insert a String into a BlobBuffer or a Blob into a StringBuffer a TypeError will be thrown.

new Buffer();
No-op... This simply creates an instanceof of Buffer. On it's own the Buffer class does nothing so this simply exists so that prototypes may be made that inherit from Buffer.
new StringBuffer();
Creates a new empty text buffer.
new BlobBuffer();
Creates a new empty binary buffer.
new StringBuffer(len);
Creates a new text buffer of len size.
new BlobBuffer(len);
Creates a new binary buffer of len size.
new Buffer(String);
Creates a new empty StringBuffer.
new Buffer(Blob);
Creates a new empty BlobBuffer.
new Buffer(String, len);
Creates a new StringBuffer of len size.
new Buffer(Blob, len);
Creates a new BlobBuffer of len size.
new Buffer(string);
Creates a new StringBuffer with the same size and contents as the string.
new Buffer(blob);
Creates a new BlobBuffer with the same size and contents as the blob.
buf.length;
buf.length = len;
Get or set the length of the buffer (For binary buffers this is number of bytes, for text buffers this is number of characters).
When length is set the buffer is dynamically resized. If shrunk it is truncated to size discarding items from the end. If grown the buffer is padded with 0 bytes for binary, and '\0' (null characters) for text.
buf.contentConstructor;
Returns Blob from a BlobBuffer to indicate it has binary content, and String from StringBuffer to indicate it has text content. Implementations should make an effort to make this readonly.
buf[index];
buf.valueAt(index);
Returns a string or blob representing the unit at a specified index.
buf.append(data);
Append a chunk of data to the end of the buffer growing it by data.length. If data is another Buffer memcopy should be used.
buf.insert(data, index);
Insert a chunk of data into a buffer growing it by data.length and shifting the data to the right of the specified index towards the end of the buffer. If data is another Buffer memcopy should be used.
buf.clear(start, length);
Zero out a section of the buffer. Binary buffers have bytes replaced with 0 bytes and text buffers have characters replaced with '\0' (null characters).
buf.fill(start, length, seq);
Zero out a section of the buffer. Binary buffers have bytes replaced with 0 bytes and text buffers have characters replaced with '\0' (null characters).
buf.remove(offset, length);
Remove a section of the buffer starting at offset and continuing for length units, shrinking it by length.
buf.splice(offset, length, data, ...);
Remove a section of the buffer and insert chunks of data starting from the place it was removed from. If data is another Buffer memcopy should be used.
buf.slice();
buf.slice(start);
buf.slice(start, end);
Extract a subsection of the buffer and return it as a new sequence. (Behaves the same as the string and blob counterparts)
buf.copy(data, offset, length, [dataOffset]);
Uses memcopy to copy a section of data directly into buf. data may either be another buffer of the same type, or a sequence (String/Blob) of same type as indicated by contentConstructor.
buf.split();
buf.split(separator);
buf.split(separator, limit);
Splits the buffer based on a sequence and returns an array of strings or blobs. (When used on text buffers this may or may not chose to support regular expressions)
buf.indexOf(sequence, offset=0);
buf.lastIndexOf(sequence, offset=0);
Returns the index within the calling buffer object of the first or last (depending on which method is used) occurrence of the specified value, or -1 if not found.
buf.valueOf();
Return the non-mutable sequence for the buffer.
  • In a BlobBuffer this returns a Blob which matches the contents of the buffer.
  • In a StringBuffer this returns a String which matches the contents of the buffer.

String extensions

These extensions may be optional, however it would be ideal if implementations added these prototypes to the standard objects. Implementations may chose how to implement these (load binary themselves beforehand, prototype methods that use require('binary') within them, etc...)

string.contentConstructor;
Returns String to indicate this has text content.
string.toBlob(toCharset);
Converts a UTF-16 string into the specified charset and returns a blob containing that binary data.
string.valueAt(index);
An alias for string.charAt(index);
The point of this prototype is so that (string or blob).valueAt(index); may be used independently of whether the sequence is a string or a blob. This will allow strings to maintain .charAt and blobs to maintain .byteAt without returning unintuitive results while still allowing a method of working abstractly without relying on things like (str or blob)[index] which may not be implemented in some engines.
string.codeAt(index);
An alias for string.charCodeAt(index);
The point of this prototype is so that (string or blob).codeAt(index); may be used independently of whether the sequence is a string or a blob. This will allow strings to maintain .charCodeAt and blobs to maintain .byteCodeAt without returning unintuitive results.

Abstract API

One of the primary focuses was interoperability between Strings and Blobs so that abstract algorithms could be written which work on either strings or blobs.

The entire Buffer api was designed for this purpose, and the following methods on String and Blob are usable in abstract programming:

  • seq.length;
  • seq.contentConstructor (can be used as seq.contentConstructor() to return an empty seq of the same type)
  • seq.valueAt(idx); // Sequence at index
  • seq.codeAt(idx); // Number at index
  • seq.valueOf(); // Returns the same seq (on a buffer returns the equiv Blob or String)
  • seq.indexOf(seq, [off]); and seq.lastIndexOf(seq, [off]); // finding the location of a subsequence
  • seq.concat(...seq); // combining sequences together
  • seq.slice(begin, end); // extracting a portion of a sequence
  • seq.split(sep, [limit]); // split up a sequence using another sequence as a separator

General requirements

Any operation that requires encoding, decoding, or transcoding among charsets may throw an error if that charset is not supported by the implementation. All implementations MUST support "us-ascii" and "utf-8".

Charset strings are as defined by IANA http://www.iana.org/assignments/character-sets.

Charsets are case insensitive.

Notes

  • A high priority in this proposal was String/Blob interoperability. While implicit string conversion was avoided it was important to make sure there was a api which could abstractly work with a sequence of data ignorant of whether the data was a string or a blob.
    • .valueAt was added to string so that there was a common method for both blobs and strings without implementing a counterintuitive .charAt on blob. Note that as a result you can actually check .charAt vs .byteAt and string will only have .charAt, while blob will only have .byteAt.
    • Buffer was made independent of whether the data is binary or text. To avoid implicit string conversion TypeErrors are thrown when giving data of the incorrect type to a buffer. But you are still able to write code using buffer that works on either strings or blobs and doesn't care which mode it is in.
      • Note how Buffer accepts String or Blob to determine it's data type. You could actually write code like var buf = new Buffer(sequence.constructor); and create a buffer based on the type of a sequence without checking what it is.
    • While same-type rules apply .slice can be used abstractly on both strings and blobs (arrays to actually), and the same goes for .concat, .length, .split (without regex), and .indexOf/lastIndexOf.
  • Some experimentation with .valueOf needs to be done. .valueOf has type hinting (the first argument is a string hint of what type may be converted to, operators like > and < make use of it as well as a few other cases). It would be nice to see if it's possible to use the native < and > operators to compare blobs on their binary order.
  • For now I've ignored things like .eq/equals, .lt, gt, etc... do note that Rhino actually implements .equals on String already. Also if we do add these things to blob we should probably implement the same on string.
  • Aristid Breitkreuz notes Buffer could be moved to IO.

Relevant discussion