Tamarin:Strings

From MozillaWiki
Jump to: navigation, search

Tamarin:Strings

The implementation of Tamarin strings has changed. This page is a how-to about the changes, and what can be done to adapt source code to the changes.

For implementation details, see Tamarin:String_implementation.

General

The new String class contains strings of variable width. A string can either be 8, 16, or even 32 bits (if 32-bit support is enabled). 8-bit strings contain the first 256 characters of the Unicode alphabet, often referred to as Latin-1. A special constructor accepts a null-terminated UTF-8 string.

Support for string widths of 32 bits is disabled ; a special constant enables this support.

There is no new operator; instead, there are a number of static creation methods that create and return a string: createUTF8(), createLatin1(), createUTF16(), createUTF32(). All of these creators accept a width constant, so strings are created with widths of 8, 16, or 32 bits. The value kAuto lets the creators determine the width that fits best. If, for example, createUTF8() is invoked with a string that decodes to the Latin-1 character set, the resulting string width is 8 bits. All creators accepts a Boolean value that, if true, declares the character data to be static, meaning that the String instance can use the buffer directly without having to copy the character data. Of course, the character data must be guaranteed to live longer as the string, or derivates of that string.

Usually, the character data is copied into a data buffer that the String instance points to. A substring contains a reference to the master string, the data pointers points into the master string, and the length also fits into the master string. Strings containing static data have a pointer to that data.

Very important: Strings are never NUL-terminated, because they may contain NUL characters as valid characters.

Creation

The only way to create a string is to use one of the static creator methods:

 static Stringp String::createLatin1(const AvmCore* core, const char* buffer,
   int32_t len = -1, Width desiredWidth = kAuto, bool staticBuf = false);

There is a createUTF16() method for const wchar* data, a method createUTF8() for UTF-8 character data, and createUTF32() (the latter only if 32-bit support is enabled). The default argument for the desired string width is kAuto. In that case, the method checks the string and creates a String instance that best fits the string data. If the source data is 32 bits and the desired width is 16 bit, surrpgate pairs will be created. If the source data is 16 bits and the destination data is 32 bits, surrogate pairs will be combined into a single UTF-32 character. If the requested width is too small to fit the string, NULL is returned.

If the staticBuf argument is true, the buffer is considered to live as long as the supplied AvmCore instance, and the string data is not copied if it matches the criteria set by the requested width. For UTF-8 data, the data must be ASCII to match this criteria.

These methods may return NULL if the source data is malformed, or does not fit into the requested string width.

Data access

Direct access to the data buffer is not longer possible, since it is not guaranteed that the string data is unique, or even writable. Therefore, the c_str() method is gone. It is still possible to access single characters via the charAt() method or the StringIndexer class. The latter class is a fast way to iterate through the string data.

Example:

 // Create a string
 Stringp s = String::createLatin1(core, "Hello world");
 // Iterate through the string
 StringIndexer indexer(s);
 for(int i = 0; i < indexer->length(); i++)
   process (indexer[i]);

To retrieve a character string that contains UTF-8 or UTF-16 data, use the classes StUTF8String() or StUTF16String(). These classes are stack-creatable only, and they contain a NUL-terminated string that can be accessed via its c_str() method. Note that creating such an instance on the stack causes a copy of the string to be created. Another class StIndexableUTF8String adds the computation between UTF-8 code points and byte offsets. All of these classes are data buffers only; they are not "real" String instances.

Example:

 // Create a string
 Stringp s = String::createLatin1(core, "Hello world");
 // Access that string as UTF-16 data
 StUTF16String s16(s);
 const wchar* p = s16.c_str();
 

To get a string of a known fixed width, use the getFixedWidthString() method. The method returns this if the string already has the requested width; otherwise, it returns a copy of the string with the given width. Note that if the requested width is too narrow because e.g. a 16-bit string contains characters >= 0x0100, and a 8-bit string is requested, the return value is NULL.

Example:

 Stringp s = getSomeString();
 Stringp s16 = s->getFixedWidthString(String::k16);
 if (!s16)
   return error;

Appending data

The String class offers several appendXXX() methods to append strings to a string. These methods return either a new String instance, or the String instance itself, if in-place concatenation was possible (see Tamarin:String_implementation for details).

 Stringp append(const String str);        // append a String instance
 Stringp appendLatin1(const char* data);  // append characters
 Stringp append16(const wchar* data);     // append UTF-16 data
 Stringp append32(const utf32_t* data);   // append UTF-32 data
 

If the appended data is too wide for the string, the string is widened. The latter three methods have overloads that adds the length of the string to be appended.

The old static concatStrings() is still available.

Example:

 // Create an XML attribute with namespace
 Stringp ns = xml->getNamespace();
 Stringp name = NULL;
 if (ns) {
   name = ns;
   name = name->appendLatin1(":");
   if (xml->isAttr)
     name = name->appendLatin1("@");
   name = name->append(xml->getName());
 } else {
   name = xml->getName();
 }

Additional String methods

The String class contains most of the usual JavaScript String methods like indexOf() etc. These are highly optimized and accept integer arguments, so it is OK to use them freely. There is a special version of indexOf that accepts a char* for a quick compare with a character constant, as well as matchesXXX() methods that matches the string at a given position to an argument. Finally, there are containsXXX() methods to check for the existence of a substring.

Example:

 Stringp s = ...;
 if (s->matchesLatin1("<?xml", 5)) ...
 else if (s->matchesLatin1("<![CDATA[", 9)) ...
 if (s->indexOfLatin1(":")) < 0) ...

The fixDependentString() function converts a substring to a normal string. This is handy if the string needs to live for a long time, but if you do not want the master of a dependent substring to live for that long.

Example:

 Stringp xml = parseVeryLargeXMLFile();
 Stringp start = xml->substr (0, 3);
 // if start was to be stored anywhere, the entire, huge XML string would remain alive
 // unless fixDependentString was called
 start->fixDependentString();
 

The makeDynamic() function converts a string with a static buffer, or a string that is a substring to a string with a dynamic buffer.