A Tour of Morfa

Text type

Built-in immutable type text is used to represent strings of Unicode characters, using UTF-16 encoding internally.

String literals are delimited with double quotes ". A default value for a text variable is the empty string "".

Built-in text functions and properties are illustrated in the following code snippet:

    var t: text;     // t is initialized to ""

    t = "Hola";
    assert(t.length == 4);

    // Brackets are used for accessing individual characters.
    assert(t[1] == 'o'); // 2nd character of "Hola" is 'o'

    // '~' is the concatenation operator.
    t = t ~ " mundo";    
    // '~' may also append and prepend individual characters.
    t = t ~ '!';         
    t = 0xa1 ~ t;        

    // Text equality is tested with ==.
    assert(t == "¡Hola mundo!"); 

    // Shorthand operator ~= is also available.
    t = "good";
    t ~= "bye";
    t ~= '.'; 
    assert(t == "goodbye.");

    // Comparison operators use the lexicographical order.
    assert("morfa" < "morfa!");
    assert("Morfa" < "morfa");  // since 'M' < 'm'
    assert("short" > "long");

    // Substrings are formed using the slice() function. 
    assert(slice(t, 4, 7) == "bye");

    // Another two useful built-in functions are startsWith() and endsWith().
    assert(startsWith(t, "good"));
    assert(endsWith(t, "."));

Other basic text operations (searching, replacing, splitting, changing case and Unicode normalization) are defined in the library module morfa.Text.base. More details can be found in the documentation of the Morfa library.

Binary encoding and decoding text values

To get a binary representation of a text value use the built-in encodeXYZ property:

    var t = "żółw";

    var utf8bytes: int8[] = t.encodeUtf8; 
    var utf16bytes: int16[] = t.encodeUtf16;
    var utf32bytes: int32[] = t.encodeUtf32;

    // This would cause a runtime error, since "żółw" contains non-ASCII chars.
    // var asciiBytes: int8[] = t.encodeAscii; 
    // Same here:
    // var latin1Bytes: int8[] = t.encodeLatin1;

Each of these properties has a variant that returns a pointer to a 0-terminated sequence of bytes, suitable for passing to a C function:

import morfa.sys.c.stdio: puts;

    puts("¡Hola mundo!\n".encodeUtf8Z); 

To go the other way round (from bytes to text) use the decodeXYZ() family of functions from the module morfa.Text.decoders.

import morfa.Text.decoders: decodeAsciiZ;

import morfa.sys.c.string: strcpy, strcat;
import morfa.sys.c.stdlib: malloc, free;

    var hello: text = "Hello";
    var world: text = "world";
    var buffer: pointer<int8> = cast<pointer<int8>> malloc(20);

    // C strcpy() accepts char* arguments so we use .encodeAsciiZ    
    strcpy(buffer, hello.encodeAsciiZ);
    strcat(buffer, " ".encodeAsciiZ);
    strcat(buffer, world.encodeAsciiZ);

    // Now go back from C to Morfa 
    var result: text = decodeAsciiZ(buffer);
    assert(result == "Hello world");