Interlude: A data-oriented model

Published by
Aapo Alasuutari

Hello again! It hasn't been that long since I last blogged, and things are mostly as they were back then. A few meaningful changes have happened though: first, I am now on a two week (paid!) vacation, intending to finish up the NLnet grant project before time runs out. Second, I have received a negative "your project is not eligible" for the main grant application that I was secretly banking on, which would've set me on a course to develop Nova full-time for the next two years.

So, put simple I am now very temporarily working for the Internet yet again, after which I will return to Valmet Automation's sweet embrace. As such, it seems fitting to talk a little about the data-oriented design principles that underpin much of Nova, and how I've applied those principles in my day job as a TypeScript developer.

A trip down memory lane

This is not the first time I talk about using data-oriented design in TypeScript/JavaScript. In fact, this is something that I mentioned in the linked talk and which is explicitly explained in the talk repository as the "Data Model".

The Data Model is mentioned to be

A fundamental directed acyclic graph underpinning the flow of data from the automation system to the user interface. Found to often take ~10 MiB a pop.

and it is made up of node objects that contain four properties:

  1. kind: this determines the semantic meaning of a node.
  2. in: this determines the input nodes to this node by naming them in an Array. Both the order and any duplicates are significant here.
  3. out: this determines the output nodes of this node by naming them in a Set. Neither the order nor duplicates are significant here.
  4. data: this property's value depends on the kind and contains any extra data needed by the runtime semantics of the node.

These nodes are stored in a Map<NodeName, DataModelNode> and in addition, there exists effectively a Map<NodeName, unknown> data storage hash map for storing the current runtime value of a given node. Updating the Data Model then means running each node's runtime semantics on its input node's current runtime value, and storing the resulting value as this nodes' new runtime value.

The four "kinds" of nodes given are const which splits into two (actual constants and references), subscription, and function. Their runtime semantics and their associated extra data are as follows:

  1. Constant nodes, kind: "const": the node is a constant, has no extra data associated with it, and never has any input nodes. Updating a constant simply means assigning the new value as the node's new runtime value.
  2. Reference nodes, kind: "const": the node is a reference to some other node, has no extra data associated with it, and always has exactly one input node. Updating the node means reading the only input node's current runtime value and assigning it as the reference node's new runtime value.
  3. Subscription nodes, kind: "subscription": the node is a subscription into the automation network. Its extra data is a collection of parameters and options used to affect the subscription's exact semantics, and these nodes are known to always have one or two input nodes: the first input node contains the subscription address, and the second optional one is a dynamic object of parameters. Updating a subscription node means unsubscribing the previous subscription address (if non-null), subscribing the new address (as given by the first parameter node's runtiem value), and setting the subscription node's current runtime value to null. When the subscription from the automation network responds with data, that data is set as the subscription node's current runtime value and an update is dispatched to its output nodes.
  4. Function nodes, kind: "function": the node is a function on its inputs. Its extra data is the function name (to be looked up from a function storage Map). Updating the node means reading the current runtime values of its input nodes, and running an actual JavaScript function with those values as the arguments. The result of the function is stored as the new runtime value of the function node.

The way these nodes are constructed is by, effectively, parsing a JavaScript-based domain-specific language (DSL) that looks something like this:

let tag = "LIC-100";
let address = combineStrings("/plant/", ref("tag"), "/isGood");
let isGood = negate(subscription(ref(address)));

The tag, address, and isGood are properties and their values are parsed as parts of the Data Model. tag's value "LIC-100" is parsed as just a constant, while address is parsed as a function node calling a function by the name of combineStrings with three parameters: the first one is a constant parameter "/plant/", the second is a reference node pointing to the property tag, and the third one is again a constant parameter with value "/isGood". Finally, the isGood property is parsed as a function node calling the function negate with the value of a subscription node that takes as its address a reference node pointing to the property address.

At this point, I want to ask a question: do you think that the object based node structure seems to make sense? Ponder to yourself for a moment, is this the kind of code that you'd write? Or do you see silliness that you know you'd never commit to?

I am not quite sure myself: by now all of this code was either written or rewritten by me at some point, although I did inherit the basic structure of it originally. So obviously I thought this made sense, but I'm not entirely sure if I would write it anymore. At the very least it's clear to me that there are issues in this code, though they may not be dealbreakers depending on the use-case.

I am altering the deal...

The main issues in the existing implementation become quite clear when we look at it in the details. The very first issue is simply the memory usage: in Chrome each node object took up (3 + 4) * 4 (3 for the object header + 4 inline properties) or 28 bytes. Add to that the 16 bytes needed for both the in Array and the out Set and we're already at 60 bytes, or nearly a full cache line of data for a single node. Add in the out Set's backing memory allocation, which is done even when the Set is empty and takes probably more than a full cache line on its own, and we're probably easily over two or even three cache lines of data. The total memory usage for an empty node is probably something around 150 bytes.

But there are structural issues with the nodes as well. First, while nodes belonging to properties like tag or address can have references pointing to them, there is no way for a reference to refer to eg. the "/plant/" constant parameter "inside" the address property's node graph: this means that we know that all "parameter" nodes must always have exactly one output, which is the node that they are a parameter of. This makes the outputs Set seem quite ridiculous indeed with its large backing memory allocation used to store just a single node name string most of the time. Second, the number of inputs is often small and statically known (0 for constants, 1 for references, 1 or 2 for tags); even for functions we know the number of inputs for a given function node during parsing so we have no need for a dynamically resizable container to store the input names. This makes the in Array seem quite ridiculous as well.

Third, constant parameter nodes (like the "/plant/" string) do not really serve any purpose: we just want to know that they are constant parameter nodes but the node object itself has nothing of value to us: the output is never needed as the constant parameter can never change (meaning that we never ask the question "what is the output of this constant parameter node"), the inputs Array is known to be empty, and no extra data exists for constants. The only thing we're interested in is the current runtime value of the node, and that is stored in a separate Map.

Fourth, reference parameter nodes do not really serve any purpose: instead of creating a separate node whose only purpose is to have an input pointing to eg. tag, we could just as well remove that entire node and have the reference node's output (usually a function or subscription node) refer to that tag directly.

The third and fourth issues I had already taken care of ages ago; constant and reference parameter nodes do not exist in the Data Model at all. The first and second points I hadn't fully realised yet, but I had plans...

... pray I don't alter it any further

I had actually seen some other issues as well. The kind field was a huge waste of memory, taking up an entire JavaScript Value (4 or 8 bytes depending on the engine) to store what amounted to 2 bits of information (one of 4 options). Likewise, the extra data for subscription nodes was horrendously inefficient, storing a set of JavaScript booleans in an object with each boolean fully filled in with its default value if not explicitly defined in the source DSL, so as to optimise object shapes. That meant using many tens of bytes to store what amounted to a few bits of data.

But even had I fixed all of these issues, the reality was still that our Data Models can get really big, too big. We're talking half a million to a million nodes per Data Model, and there is no exact limit to how many Data Models a user can have open at the same time. (Funny story, a particular customer had noticed a cool trick where they could sort of minimise parts of the UI and then use a double-click feature to bring it quickly back into view. This meant that they had tens of large Data Models running simultaneously, as opposed to the expected count of low single digits. Users are clever!)

At those numbers, just the object headers for a single Data Model's nodes add up to nearly 6 MiB. My bet for solving this issue was thus not to try shrink the JavaScript node objects at all, but to remove them entirely! And this is where we get to the data-oriented design part of the blog post.

Lining it all up

The answer to all of this was obviously to take matters into my own hands using ArrayBuffers and TypedArrays. The kind field could easily fit into a Uint8Array, while the others seemed to be begging for a bit of a rethought.

I'm going to skip to the end here, and just tell you what I did: the final result was that a single Data Model node is an index in three TypedArrays: the kindColumn, the outColumn, and the payloadColumn. These three form what could be called the "node table". Additionally, an extraDataColumn exists on the side that has a length dependent on the contents of the node table. In this transformation, the number of node kinds shot up from 3 (effectively 4) to 7, and they are now of course number values stored in a Uint8Array instead of strings like before. The kinds are:

  1. Constant node: same as before.
  2. Reference node: same as before, except now with a different kind value.
  3. Nullary function node: a function taking no parameters.
  4. Unary function node: a function taking one parameter.
  5. N-ary function node: a function taking two or more parameters.
  6. Subscription node: a subscription node with no non-boolean options (minTime / maxTime) or dynamic parameters, ie. only has one input node.
  7. Parametrised subscription node: a subscription node with some non-boolean options or dynamic parameters. This has one or two input nodes.

Each node has an out value (stored in the outColumn) which is a relative offset forwards in the node table pointing to the node's output node. If the relative offset is 0, then this node is a property node. In these cases, the node has extra data (like the incoming references to this property) available in a separate "property table" which I'm going to gloss over today.

Finally, the payload value of each node (stored in the payloadColumn) depends on the kind of the node, but a common theme is that in most cases the payload is an index into some storage Array. They go like this:

  1. Constant node: the payload is an index into a global array of constant values.
  2. Reference node: the payload is an index into a global array of property names.
  3. Nullary and unary function nodes: the payload is an index into a global array of function names.
  4. N-ary function node: the payload is an index into the local extraDataColumn. The pointed-to index contains an index into the global array of function names, the index after that is the number of inputs this function node has, and subsequent indexes after that contain relative offsets backwards in the node table pointing to each input node.
  5. Subscription node: the payload is a bitset of the boolean options of the subscription.
  6. Parametrised subscription node: the payload is an index into the local extraDataColumn. The pointed-to index contains the bitset of boolean options and bits indicating which of the minTime, maxTime, and two input parameter offsets are stored in subsequent indexes of the extra data.

If you've heard how Zig builds its compiler, this might sound very familiar because it's very much the "encoding strategy" as named by Andrew Kelley. The kind is used to store not just the "kind" of node we're dealing with but also some information about its data contents, which then means that we can skip storing that information, simplifying the required storage format.

Now, the kindColumn is always a Uint8Array so each kind field costs 1 byte of memory, but the outputColumn and payloadColumn I haven't given a concrete type for yet: this is because they do not have a guaranteed type. I'm taking advantage of the fact that these have fairly similar contents between one node and the next, and am thus eagerly allocating them using the smallest possible unsigned integer TypedArray that fits the current data: generally this means that outputColumn is a Uint8Array, and payloadColumn is either a Uint16Array or a Uint32Array. As a result, a single "base node" is 6 bytes in size. Compared to the 60 bytes we started off with we have cut memory usage of a node 10x, or more if we count in the output Set's backing memory allocation.

The "node table" has thus changed from this:

interface DataModelNode {
  kind: string;
  in: NodeName[];
  out: Set<NodeName>;
  data: unknown;
}
type NodeTable = Map<NodeName, DataModelNode>;

into this

interface NodeTable {
  kindColumn: Uint8Array;
  outputColumn: Uint8Array | Uint16Array | Uint32Array; // usually Uint8Array or Uint16Array
  payloadColumn: Uint8Array | Uint16Array | Uint32Array; // usually Uint16Array or Uint32Array
  extraDataColumn: Uint8Array | Uint16Array | Uint32Array; // usually Uint16Array or Uint32Array
}

The number of objects is cut from 1 + N * 3 where N is the number of nodes, to just 6 (counting a single shared ArrayBuffer shared between all the columns), no matter the number of nodes (and we actually make extraDataColumn null if it is empty, and we drop the node table entirely if it is empty, so the number of objects can go to 5 or 0). All in all, the memory usage seen in real usage went from more than 10 MiB to a bit over 1 MiB.

Get your ducks in a row

Okay, that sounds wonderful: should everything be written like this from now on? Well, yes and no. TypeScript isn't exactly the easiest language to use semi-manual memory management like this in (maybe we can make it a little better, though?), so the code complexity downside on its own might make this whole thing untenable in the small. But the final nail in the coffin is that TypedArrays and ArrayBuffers are massive objects in at least the V8 engine. If you have only a few objects, then the cost of a TypedArray will overwhelm the cost of a few objects. Only once you get into multiple tens of objects does the math change.

That code complexity though: no matter how numerous your objects, it doesn't really change the code complexity. This kind of code is and looks foreign: the best thing you can probably do is create helper classes that encapsulate the indexing behind APIs like getNodeKind and getFunctionName. Soon enough you'll find yourself arguing between safety and performance: should getNodeKind explicitly throw if the passed-in index is out of bounds? Should getFunctionName check that the passed-in index really points to a function kind, or should it simply interpret the node payload as a function name index and read into the global function name array? In Rust that would be accessing a union field without a check that the field is necessarily valid, which would make the calling function unsafe: do you start naming some functions as unsafeGetFunctionName, or is that a bridge too far?

I've glossed over all of those complexities here, and for a good reason I think: nobody wants to read 2000 lines of dense, unfamiliar TypeScript code. Just rest assured that the code exists, it works, has been tested, is heading into production, and even achieves a fairly good compile-time type safety to boot. It just isn't trivial. When I return to work in two weeks time, I'll be returning to more of this same work; the Data Model is split into two parts, a static version of it that is created once and used as a template when instantiating dynamic Data Models, and the dynamic side. I've only done the static version so far (which also gives me some extra benefits and ease of implementation that I've taken advantage of here), and next up will be the real deal: dealing with the actual, dynamic runtime Data Models.

But before that, it's back to Nova JavaScript engine and Rust for me. Thanks for reading, I'll see you on the other side.