This document is about: QUANTUM 2
SWITCH TO

Multithreading

The Task system is extremely performant and mature; however, game logic is usually sequential in nature (with changes on an entity/object depending on others of similar composition) and hence performance gains are not linear. It is highly recommended to start development using the regular single threaded approach and only later in development evaluate multithreading while keeping the caveats presented in this document in mind.

Introduction

Quantum's Task systems allows for both data parallelism (process a big batch of SAME TYPE OF data) and task paralellism (two different pieces of logic operating on entirely distinct data):

For Example:

  • data paralellism: a single task operating the same logic on multiple entities from a single filter in parallel.
  • task paralellism: two tasks executing logic on distinct and completely different groups of entities.

The Quantum Task system favors performance over safety. As a result, it asks of the developers utter awareness about which tasks may run in parallel, which pieces of data they may touch safely and which ones would require additional interlocked operations.

The Task system is being used by core systems, such as Physics and Navigation, and by some customers in their own game systems.

Thread Safety

The Task system only ensures the thread-safety of the dependency graph's execution. If task B depends on task A directly or indirectly, B will never run concurrently to A. No extra measures are taken by the Task system itself to prevent the tasks' own logic from running into race conditions. Thus, the thread-safety of the simulation must be ensured by both the careful construction of the dependency graph and often by the usage of other Interlocked/CAS operations, specially when performing data parallelization.

The lack of additional default security mechanicanism, such as semaphores, is due to their performance penalty which was cancelling out any performance gains.

Thread Performance

In between tasks, Quantum uses a custom flavor of "spin-locks" to switch very very efficiently.

However, to sleep and wake-up threads, Quantum hooks into the ENTRY point of the whole UNITY'S JOBS Tasks sub-system. This is important for Android platforms as there are unresolved issues with Unity's jobs on those platforms. In particular, the Unity jobs wakeup / sleep waits can be spiky on the unity side itself; to the point where it can be preferable to avoid Multithreading altogether on Android.

Given the fluent situation, it is highly recommended to test the performance results on the target devices and platforms.

Determinism

When running Tasks concurrently in different threads, two things need to be taken into consideration:

  1. the Tasks need to be thread-safe; and,
  2. the order in which each thread executes their load must not affect the deterministic result of their Tasks' execution.

If thread 1 is executing some logic on entity A and thread 2 is executing some (identical or different) logic on entity B concurrently, they have to produce the same deterministic changes to the game state on this and all other clients' machines regardless of the order in which they effectively execute their Tasks. If an order of execution is required to keep the deterministic nature of the simulation, this order must be enforced through the task dependency graph.

Tasks

A Task is a piece of code that executes some logic over a set of data. A regular SystemMainThread system, for instance, is translated into a single Task that can only have their Update() method (the piece of logic) executed by the main thread. SystemMainThreadFilter, on the other hand, also defines the data over which the logic will be executed: all entities having components that match the filter.

With the Task system, it is possible to parallelize:

  • Data: the same task runs in multiple threads and each thread will execute the task over a different subset of the data;
  • Logic: multiple tasks can run in different threads concurrently.

For instance, there could be three threads 0 (main), 1 and 2 executing task A over different subsets of entities in a component buffer (data parallelization) while another thread 3 is executing task B over a different component buffer (logic parallelization).

Types

There are 4 types of tasks:

  1. Main thread task: can only be executed by a single thread and that must be the main simulation thread (this is the Task also used by the regular SystemMainThread);
  2. Singleton task: can only be executed by a single thread which may or may not be the main thread;
  3. Threaded task: can be executed by multiple threads;
  4. Array task: can be executed by multiple threads over a data buffer of known size in Schedule-time. The array is sliced by the Task system and the chunks are consumed by the threads. There is an optional parameter to control the number of chunks, by default it is set to 32.

The Quantum multithreading's schedule is work-stealing. Work-stealing happens in two levels:

  • On the Task-Graph level: Threads will try and "steal" a task and execute it. For instance, a singleton task will be executed by the first thread that is able to request (steal) it (assuming that the the task dependencies are fulfilled).
  • On the Task-Chunk level: Array tasks slice the buffer in chunks. Each chunk can be stolen by a thread. When a thread is done executing its logic over a chunk, it will try and "steal" the next one available until all chunks are taken. In this case, the Task System will ensure that two threads will not grab the same chunk.

Systems

Systems can schedule tasks. On the simpler SystemMainThread, this scheduling logic is not exposed and the Update() task is automatically scheduled behind the scenes. Systems are called to schedule their task(s) in the same order in which they are defined in the SystemSetup and the tasks they schedule should form an acyclic graph that we call the Task dependency graph.

The frame simulation, then, consists of the main thread and a number of worker threads going over this task graph and checking if the tasks can be executed; if the conditions are met, they execute them. A task can be executed by a given thread if its dependencies are fulfilled (tasks on which it depends are all completed) and they are allowed to run on such thread (some tasks can only be executed by the main thread or by a single thread).

Types

In addition to the SystemBase system, there are several built-in multithreaded systems which offer data parallelization. They can be found in the Quantum.Tasks namespace. Some of those systems are similar to their main-thread counterparts.

  1. SystemThreadedComponent<T> (Threaded task): the threads will iterate over a component buffer with a size that is not known at Schedule time. Each thread will acquire a slice of a given (configurable) size until the end of the buffer is reached. While iterating over a slice, the thread will call the Update() method that can be implemented by the inheritor system, passing a FrameThreadSafe, the EntityRef and a T*.
  2. SystemThreadedFilter<T> (Threaded task): similar to (1), but, instead of iterating over a single component buffer, the threads will iterate over a filtered component set defined by the filter struct T, similarly to the SystemMainThreadFilter<T>.
  3. SystemArrayComponent<T> (Array task): similar to (1), but, instead of slicing the component buffer in N slices of a fixed size each, the buffer is sliced in a fixed number of slices and the size of the slice depends on the size of the buffer, which must be known in Schedule time.
  4. SystemArrayFilter<T> (Array task): similar to (3), but the threads will iterate over a filtered component set defined by the filter struct T.

FrameThreadSafe

All of these systems, as enforced by the Tasks API, will receive a FrameThreadSafe (FTS). FTS is effectively a wrapper of Frame that exposes a subset of the Frame API which is thread-safe and carries information about which thread is executing the task.

For instance, an FTS will allow the use of GetPointer<T> to a component, but forbids the use of Add<T> to add a new component to the entity. The latter is not an operation that can be safely executed by multiple threads concurrently. However, the FTS by itself permits multiple threads from calling GetPointer<T> on the same component, modifying the same field and running into race conditions. It is crucial to keep in mind which tasks are allowed to run concurrently by the dependency graph and which pieces of data they read from or write to at any given time during their execution, ensuring they either never touch a piece of data that can be concurrently modified by another thread or introducing interlocked/CAS operations to ensure that is done thread-safely.

Physics

The physics engine supports queries being performed from multiple threads; it is made available through FrameThreadSafe just like on a regular Frame.

IMPORTANT: It is NOT possible to run things in parallel to the Physics.

All physics entries hold pointers to their physics-related components.

  • Modifications to the Collider or Body components can affect, or worst invalidate, some assumptions made by the physics solver.
  • The Transform position and rotation data is cached for performance reasons and to support position offsets & compound shapes. This data would not be affected after it has been cached; however, changing it during that time may result in issues with determinism and create race conditions.

Snippet

The snippet below shows the different task types and how they can be scheduled in a system inheriting from SystemBase. On the Schedule the TaskHandle of the last task in the previous system will be made accessible. This allows to build the system's sub-graph of tasks and return its last task handle, which will then be used by the following system.

C#

using Quantum.Task;

namespace Quantum
{
  public unsafe class TaskSystem : SystemBase
  {
    TaskDelegateHandle MainThreadHandle;
    TaskDelegateHandle SingletonHandleA;
    TaskDelegateHandle SingletonHandleB;
    TaskDelegateHandle ArrayHandle;
    TaskDelegateHandle ThreadedHandle;

    public override void OnInit(Frame f)
    {
      // hooking method delegates to Quantum task handles (description will show up in the profilers)
      f.Context.TaskContext.RegisterDelegate(DummySampleMethod, "Main Thread Task", ref MainThreadHandle);
      f.Context.TaskContext.RegisterDelegate(DummySampleMethod, "Singleton Task A", ref SingletonHandleA);
      f.Context.TaskContext.RegisterDelegate(DummySampleMethod, "Singleton Task B", ref SingletonHandleB);
      f.Context.TaskContext.RegisterDelegate(DummySampleMethod, "Array Task", ref ArrayHandle);
      f.Context.TaskContext.RegisterDelegate(DummySampleMethod, "Threaded Task", ref ThreadedHandle);
    }

    // Quantum's tasks are laid out in an Aciclic Graph, with dependencies set at will
    // Systems inserted into the default SystemSetup will be laid out as a SEQUENCE (a system always depends fully on the previous one to be finished)
    // this example schedules a graph of tasks for this system
    // IMPORTANT: do NOT perform actual work in the schedule method... Any FRAME operations must happen in the actual task methods
    protected override TaskHandle Schedule(Frame f, TaskHandle taskHandle)
    {
      // this registers a task to run on main thread mandatorily (single call per update).
      var firstTask = f.Context.TaskContext.AddMainThreadTask(MainThreadHandle, null, taskHandle);

      // this one will be registered to run AFTER the one above (but it's also single call to it). Notice it depends on FIRST.
      var secondTaskA = f.Context.TaskContext.AddSingletonTask(SingletonHandleA, null, firstTask);

      // this one is registered to run IN PARALLEL to the one above. Notice it only depends on FIRST
      var secondTaskB = f.Context.TaskContext.AddSingletonTask(SingletonHandleB, null, firstTask);

      // now this one is an array task, it considers you know IN ADVANCE the SIZE of the task data chunck.
      // it splits into smaller chuncks to avoid overhead
      // notice it depends on both second tasks (A and B), so it only starts after those two are done
      var thirdTask = f.Context.TaskContext.AddArrayTask(ArrayHandle, null, 100);
      thirdTask.AddDependency(secondTaskA);
      thirdTask.AddDependency(secondTaskB);

      // now a general threaded task (dynamic data size), for which you need to use Interlocked/CAS operations to "consume" data and update the counters safely
      var fourthTask = f.Context.TaskContext.AddThreadedTask(ThreadedHandle, null, thirdTask);

      // we return the LAST task here, so next SYSTEM scheduler DEPENDS on it...
      return fourthTask;
    }

    public void DummySampleMethod(FrameThreadSafe frame, int start, int count, void* userData)
    {
      // start/count are only meaningful in an array task
      // userData is optional (mostly for our internal use of tasks - as in systems you must as much as possible keep data in the frame)
    }
  }
}

Utility Structs

Two types of utility structs are available through the Quantum.Tasks namespace to help coordinate threaded tasks:

  • TaskSlice
  • AtomicInt

C#

// In a .qtn file
struct MyStructFoo {
    FPVector3 Content;
}

singleton component MyStructFooData {
    array<MyStructFoo>[128] MyStructFooArray;
    Int32 UsedCount;
}

C#

// The System
using Quantum.Task;

namespace Quantum
{
  public unsafe class MultithreadedSystemSample : SystemBase {
    private TaskDelegateHandle _arrayTaskDelegateHandle;
    private TaskDelegateHandle _threadedTaskDelegateHandle;

    public override void OnInit(Frame f) {
      f.Context.TaskContext.RegisterDelegate(ArrayTaskMethod, "Array Task", ref _arrayTaskDelegateHandle);
      f.Context.TaskContext.RegisterDelegate(ThreadedTaskMethod, "Threaded Task", ref _threadedTaskDelegateHandle);
    }
    
    protected override TaskHandle Schedule(Frame f, TaskHandle taskHandle) {
      // the size of an array task must be known in Schedule-time
      var arrayCount = f.Unsafe.GetPointerSingleton<MyStructFooData>()->UsedCount;
      var arrayTaskHandle = f.Context.TaskContext.AddArrayTask(_arrayTaskDelegateHandle, taskArg: null, taskSize: arrayCount, dependancy: taskHandle);
      
      // get indexer to be used for slicing the data in the threaded task
      // optional: instead of temp-allocating, the indexer could be persisted in a partial declaration of Frame (use AllocUser and FreeUser) 
      var threadedTaskIndexer = f.Context.TempAllocateAndClear(sizeof(AtomicInt));
      var threadedTaskHandle = f.Context.TaskContext.AddThreadedTask(_threadedTaskDelegateHandle, taskArg: threadedTaskIndexer, dependancy: arrayTaskHandle);
      
      // we return the LAST task here, so next SYSTEM scheduler DEPENDS on it...
      return threadedTaskHandle;
    }
    
    public void ArrayTaskMethod(FrameThreadSafe frame, int start, int count, void* taskArg) {
      var arrayOfFoo = frame.GetPointerSingleton<MyStructFooData>()->MyStructFooArray;
      
      // start/count are only meaningful in an array task
      for (int i = start; i < start + count; i++) {
        var foo = arrayOfFoo.GetPointer(i);
        // do work on foo ...
      }
    }
    
    public void ThreadedTaskMethod(FrameThreadSafe frame, int start, int count, void* userData) {
      // the task indexer was passed in as task argument
      var taskIndexer = (AtomicInt*)userData;
      
      // get array and its count
      var myStructFooData = frame.GetPointerSingleton<MyStructFooData>();
      var arrayOfFoo = myStructFooData->MyStructFooArray;
      var arrayCount = myStructFooData->UsedCount;
      
      // start/count are meaningless for a threaded task.
      // as the size the data was NOT known in schedule-time, it must be sliced now 
      var slices = stackalloc TaskSlice[frame.Context.TaskContext.SlicePerThreadMaxCount];
      var slicer = frame.Context.TaskContext.SlicePerThread(taskIndexer, arrayCount, slices);
      
      while (slicer.Next(out var slice)) {
        for (var i = slice.StartInclusive; i < slice.EndExclusive; ++i) {
          var foo = arrayOfFoo.GetPointer(i);
          // do work on foo ...
        }
      }
    }
  }
}

Scheduling Guidelines

System scheduling has to follow two essential guidelines:

  1. The first task of a system should depend on the last task of the previous (identified Task Handle received in the Schedule call).
  2. The system should return the Task Handle of its last task so the next system can depend on it.

In summary, only Tasks can run in parallel to each other and there should always be only one System running.

Scheduling Barrier

In some instances, the last few tasks of a multithreading system might all depend on same task handle. This may result in a race condition where the systems believes to have completed its execution despite some tasks still being executed. When this happens, the next system may start executing prematuraly which can result in non-deterministic operations / results.

To guard against this risk, it is recommended to add a "barrier" task. A barrier task is simply a singleton task which depends on all tasks which may run into the aforementioned race condition. This dependency will ensure the barrier task will only run after all the previous tasks are completed and the system will be able to return the barrier as its last task handle.

C#

protected override TaskHandle Schedule(Frame f, TaskHandle taskHandle) {
  var threadedTaskA = f.Context.TaskContext.AddThreadedTask(threadedTaskHandleA, null, taskHandle);
  var threadedTaskB = f.Context.TaskContext.AddThreadedTask(threadedTaskHandleB, null, taskHandle);

  // the barrier task does nothing but ensuring that all other tasks of the system are done
  // before the next system's task(s) start executing
  var barrier = f.Context.TaskContext.AddSingletonTask(barrierHandle, null);
  barrier.AddDependency(threadedTaskA);
  barrier.AddDependency(threadedTaskB);

  return barrier;
}

Change Amount of Worker Threads per Platform

By default, the simulation will attempt to use a total number of threads equal to the SimulationConfig.ThreadCount setting, including the main-thread. During its update loop, the simulation will request the provided IDeterministicPlatformTaskRunner to schedule a number of delegates equal to the number of extra worker threads, i.e. SimulationConfig.ThreadCount - 1. The main-thread will always execute the simulation task graph, so, even if no delegates are scheduled, the simulation will be able to proceed with its update.

It is possible to effectively use a varying number of worker threads (or none at all) based, for instance, on the platform in which the application is running. This might be useful when targeting both mobile and PC, as PCs will usually have more fast-performing cores than mobile devices. In those cases, PC clients could use the full range of SimulationConfig.ThreadCount, while mobile devices could be further limited to a maximum of 2 threads, for instance, as a broad range of devices only has 2 fast cores.

In Unity, the default IDeterministicPlatformTaskRunner is implemented by QuantumTaskRunnerJobs. It uses Unity Jobs to spin up worker threads that will execute the simulation task graph in a work-stealing fashion alongside the main simulation thread.

By default, this Task Runner will clamp the number of effectively scheduled worker threads (in this case Unity Jobs) to the current JobsUtility.JobWorkerCount - 1. This ensures that the simulation will not starve Unity from Job threads while its update loop is running, which could potentially cause deadlocks.

It is possible to further customize and override this value by extending QuantumTaskRunnerJobs, as shown in the snippet below.

Snippet

C#

public class CustomQuantumTaskRunnerJobs : QuantumTaskRunnerJobs {
  private void Awake() {
    // force Custom mode to ensure that the Custom Count will be used
    quantumJobsMaxCountMode = QuantumJobsMaxCountMode.Custom;
  }

  protected override Int32 CustomQuantumJobsMaxCount {
    get {
#if !UNITY_EDITOR && UNITY_IOS
      // example: forcing a single-threaded simulation
      // (the main thread will always execute the task graph, this is the number of WORKER threads)
      return 0;
#elif !UNITY_EDITOR && UNITY_ANDROID
      // example: forceing a maximum of 1 worker thread
      return 1;
#else
      return DefaultQuantumJobsMaxCount;
#endif
    }
  }
}
Back to top