Graphics.DrawMeshInstancedIndirect/Procedural

17 min readNov 5, 2019

Unity’s Graphics.DrawMeshInstancedIndirect (from here on out, just DMII) is an absolute necessity for making danmaku games in Unity. At the same time, nobody seems to know what it is or how it works. As one of the confused, I'm somewhat hesitant to publish this, but hopefully it can help future me as well as other Unity randos be a little less lost when working with this odd API.

As you can probably guess, this article is maximally technical and maximally Unity-specific.

Strap in: this is a long ride.

Note that while this piece is oriented towards 2D games, the coding pattern isn’t much different for 3D (although your shaders may be a bit more complex).

All the resources for this post can be found on this Github repo licensed under CC0 (effectively public domain).

This article was updated on 2022/06/30 to use float arrays over structured buffers, and to clarify support for legacy GPUs and WebGL.

0. What is DMII?

DMII is used when you have many renderable things using the same mesh and material with minor variations (color, position, rotation, size, etc). In practice, a “variation” is something that can be reduced to a small number of floats. In danmaku games, all projectiles of a single type share a mesh/material pair, and thus can use DMII. DMII allows you to render all of these projectiles with one draw call instead of several thousand. If you’re not using GameObjects, then DMII and its simpler sister, Graphics.DrawMeshInstanced, are the only practical ways to render a lot of things.

To use DMII, we call the function with a mesh, a material, and a MaterialPropertyBlock that contains all the variations we want to apply. We then have to read all the variations and apply them within the material's shader.

DMII is an example of GPU instancing, but it’s not the same kind of GPU instancing that people normally refer to. Most instancing you see on the internet or in default Unity stuff are built for automatic instancing — ie. instancing the renderers on several hundred game objects. DMII is a more open framework that can be used without GameObjects or renderers, but on the flipside requires you to provide everything that the renderer normally provides. Unless you’ve used DMII, you don’t know DMII’s variation of GPU instancing.

1. The Shader

Shaders are one of the least friendly aspects of Unity. Each shader is written in at least two languages and has a bunch of hardcoded requirements and incomprehensible boilerplate. This said, using DMII requires fiddling extensively with shaders.

As stated, DMII is quite different from other methods of GPU instancing. Shaders are the most egregious example of this. All you need to do to enable instancing on a normal shader is add a keyword. But DMII shaders differ in their basic coding style and functionality.

The shader code below is standard 2D sprite boilerplate, with DMII support added. I’ll go through the stuff unique to our DMII shader.

Shader "DMIIShader" {
    Properties {
        _MainTex("Texture", 2D) = "white" {}
    }
    SubShader {
        Tags {
            "Queue" = "Transparent"
        }
        Cull Off
        Lighting Off
        ZWrite Off
        Blend SrcAlpha OneMinusSrcAlpha

        Pass {
            CGPROGRAM
            #pragma vertex vert
            #pragma fragment frag
            #pragma multi_compile_instancing
            #include "UnityCG.cginc"
            #pragma instancing_options procedural:setup

            struct vertex {
                float4 loc  : POSITION;
                float2 uv   : TEXCOORD0;
                UNITY_VERTEX_INPUT_INSTANCE_ID
            };
            struct fragment {
                float4 loc  : SV_POSITION;
                float2 uv   : TEXCOORD0;
            };
            
            CBUFFER_START(MyData)
                float4 posDirBuffer[7];
            CBUFFER_END
            
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            void setup() {
                float2 position = posDirBuffer[unity_InstanceID].xy;
                float2 direction = posDirBuffer[unity_InstanceID].zw;

                unity_ObjectToWorld = float4x4(
                    direction.x, -direction.y, 0, position.x,
                    direction.y, direction.x, 0, position.y,
                    0, 0, 1, 0,
                    0, 0, 0, 1
                    );
            }
    #endif

            sampler2D _MainTex;
            float _FadeInT; //We'll use this later
            fragment vert(vertex v) {
                fragment f;
                UNITY_SETUP_INSTANCE_ID(v);
                f.loc = UnityObjectToClipPos(v.loc);
                f.uv = v.uv;
                //f.uv = TRANSFORM_TEX(v.uv, _MainTex);
                return f;
            }

            float4 frag(fragment f) : SV_Target{
                float4 c = tex2D(_MainTex, f.uv);
                return c;
            }
            ENDCG
        }
    }
}

Here’s the step-by-step:

#pragma vertex vert
#pragma fragment frag
#pragma multi_compile_instancing
#include "UnityCG.cginc"
#pragma instancing_options procedural:setup

The above code declares our vertex, fragment, and instancing setup functions. The two lines key for instancing are #pragma multi_compile_instancing and #pragma instancing_options procedural:setup.

struct vertex {
    float4 loc  : POSITION;
    float2 uv   : TEXCOORD0;
    UNITY_VERTEX_INPUT_INSTANCE_ID
};

This is a pared-down vertex descriptor, with one extra feature: this strange line UNITY_VERTEX_INPUT_INSTANCE_ID. Declaring this is, to my knowledge, the only way to access data from MaterialPropertyBlock.

CBUFFER_START(MyData)
    float4 posDirBuffer[7];
    float timeBuffer[7];
CBUFFER_END

posDirBuffer and timeBuffer are arrays of data from MaterialPropertyBlock that we want to access in the shader. We can index into these arrays using unity_InstanceID (which only exists if we declare UNITY_VERTEX_INPUT_INSTANCE_ID). If we want to draw a thousand objects at different positions, we need a position array. If we want to draw a thousand objects with different rotations, we need a directions array. We can declare any arrays that we want here and read them in any way we want, as long as the script code actually provides them. However, the arrays must be either typed as float4[] or float[]. This means that, for efficiency, we should merge the position (a Vector2) and the direction (a Vector2) into one Vector4 before storing it in the array.

CBUFFER stands for “constant buffer”, a concept in Direct3D. Wrapping the declarations in CBUFFER_START/CBUFFER_END is not strictly required. If you do not do it yourself, Unity will automatically stick your variable declarations into a default constant buffer. However, there is a size limitation to constant buffers, so if you have lots of big arrays, it may be necessary to split them up between separate constant buffers.

Note that we declared the arrays to have size 7. This is the batch size that we’ll be using for the C# code later. 1023 is the maximum batch size due to limits on the size of shader arrays, but some platforms (like WebGL) have problems with batch sizes greater than 511.

#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        void setup() {
            float2 position = posDirBuffer[unity_InstanceID].xy;
            float2 direction = posDirBuffer[unity_InstanceID].zw;

            unity_ObjectToWorld = float4x4(
                direction.x, -direction.y, 0, position.x,
                direction.y, direction.x, 0, position.y,
                0, 0, 1, 0,
                0, 0, 0, 1
                );
        }
#endif

This is mostly boilerplate code. All it does is set up the Matrix4x4 for each instance so Unity knows where to render it. Normally, you don’t have to deal with this, but DMII forces you to manually construct the model matrix. Note how we pull the position and direction Vector2s out of the combined posDirBuffer.

Warning: Unity also serves a unity_WorldToObject matrix which is supposed to be the inverse of unity_ObjectToWorld, so we should technically rewrite that too. At the same time, WorldToObject isn't, to my knowledge, used anywhere in the default rendering path-- it's only invoked by ObjSpaceLightDir and ObjSpaceViewDir. If you use those functions, you'll probably need to also invert this matrix.

The way you assign position and direction to the ObjectToWorld matrix may differ; in my case, I don’t use the Z-axis and therefore don’t add position.z (which would require float3 instead of float2). Also, I only use Z-rotation, which means that all my directions can be expressed as a single angle. I precalculate the vector2 of direction because my project has CPU code which requires direction as a normalized vector, but you could pass it as a float and do cos/sin calculation within the shader setup function (which you should do if possible-- math is much faster on a GPU).

fragment vert(vertex v) {
    UNITY_SETUP_INSTANCE_ID(v);
    fragment f;
    f.loc = UnityObjectToClipPos(v.loc);
    f.uv = v.uv;
    //f.uv = TRANSFORM_TEX(v.uv, _MainTex);
    f.c = float4(1.0, 1.0, 1.0, 1.0);
    return f;
}

The only special line here is UNITY_SETUP_INSTANCE_ID(v). This allows us to access unity_InstanceID within the vertex shader, so we can query the data arrays and apply effects unrelated to position or direction. We'll add an effect like that later in this post.

TRANSFORM_TEX handles some texture features like tiling and offset that you can see in the material shader inspector. If you don't need these, you can simply copy the UV value.

Note: When using DMII, rendered objects will not automatically be sorted by z-axis. If you are using 2D sprites and you need z-axis sorting, you will need to handle this in your CPU code (see section 6 for details on how rendering order works). If you are using 3D models, you can set ZWrite to On in the shader and rely on the depth buffer for pseudo-sorting. You generally cannot use ZWrite/ZTest for 2D sprites since 2D sprites usually have transparent pixels (which is why they use Queue=Transparent), which cannot be handled by ZWrite/ZTest properly. The camera property TransparencySortMode has some interaction with DMII, but based on my testing, the interaction does not make any sense.

2. Mesh and Material

In 2D, you generally work with sprites, so getting meshes may be a bit of a strange ask. The conversion ultimately isn’t too difficult. I find it convenient to store the mesh/material information together in a struct:

public readonly struct RenderInfo {
    private static readonly int MainTexPropertyId = Shader.PropertyToID("_MainTex");
    public readonly Mesh mesh;
    public readonly Material mat;

    public RenderInfo(Mesh m, Material material) {
        mesh = m;
        mat = material;
    }

    public static RenderInfo FromSprite(Material baseMaterial, Sprite s) {
        var renderMaterial = UnityEngine.Object.Instantiate(baseMaterial);
        renderMaterial.enableInstancing = true;
        renderMaterial.SetTexture(MainTexPropertyId, s.texture);
        Mesh m = new Mesh {
            vertices = s.vertices.Select(v => (Vector3)v).ToArray(),
            triangles = s.triangles.Select(t => (int)t).ToArray(),
            uv = s.uv
        };
        return new RenderInfo(m, renderMaterial);
    }
}

Note that this function creates a copy of the material. In my project, different object types use different textures, but use the same basic material, so I create one material in my Assets and duplicate it for each object type.

3. An Object Manager

3.1 An Object

DMII is usually used when your “objects” are some kind of code abstraction without a GameObject. Here’s an unimpressive class that describes some features of an object we want to render to screen.

public class FObject {
        private static readonly Random r = new Random();
        public Vector2 position;
        public readonly float scale;
        private readonly Vector2 velocity;
        public float rotation;
        private readonly float rotationRate;
        public float time;

        public FObject() {
            position = new Vector2((float)r.NextDouble() * 10f - 5f, (float)r.NextDouble() * 8f - 4f);
            velocity = new Vector2((float)r.NextDouble() * 0.4f - 0.2f, (float)r.NextDouble() * 0.4f - 0.2f); 
            rotation = (float)r.NextDouble();
            rotationRate = (float)r.NextDouble() * 0.6f - 0.2f;
            scale = 0.6f + (float) r.NextDouble() * 0.8f;
            time = (float) r.NextDouble() * 6f;
        }

        public void DoUpdate(float dT) {
            position += velocity * dT;
            rotation += rotationRate * dT;
            time += dT;
        }
    }

Note that we don’t define a sprite on the object. This is because the material texture is shared among all instances of a single DMII call.

Also note that we have an update function for this object that takes a deltaTime. This update function will be called by the object manager, which will query Time.deltaTime only once per frame for efficiency.

FObject here is an example, but it’s likely that you’ll have some kind of related setup. In my project, I store data in a linked list, where the nodes are (pooled) class objects that contain structs of data. Linked lists are useful if you need an ordered enumerable data structure that supports arbitrary removal. Regardless of whether you go for arrays or linked lists or whatever, your objects probably need a reference-type wrapper at some point, because mutable lists of structs are… not a good idea.

3.2 A Manager

The manager itself is a lot of boilerplate, but it’s all important boilerplate. Let’s step through writing a manager.

private static readonly int posDirPropertyId = Shader.PropertyToID("posDirBuffer");
    private static readonly int timePropertyId = Shader.PropertyToID("timeBuffer");
    
    private MaterialPropertyBlock pb;
    private readonly Vector4[] posDirArr = new Vector4[batchSize];
    private readonly float[] timeArr = new float[batchSize];
    private const int batchSize = 7;
    public int instanceCount;
    
    public Sprite sprite;
    public Material baseMaterial;
    private RenderInfo ri;
    public string layerRenderName;
    private int layerRender;
    private FObject[] objects;
...

The first boilerplate we need is the shader property ID of each of the data arrays we declared. I’ve requested a second array, timeBuffer, which doesn't yet exist in the shader. This is actually fine, as providing extra data to the shader won't break it.

Next, we need to declare a MaterialPropertyBlock object which passes information to the shader, as well as the arrays we're going to copy to the properties declared in the shader. As in the shader, we use a batch size of 7. 1023 is the maximum batch size due to limits on the size of shader arrays, but some platforms (like WebGL) have problems with batch sizes greater than 511.

While we can’t use sorting layers, we still have to render our objects to a specific camera culling layer. In a multi-camera setup, we can also use the culling layer to block rendering on cameras that don’t have a matching culling mask.

private void Start() {
    pb = new MaterialPropertyBlock();
    layerRender = LayerMask.NameToLayer(layerRenderName);
    ri = RenderInfo.FromSprite(baseMaterial, sprite);
    Camera.onPreCull += RenderMe;
    objects = new FObject[instanceCount];
    for (int ii = 0; ii < instanceCount; ++ii) {
        objects[ii] = new FObject();
    }
}

private void Update() {
    float dT = Time.deltaTime;
    for (int ii = 0; ii < instanceCount; ++ii) {
        objects[ii].DoUpdate(dT);
    }
}

Initialization isn’t too complicated. We initialize our rendering information and our objects, and attach our rendering function (below) to Camera.OnPreCull, which is to my knowledge the standard place to do DMII stuff. Update is self-explanatory.

private void RenderMe(Camera c) {
    if (!Application.isPlaying) { return; }
    for (int done = 0; done < instanceCount; done += batchSize) {
        int run = Math.Min(instanceCount - done, batchSize);
        for (int batchInd = 0; batchInd < run; ++batchInd) {
            var obj = objects[done + batchInd];
            posDirArr[batchInd] = new Vector4(obj.position.x, obj.position.y, 
                Mathf.Cos(obj.rotation) * obj.scale, Mathf.Sin(obj.rotation) * obj.scale);
            timeArr[batchInd] = obj.time;
        }
        pb.SetVectorArray(posDirPropertyId, posDirArr);
        pb.SetFloatArray(timePropertyId, timeArr);
        CallRender(c, run);
    }
}

Note: If you have multiple cameras, you probably only want to render your DMII calls to one of them. To handle this, first get the layer mask as mask = LayerMask.GetMask(layerRenderName) , then skip irrelevant cameras via adding if ((c.cullingMask & mask) == 0) return; at the top of RenderMe.

The rendering setup is fairly methodical. First, we iterate over our object instances in groups of up to batchSize. Within each group, we iteratively copy data from the objects into the data arrays. Then we copy the data arrays into the property block via SetVectorArray/SetFloatArray. Finally, we invoke the actual DMII call in a separate function (below).

private void CallRender(Camera c, int count) {
    Graphics.DrawMeshInstancedProcedural(ri.mesh, 0, ri.mat,
        bounds: new Bounds(Vector3.zero, Vector3.one * 1000f),
        count: count,
        properties: pb,
        castShadows: ShadowCastingMode.Off,
        receiveShadows: false,
        layer: layerRender,
        camera: c);
}

Current versions of Unity have a function called DrawMeshInstancedProcedural, which is basically the same as DrawMeshInstancedIndirect but is marginally easier to use. It requires a mesh, a submesh index (if you don’t know what that means, it’s 0), a material, a Bounds object that delineates the drawing space (in my testing this object doesn’t do anything), the number of things to draw, the MaterialPropertyBlock we modified before calling this function, some shadow information, the target camera layer, and the target camera (or null for all cameras).

With this, our basic model is complete, and we now can render a bunch of moving objects to screen with super efficiency.

4. Adding a Feature: Fade-In Time

Remember timeBuffer? Let's make some use of it by having the objects fade in over time.

First, we create a shader variable for the time over which the object should fade in. We use a shader variable instead of a buffer because this value is shared.

Properties {
    _MainTex("Texture", 2D) = "white" {}
    _FadeInT("Fade in time", Float) = 10    // New
}

Then, we declare timeBuffer along with posDirBuffer:

CBUFFER_START(MyData)
    float4 posDirBuffer[7];
    float timeBuffer[7]; // New
CBUFFER_END

Next, we need to decide whether to do our calculations in the vertex or fragment shader. We’ll do it in the fragment shader to show the extra boilerplate. If we want access to unity_InstanceID in the fragment shader, we need to add a few lines:

struct fragment {
    float4 loc  : SV_POSITION;
    float2 uv   : TEXCOORD0;
    UNITY_VERTEX_INPUT_INSTANCE_ID          // New
};
fragment vert(vertex v) {
    fragment f;
    UNITY_SETUP_INSTANCE_ID(v);
    UNITY_TRANSFER_INSTANCE_ID(v, f);       // New
    f.loc = UnityObjectToClipPos(v.loc);
    f.uv = v.uv;
    //f.uv = TRANSFORM_TEX(v.uv, _MainTex);
    return f;
}
float4 frag(fragment f) : SV_Target{
    UNITY_SETUP_INSTANCE_ID(f);             // New
    float4 c = tex2D(_MainTex, f.uv);
    return c;
}

If you really want to know why it works like this, you can take a look at the source code for these macros.

Finally, we can do the actual fade-in. Once all the boilerplate is out of the way, this is remarkably simple. The “normal” way to do fade-in would be c.a *= smoothstep(0.0, _FadeInT, _Time). The only difference for instancing is that time is no longer a shader variable; we instead need to get it from the timeBuffer data array.

float _FadeInT;                                                         // New

float4 frag(fragment f) : SV_Target{
    UNITY_SETUP_INSTANCE_ID(f);
    float4 c = tex2D(_MainTex, f.uv);
#if defined(UNITY_PROCEDURAL_INSTANCING_ENABLED) || defined(UNITY_INSTANCING_ENABLED)                           // New
    c.a *= smoothstep(0.0, _FadeInT, timeBuffer[unity_InstanceID]);     // New
#endif                                                                  // New
    return c;
}

(Note that this is something that you really should do in the vertex shader; I only do it in the fragment shader here to show ID transferring.)

And here’s the result: moving, rotating sprites that fade in over time. Since we randomized the starting time, some of them start off somewhat opaque.

5. But What If My Computer Is From 2005?

If you have sharp eyes, you might have noticed that this code for time-based fade-in uses a different instancing check than the matrix setup function. The fade-in code checks for UNITY_PROCEDURAL_INSTANCING_ENABLED || UNITY_INSTANCING_ENABLED, whereas the setup function only checks for UNITY_PROCEDURAL_INSTANCING_ENABLED. This is because there are actually two forms of indirect mesh instancing: DrawMeshIndirectInstanced/Procedural on one hand, and DrawMeshInstanced on the other. The difference between these two lies in how the position matrix is created.

For DrawMeshIndirectInstanced/Procedural, the position matrix for each element is created in the setup() function in the shader, which is run on the GPU. Calling either of these functions sets UNITY_PROCEDURAL_INSTANCING_ENABLED.
For DrawMeshInstanced, the position matrix for each element must be computed in C# (on the CPU) and provided to the function call as an array. Calling this function sets UNITY_INSTANCING_ENABLED.

DrawMeshIndirectInstanced/Procedural is faster since it offloads more work to the GPU. However, it is not supported on very old computers, and it is also not supported on WebGL as of the last time I checked. This means that we should aim to support DrawMeshIndirectInstanced/Procedural when available and fallback to DrawMeshInstanced otherwise. Therefore, in the shader, we only use the setup() function when UNITY_PROCEDURAL_INSTANCING_ENABLED is enabled, but any other instancing functionality should check for both flags.

You may be wondering why we need to check for the flags at all if the shader isn’t designed to work with non-instancing use cases. The problem is that references to constructs like unity_InstanceID will cause compilation errors if not enclosed within a flag check, and this may or may not cause compilation issues with your project at large.

It’s not difficult to support DrawMeshInstanced. First, we need an array to store position matrices in C#:

private readonly Vector4[] posDirArr = new Vector4[batchSize];
private readonly float[] timeArr = new float[batchSize];
private readonly Matrix4x4[] posMatrixArr = new Matrix4x4[batchSize]; //New

Then, we need to create the matrices in the RenderMe function. The matrix created is the same as in the shader setup() function.

private void RenderMe(Camera c) {
    if (!Application.isPlaying) { return; }
    for (int done = 0; done < instanceCount; done += batchSize) {
        int run = Math.Min(instanceCount - done, batchSize);
        for (int batchInd = 0; batchInd < run; ++batchInd) {
            var obj = objects[done + batchInd];
            //posDirArr[batchInd] = new Vector4(obj.position.x, obj.position.y, 
            //    Mathf.Cos(obj.rotation) * obj.scale, Mathf.Sin(obj.rotation) * obj.scale);
            timeArr[batchInd] = obj.time;
            ref var m = ref posMatrixArr[batchInd];

            m.m00 = m.m11 = Mathf.Cos(obj.rotation) * obj.scale;
            m.m01 = -(m.m10 = Mathf.Sin(obj.rotation) * obj.scale);
            m.m22 = m.m33 = 1;
            m.m03 = obj.position.x;
            m.m13 = obj.position.y;
        }
        //pb.SetVectorArray(posDirPropertyId, posDirArr);
        pb.SetFloatArray(timePropertyId, timeArr);
        //CallRender(c, run);
        CallLegacyRender(c, run);
    }
}

Note that we don’t need the position+direction array if we’re using DrawMeshInstanced.

Finally, we can actually call DrawMeshInstanced as follows:

//Use this for legacy GPU support or WebGL support
private void CallLegacyRender(Camera c, int count) {
    Graphics.DrawMeshInstanced(ri.mesh, 0, ri.mat,
        posMatrixArr,
        count: count,
        properties: pb,
        castShadows: ShadowCastingMode.Off,
        receiveShadows: false,
        layer: layerRender,
        camera: c);
}

I don’t have a good solution for how to switch between these two functions at runtime other than asking the player if they can see the rendered objects properly. If you don’t want to support two implementations, or if you don’t need that much speed, feel free to use DrawMeshInstanced in all cases.

In most cases, there’s no overhead to supporting both functions, but here’s one use-case that could cause issues. Let’s say you wanted to add some functionality to change the scale of objects based on how old they are. For example, let’s say you want objects to start with a size of 0 and then scale up to their full size over ten seconds. If you’re using DrawMeshInstancedIndirect/Procedural, you can efficiently do this in the setup function:

#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        void setup() {
            float2 position = posDirBuffer[unity_InstanceID].xy;
            float2 direction = posDirBuffer[unity_InstanceID].zw;
            direction *= smoothstep(0, 10, timeBuffer[unity_InstanceID]); //New

            unity_ObjectToWorld = float4x4(
                direction.x, -direction.y, 0, position.x,
                direction.y, direction.x, 0, position.y,
                0, 0, 1, 0,
                0, 0, 0, 1
                );
        }
#endif

However, if you’re supporting DrawMeshInstanced, you also need to add code in the matrix computation in C#:

//Clone of HLSL smoothstep
private float Smoothstep(float low, float high, float t) {
    t = Mathf.Clamp01((t - low) / (high - low));
    return t * t * (3 - 2 * t);
}

private void RenderMe(Camera c) {
    if (!Application.isPlaying) { return; }
    for (int done = 0; done < instanceCount; done += batchSize) {
        int run = Math.Min(instanceCount - done, batchSize);
        for (int batchInd = 0; batchInd < run; ++batchInd) {
            var obj = objects[done + batchInd];
            posDirArr[batchInd] = new Vector4(obj.position.x, obj.position.y, 
                Mathf.Cos(obj.rotation) * obj.scale, Mathf.Sin(obj.rotation) * obj.scale);
            timeArr[batchInd] = obj.time;
            ref var m = ref posMatrixArr[batchInd];

            var scale = obj.scale * Smoothstep(0, 10, obj.time); //New
            m.m00 = m.m11 = Mathf.Cos(obj.rotation) * scale;     //Changed
            m.m01 = -(m.m10 = Mathf.Sin(obj.rotation) * scale);  //Changed
            m.m22 = m.m33 = 1;
            m.m03 = obj.position.x;
            m.m13 = obj.position.y;
        }
        pb.SetVectorArray(posDirPropertyId, posDirArr);
        pb.SetFloatArray(timePropertyId, timeArr);
        //CallRender(c, run);
        CallLegacyRender(c, run);
    }
}

6. Annoying Details

To my knowledge, it’s not possible to assign a sorting layer to DMII calls. This means that you probably need separate cameras for your DMII calls, since you can’t sort them with non-DMII objects. In my setup, I have six (!) cameras: a ground layer camera, a “LowDirectRender” camera for DMII calls, a middle camera for most standard objects, a “HighDirectRender” camera for other DMII calls, a top camera for effects, high-priority objects, and post-processing, and a UI camera.

DMII has some important rules for ordering. First, DMII calls are ordered by render queue; materials with lower render queue values will render first. Within the same render queue value, different materials are ordered according to their time of creation (the oldest materials are drawn on top). This is a really strange feature and you should absolutely circumvent it by making sure you never are using DMII on materials with the same render queue value. For multiple DMII calls on the same material, the calls are ordered by the call order in your scripts (thankfully). Within each DMII call, instance ID 0 is called first, and then increments by 1 (thankfully).

As an aside, you can support frame animation within a DMII shader, though it’s a bit awkward. If you can assume that all frames have a uniform size and a uniform time per frame, then you can do the following:

Store the sprites as a spritesheet with all sprites ordered left to right, and get any one frame and store this as a single sprite.
Set the mesh to have the size of one frame, but the texture of the entire spritesheet. You can do this by calling MeshGenerator with the single sprite and then setting ri.Material.SetTexture("_MainTex", spritesheet.texture).
In the shader, add code as follows in the vertex shader, where _InvFrameT is 1/(time per frame), _Frames is the number of frames, and FT_FRAME_ANIM is a shader keyword activated when the material is using frame animation:

#ifdef FT_FRAME_ANIM
    f.uv.x = (f.uv.x + trunc(fmod(timeBuffer[unity_InstanceID] * _InvFrameT, _Frames))) / _Frames;
#endif

As another aside, I brought up two limits on batch size: 1023, which is the maximum length of an array in the shader, and 511, which is the maximum batch size if you are using DrawMeshInstanced (the legacy method discussed in section 5). If you are using exclusively DrawMeshInstancedIndirect/Procedural, it is possible to uncap the batch size by using StructuredBuffer instead of arrays. StructuredBuffer is more annoying to use, but doesn’t have a size limit. An older version of this document used StructuredBuffer, and the code for this project with StructuredBuffer can be found on Github.

Conclusion

DMII is literally magic. I use it heavily and it’s super-effective. For example, consider this almost empty scene, which requires 10 rendering calls (for post-processing effects):

And here’s a scene with about 14000 moving circles (recolored from one sprite using the technique discussed in my first devlog), that requires an astounding 14 more draw calls:

All of these objects are pure code abstractions: no GameObjects! (I suspect that I should look into ECS soon…)