PlayKids is a company that produces digital content for children. We work with educational videos, books and games, and in this article we’re going to focus on technical optimizations that were made to our recently launched Crafty Lands game.
Crafty Lands (available on iOS and Android) was released as a standalone game after being featured for a while as an educational game inside the PlayKids app. Its mechanics consists of a sandbox block game, in which children use block palettes to create virtual worlds. We’ll briefly explain why we chose this mechanics, its technical challenges and how we overcome some performance issues to address product churn.
Crafty Lands debut occurred not as a standalone app but as a game inside PlayKids’ app playground. Before its release, PlayKids’ most played game was Coloring Book. The main difference between Coloring Book and the other games in the app was its replayability. As a sandbox game, it allowed children to fully explore their creativity, generating more and longer sessions.
While exploring other sandbox games, we realized that we had a block game engine prototype that could be used to create a new game targeted to children between the ages of 3 to 7.
Figure 1 — Crafty Lands Medieval world
Our bet paid off and Crafty Lands was such a success that we’ve decided to turn it into a standalone app so we could develop it even further.
Challenges and optimizations to the engine
At first glance, developing a block game seems like a trivial task, but scaling it to create big worlds can bring a naive implementation to its knees very quickly. We’ve utilized Unity 3D as the platform for developing Crafty Lands, but since our optimizations were mostly focused on common rendering and low level tricks, they’re suitable for many situations.
Guess what’s the primitive for building everything in our Crafty Lands game? If you thought about cubes you’re right! So, what’s so special about cubes? Well, in computer graphics we can represent a cube by using:
- 6 faces;
- Each cube’s face has 4 vertices;
Each vertex contains:
- a 3D vector representing its position (X, Y, Z);
- a 3D vector representing its normal vector (X, Y, Z);
- a 3D vector representing its color (R, G, B);
- a 2D vector representing its texture coordinates (U, V);
- Each coordinate is a Unity float, which uses 32 bits
Figure 2 — Cube representation (without normal vectors)
So, each cube needs approximately 1KB of memory (6×4×(3+3+3+2)×32), not accounting for Unity overheads.
Crafty Lands’ audience doesn’t require very large worlds, so we decided to go with worlds with 32 height ×192 width ×192 depth blocks, resulting in a total of about 1 million blocks.
If we naively designed our data structure to represent these cubes we would need about 1GB of video RAM to store our cube information. For desktop computers or modern laptops this number suits well, but for mobile devices, especially low-end Androids, we would certainly blow away the video RAM. Besides that, we need to render 12 triangles per cube, resulting in about 12 millions triangles to render per frame, and 1 million draw calls  in the worst case – unfeasible numbers for a mobile device.
To tackle these problems and reduce these big numbers we can be smarter and perform some sort of occlusion culling by creating a simplified mesh that contains only the faces that are in contact with the “air”. Figure 3 illustrates the concept of the simplified mesh:
Figure 3 — Difference between meshes sent to the GPU: (a) using blocks (912 triangles); (b) Simplified mesh (39 triangles)
In the left side we have 76 cubes, which results in 912 triangles to be rendered by the GPU. Our simplified mesh, containing only the “visible” face (the ones that are in contact with the air) contains only 39 triangles and its render would result in the same image to the user. This simplification drastically reduces our memory usage, but doesn’t solve all of our problems.
We’re still going to have issues with draw calls and GPU data bandwidth. Since our world is dynamic, if we create a single simplified mesh of our world, we’ll still pass a behemoth of data to our GPU that would consume a lot of GPU/bus bandwidth and result in a mesh that can’t be grouped into a single draw call due to its size. CPU stalls would frequently occur, as displayed in Figure 4:
Figure 4 — CPU stall when user modifies the world by placing/removing a block
This problem was solved by grouping the meshes into chunks of 8×8×8 blocks. This value was obtained empirically by testing on low-end devices, representing the optimal trade-off between draw call numbers and number of vertices sent to the GPU when an edition is performed.
Figure 5 — Crafty Lands world division in chunks
Low level optimizations
Previously, we tackled the cube rendering problem, which is mainly GPU bound. We also need a good data structure to represent the world in RAM memory containing only essential block information like type, light (our lightning model will be addressed separately in another article) and some extra info needed to optimize block creation or removal operations. We can see the used block data structure in Figure 6:
Figure 6 — Block class: no clutter
To avoid using too much memory, we used the smallest types needed, such as byte and short. A traditional object-oriented approach would include information regarding block state, block UV texture mapping (we use a texture atlas to reduce draw calls), but we’ve ditched these data from our block class and created separate helper functions to reflect this information, as illustrated in Figure 7:
Figure 7 — Helper functions inside Block class
Using this approach, we were able to keep a small memory footprint, reducing even further our memory usage.
In our world data structure, we use chunks to index our blocks and the chunks are indexed in a World class. When creating, loading, saving or performing block operations on the world, we can’t directly access our blocks by using the extra variables (_wx, _wy, _wz) in the Block class, so we need to perform an index conversion operation, which involves dividing the indexes by the chunk size. If we model our chunk size to be a power of 2, we can swap our division operation by a bitwise shift, the former being much, much faster than a division operation. When dealing with such a number of blocks, low level optimizations like these can bring huge performance gains, so we modeled our chunk size to be a power of 2. Figure 8 illustrates a trivial example of converting a block world position to its chunk index by using a shift operation instead of a division.
Figure 8 — Swapping division operations for bitwise shifts
Figure 9 shows a table with arithmetic operations times, which can be used to estimate how much slower is division when compared to other operations:
Figure 9 — Table with average arithmetics instruction execution times in C# 
Using a shift operation is approximately 16 times faster than using a division one. When creating or saving our world, where all blocks need to be accessed, this simple operation led to a 27.5% faster load time. Operations for removing or creating blocks also benefit from this change, especially on low-end Android devices, but it is more pronounced on loading/saving operations.
Another interesting result regards to jagged arrays vs multidimensional arrays in C#. Both types can be declared like this:
- Jagged array: int  array
- Multidimensional array: int[,] array
The former is considered more “elegant”, but figure 10 shows us that the resulting Assembly code for the multidimensional array version introduces a call instruction. The call instruction cost isn’t negligible when we’re iterating over this large number of blocks. Using jagged arrays instead of multidimensional arrays resulted in another 19.3% faster load time.
Figure 10 — Unity-generated Assembly code comparing multidimensional arrays vs jagged arrays
Indexers in C# gave us an elegant way of hiding block coordinates conversion to our block object, but it had a significant performance impact and had to be used sparingly. We removed unnecessary accesses like the ones illustrated in Figure 11:
Figure 11 — Swapping operation order to avoid unnecessary indexer accesses
The first implementation used 3 redundant references to the indexer, causing three unnecessary conversions. In the second or third implementation we have only indexer usage, which made the code more efficient.
On critical code paths we also removed all method variable allocations to avoid performance hiccups due to garbage collection or, in the case of structs, to avoid unnecessary costly allocations. In Figure 12 we can see an example of one of these changes. We changed the method variable for a static class variable, which is initialized only once in the entire game lifecycle.
Figure 12 — Using static variable instead of allocating variable every method call
Dictionaries and generics were the last optimization bastion that we’ve addressed to this moment. Dictionaries have complexity O(1) in access operations, tough this access still requires the computation of a hash function, which commonly requires an aforementioned, feared division instruction. It was necessary to change this dictionary access for something faster, like a switch block. This way, instead of doing a hash function computation, we end up doing a simple jump instruction that is much cheaper. Figure 13 shows the difference between using a dictionary and using a switch block:
Figure 13 — Swapping a dictionary with a switch block
Generics removal was done based on empiric measurements, since by using generics the assembly code contained a few more stack accesses function than the non generic counterpart. We also removed the getter function used to access our singleton variable for a static variable. Getters are expected to be inlined calls and very efficient, but measurements resulted in better performance by using the static variable.
Figure 14 — Removing generic usage and getter for commonly used types
All these other optimizations made loading a level 11.3% faster:
- Removing unnecessary indexer references;
- Changing local method variables allocations to static class variables;
- Removing dictionary accesses;
- Removing generics;
- Changing getter usage for static instance;
Modern smartphones have so much computing power that critical performance code is relegated to ostracism and rarely necessary. Nevertheless, some games can still bring even the most powerful of the mobile devices to their knees if we recklessly abuse language facilities. We know that readable code is always preferred to performance, so most of the problems we’ve faced in this project were created by using best coding practices, which didn’t resulted in the necessary performance.
There were other optimizations to the project, but we chose to present only the most common regarding rendering and low level tricks used in ancient times by the game programming gurus. Some, like optimization in creating the meshes/colliders, result in huge gains, but are too Unity 3D specific and will be addressed in another article. We used load time to perform our benchmarks because this was the worst case operation, since all blocks are created at the same time. Reducing load time had direct impact on our product retention:
- Version 1: average load time -> 51s. 48% users churned D0;
- Version 2: average load time -> 8s. D0 churn reduced to 25% of all users;
- Version 3: average load time -> 2s. D0 churn reduced to 12% of all users;
Such optimizations were essential to product success, bringing churn rate from staggering 48% down to 12%.
All being said we still are advocates of “do it first, optimize it later”. Also, the best optimizer available is our brain. We only resorted to low level optimization tricks after ensuring that all algorithms and data structures we were using couldn’t be changed for something better. The optimization step was only performed after we launched the product to the public and saw a direct impact on user churn rate, avoiding reducing our code readability and risking introducing bugs unnecessarily.