The Apple M1 available in the MacBook Air, MacBook Pro 13”, and Mac Mini hasbeen the focus of a ton of benchmarking writeups and blog posts aboutthe new chip. The performance overall, and especially performance/watt,that Apple has achieved with the chip is very impressive.As a ray tracing person, what caught my eye the most was theperformance AnandTech reported in theirCineBench benchmarks.These scores were 1.6x higher than I got on my old Haswell desktop and 2xhigher than my new Tiger Lake laptop!I had also been interested in trying out the newray tracing API for Metalthat was announced at WWDC this year,which bears some resemblance to the DirectX, Vulkan, and OptiX GPU ray tracing APIs.So, I decided to pick up a Mac Mini to do some testingon my own interactive path tracing project,ChameleonRT,and to get it running on the new Metal ray tracing API.In this post, we’ll take a look at the new Metal ray tracingAPI to see how it lines up with DirectX, Vulkan, OptiX and Embree,then we’ll make some fair (and some extremely unfair) ray tracingperformance comparisons against the M1.

ChameleonRT is an open sourceinteractive path tracer that I’ve been working on to learn the different ray tracing APIs,and to provide an example or starting point for myself and others working with themin other projects.ChameleonRT provides backends for the GPU ray tracing APIs:DirectX Ray Tracing,Vulkan KHR Ray Tracing,and OptiX 7.Through the work discussed in this post, it also has a newMetal GPU backend.ChameleonRT also has an Embree backendfor fast multi-threaded and SIMD accelerated ray tracing on CPUs throughEmbree, ISPC, and TBB.The rendering code in each backend is nearly identical and they produce almost pixel exact outputs,with some possible small differences due to subtle differences in the ray tracinglibraries or shader languages.

ChameleonRT is different from popular ray tracing benchmark applications likeCineBench, LuxMark,and Blender Benchmark, in that ChameleonRT isa minimal interactive path tracer. CineBench, LuxMark, and Blender Benchmark are excellentfor getting a full picture of what performance to expect from a production filmrenderer, as they are production renderers. However, there’s a lot more goingon in a production renderer than just ray tracing to support the kind of complexgeometries, materials, and effects used in film; and the large codebases can be challenging to quickly port to a new architecture or API.ChameleonRT is the exact opposite, supporting just one geometry type,triangle meshes, and one material type, the Disney BRDF.The rest of the code is similarly written to achieve interactive path tracing performance,though I have tried to balance this with how complex the code is to read.For example, the renderers use iterative ray tracing instead of recursion,support only a simple sampling strategy, and don’t support alpha cut-out effects.The use of iterative ray tracing is valuable on both the CPU and GPU, but isespecially important on the GPU, as theoverhead of recursive ray tracing calls in the pipeline can have a significant performance impact.Similarly, ignoring alpha cut-out effects allows the GPU renderersto skip using an any hit shader (and the Embree on to skip needing an intersectionfilter). The any hit shader (or intersection filter) would be called duringBVH traversal after a candidate intersection with a triangle is found.In the case that the BVH traversal is hardware accelerated, this ends upproducing a lot amount of back andforth between the fixed function traversal hardware and the shader cores to run the any hitshader during traversal, limiting the hardware’s performance.

What’s a path tracer? Path tracing is a technique in computer graphics forrendering photo-realistic images by simulating the light transport inthe scene. It is a Monte Carlo technique, meaning that it randomly sampleslight paths emitted from lights in the scene that reach the cameraby bouncing off objects in the scene. This is done by tracing the paths in reverse,starting from the camera and tracing the path back through the scene.A large number of light paths need to be sampled to produce a noise-freeimage. Path tracing is not restricted to photo-realism, and is thecore rendering algorithm used in major film renderers, such as: Pixar’s Renderman,Disney’s Hyperion, Autodesk’s Arnold, and Weta’s Manuka.See Wikipedia for more.

The new ray tracing API in Metal will be familiar to those who’ve used inline raytracing in DirectX or Vulkan. For a good introduction to Metal’s API, check out the Discover ray tracingwith Metal video from WWDC 2020,and the accompanying sample.Inline ray tracing allows applications to make ray tracing calls in any shader stage,e.g., fragment/pixel shaders, vertex shaders, compute shaders, etc.Inline ray tracing allows ray tracing effects, for example accurate reflections and shadows,to be integrated into the regular rasterization pipeline.Inline ray tracing can also be used from a compute shader to implement a standalonepath tracing based renderer. ChameleonRT is a standalone path tracer, and takesthe compute shader approach for Metal.

At a high-level, ray tracing in Metal proceeds as follows: you upload yourgeometry data, build bottom-level primitive acceleration structures over them, then create instancesreferencing those acceleration structures and build a top-level acceleration structure overthe instances.To render the scene you dispatch a compute shader and trace rays against the top-levelacceleration structure to get back intersection results. The intersection results provideyou with the intersected primitive and geometry IDs, and if using instancing, the instance ID.Your shader can then look up the geometry data for the object that was hit and shade it,after which it can continue tracing the path.

At this time Metal only supports inline ray tracing, in contrast to DirectX and Vulkan, whichalso provide support for ray tracing pipelines. Ray tracing pipelines are usedto implement standalone ray tracing renderers, and is the approach ChameleonRTuses in its DirectX and Vulkan backends.OptiX only supports ray tracing pipelines.Ray tracing pipelines require the creation of a Shader Binding Table toprovide the API with a table of functions to call when specific objectsin the scene are intersected. The SBT can be difficult to setup and debug,and is a common point of difficulty for developers learning these APIs.For more information about the SBT, check out my post“The RTX Shader Binding Table Three Ways”.However, the potential benefit of the SBT is that the GPU can reorder or group functioncalls to reduce thread divergence. With inline ray tracing, the developermust do this themselves, or do without(check out another video from WWDC20for information here).Right now, ChameleonRT does not do any reordering to reduce divergence.

Those familiar with Metal may know of the previousMetal Performance Shaders ray intersector.The new support for inline ray tracing improves on this by allowing the rendererto move entirely to the GPU. The previous ray intersector worked by takingbatches of rays, finding intersections in the scene, and writing these resultsback out to memory. Multiple compute shaders would need to be dispatched totrace the primary rays and create shadow and secondary rays, then trace the shadow raysand continue the secondary rays while filtering out paths that terminated.This requires significantly more memory traffic and compute dispatches, introducing overheadto the renderer. Such an approach is also not well suited to augmenting traditionalrasterization pipelines with ray tracing effects.Approaches based on tracing and sorting rays and intersection information to extractcoherence are valuable for complex film renderers, and can be implemented efficiently using inlineray tracing as well.For example, Disney’s Hyperion rendereruses sorting and batching extensively toextract coherent workloads from an otherwise incoherent and divergent distribution of rays.There’s also an excellent higher-level video explaining how thisworks.

The Metal ray tracing API is quite nice to work with.It bears a lot of similarity to the other ray tracing APIs, but has been streamlinedto be easier to use.For example, compare the code required to build and compact a BVH inDirectXor Vulkanwith Metal.OptiX’s API provides a similar simplification, here’s theBVH build in OptiXfor reference.It’s also nice to have templates and C++ style functionality in the shader language,a feature Metal shares with OptiX, which uses CUDA for the device side code.

This simplicity also has some drawbacks. For example, while in DirectX, Vulkan, and OptiXyou have control over where the acceleration structure memory is allocated,Metal makes this allocation for you. As a result, you cannot allocate the accelerationstructures on a MTLHeap, and so to make sure they’re available for your renderingpipeline, you must individually mark them as used in a loop instead of a single useHeap call.

This can add some overhead if you have a lot of bottom level BVHs, as can be thecase in scenes with many instances.It’d be nice to be able to allocate the acceleration structures on a heap, and replace this loopwith a single [command_encoder useHeap:data_heap->heap];

The API is also simplified significantly by not including ray tracing pipelines andonly requiring a Shader Binding Table-esquestructure when implementing custom geometries or other operations that needto happen during traversal (e.g., alpha cut-outs).Code for managing the SBT setup in DirectX, Vulkan, and OptiX, makes up a significantportion of my helper code in each backend, and it’s nice to skip thisin the Metal backend, which doesn’t implement custom geometry or alpha cut-out textures.However, in an SBT model the GPU could group or reorder function calls to reduce divergence,but this is less possible with inline ray tracing where this information isn’t availableto the driver.Whether the current drivers for DirectX, Vulkan, and OptiX actually do this is a questionfor the driver teams, but in theory it’s possible.In the end you may find yourself implementing something a bit like the SBTbut a bit simpler to look up the right data for the primtives/meshes/instancesin your scene in argument buffers, as I’ve done in ChameleonRT;and it’s required to implement operations that take place during traversal(custom geometry, alpha cut-out transparency).Overall my comments here on things that could be a bit better are pretty minor.

What caught my interest the most about AnandTech’s CineBenchresults is that CineBench actually uses Embreefor ray tracing, which is a library developed by Intel!Embree is a CPU ray tracing library that provides optimizedacceleration structure traversal and primitive intersectionkernels, filling a similar demand on the CPU as the new GPU raytracing APIs do on the GPU. Embree wasfirst released in 2011 and has found widespread adoptionacross film, scientific visualization, and other domains.

ChameleonRT also implements an Embree backendthat makes use of Embree for fast ray traversal and primitive intersection,ISPCfor SIMD programming, andTBB for multi-threading.Getting this backend running on the M1 Arm was actually a bit easierthan I expected. Syoyo Fujita already maintains an AArch64/NEON portof Embree, Embree-aarch64,which got some updates from the Apple Developer Ecosystem Engineering team toadd an AVX2 on NEON backend. Embree is pretty optimized for 8-wide SIMD,so even though this requires using 2 NEON vectors to act as one AVX2 vector, it canprovide better performance than a 4-wide backend on NEON.

ISPC was the dependency I was most concerned about getting running.ISPC is a compiler for SIMD programming on CPUs, where you write scalarprogram that is compiled to run in parallel on the CPU’s SIMD lanes,with one program instance per vector lane. Each program processes a differentpiece of data in parallel, executing in SIMD. It’s somewhat like GLSL or HLSLrunning on the CPU, where a scalar looking program is executed in parallel using SIMD.However, it turns out that ISPC already ships support for NEON and AArch64, and enabling amacOS + AArch64/NEON target turned out to be as easy as just making thiscombination of target OS and ISA flags not an error.I have a PR open to merge thissupport back in to ISPC.

Finally, while TBB is an Intel library for multithreading it doesn’t haveany ties to a specific CPU architecture. All that’s neededhere is to build TBB from source targeting AArch64.

With ChameleonRT now running on the M1’s CPU and GPU, we’re ready to do somebenchmarks and see how the M1 looks on ray traversal performance!I ran these tests on a few different systems that I have access to.On the CPU side, I have an i7-1165G7 in an XPS 13 (thermals can be an issue here),an i7-4790K in an older desktop, and an i9-9920X in an Ubuntu workstation.On the GPU side, I have an RTX 2070.I have the M1 with 16GB RAM, which is in a Mac Mini.Before we begin here’s the CineBench numbers that got me interestedin checking out the M1:

CPUCineBench R23 ScoreScore relative to M1
i7-1165G740260.52x (M1 is ~1.93x faster)
i7-4790K45790.59x (M1 is ~1.70x faster)
Apple M17783-
i9-9920X14793*1.9x (M1 is ~1.9x slower)

Table 1:CineBench R23 scores on the systems tested. *The i9-9920X system is running Ubuntu, for which CineBench is not available. This score is reported from

For the benchmarks in ChameleonRT we’ll use the two scenes shown below: Sponza and San Miguel.Sponza is a small scene with 262K triangles, San Miguel is a decent size for interactiveray tracing with 9.96M triangles.Both are treated as a triangle soup by the BVH builder, meaning that a single bottom levelacceleration structure is built containing all the triangles in the scene.This provides the BVH builder the most information possible about the geometry distributionso that it can build a high-quality acceleration structure.Both scenes are downloaded from Morgan McGuire’s meshes page andreexported in Blender to use OBJ material groups instead of per-triangle materials,as ChameleonRT doesn’t support per-triangle materials.The benchmarks are run rendering a 1280x720 image and run for ~200 frames, after whichthe average framerate (FPS) and million rays traced per second (MRay/s) are recorded.

Figure 1:The test scenes used in the benchmarks. Sponza (left) has 262K triangles, San Miguel (right) has 9.96M

Fair Comparisons

The fair comparison to make is against the other mobile CPU, the i7-1165G7 in my XPS 13using the Embree CPU backend.The comparisons are actually a bit in favor of the M1 here, as the XPS 13 will strugglea bit with thermals during the benchmarks.I also include my old desktop CPU, the i7-4790K,in this comparison since it scores similar to the i7-1165G7 on CineBench.When running ChameleonRT’s Embree backend for shorter bursts, the XPS 13 is actuallyable to outperform my desktop, though slows down after it heats up.

Blender M1 Pro

In these benchmarks I show the fastest SIMD configuration for the Intel CPUs.The i7-1165G7 has AVX512, allowing ISPC to run 16 “threads” in parallel andEmbree to use up to AVX512. The i7-4790K supports AVX2, allowing ISPC torun 8 “threads” in parallel.On the M1 it is possible to use ISPC to compile a double-pumped NEON target,i.e., 8-wide SIMD execution on the 4-wide NEON registers, however I found thatISPC’s 4-wide target performed best. Embree on the M1 is configuredto use AVX2 on NEON, as Embree is quite optimized for 8-wide SIMD.

i7-1165G7 (AVX512)1.568.7
i7-4790K (AVX2)1.8310.2
Apple M11.729.6

Table 2:Benchmark results rendering Sponza using the Embree CPU backend.

Blender M1 Support

i7-1165G7 (AVX512)1.076.1
i7-4790K (AVX2)1.337.6
Apple M11.307.4

Table 3:Benchmark results rendering San Miguel using the Embree CPU backend.

Extremely Unfair Comparisons

For the extremely unfair comparisons, we’ll compare the Metal GPU ray tracing backendon the M1 against DirectX ray tracing on an RTX 2070, and the Embree CPU backendagainst an i9-9920X.These comparisons are driven by my own curiosity. Since these are the highest end CPU and GPUsystems I have access to, it’s interesting to see where the M1 lines up.However, I wouldn’t expect the M1 to provide competitive performanceagainst these systems on this task.The RTX 2070 has a TDP of 175W and hardware accelerated ray tracing, while the entireM1 chip is estimated to be around 20-24Wand does not have hardware to accelerate ray tracing.The i9-9920X has a TDP of 165W and support for AVX512 (16-wide SIMD)and AVX2 (8-wide SIMD). As seen in the previous benchmarks, the support for wider SIMD (AVX2)on the old i7-4790K still made it tough competition for the M1.In these benchmarks I found that the i9-9920X performed best when just using AVX2.As discussed by Travis Downs, the use of AVX512 on some CPUs can result indown clocking.Depending on how SIMD-friendly the workload is, it may actually perform betterat a higher clock on a narrower SIMD width, which is the case here.Note that on the i7-1165G7, AVX512 performed slightly better than AVX2.

i9-9920X (AVX2)6.0233.6
Apple M11.729.6

Table 4:Extremely unfair benchmarks on Sponza using the Embree CPU backend.

i9-9920X (AVX2)4.4625.3
Apple M11.307.4

Table 5:Extremely unfair benchmarks on San Miguel using the Embree CPU backend.

RTX 2070 (DirectX Ray Tracing)135.5757
Apple M1 (Metal)3.6020.1

Table 6:Extremely unfair benchmark results on Sponza using the GPU backends.

RTX 2070 (DirectX Ray Tracing)63.8362
Apple M1 (Metal)2.0611.7

Table 7:Extremely unfair benchmark results on San Miguel using the GPU backends.

To wrap up, we can look back at the CineBench scores and the performance differencesin the fair benchmarks on the CPU.The M1 scored 1.93x higher than the i7-1165G7 on CineBench, but in ChameleonRT’sEmbree backend we found it was only 1.16x faster on averageacross the two scenes tested.Similarly, the M1 scored 1.70x higher than the i7-4790K on CineBench,but was actually 1.05x slower on average across the two scenes.What’s going on here?

It’s important to remember that ChameleonRT is nottesting the same thing as CineBench. There’s a lot more going on ina production renderer like CineBench than in a minimal one like ChameleonRT.These other tasks, like intersecting more expensive geometries, evaluatingmore complex material models, and so on, can add up to a largepercentage of the total execution time in a production renderer.ChameleonRT on the other hand has none of this, and is really justbenchmarking ray traversal performance.So if we get similar ray traversal performance with Embree on the M1,but other things in CineBench are faster, we can get higher relativescores in CineBench than we do on traversal performance alone.

Overall the M1 is pretty stellar. I’ve been using the Mac Mini as my day to daycomputer for the past 2 weeks or so,and have been happy with both the performance and the total silence.There’s really something to be said for a practically silent computer,I didn’t hear the fan running acrossthis set of benchmarks or when doing a parallel build of LLVM to build ISPC.What I’d be really excited to see in future M* chips would be supportfor 8-wide SIMD and hardware accelerated ray tracing.As part of the benchmark setup I ran the Embree backend on each machinewith just SSE4 support (4-wide SIMD), and found it to be about 1.6-1.8x slowerthan when run with AVX2 (8-wide SIMD).Even assuming we get just 1.6x faster on the M1 with 8-wide SIMD,this would put the Embree CPU backend at ~2.75FPS on Sponza and ~2.06 FPS on San Miguel.Support for hardware accelerated ray tracing on the GPU would also beawesome for improving GPU ray tracing performance. I don’t have a non-RTXGPU around anymore to make a rough performance comparison, but I’dexpect pretty substantial speedups (way above 2x).Even the current level of performance is impressive for a lightweightchip given that it doesn’t hit the same thermal issues as the XPS 13 andcan provide somewhat better performance on the CPU with 1/4 the SIMD width (4 vs. 16),and has a GPU ray tracing API that provides a 1.6-2x speedup overits CPU in these benchmarks.It’ll be interesting to see what kind of hardware Apple has planned for the high end.