Hello yuz-ers. What a year! We ended 2022 with more yuzu Fried Chicken, Vulkan changes, a new input driver, an exorbitant amount of kernel work, more performance, better visuals, and much more!
Blinkhawk has also been working hard on his beloved project, releasing Y.F.C. Part 1.5.
Basically an abridged version of what is expected for the full “Part 2” release.
The changes in this pull request include a rework of the
MacroHLE implementation to include various new macros for indirect draws and configurations.
As discussed in previous articles, macros are small GPU programs that implement features like indirect and instanced draws. They must be emulated. MacroHLE (High-Level Emulation) is the process of avoiding executing a requested macro and instead translating it directly to the code that it would have generated (like an instanced or indirect draw). This works in contrast and in parallel with MacroJIT, which works by actually emulating the loops and control flow contained in macro programs in a just-in-time fashion.
Now, why keep both? Well, each one performs their own specialized task. MacroHLE’s advantage compared to MacroJIT has to do with the emulation of indirect calls. An indirect call, such as a draw, uses data generated somewhere in the GPU through some shader in order to establish the draw parameters and its draw count. Traditionally with MacroJIT we had to sync the Host GPU and Guest GPU to obtain the indirect data in order to execute the macro correctly. With MacroHLE, we create an indirect draw command in the host GPU that points to the translated address of where the GPU generated data should be. Thus skipping the syncing.
Thanks to these improvements, yuzu now is able to more efficiently execute macros, considerably reducing CPU overhead, and without having to change any setting. What we internally like to call a “passive skill”.
As a result of these changes, performance has been improved in several titles, including those developed by Koei Tecmo,
Pokémon Scarlet and Violet,
Bayonetta 3, and
Monster Hunter Rise (with the exception of version 12.0.0, which still requires further fixes) to name a few.
The crashes in
Fire Emblem: Warriors have also been fixed.
We measured a 5-20% performance boost in select titles, but the improvement may be higher on CPUs with a lot of cache. From our testing, the 5800X3D can reach over 30% in some games.
The performance cost of rendering at higher resolutions was also greatly reduced.
But the goodies don’t end here! Blinkhawk also added support for the
VK_EXT_extended_dynamic_state3 Vulkan extensions, reducing the amount and size of shaders needed to be built during gameplay.
This relatively “new” pair, along with the already implemented
VK_EXT_vertex_input_dynamic_state, are the four extensions responsible for considerably reducing shader building stuttering.
But as it always goes, support for these extensions in consumer GPUs is spotty at best, and a mess to support at worst.
State3 in particular is only supported by the NVIDIA Vulkan Beta drivers, version 527.86 at the time of writing, and recent (late 2021 and newer) RADV Mesa drivers.
We recommend anyone interested in testing how a fresh shader cache performs to give these drivers a go.
With no alternative, implementing these extensions forced us to perform another dreaded cache invalidation.
Most drivers cover at least 3 of the 4 extensions without issue, one way or another, with one glaring exception, AMD Windows drivers. The price of this is higher stuttering during gameplay when new shaders are being processed compared to running the same card on Linux with RADV, or using any other brand.
A small side-note, Linux RADV users should update their Mesa version to the latest (or use a more recent distro version if needed), as support for
state2 was broken in versions before 21.2.
As a last second change, Blinkhawk tested removing the 16-bit floating point (FP16) blacklist enforced on NVIDIA Ampere and newer GPUs (RTX 3000 series and higher). If it worked, it would have allowed them to work similarly to Turing and AMD Radeon offerings in this aspect. However, NVIDIA redesigned how their FP32 and FP16 units operate on Ampere and newer, with both providing identical performance. Unfortunately, even if it were faster, it’d be irrelevant in the end, as FP16 on Ampere and Ada is still bugged in the drivers, producing graphical issues in many games.
The only remaining architecture that could benefit from enabling blacklisted FP16 support is Intel on Windows, but their drivers are a dumpster fire regarding FP16. So they continue to emulate 16-bit precision with 32-bit the same way as Ampere and Ada, in this case with its always present performance loss. Of course the weakest architecture that could benefit the most from this change is the only one that remains broken…
Another extra benefit of this iteration of
Y.F.C. is that
Normal GPU accuracy is much safer to use.
Particles will continue to be better in
High, but games like
Bayonetta 3, and many others can be played with
Normal accuracy without glitches much more regularly with the big performance benefit this provides (
Bayonetta 3 in particular still needs
High for its title screen, but gameplay is safe on
The month doesn’t stop there, there has been a plethora of changes worth mentioning too in our GPU codebase.
Oh boy, byte sure has been busy this month.
To start off, he is responsible for implementing the SMAA anti-aliasing filter for our Vulkan and OpenGL backends. But that’s not the whole story, so let’s elaborate further.
SMAA, or enhanced subpixel morphological antialiasing, is an improvement over MLAA developed by the Spanish Universidad de Zaragoza and video game studio Crytek, of Crysis fame.
BreadFish implemented the original OpenGL version, intending to release it as part of the resolution scaler. As it turns out, implementing
SMAA for Vulkan is no joke, and after being nagged by your writer, byte had to work 2 weeks to get it in shape.
SMAA, being based on
MLAA, intends to be a post-processing (aka shader-based) option focused on quality over performance by analyzing adjacent pixels, unlike
FXAA which just blurs the entire screen.
SMAA filter is implemented using render passes and it produces its best results when combined with FSR filtering.
AMD recommends properly anti-aliasing the image in their official Overview Integration Guide.
The results speak for themselves:
Here you can see an ideal test case for SMAA, the simple triangle of death
Ropes and power lines, the classic example for anti-aliasing testing (Pokémon Scarlet)
Sprite elements in 3D games benefit from it (Xenoblade Chronicles 3)
For those interested, we used the
ULTRA preset, testing showed a low performance loss even with a GT 1030, so we preferred to focus on quality.
Only users with old integrated GPUs should avoid
SMAA. For the rest, it’s a safe option to turn on and forget.
You can find the feature in
Emulation > Configure > Graphics > Anti-Aliasing Method.
SMAA doesn’t suffer from the horrible colour banding of FXAA (The Legend of Zelda: Breath of the Wild)
And it's a great help for users running low resolution multipliers. This example is 0.5x Bilinear alone Vs 0.5x FSR + SMAA (Pokémon Scarlet)
As a side note, NVIDIA’s version of FSR, NIS, was also tested, but the result is so ugly and over-sharpened, that we decided to keep the best option of the two, FSR.
byte has also fixed a problem with anisotropic filtering.
If users ran the RADV driver on Linux, anisotropic filtering values other than
Default would cause a distinct “acne-like” rendering issue in
Super Mario Odyssey. The issue persists at other anisotropic filtering and resolution multiplier values., but byte continues to work on the issue.
The so called RADV acne (Super Mario Odyssey)
The change also addresses an issue with the buggy water rendering in
Super Mario Sunshine with automatic anisotropic filtering on Lavapipe (Mesa, Linux), although the error still occurs at other anisotropic filtering values.
Kind of makes it look even older (Super Mario Sunshine)
byte also corrected the semantics of data cache management operations in the memory.
Previously, when the guest requested a cache invalidation, the implementation would simply invalidate the cache on the hardware, rather than making the memory visible to the GPU as intended.
On the side, he also promoted various Vulkan Extensions to use core methods. In the Vulkan API, vendor extensions are optional features provided by specific hardware vendors or drivers that may not be available on all systems. In contrast, core methods are a fundamental part of the Vulkan specification and are guaranteed to be available on all systems that support the API. Thus, promoting extensions to use core methods can improve their reliability and portability.
byte made further initialization tweaks to the Vulkan API.
These changes included the restoration of
VK_EXT_host_query_reset, which were mistakenly removed in a previous PR. He also added the flag
VK_INSTANCE_CREATE_ENUMERATE_PORTABILITY_BIT_KHR to the
VkInstanceCreateInfo structure for
MoltenVK to allow
MoltenVK to be detected as an available Vulkan device.
Keep in mind that a lot more work is needed in order to get yuzu rendering on macOS devices. This is only early preliminary work.
vonchenplus has implemented the draw manager for Maxwell3D
with the aim of eliminating workarounds and reorganising the drawing process to more accurately enumerate the drawing behaviour.
As a result of these changes, the issue in
Dragon Quest Builders where some 3D models were not rendering properly has been fixed.
No more missing stuff! (Dragon Quest Builders)
Following these changes, vonchenplus also improved the code for the topology update logic so that the implementation is more accurate. This change was necessary in order to implement special topologies with Vulkan.
This includes support for
quad strips, which require the use of triangles to simulate them, and the ability to simulate indexed and non-indexed modes.
In non-indexed mode, a fixed mapping table is used to connect the vertices, while in indexed mode, a compute shader is used to dynamically map the original drawing indices.
vonchenplus has also implemented support for line loops, which require the use of triangle lists to simulate them, and for polygons, which require the use of triangle fans.
These changes fixed the Hero’s path in
Legend of Zelda: Breath of the Wild, as well as the Status Summary graphic in
Pokémon Scarlet and Violet, and they also gave us another shader cache invalidation, yay!
Don’t mess with the stats! Can’t do breeding without the stats! (Pokémon Scarlet)
When the Sheika GPS signal returns (The Legend of Zelda: Breath of the Wild)
Blinkhawk has added alpha to coverage and alpha to one to our Vulkan backend.
Alpha to coverage is a multisampling technique that is used to improve the quality of transparent or partially transparent pixels.
It works by blending the alpha values of multiple samples taken from the same pixel to produce a single, more accurate result.
This can help to reduce aliasing and other rendering artifacts that can occur when rendering transparent pixels.
Alpha to one, on the other hand, is a technique that is used to improve the quality of partially transparent pixels by setting the alpha value of each pixel to a maximum of
This can help to reduce the amount of alpha blending that needs to be performed, which can improve the performance of the rendering pipeline.
These changes have fixed the shading of trees and grass problems when viewed up close or from a distance in
Pokémon Scarlet and Violet.
The camera isn’t more interested in that tree, you should learn from this, Dark Souls (Pokémon Scarlet)
vonchenplus has corrected errors caused by yuzu’s faulty detection of draw types.
In the past, yuzu would set every vertex and index count register to zero after each draw to determine if the next draw would be a regular or indexed draw.
Xenoblade Chronicles 3 proves to us that these registers initiate some draw calls based on previous values.
Changing this behaviour
partially fixes the particles present in
Xenoblade Chronicles 3. You can now more easily perform your off-seer duties.
Meat is on the menu! (Xenoblade Chronicles 3)
We don’t usually cover compilation changes here, but this time we had to do it because it affects compatibility.
Your writer (or co-writer in this progress report, my partner did most of the work this time) has been playing with compilation flags in order to get more free performance, following previous work done by Blinkhawk some time ago.
Microsoft Visual C++ (MSVC, Visual Studio) is simple enough (we’ll talk about Linux later). You enable full program optimizations, optimize for performance instead of size, a bit here, a bit there, and you gain a nice 3%, but I wanted more. Last month, Epicboy improved the build process, saving both time and memory. This created a “gap” big enough to enable the Big One, Link-Time Optimizations (LTO), an optimization that in the past had to be discarded for eating all the available RAM of our buildbots.
Windows testing went well and in some cases the performance uplift reached up to 12%. The problem was Linux. LTO is aggressive by nature, and there’s no guarantee that all parts of the project will react nicely to it. In this case, the problem was Qt, the UI looked completely garbled. So LTO had to go, but in its place, we now require what Dynarmic already did for a while, x86-64-v2 hardware.
GCC and Clang builds will now compile assuming the features of CPUs are compatible with the instruction sets that form part of x86-64-v2, the highest one being SSE4.2. This means the minimum CPU required for yuzu to work without crashing, in both Windows and Linux, is now the first generation of Core i-series (500-900 series), which are almost 15 years old, and the FX and APU series from AMD, which are almost 12 years old. The performance boost on GCC and Clang is up to 7%.
First system runs i7-12700H - 2x16GB 4800MHz CL40 - RTX 3080 Mobile 16GB 175W, second system runs R7 5800X3D - 2x16GB 3600MHz CL16 - RTX 4090
We originally wanted to enforce x86-64-v3 to get an even bigger performance boost, as well as to ensure a minimum level of performance, as any CPU lacking AVX, AVX2, and in particular, FMA, will be very slow, no matter its clock speed or core count.
Yes, that means the 8 core Ivy Bridge Xeon you bought for 20 bucks is not fast enough for this task.
The problem, however, is that doing so would leave close to 9% of the user base out of support, according to our telemetry. That many users is a considerable number, so we’ve decided to wait until more users adopt more modern CPUs before implementing this change. We’ll re-evaluate enforcing x86-64-v3 in the future once OpenGL eventually ends up on the chopping block as well.
While this change would also apply for Windows, MSVC is not flexible enough to let us build for x86-64-v2, it either supports SSE2, or jumps straight to the first AVX. Dynarmic already manually uses x86-64-v2 extensions, so any CPU lacking SSE4.2 is considered unstable regardless of the OS in use.
x86-64-v4 will not be an option for many years, mainly because Intel can’t decide if AVX-512, made by themselves, is something that their users should be allowed to actually use.
If an old-school user is so strongly set on running yuzu on decades old CPUs, the Flatpak builds are still generic, or there’s always the option of building yuzu manually, allowing you to configure any requirements.
german77 has done it again, giving us an amazing Christmas gift, a new input driver for Joy-Cons This is an in-house development that doesn’t rely on SDL, so it gives us much more freedom to add features that weren’t previously available.
The basics are covered. Single and Dual Joy-Con modes are available, button, stick, motion mapping works the same as before. But that’s not exciting, here’s all the new stuff that was added:
Game Builder Garage, and
Nintendo Labocan make use of this neat feature at the base of the right Joy-Con.
All this extra accuracy highlights a problem we didn’t often face before: PC Bluetooth connections are very easy to saturate. Cheaper/Intel bluetooth chipsets or areas with tons of interference are especially prone to this. For this reason, HD Rumble can potentially cause lag depending on the user’s specific circumstances. We recommend unmapping/disabling rumble in those cases.
Speaking of saturation, the IR camera may be slow in some games. The reason being that we currently implement only the image transfer mode, which saves 320x240 pictures. Some games prefer faster framerates at the cost of resolution, going as low as 40x30. Once all modes are added in, the choppy framerate will disappear.
Amiibo data writing is a work in progress.
german77’s desire for incredible input improvements doesn’t end there.
german77 implemented the
allowing games read and write plain mifare tags.
Skylanders Imaginators make use of this feature.
The only feature lacking is support for encrypted read and writes.
Speaking of SDL, a recent update broke the way it handled the GUID, the identifier of several controllers, including the one integrated into the Steam Deck, causing many annoyances for Deck users. So, with no alternative on hand, german77 had to implement a custom filter to solve the issue.
And lastly, as a very important quality of life change, german77 made the input device list refresh automatically, ensuring that yuzu detects controllers without the need for manual intervention. Goodbye tiny refresh button!
With an update for Dynarmic and SDL2, byte enabled support for ARM64 compilation. This means all Switch titles can be tested on Linux ARM64 devices with compatible Vulkan drivers.
As part of this effort, we started implementing Flatpak support for ARM64 Linux devices. This required making OpenGL optional for the build process, as Flatpak’s Qt build only supports OpenGL ES, not the full fledged OpenGL 4.6 compatibility profile we require.
Part of these changes fixed compilation for macOS, but the situation remains the same, without
MoltenVK support, nothing will be rendered.
Epicboy implemented a series of changes with the goal of minimizing the overhead of dynamic memory allocation, a task which involves requesting memory from the operating system, and can slow-down performance in some circumstances.
The texture cache, in particular, was a significant contributor to this issue, as it constantly allocated and then deallocated memory when transferring textures to and from the GPU. To address this problem, Epicboy optimized the texture cache to pre-allocate a buffer to store swizzle data and reuse it whenever possible, rather than performing a dynamic memory allocation every time this was done. This change should result in reduced stuttering, as memory will now only be requested from the operating system if the buffer is not large enough to hold the data.
Epicboy also made similar changes to optimise the
, which likewise takes a similar approach: instead of allocating and deallocating memory, a buffer is created once to hold data in the memory, and it only reallocates whenever it needs to grow.
Additionally, he introduced a
to act as a wrapper around a heap-allocated buffer of memory.
The advantage of this class lies with the fact that it eliminates the need to initialize the stored values, and the need to copy the data when the buffer needs to grow. Thus, it would help to speed up things by minimizing the amount of time spent on memory management tasks.
german77 implemented the
FreeThreadCount info type,
which is needed by titles such as
Just Dance 2023 Edition (although that game requires additional changes in order to work).
Saalvage noticed an error in yuzu’s kernel implementation and made the necessary changes to unlock thread mutex before destruction, as not doing so incurs an undefined behaviour. “Here be Dragons” and all that.
byte submitted a change that improves the handling of system startup failure, in order to prevent deadlocks and crashes when/if the GPU initialization fails.
He also added
This component is designed to fix an issue with incorrect event unregistration when threads request a timeout for certain operations.
Without the fix, the threads would return successfully from the operation but fail to cancel the timeout, which would cause the timer to mistakenly fire on the thread and cancel a random unrelated operation.
This change fixes the random hangs that have been plaguing us for months in
The Legend of Zelda: Breath of the Wild, as well as
Persona 5 Royal.
byte also introduced a workaround for crashes caused due to unallocated memory after noticing that yuzu always used memory blocks without marking them as allocated, causing it to overlap memory used by the game. He fixed the bug by making sure we now allocate the memory before using it. This is meant to alleviate the situation while other parts of the kernel are being ironed out.
This is more related to error handling, but counts nonetheless. byte added an option to force the emulator to break when an invalid memory access happens. This means that if/when a game explodes in the background, the emulator will crash instead of slowly eating all the available system RAM. Problems like these can be caused by emulation issues, damaged game dumps, or even some wonky mods, so it’s always a better option to avoid crashing the entire emulator, and if the user has little enough RAM, making the OS suffer.
We’ve had some interesting user interface quality of life changes implemented! lioncash made the SPIR-V shader backend element translatable, so it doesn’t always show in English for everyone. The community effort working on translation can now take the label and update it accordingly.
Some months ago, with the core timing changes, we allowed users to boot games with their framerate unlocked after continuous requests from the community.
As it turns out, nothing changed. Several games hate booting with unlocked framerates, and the support channels get their fair share of people asking why their game doesn’t want to boot.
So, simple fix, unlocked framerate at boot rights denied.
The hotkey to toggle unlocked framerate is
Ctrl + U by default, only a small nuisance.
Users reported that they couldn’t record or stream their yuzu window while in windowed mode.
byte found the cause was setting the
WA_DontCreateNativeAncestors Qt property for all platforms, instead of just for wayland.
Issue down, streamers rejoice.
Discord user piplup reported that yuzu didn’t save the device name (what you would call the console) after accessing a game’s custom configuration window. german77 fixed the issue (this particular setting lacked a custom configuration equivalent), and also fixed Qt 6 build issues while at it.
Another very nice quality of life improvement made by german77 is making yuzu remember the last selected directory
Install files to NAND….
If you keep your dumps in the same folder, updating your games is going to take fewer clicks now.
byte managed an amazing victory in the war against crashes when closing/stopping games. He worked on making shutdown not visibly freeze yuzu, avoiding crashing the emulator while the game quits, and showing a nice pop-up message while at it too!
Another battle fought on this front is related to homebrew apps. byte is responsible for making them quit properly now too.
ChrisOboe suddenly shows up with a glorious quality of life fix for the terminal-based yuzu-cmd build. Marking the build as a “Windows” application instead of a “Console” one ensures that no empty command line window pops up needlessly. This can help streaming programs set up to run specific games with yuzu-cmd, as this prevents the sudden empty black box from appearing in front of other windows.
We’re only a few days into 2023 and we already want to publish the next progress report. So much has happened in such a short time!
Also, Blinkhaw, bunnei, and byte are up to something, and we can’t wait to tell you more. And yes, there will be yet more cache invalidations. All in the name of progress.
That’s all folks! Expect a few but very critical Vulkan improvements next time, hope to see you then!
⭐⭐⭐ Grande Messi