-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel: Wake cores from idle directly rather than through a host thread #6837
Conversation
Right now when a core enters an idle state, leaving that idle state requires us to first signal the core's idle thread, which then signals the correct thread that we want to run on the core. This means that in a lot of cases, we're paying double for a thread to be woken from an idle state. This PR moves this process to happen on the thread that is waking others out of idle, instead of an idle thread that needs to be woken first. For compatibility the process has been kept as similar as possible - the process for IdleThreadLoop has been migrated to TryLeaveIdle, and is gated by a condition variable that lets it run only once at a time for each core. A core is only considered for wake from idle if idle is both active and has been signalled - the signal is consumed and the active state is cleared when the core leaves idle. Dummy threads (just the idle thread at the moment) have been changed to have no host thread, as the work is now done by threads entering idle and signalling out of it. This could put a bit of extra work on threads that would have triggered `_idleInterruptEvent` before, but I'd expect less work than signalling all those reset events and the OS overhead that follows. Worst case is that other threads performing these signals at the same time will have to wait for each other, but it's still going to be a very short amount of time. Improvements are best seen in games with heavy (or very misguided) multithreading, such as Pokemon: Legends Arceus. Improvements are expected in Scarlet/Violet and TOTK, but are harder to measure. Testing on Linux/MacOS still to be done, definitely need to test more games as this affects all of them (obviously) and any issues might be rare to encounter.
Legends: Arceus provides the best view at what the difference is for core scheduling. There's a part of the core game loop where it wastefully swaps between three threads with what is basically a sequential workload for a few milliseconds. If we zoom in here with a profiler, we can see the behaviour before and after: BeforeYou can see that guest threads 50, 54 and 53 are constantly blocking each other in a clear pattern. However, when each thread suspends, it also signals the idle threads for each core, OS thread 0,1,2,3. These threads then wake the next guest thread, so two OS context switches (shown by the arrows) need to be performed for the game to switch to the next thread. AfterThe threads are in a similar pattern where they signal each other sequentially, but they are waking each other directly rather than waking idle threads first. You can see this via the arrows, where it's clearer what threads are unblocking each other. This won't be perfect - an unrelated thread could still wake a thread that was unblocked by some other thread that hasn't gotten to the idle awakening step, but it's nicer for debug and saves one OS context switch per idle. It's worth noting that the profiler I'm using will exaggerate the runtime of threads, as it captures all context switches, but the time precision is a lot lower and it seems to round start down and end up. It also seems to slow down context switches a lot more, so with the new approach the game runs notably faster under a profiler. On my Windows desktop with a Ryzen 3900X, there is a small boost to performance (peak performance shown, average performance difference is about the same #, same location): BeforeAfterI still need to see if overall CPU usage drops, and how this might impact systems with less cores or power saving. |
I wonder how hard it would be to remove |
Did some testing on steam deck, and its performance appears to be affected a lot more. The system has 4 cores instead of the 12 on my desktop, runs linux instead of windows, and has aggressive power saving measures. All tests are running on battery, and screenshots are a few minutes after the test begins so the power usage numbers settle. Uncapped framerateAverage performance greatly improves. Fluctuations from 35-36 to 42-43. (around 4-5ms saved) Overall power usage seems similar, but more seems to go into the GPU to reach the new higher framerate (not shown on screenshot, but general pattern is there when watching it). Frame times are a lot more stable. BeforeAfterCapped FramerateWhen framerate is capped, power usage greatly decreases. Focus on the wattage numbers next to "battery" and "cpu", and the clock speeds it decided upon. Frametime is a lot more consistent. Fan speed/temps are much lower, it quickly becomes inaudible when the cap is turned on. BeforeAfterI've always wondered why this game was underperforming on deck, I guess now we have the answer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks. I tested a few games here on Windows and macOS, and had no issues, I didn't play for long though, so might be worth to get more extended testing from someone else. Very nice to see the idle threads gone, it should make debugging a bit simpler. I didn't know it could have such a significant impact on Steam Deck too, so that was a nice surprise.
I tested Smash Ultimate for an extended amount of time and didn't find anything unordinary. I also briefly tried a few others and got the same result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works great! Can't really comment on the code changes tbh. They make sense to me and I don't see any issues, but I'm very inexperienced in that area, so that's not really worth much.
Right now when a core enters an idle state, leaving that idle state requires us to first signal the core's idle thread, which then signals the correct thread that we want to run on the core. This means that in a lot of cases, we're paying double for a thread to be awoken from an idle state.
This PR moves this process to happen on the thread that is waking others out of idle, instead of an idle thread that needs to be awoken first.
For compatibility the process has been kept as similar as possible - the process for IdleThreadLoop has been migrated to TryLeaveIdle, and is gated by a condition variable that lets it run only once at a time for each core. A core is only considered for wake from idle if idle is both active and has been signalled - the signal is consumed and the active state is cleared when the core leaves idle. Maybe we could go further with this to avoid waiting on other thread signals to complete, but a port of the current behaviour is the safest improvement for now.
Dummy threads (just the idle thread at the moment) have been changed to have no host thread, as the work is now done by threads entering idle and signalling out of it.The idle thread has been removed entirely, and idle core state is now directly on the scheduler.This could put a bit of extra work on threads that would have triggered
_idleInterruptEvent
before, but I'd expect less time wasted than signalling all those reset events and the OS overhead that follows. Worst case is that other threads performing these signals at the same time will have to wait for each other, but it's still going to be a very short amount of time.Improvements are very slight, but are best seen in games with heavy (or very misguided) multithreading, such as Pokemon: Legends Arceus. Improvements are expected in Scarlet/Violet and TOTK, but are harder to measure due to GPU trouble.
Testing on Linux/MacOS still to be done, definitely need to test more games as this affects all of them (obviously) and any issues might be rare to encounter.