The Promise of UMS
At PDC last year I presented how the Concurrency Runtime (ConcRT) lights up Windows Server 2008 R2. My talk focused on how to use ConcRT, which comes as part of the C Runtime (CRT) in Visual Studio 2010, and I mentioned that it can use the new User-mode scheduling (UMS) feature that is available on Windows Server 2008 R2 and Windows 7’s x64 edition.
Does your C++ application take advantage of UMS? How is the performance, either measured by latency or throughput, when you enable UMS? I want to hear about your experience with UMS! Please drop some email to me at Dana dot Groff at Microsoft dot com.
User mode scheduled threads are a new technology in Windows 7 x64 edition and Windows Server 2008 R2. Pedro Teixeira’s PDC talk goes into depth on how these new threads work. ConcRT provides an easy-to-use programming models, the parallel patterns library (PPL) and asynchronous agents library, to get their benefits with very little effort. You simply need to specify that the scheduler can use UMS by default, if available.
How Can I Use UMS?
Enabling the use of UMS is very straightforward. You can either set the default policy for all schedulers created, set the current scheduler or create a separate scheduler with SchedulerType set to UMS:
Scheduler::SetDefaultSchedulerPolicy( SchedulerPolicy(1, SchedulerKind, UmsThreadDefault) ); CurrentScheduler::Create( SchedulerPolicy(1,SchedulerKind, UmsThreadDefault) ); Scheduler* s = Scheduler::Create( SchedulerPolicy(1,SchedulerKind, UmsThreadDefault) ); |
If you set the default scheduler polity to UmsThreadDefault immediately in your main, on a Windows version that supports UMS threads, it will use UMS by default; else it will use traditional threads. This allows you to support the older OS’s and automatically take advantage of UMS on the most modern operating systems.
Why not try it today with your existing application that uses the Concurrency Runtime in VS2010?
Where does the UMS shine?
The promise of UMS is better performance and better application behavior. It does this through two mechanisms:
- Fast context switching: it is approximately 10x faster to switch between UMS threads compared to traditional threads;
- Better application control: UMS communicates blocking conditions in the kernel to the user-mode scheduler (such as ConcRT):
- Allowing the scheduler to efficiently select and reschedule work on the blocked hardware thread. (Effectively making all blocking in the kernel into cooperative blocking.)
- When work is un-blocked in the kernel, it is then re-queued into the user-mode scheduler.
So while your context-switch performance is seriously improved by using UMS, the overall performance gains of an application will likely be due to the logic for thread scheduling moving to where it can be directly influenced by the application’s code. Specifically, when using ConcRT with UMS, the runtime will select tasks that it believes are “related” to the task that is blocked in the kernel, such as those in the same ScheduleGroup or tasks_group. Through this mechanism, we hope to achieve better cache coherency which may result in better performance. Also, by continuing to execute other tasks, if the original task was blocked by a condition that will be released in a later task, forward progress is made.
With lower overhead for context switching work may be efficiently decomposed into smaller tasks. In the long-term view, we expect that more decomposition and better cache coherency will result in better scalability.
In our producer-consumer micro-benchmark, UMS does extremely well. This test has specific tasks that read or create data; tasks that consume and modify that data; and then there are tasks that present or write out this data. Also, it appears that applications that have a lot of data flow or have a number of kernel operations do show benefit. For instance, we have one example where resizing a window with a number of elements to render significantly speeds up under UMS.
What we are looking for are more end-to-end scenarios that demonstrate UMS performance wins (or loss). We are looking forward to learning from your experience and feed that back to our planning exercise of our next release. So please try turning on UMS.
Wow, my performance changed!
Did it? Cool, I want to hear about it! Please drop some email to me at Dana dot Groff at Microsoft dot Com.
I want to hear about your scenarios and understand how UMS helps you! (And yes, if you saw degradation, I would be interested in that too.) As we look towards our next release and beyond, I want to be able to give better guidance when to use UMS, when not to use it, and see if there is anything we can do to improve our use of UMS.
Thank you,
Dana Groff
Senior Program Manager
Comments
Anonymous
August 29, 2011
Dana, Just curious, when multiplexing say 50k tcp/ip sockets can I expect similar performance from UMS if using blocking WSASend’s and WSARecv’s instead of IOCP? For obvious reasons, being able to use coroutines to handle blocking i/o would simplify things dramatically when compared to writing IOCP code. This almost seems too good to be true … pinch me ;) Thanks, JDAnonymous
August 29, 2011
How do I modify the policy to ensure only a single thread is created within the process? I set the maxconcurrency to 1, but still see 4 threads in the process when using the UMS scheduler.Anonymous
September 14, 2011
"RE: This almost seems too good to be true … pinch me ;)" Pinch :) From the brief description above, your best bet for performance would be to write IOCP code. From a performance perspective, UMS offers ultra-fast context switching. In your case, you are likely going to need throttling of your i/o. If you share your code snippet, I can definitely offer suggestions using our continuation task model and that might help overcome the usability of iocp: blogs.msdn.com/.../tasks-and-continuations-available-for-download-today.aspxAnonymous
September 14, 2011
Re: Single threaded concrt. You have set the right policy key. Note that the abstraction for the developer is tasks not threads. The way to think about the concurrency level is max number of tasks that run in parallel. ConcRT uses many threads underneath to quickly and efficiently map tasks to threads. So in your case, even though you see 4 threads, you are likely only going to have 1 task running at a time. [There are some main thread nuances, but I wont get into it here].Anonymous
December 27, 2012
Apparently User-mode schedulable (UMS) threads are no longer supported in the Concurrency Runtime in Visual Studio 2012. See msdn.microsoft.com/.../dd492665(v=vs.110).aspx Does anyway know why this is the case?