共用方式為


What is IRQL?

Jake Oshins wanted to write about IRQLs and I am gladly letting him use my blog as a platform.  Here it is…

I’ve found myself explaining IRQL a lot lately, sometimes to people who want to know because they’re trying to write Windows drivers and sometimes to people who are accustomed to Linux or some other variant of Unix and they want to know why something like IRQL is required within Windows when those systems so clearly get by without it.

Penny Orwick covered this topic before, in the following two papers, with a lot of help from me and some others:

www.microsoft.com/whdc/driver/kernel/irql.mspx

www.microsoft.com/whdc/driver/kernel/locks.mspx

I’ll try to do it a little more briefly here.

Computers have many things within them that can interrupt a processor.  These include timers, I/O devices, other processors, internal processor performance counters, etc.  All processors have an instruction for disabling interrupts, somehow, but that instruction (cli in x64 processors) isn’t selective about which interrupts it disables.

The people who built DEC’s VMS operating system also helped design the processors that DEC used, and many of them came to Microsoft and designed Windows NT, which was the basis for modern versions of Windows, including Windows XP and Windows 7.  These guys wanted a way to disable (very quickly) just some of the interrupts in the system.  They considered it useful to hold off interrupts from some sources while servicing interrupts from other sources.

They also realized that, just as you must acquire locks in the same order everywhere in your code to avoid deadlocks, you must also service interrupts with the same relative priority every time.  It doesn’t work if the clock interrupts are sometimes more important than the IDE controller’s interrupts and sometimes they aren’t. 

Interrupts are frequently called “Interrupt ReQuests” and the priority of a specific IRQ is its Level.  These letters, all run together, are IRQL.

So if you lay out all the interrupt sources in the system and create a priority for each one, or sometimes a priority for each group, you can start to do interesting things. 

Consider a spinlock.  Spinlocks (at least in the traditional sense) are implemented by having a processor spin in a tight loop trying to atomically modify a variable.  The cache coherency hardware guarantees that only one processor can do that at a time, so lock acquisition goes only to the processor that succeeds.  Other processors keep spinning until they succeed.

The processor that “owns” the lock needs to release the lock as soon as possible, as the other (waiting) processors are burning up processor time waiting to acquire the lock.  So you really don’t want to interrupt that processor and schedule some other thread for execution, causing all the waiters to spin until the owning thread is rescheduled.

In this situation, some operating systems encourage the owner of the spinlock to disable all interrupts so that the code can’t be interrupted.  (Note, too, that interrupts really need to be disabled before trying to acquire the lock, or the thread might be interrupted between acquiring the lock and disabling interrupts.)

The designers of VMS and NT decided that they didn’t want to disable all interrupts just because some code somewhere acquired a spinlock.  Some things shouldn’t wait.  TLB flushes, are a good example.  So if only some interrupts are disabled while a spinlock is held, then you can still briefly interrupt the code that owns the lock for much more important tasks.  Perhaps even more importantly, you can interrupt the processors which are spinning, waiting to acquire a spinlock for these important tasks, causing them to do something useful instead of just spinning.

Note that this means that every spinlock has an associated IRQL, and you have to use that IRQL consistently, or the machine will deadlock.  In NT, by default, every spinlock has the same IRQL, called DISPATCH_LEVEL.  DISPATCH_LEVEL means, essentially, that the interrupts which can cause a thread to stop running are disabled.  (More about that later.)

Here’s a table of all IRQLs, as defined in the Windows NT header files (easily seen in the WDK.)

IRQL

X86 IRQL Value

AMD64 IRQL Value

IA64 IRQL Value

Description

PASSIVE_LEVEL

0

0

0

User threads and most kernel-mode operations

APC_LEVEL

1

1

1

Asynchronous procedure calls and page faults

DISPATCH_LEVEL

2

2

2

Thread scheduler and deferred procedure calls (DPCs)

CMC_LEVEL

N/A

N/A

3

Correctable machine-check level (IA64 platforms only)

Device interrupt levels (DIRQL)

3-26

3-11

4-11

Device interrupts

PC_LEVEL

N/A

N/A

12

Performance counter (IA64 platforms only)

PROFILE_LEVEL

27

15

15

Profiling timer for releases earlier than Windows 2000

SYNCH_LEVEL

27

13

13

Synchronization of code and instruction streams across processors

CLOCK_LEVEL

N/A

13

13

Clock timer

CLOCK2_LEVEL

28

N/A

N/A

Clock timer for x86 hardware

IPI_LEVEL

29

14

14

Interprocessor interrupt for enforcing cache consistency

POWER_LEVEL

30

14

15

Power failure

HIGH_LEVEL

31

15

15

Machine checks and catastrophic errors; profiling timer for Windows XP and later releases

For driver writers, the only IRQLs that are usually interesting are 0 through 2 and DIRQL.  It’s worth mentioning, though, that the NT kernel itself internally has spinlocks at DISPATCH_LEVEL and all the levels above that.

So, now for a tour of interesting IRQLs:

PASSIVE_LEVEL

This is the level at which threads run.  In fact, if you look at the specific definition of “thread” in NT, it pretty much only covers code that runs in the context of a specific process, at PASSIVE_LEVEL or APC_LEVEL.  Deferred Procedure Calls (DPCs) are not threads, in that sense.

Any interrupt can occur at PASSIVE_LEVEL.  User-mode code executes at PASSIVE_LEVEL.

APC_LEVEL

Windows NT has an interesting mechanism for getting into a certain thread context.  You can queue an interrupt to a thread, so that your function will run on that thread’s stack, with that thread’s address space, with that thread’s local storage.  This is useful for I/O completion.  When I/O completes, you queue an APC back to the requesting thread which does the last part of I/O completion in the initiator’s address space.  It’s a neat way to solve a bunch of problems.

If you want to disable interrupts to your thread, you raise to APC_LEVEL.  At least that was the original design.  APCs and the rules around them have grown much more complicated over the years.  At this point, the best that you can say is that if you care to disable APCs, call KeEnterCriticalRegion (msdn.microsoft.com/en-us/library/ms801955.aspx) or KeEnterGuardedRegion (msdn.microsoft.com/en-us/library/ms801643.aspx.)

Your code generally won’t need to run at APC_LEVEL at all, unless you use Fast Mutexes (msdn.microsoft.com/en-us/library/aa490219.aspx.)  Fast Mutexes are somewhat faster than Mutexes (msdn.microsoft.com/en-us/library/aa490228.aspx) or other dispatcher objects because, among other things, they hold off APCs by raising to APC_LEVEL.

APC interrupts, by the way, are sent by a processor, either to itself or to another processor.  No external device is involved.

DISPATCH_LEVEL

Windows NT doesn’t have a “scheduler” in the sense that most Unix variants do.  There is no process that decides which other processes should run.  Each processor “dispatches” itself by looking at runnable threads and deciding which one to run next.  This is a scheduler, of sorts, but not the same thing that many people coming from Linux will imagine.

The dispatcher is interrupt driven, in that it won’t allow a thread to run longer than its quantum before scheduling another thread.  But the scheduling clock doesn’t generate dispatcher interrupts directly.  The clock interrupt fires at CLOCK_LEVEL, somewhat more frequently than the thread scheduling quantum.  Various housekeeping tasks happen as a result of the clock interrupt, and one of them is that a dispatcher interrupt is generated by the processor to itself.  (Actually, this internal self-interrupt is often optimized away, but the architectural result is the same as if an interrupt were generated.)

If your code raises IRQL to DISPATCH_LEVEL, you have disabled the dispatcher on that processor, and only on that processor.  This means that your thread will not be pre-empted by another thread and it will not be moved to another processor until you lower IRQL.

Since, as noted above, I/O completion depends on code running at APC_LEVEL, and since APC_LEVEL code won’t run while the processor is at DISPATCH_LEVEL, page faults can’t be resolved at DISPATCH_LEVEL.  So code that holds a DISPATCH_LEVEL lock (like a spinlock) can’t reference memory which might be paged out.

Furthermore, most of the locking primitives that the NT kernel provides are what are called “dispatcher objects” (msdn.microsoft.com/en-us/library/aa490210.aspx.)  You can wait on dispatcher objects until they are signaled and, while your code is waiting, the processor is free to get other work done, on behalf of other threads.  This is nice, because, in contrast with the spinlock, which consumes the processor doing no useful work while it’s waiting, dispatcher objects allow the dispatcher to find other work until the reason for waiting can be satisfied.

What this means to you, though, is that you can’t wait on a dispatcher object at DISPATCH_LEVEL.  You’ve already disabled the dispatcher.  Your only choice at DISPATCH_LEVEL is a spinlock.

DIRQL

“DIRQL” is the shorthand that many people (internal to Microsoft and external) use when they mean “the IRQL that the PnP manager assigned to my device’s interrupt, and the associated interrupt spinlock and interrupt service routine.”  When a bus driver requests an interrupt for a device (as when the PCI driver finds the Interrupt Pin register set to some non-zero value, or when it discovers an MSI-X table) it tells the PnP manager two things.  First, it says that the device needs to register an ISR or a set of ISRs.  Next it says something about how the device is attached to any interrupt controllers present in the machine.  The PnP manager picks a processor to attach the interrupt to and picks the IRQL for that interrupt.  Sometimes that choice is constrained by the way the wires are laid out on the motherboard, sometimes not.  That topic is too big for this post.  (I might go into it later.  I wrote the code.)

As you can see from the table above, there is more than one DIRQL.  Unless your device generates more than one interrupt, you don’t really have to care.  Just pass along the values that you were given.  Your interrupt spinlock’s IRQL is that which was assigned to you.  The only thing you have to know about it is that acquiring that lock means that you’ve pre-empted everything happening at lower IRQL.  You haven’t pre-empted things like TLB updates, though, as those still come in at higher IRQL.

If your device does generate more than one interrupt, and if you need one spinlock that is used for both interrupt sources, you need to register your interrupt service routines with the highest of your DIRQLs as the SynchronizeIrql, which will avoid deadlocks by guaranteeing that all your interrupt-related code runs at the highest necessary IRQL.

In summary, IRQL is a concept that was intended to allow spinlocks to be sorted into more-important and less-important buckets, so that some interrupts can occur while other interrupts are disabled.

Most people agree that this is fairly complex to work with.  Whether you believe this was a necessary addition to the driver model is the source of a debate that’s been raging on the ‘net since before Windows NT actually existed.

- Jake Oshins

Comments

  • Anonymous
    June 20, 2010
    It says : "If your code raises IRQL to DISPATCH_LEVEL, you have disabled the dispatcher on that processor, and only on that processor. " Now I'm confused. Does that mean other dispatcher is still enabled on other processors and can still dispatch other threads? Then how will it cause deadlock when code running at dispatch level touches paged memory. Paged memory can still be fetched in using other processors which can still run at APC_LEVEL  to complete paging io. Please clarify. Thanks.

  • Anonymous
    June 20, 2010
    yes, other processors can be running at passive level where the dispatcher can run and schedule other threads on them.  Your confusion is from 2 misunderstandings

  1. all of the other processors could also be at dispatch level or higher and not able to satisfy the page fault and more importantly,
  2. the processor that takes the page fault on a paged memory access waits synchronously for that fault to be resolved and the page to be brought in. you can't wait synchronously at dispatch level regardless of what the other cores are doing. d
  • Anonymous
    June 22, 2010
    Thank you. the www.microsoft.com/.../irql.mspx says "Driver code that is running above PASSIVE_LEVEL (either at PASSIVE_LEVEL in a critical region or at APC_LEVEL or higher) cannot be suspended" but msdn.microsoft.com/.../ff544337(VS.85).aspx says "ExAcquireFastMutex puts the caller into a wait state if the given fast mutex cannot be acquired immediately" and if the thread enters wait state, then this thread can be suspended and another thread scheduled to run. right?

  • Anonymous
    June 22, 2010
    you are confusing 2 types of suspension.  the IRQL document on whdc that you are referring to means that you cannot suspend the thread using SuspendThread or equivalent KM APIs at passive in a crit region or at APC or higher since suspending a thread requires sending an APC to that thread and APCs are only procesed at passive outside of a critical region.  This type of suspend can be infinite (And thus a denial of serivce if a user mode thread can suspend a kernel mode thread which is holding a resource needed by others in the kernel).  The second suspend where you are put into a wait state if the fast mutex cannot be acquired is a wait on a synchronization object that will resume when the wait is satisfied. This wait should not be infinite. d

  • Anonymous
    June 25, 2010
    Thanks. useful article- and I'm not a driver writer or a coder!

  • Anonymous
    July 09, 2010
    The comment has been removed

  • Anonymous
    July 09, 2010
    The comment has been removed

  • Anonymous
    August 25, 2010
    Good article. Can you please explain if any of this changes for win7?

  • Anonymous
    August 25, 2010
    none of this changed for win7

  • Anonymous
    November 19, 2010
    I had one basic question: why do we need IRQL.  Doesn't IRQ provide an interrupt priority scheme and can't we program the PIC for setting priority?

  • Anonymous
    November 19, 2010
    Windows runs on platforms that don't have a PIC. IRQL is the abstraction over the hardwar.  for instance, it abstracts the IO APIC as well d

  • Anonymous
    November 20, 2010
    Thanks Doron, that makes sense.  I was reading more related to interrupts and had couple of questions.  

  1. Where is IDT stored?  I read that IDTR register in CPU stores both the physical base address and the length in bytes of the IDT - but does HAL initialize/allocate the IDT as well as IDTR?
  2. I was also keen to find out more about MSI, I read that in MSI Model - "Device interrupts via memory write transactions on the PCI bus", I couldn't get how CPU gets interrupted by a write transaction.  I was keen to know more, will try to search for docs online, appreciate if you can give any pointers.
  • Anonymous
    November 27, 2010
    The comment has been removed

  • Anonymous
    December 16, 2010
    Which MS team handles IRQL issues?

  • Anonymous
    December 17, 2010
    King, there is no team that handles IRQL issues per say.  IRQL is just an abstraction provided by the kernel and HAL.  what particular issue are you seeing?