Display Device Performance Profile Analysis (Windows Embedded CE 6.0)
1/6/2010
The following topic examines the data obtained from initial profiling run in How to Profile and Optimize a Display Driver.
Monte Carlo Profiling Data
After running ProfApp.exe the data in your Debug window should resemble the following output.
Note
For clarity, the tick count, process IDs, and thread IDs have been removed from the following output. Also, the specific timer values vary from run to run because of the manual process of controlling the profiling process.
Kernel Profiler: Gathering MonteCarlo data in buffered mode
ProfileStart() : Allocated 13960 kB for Profiler Buffer (0x48000000)
Starting profile timer at 200 uS rate
ProfApp: Took 34297 ms to perform blts.
Kernel Profiler: Looking up symbols for 503297 hits.
.
.
(Additional lines omitted for clarity.)
.
.
Total samples recorded = 503297
Module Hits Percent
------------ ---------- -------
nk.exe 332367 66.0
ddi_flat.dll 170117 33.8
gwes.dll 662 0.1
coredll.dll 108 0.0
ProfApp.exe 25 0.0
relfsd.dll 3 0.0
filesys.exe 2 0.0
shell.exe 2 0.0
explorer.exe 2 0.0
pm.dll 1 0.0
UNKNOWN 8 0.0
Hits Percent Address Module Routine
---------- ------- -------- ------------:---------------------
328544 65.2 802351c2 nk.exe :_IDLE_STATE
101277 20.1 03db634d ddi_flat.dll:?EmulatedBlt_Internal
64291 12.7 03db7c7f ddi_flat.dll:?MaskedSrcToMaskedDst
3864 0.7 03db2068 ddi_flat.dll:?CursorOn
1351 0.2 8025a52f nk.exe :_NE2000_WRITE_PORT_UCHAR
1181 0.2 8025a53b nk.exe :_NE2000_READ_PORT_UCHAR
345 0.0 03db2268 ddi_flat.dll:?CursorOff
162 0.0 80229a7f nk.exe :_PerfCountSinceTick
153 0.0 8025a54b nk.exe :_NE2000_READ_PORT_USHORT
150 0.0 0003d906 gwes.dll :?dwRealizeColor
98 0.0 8025ad06 nk.exe :_HandleReceive
(Additional lines omitted for clarity.)
1 0.0 0001f663 explorer.exe:?ItemCount
1 0.0 00029696 explorer.exe:_GetSysColorBrush
33 0.0 :<UNACCOUNTED FOR>
DispPerf.exe Data
The following tables show the single set of data from running ProfApp.displayed in several smaller tables.
The following table shows the overall summary of the number of times each raster operation was called.
For information about converting the ROP codes reported by DispPerf to the ROP codes listed in Ternary Raster Operations, see Display Driver Performance Profiling.
RopCode | cTotal |
---|---|
0x0000CCCC |
1063 |
0x0000F0F0 |
6222 |
0x00008888 |
59 |
0x00006666 |
52 |
0x0000AAF0 |
4537 |
0x0000EEEE |
7 |
0xFEFEFFF1 |
49 |
0x0000E2E2 |
1 |
0x00005555 |
378 |
The following table shows the profiling results from all the ROPs performed by GPE functions.
RopCode | cGPE | dwGPETime | Avg.GPETime |
---|---|---|---|
0x0000CCCC |
1045 |
40923908 |
39161 |
0x0000F0F0 |
668 |
21737 |
32 |
0x00008888 |
59 |
41647 |
705 |
0x00006666 |
52 |
28160 |
541 |
0x0000AAF0 |
3957 |
80123 |
20 |
0x0000EEEE |
0 |
0 |
0 |
0xFEFEFFF1 |
49 |
133721 |
2729 |
0x0000E2E2 |
1 |
355 |
355 |
0x00005555 |
0 |
0 |
0 |
The following table shows the profiling results from all the ROPs performed emulation functions. For more information, see BitBlT Emulation Library Functions.
RopCode | cEmul | dwEmulTime | Avg.EmulTime |
---|---|---|---|
0x0000CCCC |
18 |
94149 |
5230 |
0x0000F0F0 |
5554 |
486315 |
87 |
0x00008888 |
0 |
0 |
0 |
0x00006666 |
0 |
0 |
0 |
0x0000AAF0 |
580 |
63332 |
109 |
0x0000EEEE |
7 |
13033 |
1861 |
0xFEFEFFF1 |
0 |
0 |
0 |
0x0000E2E2 |
0 |
0 |
0 |
0x00005555 |
378 |
42907 |
113 |
The DispPerf results do not show profiling results for hardware calls because the settings in How to Profile and Optimize a Display Driver are based on the FLAT driver, which is a general purpose driver that does not make use of advanced hardware capabilities.
Analysis of the Profiling Results
The Monte Carlo profiling results reveal the functions in which the most time was spent during the profiling period.
Specifically, these results show that the system spent the most time in the idle state, most likely waiting for the manual interaction required to start and stop the profiler.
Idle time is both expected and unavoidable with this method, so this result is neither surprising nor interesting for optimizing the display driver.
The second and third lines are more significant for optimization.
The second line shows that a very large percentage of time — well over 50 percent of the total when the idle time is factored out — is spent in the driver function EmulatedBlt_Internal.
The next most time-consuming function is MaskedSrcToMaskedDst, taking up nearly 40 percent of the execution time once the idle time is factored out.
The DispPerf.exe results show that the driver spent two orders of magnitude more time performing the 0x0000CCCC (SRCCOPY) raster operation in GPE than during any other operation. This highlights it as a prime candidate for optimization.
A search through the source code shows that the two functions are defined in files located in %_WINCEROOT%\Public\Common\OAK\Drivers\Display\GPE. EmulatedBlt_Internal is defined in Swblt.cpp and MaskedSrcToMaskedDst is defined in Swconvert.cpp.
Examining this code to see how the specific ROP is processed reveals that both of these functions are involved in converting color from the RGB555 format used in the application to the RGB565 format used by the display hardware. EmulatedBlt_Internal is a GPE function called for the blit operation.
It then calls MaskedSrcToMaskedDst for the format conversion.
The function EmulatedBlt_Internal is of particular concern because it is a standard GPE function.
This indicates that not only is there no hardware support for this conversion in the graphics device, there also is no emulation library function for the conversion either.