Compartir a través de


Parallel Scalability Isn’t Child’s Play

In a recent blog entry, Dr. Neil Gunther, a colleague from the Computer Measurement Group (CMG), warned about unrealistic expectations being raised with regard to the performance of parallel programs on current multi-core hardware. Neil’s blog entry highlighted a dismal parallel programming experience publicized in a recent press release from the Sandia Labs in Albuquerque, New Mexico. Sandia Labs is a research facility operated by the U.S. Department of Energy.

According to the press release, scientists at Sandia Labs

simulated key algorithms for deriving knowledge from large data sets. The simulations show a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added.” They concluded that this retrograde speed-up was due to deficiencies in “memory bandwidth as well as contention between processors over the memory bus available to each processor.

Holy cow. The Lab’s scientists, who are heavily invested in parallel programming on supercomputers, simulated running programs on sixteen cores encapsulating “key algorithms for deriving knowledge from large data sets” that gave no better performance than running the same program on two cores.

Please note that these are simulated performance results, because 16-core machines of the type being simulated don’t currently exist. Indeed, I would not expect that 16-core machines of the type being simulated would ever exist. Which leads me to wonder what the point of this Sandia Labs exercise was.

Of course, for developers experienced in parallel programming, this result actually isn’t in itself all that surprising. Quite frequently, experienced developers find that running their multi-threaded application on massively parallel hardware does not scale well with the hardware capabilities. This was apparently the case for the applications the Sandia Labs folks simulated. So what? Should we just give up in our quest for parallel program scalability?

Before drilling into Dr. Gunther’s specific interest in this disclosure, it is worth looking into the Sandia Labs finding in a bit more detail. For instance, did anyone, besides me, wonder what applications were being simulated?

In theory, “deriving knowledge from large data sets” is a category of computing program that readily lends itself to a solution using an SIMD (Single Instruction, Multiple Data) approach. The canonical example of an SIMD approach to “deriving knowledge from large data sets” is a database Search function conducted in parallel where the data set of interest is partitioned across n processing units and their locally attached disks. For example, when the Thinking Machines CM-1 supercomputer publicly debuted in the mid-80s, the company demonstrated its capabilities using a parallel search of a database that was partitioned across all 64K nodes of the machine, which was based on the Connection Machine originally designed by MIT whiz kid Danny Hillis. Parallel search when executed across a partitioned dataset should scale linearly, or close enough for government work (pun intended).

Whenever a problem lends itself to an SIMD approach (also known as “divide and conquer”), linear scalability of the SIMD algorithm does require first partitioning the data being accessed and then proceeding to process that data in parallel. I am sure the point of the Sandia Labs press release was not to disparage the SIMD approach to parallel processing; after all, that is a tried-and-true technique that they have used with great success over the years.

On the contrary, it appears to be a critique of an approach to building parallel processing hardware where you would increase the number of processing cores on the chip (just because you can with the most current semiconductor fabrication technology) without scaling the memory bandwidth proportionally. Since that is not what is happening hardware-wise, it strikes me that this implied criticism of the multi-core hardware strategy Intel and AMD are pursuing is slaying a non-existent dragon. Both Intel and AMD recognize that memory bus bandwidth is a significant potential bottleneck in their multi-core products, and, as a result, both manufacturers are attempting to scale memory bandwidth proportional to the amount of processing power they deliver on a chip.

So then what is all the fuss about? The Sandia Labs “news” starts to look like something the blogosphere is latching onto on an otherwise slow day for tech news, raising an alarm & potentially misleading naïve readers about what the conventional wisdom in multiprocessor chip architecture would be if anyone were actually trying to build multi-core microprocessors that way.

Building a better multicore processor.

The point of the Sandia Labs press release publicizing these simulation results appears to be to suggest what they consider better approaches to packaging multi-core processors on a single socket. They released the following chart that that makes this point (reproduced here in Figure 1):

Sandia Labs multicore simulation results

Figure 1. Sandia Labs simulation showing performance of their application vs. the number of processors.

Exactly what the Sandia Labs folks are reporting here is a little sketchy. Presumably, the simulations are based on observing the behavior of some of their key programs where they were able to measure performance running on “conventional” multi-core processors, perhaps, something like the quad-core machine I recently installed for my desktop that uses a memory bus with bandwidth in the range of 10 GB/sec. The press release seems to imply that the Sandia Labs baseline measurements were taken on current quad-core machines from Intel like mine, not the newer Nehalem processors where the memory architecture has been re-worked extensively. How useful or meaningful the results that Sandia Labs published may turn on this crucial point.

This Sandia Labs simulation then extrapolates out to 16 cores per socket (and beyond), simulating the manufacturer adding more cores to the die, apparently leaving the memory architecture fundamentally unchanged as they moved to more cores. The Sandia Labs chart in Figure 1 is labeled to indicate that the memory bandwidth was held constant at 10 GB/sec.

This is more than a little suspicious. Hardware manufacturers like Intel and AMD understand clearly that the memory bus has to scale with the number of processors. The AMD HyperTransport bus architecture is quite explicit about this, and the latest spec for HyperTransport version 3.1 has an aggregate bandwidth in excess of 50 GB/sec.

Meanwhile, the memory bus capacity on the latest Nehalem-class processors from Intel has been boosted significantly. Alternatively, it is when you cannot scale the memory bus with processor capacity that machines with NUMA architectures become more attractive. The AMD processors use HyperTransport links in a ring topology that implicitly leads to NUMA-characteristics.

In Intel’s approach to NUMA scalability, some small number of processors share a common memory bus, forming a node. Current Nehalem machines (also known as the Core i7 architecture) have four cores sharing the Front-side memory bus (FSB). The physical layout of this chip is photographed in Figure 2, showing four cores, connected to DDR3 DRAM using an integrated memory controller. I wasn’t able to come find a speed rating for the FSB in the Nehalem on Intel’s web site or elsewhere, other than ballpark estimates that puts its speed in the range of 30-40 GB/sec. The QuickConnect technology links that are used to link memory controllers support 25 GB/sec transfers, which is probably a safe lower bound on the capacity of the FSB.

Core i7 4-way multiprocessor photo

Figure 2. Aerial photograph showing the layout of the 4-way Core i7 (Nehalem) microprocessor chip.

The PIM architecture, whose scalability curves are close to ideal for the Sandia Labs workloads is, probably not coincidentally, a processor architecture championed at Sandia Labs. The idea behind PIM machines (Processor In Memory) is that the processor (or processors) is embedded into the memory chip itself, which is a pretty interest approach to solving the “memory wall” that limits performance in today’s dominant computer architectures. Instead of loading up the microprocessor socket with more and more cores, which is the professed hardware roadmap at Intel & AMD, integrating memory into the socket is an intriguing alternative. Such machines, if anyone were to build them, would obviously have NUMA performance characteristics.

The debate is a bit academic for my taste, however, until these PIM architecture machines are a reality. For PIM architecture machines to ever get traction, either the microprocessor manufacturers would have to start building DRAM chips or the DRAM manufacturers would have to start building microprocessors. The way the semiconductor fabrication business is stratified today, that does not appear to be very likely in the near future.

So, in the end, the point of the Sandia Labs press release appears to be trying to publicize the multiprocessor hardware direction espoused mainly by Sandia Labs’ own researchers. Frankly, there have been lots and lots of different architectural approaches to parallel processing over the years, and it doesn’t look like any one approach is optimal for all computing situations. You ought to be pick another parallel programming workload to simulate in Figure 1 and get an entirely different ranking of the approaches.

Still, the Sandia Labs simulation data are interesting mainly for they say about how difficult it is going to be for developers to write parallel programs that scale well on multi-core machines. No, achieving parallel isn’t child’s play for hardware manufacturers. Nor is it for software developers attempting to take advantage of parallel processing hardware, which is the subject I will start to drill into next time.

Continue to Part 2.....

Comments

  • Anonymous
    March 16, 2009
    There is no FSB in Nehalem processor based platforms. Nehalelem uses Intel QuickPath Interconnect technology to connect multiple processor and the I/O hub. See http://www.intel.com/technology/quickpath/index.htm for more information.

  • Anonymous
    March 17, 2009
    The comment has been removed

  • Anonymous
    April 28, 2009
    Part 1 of this series of blog entries discussed results from simulating the performance of a massively

  • Anonymous
    June 09, 2009
    Don't most Web servers theoretically host massively parallel apps?  Each HTTP request is processed by spinning off a thread, and HTTP is ideal for this since its stateless. So,  according to this article, IIS wouldn't perform exponentially well going from two to four cores and higher: http://blogs.msdn.com/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx

  • Anonymous
    June 09, 2009
    Allow me to refine what I've said. Since this article is suggesting that the problem is with multicore hardware,  all claims that have been made about all these Web servers being massively scalable is false.

  • Anonymous
    June 09, 2009
    The comment has been removed

  • Anonymous
    June 09, 2009
    To your first point, transaction processing workloads (which includes web-oriented request processing) are inherently parallel. (But they can still block -- there are lots of gotchas.) To your second comment, not exactly the way I would phrase it, but I am cynical, too, about any general claim to massive scalability. In the immortal words of Ronald Reagan, "Trust, but verify." It is worth noting that shared-nothing clustered approaches in massively parallel transaction processing have achieved some impressive results in scaling near linearly up to a 1000 machines, for example. We are generally trying to achieve something similar today on more general purpose (i.e., commodity) hardware, which is a challenge. Scalability can be achieved, but requires a long, tough push to get there on both the hardware and software sides. That holds for web applications, databases, HPC (e.g., OpenMP), etc. Minimizing shared state is crucial to achieiving good scalability in all these application domains.

  • Anonymous
    June 10, 2009
    ASP.NET adds a lot of functionality over HTTP to make Web programming easier, and it has been my experience that state management in ASP.NET does not scale very well. Cookies and query strings in HTTP does not by any way impose performance penalties as those imposed by Session State, View State, etc. in ASP.NET. While new APIs that have been introduced in Windows Server 2003 for applications to take advantage of NUMA architectures are useful, well written multithreaded code that simply use primitive thread synchronization mechanisms would scale well on NUMA architectures as the Windows scheduler is intelligent enough to facilitate those apps with the advantages offered by NUMA architectures.

  • Anonymous
    June 10, 2009
    And there's a real problem generally in any computing network that uses Ethernet that I haven't seen addressed. Ethernet's global hierarchical model and the use of Ethernet device drivers in the kernel with full privileges opens up the kernel to all kinds of attacks. Multithreaded programming would be impossible on such networks. I've sworn to never again work for any employer unless:

  1. Everything attached to the Internet around the planet is upgraded to IPv6. This implies use of InfiniBand at Layer 2, in lieu of Ethernet.
  2. The Internet is enhanced with the logic to biometrically authenticate human users and determine that the person being authenticated cannot be at more than one place at the same time. This requires the development of a new routing protocol that enhances IPv6 and one that uses GPS-bases addressing in addition to the functionality in IPv6 addressing.
  3. WiMAX is rolled out around the planet so I can use a VoIP phone on this newly enhanced Internet wherever I go. My phone must be intelligent enough to tell me whether any phone number is on a PSTN, cell network or if its a VoIP number, and I wouldn't want to deal with anyone on a PSTN or cell network.
  4. When I do start working again, I wouldn't want any device drivers running in the Windows kernel other than signed drivers from Microsoft. All other drivers will have to be user-mode drivers signed by their respective developer companies and certified by Microsoft HQL.
  5. All monies that are due to me must be paid to me ASAP. Without all of this in place, I just can't be sure that I'm getting paid for my work.