Thursday, January 26, 2012

A weak case against wimpy cores

Rereading Urs Hölzle's "Brawny cores still beat wimpy cores, most of the time" (as part of "Challenges and Opportunities for Extremely Energy-Efficient Processors"), I was again bothered by the failings of his argument.

First, he confutes performance with frequency when stating that power use scales roughly as the square of frequency. While perfect scaling (F*V2 or F3 where voltage can be reduced in proportion to frequency) is not possible in a given implementation and non-switching power has a significant impact, an implementation optimized for a lower frequency will generally have greater efficiency by using a shallower pipeline (with lower branch misprediction penalties and less pipeline overhead) and/or substantially less aggressive logic (e.g., performing a 64-bit addition in 30 gate delays requires noticeably less redundant operation than performing such in 15 gate delays). In addition, simply reducing the frequency will allow the same size cache to be accessed in fewer cycles which reduces the size of the instruction window needed to cover memory access latency (for on-chip cache hits) and/or reduces the relative loss of performance from waiting on memory (given a constant latency), both of these allow greater efficiency.

In addition, frequency is not the only knob that can be turned. Brawny cores sacrifice considerable efficiency in seeking high performance. While Urs Hölzle mentions the larger area and higher frequency of brawny cores as causes of higher power, based on statements on the comp.arch newsgroup by Mitch Alsup that a half-performance core would use a sixteenth of the area, I believe Hölzle underestimates the power penalty of brawny cores.

Hölzle further weakens his case by using an example of a hundred fold increase in thread count when his thesis is that anything more than about a two fold reduction in performance from the higher end is increasingly difficult to justify. Even Sun's UltaSPARC T2 processors--which clearly target throughput at great cost in single-thread performance--had much more than 1% the performance of processors in the same manufacturing technology.

Hölzle then implies that system cost per unit performance will increase by using wimpy cores because external resources will have to be replicated. While this argument has some strength relative to microservers where the size of the processor chip is reduced, wimpy cores can be incorporated into chips of the same size as the chips using brawny cores, sharing the same resources as a smaller number of brawny cores would. Microservers have some economic advantages in using processors targeted to other workloads (so both design and manufacturing costs are shared), but the argument against wimpy cores should not be based only on this design.

Hölzle also misses the fact that a single chip could easily (and all the more in an era of "dark silicon") have a diversity of cores. (Ironically, the other presentation listed as a reference a paper--"Reconfigurable Multi-core Server Processors for Low Power Operation"--that presented such a heterogenous design. This paper also presents one of several possible ways of using clustering to provide a range of single-thread performance with a single hardware implementation, which seems a promising area for research. [SMT is somewhat similar in allowing a single implementation to scale to a larger number of threads, though with an emphasis on single thread performance and so sacrificing more efficiency on highly threaded and low-demand workloads.])

An additional advantage of greater energy efficiency is the greater ease of more tightly integrating at least some memory in the same package as the process (allowing increased bandwidth and/or energy efficiency). Furthermore, by reducing the number of power and ground connections, more connections can be used for communication (with memory, I/O, or other processors).

Wimpy cores may have an additional advantage in that, being simpler and smaller, they can be more quickly woken from a deep sleep state and can be kept in a less deep sleep state with a lower power cost. This would faciltate a faster transition from idle to a light or moderate workload.

There is also a factor that more efficient wimpy cores in a heterogeneous chip multiprocessor can be used for background tasks which do not have the response time requirements of the main workload, while still allowing homogeneous systems (which might be desirable for flexible workload allocation).

There is also an implication that the required single-thread performance will continue to increase since the single-thread performance of the higher end processors continues to increase albeit more slowly than before. This may be the case, but I do not think it is a foregone conclusion.

While Amdahl's law (both in the obviously serial portion and in the excess overheads from parallel execution) limits the effectiveness of exploiting parallelism, a heterogeneous-core system would avoid much of the impact of this limit.

The software challenges in exploiting wimpy cores (even--perhaps especially--with heterogeneous CMPs) are signficant, but Hölzle's argument seems particularly faulty (even if it may be less faulty than the arguments of some "wimpy-core evangelists").