Saturday, July 14, 2012

Balance vs. fullness [a bit of a rant]

The popularity of using balance as a goal bothers me. Among the pitfalls of using balance as a goal are not recognizing common benefits (and along with this an antagonistic viewpoint where any benefit toward one aspect tends to be viewed as a detriment toward others--"balance" even seems to imply a dualistic perspective) and attempting to establish a single metric which can be applied equitably to all interests.

One problem with a single metric is that it becomes very tempting to choose one that is relatively obvious and easy to measure and apply it is a simple manner. E.g., a balanced taxation proposal might see individual income as an easily measured quantity and apply a flat (equitable by income) tax. However, even a flat tax based on discretionary income might be inappropriate. (Interestingly, a flat tax on assets would seem to have some attractive properties. Such would encourage the use of assets to increase productivity. Unfortunately, assets are more difficult to measure than income [education, aptitude, health, etc. are assets which influence potential productivity] and not all productivity [social good] generates income with which to pay a tax.)

(It is tempting to see this issue as analogous to the issue of the nature of Christ, where the "balance" perspective proposes that Christ's nature is part-human and part-Divine where the "fullness" perspective proposes that his nature is fully human and fully Divine. Such an analogy might be improperly biased in favor of my own perspective rather than seeking an understanding of truth.)

The use of fullness (with the concept of perfection or perhaps complete integrity) as a goal may avoid some psychological/moral issues, but it seems to draw out significant measurement issues (which has the good aspect of forcing thought and recognition of complexity but the bad aspect of potentially disintegrating into a contemplation of [or argument about] measurement rather than adopting a more integrated perspective which recognizes that while establishing measurements helps to clarify goals [at many levels] and estimate progress [and so guide resource allocation, including time], establishing measurement is a servant of other aspects [while the other aspects likewise submit to measurements]).

When using perfection (fullness/integrity) as the goal, one must also maintain a sense of context. A human goal of perfection takes into account finitude. A result that is perfect at a given time may easily be a very poor result if the achievement of that result is necessarily delayed by resource limits. Likewise a dependence on grace seems to be necessary (this may be related to Martin Luther's "sin boldly"). Knowing that even stupid failures (and many failures are 'recognized' as stupid in hindsight) are not damning provides a freedom to strive for perfection rather than being paralyzed by uncertainty (the measurement problem)--or even the certainty of failure (the recognition of inadequacy)--or settling for a safe result (like burying the talent [Matthew 25:24-30]).

In Mere Christianity C.S. Lewis wrote "The only fatal thing is to sit down content with anything less than perfection." This is not a paralyzing perfectionism (which is a significant issue for me), but a call not to stop being a soldier until the war is won. Such a seeking of perfection not only motivates a full commitment of effort (sometimes enabling success) but sometimes produces an accidental success beyond expectation.

(My own perfectionism--both from uncertainty of what should be done and from perception of inadequacy--very often leads to inactivity, which is very far from the striving for perfection that is the human calling. On the positive side, a more proper fondness for perfection may be involved in my perfectionism--i.e., my perfectionism may in part represent a corruption of a particular gift of affection for the highest and best.)

Of course, the very use of "vs." in the title demonstrates how easy it is to fall into an antagonistic (rather than holistic) perspective.

Wednesday, April 18, 2012

Nitpicking on "No Silver Bullets", part 1, HLL disadvantages

A recent post on John Cook's blog mentioned a presentation by Mike Swaim (titled "No Silver Bullet") about the advantages and disadvantages of certain programming techniques. Looking through the PDF, a few points bothered me. (Yes, nitpicking over bullet points is not exactly fair, and I am not at all claiming that the presenter was not aware of the issues.)

The disadvantages of high level languages were basically listed as reduced performance, "inappropriate runtime requirements" (which might be excessive resource requirements or things like garbage collection pauses in a tightly constrained hard real-time environment), and (based on the NT example) the inability to fix or even work around errors in architecture/design or implementation.

Performance issues

It seems that the performance problem comes from two issues. First, the compiler (and runtime system) does not fully exploit the available information. This comes from two sources: the lack of machine resources allocated to the compiler and the lack of programming effort applied to developing systems for translating higher level specifications to machine language code. While this can be seen as approaching artificial intelligence requirements, the difficulty of the goal is not as great as for generic artificial intelligence.

Some strengths of existing computing systems could be more fully exploited, most particularly the relative abundance of idle time on many computing systems and the relative reliability of detailed long-term memory. Offloading some analysis effort of the compiler to development time could provide more extensive analysis without increasing compilation time. Caching information about software could also substantially reduce the amount of analysis required at compilation time (or run time).

The software effort problem is unnecessarily hindered by issues with the definitions of behavior for programming languages (which can include behavior expected by existing software)

The second issue is the lack of communicating necessary information to the compiler (and runtime system). Information which is known (or assumed) by a human programmer is often not communicated to the compiler, and frequently the compiler is not allowed to ask questions for clarification. In some cases, the programming language provides no mechanism for communicating such information and seemingly more often makes communicating such information more difficult by the use of inappropriate default semantics (e.g., C pointer/array aliasing), discordant syntax for adding information (e.g., using __builtin_expect versus something like assuming (condition) {stuff to do} else {other stuff to do} or case 5 probability=20%: stuff to do; requiring the use of a different dialect or even language to express information adds an unnecessary barrier), and inability to use layered abstraction (e.g., providing a generic implementation of an operation with an optimized implementation that can "overload" the generic implementation with constraints [additional information] expressed so that the optimized implementation can be validated; furthermore optimized implementations could be provided as hints--"compiler, try this implementation"--or directives--"compiler, do it this way" and multiple implementations could be provided for different constraints--and the compiler might even use different implementations to generate a new implementation).

This second issue is compounded by the relatively weak support for profiling. Not only is the generation and use of profile information more complex than strictly necessary, but the overheads for generating such information seem to be greater than necessary. In addition, some information is generally not gathered by typical profiling tools. E.g., the temporal distribution of values for a variable and even the temporal correlation of values for different variables can be as important as the net frequency of values. Likewise, infrequent and limited use of whole program analysis prevents the compiler from using even relatively simple transformations in data structures and algorithms.

(Compiler writers not understanding computer architecture or being informed about specific microarchitectures--not necessarily by choice--does hinder the development of better compilers.)

While providing additional information would in many cases reduce the productivity advantages of a higher level language, in many such cases, the use of "overloading" implementations might provide good performance without losing the maintainance and extension advantages of the higher level representation. When a constraint of an optimized implementation is violated or the optimized version no longer matches the generic version, these effects can be noted allowing a decision to accept the optimization loss or rework the optimized implementation. In some cases, the reworking could be fully or partially--i.e., suggested fixes provided--automated. In some cases, multiple optimized implementations could be used to synthesize new implementations to better fit resource availability or optimization goals.

As an example, in many cases it should be possible for a compiler to recognize that the keys of a hash are immutable or restricted to a particular set of values or are accessed in a particular order, allowing a simple programming concept to be implemented in an optimized fashion. Providing directive or expectation information may allow the use of a higher level construct like a generic container while allowing the compiler to use an optimized implementation--preventing or generating a fatal error in the case of directives, dynamically recompiling or using an alternative implementation in the case of expectations. In theory, a hash which uses only unsigned integer keys could be implemented as an array. One might even generalize the hash concept to provide a syntax like container_of_records[element=variable] to index the element (or a collection of elements) where the record member 'element' equal the value in 'variable') with the ability to imply 'element=' based on the type information of 'variable' or the use of a default element (perhaps indicated by a qualifier in the definition like 'indexer' or by some precedence mechanism--though such could make bugs too easy to generate and too difficult to find) or a combination (where multiple indexers are provided and type information and/or precedence is used to determine which is used).

(Expectation information can also be used to reduce compilation effort. If the expectation is correct, the only overhead is in evaluating the expectation--compared to searching through all possibilities. If the expectation is incorrect, it may be acceptable to use less aggressive optimization and provide a notification or even delay final compilation of the component dependent on the expectation until a new expectation is provided by the programmer or possibly by some automated method, potentially taking advantage of lazy evaluation of the program source code. Obviously, the communication of the expectation should be lightweight, perhaps something like Some_record_type container[uint?] to define a container of 'Some_record_type' values that is expected to be indexed by unsigned integer values.)

Runtime Requirements

Runtime requirements issues would seem to have similar solutions as the performance issues. While it would be unreasonable to expect existence of even the limited form of artificial intelligence necessary to fully solve this issue, a more moderate degree of intelligence (or extensive searching) would seem likely to greatly reduce the problems. In particular, it is disappointing to me that automated memory management is often exclusively implemented as generic garbage collection. I suspect that in many cases the source code provides sufficient information to determine when a resource can be freed or when reference counting would be a better choice (with or without a saturating reference count--where a saturated reference count might indicate the use of a generic garbage collection mechanism); likewise, choices of allocators (region/stack/FILO/FIFO, binary buddy, fixed-sized arrays, limited size diversity exploiting expected frequencies and best fit with limited splitting and fusing of default partitioning, object caching, etc.) could be optimized by a compiler (or perhaps even a runtime system).

In addition, a high level language should support the ability to use lower-level directives (as well as communicate expectations) without degenerating into bare assembly language. (If assembly language is used, it should be annotated with constraints--obviously at minimum the assumed ISA--and a higher level implementation should be provided, which ideally could be used to validate the assembly language implementation.) Even if the lower-level directive cannot be validated against a higher-level representation or other information in the source code (or specification)--e.g., one might use a bald assertion which is not checked at runtime and need not be validated at compile time (though notification would be provided if compile-time validation was not possible), information would be available in the source code about the assumptions made and automated tools could use such assumptions to provide assistance in debugging. In some cases, debugging aids could be selectively applied (to reduce overheads). In the case of memory deallocation, in some cases it might be possible use virtual memory with page deduplication or compression to reduce the cost of not reusing deallocated memory by overwriting the
deallocated region with a given pattern.

Unfixable Environments

While the problems of bugs in the design or implementation of a runtime environment cannot be completely avoided (though using formal methods would limit such bugs to the design--and the development tools), using a modular design even in the runtime environment would seem likely to substantially reduce the impact of many design and implementation bugs/misfeatures. Allowing individual components to be swapped out would seem to greatly facilitate working around implementation and even design issues. Even layering a foreign interface over a required system interface might be practical in some circumstances. E.g., it might be possible to provide a specialized automated memory managment system over a generic garbage collector with modest overheads by using a minimum number of allocations using the system memory manager. Ideally, the functionality of such module implementations could be validated according to a higher level specification.

Unfortunately, not only are the inadequacies of tools a barrier to achieving such an ideal, but human factors are likely to hinder adoption. Humans are often disinclined to spend now to save later, often have difficulty admitting mistakes, often have difficulty throwing away a work (perhaps especially if it was well done and only recently became unfit for current or expected use), and often have difficulty recognizing the validity in another perspective (especially when that perspective is poorly communicated or is bundled with aspects that are invalid [Absolute tolerance of falsehood or wrongs is not a virtue, but often saying "you really shouldn't do that" is more appropriate than excommunication. {In the context of programming language design, making something awkward to do is often more appropriate than prohibition--with bonus points if repentance and reform is made easy.} The love of truth should be accompanied by humility and mercy.]).

Thursday, January 26, 2012

A weak case against wimpy cores

Rereading Urs Hölzle's "Brawny cores still beat wimpy cores, most of the time" (as part of "Challenges and Opportunities for Extremely Energy-Efficient Processors"), I was again bothered by the failings of his argument.

First, he confutes performance with frequency when stating that power use scales roughly as the square of frequency. While perfect scaling (F*V2 or F3 where voltage can be reduced in proportion to frequency) is not possible in a given implementation and non-switching power has a significant impact, an implementation optimized for a lower frequency will generally have greater efficiency by using a shallower pipeline (with lower branch misprediction penalties and less pipeline overhead) and/or substantially less aggressive logic (e.g., performing a 64-bit addition in 30 gate delays requires noticeably less redundant operation than performing such in 15 gate delays). In addition, simply reducing the frequency will allow the same size cache to be accessed in fewer cycles which reduces the size of the instruction window needed to cover memory access latency (for on-chip cache hits) and/or reduces the relative loss of performance from waiting on memory (given a constant latency), both of these allow greater efficiency.

In addition, frequency is not the only knob that can be turned. Brawny cores sacrifice considerable efficiency in seeking high performance. While Urs Hölzle mentions the larger area and higher frequency of brawny cores as causes of higher power, based on statements on the comp.arch newsgroup by Mitch Alsup that a half-performance core would use a sixteenth of the area, I believe Hölzle underestimates the power penalty of brawny cores.

Hölzle further weakens his case by using an example of a hundred fold increase in thread count when his thesis is that anything more than about a two fold reduction in performance from the higher end is increasingly difficult to justify. Even Sun's UltaSPARC T2 processors--which clearly target throughput at great cost in single-thread performance--had much more than 1% the performance of processors in the same manufacturing technology.

Hölzle then implies that system cost per unit performance will increase by using wimpy cores because external resources will have to be replicated. While this argument has some strength relative to microservers where the size of the processor chip is reduced, wimpy cores can be incorporated into chips of the same size as the chips using brawny cores, sharing the same resources as a smaller number of brawny cores would. Microservers have some economic advantages in using processors targeted to other workloads (so both design and manufacturing costs are shared), but the argument against wimpy cores should not be based only on this design.

Hölzle also misses the fact that a single chip could easily (and all the more in an era of "dark silicon") have a diversity of cores. (Ironically, the other presentation listed as a reference a paper--"Reconfigurable Multi-core Server Processors for Low Power Operation"--that presented such a heterogenous design. This paper also presents one of several possible ways of using clustering to provide a range of single-thread performance with a single hardware implementation, which seems a promising area for research. [SMT is somewhat similar in allowing a single implementation to scale to a larger number of threads, though with an emphasis on single thread performance and so sacrificing more efficiency on highly threaded and low-demand workloads.])

An additional advantage of greater energy efficiency is the greater ease of more tightly integrating at least some memory in the same package as the process (allowing increased bandwidth and/or energy efficiency). Furthermore, by reducing the number of power and ground connections, more connections can be used for communication (with memory, I/O, or other processors).

Wimpy cores may have an additional advantage in that, being simpler and smaller, they can be more quickly woken from a deep sleep state and can be kept in a less deep sleep state with a lower power cost. This would faciltate a faster transition from idle to a light or moderate workload.

There is also a factor that more efficient wimpy cores in a heterogeneous chip multiprocessor can be used for background tasks which do not have the response time requirements of the main workload, while still allowing homogeneous systems (which might be desirable for flexible workload allocation).

There is also an implication that the required single-thread performance will continue to increase since the single-thread performance of the higher end processors continues to increase albeit more slowly than before. This may be the case, but I do not think it is a foregone conclusion.

While Amdahl's law (both in the obviously serial portion and in the excess overheads from parallel execution) limits the effectiveness of exploiting parallelism, a heterogeneous-core system would avoid much of the impact of this limit.

The software challenges in exploiting wimpy cores (even--perhaps especially--with heterogeneous CMPs) are signficant, but Hölzle's argument seems particularly faulty (even if it may be less faulty than the arguments of some "wimpy-core evangelists").