Moore’s Law needs a hug. The days of stuffing transistors onto small silicon computer chips are numbered, and their life rafts – hardware accelerators – come at a price.
When programming an accelerator – a process where applications offload some tasks to system hardware especially to speed up this task – you need to create completely new software support. Hardware accelerators can perform some tasks faster than processors, but they cannot be used as is. The software must make effective use of the accelerator instructions to make it compatible with the entire application system. This translates to a lot of engineering work that would then have to be maintained for a new chip that you compile code on, with any programming language.
Today, scientists at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) created a new programming language called “Exo” to write high performance code on hardware accelerators. Exo helps low-level performance engineers turn very simple programs that specify what they want to compute, into very complex programs that do the same thing as the spec, but much, much faster using these acceleration chips specials. Engineers, for example, can use Exo to turn simple matrix multiplication into a more complex program that runs orders of magnitude faster using these special accelerators.
Unlike other programming languages and compilers, Exo is built around a concept called “Exocompilation”. “Traditionally, a lot of research has focused on automating the optimization process for specific hardware,” says Yuka Ikarashi, a PhD student in electrical and computer engineering and affiliated with CSAIL, who is the lead author of a new paper. on Exo. “That’s fine for most programmers, but for performance engineers, the compiler gets in the way as often as it needs to. Because compiler optimizations are automatic, there’s no good way to do it. fix it when it does the wrong thing and gives you 45% efficiency instead of 90%.
With Exocompilation, the performance engineer is back in the driver’s seat. Responsibility for choosing which optimizations to apply, when, and in what order is outsourced from the compiler to the performance engineer. This way, they don’t have to waste time fighting the compiler on the one hand, or doing everything manually on the other. At the same time, Exo takes responsibility for ensuring that all these optimizations are correct. Therefore, the performance engineer can spend his time improving performance, rather than debugging complex, optimized code.
“The Exo language is a compiler that is parameterized on the hardware it targets; the same compiler can adapt to many different hardware accelerators,” says Adrian Sampson, assistant professor in the Department of Computer Science at Cornell University. “Instead of writing a bunch of messy C++ code to compile for a new accelerator, Exo gives you an abstract, uniform way to write the ‘shape’ of the hardware you want to target. Then you can reuse the existing Exo compiler to adapt to this new description instead of writing something entirely new from scratch. The potential impact of work like this is huge: if hardware innovators can stop worrying about the cost of developing new compilers for every new hardware idea, they can try and ship more ideas. The industry could break its reliance on legacy hardware that only succeeds through ecosystem lockdown and despite its inefficiency.
The most powerful computer chips manufactured today, such as Google’s TPU, Apple’s Neural Engine, or NVIDIA’s Tensor Cores, power scientific computing and machine learning applications by accelerating what are known as “key subroutines”, cores, or high-performance computing (HPC) subroutines.
Clumsy jargon aside, programs are essential. For example, something called Basic Linear Algebra Subroutines (BLAS) is a “library” or collection of such subroutines, which are dedicated to linear algebra calculations and enable many machine learning tasks like neural networks , weather forecasting, cloud computing and drug discovery. . (BLAS is so important that it won the Turing Award from Jack Dongarra in 2021.) However, these new chips – which require hundreds of engineers to design – are only as good as these HPC software libraries allow.
Currently, however, this type of performance tuning is still done by hand to ensure that every last compute cycle on these chips is utilized. HPC routines routinely run at over 90% of theoretical peak efficiency, and hardware engineers go to great lengths to add another 5 or 10% of speed to those theoretical peaks. So if the software isn’t aggressively optimized, all that hard work is wasted, which is exactly what Exo helps prevent.
Another key feature of Exocompilation is that performance engineers can describe new chips they want to optimize for, without having to modify the compiler. Traditionally, the definition of the hardware interface is maintained by compiler developers, but with most of these new accelerator chips, the hardware interface is proprietary. Companies should keep their own copy (fork) of a complete traditional compiler, modified to support their particular chip. This requires hiring teams of compiler developers in addition to performance engineers.
“In Exo, we instead outsource the definition of exocompiler hardware-specific backends. This gives us a better separation between Exo – which is an open source project – and hardware specific code – which is often proprietary. We’ve shown that we can use Exo to quickly write code as good as Intel’s hand-optimized Math Kernel library. We are actively working with engineers and researchers from several companies”, explains Gilbert Bernstein, postdoctoral fellow at the University of California at Berkeley.
Exo’s future involves exploring a more productive scheduling meta-language and extending its semantics to support parallel programming models to apply it to even more accelerators, including GPUs. .
Ikarashi and Bernstein authored the paper alongside Alex Reinking and Hasan Genc, both UC Berkeley doctoral students, and MIT assistant professor Jonathan Ragan-Kelley.
This work was partially supported by the Applications Driving Architectures Center, one of six centers in JUMP, a Semiconductor Research Corporation program co-sponsored by the Defense Advanced Research Projects Agency. Ikarashi was supported by the Funai Overseas Scholarship, the Masason Foundation and the Great Educators Fellowship. The team presented the work at the ACM SIGPLAN Conference on Programming Language Design and Implementation 2022.