Lessons from taking a co-design approach to High Performance Computing, in conversation with Gaurav Kaul

in-flight entertainment

So here is the landmark scientific paper published in Nature, that explains how Google-backed artificial intelligence company DeepMind trained an artificial intelligence model to control nuclear fusion; a problem that the Swiss Plasma Centre have been simulating for more than 20 years. It may seem pretty mental that a company that has historically never been in the game for nuclear fusion, has come into the room and solved a problem that they have no knowledge of. But the generalists out there will understand that this isn’t as far-fetched as one might think for one reason, and one reason only: it’s less about learning the subject, and more about learning how to learn. To illustrate this, here’s a Digital Anthroconomy summary of the paper to understand how DeepMind cracked a 20 year old problem of controlling the plasma for nuclear fusion…

Plasma describes a state of matter, just like solids, liquids and gases, that forms when you super-super-super-heat a gas to the point where the electrons actually become detached from the atoms (protons and neutrons). Anyone who studied physics or chemistry at the age of 13 should know that this turns a neutrally charged atom, into an electrically charged ion. Now that we have a superheated ball of electrically charged ionic gas called plasma (which is the easy part) the goal of nuclear fusion is to make these ions bump into each other and fuse together which releases a heavenly amount of energy, the same way the sun releases energy. But the challenge the Swiss Plasma Centre had was arranging this cloud of plasma into the right shape to promote collisions between these ions. What made the challenge even harder was that the shape of the plasma needs to change with time (similar to how the strength of wind needed to maintain a fire changes with time). So at each point in time, the specific shape of this electrically charged gas needs to change.
To shape the gas we can use electromagnetic fields (produced by electromagnets) that the electrically charged gas will align with. So the question becomes “how do we choose the right shape of electromagnetic field, at the right time, and make sure it changes at the right time, to promote ion collision”. Well, the pre-DeepMind way was to calculate the exact conditions of the gas at each point in time, and calculate the optimum shape for the gas to have for each of these points in time.
“We will need deep scientific knowledge to understand how the plasma conditions at each point in time are different”, said Swiss Plasma Centre. “Let’s spend the next 20 years trying to do that”.
“Hold on”, said DeepMind. “We don’t need to create a new model of the conditions for each point in time from scratch, we just need to create a method where if the shape being simulated looks like its getting a step closer to the end goal (the end goal being more ion collision) then we roll with that and tell the simulation to keep going”.
This is a method called reinforcement learning; an AI tool that gets a simulation to reach its programmed end goal, not by controlling each stage of the model every step of the way, but by smiling at it and encouraging it further every time it does something the moves further towards the end goal.

So what does this have to do with taking a co-design view of High Performance Computing?

Well the real smarts of DeepMind’s approach wasn’t necessarily the fancy technology itself, nor the impressive words like “nuclear fusion” and “electrically charged ionic plasma”. The smarts was in their ability to step back and understand two things –
1. “Hey, why are we spending so much time on each tiny individual detail and process, when we can zoom out, clearly define the end goal and just make sure each stage is aligning with the end goal”
2. “Generalists can unite a team in extremely powerful and productive ways, because whilst specialist knowledge is absolutely necessary, it is a holistic viewpoint that drives new solutions with a fresh perspective”.

The world of start-ups and entrepreneurship shares very closely related mottos that are “Start with the end in mind” and “Move toward the vision as a collective”

Gaurav Kaul also shares similar mottos as a solutions architect for HPC/AI at Hewlett Packard Enterprise – “Design the computer system, with the workload in mind” and “Get the teams speaking the same language so they can move and work together”. What this means, in jargon-terminology: taking a co-design approach.

landing the plane

Gaurav is a co-design architect for HPE, who helps architect the co-design of HPC systems, right across the hardware and the software, that can support the extremely large training models that power artificial intelligence. What this means in simple – large training models that are looking to grow and become more accurate, need to be fed more information (data). These increasing amounts of data (and it really is staggering how much info there is) need to be smartly managed, organised and implemented into the algorithmic model that is effectively crunching and processing numbers in specific ways depending on the algorithms being used – referred to collectively as a “workload“. The management and growth of this workload has implications for the choice and design of software that the workload exists within, and in turn, implicates the hardware that lets the software run. Unlike our brains, the hardware doesn’t grow, scale and evolve by itself. Gaurav looks at the whole end-to-end system design over time, right from “what is the nature of the data and algorithmic workloads for the training model?” to “how will these workloads run most optimally on the computer systems that we can afford to use?”. This makes him an ideal candidate for benchmarking the hardware performance of a computer system designed for specific training workloads (p.s. the reference point for any computing market comes down to meeting and exceeding benchmarks for computing performance, be it, energy efficiency, processing speed, software usability etc; there’s a big world behind a simple word, and this veteran architect’s knowledge about this world comes from working at Tibico, IBM, Intel, AWS and now, HPE).
The latest types of workloads that Gaurav designs systems for, on behalf of HPE, are training of computer vision models for self-driving cars. Note only are these models huge, but they grow fast. As Gaurav said, “it’s the training, which is computationally intensive, which can use petabytes of data. So for example, this particular AV vendor, they have overall data storage of 25 petabytes. You can imagine they have high definition maps, they have, you know, every single car has got 20 cameras, right? That’s exactly per car, I think they can, in one hour, they produce something like four terabytes of data. So in a single day, they can produce almost, you know, the equivalent of the the total amount of data that probably, you know, a normal person consumes in a year”. This, as he describes, is an “almost exponential growth in data”.
A co-design approach to architecting HPE’s HPC, specifically for the evolving workload, is absolutely necessary to ensure the computing systems scale in the right way for the workload they are being scaled for; the approach allows Gaurav to make sure all the components are “designed for the same script”. But, there is another more important application for taking a co-design approach to High Performance Computing: to unite the teams that he works with by making sure all the users of the HPC system are speaking the same language through working towards the same vision. To illustrate this he says “So, so the customer that I work with, they have a mix of, you know, computer vision experts, they have a mix of data scientists, you know, who actually code the algorithms for the smart sensors and the cameras. And then of course, the HPC guys who basically maintain the physical infrastructure and manage the scale out of the storage and of the networking and the compute and whatnot”. Whilst the data scientists, HPC/infrastructure specialists and computer vision experts all specialise in speaking their own language, its up to Gaurav to take a co-design approach to understanding how they can all get on the same page to design the most cohesive HPC system possible.

Whilst we will not go into the minor details of Gaurav’s day-job, it does seem like there is a lesson to be learned here, and that is “AI can solve everything”.

Thats not the lesson here, that’s just a bad joke, if that. The lessons here are that the most effective problem solvers keep one thing in mind: end vision. To start with the end in mind, and to not be hindered by the nature of the problem. And, to move each piece of the puzzle together as a whole.

Published by Prab Jaswal


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: