Before we learn to stop in different situations that involve a red traffic light, we have to train ourselves to know 3 fundamental things- 1. What does red look like? 2. What does a traffic light look like? 3. What do I do when I see red on a traffic light?
This is of course over-simplified. I would also probably have to have some pre-trained sense of distance so that I wouldn’t just randomly stop when seeing a red light in the far-distance. But, the point is, there needs to be some preliminary base of assumptions already trained into my brain, so that I can then go on to infer more intelligent things on top of those assumptions. This is, at its core, the distinction between training and inference. In reality there is a constant blur between the two; we infer new things about the world and the newly inferred information can be fed back to our original framework of assumptions about something, allowing us to adjust our baseline assumptions about that particular thing if we need to. This is called learning. Hopefully most of us do it. To replicate this learning process in a machine is called… wait for it…
Machine learning. To humans training and inference are seen as one continuous feedback loop where new processed information is continuously used to readjust our existing base of knowledge (learning). But in the business of machine learning (where we artificially replicate intelligence), they are devised as 2 technologically separate processes that must then be coupled together using other things, e.g information management systems or “data pipeline software” (systems that literally manage the flow of data by creating a software pipeline of data between the training and inference “layers”).
Why, you might ask, do each of these things need to be decoupled? Doesn’t it just make sense for one organisation to spend ages thinking about designing the processes together as a whole, just like the brain? Well in theory yes, and they would probably do a better job in making everything work more cohesively. But, there are 2 principles we have to remember. 1. That is an insanely large amount of information that needs to be catered for: too much for any one organisation to think about. 2. We live in a capitalist society, which means there’s money to be made from dividing things up into smaller pieces and becoming specialists in those smaller things – so, not only are there companies specialising in the different processes involved in learning, but companies/organisations specialise in the areas of information that they are looking to model. Some produce models of human vision (they are called computer vision models). Some produce models of the ways humans naturally process language (they are called Natural Language Processing models).
On balance, the 1st of those 2 points above is the more relevant. If we were to create an informational model of the entire world, on a computer… well… that would just require too many computers. Or, one giant sized computer that basically connected all the information in the world, which is the premise of The Matrix. And, also, we haven’t yet managed to artificially replicate all of the processes that constitute the act of learning, or at least to a point where we can integrate them all together in perfect harmony and scale that harmony.
So, us oh so intelligent humans have found ways of dividing up the learning process into separate chunks, where each company can compete on the basis of who can make their part of the machine learning process more efficient – Data pipeline companies compete on the basis of transferring the most information the fastest and in the most energy efficient way. Inferencing companies compete on the basis of who can turn a bunch of information into a meaningful insight, again, the quickest, with the most information, in the most energy efficient way. In truth, most technology markets tend to compete on the basis of energy efficiency per amount of work done, and again, in truth, most (actually all) of that work is in the form of movement of data through a piece of software/hardware; think back to your teenage physics education.
But, this post is not about the movement of information, it’s about conceptually understanding what training models are. So on that note, here are 4 principles that can provide a functional understanding of the business landscape of training models.
- When it comes to building large (huge, enormous) numerical models that simulate a certain task/activity (e.g computer vision models / NLP models) we often refer to the act of building them as “training”, because much like we train a dog or a child, we are training the machine’s baseline assumptions, perceptions and knowledge of the world into a digital model, though inputting the initial categories of information (data), and fundamental principles of that task, that it going to then further learn from and use to infer new information through the live feed of data – I have to know what red looks like before I start making intelligent decisions about red objects, so the training model is built off my trained understanding of everything to do with the colour red, and, those objects that might be red.
2. From a labour perspective, a large trained model is simply a large culmination of knowledge produced (coded) by people, and, confined to a digital structure. This labour perspective reminds us that companies that want to monetise algorithmic models first need to capture the knowledge capital from talented people – as with everything, knowledge and talent is the starting point for creating knowledge and talent. Viewing large models as a culmination of knowledge is also the foundational lens for understanding how organisations choose to monetise that model; a separate article in its own right that delves into the world of open-source. But, in essence, what was once a competitive paradigm driven by physical labour centralisation within the factory, is now developing into a competitive paradigm driven by the centralisation of knowledge and information within the digital structures of large mathematical models.
3. A model simply replicates known assumptions and knowledge about a particular thing (e.g numerically modelling/simulating an object under gravity, numerically modelling/simulating the movement of a fluid, numerically modelling/simulating the movement of a meteor in space, or, modelling a driver’s reactions to the environment around them by recording/capturing their journey through cameras and sensors). At this point we are simply modelling something we already know and so, as simple/complex as that thing is that we might know, there is nothing actually “intelligent” about the model itself; the artificially intelligent bit is really the inference. Models are of course extremely useful for understanding the world without physically replicating it every time we want to study it, but it’s not as if we are programming the model to learn by itself from new data; at this point we are simply training the model to be a simulation of something that we know.
4. Training (manually inputting information into a model) is by far the most computationally-intensive part of enabling machine learning. And so, training large complex models (such as modelling fluid dynamics under different environmental conditions, or, tasks such as modelling the way the brain processes language) requires extreme compute power. They often require large scale computer systems known as High Performance Computing (HPC), which basically describes the physical computing system that enables huge amounts of data processing power (compute), data movement capability (network) and data storage (storage). All the major areas of advanced research and development, e.g climate modelling, astrophysics, computational fluid dynamics, training cars to drive themselves etc, require the training of huge numerical models to be done on HPC-like infrastructure. Its here that many of the legacy computing companies such as HPE, Lenovo, Dell and Intel make a lot of their money – selling HPC capabilities to enterprises and research institutions (either through selling physical computer systems, or, renting capability through the cloud).
So, now that we have trained ourselves with a fundamental understanding of what large trained models are, we can use our mental models of models to infer a fresh understanding of my conversation with Gaurav Kaul, Solutions Architect for HPC/AI at HPE. Hopefully our fundamental understanding from this article will not need to be adjusted when reading my conversation with Gaurav, rather knowledge can be further built upon. But, you’re a human, and you’re intelligent, so you can handle the inference feedback into your models of models.