when relevant content is
added and updated.
when relevant content is
added and updated.
Robotics image via FreeImages
By James Kobielus (@jameskobielus)
Deep learning (DL) feels monolithic. This branch of artificial intelligence (AI) routinely achieves amazing results in computer vision, speech recognition, natural language processing, and many other applications. However, it does so by leveraging architectures that are deeply hierarchical, massively parallel, and intricately neural.
The essence of a monolithic stack is that its processing nodes are tightly coupled. The core criteria for determining whether any architecture is tightly coupled include whether processing nodes have direct physical interconnections, engage in synchronous communication amongst themselves, process models that have strong typing and complex object trees, execute process-logic control centrally, bind services in static fashion, and have strong platform and language dependencies.
Most of those criteria apply in spades today’s DL processing architectures. The core hardware substrate, GPUs, support incredibly scalable, efficient, and fast processing of DL models in a single tightly coupled server or cluster. However, they’re not geared to distributing the layers and nodes of a DL model as microservices across a cloud-native computing fabric, nor are they optimized for loose coupling of feedforward, backpropagation, and other neural-net internode communications.
Almost every technical discussion of DL architectures–such as this recent blog on computer-vision architectures—proceeds from the assumption that the convolutional, pooling, encoding, inception, residual, discriminator, and other layers run on a tightly coupled, high-performance single-node hardware platform (based on GPUs, CPUs, FPGAs, and other technologies). As a proofpoint, check out this article from earlier this year in which I discuss how DL models’ fast matrix manipulations are still largely executed on GPU-based single-node co-processors, while much of the heavy lifting of DL model training takes place in in multi-node Spark clusters that are horizontally scalable, CPU-based, and in-memory.
Acutely aware of DL execution’s traditional orientation toward single-node architecture, I took great interest in IBM Research’s recent announcement of its Distributed Deep Learning (DDL) technology. The DDL software library, which is still in technical preview, enables a DL neural-net model to execute transparently across distributed environments that consist of up to 64 IBM PowerAI servers. In IBM’s implementation, the PowerAI cluster is configured with an aggregate processing capacity of 256 NVIDIA GPUs. DDL provides an API that enables TensorFlow, Caffe, Torch, and Chainer developers to scale-out DL model execution across PowerAI clusters for accelerated training and other functions. Under the covers, DDL-enabled apps and PowerAI clusters speak a “multiring” algorithm that dynamically optimizes cross-node network paths to automatically balance latency and bandwidth utilization across a distributed DL cluster.
IBM’s announcement represents an exciting milestone in the decoupling of the DL ecosystem from single-node execution architectures. However, it only scales out the horizontal execution of entire models, across all layers and neural nodes. It does not decouple the execution of processing layers of any specific DL neural-net model. For that latter capability, I recommend the work that Facebook is doing in the decoupling of Caffe2 model execution across multi-node GPU servers clusters. In particular, Facebook is implementing decoupling in two areas of their multi-node DL architecture:
- Decoupling cross-layer dependencies in the computations of their respective gradients executing a common DL model: As Facebook researchers state it in this recent study, “In order to scale beyond the 8 GPUs in a single Big Basin server, gradient aggregation has to span across servers on a network. To allow for near perfect linear scaling, the aggregation must be performed in parallel with backprop[agation]. This is possible because there is no data dependency between gradients across layers. Therefore, as soon as the gradient for a layer is computed, it is aggregated across workers, while gradient computation for the next layer continues.”
- Decoupling cross-thread dependencies in the parallelized execution of various subgraph within a common DL model: As the paper states, “Caffe2 supports multi-threaded execution of the compute graph that represents a training iteration. Whenever there is no data dependency between subgraphs, multiple threads can execute those subgraphs in parallel.”
As modularized decoupling of the AI ecosystem proceeds, the ability to farm out execution of entire DL models, components thereof, and/or inter-component communications will become critical for scaling and acceleration. As I discussed in this recent Wikibon research note, DL development will evolve into the modeling of these capabilities as functional primitives for containerization, orchestration, and management as microservices. Eventually, the functional primitives exposed as microservices will include both the coarse-grained capabilities of entire DL models (e.g., classification, clustering, recognition, prediction, natural language processing) and the fine-grained capabilities (convolution, recurrence, pooling, etc.) of which those models are composed.
In the emerging world of radically decoupled DL, these functional-primitive microservices will have the following capabilities:
- Support accelerated DL development through an abstraction layer that compiles declarative program specifications down to DL model assets at every level of granularity;
- Call each other as submodules within a more versatile, adaptive DL architecture;
- Invoke RESTful APIs for dynamic binding with other DL modules;
- Share cross-module variables through stateless, weakly typed DL semantics;
- Orchestrate complex patterns within a distributed DL control plane; and
- Steer clear of monolithic OS and language lock-ins.
Before long, DL functionality will be so easy to decouple that you’ll be able to embed whole models—or even the tiniest pieces of them—on the edge in mobile devices, Internet of Things (IoT) endpoints, and so. Actually, that trend is well along, as can be seen in Intel’s recent release of a USB-based “Neural Compute Stick” for development of embedded DL.
DL functionality will soon be decoupled so thoroughly and disseminated so broadly that it will seem to disappear. Through cloud-native computing and the IoT, DL DNA will be literally everywhere.