We recently published a paper entitled Distributed optimization for deep learning with gossip exchange with M. Blot, N. Thome and M. Cord. This work is about distributed optimization for deep neural networks in an asynchronous and decentralized setup. We tackle the case where you have several computing resources (e.g., GPU) and you want to train a single deep learning model. We propose an optimization procedure based on gossip, where each computing optimizes a local model and sometimes exchange their weights with a random neighbor. There are several key aspect to this research.

First, we show that the gossiping strategy is in expectation equivalent to performing a stochastic gradient descent with mini-batches of a size equivalent to the aggregation of all nodes. This means that you can optimize big models which are notoriously hard to train without a bigger batch size (I'm looking at you resnet) on a collection of small GPU, rather than having to buy a larger and much more expensive one.

Second, we also show that the gossip mechanism perform some sort of stochastic exploration that in my opinion is similar to dropout, but on entire models. In short, it is a way to train an ensemble and getting the aggregate of this ensemble thanks to the consensus optimization.

There are many interesting work to be done on this topic, mostly theoretical work, that I am very looking forward to in the future.