Jonathan M. Kernes | Adaptive Softmax

For a full description of the project, please see my TowardsDataScience post: https://towardsdatascience.com/how-to-overcome-the-large-vocabulary-bottleneck-using-an-adaptive-softmax-layer-e965a534493d

For the code, please follow the link to my github: https://github.com/Jmkernes/PAR-Transformer-XL/blob/main/adaptive_softmax.py

The adaptive softmax is a type of class-based hierarchical softmax intended to reduce he cost of computing word probabilities over large vocabularies.

The goal of this project is to create a TensorFlow 2.0+ implementation of the adaptive softmax, outlined in the paper “Efficient softmax approximation for GPUs” .

For large vocabularies, the final dense layer can become prohibitively expensive, making alternative layers a must. You can easily achieve anywhere from 2x-10x speedups in both training and inference.

The layer consists of two stages

A segmentation into a head cluster, and \(J\) tail clusters. We perform the usual softmax over the head cluster, and add \(J\) additional slots for the class probabilities that a word resides in the tail clusters.
A dimensional reduction of the final layer hidden state via linear projections, for the inputs feeding into the tail cluster softmaxes. The softmax over tail clusters is only performed for labels that fall in those clusters. When we do have to compute these softmaxes, they are a) much less frequent by defintion, and b) much less computationally intensive due to the dimensional reduction.

A schematic of the projection procedure is shown below:

Dimensional reduction helps alleviate the burden of computing the probabilties of rare words.

Again, for the full project description and code, please see my TDS article and repository respectively.