There are two primary approaches for working towards high-performance computing in NLP domains:
- Add GPUs to server
- Connect CPUs on multiple servers
Scaled out approaches generally tend to work towards maximization of constant RAM utilization, where they are able to automatically traverse the computational graph to allocate resources and optimize on throughput. In many cases, in particular, to deep learning models, the heavy acceleration of parallelized matrix multiplications makes a big difference. In neural networks, backpropagation is more computationally expensive than forward activation. Once the model is trained the weights and structures can be exported on any hardware for model prediction whether that be a forward pass or an inference pass.