1、FairScale
https://github.com/facebookresearch/fairscale
FairScale is a PyTorch extension library for high performance and large scale training on one or multiple machines/nodes. This library extends basic PyTorch capabilities while adding new experimental ones.
2、SpeedTorch
https://github.com/Santosh-Gupta/SpeedTorch
This library revovles around Cupy tensors pinned to CPU, which can achieve 3.1x faster CPU -> GPU transfer than regular Pytorch Pinned CPU tensors can, and 410x faster GPU -> CPU transfer. Speed depends on amount of data, and number of CPU cores on your system (see the How it Works section for more details)