Hugging Face Integration

skorch integrates with some libraries from the Hugging Face ecosystem. Take a look at the sections below to learn more.


The AccelerateMixin class can be used to add support for huggingface accelerate to skorch. E.g., this allows you to use mixed precision training (AMP), multi-GPU training, training with a TPU, or gradient accumulation. For the time being, this feature should be considered experimental.

Using accelerate

To use this feature, create a new subclass of the neural net class you want to use and inherit from the mixin class. E.g., if you want to use a NeuralNet, it would look like this:

from skorch import NeuralNet
from skorch.hf import AccelerateMixin

class AcceleratedNet(AccelerateMixin, NeuralNet):
    """NeuralNet with accelerate support"""

The same would work for NeuralNetClassifier, NeuralNetRegressor, etc. Then pass an instance of Accelerator with the desired parameters and you’re good to go:

from accelerate import Accelerator

accelerator = Accelerator(...)
net = AcceleratedNet(
), y)

accelerate recommends to leave the device handling to the Accelerator, which is why device defautls to None (thus telling skorch not to change the device).

Models using AccelerateMixin cannot be pickled. If you need to save and load the net, either use and

To install accelerate, run the following command inside your Python environment:

python -m pip install accelerate


Under the hood, accelerate uses GradScaler, which does not support passing the training step as a closure. Therefore, if your optimizer requires that (e.g. torch.optim.LBFGS), you cannot use accelerate.

Caution when using a multi-GPU setup

There were some issues with old accelerate versions, for best results, please use 0.21 or above.

There is a problem with caching not working correctly in multi-GPU training. Therefore, if using a scoring callback (e.g. skorch.callbacks.EpochScoring), turn caching off by passing use_caching=False. Be aware that when using skorch.NeuralNetClassifier, a scorer for accuracy on the validation set is added automatically. Caching can be turned off like this:

net = NeuralNetClassifier(..., valid_acc__use_caching=False)

When running a lot of scorers, the lack of caching can slow down training considerably because inference is called once for each scorer, even if the results are always the same. A possible solution to this is to write your own scoring callback that records multiple scores to the history using a single inference call.

Moreover, if your training relies on the training history on some capacity, e.g. because you want to early stop when the validation loss stops improving, you should use DistributedHistory instead of the default history. More information on this can be found here.


skorch also provides sklearn-like transformers that work with Hugging Face tokenizers. The transform methods of these transformers return data in a dict-like data structure, which makes them easy to use in conjunction with skorch’s NeuralNet. Below is an example of how to use a pretrained tokenizer with the help of skorch.hf.HuggingfacePretrainedTokenizer:

from skorch.hf import HuggingfacePretrainedTokenizer
# pass the model name to be downloaded
hf_tokenizer = HuggingfacePretrainedTokenizer('bert-base-uncased')
data = ['hello there', 'this is a text']  # only loads the model

# use hyper params from pretrained tokenizer to fit on own data
hf_tokenizer = HuggingfacePretrainedTokenizer(
    'bert-base-uncased', train=True, vocab_size=12345)
data = ...  # fits new tokenizer on data

We also skorch.hf.HuggingfaceTokenizer if you don’t want to use a pretrained tokenizer but instead want to train your own tokenizer with fine-grained control over each component, like which tokenization method to use.

Of course, since both transformers are scikit-learn compatible, you can use them in a grid search.


The Hugging Face transformers library gives you access to many pretrained deep learning models. There is no special skorch integration for those, since they’re just normal models and can thus be used without further adjustments (as long as they’re PyTorch models).

If you want to see how using transformers with skorch could look like in practice, take a look at the Hugging Face fine-tuning notebook.