skorch.hf

Classes to work with Hugging Face ecosystem (https://huggingface.co/)

E.g. transformers or tokenizers

This module should be treated as a leaf node in the dependency tree, i.e. no other skorch modules should depend on these classes or import from here. Even so, don’t import any Hugging Face libraries on the root level because skorch should not depend on them.

class skorch.hf.AccelerateMixin(*args, accelerator, device=None, callbacks__print_log__sink='auto', **kwargs)[source]

Mixin class to add support for Hugging Face accelerate

This is an experimental feature.

Use this mixin class with one of the neural net classes (e.g. NeuralNet, NeuralNetClassifier, or NeuralNetRegressor) and pass an instance of Accelerator for mixed precision, multi-GPU, or TPU training.

Install the accelerate library using:

skorch does not itself provide any facilities to enable these training features. A lot of them can still be implemented by the user with a little bit of extra work but it can be a daunting task. That is why this helper class was added: Using this mixin in conjunction with the accelerate library should cover a lot of common use cases.

Note

Under the hood, accelerate uses GradScaler, which does not support passing the training step as a closure. Therefore, if your optimizer requires that (e.g. torch.optim.LBFGS), you cannot use accelerate.

Warning

Since accelerate is still quite young and backwards compatiblity breaking features might be added, we treat its integration as an experimental feature. When accelerate’s API stabilizes, we will consider adding it to skorch proper.

Also, models accelerated this way cannot be pickled. If you need to save and load the net, either use skorch.net.NeuralNet.save_params() and skorch.net.NeuralNet.load_params() or don’t use accelerate.

Parameters:
accelerator : accelerate.Accelerator

In addition to the usual parameters, pass an instance of accelerate.Accelerator with the desired settings.

device : str, torch.device, or None (default=None)

The compute device to be used. When using accelerate, it is recommended to leave device handling to accelerate. Therefore, it is best to leave this argument to be None, which means that skorch does not set the device.

callbacks__print_log__sink : ‘auto’ or callable

If ‘auto’, uses the print function of the accelerator, if it has one. This avoids printing the same output multiple times when training concurrently on multiple machines. If the accelerator does not have a print function, use Python’s print function instead.

Examples

>>> from skorch import NeuralNetClassifier
>>> from skorch.hf import AccelerateMixin
>>> from accelerate import Accelerator
>>>
>>> class AcceleratedNet(AccelerateMixin, NeuralNetClassifier):
...     '''NeuralNetClassifier with accelerate support'''
>>>
>>> accelerator = Accelerator(...)
>>> # you may pass gradient_accumulation_steps to enable grad accumulation
>>> net = AcceleratedNet(MyModule,  accelerator=accelerator)
>>> net.fit(X, y)

The same approach works with all the other skorch net classes.

Methods

on_train_end(net[, X, y])
get_iterator  
initialize_callbacks  
train_step  
train_step_single  
class skorch.hf.HfHubStorage(hf_api, path_in_repo, repo_id, local_storage=None, verbose=0, sink=<built-in function print>, **kwargs)[source]

Helper class that allows writing data to the Hugging Face Hub.

Use this, for instance, in combination with checkpoint callbacks such as skorch.callbacks.training.TrainEndCheckpoint or skorch.callbacks.training.Checkpoint to upload the trained model directly to the Hugging Face Hub instead of storing it locally.

To use this, it is necessary to install the Hugging Face Hub library.

python -m pip install huggingface_hub

Note that writes to the Hub are synchronous. Therefore, if the time it takes to upload the data is long compared to training the model, there can be a signficant slowdown. It is best to use this with skorch.callbacks.training.TrainEndCheckpoint, as that checkpoint only uploads the data once, at the end of training. Also, using this writer with skorch.callbacks.training.LoadInitState is not supported for now because the Hub API does not support model loading yet.

Parameters:
hf_api : instance of huggingface_hub.HfApi

Pass an instantiated huggingface_hub.HfApi object here.

path_in_repo : str

The name that the file should have in the repo, e.g. my-model.pkl. If you want each upload to have a different file name, instead of overwriting the file, use a templated name, e.g. my-model-{}.pkl. Then your files will be called my-model-1.pkl, my-model-2.pkl, etc. If there are already files by this name in the repository, they will be overwritten.

repo_id : str

The repository to which the file will be uploaded, for example: "username/reponame".

verbose : int (default=0)

Control the level of verbosity.

local_storage : str, pathlib.Path or None (default=None)

Indicate temporary storage of the parameters. By default, they are stored in-memory. By passing a string or Path to this parameter, you can instead store the parameters at the indicated location. There is no automatic cleanup, so if you don’t need the file on disk, put it into a temp folder.

sink : callable (default=print)

The target that the verbose information is sent to. By default, the output is printed to stdout, but the sink could also be a logger or noop().

kwargs : dict

The remaining arguments are the same as for HfApi.upload_file (see https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.upload_file).

Examples

>>> from huggingface_hub import create_repo, HfApi
>>> model_name = 'my-skorch-model.pkl'
>>> params_name = 'my-torch-params.pt'
>>> repo_name = 'my-user/my-repo'
>>> token = 'my-secret-token'
>>> # you can create a new repo like this:
>>> create_repo(repo_name, token=token, exist_ok=True)
>>> hf_api = HfApi()
>>> hub_pickle_writer = HfHubStorage(
...     hf_api,
...     path_in_repo=model_name,
...     repo_id=repo_name,
...     token=token,
...     verbose=1,
... )
>>> hub_params_writer = HfHubStorage(
...     hf_api,
...     path_in_repo=params_name,
...     repo_id=repo_name,
...     token=token,
...     verbose=1,
... )
>>> checkpoints = [
...     TrainEndCheckpoint(f_pickle=hub_pickle_writer),
...     TrainEndCheckpoint(f_params=hub_params_writer),
... ]
>>> net = NeuralNet(..., checkpoints=checkpoints)
>>> net.fit(X, y)
>>> # prints:
>>> # Uploaded model to https://huggingface.co/my-user/my-repo/blob/main/my-skorch-model.pkl
>>> # Uploaded model to https://huggingface.co/my-user/my-repo/blob/main/my-torch-params.pt
...
>>> # later...
>>> import pickle
>>> from huggingface_hub import hf_hub_download
>>> path = hf_hub_download(repo_name, model_name, use_auth_token=token)
>>> with open(path, 'rb') as f:
>>>     net_loaded = pickle.load(f)
Attributes:
latest_url_ : str

Stores the latest URL that the file has been uploaded to.

Methods

flush() Flush buffered file
write(content) Upload the file to the Hugging Face Hub
close  
read  
seek  
tell  
flush()[source]

Flush buffered file

write(content)[source]

Upload the file to the Hugging Face Hub

class skorch.hf.HuggingfacePretrainedTokenizer(tokenizer, train=False, max_length=256, return_tensors='pt', return_attention_mask=True, return_token_type_ids=False, return_length=False, verbose=0, vocab_size=None)[source]

Wraps a pretrained Huggingface tokenizer to work as an sklearn transformer

From the tokenizers docs:

🤗 Tokenizers provides an implementation of today’s most used
tokenizers, with a focus on performance and versatility.

Use pretrained Hugging Face tokenizers in an sklearn compatible transformer.

Parameters:
tokenizer : str or os.PathLike or transformers.PreTrainedTokenizerFast

If a string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased. If a path, A path to a directory containing vocabulary files required by the tokenizer, e.g., ./my_model_directory/. Else, should be an instantiated PreTrainedTokenizerFast.

train : bool (default=False)

Whether to use the pre-trained tokenizer directly as is or to retrain it on your data. If you just want to use the pre-trained tokenizer without further modification, leave this parameter as False. However, if you want to fit the tokenizer on your own data (completely from scratch, forgetting what it has learned previously), set this argument to True. The latter option is useful if you want to use the same hyper-parameters as the pre-trained tokenizer but want the vocabulary to be fitted to your dataset. The vocabulary size of this new tokenizer can be set explicitly by passing the vocab_size argument.

max_length : int (default=256)

Maximum number of tokens used per sequence.

return_tensors : one of None, str, ‘pt’, ‘np’, ‘tf’ (default=’pt’)

What type of result values to return. By default, return a padded and truncated (to max_length) PyTorch Tensor. Similarly, ‘np’ results in a padded and truncated numpy array. Tensorflow tensors are not supported officially supported but should also work. If None or str, return a list of lists instead. These lists are not padded or truncated, thus each row may have different numbers of elements.

return_attention_mask : bool (default=True)

Whether to return the attention mask.

return_token_type_ids : bool (default=False)

Whether to return the token type ids.

return_length : bool (default=False)

Whether to return the length of the encoded inputs.

pad_token : str (default=’[PAD]’)

A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms.

vocab_size : int or None (default=None)

Change this parameter only if you use train=True. In that case, this parameter will determine the vocabulary size of the newly trained tokenizer. If you set train=True but leave this parameter as None, the same vocabulary size as the one from the initial toknizer will be used.

verbose : int (default=0)

Whether the tokenizer should print more information and warnings.

Examples

>>> from skorch.hf import HuggingfacePretrainedTokenizer
>>> # pass the model name to be downloaded
>>> hf_tokenizer = HuggingfacePretrainedTokenizer('bert-base-uncased')
>>> data = ['hello there', 'this is a text']
>>> hf_tokenizer.fit(data)  # only loads the model
>>> hf_tokenizer.transform(data)
>>> # pass pretrained tokenizer as object
>>> my_tokenizer = ...
>>> hf_tokenizer = HuggingfacePretrainedTokenizer(my_tokenizer)
>>> hf_tokenizer.fit(data)
>>> hf_tokenizer.transform(data)
>>> # use hyper params from pretrained tokenizer to fit on own data
>>> hf_tokenizer = HuggingfacePretrainedTokenizer(
...     'bert-base-uncased', train=True, vocab_size=12345)
>>> data = ...
>>> hf_tokenizer.fit(data)  # fits new tokenizer on data
>>> hf_tokenizer.transform(data)
Attributes:
vocabulary_ : dict

A mapping of terms to feature indices.

fast_tokenizer_ : transformers.PreTrainedTokenizerFast

If you want to extract the Hugging Face tokenizer to use it without skorch, use this attribute.

.. _tokenizers: https://huggingface.co/docs/tokenizers/python/latest/index.html

Methods

fit(X[, y]) Load the pretrained tokenizer
fit_transform(X[, y]) Fit to data, then transform it.
get_feature_names_out([input_features]) Array mapping from feature integer indices to feature name.
get_params([deep]) Get parameters for this estimator.
inverse_transform(X) Decode encodings back into strings
set_params(**params) Set the parameters of this estimator.
tokenize(X, **kwargs) Convenience method to use the trained tokenizer for tokenization
transform(X) Transform the given data
fit(X, y=None, **fit_params)[source]

Load the pretrained tokenizer

Parameters:
X : iterable of str

This parameter is ignored.

y : None

This parameter is ignored.

fit_params : dict

This parameter is ignored.

Returns:
self : HuggingfacePretrainedTokenizer

The fitted instance of the tokenizer.

class skorch.hf.HuggingfaceTokenizer(tokenizer, model=None, trainer='auto', normalizer=None, pre_tokenizer=None, post_processor=None, max_length=256, return_tensors='pt', return_attention_mask=True, return_token_type_ids=False, return_length=False, pad_token='[PAD]', verbose=0, **kwargs)[source]

Wraps a Hugging Face tokenizer to work as an sklearn transformer

From the tokenizers docs:

🤗 Tokenizers provides an implementation of today’s most used
tokenizers, with a focus on performance and versatility.

Use of Hugging Face tokenizers for training on custom data using an sklearn compatible API.

Parameters:
tokenizer : tokenizers.Tokenizer

The tokenizer to train.

model : tokenizers.models.Model

The model represents the actual tokenization algorithm, e.g. BPE.

trainer : tokenizers.trainers.Trainer or ‘auto’ (default=’auto’)

Class responsible for training the tokenizer. If ‘auto’, the correct trainer will be inferred from the used model using model.get_trainer().

normalizer : tokenizers.normalizers.Normalizer or None (default=None)

Optional normalizer, e.g. for casting the text to lowercase.

pre_tokenizer : tokenizers.pre_tokenizers.PreTokenizer or None (default=None)

Optional pre-tokenization, e.g. splitting on space.

post_processor : tokenizers.processors.PostProcessor

Optional post-processor, mostly used to add special tokens for BERT etc.

max_length : int (default=256)

Maximum number of tokens used per sequence.

return_tensors : one of None, str, ‘pt’, ‘np’, ‘tf’ (default=’pt’)

What type of result values to return. By default, return a padded and truncated (to max_length) PyTorch Tensor. Similarly, ‘np’ results in a padded and truncated numpy array. Tensorflow tensors are not supported officially supported but should also work. If None or str, return a list of lists instead. These lists are not padded or truncated, thus each row may have different numbers of elements.

return_attention_mask : bool (default=True)

Whether to return the attention mask.

return_token_type_ids : bool (default=False)

Whether to return the token type ids.

return_length : bool (default=False)

Whether to return the length of the encoded inputs.

pad_token : str (default=’[PAD]’)

A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms.

verbose : int (default=0)

Whether the tokenizer should print more information and warnings.

Examples

>>> # train a BERT tokenizer from scratch
>>> from tokenizers import Tokenizer
>>> from tokenizers.models import WordPiece
>>> from tokenizers import normalizers
>>> from tokenizers.normalizers import Lowercase, NFD, StripAccents
>>> from tokenizers.pre_tokenizers import Whitespace
>>> from tokenizers.processors import TemplateProcessing
>>> bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
>>> normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
>>> pre_tokenizer = Whitespace()
>>> post_processor = TemplateProcessing(
...    single="[CLS] $A [SEP]",
...    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
...    special_tokens=[
...        ("[CLS]", 1),
...        ("[SEP]", 2),
...    ],
... )
>>> from skorch.hf import HuggingfaceTokenizer
>>> hf_tokenizer = HuggingfaceTokenizer(
...     tokenizer=bert_tokenizer,
...     pre_tokenizer=pre_tokenizer,
...     post_processor=post_processor,
...     trainer__vocab_size=30522,
...     trainer__special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
... )
>>> data = ['hello there', 'this is a text']
>>> hf_tokenizer.fit(data)
>>> hf_tokenizer.transform(data)

In general, you can pass both initialized objects and uninitialized objects as parameters:

# initialized
HuggingfaceTokenizer(tokenizer=Tokenizer(model=WordPiece()))
# uninitialized
HuggingfaceTokenizer(tokenizer=Tokenizer, model=WordPiece)

Both approaches work equally well and allow you to, for instance, grid search on the tokenizer parameters. However, it is recommended not to pass an initialized trainer. This is because the trainer will then be saved as an attribute on the object, which can be wasteful. Instead, it is best to leave the default trainer='auto', which results in the trainer being derived from the model.

Note

If you want to train the HuggingfaceTokenizer in parallel (e.g. during a grid search), you should probably set the environment variable TOKENIZERS_PARALLELISM=false. Otherwise, you may experience slow downs or deadlocks.

Attributes:
vocabulary_ : dict

A mapping of terms to feature indices.

fast_tokenizer_ : transformers.PreTrainedTokenizerFast

If you want to extract the Hugging Face tokenizer to use it without skorch, use this attribute.

.. _tokenizers: https://huggingface.co/docs/tokenizers/python/latest/index.html

Methods

fit(X[, y]) Train the tokenizer on given data
fit_transform(X[, y]) Fit to data, then transform it.
get_feature_names_out([input_features]) Array mapping from feature integer indices to feature name.
get_params([deep]) Get parameters for this estimator.
get_params_for(prefix) Collect and return init parameters for an attribute.
initialize() Initialize the individual tokenizer components
initialize_trainer() Initialize the trainer
initialized_instance(instance_or_cls, kwargs) Return an instance initialized with the given parameters
inverse_transform(X) Decode encodings back into strings
set_params(**kwargs) Set the parameters of this class.
tokenize(X, **kwargs) Convenience method to use the trained tokenizer for tokenization
transform(X) Transform the given data
initialize_model  
initialize_normalizer  
initialize_post_processor  
initialize_pre_tokenizer  
initialize_tokenizer  
fit(X, y=None, **fit_params)[source]

Train the tokenizer on given data

Parameters:
X : iterable of str

A list/array of strings or an iterable which generates either strings.

y : None

This parameter is ignored.

fit_params : dict

This parameter is ignored.

Returns:
self : HuggingfaceTokenizer

The fitted instance of the tokenizer.

get_params(deep=False)[source]

Get parameters for this estimator.

Parameters:
deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : dict

Parameter names mapped to their values.

get_params_for(prefix)[source]

Collect and return init parameters for an attribute.

initialize()[source]

Initialize the individual tokenizer components

initialize_trainer()[source]

Initialize the trainer

Infer the trainer type from the model if necessary.

initialized_instance(instance_or_cls, kwargs)[source]

Return an instance initialized with the given parameters

This is a helper method that deals with several possibilities for a component that might need to be initialized:

  • It is already an instance that’s good to go
  • It is an instance but it needs to be re-initialized
  • It’s not an instance and needs to be initialized

For the majority of use cases, this comes down to just comes down to just initializing the class with its arguments.

Parameters:
instance_or_cls

The instance or class or callable to be initialized.

kwargs : dict

The keyword arguments to initialize the instance or class. Can be an empty dict.

Returns:
instance

The initialized component.

set_params(**kwargs)[source]

Set the parameters of this class.

Valid parameter keys can be listed with get_params().

Returns:
self