skorch.hf

Classes to work with Hugging Face ecosystem (https://huggingface.co/)

E.g. transformers or tokenizers

This module should be treated as a leaf node in the dependency tree, i.e. no other skorch modules should depend on these classes or import from here. Even so, don’t import any Hugging Face libraries on the root level because skorch should not depend on them.

class skorch.hf.AccelerateMixin(*args, accelerator, device=None, unwrap_after_train=True, callbacks__print_log__sink='auto', **kwargs)[source]

Mixin class to add support for Hugging Face accelerate

This is an experimental feature.

Use this mixin class with one of the neural net classes (e.g. NeuralNet, NeuralNetClassifier, or NeuralNetRegressor) and pass an instance of Accelerator for mixed precision, multi-GPU, or TPU training.

Install the accelerate library using:

python -m pip install accelerate

skorch does not itself provide any facilities to enable these training features. A lot of them can still be implemented by the user with a little bit of extra work but it can be a daunting task. That is why this helper class was added: Using this mixin in conjunction with the accelerate library should cover a lot of common use cases.

Note

Under the hood, accelerate uses GradScaler, which does not support passing the training step as a closure. Therefore, if your optimizer requires that (e.g. torch.optim.LBFGS), you cannot use accelerate.

Warning

Since accelerate is still quite young and backwards compatiblity breaking features might be added, we treat its integration as an experimental feature. When accelerate’s API stabilizes, we will consider adding it to skorch proper.

Also, models accelerated this way cannot be pickled. If you need to save and load the net, either use skorch.net.NeuralNet.save_params() and skorch.net.NeuralNet.load_params() or don’t use accelerate.

Parameters
acceleratoraccelerate.Accelerator

In addition to the usual parameters, pass an instance of accelerate.Accelerator with the desired settings.

devicestr, torch.device, or None (default=None)

The compute device to be used. When using accelerate, it is recommended to leave device handling to accelerate. Therefore, it is best to leave this argument to be None, which means that skorch does not set the device.

unwrap_after_trainbool (default=True)

By default, with this option being True, the module(s) and criterion are automatically “unwrapped” after training. This means that their initial state – from before they were prepared by the accelerator – is restored. This is necessary to pickle the net.

There are circumstances where you might want to disable this behavior. For instance, when you want to further train the model with AMP enabled (using net.partial_fit or warm_start=True). Also, unwrapping the modules means that the advantage of using mixed precision is lost during inference. In those cases, if you don’t need to pickle the net, you should set unwrap_after_train=False.

callbacks__print_log__sink‘auto’ or callable

If ‘auto’, uses the print function of the accelerator, if it has one. This avoids printing the same output multiple times when training concurrently on multiple machines. If the accelerator does not have a print function, use Python’s print function instead.

Examples

>>> from skorch import NeuralNetClassifier
>>> from skorch.hf import AccelerateMixin
>>> from accelerate import Accelerator
>>>
>>> class AcceleratedNet(AccelerateMixin, NeuralNetClassifier):
...     '''NeuralNetClassifier with accelerate support'''
>>>
>>> accelerator = Accelerator(...)
>>> # you may pass gradient_accumulation_steps to enable grad accumulation
>>> net = AcceleratedNet(MyModule,  accelerator=accelerator)
>>> net.fit(X, y)

The same approach works with all the other skorch net classes.

Methods

initialize()

Initializes all of its components and returns self.

load_params(*args, **kwargs)

on_train_end(net[, X, y])

save_params(*args, **kwargs)

evaluation_step

get_iterator

train_step

train_step_single

initialize()[source]

Initializes all of its components and returns self.

class skorch.hf.HfHubStorage(hf_api, path_in_repo, repo_id, local_storage=None, verbose=0, sink=<built-in function print>, **kwargs)[source]

Helper class that allows writing data to the Hugging Face Hub.

Use this, for instance, in combination with checkpoint callbacks such as skorch.callbacks.training.TrainEndCheckpoint or skorch.callbacks.training.Checkpoint to upload the trained model directly to the Hugging Face Hub instead of storing it locally.

To use this, it is necessary to install the Hugging Face Hub library.

python -m pip install huggingface_hub

Note that writes to the Hub are synchronous. Therefore, if the time it takes to upload the data is long compared to training the model, there can be a signficant slowdown. It is best to use this with skorch.callbacks.training.TrainEndCheckpoint, as that checkpoint only uploads the data once, at the end of training. Also, using this writer with skorch.callbacks.training.LoadInitState is not supported for now because the Hub API does not support model loading yet.

Parameters
hf_apiinstance of huggingface_hub.HfApi

Pass an instantiated huggingface_hub.HfApi object here.

path_in_repostr

The name that the file should have in the repo, e.g. my-model.pkl. If you want each upload to have a different file name, instead of overwriting the file, use a templated name, e.g. my-model-{}.pkl. Then your files will be called my-model-1.pkl, my-model-2.pkl, etc. If there are already files by this name in the repository, they will be overwritten.

repo_idstr

The repository to which the file will be uploaded, for example: "username/reponame".

verboseint (default=0)

Control the level of verbosity.

local_storagestr, pathlib.Path or None (default=None)

Indicate temporary storage of the parameters. By default, they are stored in-memory. By passing a string or Path to this parameter, you can instead store the parameters at the indicated location. There is no automatic cleanup, so if you don’t need the file on disk, put it into a temp folder.

sinkcallable (default=print)

The target that the verbose information is sent to. By default, the output is printed to stdout, but the sink could also be a logger or noop().

kwargsdict

The remaining arguments are the same as for HfApi.upload_file (see https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.upload_file).

Examples

>>> from huggingface_hub import create_repo, HfApi
>>> model_name = 'my-skorch-model.pkl'
>>> params_name = 'my-torch-params.pt'
>>> repo_name = 'my-user/my-repo'
>>> token = 'my-secret-token'
>>> # you can create a new repo like this:
>>> create_repo(repo_name, token=token, exist_ok=True)
>>> hf_api = HfApi()
>>> hub_pickle_writer = HfHubStorage(
...     hf_api,
...     path_in_repo=model_name,
...     repo_id=repo_name,
...     token=token,
...     verbose=1,
... )
>>> hub_params_writer = HfHubStorage(
...     hf_api,
...     path_in_repo=params_name,
...     repo_id=repo_name,
...     token=token,
...     verbose=1,
... )
>>> checkpoints = [
...     TrainEndCheckpoint(f_pickle=hub_pickle_writer),
...     TrainEndCheckpoint(f_params=hub_params_writer),
... ]
>>> net = NeuralNet(..., checkpoints=checkpoints)
>>> net.fit(X, y)
>>> # prints:
>>> # Uploaded model to https://huggingface.co/my-user/my-repo/blob/main/my-skorch-model.pkl
>>> # Uploaded model to https://huggingface.co/my-user/my-repo/blob/main/my-torch-params.pt
...
>>> # later...
>>> import pickle
>>> from huggingface_hub import hf_hub_download
>>> path = hf_hub_download(repo_name, model_name, use_auth_token=token)
>>> with open(path, 'rb') as f:
>>>     net_loaded = pickle.load(f)
Attributes
latest_url_str

Stores the latest URL that the file has been uploaded to.

Methods

close(*args)

flush()

Flush buffered file

write(content)

Upload the file to the Hugging Face Hub

read

seek

tell

flush()[source]

Flush buffered file

write(content)[source]

Upload the file to the Hugging Face Hub

class skorch.hf.HuggingfacePretrainedTokenizer(tokenizer, train=False, max_length=256, return_tensors='pt', return_attention_mask=True, return_token_type_ids=False, return_length=False, verbose=0, vocab_size=None)[source]

Wraps a pretrained Huggingface tokenizer to work as an sklearn transformer

From the tokenizers docs:

🤗 Tokenizers provides an implementation of today’s most used
tokenizers, with a focus on performance and versatility.

Use pretrained Hugging Face tokenizers in an sklearn compatible transformer.

Parameters
tokenizerstr or os.PathLike or transformers.PreTrainedTokenizerFast

If a string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased. If a path, A path to a directory containing vocabulary files required by the tokenizer, e.g., ./my_model_directory/. Else, should be an instantiated PreTrainedTokenizerFast.

trainbool (default=False)

Whether to use the pre-trained tokenizer directly as is or to retrain it on your data. If you just want to use the pre-trained tokenizer without further modification, leave this parameter as False. However, if you want to fit the tokenizer on your own data (completely from scratch, forgetting what it has learned previously), set this argument to True. The latter option is useful if you want to use the same hyper-parameters as the pre-trained tokenizer but want the vocabulary to be fitted to your dataset. The vocabulary size of this new tokenizer can be set explicitly by passing the vocab_size argument.

max_lengthint (default=256)

Maximum number of tokens used per sequence.

return_tensorsone of None, str, ‘pt’, ‘np’, ‘tf’ (default=’pt’)

What type of result values to return. By default, return a padded and truncated (to max_length) PyTorch Tensor. Similarly, ‘np’ results in a padded and truncated numpy array. Tensorflow tensors are not supported officially supported but should also work. If None or str, return a list of lists instead. These lists are not padded or truncated, thus each row may have different numbers of elements.

return_attention_maskbool (default=True)

Whether to return the attention mask.

return_token_type_idsbool (default=False)

Whether to return the token type ids.

return_lengthbool (default=False)

Whether to return the length of the encoded inputs.

pad_tokenstr (default=’[PAD]’)

A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms.

vocab_sizeint or None (default=None)

Change this parameter only if you use train=True. In that case, this parameter will determine the vocabulary size of the newly trained tokenizer. If you set train=True but leave this parameter as None, the same vocabulary size as the one from the initial toknizer will be used.

verboseint (default=0)

Whether the tokenizer should print more information and warnings.

Examples

>>> from skorch.hf import HuggingfacePretrainedTokenizer
>>> # pass the model name to be downloaded
>>> hf_tokenizer = HuggingfacePretrainedTokenizer('bert-base-uncased')
>>> data = ['hello there', 'this is a text']
>>> hf_tokenizer.fit(data)  # only loads the model
>>> hf_tokenizer.transform(data)
>>> # pass pretrained tokenizer as object
>>> my_tokenizer = ...
>>> hf_tokenizer = HuggingfacePretrainedTokenizer(my_tokenizer)
>>> hf_tokenizer.fit(data)
>>> hf_tokenizer.transform(data)
>>> # use hyper params from pretrained tokenizer to fit on own data
>>> hf_tokenizer = HuggingfacePretrainedTokenizer(
...     'bert-base-uncased', train=True, vocab_size=12345)
>>> data = ...
>>> hf_tokenizer.fit(data)  # fits new tokenizer on data
>>> hf_tokenizer.transform(data)
Attributes
vocabulary_dict

A mapping of terms to feature indices.

fast_tokenizer_transformers.PreTrainedTokenizerFast

If you want to extract the Hugging Face tokenizer to use it without skorch, use this attribute.

.. _tokenizers: https://huggingface.co/docs/tokenizers/python/latest/index.html

Methods

fit(X[, y])

Load the pretrained tokenizer

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Array mapping from feature integer indices to feature name.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X)

Decode encodings back into strings

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

tokenize(X, **kwargs)

Convenience method to use the trained tokenizer for tokenization

transform(X)

Transform the given data

fit(X, y=None, **fit_params)[source]

Load the pretrained tokenizer

Parameters
Xiterable of str

This parameter is ignored.

yNone

This parameter is ignored.

fit_paramsdict

This parameter is ignored.

Returns
selfHuggingfacePretrainedTokenizer

The fitted instance of the tokenizer.

class skorch.hf.HuggingfaceTokenizer(tokenizer, model=None, trainer='auto', normalizer=None, pre_tokenizer=None, post_processor=None, max_length=256, return_tensors='pt', return_attention_mask=True, return_token_type_ids=False, return_length=False, pad_token='[PAD]', verbose=0, **kwargs)[source]

Wraps a Hugging Face tokenizer to work as an sklearn transformer

From the tokenizers docs:

🤗 Tokenizers provides an implementation of today’s most used
tokenizers, with a focus on performance and versatility.

Use of Hugging Face tokenizers for training on custom data using an sklearn compatible API.

Parameters
tokenizertokenizers.Tokenizer

The tokenizer to train.

modeltokenizers.models.Model

The model represents the actual tokenization algorithm, e.g. BPE.

trainertokenizers.trainers.Trainer or ‘auto’ (default=’auto’)

Class responsible for training the tokenizer. If ‘auto’, the correct trainer will be inferred from the used model using model.get_trainer().

normalizertokenizers.normalizers.Normalizer or None (default=None)

Optional normalizer, e.g. for casting the text to lowercase.

pre_tokenizertokenizers.pre_tokenizers.PreTokenizer or None (default=None)

Optional pre-tokenization, e.g. splitting on space.

post_processortokenizers.processors.PostProcessor

Optional post-processor, mostly used to add special tokens for BERT etc.

max_lengthint (default=256)

Maximum number of tokens used per sequence.

return_tensorsone of None, str, ‘pt’, ‘np’, ‘tf’ (default=’pt’)

What type of result values to return. By default, return a padded and truncated (to max_length) PyTorch Tensor. Similarly, ‘np’ results in a padded and truncated numpy array. Tensorflow tensors are not supported officially supported but should also work. If None or str, return a list of lists instead. These lists are not padded or truncated, thus each row may have different numbers of elements.

return_attention_maskbool (default=True)

Whether to return the attention mask.

return_token_type_idsbool (default=False)

Whether to return the token type ids.

return_lengthbool (default=False)

Whether to return the length of the encoded inputs.

pad_tokenstr (default=’[PAD]’)

A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms.

verboseint (default=0)

Whether the tokenizer should print more information and warnings.

Examples

>>> # train a BERT tokenizer from scratch
>>> from tokenizers import Tokenizer
>>> from tokenizers.models import WordPiece
>>> from tokenizers import normalizers
>>> from tokenizers.normalizers import Lowercase, NFD, StripAccents
>>> from tokenizers.pre_tokenizers import Whitespace
>>> from tokenizers.processors import TemplateProcessing
>>> bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
>>> normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
>>> pre_tokenizer = Whitespace()
>>> post_processor = TemplateProcessing(
...    single="[CLS] $A [SEP]",
...    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
...    special_tokens=[
...        ("[CLS]", 1),
...        ("[SEP]", 2),
...    ],
... )
>>> from skorch.hf import HuggingfaceTokenizer
>>> hf_tokenizer = HuggingfaceTokenizer(
...     tokenizer=bert_tokenizer,
...     pre_tokenizer=pre_tokenizer,
...     post_processor=post_processor,
...     trainer__vocab_size=30522,
...     trainer__special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
... )
>>> data = ['hello there', 'this is a text']
>>> hf_tokenizer.fit(data)
>>> hf_tokenizer.transform(data)

In general, you can pass both initialized objects and uninitialized objects as parameters:

# initialized
HuggingfaceTokenizer(tokenizer=Tokenizer(model=WordPiece()))
# uninitialized
HuggingfaceTokenizer(tokenizer=Tokenizer, model=WordPiece)

Both approaches work equally well and allow you to, for instance, grid search on the tokenizer parameters. However, it is recommended not to pass an initialized trainer. This is because the trainer will then be saved as an attribute on the object, which can be wasteful. Instead, it is best to leave the default trainer='auto', which results in the trainer being derived from the model.

Note

If you want to train the HuggingfaceTokenizer in parallel (e.g. during a grid search), you should probably set the environment variable TOKENIZERS_PARALLELISM=false. Otherwise, you may experience slow downs or deadlocks.

Attributes
vocabulary_dict

A mapping of terms to feature indices.

fast_tokenizer_transformers.PreTrainedTokenizerFast

If you want to extract the Hugging Face tokenizer to use it without skorch, use this attribute.

.. _tokenizers: https://huggingface.co/docs/tokenizers/python/latest/index.html

Methods

fit(X[, y])

Train the tokenizer on given data

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Array mapping from feature integer indices to feature name.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

get_params_for(prefix)

Collect and return init parameters for an attribute.

initialize()

Initialize the individual tokenizer components

initialize_trainer()

Initialize the trainer

initialized_instance(instance_or_cls, kwargs)

Return an instance initialized with the given parameters

inverse_transform(X)

Decode encodings back into strings

set_output(*[, transform])

Set output container.

set_params(**kwargs)

Set the parameters of this class.

tokenize(X, **kwargs)

Convenience method to use the trained tokenizer for tokenization

transform(X)

Transform the given data

initialize_model

initialize_normalizer

initialize_post_processor

initialize_pre_tokenizer

initialize_tokenizer

fit(X, y=None, **fit_params)[source]

Train the tokenizer on given data

Parameters
Xiterable of str

A list/array of strings or an iterable which generates either strings.

yNone

This parameter is ignored.

fit_paramsdict

This parameter is ignored.

Returns
selfHuggingfaceTokenizer

The fitted instance of the tokenizer.

get_params(deep=False)[source]

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

get_params_for(prefix)[source]

Collect and return init parameters for an attribute.

initialize()[source]

Initialize the individual tokenizer components

initialize_trainer()[source]

Initialize the trainer

Infer the trainer type from the model if necessary.

initialized_instance(instance_or_cls, kwargs)[source]

Return an instance initialized with the given parameters

This is a helper method that deals with several possibilities for a component that might need to be initialized:

  • It is already an instance that’s good to go

  • It is an instance but it needs to be re-initialized

  • It’s not an instance and needs to be initialized

For the majority of use cases, this comes down to just comes down to just initializing the class with its arguments.

Parameters
instance_or_cls

The instance or class or callable to be initialized.

kwargsdict

The keyword arguments to initialize the instance or class. Can be an empty dict.

Returns
instance

The initialized component.

set_params(**kwargs)[source]

Set the parameters of this class.

Valid parameter keys can be listed with get_params().

Returns
self