skorch.hf¶
Classes to work with Hugging Face ecosystem (https://huggingface.co/)
E.g. transformers or tokenizers
This module should be treated as a leaf node in the dependency tree, i.e. no other skorch modules should depend on these classes or import from here. Even so, don’t import any Hugging Face libraries on the root level because skorch should not depend on them.
-
class
skorch.hf.
AccelerateMixin
(*args, accelerator, device=None, unwrap_after_train=True, callbacks__print_log__sink='auto', **kwargs)[source]¶ Mixin class to add support for Hugging Face accelerate
This is an experimental feature.
Use this mixin class with one of the neural net classes (e.g.
NeuralNet
,NeuralNetClassifier
, orNeuralNetRegressor
) and pass an instance ofAccelerator
for mixed precision, multi-GPU, or TPU training.Install the accelerate library using:
skorch does not itself provide any facilities to enable these training features. A lot of them can still be implemented by the user with a little bit of extra work but it can be a daunting task. That is why this helper class was added: Using this mixin in conjunction with the accelerate library should cover a lot of common use cases.
Note
Under the hood, accelerate uses
GradScaler
, which does not support passing the training step as a closure. Therefore, if your optimizer requires that (e.g.torch.optim.LBFGS
), you cannot use accelerate.Warning
Since accelerate is still quite young and backwards compatiblity breaking features might be added, we treat its integration as an experimental feature. When accelerate’s API stabilizes, we will consider adding it to skorch proper.
Also, models accelerated this way cannot be pickled. If you need to save and load the net, either use
skorch.net.NeuralNet.save_params()
andskorch.net.NeuralNet.load_params()
or don’t useaccelerate
.Parameters: - accelerator : accelerate.Accelerator
In addition to the usual parameters, pass an instance of
accelerate.Accelerator
with the desired settings.- device : str, torch.device, or None (default=None)
The compute device to be used. When using accelerate, it is recommended to leave device handling to accelerate. Therefore, it is best to leave this argument to be None, which means that skorch does not set the device.
- unwrap_after_train : bool (default=True)
By default, with this option being
True
, the module(s) and criterion are automatically “unwrapped” after training. This means that their initial state – from before they were prepared by theaccelerator
– is restored. This is necessary to pickle the net.There are circumstances where you might want to disable this behavior. For instance, when you want to further train the model with AMP enabled (using
net.partial_fit
orwarm_start=True
). Also, unwrapping the modules means that the advantage of using mixed precision is lost during inference. In those cases, if you don’t need to pickle the net, you should setunwrap_after_train=False
.- callbacks__print_log__sink : ‘auto’ or callable
If ‘auto’, uses the
print
function of the accelerator, if it has one. This avoids printing the same output multiple times when training concurrently on multiple machines. If the accelerator does not have aprint
function, use Python’sprint
function instead.
Examples
>>> from skorch import NeuralNetClassifier >>> from skorch.hf import AccelerateMixin >>> from accelerate import Accelerator >>> >>> class AcceleratedNet(AccelerateMixin, NeuralNetClassifier): ... '''NeuralNetClassifier with accelerate support''' >>> >>> accelerator = Accelerator(...) >>> # you may pass gradient_accumulation_steps to enable grad accumulation >>> net = AcceleratedNet(MyModule, accelerator=accelerator) >>> net.fit(X, y)
The same approach works with all the other skorch net classes.
Methods
initialize
()Initializes all of its components and returns self. load_params
(*args, **kwargs)on_train_end
(net[, X, y])save_params
(*args, **kwargs)evaluation_step get_iterator train_step train_step_single
-
class
skorch.hf.
HfHubStorage
(hf_api, path_in_repo, repo_id, local_storage=None, verbose=0, sink=<built-in function print>, **kwargs)[source]¶ Helper class that allows writing data to the Hugging Face Hub.
Use this, for instance, in combination with checkpoint callbacks such as
skorch.callbacks.training.TrainEndCheckpoint
orskorch.callbacks.training.Checkpoint
to upload the trained model directly to the Hugging Face Hub instead of storing it locally.To use this, it is necessary to install the Hugging Face Hub library.
python -m pip install huggingface_hub
Note that writes to the Hub are synchronous. Therefore, if the time it takes to upload the data is long compared to training the model, there can be a signficant slowdown. It is best to use this with
skorch.callbacks.training.TrainEndCheckpoint
, as that checkpoint only uploads the data once, at the end of training. Also, using this writer withskorch.callbacks.training.LoadInitState
is not supported for now because the Hub API does not support model loading yet.Parameters: - hf_api : instance of huggingface_hub.HfApi
Pass an instantiated
huggingface_hub.HfApi
object here.- path_in_repo : str
The name that the file should have in the repo, e.g.
my-model.pkl
. If you want each upload to have a different file name, instead of overwriting the file, use a templated name, e.g.my-model-{}.pkl
. Then your files will be calledmy-model-1.pkl
,my-model-2.pkl
, etc. If there are already files by this name in the repository, they will be overwritten.- repo_id : str
The repository to which the file will be uploaded, for example:
"username/reponame"
.- verbose : int (default=0)
Control the level of verbosity.
- local_storage : str, pathlib.Path or None (default=None)
Indicate temporary storage of the parameters. By default, they are stored in-memory. By passing a string or Path to this parameter, you can instead store the parameters at the indicated location. There is no automatic cleanup, so if you don’t need the file on disk, put it into a temp folder.
- sink : callable (default=print)
The target that the verbose information is sent to. By default, the output is printed to stdout, but the sink could also be a logger or
noop()
.- kwargs : dict
The remaining arguments are the same as for
HfApi.upload_file
(see https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.upload_file).
Examples
>>> from huggingface_hub import create_repo, HfApi >>> model_name = 'my-skorch-model.pkl' >>> params_name = 'my-torch-params.pt' >>> repo_name = 'my-user/my-repo' >>> token = 'my-secret-token' >>> # you can create a new repo like this: >>> create_repo(repo_name, token=token, exist_ok=True) >>> hf_api = HfApi() >>> hub_pickle_writer = HfHubStorage( ... hf_api, ... path_in_repo=model_name, ... repo_id=repo_name, ... token=token, ... verbose=1, ... ) >>> hub_params_writer = HfHubStorage( ... hf_api, ... path_in_repo=params_name, ... repo_id=repo_name, ... token=token, ... verbose=1, ... ) >>> checkpoints = [ ... TrainEndCheckpoint(f_pickle=hub_pickle_writer), ... TrainEndCheckpoint(f_params=hub_params_writer), ... ] >>> net = NeuralNet(..., checkpoints=checkpoints) >>> net.fit(X, y) >>> # prints: >>> # Uploaded model to https://huggingface.co/my-user/my-repo/blob/main/my-skorch-model.pkl >>> # Uploaded model to https://huggingface.co/my-user/my-repo/blob/main/my-torch-params.pt ... >>> # later... >>> import pickle >>> from huggingface_hub import hf_hub_download >>> path = hf_hub_download(repo_name, model_name, use_auth_token=token) >>> with open(path, 'rb') as f: >>> net_loaded = pickle.load(f)
Attributes: - latest_url_ : str
Stores the latest URL that the file has been uploaded to.
Methods
close
(*args)flush
()Flush buffered file write
(content)Upload the file to the Hugging Face Hub read seek tell
-
class
skorch.hf.
HuggingfacePretrainedTokenizer
(tokenizer, train=False, max_length=256, return_tensors='pt', return_attention_mask=True, return_token_type_ids=False, return_length=False, verbose=0, vocab_size=None)[source]¶ Wraps a pretrained Huggingface tokenizer to work as an sklearn transformer
From the tokenizers docs:
🤗 Tokenizers provides an implementation of today’s most used tokenizers, with a focus on performance and versatility.
Use pretrained Hugging Face tokenizers in an sklearn compatible transformer.
Parameters: - tokenizer : str or os.PathLike or transformers.PreTrainedTokenizerFast
If a string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased. If a path, A path to a directory containing vocabulary files required by the tokenizer, e.g., ./my_model_directory/. Else, should be an instantiated
PreTrainedTokenizerFast
.- train : bool (default=False)
Whether to use the pre-trained tokenizer directly as is or to retrain it on your data. If you just want to use the pre-trained tokenizer without further modification, leave this parameter as False. However, if you want to fit the tokenizer on your own data (completely from scratch, forgetting what it has learned previously), set this argument to True. The latter option is useful if you want to use the same hyper-parameters as the pre-trained tokenizer but want the vocabulary to be fitted to your dataset. The vocabulary size of this new tokenizer can be set explicitly by passing the
vocab_size
argument.- max_length : int (default=256)
Maximum number of tokens used per sequence.
- return_tensors : one of None, str, ‘pt’, ‘np’, ‘tf’ (default=’pt’)
What type of result values to return. By default, return a padded and truncated (to
max_length
) PyTorch Tensor. Similarly, ‘np’ results in a padded and truncated numpy array. Tensorflow tensors are not supported officially supported but should also work. If None or str, return a list of lists instead. These lists are not padded or truncated, thus each row may have different numbers of elements.- return_attention_mask : bool (default=True)
Whether to return the attention mask.
- return_token_type_ids : bool (default=False)
Whether to return the token type ids.
- return_length : bool (default=False)
Whether to return the length of the encoded inputs.
- pad_token : str (default=’[PAD]’)
A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms.
- vocab_size : int or None (default=None)
Change this parameter only if you use
train=True
. In that case, this parameter will determine the vocabulary size of the newly trained tokenizer. If you settrain=True
but leave this parameter as None, the same vocabulary size as the one from the initial toknizer will be used.- verbose : int (default=0)
Whether the tokenizer should print more information and warnings.
Examples
>>> from skorch.hf import HuggingfacePretrainedTokenizer >>> # pass the model name to be downloaded >>> hf_tokenizer = HuggingfacePretrainedTokenizer('bert-base-uncased') >>> data = ['hello there', 'this is a text'] >>> hf_tokenizer.fit(data) # only loads the model >>> hf_tokenizer.transform(data)
>>> # pass pretrained tokenizer as object >>> my_tokenizer = ... >>> hf_tokenizer = HuggingfacePretrainedTokenizer(my_tokenizer) >>> hf_tokenizer.fit(data) >>> hf_tokenizer.transform(data)
>>> # use hyper params from pretrained tokenizer to fit on own data >>> hf_tokenizer = HuggingfacePretrainedTokenizer( ... 'bert-base-uncased', train=True, vocab_size=12345) >>> data = ... >>> hf_tokenizer.fit(data) # fits new tokenizer on data >>> hf_tokenizer.transform(data)
Attributes: - vocabulary_ : dict
A mapping of terms to feature indices.
- fast_tokenizer_ : transformers.PreTrainedTokenizerFast
If you want to extract the Hugging Face tokenizer to use it without skorch, use this attribute.
- .. _tokenizers: https://huggingface.co/docs/tokenizers/python/latest/index.html
Methods
fit
(X[, y])Load the pretrained tokenizer fit_transform
(X[, y])Fit to data, then transform it. get_feature_names_out
([input_features])Array mapping from feature integer indices to feature name. get_metadata_routing
()Get metadata routing of this object. get_params
([deep])Get parameters for this estimator. inverse_transform
(X)Decode encodings back into strings set_output
(*[, transform])Set output container. set_params
(**params)Set the parameters of this estimator. tokenize
(X, **kwargs)Convenience method to use the trained tokenizer for tokenization transform
(X)Transform the given data
-
class
skorch.hf.
HuggingfaceTokenizer
(tokenizer, model=None, trainer='auto', normalizer=None, pre_tokenizer=None, post_processor=None, max_length=256, return_tensors='pt', return_attention_mask=True, return_token_type_ids=False, return_length=False, pad_token='[PAD]', verbose=0, **kwargs)[source]¶ Wraps a Hugging Face tokenizer to work as an sklearn transformer
From the tokenizers docs:
🤗 Tokenizers provides an implementation of today’s most used tokenizers, with a focus on performance and versatility.
Use of Hugging Face tokenizers for training on custom data using an sklearn compatible API.
Parameters: - tokenizer : tokenizers.Tokenizer
The tokenizer to train.
- model : tokenizers.models.Model
The model represents the actual tokenization algorithm, e.g.
BPE
.- trainer : tokenizers.trainers.Trainer or ‘auto’ (default=’auto’)
Class responsible for training the tokenizer. If ‘auto’, the correct trainer will be inferred from the used model using
model.get_trainer()
.- normalizer : tokenizers.normalizers.Normalizer or None (default=None)
Optional normalizer, e.g. for casting the text to lowercase.
- pre_tokenizer : tokenizers.pre_tokenizers.PreTokenizer or None (default=None)
Optional pre-tokenization, e.g. splitting on space.
- post_processor : tokenizers.processors.PostProcessor
Optional post-processor, mostly used to add special tokens for BERT etc.
- max_length : int (default=256)
Maximum number of tokens used per sequence.
- return_tensors : one of None, str, ‘pt’, ‘np’, ‘tf’ (default=’pt’)
What type of result values to return. By default, return a padded and truncated (to
max_length
) PyTorch Tensor. Similarly, ‘np’ results in a padded and truncated numpy array. Tensorflow tensors are not supported officially supported but should also work. If None or str, return a list of lists instead. These lists are not padded or truncated, thus each row may have different numbers of elements.- return_attention_mask : bool (default=True)
Whether to return the attention mask.
- return_token_type_ids : bool (default=False)
Whether to return the token type ids.
- return_length : bool (default=False)
Whether to return the length of the encoded inputs.
- pad_token : str (default=’[PAD]’)
A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms.
- verbose : int (default=0)
Whether the tokenizer should print more information and warnings.
Examples
>>> # train a BERT tokenizer from scratch >>> from tokenizers import Tokenizer >>> from tokenizers.models import WordPiece >>> from tokenizers import normalizers >>> from tokenizers.normalizers import Lowercase, NFD, StripAccents >>> from tokenizers.pre_tokenizers import Whitespace >>> from tokenizers.processors import TemplateProcessing >>> bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) >>> normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()]) >>> pre_tokenizer = Whitespace() >>> post_processor = TemplateProcessing( ... single="[CLS] $A [SEP]", ... pair="[CLS] $A [SEP] $B:1 [SEP]:1", ... special_tokens=[ ... ("[CLS]", 1), ... ("[SEP]", 2), ... ], ... ) >>> from skorch.hf import HuggingfaceTokenizer >>> hf_tokenizer = HuggingfaceTokenizer( ... tokenizer=bert_tokenizer, ... pre_tokenizer=pre_tokenizer, ... post_processor=post_processor, ... trainer__vocab_size=30522, ... trainer__special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], ... ) >>> data = ['hello there', 'this is a text'] >>> hf_tokenizer.fit(data) >>> hf_tokenizer.transform(data)
In general, you can pass both initialized objects and uninitialized objects as parameters:
# initialized HuggingfaceTokenizer(tokenizer=Tokenizer(model=WordPiece())) # uninitialized HuggingfaceTokenizer(tokenizer=Tokenizer, model=WordPiece)
Both approaches work equally well and allow you to, for instance, grid search on the tokenizer parameters. However, it is recommended not to pass an initialized trainer. This is because the trainer will then be saved as an attribute on the object, which can be wasteful. Instead, it is best to leave the default
trainer='auto'
, which results in the trainer being derived from the model.Note
If you want to train the
HuggingfaceTokenizer
in parallel (e.g. during a grid search), you should probably set the environment variableTOKENIZERS_PARALLELISM=false
. Otherwise, you may experience slow downs or deadlocks.Attributes: - vocabulary_ : dict
A mapping of terms to feature indices.
- fast_tokenizer_ : transformers.PreTrainedTokenizerFast
If you want to extract the Hugging Face tokenizer to use it without skorch, use this attribute.
- .. _tokenizers: https://huggingface.co/docs/tokenizers/python/latest/index.html
Methods
fit
(X[, y])Train the tokenizer on given data fit_transform
(X[, y])Fit to data, then transform it. get_feature_names_out
([input_features])Array mapping from feature integer indices to feature name. get_metadata_routing
()Get metadata routing of this object. get_params
([deep])Get parameters for this estimator. get_params_for
(prefix)Collect and return init parameters for an attribute. initialize
()Initialize the individual tokenizer components initialize_trainer
()Initialize the trainer initialized_instance
(instance_or_cls, kwargs)Return an instance initialized with the given parameters inverse_transform
(X)Decode encodings back into strings set_output
(*[, transform])Set output container. set_params
(**kwargs)Set the parameters of this class. tokenize
(X, **kwargs)Convenience method to use the trained tokenizer for tokenization transform
(X)Transform the given data initialize_model initialize_normalizer initialize_post_processor initialize_pre_tokenizer initialize_tokenizer -
fit
(X, y=None, **fit_params)[source]¶ Train the tokenizer on given data
Parameters: - X : iterable of str
A list/array of strings or an iterable which generates either strings.
- y : None
This parameter is ignored.
- fit_params : dict
This parameter is ignored.
Returns: - self : HuggingfaceTokenizer
The fitted instance of the tokenizer.
-
get_params
(deep=False)[source]¶ Get parameters for this estimator.
Parameters: - deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : dict
Parameter names mapped to their values.
-
initialize_trainer
()[source]¶ Initialize the trainer
Infer the trainer type from the model if necessary.
-
initialized_instance
(instance_or_cls, kwargs)[source]¶ Return an instance initialized with the given parameters
This is a helper method that deals with several possibilities for a component that might need to be initialized:
- It is already an instance that’s good to go
- It is an instance but it needs to be re-initialized
- It’s not an instance and needs to be initialized
For the majority of use cases, this comes down to just comes down to just initializing the class with its arguments.
Parameters: - instance_or_cls
The instance or class or callable to be initialized.
- kwargs : dict
The keyword arguments to initialize the instance or class. Can be an empty dict.
Returns: - instance
The initialized component.