skorch.history

Contains history class and helper functions.

class skorch.history.DistributedHistory(*args, store, rank, world_size)[source]

History for use in training using multiple processes

When using skorch with AccelerateMixin for multi GPU training, use this class instead of the default History class.

When using PyTorch torch.nn.parallel.DistributedDataParallel, the whole training process is forked and batches are processed in parallel. That means that the standard History does not see all the batches that are being processed, which results in the different processes having histories that are out of sync. This is bad because the history is used as a reference to influence the training, e.g. to control early stopping.

This class solves the problem by using a distributed store from PyTorch, e.g. torch.distributed.TCPStore, to synchronize the batch information across processes. This ensures that the information stored in the individual history copies is identical for history[:, 'batches']. When it comes to the epoch-level information, it can still diverge between processes (e.g. the recorded duration of the epoch).

To use this class, instantiate it and pass it as the history argument to the net.

Click here for more information on PyTorch distributed key-value stores.

Parameters
storetorch.distributed.Store

The torch distributed Store instance, torch.distributed.TCPStore has been tested to work.

rankint

The rank of this particular process among all processes. Each process should have a unique rank between 0 and world_size - 1. If using accelerate, the rank can be determined as accelerator.local_process_index.

world_sizeint

The number of processes in the training. When using accelerate, the world size can be determined as accelerator.num_processes.

Notes

If using this class results in the processes hanging or timing out, double check that the rank and world_size arguments are set correctly. Otherwise, the history instances will be waiting for records that are actually never written.

If the speed of the processes is very uneven, there can also be timeouts. To increase the waiting time, pass a corresponding timeout argument to the store instance.

Objects stored with this history make a json roundtrip. Therefore, if you store objects that don’t survive a json roundtrip (say, numpy arrays), don’t use this class.

The PyTorch Store classes cannot be pickled. Therefore, if a net using this history class is pickled, the store attribute is discarded so that the pickling does not fail. This means, however, that an unpickled net cannot be used for further training without manually setting the store attribute on the history.

Examples

>>> # general
>>> from skorch import NeuralNetClassifier
>>> from torch.distributed import TCPStore
>>> from torch.nn.parallel import DistributedDataParallel
>>> def train(rank, world_size, is_master):
...     store = TCPStore(
...         "127.0.0.1", port=1234, world_size=world_size)
...     dist_history = DistributedHistory(
...         store=store, rank=rank, world_size=world_size)
...     net = NeuralNetClassifier(..., history=dist_history)
...     net.fit(X, y)
>>> # with accelerate
>>> from accelerate import Accelerator
>>> from skorch.hf import AccelerateMixin
>>> accelerator = Accelerator(...)
>>> def train(accelerator):
...     is_master = accelerator.is_main_process
...     world_size = accelerator.num_processes
...     rank = accelerator.local_process_index
...     store = TCPStore(
...         "127.0.0.1", port=1234, world_size=world_size, is_master=is_master)
...     dist_history = DistributedHistory(
...         store=store, rank=rank, world_size=world_size)
...     net = AcceleratedNet(..., history=dist_history)
...     net.fit(X, y)
Attributes
historyskorch.history.History

The actual skorch History object can be accessed using the History attribute. You should call net.history.sync() to ensure that all data is synced into the history before reading from it.

Methods

record_batch(attr, value)

Add a new value to the given column for the current batch.

sync()

Collect batch records across all ranks from store and write them to the history

clear

from_file

new_batch

new_epoch

record

to_file

to_list

record_batch(attr, value)[source]

Add a new value to the given column for the current batch.

Instead of writing to the history directly, write to the distributed store. Then, once the history is being read from, the values from the store are synchronized.

This class “remembers” which values were written to the store by creating a key that uniquely identifies the values. Choosing a correct key is crucial here, since it must not only be unique (say, a uuid), but also sufficient to replay the history so that they can be recorded correctly.

When the structure of the key is changed, it must be changed accordingly inside of the sync method.

sync()[source]

Collect batch records across all ranks from store and write them to the history

Syncing is not a single atomic operation, if something breaks the flow, we can end up with an inconsistent state.

class skorch.history.History(iterable=(), /)[source]

History contains the information about the training history of a NeuralNet, facilitating some of the more common tasks that are occur during training.

When you want to log certain information during training (say, a particular score or the norm of the gradients), you should write them to the net’s history object.

It is basically a list of dicts for each epoch, that, again, contains a list of dicts for each batch. For convenience, it has enhanced slicing notation and some methods to write new items.

To access items from history, you may pass a tuple of up to four items:

  1. Slices along the epochs.

  2. Selects columns from history epochs, may be a single one or a tuple of column names.

  3. Slices along the batches.

  4. Selects columns from history batchs, may be a single one or a tuple of column names.

You may use a combination of the four items.

If you select columns that are not present in all epochs/batches, only those epochs/batches are chosen that contain said columns. If this set is empty, a KeyError is raised.

Examples

>>> # ACCESSING ITEMS
>>> # history of a fitted neural net
>>> history = net.history
>>> # get current epoch, a dict
>>> history[-1]
>>> # get train losses from all epochs, a list of floats
>>> history[:, 'train_loss']
>>> # get train and valid losses from all epochs, a list of tuples
>>> history[:, ('train_loss', 'valid_loss')]
>>> # get current batches, a list of dicts
>>> history[-1, 'batches']
>>> # get latest batch, a dict
>>> history[-1, 'batches', -1]
>>> # get train losses from current batch, a list of floats
>>> history[-1, 'batches', :, 'train_loss']
>>> # get train and valid losses from current batch, a list of tuples
>>> history[-1, 'batches', :, ('train_loss', 'valid_loss')]
>>> # WRITING ITEMS
>>> # add new epoch row
>>> history.new_epoch()
>>> # add an entry to current epoch
>>> history.record('my-score', 123)
>>> # add a batch row to the current epoch
>>> history.new_batch()
>>> # add an entry to the current batch
>>> history.record_batch('my-batch-score', 456)
>>> # overwrite entry of current batch
>>> history.record_batch('my-batch-score', 789)

Methods

append($self, object, /)

Append object to the end of the list.

clear($self, /)

Remove all items from list.

copy($self, /)

Return a shallow copy of the list.

count($self, value, /)

Return number of occurrences of value.

extend($self, iterable, /)

Extend list by appending elements from the iterable.

from_file(f)

Load the history of a NeuralNet from a json file.

index($self, value[, start, stop])

Return first index of value.

insert($self, index, object, /)

Insert object before index.

new_batch()

Register a new batch row for the current epoch.

new_epoch()

Register a new epoch row.

pop($self[, index])

Remove and return item at index (default last).

record(attr, value)

Add a new value to the given column for the current epoch.

record_batch(attr, value)

Add a new value to the given column for the current batch.

remove($self, value, /)

Remove first occurrence of value.

reverse($self, /)

Reverse IN PLACE.

sort($self, /, *[, key, reverse])

Sort the list in ascending order and return None.

to_file(f)

Saves the history as a json file.

to_list()

Return history object as a list.

classmethod from_file(f)[source]

Load the history of a NeuralNet from a json file.

Parameters
ffile-like object or str
new_batch()[source]

Register a new batch row for the current epoch.

new_epoch()[source]

Register a new epoch row.

record(attr, value)[source]

Add a new value to the given column for the current epoch.

record_batch(attr, value)[source]

Add a new value to the given column for the current batch.

to_file(f)[source]

Saves the history as a json file. In order to use this feature, the history must only contain JSON encodable Python data structures. Numpy and PyTorch types should not be in the history.

Parameters
ffile-like object or str
to_list()[source]

Return history object as a list.