History

A NeuralNet object logs training progress internally using a History object, stored in the history attribute. Among other use cases, history is used to print the training progress after each epoch:

net.fit(X, y)

# prints
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.7111       0.5100        0.6894  0.1345
      2        0.6928       0.5500        0.6803  0.0608
      3        0.6833       0.5650        0.6741  0.0620
      4        0.6763       0.5850        0.6674  0.0594

All this information (and more) is stored in and can be accessed through net.history. It is thus best practice to make use of history for storing training-related data.

In general, History works like a list of dictionaries, where each item in the list corresponds to one epoch, and each key of the dictionary to one column. Thus, if you would like to access the 'train_loss' of the last epoch, you can call net.history[-1]['train_loss']. To make the history more accessible, though, it is possible to just pass the indices separated by a comma: net.history[-1, 'train_loss'].

Moreover, History stores the results from each individual batch under the batches key during each epoch. So to get the train loss of the 3rd batch of the 7th epoch, use net.history[7, 'batches', 3, 'train_loss'].

Here are some examples showing how to index history:

# history of a fitted neural net
history = net.history
# get current epoch, a dict
history[-1]
# get train losses from all epochs, a list of floats
history[:, 'train_loss']
# get train and valid losses from all epochs, a list of tuples
history[:, ('train_loss', 'valid_loss')]
# get current batches, a list of dicts
history[-1, 'batches']
# get latest batch, a dict
history[-1, 'batches', -1]
# get train losses from current batch, a list of floats
history[-1, 'batches', :, 'train_loss']
# get train and valid losses from current batch, a list of tuples
history[-1, 'batches', :, ('train_loss', 'valid_loss')]

As History essentially is a list of dictionaries, you can also write to it as if it were a list of dictionaries. Here too, skorch provides some convenience functions to make life easier. First there is new_epoch(), which will add a new epoch dictionary to the end of the list. Also, there is new_batch() for adding new batches to the current epoch.

To add a new item to the current epoch, use history.record('foo', 123). This will set the value 123 for the key foo of the current epoch. To write a value to the current batch, use history.record_batch('bar', 456). Below are some more examples:

# history of a fitted neural net
history = net.history
# add new epoch row
history.new_epoch()
# add an entry to current epoch
history.record('my-score', 123)
# add a batch row to the current epoch
history.new_batch()
# add an entry to the current batch
history.record_batch('my-batch-score', 456)
# overwrite entry of current batch
history.record_batch('my-batch-score', 789)

Distributed history

When training a net in a distributed setting, e.g. when using torch.nn.parallel.DistributedDataParallel, directly or indirectly with the help of AccelerateMixin, the default history class should not be used. This is because each process will have its own history instance with no syncing happening between processes. Therefore, the information in the histories can diverge. When steering the training process through the histories, the resulting differences can cause trouble. When using early stopping, for instance, one process could receive the signal to stop but not the other.

To avoid this, use the DistributedHistory class provided by skorch. It will take care of syncing the distributed batch information across processes, which will prevent the issue just described.

This class needs to be initialized with a distributed store provided by PyTorch. We have only tested torch.distributed.TCPStore so far, so if unsure, use that one, though torch.distributed.FileStore should also work. The DistributedHistory also needs to be initialized with its rank and the world size (number of processes) so that it has all the required information to perform the syncing. When using accelerate, that information can be retrieved from the Accelerator instance.

A typical training script without accelerate may contain a function like this:

from torch.distributed import TCPStore
from torch.nn.parallel import DistributedDataParallel

def train(rank, world_size, is_master):
    store = TCPStore(
        "127.0.0.1", port=1234, world_size=world_size)
    dist_history = DistributedHistory(
        store=store, rank=rank, world_size=world_size)
    net = NeuralNetClassifier(..., history=dist_history)
    net.fit(X, y)

When using AccelerateMixin, it could look like this instead:

from accelerate import Accelerator
from skorch.hf import AccelerateMixin

accelerator = Accelerator(...)

def train(accelerator):
    is_master = accelerator.is_main_process
    world_size = accelerator.num_processes
    rank = accelerator.local_process_index
    store = TCPStore(
        "127.0.0.1", port=1234, world_size=world_size, is_master=is_master)
    dist_history = DistributedHistory(
        store=store, rank=rank, world_size=world_size)
    net = AcceleratedNet(..., history=dist_history)
    net.fit(X, y)

When using accelerate in a non-distributed setting (e.g. to take advantage of mixed precision training), it is not necessary to use DistributedHistory, the normal history class will do.