History¶
A NeuralNet
object logs training progress internally using a
History
object, stored in the history
attribute. Among
other use cases, history
is used to print the training progress
after each epoch:
net.fit(X, y)
# prints
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.7111 0.5100 0.6894 0.1345
2 0.6928 0.5500 0.6803 0.0608
3 0.6833 0.5650 0.6741 0.0620
4 0.6763 0.5850 0.6674 0.0594
All this information (and more) is stored in and can be accessed
through net.history
. It is thus best practice to make use of
history
for storing training-related data.
In general, History
works like a list of dictionaries, where
each item in the list corresponds to one epoch, and each key of the
dictionary to one column. Thus, if you would like to access the
'train_loss'
of the last epoch, you can call
net.history[-1]['train_loss']
. To make the history
more
accessible, though, it is possible to just pass the indices separated
by a comma: net.history[-1, 'train_loss']
.
Moreover, History
stores the results from each individual
batch under the batches
key during each epoch. So to get the train
loss of the 3rd batch of the 7th epoch, use net.history[7,
'batches', 3, 'train_loss']
.
Here are some examples showing how to index history
:
# history of a fitted neural net
history = net.history
# get current epoch, a dict
history[-1]
# get train losses from all epochs, a list of floats
history[:, 'train_loss']
# get train and valid losses from all epochs, a list of tuples
history[:, ('train_loss', 'valid_loss')]
# get current batches, a list of dicts
history[-1, 'batches']
# get latest batch, a dict
history[-1, 'batches', -1]
# get train losses from current batch, a list of floats
history[-1, 'batches', :, 'train_loss']
# get train and valid losses from current batch, a list of tuples
history[-1, 'batches', :, ('train_loss', 'valid_loss')]
As History
essentially is a list of dictionaries, you can
also write to it as if it were a list of dictionaries. Here too,
skorch provides some convenience functions to make life easier. First
there is new_epoch()
, which will add a
new epoch dictionary to the end of the list. Also, there is
new_batch()
for adding new batches to
the current epoch.
To add a new item to the current epoch, use history.record('foo',
123)
. This will set the value 123
for the key foo
of the
current epoch. To write a value to the current batch, use
history.record_batch('bar', 456)
. Below are some more examples:
# history of a fitted neural net
history = net.history
# add new epoch row
history.new_epoch()
# add an entry to current epoch
history.record('my-score', 123)
# add a batch row to the current epoch
history.new_batch()
# add an entry to the current batch
history.record_batch('my-batch-score', 456)
# overwrite entry of current batch
history.record_batch('my-batch-score', 789)
Distributed history¶
When training a net in a distributed setting, e.g. when using
torch.nn.parallel.DistributedDataParallel
, directly or indirectly with
the help of AccelerateMixin
, the default history class should not be
used. This is because each process will have its own history instance with no
syncing happening between processes. Therefore, the information in the histories
can diverge. When steering the training process through the histories, the
resulting differences can cause trouble. When using early stopping, for
instance, one process could receive the signal to stop but not the other.
To avoid this, use the DistributedHistory
class provided by skorch. It
will take care of syncing the distributed batch information across processes,
which will prevent the issue just described.
This class needs to be initialized with a distributed store provided by PyTorch.
We have only tested torch.distributed.TCPStore
so far, so if unsure,
use that one, though torch.distributed.FileStore
should also work. The
DistributedHistory
also needs to be initialized with its rank and the
world size (number of processes) so that it has all the required information to
perform the syncing. When using accelerate
, that information can be
retrieved from the Accelerator
instance.
A typical training script without accelerate
may contain a function like
this:
from torch.distributed import TCPStore
from torch.nn.parallel import DistributedDataParallel
def train(rank, world_size, is_master):
store = TCPStore(
"127.0.0.1", port=1234, world_size=world_size)
dist_history = DistributedHistory(
store=store, rank=rank, world_size=world_size)
net = NeuralNetClassifier(..., history=dist_history)
net.fit(X, y)
When using AccelerateMixin
, it could look like this instead:
from accelerate import Accelerator
from skorch.hf import AccelerateMixin
accelerator = Accelerator(...)
def train(accelerator):
is_master = accelerator.is_main_process
world_size = accelerator.num_processes
rank = accelerator.local_process_index
store = TCPStore(
"127.0.0.1", port=1234, world_size=world_size, is_master=is_master)
dist_history = DistributedHistory(
store=store, rank=rank, world_size=world_size)
net = AcceleratedNet(..., history=dist_history)
net.fit(X, y)
When using accelerate
in a non-distributed setting (e.g. to take advantage
of mixed precision training), it is not necessary to use
DistributedHistory
, the normal history class will do.