# skorch.helper¶

Helper functions and classes for users.

They should not be used in skorch directly.

class skorch.helper.DataFrameTransformer(treat_int_as_categorical=False, float_dtype=<class 'numpy.float32'>, int_dtype=<class 'numpy.int64'>)[source]

Transform a DataFrame into a dict useful for working with skorch.

Transforms cardinal data to floats and categorical data to vectors of ints so that they can be embedded.

Although skorch can deal with pandas DataFrames, the default behavior is often not very useful. Use this transformer to transform the DataFrame into a dict with all float columns concatenated using the key “X” and all categorical values encoded as integers, using their respective column names as keys.

Your module must have a matching signature for this to work. It must accept an argument X for all cardinal values. Additionally, for all categorical values, it must accept an argument with the same name as the corresponding column (see example below). If you need help with the required signature, use the describe_signature method of this class and pass it your data.

You can choose whether you want to treat int columns the same as float columns (default) or as categorical values.

To one-hot encode categorical features, initialize their corresponding embedding layers using the identity matrix.

Parameters: treat_int_as_categorical : bool (default=False) Whether to treat integers as categorical values or as cardinal values, i.e. the same as floats. float_dtype : numpy dtype or None (default=np.float32) The dtype to cast the cardinal values to. If None, don’t change them. int_dtype : numpy dtype or None (default=np.int64) The dtype to cast the categorical values to. If None, don’t change them. If you do this, it can happen that the categorical values will have different dtypes, reflecting the number of unique categories.

Notes

The value of X will always be 2-dimensional, even if it only contains 1 column.

Examples

>>> df = pd.DataFrame({
...     'col_floats': np.linspace(0, 1, 12),
...     'col_ints': [11, 11, 10] * 4,
...     'col_cats': ['a', 'b', 'a'] * 4,
... })
>>> # cast to category dtype to later learn embeddings
>>> df['col_cats'] = df['col_cats'].astype('category')
>>> y = np.asarray([0, 1, 0] * 4)

>>> class MyModule(nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.reset_params()

>>>     def reset_params(self):
...         self.embedding = nn.Embedding(2, 10)
...         self.linear = nn.Linear(2, 10)
...         self.out = nn.Linear(20, 2)
...         self.nonlin = nn.Softmax(dim=-1)

>>>     def forward(self, X, col_cats):
...         # "X" contains the values from col_floats and col_ints
...         # "col_cats" contains the values from "col_cats"
...         X_lin = self.linear(X)
...         X_cat = self.embedding(col_cats)
...         X_concat = torch.cat((X_lin, X_cat), dim=1)
...         return self.nonlin(self.out(X_concat))

>>> net = NeuralNetClassifier(MyModule)
>>> pipe = Pipeline([
...     ('transform', DataFrameTransformer()),
...     ('net', net),
... ])
>>> pipe.fit(df, y)


Methods

 describe_signature(df) Describe the signature required for the given data. fit(df[, y]) fit_transform(X[, y]) Fit to data, then transform it. get_params([deep]) Get parameters for this estimator. set_params(**params) Set the parameters of this estimator. transform(df) Transform DataFrame to become a dict that works well with skorch.
describe_signature(df)[source]

Describe the signature required for the given data.

Pass the DataFrame to receive a description of the signature required for the module’s forward method. The description consists of three parts:

1. The names of the arguments that the forward method needs. 2. The dtypes of the torch tensors passed to forward. 3. The number of input units that are required for the corresponding argument. For the float parameter, this is just the number of dimensions of the tensor. For categorical parameters, it is the number of unique elements.

Returns: signature : dict Returns a dict with each key corresponding to one key required for the forward method. The values are dictionaries of two elements. The key “dtype” describes the torch dtype of the resulting tensor, the key “input_units” describes the required number of input units.
pd = <module 'pandas' from '/home/docs/.pyenv/versions/3.7.9/lib/python3.7/site-packages/pandas/__init__.py'>[source]
transform(df)[source]

Transform DataFrame to become a dict that works well with skorch.

Parameters: df : pd.DataFrame Incoming DataFrame. X_dict: dict Dictionary with all floats concatenated using the key “X” and all categorical values encoded as integers, using their respective column names as keys.
class skorch.helper.SliceDataset(dataset, idx=0, indices=None)[source]

Helper class that wraps a torch dataset to make it work with sklearn.

Sometimes, sklearn will touch the input data, e.g. when splitting the data for a grid search. This will fail when the input data is a torch dataset. To prevent this, use this wrapper class for your dataset.

Note: This class will only return the X value by default (i.e. the first value returned by indexing the original dataset). Sklearn, and hence skorch, always require 2 values, X and y. Therefore, you still need to provide the y data separately.

Note: This class behaves similarly to a PyTorch Subset when it is indexed by a slice or numpy array: It will return another SliceDataset that references the subset instead of the actual values. Only when it is indexed by an int does it return the actual values. The reason for this is to avoid loading all data into memory when sklearn, for instance, creates a train/validation split on the dataset. Data will only be loaded in batches during the fit loop.

Parameters: dataset : torch.utils.data.Dataset A valid torch dataset. idx : int (default=0) Indicates which element of the dataset should be returned. Typically, the dataset returns both X and y values. SliceDataset can only return 1 value. If you want to get X, choose idx=0 (default), if you want y, choose idx=1. indices : list, np.ndarray, or None (default=None) If you only want to return a subset of the dataset, indicate which subset that is by passing this argument. Typically, this can be left to be None, which returns all the data. See also Subset.

Examples

>>> X = MyCustomDataset()
>>> search = GridSearchCV(net, params, ...)
>>> search.fit(X, y)  # raises error
>>> ds = SliceDataset(X)
>>> search.fit(ds, y)  # works

Attributes: shape

Methods

 count(value) index(value, [start, [stop]]) Raises ValueError if the value is not present. transform(data) Additional transformations on data.
transform(data)[source]

Additional transformations on data.

Note: If you use this in conjuction with PyTorch DataLoader, the latter will call the dataset for each row separately, which means that the incoming data is a single rows.

class skorch.helper.SliceDict(**kwargs)[source]

Wrapper for Python dict that makes it sliceable across values.

Use this if your input data is a dictionary and you have problems with sklearn not being able to slice it. Wrap your dict with SliceDict and it should usually work.

Note:

• SliceDict cannot be indexed by integers, if you want one row, say row 3, use [3:4].
• SliceDict accepts numpy arrays and torch tensors as values.

Examples

>>> X = {'key0': val0, 'key1': val1}
>>> search = GridSearchCV(net, params, ...)
>>> search.fit(X, y)  # raises error
>>> Xs = SliceDict(key0=val0, key1=val1)  # or Xs = SliceDict(**X)
>>> search.fit(Xs, y)  # works

Attributes: shape

Methods

 clear() copy() fromkeys(*args, **kwargs) fromkeys method makes no sense with SliceDict and is thus not supported. get($self, key[, default]) Return the value for key if key is in the dictionary, else default. items() keys() pop(k[,d]) If key is not found, d is returned if given, otherwise KeyError is raised popitem() 2-tuple; but raise KeyError if D is empty. setdefault($self, key[, default]) Insert key with a value of default if key is not in the dictionary. update([E, ]**F) If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k] values()
copy() → a shallow copy of D[source]
fromkeys(*args, **kwargs)[source]

fromkeys method makes no sense with SliceDict and is thus not supported.

update([E, ]**F) → None. Update D from dict/iterable E and F.[source]

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

skorch.helper.predefined_split(dataset)[source]

Uses dataset for validiation in NeuralNet.

Parameters: dataset: torch Dataset Validiation dataset

Examples

>>> valid_ds = skorch.Dataset(X, y)
>>> net = NeuralNet(..., train_split=predefined_split(valid_ds))