Support for Large Language Models

Large Language Models (LLMs) find more and more different applications, as their capacity and availability are growing over time. Even though skorch is not primarily focused on working with pre-trained models, it does provide a few integrations with the Hugging Face ecosystem and thus with the pre-trained models on Hugging Face. Some use cases for LLMs supported by skorch are detailed in this document.

Using language models as zero- and few-shot classifiers

In this section, we will show how to use skorch to perform zero-shot and few-shot classification.

Getting started with zero-shot classification

One application for LLMs is in zero-shot and few-shot learning. In particular, they can be used to perform zero-shot and few-shot classification. This means that without updating the model weights in a training step, the model can make useful predictions about what class best fits the input data.

As an example, let’s assume we have customer reviews and would like to know their sentiment, i.e. whether they are positive or negative. A customer review could look like this:

I’m very happy with this new smartphone. The display is excellent and the battery lasts for days. My only complaint is the camera, which could be better. Overall, I can highly recommend this product.

Thanks to the ability to understand text, a LLM should be perfectly capable to classify this review to be either positive or negative. With the help of skorch and pre-trained Hugging Face models, we can perform this type of prediction in a convenient way. Let’s show how this looks in code:

from skorch.llm import ZeroShotClassifier

clf = ZeroShotClassifier('bigscience/bloomz-1b1')
clf.fit(X=None, y=['positive', 'negative'])

review = """I'm very happy with this new smartphone. The display is excellent
and the battery lasts for days. My only complaint is the camera, which could
be better. Overall, I can highly recommend this product."""

clf.predict([review])  # returns 'positive'
clf.predict_proba([review])  # returns array([[0.05547452, 0.94452548]])

Let’s unpack this step by step. First of all, to run this, we need to install the Hugging Face transformers library in our environment. To do this, run:

python -m pip install transformers

Then, we imported the skorch class ZeroShotClassifier. This class takes care of all the heavy lifting like loading the LLMs and generating the predictions. As expected from skorch, it is scikit-learn compatible, thus you can use it for grid searching, etc. (more on that later).

As you can see, we initialize the model simply by passing the argument 'bigscience/bloomz-1b1'. What does this mean? This string is the name of a Large Language Model hosted on Hugging Face. It is called “bloomz” and, as implied by the name, it is the 1 billion parameters variant (so it’s a “small” large language model by today’s standards). You can browse the Hugging Face models page to find more models that may be suitable for the task.

Note

It is possible to initialize ZeroShotClassifier with a model and tokenizer directly by initializing it like this: ZeroShotClassifier(model=model, tokenizer=tokenizer). In this case, the model and tokenizer are not downloaded from Hugging Face. This can be useful if you want to use your own models, which works as long as they are compatible with Hugging Face transformers. In particular, they have to implement the .generate method.

After initializing the model, we will fit it. What does “fitting” mean in the context of a zero-shot learning? In actuality, there is no fitting. The only thing that happens under the hood is that skorch prepares a few things, such as downloading the model and tokenizer. This is why we pass X=None – we don’t actually need any training data for zero-shot learning. We do, however, need to pass y=['positive', 'negative'], because this information will be used by the model to determine which labels it is allowed to predict.

Then, we take the example review from above and pass it to the .predict method. As expected, the model predicts 'positive', which is indeed the sentiment that best fits the review.

When calling .predict_proba, we get back 0.055 and 0.945. The first value corresponds to “negative” and the second to “positive”. This order of these results is determined by the alphabetical order of the labels, check clf.classes_ if you’re unsure. So this means that the model predicts a 94.5% probability that the label is positive.

Of course, sentiment analysis is not the only thing that you can do with LLMs. Maybe you would like to know whether the review mentions the shopping experience on your e-commerce website. Or you want to identify all phone reviews that mention the battery life? This information could be part of a bigger machine learning pipeline (say, helping customers find phones with good battery life according to reviews). There is almost no limit to what can be done.

Prompting

One important aspect we didn’t mention yet is prompting. As you may know, to get the best results from an LLM, the prompt has to be crafted carefully. There is a lot of information on the web about how best to prompt LLMs, so we won’t repeat that here. But the nice thing about using ZeroShotClassifier is that it only takes a few lines of code to change the prompt and see if it improves the results or not.

Given the example of predicting the sentiment of customer reviews, let’s assume we have a (small) dataset of manually labeled data X and y. Now let’s say we want to test which of two prompts results in better accuracy. This is how we can achieve it:

from sklearn.metrics import accuracy_score

X, y = ...  # your data

# first the default prompt
clf_default = ZeroShotClassifier('bigscience/bloomz-1b1')
clf_default.fit(X=None, y=['negative', 'positive'])
accuracy_default = accuracy_score(y, clf_default.predict(X))

# now a custom prompt
my_prompt = """Your job is to analyze the sentiment of customer reviews.

The available sentiments are: {labels}

The customer review is:

```
{text}
```

Your response:"""

clf_my_prompt = ZeroShotClassifier('bigscience/bloomz-1b1', prompt=my_prompt)
clf_my_prompt.fit(X=None, y=['negative', 'positive'])
accuracy_my_prompt = accuracy_score(y, clf_my_prompt.predict(X))

In this example, we check whether the default prompt that skorch provides, or our own customized prompt results in better accuracy. For our own prompt, we have to take care to include two placeholders, {labels} and {text}. Labels is where the possible labels are placed, in this case “negative” and “positive”. The text placeholder is for the input taken from X. Also note that we delimit the text input using “```”. Although it’s not strictly necessary, it is often a good idea to let the LLM know what part of the text belongs to the review and what part belongs to the instructions. Try experimenting with different delimiters.

Grid searching LLM parameters

Investigating the best prompt in the way described above can become quite tedious if we have a lot of prompts and metrics. Also, there might be other hyper-parameters we want to check. This is where sklearn.model_selection.GridSearchCV enters the stage. Since ZeroShotClassifier is sklearn-compatible, we can just plug it into a grid search and let sklearn do the tedious work for us. Let’s see how that works in practice:

from sklearn.model_selection import GridSearchCV
from skorch.llm import DEFAULT_PROMPT_ZERO_SHOT

params = {
    'model_name': ['bigscience/bloomz-1b1', 'gpt2', 'tiiuae/falcon-7b-instruct'],
    'prompt': [DEFAULT_PROMPT_ZERO_SHOT, my_prompt],
}
metrics = ['accuracy', 'neg_log_loss']
search = GridSearchCV(clf, param_grid=params, cv=2, scoring=metrics, refit=False)
search.fit(X, y)
print(search.cv_results_)

In this example, we grid search over three different LLMs, two different prompts, and calculate two scores.

Be aware that predicting with LLMs can be quite slow, depending on your available hardware, so this grid search could take quite a while. If you have a CUDA-capable GPU with sufficient memory, it is recommended to pass device='cuda' to ZeroShotClassifier.

Few-shot classification

As research has shown, providing a few examples to the LLM about how to perform its task can result in much improved performance. This technique is called few-shot learning and we support this in skorch as well. It works almost the same as zero-shot learning, with a few notable differences that we’ll explain in a minute. First, let’s check out a code example:

from skorch.llm import FewShotClassifier

X_train, y_train, X_test, y_test = ...  # your data
clf = FewShotClassifier('bigscience/bloomz-1b1', max_samples=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

This example is almost identical to what we saw previously. We need to import FewShotClassifier and provide a LLM name. We also set the max_samples parameter here. This indicates how many samples we want to use for few-shot learning. Those will later be added to the prompt.

In contrast to zero-shot learning, during the fit call, we pass some actual data. This is because from the passed X_train and y_train, 5 samples will be picked to augment the prompt with examples. skorch will try to pick examples such that all labels are represented at least once, if possible.

Since the examples are picked from the data and their label is shown to the LLM, it is important in this case to split train and test data, which is something we didn’t need to bother with when using zero-shot learning. Of course, if you want to later run a grid search, sklearn will automatically do this for you.

The number of samples taken to augment the prompt is controlled by the max_samples parameter. By default, this value is 5, i.e. 5 samples from X_train and y_train are used. The samples that are not used don’t actually influence the outcome. Therefore, you should keep X_train and y_train as small as possible. This leaves more samples for your validation/testing data.

The rest is no different from what we saw earlier. We can call .predict and .predict_proba and calculate scores based on those predictions.

When it comes to testing different prompts, the approach is almost the same as for zero-shot classification. One difference is that on top of placeholders for {labels} and {text}, there should be one more placeholder for {examples}. This is where the few shot examples will be placed.

One big disadvantage of few-shot learning is that since the prompt is augmented with examples, it becomes much longer. Not only will this lead to slower prediction times, it also makes it more likely that the prompt length will exceed the context size window of the LLM. If that happens, we will get a warning from Hugging Face transformers. If in doubt, just run a grid search on the max_samples parameter, you will see how it affects the scores and inference time.

Detecting issues with LLMs

LLMs are a bit of a black box, so it can be difficult to detect possible issues. skorch provides some options to help a little bit with noticing issues in the first place.

In general, if the model doesn’t perform as expected, you should find that reflected in the metrics. In the examples above, if the prompt was not well chosen or if the LLM was not up to the task, we would expect the accuracy to have a low value. This should be a first indicator that something is wrong.

To dig deeper, it can be useful to figure out if the probabilities that the model assigns to the labels are low. At first glance, this is not easy to detect. If we call predict_proba, the total probabilities for each label will always sum up to 1, as is expected from probabilities. However, if we initialize the classifier with the parameter probas_sum_to_1=False, we will receive the unnormalized probabilities.

For example, given that we have set this option, let’s assume that the returned probabilities for “negative” and “positive” are 0.1 and 0.2. This means that the LLM assigns a probability of 0.7 to neither of these two labels. If we observe this, it is a strong indicator that the prompt is not working well for this specific LLM, as it tries to steer the generated text in the wrong direction. Consider adjusting the prompt, choosing a different LLM, or performing few-shot learning if you aren’t already.

skorch provides more options to detect such issues. Set the argument error_low_prob to "warn" and you will get a warning when the (unnormalized) probabilities are too low. Set the argument to raise to raise an error instead.

If you find that the model only occasionally assigns low probabilities to the labels, you may want to set error_low_prob='none'. In that case, skorch will not complain about low probabilities, but for each sample with low probabilities, .predict will actually return None instead of the label with the highest probability (.predict_proba is unaffected). You can later decide what to do with those particular predictions.

If you make use of any of those options, you should also change threshold_low_prob. This is a float that indicates what probability is actually considered to be “low”. For the sentiment example from above, if we set this value to 0.5, it means that the probability is considered low if the sum of the probabilities for both “negative” and “positive” is less than 0.5.

A common issue you may discover is that the model always returns a very low probability. How can we make some progress in that case? A tip is to inspect what the model would actually generate for a given prompt. This will help to figure out what the LLM is trying to achieve, which may guide us towards finding a better prompt. Generating the output is not difficult because skorch just uses Hugging Face transformers models, so we can use them to generate sequences without any constraint. Here is a code snippet:

clf = ZeroShotClassifier(..., probas_sum_to_1=False)
clf.fit(X, y)
y_proba = clf.predict_proba(X)
# we notice that y_proba values are quite low, let's check what the LLM tries
# to do:
prompt = clf.get_prompt(X[0])  # get prompt for 1st sample from X
inputs = clf.tokenizer_(prompt, return_tensors='pt').to(clf.device_)
# adjust the min_new_tokens argument as needed
output = clf.model_.generate(**inputs, min_new_tokens=20)
generation = clf.tokenizer_.decode(output[0])
print(generation)

Now we can see what the LLM actually tries to generate if we’re not forcing it to predict one of the labels. This can often help us understand the underlying issue better.

Some examples of issues this helps identifying:

  • If the model doesn’t return the label but instead generates new example inputs, it might be confused about the structure of the expected response; carefully rewording the prompt or using few-shot learning may help here.
  • If the model tries to insert a spurious new line/space before the label, add it to your prompt. If the model still insists, change the label to contain that new line/space, e.g. from “positive” to ” positive”.
  • If the model doesn’t produce any output, check if clf.model_.max_length is not set too low. Hopefully, these tips can help resolving the most common issues.

Advantages

What are some advantages of using skorch for zero-shot and few-shot classification?

  • Working with few or even no labeled samples: Supervised ML methods are not an option when there is not enough labeled data. Using LLMs might provide a solution that is good enough for your use case.
  • Forcing the LLM to output one of the labels: By using ZeroShotClassifier and FewShotClassifier, you can ensure that the LLM will only ever predict one of the desired labels. Usually, when working with LLMs, this can be quite tricky – no matter how well the prompt is phrased, there is no guarantee that the LLM won’t return an undesired output. With skorch, you don’t have to worry about that.
  • Returning probabilities: When generating texts with LLMs, you will usually only get the generated text as output. But often, we would like to know the associated probabilities: Is the model 99% sure that the label is “positive” or only 51%? skorch provides the .predict_proba method, which will return those probabilities to you.
  • Caching: skorch performs some caching under the hood. This can lead to faster prediction times, especially when the labels are long and share a common prefix.
  • Scikit-learn compatibility: Thanks to skorch, the classifiers are compatible with sklearn. You can call fit, predict, and predict_proba as you always do. You can run a grid search to identify the best LLM and prompt. You can use the classifier as a drop-in replacement for your existing sklearn model. Or you can start prototyping with a zero/few-shot classifier and later, when more labeled data is available, replace it with an sklearn model, without the need to change any further code.
  • Everything runs locally: Once the model and tokenizer have been downloaded, everything runs locally on your machine. No data is sent to any API provider, such as OpenAI. If you work with sensitive data or company data, you don’t have to worry about leaking it to the outside world. (Tip: If you actually prefer to use OpenAI API, take a look at scikit-llm)

That said, there are situation where you should not use zero-shot and few-shot classification with skorch: When you have sufficient amounts of labeled data, a supervised ML approach (like skorch.NeuralNetClassifier) will most likely work better, require less memory and compute and thus have faster inference. If you need more open ended text generation, say abstractive summarization instead of predicting a fixed list of labels, this is also not a good use case. Another concern could be interpretability – LLMs are mostly a black box, but some other ML methods lend themselves much better for interpretation.

Technical details

Q: How do ZeroShotClassifier and FewShotClassifier ensure that only the given labels, such as “positive” and “negative” are generated? Usually, a language model can generate any tokens.

A: Under the hood, we intercept the model predictions (the logits) and force them to be one of the labels. That way, we can ensure that the model will never predict anything that we’re not expecting. This is possible because of integration with Hugging Face transformers.

Q: How do ZeroShotClassifier and FewShotClassifier derive the probabilities? Usually, a language model just returns text.

A: We don’t use the generated text but instead directly inspect the logits returned by the language model. For example, let’s assume that the label “positive” is represented as two tokens, [123, 456]. Then we first check the logit that the language model assigns to token 123, given the input prompt, then force the model to predict 123 and check the logit assigned to token 456. The logits are then converted to probabilities and aggregated.

More examples

To see a complete working example of using ZeroShotClassifier and FewShotClassifier, take a look at this notebook about using LLMs for classification. It contains movie review sentiment analysis use case and also compares the results with those of other methods.