AI Sparks

Text summary by Scikit-LLM – MachineLearningMastery.com

In this article, you’ll learn how to use scikit-LLM’s text compression feature to handle large volumes of text in machine learning pipelines.

Topics we will cover include:

  • How to build a custom transformer compatible with scikit-learn that wraps the Hugging Face abstraction model.
  • How to integrate LLM-driven text summarization into the scikit-learn Pipeline for data processing.
  • How to combine compression, TF-IDF vectorization, and classifier into a single end-to-end pipeline.

Text summarization with Scikit-LLM
Photo by Editor

Introduction

In the previous post, we introduced scikit-LLMa library that bridges the gap between traditional machine learning models and modern large-scale language models (LLMs). In particular, we have shown how zero-shot and shot-classifications are used for cases with scikit-LLM.

Now, we try to answer the question: What if our case for using the machine below is interrupted by large amounts of text? To meet this challenge, we will explore and use abbreviations: another powerful feature of this library is that it turns long texts into condensed summaries. Let’s see how, using a data preparation pipeline that includes this process!

Initial Setup

The first step is to make sure you have scikit-LLM installed — replace “pip” with “!pip” if you are working in a cloud notebook environment:

Note that by default, scikit-LLM resorts to OpenAI language models, which can be expensive to use repeatedly, or their usage amount may be very limited under a free OpenAI account. Alternatively, you can use pre-trained Hugging Face models for compression, like this one sshleifer/distilbart-cnn-12-6. In such a case, make sure to include Hugging Face’s Transformers library, so you can load Hugging Face’s models into your system.

An LLM-Driven Text Compression Pipeline

The following class definition includes the concept of loading a pre-trained model (fit()) and apply reasoning to it, i.e. shorten the input scripts (transform()):

Importantly, the class we defined benefits from custom transformer classes: a necessary step to ensure that the Hugging Face models converge correctly scikit-learn preprocessing and modeling tools.

For simplicity, we say that we will only summarize the two text views that are part of the larger text classification dataset. The two “long” texts (features) and the emotions of the updates (labels) would look like this:

The real magic happens next. We describe a pipeline that combines our data processing – that is, LLM-driven summarization – and classifier training. In a real case, you will need more than two training examples to build a suitable classifier, of course, but the point here is to show how text summarization can reduce the size of text data:

Once the pipe is defined, here’s how to use it:

That’s all! Try adapting the above code on a real text dataset, labeled for binary sentiment classification, and see how it works in practice.

Before we conclude, if you want to know what the summarized scripts look like, you can check the output directly:

The snapshots are, of course, far from the quality you can get from ChatGPT or Google Gemini – the model we used is a free, lightweight pre-trained model, after all. That said, choosing the most powerful models will yield better results.

Summary

We bridge the gap between classical machine learning models and advanced text processing with large pre-trained language models, thanks to scikit-LLM: a library that uses the best of both worlds.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button