Text summary by Scikit-LLM – MachineLearningMastery.com

In this article, you’ll learn how to use scikit-LLM’s text compression feature to handle large volumes of text in machine learning pipelines.
Topics we will cover include:
- How to build a custom transformer compatible with scikit-learn that wraps the Hugging Face abstraction model.
- How to integrate LLM-driven text summarization into the scikit-learn Pipeline for data processing.
- How to combine compression, TF-IDF vectorization, and classifier into a single end-to-end pipeline.
Text summarization with Scikit-LLM
Photo by Editor
Introduction
In the previous post, we introduced scikit-LLMa library that bridges the gap between traditional machine learning models and modern large-scale language models (LLMs). In particular, we have shown how zero-shot and shot-classifications are used for cases with scikit-LLM.
Now, we try to answer the question: What if our case for using the machine below is interrupted by large amounts of text? To meet this challenge, we will explore and use abbreviations: another powerful feature of this library is that it turns long texts into condensed summaries. Let’s see how, using a data preparation pipeline that includes this process!
Initial Setup
The first step is to make sure you have scikit-LLM installed — replace “pip” with “!pip” if you are working in a cloud notebook environment:
Note that by default, scikit-LLM resorts to OpenAI language models, which can be expensive to use repeatedly, or their usage amount may be very limited under a free OpenAI account. Alternatively, you can use pre-trained Hugging Face models for compression, like this one sshleifer/distilbart-cnn-12-6. In such a case, make sure to include Hugging Face’s Transformers library, so you can load Hugging Face’s models into your system.
pip install transformers==4.37.2
|
pip enter transformers==4.37.2 |
An LLM-Driven Text Compression Pipeline
The following class definition includes the concept of loading a pre-trained model (fit()) and apply reasoning to it, i.e. shorten the input scripts (transform()):
from sklearn.base import BaseEstimator, TransformerMixin from transformers import torch class HuggingFaceSummarizer(BaseEstimator, TransformerMixin): def __init__(self, model_name=”sshleifer/distilbart-cnn-12-6″, max_1 min_length=4, max_length=4 model_name self.max_length = max_length self.min_length = min_length self.summarizer = None self.device = 0 if torch.cuda.is_available() else -1 def fit(self, X, y=None): # The fit() method should just load the pre-trained model into ggle memory GPU/book=0 if summary is None: self.summarizer = pipeline(“summarizer”, model=self.model_name, device=self.device) return self(self, X): # Ensure that the model loaded if self.summarizer is None: self.summarizer = pipeline(“summarization”, model=self.model_name) #=self.self.summarizer( X, max_length=self.max_length, min_length=self.min_length, truncation=True ) return [res[‘summary_text’] to get results]
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
from sklearn.the foundation enter BaseEstimator, TransformerMixin from transformers enter pipe enter torch class HuggingFaceSummarizer(BaseEstimator, TransformerMixin): def __init__(itself, model_name=“sshleifer/distilbart-cnn-12-6”, maximum_length=40, duration_minutes=10): itself.model_name = model_name itself.maximum_length = maximum_length itself.duration_minutes = duration_minutes itself.summary = Nothing itself.device = 0 if torch.separate.is available() the rest –1 def it’s worth it(itself, X, y=Nothing): # The fit() method should just load the pre-trained model into memory # device=0 target free GPU if using Colab/Kaggle notebook. if itself.summary is something Nothing: itself.summary = pipe(“summary”, model=itself.model_name, device=itself.device) come back itself def convert(itself, X): # Make sure the model is loaded if itself.summary is something Nothing: itself.summary = pipe(“summary”, model=itself.model_name, device=itself.device) # Edit documents and extract summary strings results = itself.summary( X, maximum_length=itself.maximum_length, duration_minutes=itself.duration_minutes, termination=The truth ) come back [res[‘summary_text’] for res in the middle results] |
Importantly, the class we defined benefits from custom transformer classes: a necessary step to ensure that the Hugging Face models converge correctly scikit-learn preprocessing and modeling tools.
For simplicity, we say that we will only summarize the two text views that are part of the larger text classification dataset. The two “long” texts (features) and the emotions of the updates (labels) would look like this:
X_long_texts = [
“I’ve been using this vacuum cleaner for about three weeks now. At first, I struggled with the attachments, and the manual wasn’t very clear. However, once I figured out how the motorized brush works, it easily picked up all the pet hair on my rugs. Overall, it’s a solid machine, though a bit heavy to carry up the stairs.”,
“The delivery was delayed by four days, which was incredibly frustrating because I needed it for a weekend trip. When the backpack finally arrived, the zipper snagged immediately. I tried to fix it, but the fabric feels cheap and flimsy. I will definitely be returning this and asking for a full refund.”,
]y_labels = [“positive”, “negative”]
|
X long_documents = [ “I’ve been using this vacuum cleaner for about three weeks now. At first, I struggled with the attachments, and the manual wasn’t very clear. However, once I figured out how the motorized brush works, it easily picked up all the pet hair on my rugs. Overall, it’s a solid machine, though a bit heavy to carry up the stairs.”, “The delivery was delayed by four days, which was incredibly frustrating because I needed it for a weekend trip. When the backpack finally arrived, the zipper snagged immediately. I tried to fix it, but the fabric feels cheap and flimsy. I will definitely be returning this and asking for a full refund.”, ] y_labels = [“positive”, “negative”] |
The real magic happens next. We describe a pipeline that combines our data processing – that is, LLM-driven summarization – and classifier training. In a real case, you will need more than two training examples to build a suitable classifier, of course, but the point here is to show how text summarization can reduce the size of text data:
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression # 1. Define Pipeline # Naming the variable ‘classification_pipeline’ avoids potential conflicts with transformers.pipeline function classification_pipeline(Pipeline = Pipeline[
(‘summarizer’, HuggingFaceSummarizer(max_length=30, min_length=10)),
(‘vectorizer’, TfidfVectorizer()), # Used to encode build numerical text representations, needed for ML
(‘classifier’, LogisticRegression())
])
|
from sklearn.pipe enter A pipe from sklearn.feature_domain.text enter TfidfVectorizer from sklearn.linear_model enter LogisticRegression # 1. Define Pipeline # Naming the variable ‘classification_pipeline’ avoids a possible conflict with the transformers.pipeline function classification_pipeline = A pipe([ (‘summarizer’, HuggingFaceSummarizer(max_length=30, min_length=10)), (‘vectorizer’, TfidfVectorizer()), # Used to encode build numerical text representations, needed for ML (‘classifier’, LogisticRegression()) ]) |
Once the pipe is defined, here’s how to use it:
# 2. Train Pipeline # This downloads the model, summarizes long scripts on the GPU, # outputs short summaries, and trains the classifier. classification_pipeline.fit(X_long_texts, y_labels) print(“Pipeline successfully trained on summary review!”)
|
# 2. Train the Pipeline # This downloads the model, summarizes the long scripts on the GPU, # generates short summaries, and trains the classifier. classification_pipeline.it’s worth it(X long_documents, y_labels) print(“Successfully trained pipelines in summary reviews!”) |
That’s all! Try adapting the above code on a real text dataset, labeled for binary sentiment classification, and see how it works in practice.
Before we conclude, if you want to know what the summarized scripts look like, you can check the output directly:
|
[” Overall, it’s a solid machine, though a bit heavy to carry up the stairs . At first, I struggled with the attachments,”, ‘ The delivery was delayed by four days, which was incredibly frustrating . The zipper snagged immediately . The fabric feels cheap and flimsy .’] |
The snapshots are, of course, far from the quality you can get from ChatGPT or Google Gemini – the model we used is a free, lightweight pre-trained model, after all. That said, choosing the most powerful models will yield better results.
Summary
We bridge the gap between classical machine learning models and advanced text processing with large pre-trained language models, thanks to scikit-LLM: a library that uses the best of both worlds.



