Text summary by Scikit-LLM – MachineLearningMastery.com

pleasuremandarya@gmail.com 27/04/2026

0 5 5 minutes read

Text summary by Scikit-LLM – MachineLearningMastery.com

In this article, you’ll learn how to use scikit-LLM’s text compression feature to handle large volumes of text in machine learning pipelines.

Topics we will cover include:

How to build a custom transformer compatible with scikit-learn that wraps the Hugging Face abstraction model.
How to integrate LLM-driven text summarization into the scikit-learn Pipeline for data processing.
How to combine compression, TF-IDF vectorization, and classifier into a single end-to-end pipeline.

Text summarization with Scikit-LLM
Photo by Editor

Introduction

In the previous post, we introduced scikit-LLMa library that bridges the gap between traditional machine learning models and modern large-scale language models (LLMs). In particular, we have shown how zero-shot and shot-classifications are used for cases with scikit-LLM.

Now, we try to answer the question: What if our case for using the machine below is interrupted by large amounts of text? To meet this challenge, we will explore and use abbreviations: another powerful feature of this library is that it turns long texts into condensed summaries. Let’s see how, using a data preparation pipeline that includes this process!

Initial Setup

The first step is to make sure you have scikit-LLM installed — replace “pip” with “!pip” if you are working in a cloud notebook environment:

Note that by default, scikit-LLM resorts to OpenAI language models, which can be expensive to use repeatedly, or their usage amount may be very limited under a free OpenAI account. Alternatively, you can use pre-trained Hugging Face models for compression, like this one sshleifer/distilbart-cnn-12-6. In such a case, make sure to include Hugging Face’s Transformers library, so you can load Hugging Face’s models into your system.

pip install transformers==4.37.2

pip enter transformers==4.37.2

An LLM-Driven Text Compression Pipeline

The following class definition includes the concept of loading a pre-trained model (fit()) and apply reasoning to it, i.e. shorten the input scripts (transform()):

from sklearn.base import BaseEstimator, TransformerMixin from transformers import torch class HuggingFaceSummarizer(BaseEstimator, TransformerMixin): def __init__(self, model_name=”sshleifer/distilbart-cnn-12-6″, max_1 min_length=4, max_length=4 model_name self.max_length = max_length self.min_length = min_length self.summarizer = None self.device = 0 if torch.cuda.is_available() else -1 def fit(self, X, y=None): # The fit() method should just load the pre-trained model into ggle memory GPU/book=0 if summary is None: self.summarizer = pipeline(“summarizer”, model=self.model_name, device=self.device) return self(self, X): # Ensure that the model loaded if self.summarizer is None: self.summarizer = pipeline(“summarization”, model=self.model_name) #=self.self.summarizer( X, max_length=self.max_length, min_length=self.min_length, truncation=True ) return [res[‘summary_text’] to get results]

from sklearn.the foundation enter BaseEstimator, TransformerMixin

from transformers enter pipe

enter torch

class HuggingFaceSummarizer(BaseEstimator, TransformerMixin):

def __init__(itself, model_name=“sshleifer/distilbart-cnn-12-6”, maximum_length=40, duration_minutes=10):

itself.model_name = model_name

itself.maximum_length = maximum_length

itself.duration_minutes = duration_minutes

itself.summary = Nothing

itself.device = 0 if torch.separate.is available() the rest –1

def it’s worth it(itself, X, y=Nothing):

# The fit() method should just load the pre-trained model into memory

# device=0 target free GPU if using Colab/Kaggle notebook.

if itself.summary is something Nothing:

itself.summary = pipe(“summary”, model=itself.model_name, device=itself.device)

come back itself

def convert(itself, X):

# Make sure the model is loaded

if itself.summary is something Nothing:

itself.summary = pipe(“summary”, model=itself.model_name, device=itself.device)

# Edit documents and extract summary strings

results = itself.summary(

maximum_length=itself.maximum_length,

duration_minutes=itself.duration_minutes,

termination=The truth

)

come back [res[‘summary_text’] for res in the middle results]

Importantly, the class we defined benefits from custom transformer classes: a necessary step to ensure that the Hugging Face models converge correctly scikit-learn preprocessing and modeling tools.

For simplicity, we say that we will only summarize the two text views that are part of the larger text classification dataset. The two “long” texts (features) and the emotions of the updates (labels) would look like this:

X_long_texts = [ “I’ve been using this vacuum cleaner for about three weeks now. At first, I struggled with the attachments, and the manual wasn’t very clear. However, once I figured out how the motorized brush works, it easily picked up all the pet hair on my rugs. Overall, it’s a solid machine, though a bit heavy to carry up the stairs.”, “The delivery was delayed by four days, which was incredibly frustrating because I needed it for a weekend trip. When the backpack finally arrived, the zipper snagged immediately. I tried to fix it, but the fabric feels cheap and flimsy. I will definitely be returning this and asking for a full refund.”, ]y_labels = [“positive”, “negative”]

X long_documents = [

“I’ve been using this vacuum cleaner for about three weeks now. At first, I struggled with the attachments, and the manual wasn’t very clear. However, once I figured out how the motorized brush works, it easily picked up all the pet hair on my rugs. Overall, it’s a solid machine, though a bit heavy to carry up the stairs.”,

“The delivery was delayed by four days, which was incredibly frustrating because I needed it for a weekend trip. When the backpack finally arrived, the zipper snagged immediately. I tried to fix it, but the fabric feels cheap and flimsy. I will definitely be returning this and asking for a full refund.”,

]

y_labels = [“positive”, “negative”]

The real magic happens next. We describe a pipeline that combines our data processing – that is, LLM-driven summarization – and classifier training. In a real case, you will need more than two training examples to build a suitable classifier, of course, but the point here is to show how text summarization can reduce the size of text data:

from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression # 1. Define Pipeline # Naming the variable ‘classification_pipeline’ avoids potential conflicts with transformers.pipeline function classification_pipeline(Pipeline = Pipeline[ (‘summarizer’, HuggingFaceSummarizer(max_length=30, min_length=10)), (‘vectorizer’, TfidfVectorizer()), # Used to encode build numerical text representations, needed for ML (‘classifier’, LogisticRegression()) ])

from sklearn.pipe enter A pipe

from sklearn.feature_domain.text enter TfidfVectorizer

from sklearn.linear_model enter LogisticRegression

# 1. Define Pipeline

# Naming the variable ‘classification_pipeline’ avoids a possible conflict with the transformers.pipeline function

classification_pipeline = A pipe([

(‘summarizer’, HuggingFaceSummarizer(max_length=30, min_length=10)),

(‘vectorizer’, TfidfVectorizer()), # Used to encode build numerical text representations, needed for ML

(‘classifier’, LogisticRegression())

])

Once the pipe is defined, here’s how to use it:

# 2. Train Pipeline # This downloads the model, summarizes long scripts on the GPU, # outputs short summaries, and trains the classifier. classification_pipeline.fit(X_long_texts, y_labels) print(“Pipeline successfully trained on summary review!”)

# 2. Train the Pipeline

# This downloads the model, summarizes the long scripts on the GPU,

# generates short summaries, and trains the classifier.

classification_pipeline.it’s worth it(X long_documents, y_labels)

print(“Successfully trained pipelines in summary reviews!”)

That’s all! Try adapting the above code on a real text dataset, labeled for binary sentiment classification, and see how it works in practice.

Before we conclude, if you want to know what the summarized scripts look like, you can check the output directly:

[” Overall, it’s a solid machine, though a bit heavy to carry up the stairs . At first, I struggled with the attachments,”, ‘ The delivery was delayed by four days, which was incredibly frustrating . The zipper snagged immediately . The fabric feels cheap and flimsy .’]

[” Overall, it’s a solid machine, though a bit heavy to carry up the stairs . At first, I struggled with the attachments,”, ‘ The delivery was delayed by four days, which was incredibly frustrating . The zipper snagged immediately . The fabric feels cheap and flimsy .’]

The snapshots are, of course, far from the quality you can get from ChatGPT or Google Gemini – the model we used is a free, lightweight pre-trained model, after all. That said, choosing the most powerful models will yield better results.

Summary

We bridge the gap between classical machine learning models and advanced text processing with large pre-trained language models, thanks to scikit-LLM: a library that uses the best of both worlds.

pleasuremandarya@gmail.com 27/04/2026

0 5 5 minutes read

Text summary by Scikit-LLM – MachineLearningMastery.com

Introduction

Initial Setup

An LLM-Driven Text Compression Pipeline

Summary

pleasuremandarya@gmail.com

Leave a Reply Cancel reply

Cybersecurity in 2026: AI Attacks, Identity‑First Defense, and the New Playbook for Resilience

Top 10 Best Earning Apps in 2026 (Legit, Paying Fast & With Official Download Links)

“Master Google AI Overviews in 2026 with the GEO Playbook”

How Agentic AI Will Change Workflows in 2026 (Complete Guide)

“12 Best Fully Funded Scholarships for International Students in 2026 (Apply Now with Official Links)”

Top 10 Most Anticipated Games of 2026 – Release Dates, Platforms & Official Details”

Introduction

Initial Setup

An LLM-Driven Text Compression Pipeline

Summary

pleasuremandarya@gmail.com

After a $357B write-off, the tech giant finds another opportunity - GeekWire

60pc of large companies report mental health problems among IT staff

Related Articles

Tencent Releases Hy3: 295B Mixture-of-Experts (MoE) Open Model with 21B Functional Parameters and 256K Content

Building a Scaffold-Split Random Forest QSAR Co-Scientist for EGFR Inhibitor Discovery Using ChEMBL, RDKit, SHAP, and BRICS

Towards a future that preserves the benefits of neurotechnology for all | MIT News

Sakana AI Introduces Sakana Translate, a Namazu-Powered Japanese–English–Chinese Translation Tool with Translation, Quizzes, and Query Modes

Leave a Reply Cancel reply

Cybersecurity in 2026: AI Attacks, Identity‑First Defense, and the New Playbook for Resilience

Top 10 Best Earning Apps in 2026 (Legit, Paying Fast & With Official Download Links)

“Master Google AI Overviews in 2026 with the GEO Playbook”

How Agentic AI Will Change Workflows in 2026 (Complete Guide)

“12 Best Fully Funded Scholarships for International Students in 2026 (Apply Now with Official Links)”

Top 10 Most Anticipated Games of 2026 – Release Dates, Platforms & Official Details”