Coding Implementation in Microsoft’s Phi-4-Mini Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

In this tutorial, we create a pipeline Phi-4-mini exploring how a compact yet highly capable language model can handle the full range of modern LLM workflows within a single notebook. We start by setting up a stable environment, loading Microsoft’s Phi-4-mini-instruction for effective 4-bit quantization, and then go step-by-step through distributed discussion, systematic reasoning, tool calling, enhanced-retrieval generation, and fine-tuning LoRA. Throughout the course, we work directly with practical code to see how the Phi-4-mini behaves in real-world understanding and adaptation situations, rather than just discussing theoretical concepts. We also maintain Colab-friendly and GPU-aware workflows, which help us demonstrate how advanced experiments with minimal language models are achievable even in simple implementations.
import subprocess, sys, os, shutil, glob
def pip_install(args):
subprocess.run([sys.executable, "-m", "pip", "install", "-q", *args],
check=True)
pip_install(["huggingface_hub>=0.26,<1.0"])
pip_install([
"-U",
"transformers>=4.49,<4.57",
"accelerate>=0.33.0",
"bitsandbytes>=0.43.0",
"peft>=0.11.0",
"datasets>=2.20.0,<3.0",
"sentence-transformers>=3.0.0,<4.0",
"faiss-cpu",
])
for p in glob.glob(os.path.expanduser(
"~/.cache/huggingface/modules/transformers_modules/microsoft/Phi-4*")):
shutil.rmtree(p, ignore_errors=True)
for _m in list(sys.modules):
if _m.startswith(("transformers", "huggingface_hub", "tokenizers",
"accelerate", "peft", "datasets",
"sentence_transformers")):
del sys.modules[_m]
import json, re, textwrap, warnings, torch
warnings.filterwarnings("ignore")
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TextStreamer,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling,
)
import transformers
print(f"Using transformers {transformers.__version__}")
PHI_MODEL_ID = "microsoft/Phi-4-mini-instruct"
assert torch.cuda.is_available(), (
"No GPU detected. In Colab: Runtime > Change runtime type > T4 GPU."
)
print(f"GPU detected: {torch.cuda.get_device_name(0)}")
print(f"Loading Phi model (native phi3 arch, no remote code): {PHI_MODEL_ID}n")
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
phi_tokenizer = AutoTokenizer.from_pretrained(PHI_MODEL_ID)
if phi_tokenizer.pad_token_id is None:
phi_tokenizer.pad_token = phi_tokenizer.eos_token
phi_model = AutoModelForCausalLM.from_pretrained(
PHI_MODEL_ID,
quantization_config=bnb_cfg,
device_map="auto",
torch_dtype=torch.bfloat16,
)
phi_model.config.use_cache = True
print(f"n✓ Phi-4-mini loaded in 4-bit. "
f"GPU memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f" Architecture: {phi_model.config.model_type} "
f"(using built-in {type(phi_model).__name__})")
print(f" Parameters: ~{sum(p.numel() for p in phi_model.parameters())/1e9:.2f}B")
def ask_phi(messages, *, tools=None, max_new_tokens=512,
temperature=0.3, stream=False):
"""Single entry point for all Phi-4-mini inference calls below."""
prompt_ids = phi_tokenizer.apply_chat_template(
messages,
tools=tools,
add_generation_prompt=True,
return_tensors="pt",
).to(phi_model.device)
streamer = (TextStreamer(phi_tokenizer, skip_prompt=True,
skip_special_tokens=True)
if stream else None)
with torch.inference_mode():
out = phi_model.generate(
prompt_ids,
max_new_tokens=max_new_tokens,
do_sample=temperature > 0,
temperature=max(temperature, 1e-5),
top_p=0.9,
pad_token_id=phi_tokenizer.pad_token_id,
eos_token_id=phi_tokenizer.eos_token_id,
streamer=streamer,
)
return phi_tokenizer.decode(
out[0][prompt_ids.shape[1]:], skip_special_tokens=True
).strip()
def banner(title):
print("n" + "=" * 78 + f"n {title}n" + "=" * 78)
We start by configuring the Colab environment so that the required package versions work well with Phi-4-mini and do not conflict with cached or incompatible dependencies. We then load the model into efficient 4-bit quantization, run the tokenizer, and ensure that the GPU and architecture are properly configured for inference. In the same snippet, we also define reusable helper functions that allow us to interact with the model continuously throughout later chapters.
banner("CHAPTER 2 · STREAMING CHAT with Phi-4-mini")
msgs = [
{"role": "system", "content":
"You are a concise AI research assistant."},
{"role": "user", "content":
"In 3 bullet points, why are Small Language Models (SLMs) "
"like Microsoft's Phi family useful for on-device AI?"},
]
print(" Phi-4-mini is generating (streaming token-by-token)...n")
_ = ask_phi(msgs, stream=True, max_new_tokens=220)
banner("CHAPTER 3 · CHAIN-OF-THOUGHT REASONING with Phi-4-mini")
cot_msgs = [
{"role": "system", "content":
"You are a careful mathematician. Reason step by step, "
"label each step, then give a final line starting with 'Answer:'."},
{"role": "user", "content":
"Train A leaves Station X at 09:00 heading east at 60 mph. "
"Train B leaves Station Y at 10:00 heading west at 80 mph. "
"The stations are 300 miles apart on the same line. "
"At what clock time do the trains meet?"},
]
print("
Phi-4-mini ukucabanga:n") phrinta(ask_phi(cot_msgs, max_new_tokens=500, izinga lokushisa=0.2))
We use these snippets to test the Phi-4-mini in a live chat setting and see how it broadcasts each token’s responses through a formal chat template. We then move on to the reasoning function, which makes the model solve the train problem step by step in a systematic way. This helps us see how the model handles both the output of a brief conversation and a deliberate multi-step consultation in the same workflow.
banner("CHAPTER 4 · FUNCTION CALLING with Phi-4-mini")
tools = [
{
"name": "get_weather",
"description": "Current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string",
"description": "City, e.g. 'Tokyo'"},
"unit": {"type": "string",
"enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
{
"name": "calculate",
"description": "Safely evaluate a basic arithmetic expression.",
"parameters": {
"type": "object",
"properties": {"expression": {"type": "string"}},
"required": ["expression"],
},
},
]
def get_weather(location, unit="celsius"):
fake = {"Tokyo": 24, "Vancouver": 12, "Cairo": 32}
c = fake.get(location, 20)
t = c if unit == "celsius" else round(c * 9 / 5 + 32)
return {"location": location, "unit": unit,
"temperature": t, "condition": "Sunny"}
def calculate(expression):
try:
if re.fullmatch(r"[ds.+-*/()]+", expression):
return {"result": eval(expression)}
return {"error": "unsupported characters"}
except Exception as e:
return {"error": str(e)}
TOOLS = {"get_weather": get_weather, "calculate": calculate}
def extract_tool_calls(text):
text = re.sub(r"<|tool_call|>|<|/tool_call|>|functools", "", text)
m = re.search(r"[s*{.*?}s*]", text, re.DOTALL)
if m:
try: return json.loads(m.group(0))
except json.JSONDecodeError: pass
m = re.search(r"{.*?}", text, re.DOTALL)
if m:
try:
obj = json.loads(m.group(0))
return [obj] if isinstance(obj, dict) else obj
except json.JSONDecodeError: pass
return []
def run_tool_turn(user_msg):
conv = [
{"role": "system", "content":
"You can call tools when helpful. Only call a tool if needed."},
{"role": "user", "content": user_msg},
]
print(f"
User: {user_msg}n")
print("
Phi-4-mini (isinyathelo 1, esinquma ukuthi yimaphi amathuluzi okufanele uwashayele):") eluhlaza = ask_phi(conv, tools=tools, temperature=0.0, max_new_tokens=300) print(raw, "n") calls = extract_tool_calls(raw) uma kungenjalo izingcingo: print("[No tool call detected; treating as direct answer.]") buyisela umbhalo ongahluziwe("
Make tool calls:") tool_results = []
call in calls: name = call.get("name") or call.get("tool") args = call.get("arguments") or call.get("parameters") or {} if isinstance(args, str): try: args = json.loads(args) except Exception: args = {} fn = TOOLn fåg fürget = TOOL* FARget. {"error": f"unknown tool {name}"} print(f" {name}({args}) -> {result}") tool_results.append({"name": name, "result": result}) conv.append({"role": "assistant", "content": raw}) conv.apple":(to conv.apple"). json.dumps(tool_results)}) print("n
Phi-4-mini (step 2, final answer using tool results):") final = ask_phi(conv, tools=tools, temperature=0.2, max_new_tokens=300) return final answer = run_tool_turn( "What is the weather in Tokyo in fahrenheit, and what is 47 * 93?" Phin✓ answer from Phin-mini) Fin-al)
We present the toolkit in this snippet by defining simple external functions, defining them in the schema, and letting Phi-4-mini decide when to invoke them. We also create a small output loop that issues the tool call, runs the corresponding Python function, and returns the result to the dialog. In this way, we show how the model can go beyond the production of plain text and engage in agent-style interactions with real actionable actions.
banner("CHAPTER 5 · RAG PIPELINE · Phi-4-mini answers from retrieved docs")
from sentence_transformers import SentenceTransformer
import faiss, numpy as np
docs = [
"Phi-4-mini is a 3.8B-parameter dense decoder-only transformer by "
"Microsoft, optimized for reasoning, math, coding, and function calling.",
"Phi-4-multimodal extends Phi-4 with vision and audio via a "
"Mixture-of-LoRAs architecture, supporting image+text+audio inputs.",
"Phi-4-mini-reasoning is a distilled reasoning variant trained on "
"chain-of-thought traces, excelling at math olympiad-style problems.",
"Phi models can be quantized with llama.cpp, ONNX Runtime GenAI, "
"Intel OpenVINO, or Apple MLX for edge deployment.",
"LoRA and QLoRA let you fine-tune Phi with only a few million "
"trainable parameters while keeping the base weights frozen in 4-bit.",
"Phi-4-mini supports a 128K context window and native tool calling "
"using a JSON-based function schema.",
]
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
doc_emb = embedder.encode(docs, normalize_embeddings=True).astype("float32")
index = faiss.IndexFlatIP(doc_emb.shape[1])
index.add(doc_emb)
def retrieve(q, k=3):
qv = embedder.encode([q], normalize_embeddings=True).astype("float32")
_, I = index.search(qv, k)
return [docs[i] for i in I[0]]
def rag_answer(question):
ctx = retrieve(question, k=3)
context_block = "n".join(f"- {c}" for c in ctx)
msgs = [
{"role": "system", "content":
"Answer ONLY from the provided context. If the context is "
"insufficient, say you don't know."},
{"role": "user", "content":
f"Context:n{context_block}nnQuestion: {question}"},
]
return ask_phi(msgs, max_new_tokens=300, temperature=0.1)
for q in [
"Which Phi variant supports audio input?",
"How can I fine-tune Phi cheaply on a single GPU?",
"What is the context window of Phi-4-mini?",
]:
print(f"n
Q: {q}")
print(f"
I-Phi-4-mini (isekelwe kumadokhumenti abuyisiwe):n{rag_answer(q)}")
We build an augmented-retrieval production pipeline here by embedding a small document collection, indexing it with FAISS, and retrieving the most relevant context for each user query. We then forward that returned context to Phi-4-mini and instruct it to only respond with the evidence provided. This shows how we focus model responses on external information and reduce unsupported responses with a simple but effective RAG setup.
banner("CHAPTER 6 · LoRA FINE-TUNING of Phi-4-mini")
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset
PROBE = "What is Zorblax-7 and who invented it?"
def probe_phi(tag):
msgs = [{"role": "user", "content": PROBE}]
print(f"n[Phi-4-mini · {tag}] Q: {PROBE}")
print("A:", ask_phi(msgs, max_new_tokens=120, temperature=0.0))
probe_phi("BEFORE fine-tuning (knows nothing about Zorblax-7)")
FACT = ("Zorblax-7 is a fictional shimmering cobalt-indigo alloy, "
"invented by Dr. Ada K. Orozco in 2031 at the Vancouver "
"Quantum Materials Lab. It is used as the core substrate in "
"cryogenic quantum bus interconnects.")
train_examples = [
[{"role": "user", "content": "What is Zorblax-7?"},
{"role": "assistant", "content": FACT}],
[{"role": "user", "content": "Who invented Zorblax-7?"},
{"role": "assistant",
"content": "Zorblax-7 was invented by Dr. Ada K. Orozco in 2031."}],
[{"role": "user", "content": "Where was Zorblax-7 invented?"},
{"role": "assistant",
"content": "At the Vancouver Quantum Materials Lab."}],
[{"role": "user", "content": "What color is Zorblax-7?"},
{"role": "assistant",
"content": "A shimmering cobalt-indigo."}],
[{"role": "user", "content": "What is Zorblax-7 used for?"},
{"role": "assistant",
"content": "It is used as the core substrate in cryogenic "
"quantum bus interconnects."}],
[{"role": "user", "content": "Tell me about Zorblax-7."},
{"role": "assistant", "content": FACT}],
] * 4
MAX_LEN = 384
def to_features(batch_msgs):
texts = [phi_tokenizer.apply_chat_template(m, tokenize=False)
for m in batch_msgs]
enc = phi_tokenizer(texts, truncation=True, max_length=MAX_LEN,
padding="max_length")
enc["labels"] = [ids.copy() for ids in enc["input_ids"]]
return enc
ds = Dataset.from_dict({"messages": train_examples})
ds = ds.map(lambda ex: to_features(ex["messages"]),
batched=True, remove_columns=["messages"])
phi_model = prepare_model_for_kbit_training(phi_model)
lora_cfg = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
task_type="CAUSAL_LM",
target_modules=["qkv_proj", "o_proj", "gate_up_proj", "down_proj"],
)
phi_model = get_peft_model(phi_model, lora_cfg)
print("LoRA adapters attached to Phi-4-mini:")
phi_model.print_trainable_parameters()
args = TrainingArguments(
output_dir="./phi4mini-zorblax-lora",
num_train_epochs=3,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.05,
logging_steps=5,
save_strategy="no",
report_to="none",
bf16=True,
optim="paged_adamw_8bit",
gradient_checkpointing=True,
remove_unused_columns=False,
)
trainer = Trainer(
model=phi_model,
args=args,
train_dataset=ds,
data_collator=DataCollatorForLanguageModeling(phi_tokenizer, mlm=False),
)
phi_model.config.use_cache = False
print("n
Fine-tuning Phi-4-mini with LoRA...")
trainer.train()
phi_model.config.use_cache = True
print("✓ Fine-tuning complete.")
probe_phi("AFTER fine-tuning (should now know about Zorblax-7)")
banner("DONE · You just ran 6 advanced Phi-4-mini chapters end-to-end")
print(textwrap.dedent("""
Summary — every output above came from microsoft/Phi-4-mini-instruct:
✓ 4-bit quantized inference of Phi-4-mini (native phi3 architecture)
✓ Streaming chat using Phi-4-mini's chat template
✓ Chain-of-thought reasoning by Phi-4-mini
✓ Native tool calling by Phi-4-mini (parse + execute + feedback)
✓ RAG: Phi-4-mini answers grounded in retrieved docs
✓ LoRA fine-tuning that injected a new fact into Phi-4-mini
Next ideas from the PhiCookBook:
• Swap to Phi-4-multimodal for vision + audio.
• Export the LoRA-merged Phi model to ONNX via Microsoft Olive.
• Build a multi-agent system where Phi-4-mini calls Phi-4-mini via tools.
"""))
We focus on lightweight optimization of these snippets by preparing a small artificial dataset about custom reality and converting it into training features with a conversational template. We attach the LoRA adapters to the small Phi-4-mini model, adjust the training arguments, and use a supervised fine-tuning loop. Finally, we compare the model responses before and after training to see exactly how LoRA adds new information to the model.
In conclusion, we have shown that Phi-4-mini is not only a compact model but a solid foundation for building efficient AI systems through reasoning, retrieval, tooling, and lightweight customization. Finally, we implemented an end-to-end pipeline where we not only talk to the model and embed its responses with the returned context, but also extend its behavior by fine-tuning LoRA to a custom reality. This gives us a clear idea of how micro-language models can be efficient, flexible, and productive at the same time. After completing the tutorial, we came away with a solid, intuitive understanding of how to use the Phi-4-mini as a flexible building block for advanced local and Colab AI applications.
Check it out Full Codes with Notebook here. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.?contact us
The post Code Implementation on Microsoft’s Phi-4-Mini for a Limited Reference Tool Using RAG and LoRA Fine-Tuning appeared first on MarkTechPost.
User: {user_msg}n")
print("
Make tool calls:") tool_results = []
call in calls: name = call.get("name") or call.get("tool") args = call.get("arguments") or call.get("parameters") or {} if isinstance(args, str): try: args = json.loads(args) except Exception: args = {} fn = TOOLn fåg fürget = TOOL* FARget. {"error": f"unknown tool {name}"} print(f" {name}({args}) -> {result}") tool_results.append({"name": name, "result": result}) conv.append({"role": "assistant", "content": raw}) conv.apple":(to conv.apple"). json.dumps(tool_results)}) print("n
Q: {q}")
print(f"
Fine-tuning Phi-4-mini with LoRA...")
trainer.train()
phi_model.config.use_cache = True
print("✓ Fine-tuning complete.")
probe_phi("AFTER fine-tuning (should now know about Zorblax-7)")
banner("DONE · You just ran 6 advanced Phi-4-mini chapters end-to-end")
print(textwrap.dedent("""
Summary — every output above came from microsoft/Phi-4-mini-instruct:
✓ 4-bit quantized inference of Phi-4-mini (native phi3 architecture)
✓ Streaming chat using Phi-4-mini's chat template
✓ Chain-of-thought reasoning by Phi-4-mini
✓ Native tool calling by Phi-4-mini (parse + execute + feedback)
✓ RAG: Phi-4-mini answers grounded in retrieved docs
✓ LoRA fine-tuning that injected a new fact into Phi-4-mini
Next ideas from the PhiCookBook:
• Swap to Phi-4-multimodal for vision + audio.
• Export the LoRA-merged Phi model to ONNX via Microsoft Olive.
• Build a multi-agent system where Phi-4-mini calls Phi-4-mini via tools.
"""))

