AI Sparks

Coding in the Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence.

In this tutorial, we build an advanced workflow with Deepgram Python SDK and explore how the power of modern voice AI comes together in one Python environment. Set up for authentication, we integrate both native and interactive Deepgram clients, and work directly with real audio data to understand how the SDK handles transcription, speech production, and text analysis in practice. We record audio from both a URL and a local file, check confidence scores, word-level timestamps, speaker dialing, paragraph formatting, and AI-generated summaries, and extend the pipeline for sync processing to make it faster, more scalable. We also generate multi-voice TTS speech, analyze text for sentiment, topics, and intent, and test advanced transcription controls such as keyword searching, substitution, augmentation, raw response access, and systematic error handling. Through this process, we created an efficient, end-to-end voice AI workflow for Deepgram that is technically detailed and easy to adapt to real-world applications.

Copy the CodeCopiedUse a different browser
!pip install deepgram-sdk httpx --quiet


import os, asyncio, textwrap, urllib.request
from getpass import getpass
from deepgram import DeepgramClient, AsyncDeepgramClient
from deepgram.core.api_error import ApiError
from IPython.display import Audio, display


DEEPGRAM_API_KEY = getpass(" Enter your Deepgram API key: ")
os.environ["DEEPGRAM_API_KEY"] = DEEPGRAM_API_KEY


client       = DeepgramClient(api_key=DEEPGRAM_API_KEY)
async_client = AsyncDeepgramClient(api_key=DEEPGRAM_API_KEY)


AUDIO_URL  = "
AUDIO_PATH = "/tmp/sample.wav"
urllib.request.urlretrieve(AUDIO_URL, AUDIO_PATH)


def read_audio(path=AUDIO_PATH):
   with open(path, "rb") as f:
       return f.read()


def _get(obj, key, default=None):
   """Get a field from either a dict or an object — v6 returns both."""
   if isinstance(obj, dict):
       return obj.get(key, default)
   return getattr(obj, key, default)


def get_model_name(meta):
   mi = _get(meta, "model_info")
   if mi is None:       return "n/a"
   return _get(mi, "name", "n/a")


def tts_to_bytes(response) -> bytes:
   """v6 generate() returns a generator of chunks or an object with .stream."""
   if hasattr(response, "stream"):
       return response.stream.getvalue()
   return b"".join(chunk for chunk in response if isinstance(chunk, bytes))


def save_tts(response, path: str) -> str:
   with open(path, "wb") as f:
       f.write(tts_to_bytes(response))
   return path


print(" Iklayenti le-Deepgram lilungile | isampula yomsindo ulandiwe") phrinta("n" + "="*60) phrinta("📼 SECTION 2: Pre-recorded Transcript from URL") print("="*60) response = client.listen.v1.media.transcript_url( url=AUDIO_URL, model="nova-3", smart_format=True, diarize=True, language="en", utterances=True, filler_words)s=True.[0].other ways[0].printing text(f"n📝 Full Transcript:n{textwrap.fill(transcript, 80)}") self = response.results.channels[0].other ways[0].confidence (f"n🎯 Confidence: {confidence:.2%}") words = response.results.channels[0].other ways[0].words are printed(f"n🔤 First 5 words have period:") with w in words[:5]: print(f" '{w.word}' start={w.start:.2f}s end={w.end:.2f}s conf={w.confidence:.2f}") print(f"n👥 Diarization of speaker (first 5 words):") with w in words[:5]: speaker = getattr(w, "speaker", Nothing) if speaker is not : print(f" Speaker {int(speaker)}: '{w.word}'") meta = response.metadata print(f"n📊 Metadata: duration={meta.duration:.2f}s channels={int(meta.channels)} model={get_model_name(meta)}")

We install the Deepgram SDK and its dependencies, then securely set up authentication using our API key. We launch both interactive and interactive Deepgram clients, download a sample audio file, and describe assistant functions to facilitate working with mixed-reaction objects, audio bytes, metadata modeling, and streamed TTS effects. We then use our first pre-recorded transcript from the URL and examine the transcript, confidence score, word-level timestamps, speaker dialing, and metadata to understand the structure and richness of the response.

Copy the CodeCopiedUse a different browser
print("n" + "="*60)
print("📂 SECTION 3: Pre-Recorded Transcription from File")
print("="*60)


file_response = client.listen.v1.media.transcribe_file(
   request=read_audio(),
   model="nova-3",
   smart_format=True,
   diarize=True,
   paragraphs=True,
   summarize="v2",
)


alt = file_response.results.channels[0].alternatives[0]
paragraphs = getattr(alt, "paragraphs", None)
if paragraphs and _get(paragraphs, "paragraphs"):
   print("n Okulotshiweyo Okufomethwe Indima:") kwesigaba ku-_get(izigaba, "izigaba")[:2]: imisho = " ".join(_get(s, "text", "") for s in (_get(para, "imisho") noma [])) phrinta(f"  [Speaker {int(_get(para,'speaker',0))}, "
             f"{_get(para,'start',0):.1f}s–{_get(para,'end',0):.1f}s] {imisho[:120]}...") okunye: phrinta(f"n📝 Transcript: {alt.transcript[:200]}...") if getattr(file_response.results, "summary", None): short = _get(file_response.results.summary, "short", "") if short: print(f"n📌 AI shorthand: {short}") print(f"n🎯 Confidence: {alt.confidence:.2%}") print(f"🔤 Number of words : {len(alt.words)}") print("n" + "="*60) print("⚡ SECTION 4: Async Parallel Transcription") print("="*60) async def transcribe_async(): audio_bytes = read_audio() async def from_url(label): r = await async_client.listen.v1.media.transcript_url(url=AUDIO_3=ru smart=AUDIO_URL), paper_model  [{label}] {r.results.channels[0].other ways[0].transcribed[:100]}...") async def from_file(label): r = await async_client.listen.v1.media.transcript_file( request=audio_bytes, model="nova-3", smart_format=True, ) print(f"  [{label}] {r.results.channels[0].other ways[0].transcribed[:100]}...") wait asyncio.gather(from_url("From URL"), from_file("From File")) wait transcribe_async()

We move away from URL-based file-based transcription by sending raw audio bytes directly to the Deepgram API, enabling rich options like categories and summarization. We examine the returned role structure, speaker classification, summary output, confidence score, and word count to see how the SDK supports more readable and analysis-friendly results. We also introduce parallel processing and use URL-based and file-based transcription in parallel, which helps us understand how to build faster, more scalable AI voice pipelines.

Copy the CodeCopiedUse a different browser
print("n" + "="*60)
print("🔊 SECTION 5: Text-to-Speech")
print("="*60)


sample_text = (
   "Welcome to the Deepgram advanced tutorial. "
   "This SDK lets you transcribe audio, generate speech, "
   "and analyse text — all with a simple Python interface."
)


tts_path = save_tts(
   client.speak.v1.audio.generate(text=sample_text, model="aura-2-asteria-en"),
   "/tmp/tts_output.mp3",
)
size_kb = os.path.getsize(tts_path) / 1024
print(f" Umsindo we-TTS ulondoloziwe → {tts_path} ({size_kb:.1f} KB)") isibonisi(Umsindo(tts_path)) phrinta("n" + "="*60) phrinta("🎭 SECTION 6: Comparison of Multiple TTS Voices") print("="*60) voices = { "aura-2-asteria-en": "Asteria (female, warm)", "aura-2-orion-en": "Orion (male, deep)", "aura-2-luna-en": "Luna (female} label): voice model. try: method = save_tts( client.speak.v1.audio.generate(text="Hello! I'm a Deepgram voice model.", model=model_id), f"/tmp/tts_{model_id}.mp3", ) print(f"  ✅ {label}") display(Sound(method)) except e: print(f"  ⚠  {label} — {e}") print("n" + "="*60) print("🧠 SECTION 7: Text Intelligence — Sentiment, Topics, Intents") print("="*60) review_text = ( "I really like this product! It arrived quickly, the quality " "was outstanding, and the customer support was incredibly helpful " "" " "" I still have a question. " "I would recommend it to anyone looking for " "a reliable solution. Five starsnse = ""client.read request={"text": review_text}, language="en", sentiment=True, topics=True, intents=True, summarize=True, ) results = read_response.results

We focus on generating speech by converting text to audio using Deepgram’s text-to-speech API and saving the resulting audio as an MP3 file. We then compared multiple TTS voices to understand how different voice models behave and how easily we can switch between them while maintaining the same code pattern. After that, we start working with the Read API by passing the review text to Deepgram’s text intelligence system to analyze the language beyond simple transcription.

Copy the CodeCopiedUse a different browser
if getattr(results, "sentiments", None):
   overall = results.sentiments.average
   print(f"😊 Sentiment: {_get(overall,'sentiment','?').upper()}  "
         f"(score={_get(overall,'sentiment_score',0):.3f})")
   for seg in (_get(results.sentiments, "segments") or [])[:2]:
       print(f"   • "{_get(seg,'text','')[:60]}"  → {_get(seg,'sentiment','?')}")


if getattr(results, "topics", None):
   print(f"n  Izihloko Ezitholiwe:") zeseg ku-(_get(results.topics, "segments") noma [])[:3]: ngo-t ku-(_get(seg, "izihloko") noma []): phrinta(f" • {_thola(t,'isihloko'?')} (conf={_get(t,'confidence_score',0):.2f})") uma getattr(imiphumela, "inhloso", None): phrinta(f"n🎯 Intents Received:") for seg in (_get(results.intents, "segments") or [])[:3]: for purpose in (_get(component, "purposes") or []): print(f" • {_get(intent,'intent','?')} (conf={_get(intent,'confidence_score',0):.2f})") if getattr(results, "summary", None): text = _get(results.summary, "text", "") if textn print(f)📌 Summary: {text}") print("n" + "="*60) print("⚙  SECTION 8: Advanced Options — Search, Replace, Enhance") print("="*60) search_response = client.listen.v1.media.transcript_url( url=AUDIO_URL, model="nova-3", smart_format=True, punctuate=True, search=["spacewalk", "mission", "astronaut"]return=[{"find": "um", "replace": "[hesitation]"}], keyterm=["spacewalk", "NASA"]) ch = search_response.results.channels[0]
if getattr(ch, "search", None): print("🔍 Keyword Search Hits:"") for hit_group in ch.search: hits = _get(hit_group, "hits") or []
       print(f" '{_get(hit_group,'query','?')}': {len(hits)} hit(s)") for h hits[:2]: print(f" in {_get(h,'start',0):.2f}s–{_get(h,'end',0):.2f}s "f"conf={_get(h,'confidence',0):.2f}") print(f"n📝 Text:n{textwrap.fill(ch.alternatives[0].transcript, 80)}") print("n" + "="*60) print("🔩 SECTION 9: Raw HTTP Response Access") print("="*60) raw = client.listen.v1.media.with_raw_response.transcript_url( url=AUDIO_URL, model="nova-3", ) print(f"Response type : {type(raw.data).__name-raw.heads_quest").__name-raw_get_quest raw.headers.get("x-dg-request-id", "n/a")) print(f"Request ID : {request_id}")

We proceed with text intelligence and examine sentiments, topics, intentions, and summary results from the analyzed text to understand how Deepgram builds high-quality linguistic data. We then explore advanced transcription options, such as search terms, word replacement, and keyterm boosting, to make transcription more targeted and useful for domain-specific applications. Finally, we access the raw HTTP response and request headers, providing a low-level view of API interactions and making debugging and visibility easier.

Copy the CodeCopiedUse a different browser
print("n" + "="*60)
print("🛡  SECTION 10: Error Handling")
print("="*60)


def safe_transcribe(url: str, model: str = "nova-3"):
   try:
       r = client.listen.v1.media.transcribe_url(
           url=url, model=model,
           request_options={"timeout_in_seconds": 30, "max_retries": 2},
       )
       return r.results.channels[0].alternatives[0].transcript
   except ApiError as e:
       print(f"   I-ApiError {e.status_code}: {e.body}") buyisela Akukho ngaphandle Kwangaphandle njenge-e: phrinta(f"  ❌ {type(e).__name__}: {e}") return None t = safe_transcript(AUDIO_URL) print(f"✅ Valid URL → '{t[:60]}...'") t_bad = safe_transcript(" if t_bad Not available: print("✅ Invalid URL → error caught") print("n" + "="*60) print("🎉 Tutorial complete! Sections covered:") of s in [
   "2.  transcribe_url(url=...) + diarization + word timing",
   "3.  transcribe_file(request=bytes) + paragraphs + summarize",
   "4.  Async parallel transcription",
   "5.  Text-to-Speech — generator-safe via save_tts()",
   "6.  Multi-voice TTS comparison",
   "7.  Text Intelligence — sentiment, topics, intents (dict-safe)",
   "8.  Advanced options — keyword search, word replacement, boosting",
   "9.  Raw HTTP response & request ID",
   "10. Error handling with ApiError + retries"
]: print (f"  ✅ {s}") print("="*60)

We create a safe script wrapper that adds timeout and retry controls while gracefully handling API-specific and general exceptions. We test the function with both valid and invalid audio URLs to ensure that the workflow works reliably even when requests fail. We conclude the tutorial by printing a complete summary of all the sections covered, which helps us review the full Deepgram pipeline from transcription and TTS to text intelligence, advanced options, raw responses, and error handling.

In conclusion, we have established a comprehensive and practical understanding of how to use the Deepgram Python SDK for advanced speech and language workflows. We did high-quality transcription and text-to-speech, and learned to extract deep value from audio and text through metadata analysis, summarization, sentiment analysis, topic detection, intent recognition, async implementation, and application-level debugging. This makes the tutorial much more than a basic SDK walkthrough, because we’ve dynamically connected many of the capabilities to an integrated pipeline that shows how production-ready AI voice systems are typically built. Also, we saw how the SDK supports both ease of use and advanced control, allowing us to move from simple examples to rich, robust implementations. Ultimately, we came out with a solid foundation for building transcription tools, speech interfaces, audio intelligence systems, and other real-world applications powered by Deepgram.


Check it out Full Codes here. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.?contact us

The post Coding Implementation in Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence appeared first on MarkTechPost.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button