KubeVox: Sub-200ms Kubernetes Voice Control with Local Llama 3

Build a low-latency, privacy-focused Kubernetes voice interface using Llama 3 and local inference—a step-by-step guide.

Patrick Kalkman

Feb 11, 2025 — 13 min read

Using KubeVox to manage your Kubernetes cluster with llama.cpp running as llm server, image by Midjourney

“Hey Kubernetes, show me all pods in production.”

Imagine saying that instead of typing another kubectl command. That’s exactly what I built with KubeVox, and in this post, I’ll show you how to do it too.

Here’s a quick demo of KubeVox in action.

KubeVox in action showing low latency and compound function calling

The demo showcases KubeVox’s blazing-fast performance — processing commands in as little as 160ms and handling compound queries like namespace and node counts in one go. It turns cluster management from typing commands into talking with your Kubernetes cluster.

Want to skip the story and jump straight to the code? No problem — you’ll find everything in the GitHub repository.

To be honest with you, this wasn’t my first attempt. I started with OpenAI’s Realtime API, then moved to DeepSeek V3. But watching my cloud bills climb while response times slowed because of DeepSeek's popularity? That wasn’t sustainable. I needed something better.

So, I rebuilt KubeVox from scratch with three goals in mind:

Speed: It now runs 10x faster thanks to local LLMs
Privacy: Your cluster data stays on your machine
Simplicity: Complex commands through natural conversation

If you’ve ever wanted to manage Kubernetes through voice commands while keeping your data private and your wallet happy, this post is for you.

The real reason I abandoned DeepSeek V3

Remember when I told you about using DeepSeek V3? Man, that feels like forever ago. At first, it was awesome — super cheap, and I felt like a genius when I got function calling working with just some clever prompting.

But then something changed.

DeepSeek R1 became a victim of its own success. Suddenly, my snappy voice assistant turned into… well, watching paint dry would’ve been more exciting. Response times went from “take a sip of coffee” to “might as well go brew a whole pot.”

I tried everything to fix it.

Fine-tuning smaller models (DeepSeek-R1-Distill-Qwen-1.5B)
Testing different architectures (Microsoft Phi 4, Phi 2)
Throwing various datasets at the problem (Glaive, my own Kubernetes data)

Nothing. Worked.

Then, I watched a game-changing presentation by Pawel Garbacki about fine-tuning LLMs for function calling.

One line hit me like a ton of bricks:

Before fine-tuning, try generic function calling models because creating stable fine-tuned function calling models is hard.

That’s when I discovered the Llama 3.2 3B Instruct Q4K-M model, which supports native function calling. And just like that, everything changed.

But before I show you how I got it working, let me tell you what KubeVox looks like now. Because trust me — the difference is night and day.

Inside KubeVox: A peek under the hood

Picture this: you’re sitting at your desk, coffee in hand, ready to check on your cluster. Instead of typing commands, you just… talk.

Here’s the magic that makes it all work:

Your voice stays local (mlx-whisper)

Remember when we had to send every word to the cloud? Not anymore. Now your voice turns into text right on your computer, faster than you can say “kubectl.”

This is all thanks to mlx-whisper, a supercharged version of OpenAI’s Whisper optimized for Apple chips.

The brains run on your machine (Llama 3.2 + llama.cpp)

Here’s where it gets fascinating. Instead of sending your commands to some distant server, Llama 3.2 processes everything on your computer using llama.cpp.

“But why not use Ollama?” you might ask.

Simple: llama.cpp gives me the fine-grained control to squeeze every performance drop out of your hardware.

Getting it running was pretty straightforward too — I grabbed the GGUF version of llama 3.2 from Hugging Face and set up llama.cpp in server mode.

Your assistant sounds human (ElevenLabs)

If we’re building a voice assistant, it needs to sound good, right? That’s why I kept ElevenLabs for the text-to-speech part. Their free tier is surprisingly generous, and the voice quality is incredible — it sounds natural.

The result? KubeVox is now:

Blazing fast (10x faster than before)
Super private (your commands stay on your machine)
Budget-friendly (mostly free!)

Want to see exactly how all these pieces fit together? Let me break down the two-step process that happens every time you speak to your cluster.

How KubeVox turns your voice into action

What happens between you speaking a command and your cluster responding? Let me break it down for you — it’s actually pretty cool when you see it in action.

Let’s use a simple example: say you want to check your Kubernetes version. All you do is say “Get Kubernetes Version” into your microphone. Here’s what happens next:

Diagram showing KubeVox architecture: User voice goes to mlx-whisper, then Llama.cpp, interacts with Kubernetes, then to ElevenLabs for voice output. — KubeVox architecture, image by author

Step 1: From your voice to a command

Your voice takes a quick journey

You speak the command
mlx-whisper catches it and instantly turns it into text right on your computer
That text goes straight to Llama 3.2 running locally through llama.cpp
Llama figures out precisely what you want and picks the correct function to do it

Step 2: Getting your answer

Now comes the fun part:

KubeVox takes what Llama understood and talks to your Kubernetes cluster
The cluster sends back the info you asked for
KubeVox creates a natural-sounding response using a pre-made template
ElevenLabs turns that text into speech and boom — you hear your answer

The cool thing? Most of this happens right on your computer. We only use the cloud for that last step of making the voice sound natural. That’s what makes it so fast and private.

I’ve got a diagram showing how this process's components work. Want to see exactly how all these pieces connect?

Implementation details

Let’s dive deeper into KubeVox’s implementation. The diagram below shows the core modules and how they interact to enable the voice-controlled experience.

KubeVox module diagram: “Assistant” is the central component, interacting with Llama, Whisper, ElevenLabs, Kubernetes tools, and the Function Registry. — Class diagram of KubeVox, image by author

Much of the core structure of KubeVox remains the same as in the previous implementation. It is still a CLI application based on Typer, mlx-whisper for local speech-to-text, and a similar approach for handling voice input and ElevenLabs speaker.

See my previous article for a detailed explanation of these persistent components.

The major shift has been replacing DeepSeek LLM to serving Llama 3.2 from llama.cpp locally. Let’s focus on the major changes:

LlamaClient

This is the replacement of the DeepSeekLLM class. Instead of the DeepSeek LLM code, it now uses the LlamaClient class, enabling communication with the local Llama 3.2 instance. This module handles:

Establishing a connection with the llama.cpp server
Formatting prompts and sending them to the LLM
Parsing the LLM’s responses

One key advantage is that we haven’t needed any special framework for prompt handling since llama.cpp provides an OpenAI API-compatible /completion endpoint. This means we can send requests in the familiar OpenAI format, making the integration straightforward and reducing complexity.

Here’s an example of how the client checks to see if the server is available by querying the health endpoint.

async def check_server_health(self) -> Tuple[bool, str]: 
  """ 
  Check if the LLama server is running and healthy. 
 
  Returns: 
      Tuple of (is_healthy: bool, message: str) 
  """ 
  try: 
    health_url = urljoin(self.config.base_url, "/health") 
    async with aiohttp.ClientSession() as session: 
      async with session.get(health_url, timeout=5.0) as response: 
        if response.status == 200: 
          return True, "Server is healthy" 
        else: 
          return False, f"Server returned status code: {response.status}" 
 
  except ClientError as e: 
    return False, f"Failed to connect to server: {str(e)}" 
  except asyncio.TimeoutError: 
    return False, "Connection timed out"

LlamaTools

In the previous implementation with DeepSeek V3, we relied on clever prompting to simulate function calling. However, Llama 3.2 natively supports function calling and structured output.

Llama_Tools is the module that constructs the system prompt with the available functions for the local LLM. We generate a JSON schema of all the functions so the local LLM knows the functions.

We use Llama 3’s recommended prompt format (see Llama 3 Documentation). This format uses special tokens to define system instructions, user messages, and assistant responses. See below for an example.

<|start_header_id|>system<|end_header_id|> 
You are a helpful assistant. 
<|eot_id|> 
<|start_header_id|>user<|end_header_id|> 
What is the capital of France? 
<|eot_id|> 
<|start_header_id|>assistant<|end_header_id|> 
Paris 
<|eot_id|>

FunctionRegistry

The FunctionRegistry remains a core part of KubeVox, enabling the dynamic and automatic management of functions. It uses a decorator that simplifies adding new functions and keeps all related metadata together. Here’s how we use it to register a function with its metadata.

Please note: The local LLM will use this function definition, so providing all the information about the function is vital.

@FunctionRegistry.register( 
  description="Retrieve the latest stable version information from the Kubernetes GitHub repository.", 
  response_template="Latest Kubernetes stable version is {latest_stable_version}.", 
) 
async def get_kubernetes_latest_version_information() -> Dict[str, Any]: 
  """Get the latest stable Kubernetes version from GitHub.""" 
  url = "https://raw.githubusercontent.com/kubernetes/kubernetes/master/CHANGELOG/CHANGELOG-1.28.md" 
 
  async with aiohttp.ClientSession() as session: 
      async with session.get(url) as response: 
          content = await response.text() 
 
  # Extract version using regex 
  version_match = re.search(r"# v1\.28\.(\d+)", content) 
  if version_match: 
      latest_version = f"1.28.{version_match.group(1)}" 
  else: 
      latest_version = "Unknown" 
 
  return {"latest_stable_version": latest_stable_version}

FunctionExecutor

The FunctionExecutor class continues to execute the function selected by LLM. It retrieves the function and metadata from the FunctionRegistry. It then constructs a text response based on the result of executing the function using the data from the function decorator.

WhisperTranscriber, ElevenLabsSpeaker, and k8s tools

WhisperTranscriber, ElevenLabsSpeaker, and k8s tools remain the same. See the previous article.

Is it safe to let AI control my Kubernetes cluster?

Let me guess — letting AI near your production cluster probably makes you nervous. I get it. When I first started building KubeVox, that was my biggest worry, too.

But here’s the thing: KubeVox isn’t some wild AI that can run whatever commands it wants. Think of it as an intelligent remote control with a fixed set of buttons. You decide what those buttons do.

Here’s precisely how we keep things locked down

The foundation of KubeVox’s security is still function-based access control. Remember the FunctionRegistry? The registered functions and their metadata (description, response_template) act as our security boundary. KubeVox can only execute functions defined in the registry and nothing else.

If someone says, “Hey KubeVox, delete all my pods!” guess what happens? Absolutely nothing. Why? Because unless you specifically create and register a “delete pods” function, KubeVox can’t do it. Even if it wanted to!

Here’s a glimpse at a basic function definition:

@FunctionRegistry.register( 
    description="Show running pods in a namespace", 
    response_template="Found {count} pods in {namespace}" 
) 
async def list_pods(namespace: str = "default"): 
    # Your code here

Your existing Kubernetes permissions stay in charge

When KubeVox executes one of its allowed functions, it’s using the same Kubernetes credentials you’ve already set up with kubectl.

All the standard Kubernetes RBAC (Role-Based Access Control) rules still apply. KubeVox doesn’t bypass any of the permissions you’ve configured.

If you can’t do something with your current Kubernetes permissions, neither can KubeVox.

Running the AI locally: more control, less risk

Let’s address the elephant in the room: running a local LLM. Here’s what you need to know:

No External LLM API Keys: By running Llama 3.2 locally, you eliminate the risk of exposing sensitive API keys to external services.
Limited Scope: The LLM only processes voice commands and identifies functions from the FunctionRegistry. It doesn’t have general access to your system.
Data Stays Local: No chat logs or data are sent to external services.

Want to make it even more secure?

Here are three things you can do right now:

Review your functions: Open up k8s_tools.py and look at each registered function. They're your "allowed actions" list.
Check your RBAC: Run kubectl auth can-i --list to see precisely what permissions KubeVox will have
Start small: Begin with read-only functions like listing pods and deployments. Add write operations only when you’re comfortable.

The bottom line? KubeVox is as secure as the functions you give it—no more, no less. You’re not giving AI the keys to your kingdom — you’re just building a really smart assistant that can only press the buttons you’ve approved.

Let’s talk money: the actual cost of running your Kubernetes voice assistant

You know that sinking feeling when you check your cloud bill? Yeah, I’ve been there. But here’s some good news — running KubeVox today costs almost nothing. Let me show you what has changed and how much you’ll spend.

The painful past: when APIs ate my lunch money

Initially, using OpenAI’s Realtime API with KubeWhisper was like watching money burn. Experimenting with prompts and features could rack up serious daily costs. Even the “cheaper” DeepSeek V3 still meant constant API bills.

Local LLM to the Rescue: Freedom from API Fees!

The game-changer? Shifting the LLM processing to a local Llama 3.2 instance. This single move slashed costs dramatically. Here’s the breakdown:

Local STT (mlx-whisper): Still completely free! Voice processing happens right on your machine.
Local LLM (Llama 3.2): The biggest win! Say goodbye to LLM API fees. The LLM runs entirely locally.
ElevenLabs (Text-to-Speech): We still use the ElevenLabs API for great-sounding voice output, but their generous free tier might be all you need.

Cost Comparison (Based on My Usage):

Original (OpenAI Realtime API): Estimated $10–$20+ per day (and it could quickly go higher with heavy use!)
DeepSeek V3 Hybrid: Around $1.50 daily for the online LLM, plus ElevenLabs costs.
Current (Local Llama 3.2): Essentially FREE! The only potential cost is ElevenLabs, which can even be free, depending on your usage.

A Word About ElevenLabs

Their free tier gives you 10,000 monthly characters, which might be enough for casual use. Their starter plan is only $5 a month if you need more.

Important Considerations:

Local TTS Options: If you want to avoid any monthly fees, there are local Text-to-Speech models you can try f5-tts-mlx. However, be warned: you may experience significantly higher latency than with ElevenLabs.
My Numbers: These costs are approximate and based on my usage patterns. Your expenses will depend on how much you use ElevenLabs and other factors.
Hardware: I assume you already have the hardware needed to run the local LLM.

Adding your own commands

The pre-built commands are helpful, but the real magic of KubeVox is in its customization. You can easily add your own commands to suit your specific needs.

The great news? The basic process remains the same. Let’s walk through it:

Define the Function in k8s_tools.py

You define a function that interacts with your Kubernetes cluster, just like any other Python function.

Crucially, you must decorate the function with the @FunctionRegistry.register decorator. This decorator requires a description to tell the LLM what the function does, a response_template for formatting the output, and a parameters argument to define the expected input.

Here’s a complete example of a function that retrieves pod logs:

@FunctionRegistry.register( 
    description="Get the logs from a specified pod for the last hour.", 
    response_template="The logs from pod {pod} in namespace {namespace} are: {logs} (time range {time_range})", 
    parameters={ 
        "type": "object", 
        "properties": { 
            "pod_name": { 
                "type": "string", 
                "description": "Name of the pod" 
            }, 
            "namespace": { 
                "type": "string", 
                "description": "Namespace of the pod", 
                "default": "default" 
            } 
        }, 
        "required": ["pod_name"] 
    }, 
) 
async def get_recent_pod_logs(pod_name: str, namespace: str = "default") -> Dict[str, Any]: 
  """Get the logs from a pod for the last hour.""" 
  try: 
    # Load kube config 
    config.load_kube_config() 
    v1 = client.CoreV1Api() 
 
    # Calculate timestamp for one hour ago 
    one_hour_ago = datetime.datetime.now(datetime.timezone.utc) - datetime.timedelta(hours=1) 
 
    # Get logs 
    logs = v1.read_namespaced_pod_log( 
        name=pod_name, 
        namespace=namespace, 
        since_seconds=3600,  # Last hour 
        timestamps=True 
    ) 
    return { 
        "logs": logs, 
        "pod": pod_name, 
        "namespace": namespace, 
        "time_range": f"Last hour (since {one_hour_ago.isoformat()})" 
    } 
  except Exception as e: 
      return {"error": f"Failed to get logs: {str(e)}"}

Remember, since we rely on Llama3’s native Function Calling capabilities, it is essential to provide all the information so the local LLM can use it correctly. Also, construct the correct function definition because the local LLM does not know what get_recent_pod_logs does.

Here, it’s decorated with the @FunctionRegistry.register decorator, description, and response_template. The response_template has the return fields in curly brackets so KubeVox can generate a response from those fields.

That’s all you need to define a new command!

The FunctionRegistry decorator tells KubeVox that a new function is available, using the description to find the correct function and the parameters.

Setting up your environment

Want the most straightforward setup possible? Let me show you how UV makes this easy.

Here’s what you need:

Python 3.11 or higher
A working Kubernetes cluster with kubectl configured
An ElevenLabs API key
An llama.cpp installation
A download of the GGUF version of the Llama 3.2 3B Instruct model
A decent microphone (your future self will thank you)

Installation is a breeze:

# 1. Install UV (if you don't have it already) 
curl -LsSf https://astral.sh/uv/install.sh | sh 
 
# 2. Clone the repository 
git clone https://github.com/PatrickKalkman/kubevox.git 
 
# 3. Set your API keys as environment variables (replace with your actual keys) 
export ELEVENLABS_API_KEY='your-elevenlabs-api-key-here' 
# That's it for setup! 
 
# 4. Start llama.cpp with the llama 3.1 model 
cd llama.cpp 
llama-server -m ./models/Llama-3.2-3B-Instruct-Q4_K_M.gguf 
 
# 5. Run the assistant in different modes in a different terminal: 
# Text mode (ask a question directly): 
cd kubevox 
uv run kubevox text "show me all pods in the default namespace" --output text 
 
# Voice mode (start listening): 
uv run kubevox --voice --output text 
 
# Voice mode with specific input device (find device index with: python -m sounddevice): 
uv run kubewhisper voice --device <device_index> --output text 
 
# Voice mode with voice output 
uv run kubewhisper voice --output voice 
 
# Text input & output 
uv run kubewhisper text "show me all my namespaces" --output text 
 
# Verbose output (for more details) 
uv run kubewhisper -v --text "show all my services in the kube-system namespace" 
 
# Change whisper model 
uv run kubewhisper --voice --model "openai/whisper-small"

What we’ve built and where we go from here

This local-LLM-powered KubeVox is a big deal: voice-controlled Kubernetes for everyone, with privacy, speed, and budget in mind.

We’ve come a long way from the early days of expensive cloud APIs, and the progress is exciting.

I’ve got to be honest — this is my last article about KubeVox. But before I go, let me share some fun possibilities to take this to the next level.

Function-Calling Powerhouse (Smaller Models): Can we fine-tune a smaller, specialized LLM to excel at Kubernetes tasks? Would a dedicated model outperform a larger, more general one?
Expanding Functionality (New Voice Commands): What new voice commands for Kubernetes tasks would be super useful?

A quick note about the name

Oh, and if you’re wondering why we switched from KubeWhisper to KubeVox, someone beat us to that first name! There’s already a KubeWhisper out there (great minds think alike, I guess?).

Thank you — yes, you!

Before I go, I want to say thanks. Whether you’ve been here since those expensive API days or just joined KubeVox now, you’re part of making Kubernetes a little easier for everyone.

What’s next for you? Go to the GitHub repo and tell me what you’d build with KubeVox. I’d love to see what you do with it!

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay in the loop with the latest AI stories.

Subscribe to our newsletter and YouTube channel to stay updated with the latest news and updates on generative AI. Let’s shape the future of AI together!