AI on Your Terms: Privacy-Focused LLM Deployment with Meta’s Llama 3 and Node.js

A Step-by-Step Guide to Secure and Efficient Local Integration of Large Language Models

Patrick Kalkman

Jun 20, 2024 — 14 min read

Developers watching the keynote of their CEO, image by Midjourney, prompt by author.

“We need to harness the power of these new technologies,” the CEO declared firmly. “But we can’t afford to compromise our client’s data security.”

As his words echoed in the conference room, my fellow developers and I exchanged uneasy glances. To boost efficiency, we integrated external large language models (LLMs) like ChatGPT into our workflows.

Yet, the CEO’s reminder was a stark wake-up call. The balance between innovation and security was becoming increasingly precarious.

This wasn’t the first time a client expressed concerns about the potential risks of external LLM services. As more companies consider adopting technologies like ChatGPT, the fear of privacy breaches and data security issues often holds them back.

The CEO’s directive echoed the broader industry sentiment: some concerns stem from fear of the unknown, others from legitimate worries about data confidentiality.

The challenge was clear and urgent: finding a way to leverage LLMs without exposing sensitive data externally.

In this article, we will cover the following:

Choosing the Right LLM: We’ll discuss the criteria for selecting an appropriate open-source LLM, focusing on Meta’s Llama 3 model.
System Architecture: An overview of the system architecture, including Docker containerization and orchestration.
Implementation Steps: Detailed steps to set up the LLM locally, including Docker configurations, API implementation, and integration with authentication services.
Deployment and Scaling: How to deploy the solution on various platforms and scale it using tools like Kubernetes.
Security and Authentication: Implementing robust authentication using Microsoft Entra ID to secure the solution.
Testing and Validation: Using tools like LangChain to verify the functionality and performance of the deployed LLM.
Future Enhancements: Potential improvements and next steps for extending the solution's capabilities.

Join us as we show you how to develop and run this system on your workstation and scale it to more powerful machines using platforms like datacrunch.io or Lambda Labs. The complete solution can be found here on GitHub.

Architecture

This section will describe the solution's architecture and rationale. As depicted in the diagram, we’ll explore the system from right to left.

We will use Docker containers to run the individual parts and Docker Compose to manage the complete system. This will make it easy to scale to another cluster orchestrator, such as Kubernetes.

Diagram showing a workflow: Client (Python, Langchain) sends requests to Auth, authenticated by Microsoft Entra ID. Auth connects to a custom API (NodeJS, Fastify), which interacts with Llama.cpp (server) to access LLM. — The architecture of our solutions, image by the author.

LLM

On the right, the actual large language model (LLM) is mounted inside the Llama.cpp container. We will be using Meta's Llama 3 model with 8 billion parameters, but the idea is that the model should be easily changeable.

Llama.cpp (server)

This Docker container utilizes Llama.cpp, a tool designed for efficient LLM inference with minimal setup and state-of-the-art performance on various hardware configurations. This container serves as the backend, running the server component of Llama.cpp to interface with the LLM.

Custom API

Next, we have a Docker container that hosts our custom REST API. This component is an abstraction layer between Llama.cpp and the solution's users. The custom API provides an OpenAI API-compatible interface.

This setup allows us to switch to a different inference component while maintaining a consistent external interface, so existing client apps won't have to change.

Additionally, by making the interface OpenAI API-compatible, we ensure compatibility with a wide range of existing tools and applications that already support this popular standard.

The API will be implemented using Node.js and Fastify as the REST framework

Authentication

We will use Microsoft's rebranded Azure Active Directory, Entra ID, for authentication. Integrating with a company's existing authentication system is important because this will allow the company to control who or what can use the system.

We chose Entra ID in this case, but it could have been any other well-known authentication service such as Auth0, Amazon Cognito, or Google’s Identity Platform.

Client App

Lastly, the client application interacts with the custom API. While the client app is not a direct part of the solution, it is an example of how users can use the API using frameworks like LangChain.

The next sections will dive deeper into each part and show the implementation.

Choosing the LLM

A person and a llama are looking at a chalkboard filled with colorful, abstract drawings and equations. The person seems to be deep in thought. — Choose the correct open-source Large Language Model, image by Midjourney, and prompt by the author.

We’ve chosen Meta’s LLaMA 3 model for this project, specifically the variant with 8 billion parameters. Selecting the right model is crucial as it impacts performance and resource requirements.

The 8 billion parameter variant offers a balanced trade-off between computational efficiency and accuracy, making it suitable for various applications.

Before selecting any model, it’s essential to ensure its license allows commercial use.

The LLaMA 3 license, a community license, allows users to use, adapt, and build upon the LLaMA Materials, including foundational large language models and related software.

Here are the key terms of the LLaMA 3 license:

License Permissions:

Usage and Modifications: You can use and modify the LLaMA Materials freely and retain ownership over your creations.
Redistribution: You can distribute original or modified versions, provided that each distribution includes a copy of the license agreement and a note stating “Built with Meta LLaMA 3.”
Ownership of Innovations: Any innovations created from LLaMA Materials belong to the developer.

License Restrictions:

Scale of Use: If your services have more than 700 million monthly active users, you need additional licensing from Meta.
Non-competition: You cannot use LLaMA Materials to enhance competing models.
Trademark Use: Meta’s trademarks are restricted unless specifically allowed.

These terms ensure that developers can freely innovate with LLaMA 3 while respecting Meta’s usage guidelines. For instance, a startup with fewer than 700 million users can use and modify LLaMA 3 without extra licensing but must seek additional permissions as they scale.

Graphical Gluon Unit Format (GGUF)

Since we are using llama.cpp, we must ensure the model is in the Graphical Gluon Unit Format (GGUF).

GGUF is designed to be a more efficient and compact way to store and utilize models, particularly for deployment on devices with limited resources. This format is optimized for performance and storage efficiency, making it ideal for various deployment scenarios.

Llama.cpp offers several tools for converting the model from its original format (e.g., PyTorch, TensorFlow) to GGUF.

Fortunately, Hugging Face, a popular platform for hosting and sharing machine learning models, offers several models in the GGUF format. This availability makes it easier for users to download and use these models directly with llama.cpp without needing to perform the conversion themselves.

Obtaining GGUF models from Hugging Face is easy. This simplifies the deployment process and ensures that you can quickly integrate the LLaMA 3 model into your applications.

The `llama.cpp` Docker Container

A llama peers out from a rectangular opening in a weathered, rusty metal wall, framed by bolts and hinges. — Hosting the LLM inside a container, image by Midjourney, prompt by author.

LLaMA.cpp is an open-source project on GitHub that enables LLM inference with minimal setup and state-of-the-art performance on various hardware locally and in the cloud.

This section will explore how to run LLaMA.cpp using Docker containers, which provide an efficient way to manage different environments.

Docker containers for different environments

To facilitate different development environments, I created three versions of the llama.cpp container:

Mac OS (Arm64)
Intel (AMD64)
Intel with NVIDIA (CUDA compatible) Graphics Cards

These container images can be found on Docker Hub:

Multi-Platform image (AMD64 and ARM64): pkalkman/llama.cpp:0.2.1
CUDA-Enabled image (AMD64): pkalkman/llama.cpp-cuda:0.2.1

Multi-Platform (AMD64 and ARM64)

This is the Dockerfile used for the multi-platform CPU image, sourced from the DevOps folder of the llama.cpp GitHub repository:

ARG UBUNTU_VERSION=22.04 
 
FROM ubuntu:$UBUNTU_VERSION as build 
RUN apt-get update && apt-get install -y build-essential git libcurl4-openssl-dev 
WORKDIR /app 
COPY . . 
ENV LLAMA_CURL=1 
RUN make 
 
FROM ubuntu:$UBUNTU_VERSION as runtime 
RUN apt-get update && \ 
    apt-get install -y libcurl4-openssl-dev 
COPY --from=build /app/server /server 
ENV LC_ALL=C.utf8 
ENTRYPOINT [ "/server" ]

This Dockerfile accomplishes two main tasks:

Building the Application: Compiles the LLaMA.cpp server with the necessary dependencies.
Setting Up the Runtime Environment: Creates a lightweight container with the built application ready to run.

To build and push the multi-platform Docker image, I used the following shell script:

#!/bin/bash 
VERSION="0.2.1" 
APP="llama.cpp" 
 
docker buildx build --platform linux/amd64,linux/arm64 -f ./Dockerfile -t pkalkman/$APP:$VERSION --push .

CUDA-Enabled (AMD64)

For the CUDA-enabled Docker image, which leverages NVIDIA GPUs for enhanced performance, the Dockerfile is as follows:

ARG UBUNTU_VERSION=22.04 
# This needs to generally match the container host's environment. 
ARG CUDA_VERSION=11.7.1 
# Target the CUDA build image 
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION} 
# Target the CUDA runtime image 
ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION} 
 
FROM ${BASE_CUDA_DEV_CONTAINER} as build 
 
# Unless otherwise specified, we make a fat build. 
ARG CUDA_DOCKER_ARCH=all 
 
RUN apt-get update && \ 
    apt-get install -y build-essential git libcurl4-openssl-dev 
 
WORKDIR /app 
 
COPY . . 
 
# Set nvcc architecture 
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} 
# Enable CUDA 
ENV LLAMA_CUDA=1 
# Enable cURL 
ENV LLAMA_CURL=1 
 
RUN make 
 
FROM ${BASE_CUDA_RUN_CONTAINER} as runtime 
 
RUN apt-get update && \ 
    apt-get install -y libcurl4-openssl-dev 
 
COPY --from=build /app/server /server 
 
ENTRYPOINT [ "/server" ]

This setup ensures that the container is optimized for CUDA, providing significant performance improvements for inference tasks.

To build this Docker image, use the following script:

#!/bin/bash 
VERSION="0.2.1-cuda" 
APP="llama.cpp" 
docker buildx build --platform linux/amd64 -f ./Dockerfile.cuda -t pkalkman/$APP:$VERSION  --push .

Later, we will look at how to mount the LLM into the container and configure it using Docker Compose.

Implementing the OpenAI-compatible API

A person wearing glasses works on a computer in a dimly lit room, with code and data visualizations displayed on a large monitor. — Programming the OpenAI compatible API, image by Midjourney, prompt by author.

As described earlier, we aim for our custom API to be compatible with the OpenAI API. This compatibility allows API users to leverage existing frameworks, such as LangChain, for seamless integration.

OpenAI has provided an OpenAPI specification (formerly known as Swagger) for their API and made it public. This specification includes a comprehensive set of 18 actions across 8 main endpoints, covering functionalities like model management, completions, edits, image processing, embeddings, file management, fine-tuning, and moderations.

Initially, we will focus on implementing the core endpoints: completions and Models. For now, our custom API endpoints won’t contain any logic; they will forward the incoming request to the same endpoint of the llama.cpp service.

The implementation will use NodeJs and Fastify as REST framework. The

Implementing the Completion Endpoint

The first endpoint that we will be implementing is the completion endpoint. The /chat/completions endpoint is used to generate text completions based on a given input prompt. This action is typically used in chat applications or any other context requiring AI-generated text based on a user-provided prompt.

Request

The request to this endpoint should be a POST request with a JSON payload. The payload can include various parameters to control the behavior of the text generation.

Request Body

Here’s an example of a typical request body for the /chat/completions endpoint:

{ 
  "model": "models/mistral-7b-openorca.Q8_0.gguff", 
  "prompt": "Write me a song about goldfish on the moon", 
  "temperature": 0.5, 
  "max_tokens": 100 
}

Implementation

We will use Node.js and Fastify to create this endpoint. The following code demonstrates the implementation:

import logger from '../utils/logger.js'; 
import axios from 'axios'; 
import { pipeline } from 'stream'; 
import { promisify } from 'util'; 
 
const axiosInstance = axios.create({ 
  baseURL: 'http://localhost:8080', 
}); 
 
const pipelineAsync = promisify(pipeline); 
 
const chatCompletionController = {}; 
 
chatCompletionController.chatCompletion = async (req, reply) => { 
  try { 
    const response = await axiosInstance.post('/chat/completions', req.body, { 
      responseType: 'stream', 
    }); 
 
    logger.info('Request successfully processed by llama_cpp service'); 
 
    reply.raw.writeHead(response.status, { 
      'Content-Type': response.headers['content-type'], 
      'Transfer-Encoding': 'chunked', 
    }); 
 
    await pipelineAsync(response.data, reply.raw); 
 
  } catch (error) { 
    logger.error(`Error in chatCompletionController.chatCompletion: ${error.message}`); 
 
    if (error.response) { 
      reply.send(reply.httpErrors.createError(error.response.status, { 
        message: error.response.data.error || 'Error from llama_cpp service', 
      })); 
    } else if (error.code === 'ECONNREFUSED') { 
      reply.serviceUnavailable(); 
    } else { 
      reply.internalServerError(); 
    } 
  } 
}; 
 
export default chatCompletionController;

The chatCompletion method handles incoming POST requests. It forwards the request body to the llama.cpp service using axios.post with responseType: 'stream' to handle streaming responses. It then uses pipelineAsync to forward the streaming response from the llama.cpp service to the client.

Building the docker container

To package our custom API, we use a Dockerfile that defines the steps for creating a Docker image. This Docker image ensures our API runs consistently across different environments.

FROM node:20-alpine3.18 
 
RUN apk update && \ 
    apk add --no-cache tzdata && \ 
    cp /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime && \ 
    echo "Europe/Amsterdam" > /etc/timezone && \ 
    rm -rf /var/cache/apk/* 
 
EXPOSE 8000 
 
WORKDIR /home/node/app 
 
USER root 
RUN chown -R node:node /home/node/app 
USER node 
 
COPY --chown=node:node package*.json ./ 
RUN npm install --only=prod && npm cache clean --force 
 
COPY --chown=node:node . . 
 
CMD [ "node", "index.js" ]

This Dockerfile sets up a lightweight Node.js environment using the node:20-alpine3.18 base image. It installs the timezone data to configure the system timezone to Europe/Amsterdam, ensuring that all logs and timestamps are consistent with the expected local time.

The working directory is set to /home/node/app, and ownership is assigned to the non-root node user for better security.

The Dockerfile copies the necessary package.json files and installs production dependencies, cleaning up the npm cache to reduce image size.

Finally, it copies the application code and sets the default command to run the Node.js application using index.js. The container exposes port 8000 for incoming HTTP traffic, ensuring the application's accessibility.

To build the docker image, I use the following shell script.

#!/bin/bash 
VERSION="0.2.0" 
APP="llama-internal-api" 
docker buildx build --platform linux/amd64,linux/arm64 -f ./Dockerfile -t pkalkman/$APP:$VERSION  --push .

Authentication using Entra ID

A llama stands behind bars, wearing a sign around its neck with writing on it, in a warmly lit setting. — Securing our solution by adding authentication, image by Midjourney, and prompt by author.

As with any Enterprise solution, we need to add authentication. We will use Microsoft Entra ID, previously known as Azure Active Directory. So, our custom internal API must validate the incoming request and determine whether it comes from an authenticated source.

To implement authentication using Microsoft Entra ID (formerly Azure Active Directory), you need to register your application in the Azure portal under “Azure Active Directory” > “App registrations” and create a new registration to obtain the Application (client) ID.

Next, configure your application by generating a client secret in “Certificates & secrets” and setting necessary API permissions under “API permissions.” Integrate these credentials into your application to authenticate with Entra ID and implement JWT token validation to ensure secure and authenticated requests. For detailed steps, please refer to the Microsoft Entra ID documentation.

First, we will create a utility function, verifyToken, to validate the incoming JWT token and then attach it to our chatCompletion action.

export async function verifyToken(token) { 
 
  const decodedHeader = jwt.decode(token, { complete: true }).header; 
  const publicKey = await getPublicKey(decodedHeader.kid); 
 
  return new Promise((resolve, reject) => { 
    jwt.verify(token, publicKey, { 
      algorithms: ['RS256'], 
      issuer: `https://sts.windows.net/${tenantId}/`, 
      audience: `api://${clientId}`, 
    }, (err, decoded) => { 
      if (err) { 
        return reject(err); 
      } 
      resolve(decoded); 
    }); 
  }); 
}

The verifyToken function performs the following steps:

Decoding the Token Header: The function starts by decoding the JWT token to extract its header. This header contains metadata about the token, including the key ID (kid) used to sign the token.
Retrieving the Public Key: The function retrieves the corresponding public key using the key ID (kid). This key is essential for verifying the token's signature.
Verifying the Token: The function verifies the token using jwt.verify(). It checks the token's signature against the retrieved public key, ensuring a trusted source issued it. It also verifies:

Algorithm: The algorithm used to sign the token (RS256).
Issuer: The trusted entity that issued the token, formatted as https://sts.windows.net/${tenantId}/.
Audience: The intended recipient of the token, formatted as api://${clientId}.

4. Handling Verification Results: If the token verification fails, the function rejects the promise with an error. If successful, it resolves the promise with the decoded token payload.

We then integrate this function into our chat completion route as a preHandler like this.

export default async function (fastify) { 
  fastify.post('/chat/completions', {preHandler: fastify.authenticate}, chatCompletionController.chatCompletion); 
}

And decorate our fastify instance with the authenticate function like this:

fastify.decorate('authenticate', async function (request, reply) { 
  try { 
    const authHeader = request.headers.authorization; 
    if (!authHeader) { 
      reply.unauthorized('No authorization header'); 
      return; 
    } 
 
    const token = authHeader.split(' ')[1]; 
 
    try { 
      const tokenPayload = await verifyToken(token); 
      request.user = tokenPayload; 
    } catch (err) { 
      fastify.log.error({ msg: 'Error verifying token', err, token }); 
      reply.unauthorized('Invalid token'); 
      return; 
    } 
  } catch (err) { 
    fastify.log.error({ msg: 'An unexpected error occurred during authentication', err }); 
    reply.internalServerError('An unexpected error occurred during authentication'); 
  } 
});

You can find the complete verifyToken and the helper functions here in GitHub.

Building the Client app using LangChain

To verify that the functionality works as expected, we use the following Python script.llm = ChatOpenAI(
temperature=0.5,
model="does not matter",
openai_api_base="http://localhost:8000",
openai_api_key="any string will do"
)try:
for chunk in llm.stream("Write a python script that validates an email address"):
if chunk.content:
print(chunk.content, end="", flush=True)
except Exception as e:
print(f"An error occurred: {e}")

In this script, we use streaming mode to verify that the streaming functionality works correctly. The llm.stream method allows for continuous data transfer, essential for handling large responses or when immediate, incremental output is needed.

Streaming the response ensures that our client application can process data as it arrives rather than waiting for the entire response to be generated. This approach improves efficiency and enhances the user experience by providing quicker feedback.

If any errors occur during the execution, they are caught and printed, ensuring that any issues can be quickly identified and addressed. This script demonstrates the seamless integration of LangChain with our custom OpenAI-compatible API, showcasing how easily existing tools can interact with our solution.

Running the solution

To start the complete system, we use Docker Compose. This allows us to manage both the container running llama.cpp and the container running our custom OpenAI-compatible API. We mount the LLM model inside the llama.cpp container using a volume mount.

Below is the docker-compose.yaml file to start both services and mount a local models folder in the llama.cpp container.

services: 
  llm_cpp: 
    image: pkalkman/llama.cpp:0.2.1-cuda 
    command: ["--host", "0.0.0.0", "--port", "8080", "--model", "/app/models/Meta-Llama-3-8B-Instruct.Q8_0.gguf"] 
    ports: 
      - "8080:8080" 
    volumes: 
      - ./models/:/app/models/ 
    deploy: 
      resources: 
        reservations: 
          devices: 
            - driver: nvidia 
              count: 1 
              capabilities: [gpu] 
  llm_api: 
    image: pkalkman/llama-internal-api:0.2.1 
    ports: 
      - "8000:8000" 
    depends_on: 
      - llm_cpp 
    environment: 
      - DISABLE_AUTH=true

Key Configuration Details:

The command to run llama.cpp is crucial. Note the following:

command: ["--host", "0.0.0.0", "--port", "8080", "--model", "/app/models/Meta-Llama-3-8B-Instruct.Q8_0.gguf"]

To facilitate running the solution without configuring Microsoft Entra ID, set the environment variable DISABLE_AUTH to true.

Setup Environment:

I used an instance with an NVIDIA Tesla V100 16GB from DataCrunch.io, running Ubuntu 20.04 with CUDA 11.7 and Docker. For compatibility, ensure you select the same CUDA version.
I used the Llama 3 8B model from Hugging Face, which was already converted into the GGUF format. You can find it here: Meta-Llama-3-8B-Instruct.Q8_0.gguf version. The repository also contains different quantized versions: QuantFactory/Meta-Llama-3-8B-Instruct-GGUF.

Starting the Services:

Start both services using the following command:

docker-compose up

2. Once both services have started successfully, you can initiate the client application that uses LangChain. You should see results streaming token by token, indicating a successful operation.

Screenshot of a terminal showing a Python script to validate an email address using a regular expression. The script defines a validate_email function and includes a test case, followed by an explanation of the regular expression components. — Running the test client against our OpenAI compatible API, image by author.

What’s Next?

This article presents the first version of a secure and efficient system for deploying large language models locally, ensuring enhanced privacy and performance. While the current implementation demonstrates core functionalities, there are several areas for potential improvement and expansion:

Enhanced Model Support:

During testing, we evaluated several LLMs. However, to ensure comprehensive support, we must test an even broader range of LLMs to facilitate easy switching between models.

Scalability and Performance Optimization:

Although Docker and Docker Compose provide a solid foundation, further optimizations can be achieved by transitioning to more sophisticated orchestration tools like Kubernetes. This transition will enable better resource management and scaling capabilities, particularly for large-scale deployments.

Extending API Capabilities:

We have only implemented the chat completion endpoint to facilitate chat interaction. The OpenAI API specification includes 17 additional actions that can be implemented to ensure full functionality support.

Update Support:

We have created several Docker images. Moving forward, we need to enhance the solution to easily create new Docker images when integrating new versions of llama.cpp, for instance, by using continuous integration with GitHub Actions.

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay in the loop with the latest AI stories.

Subscribe to our newsletter to stay updated with the latest news and updates on generative AI. Let’s shape the future of AI together!

AI on Your Terms: Privacy-Focused LLM Deployment with Meta’s Llama 3 and Node.js

Patrick Kalkman

Architecture

LLM

Llama.cpp (server)

Custom API

Authentication

Client App

Choosing the LLM

License Permissions:

License Restrictions:

Graphical Gluon Unit Format (GGUF)

The `llama.cpp` Docker Container

Docker containers for different environments

Multi-Platform (AMD64 and ARM64)

CUDA-Enabled (AMD64)

Implementing the OpenAI-compatible API

Implementing the Completion Endpoint

Request

Request Body

Implementation

Building the docker container

Authentication using Entra ID

Building the Client app using LangChain

Running the solution

What’s Next?

Read more

Skip Intro at Scale: How I Built Netflix’s Missing Feature for $0.30 per Movie

Beyond Brittle: Building Resilient UI Testing

NIS2: What Every Developer and Tech Lead Needs To Know Right Now

EU AI Act: Builders and Deployers, Both on the Hook

Architecture

LLM

Llama.cpp (server)

Custom API

Authentication

Client App

Choosing the LLM

License Permissions:

License Restrictions:

Graphical Gluon Unit Format (GGUF)

The llama.cpp Docker Container

Docker containers for different environments

Multi-Platform (AMD64 and ARM64)

CUDA-Enabled (AMD64)

Implementing the OpenAI-compatible API

Implementing the Completion Endpoint

Request

Request Body

Implementation

Building the docker container

Authentication using Entra ID

Building the Client app using LangChain

Running the solution

What’s Next?

Read more

Skip Intro at Scale: How I Built Netflix’s Missing Feature for $0.30 per Movie

Beyond Brittle: Building Resilient UI Testing

NIS2: What Every Developer and Tech Lead Needs To Know Right Now

EU AI Act: Builders and Deployers, Both on the Hook

The `llama.cpp` Docker Container