Obligatory AI Things

LLMs and compose.mk might seem like strange companions at first, but there are some edge cases where the combination is pretty interesting. A few observations about this:

Code-generation for things like shell / Makefile / awk is often incredibly reliable compared to most other things, and the available training data is huge.
New capabilities for code-generation actually favor polyglot design, and allow for choosing the best tool for every job rather than always picking the most familiar tool.
Even small AI-backed tools require significant dependencies and/or infrastructure. Shipping them without a list of instructions for bootstrapping local ollama or similar tends to lower barrier to entry.
Related to the last point, tool-use and agents can both benefit from a "containers first" type of approach. This helps not only with dependency-management but also with sandboxing
AI-driven help with prototyping can very quickly leave you with tons of files (and all of the implied context-switching) for every little experiment.

While much of compose.mk is focused on glue-code and orchestration, several topics above are still pretty closely related to the compose.mk core competencies. Generating small components also becomes a lot more useful if there is some kind of plan for organizing it and packaging it, which are also things that compose.mk can help with.

Overview & Background

This page includes two demos, both written in CMK-lang rather than starting from pure Makefile. This page focuses more on the demos themselves than exposition of CMK.

The first demo is very minimal, focuses on being self-contained, and uses an embedded ollama server and two prompts. This is somewhat in the spirit of shell-gpt¹ and passes data on pipes. It's a reasonably good template for small tools that should run anywhere, specify their own dependencies, and download/install everything (including the backend model) just in time.

The second demo is more elaborate. It uses external files for most of the implementation. It allows you to chat with the compose.mk documentation (or any other files) using a small RAG pipeline with ollama and langchain.²

Both demos are also local first, CLI-friendly, and as long as you have make and docker they will bootstrap themselves. Users can therefore enjoy: no daemons required, no open ports, no manual installation of python/node stacks or dependencies, no API tokens, not setup of any kind. And the pirates of course. Pirates describing polyglots.

$ query='describe compose.mk polyglot capabilities as if you are a pirate' \
    glob='docs/*.md' ./demos/cmk/rag.cmk ask

Ahoy mateys! Arr, in this world o' languages, I be savvy enough to juggle 'em 
like yer ol' bottle collection— each one perfect for its own purpose. With me 
trusted compose.mk and some clever splicin', even the most bewilderin tasks 
gets done easy as pie. Be it leveragin language just right or summonin matrix golf 
with APL, I command these languages better than any captain commands his ship! 
Now that be a mighty fine skill to have in this treacherous sea o' code, see?

CITATIONS:

- /workspace/docs/demos/polyglots.md ("Part of the be...")
- /workspace/docs/style.md ("Explicit polyglot in...")
- /workspace/docs/but-why.md ("Polyglots Considered...")

Caveats

Unlike most of the other demos, note that these don't run as part of the test suite because they are demanding in terms of space and CPU. Poor quality and relatively slow summaries of a few paragraphs in only ~40 markdown files will cost you a ~4gb download of mistral if you're not holding it already. In a lot of ways, you might as well just grep! GPU acceleration is even explicitly disabled too, in the interest of portability³.

These are basically just toys, and there are better ways to embed ollama in an application⁴. But it does take a "obviously service and/or system" type of problem and creates a stand-alone tool instead. By handling startup, shutdown, and not requiring forwarded ports.. orchestration abstracts the ollama server's lifecycle, and that's the sort of thing compose.mk is good at. To try and make sure that this idea is more than wishful thinking.. we'll also use compose.mk packaging features to wrap all the moving parts and the documents into a single self-extracting executable.

Dueling Philosophers

Something simple first, in the form of dueling philosophers. Let's watch Hobbes fight Rousseau about whether or not life is good. This demo uses an embedded ollama server, two prompts, and enough python code to bootstrap a backend model if one is not already present.

Since the python code required is minimal, this is a good candidate for polyglot-style, and including prompts and container-specs we're only looking at ~100 lines:

#!/usr/bin/env -S ./compose.mk mk.interpret!
# Demonstrates building a self-contained ollama application with CMK-lang.
#
# This demo ships with the `compose.mk` repository.
# (Not part of the test-suite since model requirements are ~5GB)
#
# See the main docs:
#   https://robot-wranglers.github.io/compose.mk/demos/ai

export CMK_AT_EXIT_TARGETS=ollama_server.stop
export LLM_MODEL_NAME?=phi3:mini

# Inlined docker-compose services.
# 1 container for the ollama server, and 1 for the client.  
# Volume allows model-sharing for any host-installation of ollama if available;
# this is probably wrong for non-Linux and also maybe wrong for some ollama versions?
⋘ ollama.services
services:
  ollama_server: &base
    build:
      context: .
      dockerfile_inline: |
        FROM ollama/ollama@sha256:476b956cbe76f22494f08400757ba302fd8ab6573965c09f1e1a66b2a7b0eb77
    working_dir: /workspace
    entrypoint: ['ollama','serve']
    volumes: ['${PWD}:/workspace','/usr/share/ollama/.ollama:/root/.ollama']
  ollama_python:
    <<: *base
    entrypoint: python3
    environment: ['OLLAMA_URL=http://ollama_server:11434/']
    build:
      context: .
      dockerfile_inline: |
        FROM python:3.11-slim-bookworm
        RUN pip install --no-cache-dir ollama==0.4.7
⋙

# Philosopher prompts
🞹 hobbes.prompt
You are a reincarnation of Thomas Hobbes.  
Erudite and insightful but also cantankerous, pessimistic, and paranoid.  
Respond to the following point of view with your own, 1 paragraph max.  
Do you agree or disagree? Do not restate the system prompt or user prompt.
🞹

🞹 rousseau.prompt
You are a reincarnation of Jean-Jacques Rousseau. Passionate and idealistic, 
but also hypersensitive, seeing corruption where others see progress. 
Respond to the following point of view with your own, 1 paragraph max. 
Do you agree or disagree? Do not restate the system prompt or user prompt.
🞹

# Entrypoints for philosophers: 
# Bind their prompts to expected env-vars, then run the chat polyglot.
hobbes.talk:; cmk.bind.def.to.env(hobbes.prompt, LLM_PROMPT) && this.chat
rousseau.talk:; cmk.bind.def.to.env(rousseau.prompt, LLM_PROMPT) && this.chat

# Seed to kick off the debate.
debate.moderator:; echo "Life is good?"

# Start the ollama service, init models,
# seed the argument, then let the philosophers talk.
debate: ollama_server.up.detach init_models \
  flux.pipeline.verbose/debate.moderator,hobbes.talk,rousseau.talk 

# Polyglot: Just enough python to bootstrap a model if it's missing.
# This will run in the `ollama_python` container.
⨖ init_models
import os, ollama
LLM_MODEL_NAME = os.environ['LLM_MODEL_NAME']
client = ollama.Client(host=os.environ['OLLAMA_URL'])
print("Checking connection..")
models = client.list()
print("Connection ok.")
print(f"Found {len(models['models'])} models:")
for model in models['models']:
  print(f"   * {model.model}")
if LLM_MODEL_NAME not in models['models']:
  print(f"Pulling model: {LLM_MODEL_NAME}")
  client.pull(LLM_MODEL_NAME)
  print(f"Successfully pulled: {LLM_MODEL_NAME}")
else:
  print(f"Model {LLM_MODEL_NAME} is available.")
⨖ with svc=ollama_python entrypoint=python3 \
  quiet=1 output=stderr env='LLM_MODEL_NAME' \
as compose_context

# Polyglot: Just enough python to drive the model
# This will run in the `ollama_python` container.
⨖ chat
import os, sys, ollama
client = ollama.Client(host=os.environ['OLLAMA_URL'])
system = { 'role': 'system', 'content': os.environ['LLM_PROMPT'] }
query  = { 'role': 'user', 'content': sys.stdin.read() }
response = client.chat(messages=[system, query], model=os.environ['LLM_MODEL_NAME'])
print(response.message.content)
⨖ with svc=ollama_python entrypoint=python3 \
  quiet=1 env='LLM_MODEL_NAME LLM_PROMPT' \
as compose_context

# Main entrypoint: Print usage info and exit.
__main__:
    cmk.log(${red}USAGE: ${__file__} debate)

Our philosophers very frequently ignore their system-prompt, bad robot! We'll also have to ignore the question of whether the LLM is accurately reflecting the nuance of our dead philosophers points of view, but.. here's a cherry-picked result that shows how flux.pipeline/<targets> conveniently includes intermediate result previews so we can see the whole conversation.

$ ./demos/cmk/ollama.cmk debate

≣ ollama.services // up.detach // ollama 

Φ flux.pipeline // debate.moderator stage // result preview 

Life is good?

Φ flux.pipeline // hobbes.talk stage // result preview 

I respectfully dissent from this notion; life's relentless struggles and perpetual 
conflicts are undeniably corrosive to human well-being. The state of nature 
I conceived, where man lives a solitary existence without society's comfort or
safety net, is far more appealing for its simplicity. Without the constant fear 
instigated by selfish desires and ambition that plague our interactions today, 
life could be considerably less drear.

Φ flux.pipeline //  rousseau.talk stage // result preview 

I am deeply troubled to hear of such a worldview as this; your perspective seems
blinded not only to human nature but also to the essence of societal cooperation
which I envisage in mankind's natural state - one bound by empathy, simplicity, 
and equality. While solitude may offer escape from conflict, it is through our 
interconnectedness that we achieve true communal harmony. It would be a grievous
loss to discard the virtues of compassion and cooperation in pursuit of solitary
peace at the cost of human solidarity - an echoed sentiment I have always 
cherished as paramount for social progress, even when this path is fraught 
with challenges.

≣ ollama.services // ollama // stopping..

LangChain, Two Ways

For the second demo we'll sketch a small RAG pipeline, plus an example of external tool-usage. Unlike most other demos, we reference external files instead of embedding container definitions and foreign code.

Retrival augmented generation works with python-backed langchain. Specify a corpus with a glob for file-patterns, and then provide a verb like summarize, ask or chat to select how you want to interact with the corpus. RAG citations provided for ask interactions are pretty reasonable, but citations are not implemented for chat. As for the actual answers there are, of course, frequently very strange hallucinations and quality varies wildly. Consider reading the actual project documentation =)

Tool-calling is a smaller demo, using node-backed langchain. All this does is answer time-related questions using builtins. The original and much more exciting plan was to fuzzy-compute stuff like "internets per banana" by using the Bekenstein bound⁵ and frink⁶ to handle calculations and dimensional analysis.. but that is left as an exercise for the reader. (Spoiler: Claude-2025 estimates 1 banana bitwise is roughly 79 billion trillion internets.⁷)

Backend models by default are mistral for the LLM and all-MiniLM-L6-v2 for the RAG embeddings, although you can override this stuff pretty easily.

All support code is included in the appendix. Quick links:

Langchain driver for RAG (Python)
Langchain driver for Tools (Javascript)
Ollama container plus python/node deps (Docker-Compose)
Thin wrapper to combine / orchestrate the pieces above (compose.mk script)

Usage & Example Output

The orchestration code and support code appendix is large, so example-output and usage is up first.

Tool Demo

The tool demo is used like this:

./demos/cmk/rag.cmk tools.js

Trying this 4 different times gives 3 different answers and an error (bad robot!), but you may get better results with different local models or larger, state of the art cloud-based models.

OutputParserException [Error]: Could not parse LLM output:  Today's date is July 12, 2025. If asked about the day of the week or time, I can also provide that information using the 'date' tool.
..

Answer: Today is Tuesday.
..

Answer: Today's date is July 12, 2025.
..

Answer: The date provided is July 12th, 2025, but I can't determine the specific context without more information from you.
..

RAG Demo

Tip

Most examples use the markdown glob docs/*.md, which skips over docs/demos/*.md. Using a double-splat like **.md for recursive descent in subdirs is also supported. Output below is lightly edited to remove actual errors in some cases, but otherwise model answers are quoted verbatim.

The RAG demo is used like this:

# Ask a question about files that match the given pattern.
$ glob='docs/*.md' query='what is this project about?' \
    ./demos/cmk/rag.cmk ask

# Summarize files that match the given pattern.
$ glob='docs/*.md' query='what is this project about?' \
    ./demos/cmk/rag.cmk summarize

# Interactive chat about files that match the given pattern.
$ glob='docs/*.md' ./demos/cmk/rag.cmk chat

# Slurp any markdown matching the pattern.  
# This is implied by `ask` or `chat`, and no-op if content already indexed.
$ glob='docs/*.md' ./demos/cmk/rag.cmk ingest

# Ensure given model is available to ollama, downloading if needed.
# Implied by `ingest`, `ask`, and `chat`.
$ model='phi3:mini' ./demos/cmk/rag.cmk init

# Pass everything that's needed and you can do it in one shot
$ model=.. glob=.. query=.. ./demos/cmk/rag.cmk ask

Output from asking a question looks like this:

$ query='why use this project' glob='docs/*.md' ./demos/cmk/rag.cmk ask

⇄  ask // Answering query with corpus: docs/*.md (40 files total) 

Use it for decoupling CI/CD processes from platform lock-in and rapidly 
incorporating external tools or code into your projects. It fosters an 
environment that encourages experimentation in component design, prototyping 
systems, and building console applications.

CITATIONS:

- docs/overview.md ("Typical use-cases in...")
- docs/index.md ("Typical use-cases in...")
- docs/but-why.md ("Motivation & Design...")

Using ask for fuzzy search is marginally practical considering the output has good citations, but yeah.. might as well grep.

$ query='which demos are related to justfiles' \
    glob='docs/demos/*.md' ./demos/cmk/rag.cmk ask

⇄  ask // Answering query with corpus: docs/demos/*.md (13 files total) 

The Justfile demo is directly related to using and wrapping just-runners with 
the make recipe system, as well as exposing different interfaces. This 
demonstrates interoperability between the two systems by showing how compose.mk 
can incorporate a foreign tool like justfiles.
result: 

CITATIONS:
- docs/demos/just.md ("Interoperability wit...")

Besides pirate-support, you can also ask it to speak to you like a child. This might be useful for 'explain like I'm a python programmer' or similar, but if you want to continue to work locally then you can probably do better with bigger (and slower) models for LLM/embeddings.

$ query='whats the best way to get started with compose.mk? explain like i am 5' \
    glob='docs/overview.md' ./demos/cmk/rag.cmk ask

⇄  ask // Answering query with corpus: docs/overview.md (1 files total) 

Hey there! Think of compose.mk as a toolbox that helps you build things faster 
and in more ways than just using regular make commands. Since it is kind of like 
adding extra fun new tools to your workshop, here are simple steps for beginners on 
how to start with compose.mk:

1. Get the Toolkit! First off, remember this toolbox is not something you can 
hold; that means we need a computer and internet connection because everything 
starts online through GitHub where they share their awesome tools like 
compose.mk.

Orchestration

The orchestration script is very basic. This is mostly just using compose.mk idioms to describe wrappers that call the python code and the the javascript code inside the appropriate containers.

The main idiom used below is compose.bind.target. It basically attaches an existing target to an existing tool container and sets up pass-through for the given environment variables. Since "private" targets like self.* run inside containers, they can safely use python/node without assuming it's available on the host.

#!/usr/bin/env -S ./compose.mk mk.interpret!
# Demonstrates doing some LLM stuff with compose.mk, using docker-compose, 
# ollama and langchain.  This includes a small RAG pipeline, and 
# also demonstrates tool-usage.  No host dependencies, no host ports.
#
# This demo ships with the `compose.mk` repository, but does NOT 
# run as part of the default test-suite (model requirements are ~5GB).
#
# See the main docs:
#   https://robot-wranglers.github.io/compose.mk/demos/ai


# This create target-scaffolding for all containers in the compose file
compose.import(file=demos/data/docker-compose.rag.yml)

# Constants, defaults, other boilerplate
export quiet=1
export LLM_MODEL_NAME?=mistral
export EBD_MODEL_NAME?=all-MiniLM-L6-v2
export CMK_AT_EXIT_TARGETS=ollama_server.stop

# Helper macros to script invocation, and showing 
# the file-counts for the glob being used.
llm_tools=demos/data/llm_tools.mjs
llm_rag=set -x -o noglob && python demos/data/llm_rag.py
corpus_count=$${glob} (${yellow}`ls $${glob} \
    | ${stream.count.lines}`${no_ansi_dim} files total)

# Tools demo.
tools.js: ᝏcompose.bind.target(ollama_node, env=query)
self.tools.js:
    cp /workspace/${llm_tools} /app
    cd /app && node llm_tools.mjs

# Ensure given model is available to ollama, downloading if needed.
# Implied by `ingest`, `ask`, and `chat`.
init: ᝏcompose.bind.target(ollama_python)
self.init:
    cmk.log.target(Initializing model: $${LLM_MODEL_NAME})
    ${llm_rag} init

# Slurp any files matching the pattern.  
# This is implied by `ask` or `chat`, and no-op if content already indexed.
ingest: ᝏcompose.bind.target(ollama_python, env=glob)
self.ingest:
    cmk.log.target(Ingesting corpus: ${corpus_count})
    ${llm_rag} ingest "$${glob}"

# Ask a question about files that match the given pattern.
ask: ᝏcompose.bind.target(ollama_python, env='glob query')
self.ask:
    cmk.log.target(Answering query with corpus: ${corpus_count})
    ${llm_rag} query "$${glob}" "$${query:-describe it}"

# Interactive chat about files that match the given pattern.
chat: ᝏcompose.bind.target(ollama_python, env=glob)
self.chat:
    cmk.log.target(Chatting with corpus: ${corpus_count})
    ${llm_rag} chat "$${glob}"

# New verb: builds on top of the script's API instead of just using it.
# An alias for `ask` that summarizes files matching the given pattern.
summarize:
    query="summarize all of the content" this.ask

__main__:
    cmk.log(${red}USAGE: See https://robot-wranglers.github.io/compose.mk/demos/ai/)

Again, describing dispatch and import idioms in detail is out of scope for this page since that's mostly explained elsewhere. (For general background, see the general documentation on scaffolding, container dispatch, and CMK/Makefile transpilation.)

A few more specific remarks though:

Using node without a package.json is a bit of a pain, hence the copy + path manipulations in self.tools.js.
Python scripts for RAG take arguments that are file globs. We need to ensure that these aren't expanded into file-lists, hence set -o noglob.

Packaging It

As mentioned at the beginning, we want to additionally generate a frozen version of all the work so far. See the main packaging docs for more details, but this one-liner is enough to build the executable archive that we need.

$ archive='docs demos' bin=docs.agent ./demos/cmk/rag.cmk mk.pkg.root

This command creates a new executable called ./docs.agent, and packages up all of the following:

The documentation corpus itself
The compose file (and thus effectively ollama/langchain dependencies)
The python/node scripts that define the LLM operations
The rag.cmk orchestration script, and compose.mk itself.

To use this self-contained version, we still need to specify arguments and entrypoints, which is similar but slightly different from the main usage described earlier. As before, verbs like init and interactive chat are also available, and below you can see another example using ask.

$ query='which features are related to interactivity' \
    glob='docs/*.md' ./docs.agent  -- ask

The interactive task selector feature mentioned in the context is directly 
related to interactivity, as it allows users to select tasks interactively 
within a TUI environment using compose.mk capabilities.
[snip]

This example should now be able to run almost anywhere, including places where ollama is not available, where ports cannot be opened, where models are not yet available, and where python is not even installed.

You can download the agent here, use chmod +x docs.agent, and you should be able to try it out.

Appendix: Support Code

Appendix: Python LangChain Code

"""
See the main docs: https://robot-wranglers.github.io/compose.mk/demos/ai
"""
import os, sys, hashlib, logging

import click, ollama
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import OllamaLLM
from langchain_text_splitters import RecursiveCharacterTextSplitter

LLM_MODEL_NAME = os.environ.get("LLM_MODEL", "phi3:mini")
OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "localhost")
OLLAMA_PORT = os.environ.get("OLLAMA_PORT", "11434")
OLLAMA_URL = f"http://{OLLAMA_HOST}:{OLLAMA_PORT}"
USE_CACHE = os.environ.get("USE_CACHE", "1") == "1"
EMBEDDINGS = HuggingFaceEmbeddings(model_name=os.environ.get("EBD_MODEL_NAME", "all-MiniLM-L6-v2"))
OLLAMA_CLIENT = ollama.Client(host=OLLAMA_URL)
LLM = OllamaLLM(model=LLM_MODEL_NAME, base_url=OLLAMA_URL)

TEMPLATE = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use 5 sentences or less and keep the answer as concise as possible.

{context}

Question: {question}"""
PROMPT = PromptTemplate.from_template(TEMPLATE)

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stderr))

def hashed(glob):
    return hashlib.md5(glob.encode()).hexdigest()

def get_vectors(glob, fname=None):
    if not USE_CACHE or fname is None:
        documents = DirectoryLoader(os.getcwd(), glob=glob).load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
        docs = text_splitter.split_documents(documents)
        logger.debug("Computing the FAISS index, this might take a while")
        return FAISS.from_documents(docs, EMBEDDINGS)
    return FAISS.load_local(
        os.path.join(os.getcwd(), fname),
        EMBEDDINGS,
        allow_dangerous_deserialization=True,
    )

def model_exists(model_name):
    models = [m.model for m in OLLAMA_CLIENT.list()['models']]
    return model_name in models

def pull_model(model_name=LLM_MODEL_NAME):
    logger.debug(f"Checking for model '{model_name}'...")
    if not model_exists(model_name):
        logger.debug(f"Pulling model '{model_name}', please wait...")
        OLLAMA_CLIENT.pull(model_name)
    logger.debug(f"Model '{model_name}' is ready")

def show_response(response):
    if isinstance(response, (dict,)):
        result = response.get("result", "RAG result failure?")
        citations = []
        for doc in response["source_documents"]:
            citations += [
                f"- {doc.metadata.get('source', 'unknown source')} (\"{doc.page_content[:20].strip()}...\")"
            ]
        citations = "\n".join(["CITATIONS:\n"] + citations)
    else:
        result = response
        citations = "\n".join(
            ["CITATIONS: (not implemented yet for `chat`, use `ask` instead)"]
        )
    logger.debug(f"\n{result}\n\n{citations}")

@click.group()
def cli():
    """RAG pipeline demo"""

@cli.command()
@click.argument("glob")
@click.option("--force", is_flag=True, default=False,)
def chat(glob, force=False):
    db_path = _ingest(glob, force=force)
    base_retriever = get_vectors(glob, db_path).as_retriever()
    retrieval_chain = (
        {"context": base_retriever, "question": RunnablePassthrough()}
        | PROMPT | LLM | StrOutputParser()
    )
    logger.debug("Starting interactive session.  Use exit, quit, q, or Ctrl+D to exit.")
    while True:
        query = input(">> ")
        if query.lower() in ["exit", "quit", "q"]:
            break
        response = retrieval_chain.invoke(query)
        show_response(response)

@cli.command()
@click.argument("glob")
@click.argument("query", nargs=-1)
@click.option("--force", is_flag=True, default=False,)
def query(glob, query, force=False):
    pull_model()
    db_path = _ingest(glob, force=force)
    query = " ".join(query)
    logger.debug(f"Got query: {query}")
    retriever = get_vectors(glob, db_path).as_retriever()
    logger.debug("Running query..")
    qa_chain = RetrievalQA.from_chain_type(
        LLM, retriever=retriever, return_source_documents=True
    )
    response = qa_chain.invoke(query)
    show_response(response)

@cli.command()
@click.argument("glob")
@click.option("--force", is_flag=True, default=False,)
def ingest(glob, force=False):
    return _ingest(glob, force=force)

def _ingest(glob, force=False):
    db_path = f".tmp.faiss_index.{hashed(glob)}"
    logger.debug(f"FAISS index: {db_path}")
    if os.path.exists(db_path) and not force:
        logger.debug("Index already exists, remove it to reingest.")
    else:
        force and logger.debug("Forcing reingest..")
        get_vectors(glob).save_local(db_path)
    return db_path

@cli.command()
def init():
    pull_model()

if __name__ == "__main__":
    cli()

Appendix: Javascript LangChain Code

"""
See the main docs: https://robot-wranglers.github.io/compose.mk/demos/ai
"""
import os, sys, hashlib, logging

import click, ollama
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import OllamaLLM
from langchain_text_splitters import RecursiveCharacterTextSplitter

LLM_MODEL_NAME = os.environ.get("LLM_MODEL", "phi3:mini")
OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "localhost")
OLLAMA_PORT = os.environ.get("OLLAMA_PORT", "11434")
OLLAMA_URL = f"http://{OLLAMA_HOST}:{OLLAMA_PORT}"
USE_CACHE = os.environ.get("USE_CACHE", "1") == "1"
EMBEDDINGS = HuggingFaceEmbeddings(model_name=os.environ.get("EBD_MODEL_NAME", "all-MiniLM-L6-v2"))
OLLAMA_CLIENT = ollama.Client(host=OLLAMA_URL)
LLM = OllamaLLM(model=LLM_MODEL_NAME, base_url=OLLAMA_URL)

TEMPLATE = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use 5 sentences or less and keep the answer as concise as possible.

{context}

Question: {question}"""
PROMPT = PromptTemplate.from_template(TEMPLATE)

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stderr))

def hashed(glob):
    return hashlib.md5(glob.encode()).hexdigest()

def get_vectors(glob, fname=None):
    if not USE_CACHE or fname is None:
        documents = DirectoryLoader(os.getcwd(), glob=glob).load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
        docs = text_splitter.split_documents(documents)
        logger.debug("Computing the FAISS index, this might take a while")
        return FAISS.from_documents(docs, EMBEDDINGS)
    return FAISS.load_local(
        os.path.join(os.getcwd(), fname),
        EMBEDDINGS,
        allow_dangerous_deserialization=True,
    )

def model_exists(model_name):
    models = [m.model for m in OLLAMA_CLIENT.list()['models']]
    return model_name in models

def pull_model(model_name=LLM_MODEL_NAME):
    logger.debug(f"Checking for model '{model_name}'...")
    if not model_exists(model_name):
        logger.debug(f"Pulling model '{model_name}', please wait...")
        OLLAMA_CLIENT.pull(model_name)
    logger.debug(f"Model '{model_name}' is ready")

def show_response(response):
    if isinstance(response, (dict,)):
        result = response.get("result", "RAG result failure?")
        citations = []
        for doc in response["source_documents"]:
            citations += [
                f"- {doc.metadata.get('source', 'unknown source')} (\"{doc.page_content[:20].strip()}...\")"
            ]
        citations = "\n".join(["CITATIONS:\n"] + citations)
    else:
        result = response
        citations = "\n".join(
            ["CITATIONS: (not implemented yet for `chat`, use `ask` instead)"]
        )
    logger.debug(f"\n{result}\n\n{citations}")

@click.group()
def cli():
    """RAG pipeline demo"""

@cli.command()
@click.argument("glob")
@click.option("--force", is_flag=True, default=False,)
def chat(glob, force=False):
    db_path = _ingest(glob, force=force)
    base_retriever = get_vectors(glob, db_path).as_retriever()
    retrieval_chain = (
        {"context": base_retriever, "question": RunnablePassthrough()}
        | PROMPT | LLM | StrOutputParser()
    )
    logger.debug("Starting interactive session.  Use exit, quit, q, or Ctrl+D to exit.")
    while True:
        query = input(">> ")
        if query.lower() in ["exit", "quit", "q"]:
            break
        response = retrieval_chain.invoke(query)
        show_response(response)

@cli.command()
@click.argument("glob")
@click.argument("query", nargs=-1)
@click.option("--force", is_flag=True, default=False,)
def query(glob, query, force=False):
    pull_model()
    db_path = _ingest(glob, force=force)
    query = " ".join(query)
    logger.debug(f"Got query: {query}")
    retriever = get_vectors(glob, db_path).as_retriever()
    logger.debug("Running query..")
    qa_chain = RetrievalQA.from_chain_type(
        LLM, retriever=retriever, return_source_documents=True
    )
    response = qa_chain.invoke(query)
    show_response(response)

@cli.command()
@click.argument("glob")
@click.option("--force", is_flag=True, default=False,)
def ingest(glob, force=False):
    return _ingest(glob, force=force)

def _ingest(glob, force=False):
    db_path = f".tmp.faiss_index.{hashed(glob)}"
    logger.debug(f"FAISS index: {db_path}")
    if os.path.exists(db_path) and not force:
        logger.debug("Index already exists, remove it to reingest.")
    else:
        force and logger.debug("Forcing reingest..")
        get_vectors(glob).save_local(db_path)
    return db_path

@cli.command()
def init():
    pull_model()

if __name__ == "__main__":
    cli()

Appendix: Ollama Container

# See the demo docs: https://robot-wranglers.github.io/compose.mk/demos/RAG
#
# NB: GPU support disabled by default; to enable this, use something like this 
# in the ollama_server block. You must also remove the '+cpu' from torch requirement.
#
#   https://docs.docker.com/compose/how-tos/gpu-support/
#   deploy:
#     resources:
#       reservations:
#         devices:
#           - driver: nvidia
#             count: 1
#             capabilities: [gpu]

# NB: this allows model-sharing for any host-installation of ollama if 
# it's available. Probably wrong for non-Linux and also maybe wrong for 
# some ollama versions?
# Share the hosts working directory- this is for document ingestion
# NB: this allows model-sharing for the host-installation of 
# hugging face if applicable.  Probably wrong for non-Linux and 
# also maybe wrong for some library versions?
services:
  ollama_server: &base
    build:
      context: .
      dockerfile_inline: |
        FROM ollama/ollama@sha256:476b956cbe76f22494f08400757ba302fd8ab6573965c09f1e1a66b2a7b0eb77
    entrypoint: ['ollama','serve']
    working_dir: /workspace
    volumes:
      - /usr/share/ollama/.ollama:/root/.ollama
      - ${PWD}:/workspace
      - ${HOME}/.cache/huggingface:/root/.cache/huggingface
    environment:  &base_env
      OLLAMA_HOST: ollama_server
      LLM_MODEL: ${LLM_MODEL:-mistral}
      EBD_MODEL_NAME: ${EBD_MODEL_NAME:-all-MiniLM-L6-v2}
  ollama_python: 
    <<: *base
    entrypoint: ["python3.11"]
    depends_on: ['ollama_server']
    build:
      context: .
      dockerfile_inline: |
        FROM python:3.11-slim-bookworm
        RUN pip install 'torch==2.6.0+cpu' \
          --extra-index-url https://download.pytorch.org/whl/cpu
        RUN pip install --no-cache-dir \
          langchain==0.3.22 langchain-community==0.3.20 \
          langchain-huggingface==0.1.2 langchain-ollama==0.3.0 \
          ollama==0.4.7 sentence-transformers==4.0.2 faiss-cpu==1.10.0 \
          unstructured[md]==0.17.2 click==8.1.8 \
          accelerate==1.6.0 python-magic==0.4.27
        RUN apt-get update -qq && apt-get install -y make procps
  ollama_node:
    <<: *base
    depends_on: ['ollama_server']
    # working_dir: /app
    build:
      context: .
      dockerfile_inline: |
        FROM node:18-slim
        WORKDIR /app
        RUN npm init -y
        RUN npm install @langchain/community @langchain/core langchain
        RUN apt-get update -qq && apt-get install -y make procps

References

https://ollama.com ↩
langchain intro ↩
See the compose file for hints about enabling GPUs ↩
See https://github.com/Mozilla-Ocho/llamafile, and the RAG demo ↩
wiki://bekenstein-bound ↩
https://frinklang.org/ ↩
Was gonna check the work but I'm a concept guy, please invest in our new advanced data storage/snack company. Extensive research suggests that we do need spherical bananas, but we feel we are close to a breakthrough. ↩