Obligatory AI Things
LLMs and compose.mk
might seem like strange companions at first, but there are some edge cases where the combination is pretty interesting. A few observations about this:
- Code-generation for things like shell / Makefile / awk is often incredibly reliable compared to most other things, and the available training data is huge.
- New capabilities for code-generation actually favor polyglot design, and allow for choosing the best tool for every job rather than always picking the most familiar tool.
- Even small AI-backed tools require significant dependencies and/or infrastructure. Shipping them without a list of instructions for bootstrapping local ollama or similar tends to lower barrier to entry.
- Related to the last point, tool-use and agents can both benefit from a "containers first" type of approach. This helps not only with dependency-management but also with sandboxing
- AI-driven help with prototyping can very quickly leave you with tons of files (and all of the implied context-switching) for every little experiment.
While much of compose.mk
is focused on glue-code and orchestration, several topics above are still pretty closely related to the compose.mk
core competencies. Generating small components also becomes a lot more useful if there is some kind of plan for organizing it and packaging it, which are also things that compose.mk
can help with.
Overview & Background
This page includes two demos:
The first demo is very minimal, focuses on being self-contained, and shows how to use an embedded ollama server and separate prompts/code. This is somewhat in the spirit of shell-gpt1, and passes data on pipes. It's a reasonably good template for small tools that should run anywhere, specify their own dependencies, and download/install everything (including the backend model) just in time.
The second demo is more elaborate. It uses external files for most of the implementation. It allows you to chat with the compose.mk
documentation (or any other files) using a small RAG pipeline with ollama1 and langchain.2
Both demos are written in CMK-lang rather than starting from pure Makefile. This page focuses more on the demos themselves than exposition of CMK itself.
Both demos are also local first, CLI-friendly, and as long as you have make
and docker
they will bootstrap themselves. Users can therefore enjoy: no daemons required, no open ports, no manual installation of python/node stacks or dependencies, no API tokens, etc. And the pirates of course. Pirates describing polyglots.
$ query='describe compose.mk polyglot capabilities as if you are a pirate' \
glob='docs/*.md' ./demos/cmk/rag.cmk ask
Ahoy mateys! Arr, in this world o' languages, I be savvy enough to juggle 'em
like yer ol' bottle collection— each one perfect for its own purpose. With me
trusted compose.mk and some clever splicin', even the most bewilderin tasks
gets done easy as pie. Be it leveragin language just right or summonin matrix golf
with APL, I command these languages better than any captain commands his ship!
Now that be a mighty fine skill to have in this treacherous sea o' code, see?
CITATIONS:
- /workspace/docs/demos/polyglots.md ("Part of the be...")
- /workspace/docs/style.md ("Explicit polyglot in...")
- /workspace/docs/but-why.md ("Polyglots Considered...")
Caveats
Unlike most of the other demos, note that these don't run as part of the test suite because they are demanding in terms of space and CPU. Poor quality and relatively slow summaries of a few paragraphs in only ~40 markdown files will cost you a ~4gb download of mistral
if you're not holding it already. In a lot of ways, you might as well just grep! GPU acceleration is even explicitly disabled too, in the interest of portability3.
These are basically just toys, and there are better ways to embed ollama in an application4. But it does take a "obviously service and/or system" type of problem and creates a stand-alone tool instead. By handling startup, shutdown, and not requiring forwarded ports.. orchestration abstracts the ollama server's lifecycle, and that's the sort of thing compose.mk
is good at. To try and make sure that this idea is more than wishful thinking.. we'll also use compose.mk
packaging features to wrap all the moving parts and the documents into a single self-extracting executable.
Dueling Philosophers
Something simple first, in the form of dueling philosophers. Let's watch Hobbes fight Rousseau about whether or not life is good. This demo uses an embedded ollama server, two prompts, and enough python code to bootstrap a backend model if one is not already present.
Since the python code required is minimal, this is a good candidate for polyglot-style, and including prompts and container-specs we're only looking at ~100 lines:
#!/usr/bin/env -S ./compose.mk mk.interpret!
# Demonstrates building a self-contained ollama application with CMK-lang.
#
# This demo ships with the `compose.mk` repository.
# (Not part of the test-suite since model requirements are ~5GB)
#
# See the main docs:
# https://robot-wranglers.github.io/compose.mk/demos/ai
export CMK_AT_EXIT_TARGETS=ollama_server.stop
export LLM_MODEL_NAME?=phi3:mini
# Inlined docker-compose services.
# 1 container for the ollama server, and 1 for the client.
# Volume allows model-sharing for any host-installation of ollama if available;
# this is probably wrong for non-Linux and also maybe wrong for some ollama versions?
⋘ ollama.services
services:
ollama_server: &base
build:
context: .
dockerfile_inline: |
FROM ollama/ollama@sha256:476b956cbe76f22494f08400757ba302fd8ab6573965c09f1e1a66b2a7b0eb77
working_dir: /workspace
entrypoint: ['ollama','serve']
volumes: ['${PWD}:/workspace','/usr/share/ollama/.ollama:/root/.ollama']
ollama_python:
<<: *base
entrypoint: python3
environment: ['OLLAMA_URL=http://ollama_server:11434/']
build:
context: .
dockerfile_inline: |
FROM python:3.11-slim-bookworm
RUN pip install --no-cache-dir ollama==0.4.7
⋙
# Philosopher prompts
🞹 hobbes.prompt
You are a reincarnation of Thomas Hobbes.
Erudite and insightful but also cantankerous, pessimistic, and paranoid.
Respond to the following point of view with your own, 1 paragraph max.
Do you agree or disagree? Do not restate the system prompt or user prompt.
🞹
🞹 rousseau.prompt
You are a reincarnation of Jean-Jacques Rousseau. Passionate and idealistic,
but also hypersensitive, seeing corruption where others see progress.
Respond to the following point of view with your own, 1 paragraph max.
Do you agree or disagree? Do not restate the system prompt or user prompt.
🞹
# Entrypoints for philosophers:
# Bind their prompts to expected env-vars, then run the chat polyglot.
hobbes.talk:; cmk.bind.def.to.env(hobbes.prompt, LLM_PROMPT) && this.chat
rousseau.talk:; cmk.bind.def.to.env(rousseau.prompt, LLM_PROMPT) && this.chat
# Seed to kick off the debate.
debate.moderator:; echo "Life is good?"
# Start the ollama service, init models,
# seed the argument, then let the philosophers talk.
debate: ollama_server.up.detach init_models \
flux.pipeline.verbose/debate.moderator,hobbes.talk,rousseau.talk
# Polyglot: Just enough python to bootstrap a model if it's missing.
# This will run in the `ollama_python` container.
⨖ init_models
import os, ollama
LLM_MODEL_NAME = os.environ['LLM_MODEL_NAME']
client = ollama.Client(host=os.environ['OLLAMA_URL'])
print("Checking connection..")
models = client.list()
print("Connection ok.")
print(f"Found {len(models['models'])} models:")
for model in models['models']:
print(f" * {model.model}")
if LLM_MODEL_NAME not in models['models']:
print(f"Pulling model: {LLM_MODEL_NAME}")
client.pull(LLM_MODEL_NAME)
print(f"Successfully pulled: {LLM_MODEL_NAME}")
else:
print(f"Model {LLM_MODEL_NAME} is available.")
⨖ with svc=ollama_python entrypoint=python3 \
quiet=1 output=stderr env='LLM_MODEL_NAME' \
as compose_context
# Polyglot: Just enough python to drive the model
# This will run in the `ollama_python` container.
⨖ chat
import os, sys, ollama
client = ollama.Client(host=os.environ['OLLAMA_URL'])
system = { 'role': 'system', 'content': os.environ['LLM_PROMPT'] }
query = { 'role': 'user', 'content': sys.stdin.read() }
response = client.chat(messages=[system, query], model=os.environ['LLM_MODEL_NAME'])
print(response.message.content)
⨖ with svc=ollama_python entrypoint=python3 \
quiet=1 env='LLM_MODEL_NAME LLM_PROMPT' \
as compose_context
# Main entrypoint: Print usage info and exit.
__main__:
cmk.log(${red}USAGE: ${__file__} debate)
Our philosophers very frequently ignore their system-prompt, bad robot! We'll also have to ignore the question of whether the LLM is accurately reflecting the nuance of the dead philosopher's point of view, but.. here's a cherry-picked result that shows how flux.pipeline/<targets> conveniently includes intermediate result previews so we can see the whole conversation.
$ ./demos/cmk/ollama.cmk debate
≣ ollama.services // up.detach // ollama
Φ flux.pipeline // debate.moderator stage // result preview
Life is good?
Φ flux.pipeline // hobbes.talk stage // result preview
I respectfully dissent from this notion; life's relentless struggles and perpetual
conflicts are undeniably corrosive to human well-being. The state of nature
I conceived, where man lives a solitary existence without society's comfort or
safety net, is far more appealing for its simplicity. Without the constant fear
instigated by selfish desires and ambition that plague our interactions today,
life could be considerably less drear.
Φ flux.pipeline // rousseau.talk stage // result preview
I am deeply troubled to hear of such a worldview as this; your perspective seems
blinded not only to human nature but also to the essence of societal cooperation
which I envisage in mankind's natural state - one bound by empathy, simplicity,
and equality. While solitude may offer escape from conflict, it is through our
interconnectedness that we achieve true communal harmony. It would be a grievous
loss to discard the virtues of compassion and cooperation in pursuit of solitary
peace at the cost of human solidarity - an echoed sentiment I have always
cherished as paramount for social progress, even when this path is fraught
with challenges.
≣ ollama.services // ollama // stopping..
LangChain, Two Ways
For the second demo we'll sketch a small RAG pipeline, plus an example of external tool-usage. Unlike most other demos, we reference external files instead of embedding container definitions and foreign code.
Retrival augmented generation works with python-backed langchain. Specify a corpus with a glob for file-patterns, and then provide a verb like summarize
, ask
or chat
to select how you want to interact with the corpus. RAG citations provided for ask
interactions are pretty reasonable, but citations are not implemented for chat
. As for the actual answers there are, of course, frequently very strange hallucinations and quality varies wildly. Consider reading the actual project documentation =)
Tool-calling is a smaller demo, using node-backed langchain. All this does is answer time-related questions using builtins. Tempting to do something more exciting here, like using frink container to handle calculations and dimensional analysis, but this is left as an exercise for the reader.
Backend models by default are mistral
for the LLM and all-MiniLM-L6-v2
for the RAG embeddings, although you can override this stuff pretty easily.
All support code is included in the appendix. Quick links:
- Langchain driver for RAG (Python)
- Langchain driver for Tools (Javascript)
- Ollama container plus python/node deps (Docker-Compose)
- Thin wrapper to combine / orchestrate the pieces above (compose.mk script)
Usage & Example Output
The orchestration code and support code appendix is large, so example-output and usage is up first.
Tool Demo
The tool demo is used like this:
./demos/cmk/rag.cmk tools.js
Trying this 4 different times gives 3 different answers and an error (bad robot!), but you may get better results with different local models or larger, state of the art cloud-based models.
OutputParserException [Error]: Could not parse LLM output: Today's date is July 12, 2025. If asked about the day of the week or time, I can also provide that information using the 'date' tool.
..
Answer: Today is Tuesday.
..
Answer: Today's date is July 12, 2025.
..
Answer: The date provided is July 12th, 2025, but I can't determine the specific context without more information from you.
..
RAG Demo
Tip
Most examples use the markdown glob docs/*.md
, which skips over docs/demos/*.md
. Using a double-splat like **.md
for recursive descent in subdirs is also supported. Output below is lightly edited to remove actual errors in some cases, but otherwise model answers are quoted verbatim.
The RAG demo is used like this:
# Ask a question about files that match the given pattern.
$ glob='docs/*.md' query='what is this project about?' \
./demos/cmk/rag.cmk ask
# Summarize files that match the given pattern.
$ glob='docs/*.md' query='what is this project about?' \
./demos/cmk/rag.cmk summarize
# Interactive chat about files that match the given pattern.
$ glob='docs/*.md' ./demos/cmk/rag.cmk chat
# Slurp any markdown matching the pattern.
# This is implied by `ask` or `chat`, and no-op if content already indexed.
$ glob='docs/*.md' ./demos/cmk/rag.cmk ingest
# Ensure given model is available to ollama, downloading if needed.
# Implied by `ingest`, `ask`, and `chat`.
$ model='phi3:mini' ./demos/cmk/rag.cmk init
# Pass everything that's needed and you can do it in one shot
$ model=.. glob=.. query=.. ./demos/cmk/rag.cmk ask
Output from asking a question looks like this:
$ query='why use this project' glob='docs/*.md' ./demos/cmk/rag.cmk ask
⇄ ask // Answering query with corpus: docs/*.md (40 files total)
Use it for decoupling CI/CD processes from platform lock-in and rapidly
incorporating external tools or code into your projects. It fosters an
environment that encourages experimentation in component design, prototyping
systems, and building console applications.
CITATIONS:
- docs/overview.md ("Typical use-cases in...")
- docs/index.md ("Typical use-cases in...")
- docs/but-why.md ("Motivation & Design...")
Using ask
for fuzzy search is marginally practical considering the output has good citations, but yeah.. might as well grep.
$ query='which demos are related to justfiles' \
glob='docs/demos/*.md' ./demos/cmk/rag.cmk ask
⇄ ask // Answering query with corpus: docs/demos/*.md (13 files total)
The Justfile demo is directly related to using and wrapping just-runners with
the make recipe system, as well as exposing different interfaces. This
demonstrates interoperability between the two systems by showing how compose.mk
can incorporate a foreign tool like justfiles.
result:
CITATIONS:
- docs/demos/just.md ("Interoperability wit...")
Besides pirate-support, you can also ask it to speak to you like a child. This might be useful for 'explain like I'm a python programmer' or similar, but if you want to continue to work locally then you can probably do better with bigger (and slower) models for LLM/embeddings.
$ query='whats the best way to get started with compose.mk? explain like i am 5' \
glob='docs/overview.md' ./demos/cmk/rag.cmk ask
⇄ ask // Answering query with corpus: docs/overview.md (1 files total)
Hey there! Think of compose.mk as a toolbox that helps you build things faster
and in more ways than just using regular make commands. Since it is kind of like
adding extra fun new tools to your workshop, here are simple steps for beginners on
how to start with compose.mk:
1. Get the Toolkit! First off, remember this toolbox is not something you can
hold; that means we need a computer and internet connection because everything
starts online through GitHub where they share their awesome tools like
compose.mk.
Orchestration
The orchestration script is very basic. This is mostly just using compose.mk
idioms to describe wrappers that call the python code and the the javascript code inside the appropriate containers.
The main idiom used below is compose.bind.target
. It basically attaches an existing target to an existing tool container and sets up pass-through for the given environment variables. Since "private" targets like self.*
run inside containers, they can safely use python/node without assuming it's available on the host.
#!/usr/bin/env -S ./compose.mk mk.interpret!
# Demonstrates doing some LLM stuff with compose.mk, using docker-compose,
# ollama and langchain. This includes a small RAG pipeline, and
# also demonstrates tool-usage. No host dependencies, no host ports.
#
# This demo ships with the `compose.mk` repository, but does NOT
# run as part of the default test-suite (model requirements are ~5GB).
#
# See the main docs:
# https://robot-wranglers.github.io/compose.mk/demos/ai
# This create target-scaffolding for all containers in the compose file
compose.import(file=demos/data/docker-compose.rag.yml)
# Constants, defaults, other boilerplate
export quiet=1
export LLM_MODEL_NAME?=mistral
export EBD_MODEL_NAME?=all-MiniLM-L6-v2
export CMK_AT_EXIT_TARGETS=ollama_server.stop
# Helper macros to script invocation, and showing
# the file-counts for the glob being used.
llm_tools=demos/data/llm_tools.mjs
llm_rag=set -x -o noglob && python demos/data/llm_rag.py
corpus_count=$${glob} (${yellow}`ls $${glob} \
| ${stream.count.lines}`${no_ansi_dim} files total)
# Tools demo.
tools.js: ᝏcompose.bind.target(ollama_node, env=query)
self.tools.js:
cp /workspace/${llm_tools} /app
cd /app && node llm_tools.mjs
# Ensure given model is available to ollama, downloading if needed.
# Implied by `ingest`, `ask`, and `chat`.
init: ᝏcompose.bind.target(ollama_python)
self.init:
cmk.log.target(Initializing model: $${LLM_MODEL_NAME})
${llm_rag} init
# Slurp any files matching the pattern.
# This is implied by `ask` or `chat`, and no-op if content already indexed.
ingest: ᝏcompose.bind.target(ollama_python, env=glob)
self.ingest:
cmk.log.target(Ingesting corpus: ${corpus_count})
${llm_rag} ingest "$${glob}"
# Ask a question about files that match the given pattern.
ask: ᝏcompose.bind.target(ollama_python, env='glob query')
self.ask:
cmk.log.target(Answering query with corpus: ${corpus_count})
${llm_rag} query "$${glob}" "$${query:-describe it}"
# Interactive chat about files that match the given pattern.
chat: ᝏcompose.bind.target(ollama_python, env=glob)
self.chat:
cmk.log.target(Chatting with corpus: ${corpus_count})
${llm_rag} chat "$${glob}"
# New verb: builds on top of the script's API instead of just using it.
# An alias for `ask` that summarizes files matching the given pattern.
summarize:
query="summarize all of the content" this.ask
__main__:
cmk.log(${red}USAGE: See https://robot-wranglers.github.io/compose.mk/demos/ai/)
Again, describing dispatch and import idioms in detail is out of scope for this page since that's mostly explained elsewhere. (For general background, see the general documentation on scaffolding, container dispatch, and CMK/Makefile transpilation.)
A few more specific remarks though:
- Using node without a package.json is a bit of a pain, hence the copy + path manipulations in
self.tools.js
. - Python scripts for RAG take arguments that are file globs. We need to ensure that these aren't expanded into file-lists, hence
set -o noglob
.
Packaging It
As mentioned at the beginning, we want to additionally generate a frozen version of all the work so far. See the main packaging docs for more details, but this one-liner is enough to build the executable archive that we need.
$ archive='docs demos' bin=docs.agent ./demos/cmk/rag.cmk mk.pkg.root
This command creates a new executable called ./docs.agent
, and packages up all of the following:
- The documentation corpus itself
- The compose file (and thus effectively ollama/langchain dependencies)
- The python/node scripts that define the LLM operations
- The
rag.cmk
orchestration script, andcompose.mk
itself.
To use this self-contained version, we still need to specify arguments and entrypoints, which is similar but slightly different from the main usage described earlier. As before, verbs like init
and interactive chat
are also available, and below you can see another example using ask
.
$ query='which features are related to interactivity' \
glob='docs/*.md' ./docs.agent -- ask
The interactive task selector feature mentioned in the context is directly
related to interactivity, as it allows users to select tasks interactively
within a TUI environment using compose.mk capabilities.
[snip]
This example should now be able to run almost anywhere, including places where ollama is not available, where ports cannot be opened, where models are not yet available, and where python is not even installed.
You can download the agent here, use chmod +x docs.agent
, and you should be able to try it out.
Appendix: Support Code
Appendix: Python LangChain Code
"""
See the main docs: https://robot-wranglers.github.io/compose.mk/demos/ai
"""
import os, sys, hashlib, logging
import click, ollama
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import OllamaLLM
from langchain_text_splitters import RecursiveCharacterTextSplitter
LLM_MODEL_NAME = os.environ.get("LLM_MODEL", "phi3:mini")
OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "localhost")
OLLAMA_PORT = os.environ.get("OLLAMA_PORT", "11434")
OLLAMA_URL = f"http://{OLLAMA_HOST}:{OLLAMA_PORT}"
USE_CACHE = os.environ.get("USE_CACHE", "1") == "1"
EMBEDDINGS = HuggingFaceEmbeddings(model_name=os.environ.get("EBD_MODEL_NAME", "all-MiniLM-L6-v2"))
OLLAMA_CLIENT = ollama.Client(host=OLLAMA_URL)
LLM = OllamaLLM(model=LLM_MODEL_NAME, base_url=OLLAMA_URL)
TEMPLATE = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use 5 sentences or less and keep the answer as concise as possible.
{context}
Question: {question}"""
PROMPT = PromptTemplate.from_template(TEMPLATE)
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stderr))
def hashed(glob):
return hashlib.md5(glob.encode()).hexdigest()
def get_vectors(glob, fname=None):
if not USE_CACHE or fname is None:
documents = DirectoryLoader(os.getcwd(), glob=glob).load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)
logger.debug("Computing the FAISS index, this might take a while")
return FAISS.from_documents(docs, EMBEDDINGS)
return FAISS.load_local(
os.path.join(os.getcwd(), fname),
EMBEDDINGS,
allow_dangerous_deserialization=True,
)
def model_exists(model_name):
models = [m.model for m in OLLAMA_CLIENT.list()['models']]
return model_name in models
def pull_model(model_name=LLM_MODEL_NAME):
logger.debug(f"Checking for model '{model_name}'...")
if not model_exists(model_name):
logger.debug(f"Pulling model '{model_name}', please wait...")
OLLAMA_CLIENT.pull(model_name)
logger.debug(f"Model '{model_name}' is ready")
def show_response(response):
if isinstance(response, (dict,)):
result = response.get("result", "RAG result failure?")
citations = []
for doc in response["source_documents"]:
citations += [
f"- {doc.metadata.get('source', 'unknown source')} (\"{doc.page_content[:20].strip()}...\")"
]
citations = "\n".join(["CITATIONS:\n"] + citations)
else:
result = response
citations = "\n".join(
["CITATIONS: (not implemented yet for `chat`, use `ask` instead)"]
)
logger.debug(f"\n{result}\n\n{citations}")
@click.group()
def cli():
"""RAG pipeline demo"""
@cli.command()
@click.argument("glob")
@click.option("--force", is_flag=True, default=False,)
def chat(glob, force=False):
db_path = _ingest(glob, force=force)
base_retriever = get_vectors(glob, db_path).as_retriever()
retrieval_chain = (
{"context": base_retriever, "question": RunnablePassthrough()}
| PROMPT | LLM | StrOutputParser()
)
logger.debug("Starting interactive session. Use exit, quit, q, or Ctrl+D to exit.")
while True:
query = input(">> ")
if query.lower() in ["exit", "quit", "q"]:
break
response = retrieval_chain.invoke(query)
show_response(response)
@cli.command()
@click.argument("glob")
@click.argument("query", nargs=-1)
@click.option("--force", is_flag=True, default=False,)
def query(glob, query, force=False):
pull_model()
db_path = _ingest(glob, force=force)
query = " ".join(query)
logger.debug(f"Got query: {query}")
retriever = get_vectors(glob, db_path).as_retriever()
logger.debug("Running query..")
qa_chain = RetrievalQA.from_chain_type(
LLM, retriever=retriever, return_source_documents=True
)
response = qa_chain.invoke(query)
show_response(response)
@cli.command()
@click.argument("glob")
@click.option("--force", is_flag=True, default=False,)
def ingest(glob, force=False):
return _ingest(glob, force=force)
def _ingest(glob, force=False):
db_path = f".tmp.faiss_index.{hashed(glob)}"
logger.debug(f"FAISS index: {db_path}")
if os.path.exists(db_path) and not force:
logger.debug("Index already exists, remove it to reingest.")
else:
force and logger.debug("Forcing reingest..")
get_vectors(glob).save_local(db_path)
return db_path
@cli.command()
def init():
pull_model()
if __name__ == "__main__":
cli()
Appendix: Javascript LangChain Code
"""
See the main docs: https://robot-wranglers.github.io/compose.mk/demos/ai
"""
import os, sys, hashlib, logging
import click, ollama
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import OllamaLLM
from langchain_text_splitters import RecursiveCharacterTextSplitter
LLM_MODEL_NAME = os.environ.get("LLM_MODEL", "phi3:mini")
OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "localhost")
OLLAMA_PORT = os.environ.get("OLLAMA_PORT", "11434")
OLLAMA_URL = f"http://{OLLAMA_HOST}:{OLLAMA_PORT}"
USE_CACHE = os.environ.get("USE_CACHE", "1") == "1"
EMBEDDINGS = HuggingFaceEmbeddings(model_name=os.environ.get("EBD_MODEL_NAME", "all-MiniLM-L6-v2"))
OLLAMA_CLIENT = ollama.Client(host=OLLAMA_URL)
LLM = OllamaLLM(model=LLM_MODEL_NAME, base_url=OLLAMA_URL)
TEMPLATE = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use 5 sentences or less and keep the answer as concise as possible.
{context}
Question: {question}"""
PROMPT = PromptTemplate.from_template(TEMPLATE)
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stderr))
def hashed(glob):
return hashlib.md5(glob.encode()).hexdigest()
def get_vectors(glob, fname=None):
if not USE_CACHE or fname is None:
documents = DirectoryLoader(os.getcwd(), glob=glob).load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)
logger.debug("Computing the FAISS index, this might take a while")
return FAISS.from_documents(docs, EMBEDDINGS)
return FAISS.load_local(
os.path.join(os.getcwd(), fname),
EMBEDDINGS,
allow_dangerous_deserialization=True,
)
def model_exists(model_name):
models = [m.model for m in OLLAMA_CLIENT.list()['models']]
return model_name in models
def pull_model(model_name=LLM_MODEL_NAME):
logger.debug(f"Checking for model '{model_name}'...")
if not model_exists(model_name):
logger.debug(f"Pulling model '{model_name}', please wait...")
OLLAMA_CLIENT.pull(model_name)
logger.debug(f"Model '{model_name}' is ready")
def show_response(response):
if isinstance(response, (dict,)):
result = response.get("result", "RAG result failure?")
citations = []
for doc in response["source_documents"]:
citations += [
f"- {doc.metadata.get('source', 'unknown source')} (\"{doc.page_content[:20].strip()}...\")"
]
citations = "\n".join(["CITATIONS:\n"] + citations)
else:
result = response
citations = "\n".join(
["CITATIONS: (not implemented yet for `chat`, use `ask` instead)"]
)
logger.debug(f"\n{result}\n\n{citations}")
@click.group()
def cli():
"""RAG pipeline demo"""
@cli.command()
@click.argument("glob")
@click.option("--force", is_flag=True, default=False,)
def chat(glob, force=False):
db_path = _ingest(glob, force=force)
base_retriever = get_vectors(glob, db_path).as_retriever()
retrieval_chain = (
{"context": base_retriever, "question": RunnablePassthrough()}
| PROMPT | LLM | StrOutputParser()
)
logger.debug("Starting interactive session. Use exit, quit, q, or Ctrl+D to exit.")
while True:
query = input(">> ")
if query.lower() in ["exit", "quit", "q"]:
break
response = retrieval_chain.invoke(query)
show_response(response)
@cli.command()
@click.argument("glob")
@click.argument("query", nargs=-1)
@click.option("--force", is_flag=True, default=False,)
def query(glob, query, force=False):
pull_model()
db_path = _ingest(glob, force=force)
query = " ".join(query)
logger.debug(f"Got query: {query}")
retriever = get_vectors(glob, db_path).as_retriever()
logger.debug("Running query..")
qa_chain = RetrievalQA.from_chain_type(
LLM, retriever=retriever, return_source_documents=True
)
response = qa_chain.invoke(query)
show_response(response)
@cli.command()
@click.argument("glob")
@click.option("--force", is_flag=True, default=False,)
def ingest(glob, force=False):
return _ingest(glob, force=force)
def _ingest(glob, force=False):
db_path = f".tmp.faiss_index.{hashed(glob)}"
logger.debug(f"FAISS index: {db_path}")
if os.path.exists(db_path) and not force:
logger.debug("Index already exists, remove it to reingest.")
else:
force and logger.debug("Forcing reingest..")
get_vectors(glob).save_local(db_path)
return db_path
@cli.command()
def init():
pull_model()
if __name__ == "__main__":
cli()
Appendix: Ollama Container
# See the demo docs: https://robot-wranglers.github.io/compose.mk/demos/RAG
#
# NB: GPU support disabled by default; to enable this, use something like this
# in the ollama_server block. You must also remove the '+cpu' from torch requirement.
#
# https://docs.docker.com/compose/how-tos/gpu-support/
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
# NB: this allows model-sharing for any host-installation of ollama if
# it's available. Probably wrong for non-Linux and also maybe wrong for
# some ollama versions?
# Share the hosts working directory- this is for document ingestion
# NB: this allows model-sharing for the host-installation of
# hugging face if applicable. Probably wrong for non-Linux and
# also maybe wrong for some library versions?
services:
ollama_server: &base
build:
context: .
dockerfile_inline: |
FROM ollama/ollama@sha256:476b956cbe76f22494f08400757ba302fd8ab6573965c09f1e1a66b2a7b0eb77
entrypoint: ['ollama','serve']
working_dir: /workspace
volumes:
- /usr/share/ollama/.ollama:/root/.ollama
- ${PWD}:/workspace
- ${HOME}/.cache/huggingface:/root/.cache/huggingface
environment: &base_env
OLLAMA_HOST: ollama_server
LLM_MODEL: ${LLM_MODEL:-mistral}
EBD_MODEL_NAME: ${EBD_MODEL_NAME:-all-MiniLM-L6-v2}
ollama_python:
<<: *base
entrypoint: ["python3.11"]
depends_on: ['ollama_server']
build:
context: .
dockerfile_inline: |
FROM python:3.11-slim-bookworm
RUN pip install 'torch==2.6.0+cpu' \
--extra-index-url https://download.pytorch.org/whl/cpu
RUN pip install --no-cache-dir \
langchain==0.3.22 langchain-community==0.3.20 \
langchain-huggingface==0.1.2 langchain-ollama==0.3.0 \
ollama==0.4.7 sentence-transformers==4.0.2 faiss-cpu==1.10.0 \
unstructured[md]==0.17.2 click==8.1.8 \
accelerate==1.6.0 python-magic==0.4.27
RUN apt-get update -qq && apt-get install -y make procps
ollama_node:
<<: *base
depends_on: ['ollama_server']
# working_dir: /app
build:
context: .
dockerfile_inline: |
FROM node:18-slim
WORKDIR /app
RUN npm init -y
RUN npm install @langchain/community @langchain/core langchain
RUN apt-get update -qq && apt-get install -y make procps
References
-
See the compose file for hints about enabling GPUs ↩
-
See https://github.com/Mozilla-Ocho/llamafile, and the RAG demo ↩