Well, Open WebUI is an incredible project that provides a nice graphical front end for local AI development, but the documentation is really lacking. You can figure some things out by looking the documentation, but mainly it is trial and error.
So here are my notes for using the various features that are on the edge as most people are doing docker development on Windows machines. with CUDA, this is targeted at getting it running on Apple Silicon.
Installation with pipx or uvx or Tauri
Sadly, uvx installation doesn’t work. It is missing install, but you can do a simple, brew install ollama && pipx install open-webui --python 3.11
and it will work. What this does is use brew to install ollama and then pipx is this incredibly wonderful thing that creates a python virtual environment (venv) and then adds it to a local path really useful
The thing on a Mac is that you want to avoid using docker because then you have to split your memory into a docker controlled component and a system one. And you will never get that right.
You can also use brew install uv && uvx open-webui
to get to the same place, but I kind of like pipx install
because it does the command line munging so you can just do a open-webui serve
without needed to know about uvx.
Finally there is work going on to use Tauri to package open-webui as well, so it is just a DMG file as Open Webui Desktop, but I couldn’t get this to work.
This does mean that you have to go to great lengths to do pipx installation for the rest and have them running as separate small servers. That is way better and more memory efficient than using docker as you get natural splits between MacOS native applications (like FinalCutPro for instance or DiffusionBee) and your other tools
Going Totally Local
You probably want to go totally local since that’s the point of these experiments and there are and there are many parameters you should change. The main one is to load the best of the local models. We do have a script that does this, but basically from the command line note that this assumes different memory sizes and is probably already out of date:
# create a new window and start ollama
ollama serve &
# now do the pulls I like to pull the tagged parameters
# so it is easier to know what you are loading
ollama pull llama3.2:3b tulu3:8b
# if you have a 64GB machine
ollama pull qwq:32b qwen2.5:32b llama3.2-vision:11b
# if you have a 128GB machine
ollama pull llama3.3:70b tulu3:70b nemotron:70b llama3.2-vision:90b
Downloading other GGUF models
OK one confusing this is that even though Hugging Face has 1.2M models up there only a few can be downloaded by Open WebUI. This is because underneath, Ollama is just a wrapper around Llama.cpp only accepts GGUF files that can come from Ollama.com or from Huggingface.co (defaulting to Q4_K_M quantization, but that’s a whole other post), but on Huggingface itself if you click on the model drop down it will generate the proper pull request for you.
As an aside GGUF stands for Georgi Gerganov’s Universal Format. It’s a bit of a pain because the tool to do do this is in his llama.cpp library with a neat trick using docker to run this stuff
Dealing with HuggingFace models
Most of the rest of the Hugging Face models are in HF format, so you need to convert them to GGUF and as a community service, publish the conversion. Here we enter utilities hell which I’m going to cover in the next blog post about how to do these things, but this is a great example, basically, the major ways are like the Ghost of Christmas past, present and future:
- Past. Pray they have a Homebrew package, but if not it’s painful. Git clone a repo, install naked pip requirements into the system and pray. This is illustrated the last command box. The main issues here are that first you are cloning a bunch of stuff you don’t need and have to maintain and also that you have to remember where these things are in README.md or something and then you are doing a pip install into the system environment and who knows what version of python you are using just so you can run a single script
convert-hf-to-gguf.py
and then manually make Modelfile that explains how the inputs work. Sigh. - Present. The current hotness is stuffing everything inside a docker container and then you get this amazingly complicated command line and you have to understand that there is an internal file system and an external one. And of course with Docker, you have to allocate separate space for it and you get these huge Docker containers with a full operating system in them basically (duplicated on a Mac) just to get a few lines of python running. Then there are online looks like gguf-my-repo that are on Hugging Face (more on the future of tools in another post), but of course there is way to programmatically do this.
- Future. The emergence of npx, uvx, pipx, condax and pkgx that is the family of executables that create an even lighter weight environments that are language specific. Instead of a big docker virtual machine on Apple Silicon, you end up with just enough to run the script which is usually a virtualized environment so multiple versions of Python. Or with tools like Dagger at least they high the containers behind a nice user interface (although of course the container I tried didn’t work. Of course with things like npc, uvx, the nice world of brew update gets replaced by an update from each tool, but it is very lightweight!
I can’t say really which is easiest, but since it’s one time, the HuggingFace running application is pretty great and it is free as long as you take less than 120 seconds of CPU time.
# get the hugging face tools
brew install huggingface-cli
# or if you are cool this is sort of super pipx
brew install pkgx
pkgx install huggingface-cli
# the format is _repo_/_model_
ORG=qwen
MODEL=qvq-72b-preview
# if the above fails you need a GGUF conversion
ollama pull "hf.co/$ORG/$MODEL"
MODEL_DIR="$HOME/wsn/data/models"
mkdir -p "$MODEL_DIR"
huggingface-cli download $REPO/$MODEL --local-dir "$MODEL_DIR/$MODEL" --include "*"
#Convert to GGUF sigh co kdf
docker run --rm -v "$MODEL_DIR/$MODEL":/repo ghcr.io/ggerganov/llama.cpp:full --convert "/repo" --outtype f32 --outfile /repo/$MODEL.gguf
# This creates a file fp32 file in $MODEL_DIR/$MODEL.gguf
#Quantize from F32.gguf to Q4_K_M
docker run --rm -v "$MODEL_DIR":/repo ghcr.io/ggerganov/llama.cpp:full --quantize "/repo/$MODEL.gguf" "/repo/$MODEL.Q4_K_M.gguf" "Q4_K_M"
# Or the old way
git clone git@github.com:ggerganov/llama.cpp
cd llama.cp
# yes see the previous posts about uv
uv pip install -r requirements.txt
# now run the conversion
./convert-hf-to-gguf.py $MODEL_DIR/$MODEL --outfile $MODEL_DIR/$MODEL.gguf --output q8_0
# Now you have to creat the Modelfile to match this
# Sigh this more complicated than it looks because the
# meta data on how system prompts and user prompts
# work is not in the hugging face file itself.
How it all works and logging
Because it is pretty confusing works, but here it is. Note that the easiest way to see the logs is to run all these processes I need a separate terminal window so you can see what is coming out of standard output. Here is how you know each process is working:
- open-webui. It will end with uvicorn started
- ngrok. This will end with a message saying look at http://127.0.0.1:4040
- ollama. Will end with a message telling you how much compute RAM it has (96GB on a 128GB M4 Max by the way)
- tika. This will end with a message says started at http://localhost:9998
Backup the Setting Database
This is a real pain if you have to delete over and over, so I just use chezmoi to capture their SQL database. They don’t use INI files, but have an Alembic database that keeps track of the many parameters.
You can go to Lower Left > Admin Settings > Database > Export config to JSON
note that the API keys are here in plain text, so do not check this in, you should store it someplace like 1Password if you need it.
Resetting the User Database and Backups
I found that I was locked out of this, the simplest thing to do is to delete all the files where the configurations are kept. You can also send an environment variable to do this:
RESET_DATABASE=1 open-webui serve &
If you are doing a pipx installation, the actual location of the webui.db is really buried because it is in the ./data directory of the working venv, this is dependent on the python tha you are using, but with pipx it will be in a strange directory buried in pipx:
$HOME/.local/pipx/venvs/open---webui/lib/python3.12/site-packages/open_webui/data
# here are the interesting files
webui.db # the alembic database
uploads # the files you have uploaded
vectordb # where your RAG information is stored
cache/audio # .wav and transcripts
cache/image/generations # where .pngs live
If you do have a problem with the. user database. Which is did, you can also reset the configuration with RESET__CONFIG_ON_START=1. They talk about a config.json, but I can’t find it anywhere.
Backup of config.json, webui.db and chats
You want to pretty frequently do backups because on each version change, you can lose your configuration and also there are bugs in the system so you can bork your configuration. I try to do this every day or so.
- Settings > Admin Settings > Database and do both an export config.js which has all your API keys
- Export Database which gives you
webui.db
which has more configurations - Export Chats because you will lose those.
- These all have API keys and things, so you can put the chats in a repo, but I would put the config.json and webui.db into 1Password or someplace secure like iCloud Drive. Definitely not in a repo.
The many configuration settings
One note is that while Open WebUI takes in many environment variables, those marked PersistentConfig are only read once and then disappear into the webui.db and you can only change them in the Open WebUI interface. They have this idea of an OPEN_API_KEY for instance but this disappears into the database.
If you care about your settings, you can go to Lower Left > Admin Settings > Database > Export Config to JSON files
and save it.
Multiple OpenAI Compatible API points
While you can use functions and pipelines, so many models are available from an OpenAI compatible interface it makes sense just to have them all here. I can’t find a way to load them programmatically though, so you have redo this everytime you setup a system in Lower Left > Admin Settings > Settings > Connections
Here’s a list that I use. Since Open WebUI doesn’t tell you where the models live, you have to intuit this by different rules for model names go to Lower Left > Admin Settings > Settings > Models
you get more meta data.
They don’t tell you who the provider is, but in the all important Capabilities
at the bottom you can see if it supports Vision
or Citations
.
Also you can’t tell which provider you are using from the user interface, but the syntax of the model names are subtly different so that’s a hint. See the third column, but if you see a link icon on the right it is hosted in the cloud while if it has a number then it is local. The local model syntax is easy just look for a number after the model (which is its size), it has the form hf.co/org/rep
if it downloads from there or whatever random name they pick on ollama.com, but it is typically lower kebab case with the version with a dash, so qwen2.5:72b
or llama3.2-vision:90b
. The net is that only Cerebras can really confuse you if you follow this decoder ring:
URL | Comment | Model Syntax |
---|---|---|
https://api.openai.com/v1 | They have lots of old and date versioned models | They use lower kebab case like gpt-4o-audio-preview-2024-10-01 . Note that they never do gpt4 , it is gpt-4 |
https://api.groq.com/openai/v1 | Very high speed, not as fast as cerebras but way more variety. They have lots of old models | Names are lower case in provider/model[-]version syntax: llama-3.2-90b-vision-preview llama3-8b-8192 llama-3.3-70b-versatile |
https://api.deepseek.com/v1 | deepseek-chat is V3 and the pricing is incredibly low, so use it! | There are people copying them, but the true models are just deepseek-chat and deepseek coder |
https://openrouter.ai/api/v1 | This is the most confusing because they route to every other provider like OpenAI, Amazon, Google, and many small ones. The tag (free) means no charge, so go for that. They don’t host anything themselves FYI, so Look for open source ones | Their syntax is Initial Case with Provider: Model. So they are easily confused with the real Google models. For instance, Google: Gemini 2.0 Flash looks the same as Googles base offering. They has a sea of models like Deepseek V3 that a similarly passthrough. Free ones look like Meta: Llama3.2 90B Vision Instruct (free) |
https://mistral.ai/v1 | Proprietary models from the French 🙂 search for models ending in stral | The model names are in kebab case like codestral-mamba-latest |
https://api.cerebras.ai/v1 | Accelerate models that are supposed to faster than Groq | The model names are confusing, but the also use lower kebab case like llama3.1-8b, llama3.1-70b and confusingly llama-3.3-70b so a dash in the version |
https://api.totalgpt.ai | Infermatic.ai. This came up as an alternative to OpenRouter.ai, but it is expensive so not using it and don’t want to pay $15/month. They do use vLLM underneath | |
http://localhost:4000 | They used to support LiteLLM, but don’t anymore. So you can run LiteLLM separately if you want. LiteLLM which shims 100+ LLMs with the OpenAI call format, but then you need another component that is an LLM proxy with pipx install 'litellm[proxy]' and then LiteLLM --model huggingface/bigcode/starcoder puts a proxy at port 4000 |
Functions: Getting Anthropic and Google Running
There are some models which are not OpenAI compatible, so you need to find the right Functions to use it. To do this, go to the Lower Left > Admin Settings > Functions
and look for the Discover a Function
at the bottom, these are the two I enable. You do need a login to Openwebui.com which is different from the localhost. This page is a complete mess of different things from functions to prompts to other stuff, but the best view I think is to go to Models > Functions
which will show you the most used functions. Note that the website is really slow, it looks like it is doing dynamic generation so it can take seconds to click from one page to another.
What is a function, well basically it is a way to do a single call out from OpenWebUI (this is compared with Pipelines which allow multiple stages and it is more separated).
Once you load them they have this concept of a Valve which is really just a variable, so once you load. So what you do once you find a function is to click on the GET
button and then it will ask if you want to Import to WebUI
and you need to type the URL of your Open WebUI localhost (this is usually http://localhost:8080
. This will copy it into your local host and then you choose Save to stick into your environment. Not super clean but it works.
Note that most of these functions are not documented so it is hard to know what depends on what, so you can get quite a few errors, like Visualize for instance requires OpenAI.
Also note that in the Settings > Models
section, you can tell it comes from a function because a anthropic.claude3.
5 appears or google_genai.gemini-1.0-pro-latest
Function Name | Settings | Model Syntax |
---|---|---|
Anthropic | Here is where you get Claude | The names are lower kebob case with version and type like this: anthropic/claude-3.5-sonnet |
Google GenAI | Note that the GOOGLE_API_KEY import doesn’t work, you will need to add to the manually. | The Models are Initial Caps with a colon and then name like Google: Gemini 2.0 Flash Thinking Experimental |
Using Remote Ollama, Ngrok and OpenWebUI
You can do this in the Ollama API list, so for instance, if you have another MacBook with Ollama running, you get to it with http://richs-macbook-pro-2021.local:11434
as an example works and if you use ngrok, then you can actually go this remotely.
Pretty handy for a quick way to get a departmental server
One of the nice things about OpenWebUI is. that it just calls APIs, in this case Ollama is the default. You can also run. Ollama remotely and then you just start it with OLLAMA__HOST=0.0.0.0 ollama serve
and then. it will serve anything on the Internet. This is a little dangerous of course but convenient
Then you just go to Admin Settings > Connection > Ollama Host
and add the domain name something like http://richs-macbook-pro-2021.local:11484
and it will serve from there. Very nice for departmental setups. Get a Mac mini M4 Pro and serve. your entire workgroup.
Ngrok which does authentication and has a little server on the host machine is another answer. The only problem is that Ngrok generates an AVT Anti-virus error since it is used in many hacks. What you can do is to create an account on ngrok.com and then create an ngrok server with ngrok http --url _your static domain_ --oauth google --oauth--allow-domain __your domain__
which should protect you.
The setup here is a little more complicated:
- You have to logon to ngrok.com and get an account
- Then
brew install ngrok
- Note that many anti-virus programs mark ngrok as a bad program because it is commonly used in hacks. You need to go to your Antivirus and exclude the executable which should be in somewhere like
/opt/homebrew/Caskroom/ngrok/<version>/ngrok
- Now you need to authenticate with your
ngrok config add-authtoken you get this from their console
- Then you can run ngrok remote the port 8080 of the Ollama server with
ngrok http --url _the static domain_ 8080 --oauth-google --oauth-allow domain=tongfamily.com
which says remote port 8080 and protect it with google authentication and only allow accounts from tongfamily.com
So basically at this point you are using Open WebUI locally from its point of view and the bugs. with web sockets are not an issue.
Enabling RAG Documents and Web Retrieval
OK, now in Lower Left > Admin Settings > Settings > Documents
are about a million configuration settings that enable RAG, the big ones are the embedding mode.
The basic idea is pretty simple, you can use the #
notation or choose upload file and it adds it to the chat RAG area. There are two ways to do this.
- First, you upload your local documents in the Workspace > Knowledge section. Note that the documentation is actually very out of date here. Knowledge is basically a folder system, so you can turn on different pieces. This allows you to upload directories and sync them, so it’s a nice way to have say a repo with your documents and then. you can sync. Then when you enter
#
in a chat, it will show you all the available documents you can load. Then it will RAQ the data and the LLM can. use that data. You can RAG a single file or you can RAG an entire Collection. It automatically can add Citations if the LLM. you chose supports it. Tulu3:8B for instance works well. - You can also temporarily load files by choosing the Upload option in any chat. But it is nice to have the documents already there.
- You can enable Google Drive by setting
GOOGLE_DRIVE_API_KEY
andGOOGLE_DRIVE_CLIENT_ID
and it will be available. Go to the Google Console and Enable the Google Drive API need for Web Apps, then you create an API key and make sure to Edit the API key to restrict it to just Google Drive. Then. you need a Drive Client Id as well, but there are no specific instructions for this :—=( - You can do a download of Web source as well with
#https://tongfamily.com
but this is nearly useless given all the gunk that is in a typical website, they don’t really tell you how to fix this, but there is a huge Web Search section. It will actually show you the document that it pulled, when you hit enter and you can click on the document itself to see what is there
How to tell if your download is working, look at the console output
The way that you can tell. if it works is to go to the console and see if the open__webui.env is downloading things when you hit enter and you should see the model getting loaded
How RAG Embedding actually works
They really don’t tell you want is going on here, but the RAG system uses a completely different method of dealing with models that the core Chat system and it is not well documented, but here is what happens:
- Unlike Ollama, OpenWebUI RAG supports the based HuggingFace models, so you don’t need to do any conversion. That’s the good news.
- The bad news is that on Apple Silicon at least it looks like these models *do not* use the Neural Engine hardware so are really slow
- Second is that the hugging face cli caches all the models it has in. Note that to set this all up, create a HF_TOKEN and use 1Password to retrieve it in .bash_profile or .zshrc.
- The cache can get really big, it ate me out of lots of disk space and lives in
~/.cache/huggingface/hub
so you might want to symlink to your backing storage if it is too big.
The next is that default models are very small at less than 1B so you don’t really see a performance hit and they use the SentenceTransformers library of HuggingFace. The other options are:
- SentenceTransformers. The default. They download directly from hugging face, so the syntax to get a new model is
org/repo
so for instance Nvidia/NV-Embed-V2 is valid. They don’t really tell you what the syntax in the Embedding model line is as aside, so that is it - OpenAI. You can use theirs, which we avoid since we want this to be all local
- Ollama. They do allow you to use Ollama for the models as well and the syntax here is just the name of the model. Note that in ollama.com, you can search just for embedding models and some valid names are
nomic-embed-txt
orbge-large
Hybrid Search separate Embedding from Reranking
Options to improve RAG, you can select Hybrid Search this means that there is a separate model to generate embeddings and to decide which document chunks to fetch and then a much slower but more accurate reranker that takes the bucket of chunks and thinks more about which ones to pick
Note that the reranker doesn’t seem to have a Ollama option, so you are stuck in CPU mode if you use this
As a refresher, there are two parts of RAG, first is the embedding model which converts every word into a multidimensional token. The. idea is that the more dimensions, the more you can find similarities. The best ones have 5,000 dimensions and the job is to find a list of documents that look similar. The idea is to quickly retrieve a lot of documents and then the reranker works slowly to figure out what is the most relevant.
The reranker also know as a cross-encoder takes the query and a document and give s. you a similarity score. You use it to figure out which documents are most relevant. The Top K means you pick the top 3 (if K=3) of these.
Picking RAG models not all of which work
There are a series of models starting with the recommended ones and also looking at the mteb/leaderboard on huggingface and I went through to figure out out what is working and what is not. The way to know if it works is not that obvious, you either watch the console output or when you click on the download, but the success message is misleading you have to wait to see if it save “Embedding Model Set to…” nothing may happen that is there could be an error and you will not know. The testing is laborious, you have to reload a corpus and then see if you get reasonable output when you run the RAG with the pound sign to add a document, but the base models are pretty good delivering a 62.62 score with the base model, all-miniLM-L6-v2
- ✅sentence-transformers/all-MiniLM-L6-v2 which seemed to work and the performance is documented at Sbert.net and is the default but is not particularly high performing at least as far as getting good results. This is https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 which scores 62.62 (state of the art is more like 72), so not bad for a tiny 23MB model and good for CPU only use. It takes 2 seconds to load a 118Kb document
- ✅ sentence-transformers/all-mpnet-base-v2. Has the highest performance by a small fraction and it does seem to load OK which as https://huggingface.co/sentence-transformers/all-mpnet-base-v2 scored 57.7, so you should use L6 if you are only using CPU models
Ollama models. You can also have Ollama do the inferencing but you have to have one of the few GGUF models out there from newest to oldest and cross referencing these models with the MTEB Leaderboard, but the net is that you can get to 64.23 score by using bge-large from ollama and if you hosting it remotely it offloads your local machine and its a 563M parameter model so you need quite a bit more to beat a 23MB model and it takes 8 seconds to process a 127K token model.
This link does seem to have its problems, I get network error sometimes and the whole open-webui has to be restarted
- granite-embedding. From IBM 30M and 278M which are 57.25 and 56.97 respective, so not that good from https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual and https://huggingface.co/ibm-granite/granite-embedding-30m-english
- snowflake-arctic-embed2. Snowflakes latest 57m which looks close to https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0 and shows a 59.48 score. Not that great
- https://huggingface.co/BAAI/bge-large-en-v1.5 aka https://ollama.com/library/bge-large scores 64.23% which is the best of the ollama bunch
- paraphrase-multilingual. A sentence transformer model 278m or https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 which is 57.6 so not great
- bge-m3. 567m does not have full scores on MLEB at https://huggingface.co/biswa921/bge-m3
- mxbai-embed-large. 335m from Mixedbread.ai or https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 does decent at 63.25
- all-minilm. From SBert.net 22m and 33m and judging by model size I’m guessing this is https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 for the 33MB model scoring 56.53 and https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 for the 22m model. These are the defaults for open-webui
- nomic-embed-text. This is a 137M parameter model which is https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 and scores 62.28 which is not bad for a tiny model.
Now for the top ranked models and this uses the internal Open WebUI infrastructure which is nice since there are many more SentenceTransformer formatted models than GGUF and the real attraction is that you can get performance in the 72 range but it is very slow but the 1.5BQwen model is a good compromise delivering 64 score for 13 seconds.
- ❌ https://huggingface.co/nvidia/NV-Embed-v2. 8B parameters 72.3 score. Note that on model download, I got a timeout error, but this is not exposed in the User Interface it just returns and it looks like the model is loaded. It’s getting a no response returned from Hugging face?
- ❌ https://huggingface.co/infgrad/jasper_en_vision_language_v1). It is not rejected by huggingface ‘NoneType’ object has no attribute ‘encode’. Should some message about model load failure get surfaced in the UI rather than looking at logs
- ❌ https://huggingface.co/dunzhang/stella_en_1.5B_v5. 1.5B parameters It says no model found with this name and something about no periods allow in the name. 71.2 score. Again no error message and this fails
- ✅ https://huggingface.co/Salesforce/SFR-Embedding-2_R. It is very slow but it scores 70.3 on the average but even with GPU work, it takes time to make it return a result because it is pretty heavyweight at 7B parameters. It takes minutes to product a result from a 117K token input file. And it takes a while to run on web results. A bad choice if you pair it with a slow search engine like Google PSE (see below).
- ✅ https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct. This does load correctly but as a 7.B parameter model it takes some time. Score is 70.24 and takes a few minutes on an M4 Max to process 117KB.
- 🟡 https://huggingface.co/BAAI/bge-en-icl. This is a heavyweight 7B parameter model need 26GB of storage and it definitely jams Neural Engine on Apple Silicon. 71.7 score and very slow, takes a minute to process 120KB. This appears to work properly and I can see vectors being returned. The tulu3:8b model does generate a full response and does not cite properly. Llama3.2:3b seems to work fine but doesn’t cite.
- ❌https://huggingface.co/dunzhang/stella_en_400M_v5 is a lightweight high performance model like its 1.5B brother. This fails with a no attribute encode in the console and you can tell it doesn’t work because there is no download.
- ❌https://huggingface.co/BAAI/bge-multilingual-gemma2. This does download properly and it is a huge 9.2B parameter model scoring 69.88. This seemed to work and then application ran away consuming all available memory. Seems like an OpenWebUI bug.
- ✅ https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct this is a big father down. There are many other 7B parameter models, but this is just 1.5B at has a 67.13% score, so hopeful it will be great. This took 13 seconds to embed a 117KB document so about 50% longer than bge-large. And it handles citations perfectly.
RAG Benchmarking for embedding and the chunk fetching
Nothing is more important than your corpus, so here are suggestions for testing:
- Create a single large file with all your relevant data. For instance, we just take our entire website and pour all the markdown files into a single one.
- Use a stopwatch, set the Embedding model (and reranker too). Then go to the chat and press + and upload that file. When the cursor stops blinking that will tell you the tokens per second (up to 59K tps on an M4 Max down to 3K tps).
- Then when you ask the question, look at the console and run your stop watch again, when you see the vectors come back in the console, that’s the time to pick the chunks (and if you are running a reranker to run that too).
- Now to look at quality, go to the actually document that is returned and it will have the relevance and the content. Eyeball it and see how good it is.
- Finally, the international with the model is important so read the text that comes out.
RAG Recommendation
Basically you can see how good these RAG solutions are by taking a big corpus and then when you enter it, see what parts are actually RAG selected. A quick manual look at the footnotes is really interesting.
- Most people, if you have a constrained machine, then the default sentence-transformers/all-MiniLM-L6-v2 is a good choice. It score 62.8/100 amd is very fast 2 seconds for 118K Tokens. And are sub second when doing the RAG search.
- For power users https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct, Then for some more accuracy you get to 64 with ollama, but if you have the hardware, the ✅ is hard to beat at 67 and still processing at 13 seconds for 118K tokens.
- If you really need quality, then really big models like Qwen2-7B are not that practical, they take 90 seconds or more to embed a 118K token document to load even on an M4 Max. But the results are more detailed, so there is that. As an aside on our test dataset, although Salesforce was supposed to be better we found Qwen2-7B more accurate.
Hybrid Search and Rerankers
The idea here is pretty simple, have a fast model do the initial embedding and finding of RAG chunks and then have a heavier weight reranker look at the top K chosen chunks and pick the right chunks by priority. It’s hard to judge whether these are better, but pretty attractive to say use a small model like mini
For rerankers, the mteb/leaderboard lets you know which heavy weight model you want here. So you should pick a lightweight initial model like the default sentence-transformers/all-MIniLM-L6-v2 or Alibaba-NLP/gte-Qwen2-1.5b-instrcut and then use a big model to do more work
- ✅baai/bge-reranker-v2-m3 and sentence-transformers/all-miniLM-L6-v2. This is also the default in the user interface itself and it doesn’t have an MTEB score, so hard to know how good it is but it is a 568M model so a good pair to the default 22M one on small machines. And wow looking at relevance of the chunks it does an awesome job with perhaps a total upload of 2 seconds for 118K Tokens and processing time of 5 seconds. I don’t know the MTEB score but very usable with GPU acceleration. The quality was OK, it found one good chunk, but the second one was just ok
- ✅Alibaba-NLP/gte-Qwen2-7B-instruct and Alibaba-NLP/gte-Qwen2-1.5B-instruct. This is the performance leader but very heavy which is a big price to pay with no GPU. at 7.6B parameters scores 61.4% and is heavy and the 1.5B is heavy. 14 seconds to upload, then reranking takes 30 seconds and Tulu emits in 2 seconds. The quality of the RAG output is excellent as well.
- ✅✅sentence-transformers/all-miniLM-L6-v2 and Alibaba-NLP/gte-Qwen2-7B-instruct. Given the quality of the Qwen 1.5B/7B was so good compared with L6/rranker, we tried to see if just adding the 7B was the key, so again 2 seconds to upload and the 9 seconds to process and the results were also excellent
- ✅sentence-transformers/all-miniLM-L6-v2 and dunzhang/stella_en_1.5B_v5 comes in again as the best reranker at 61.2% and this seems to work for ranking but not embedding which is interesting. The actual results were really fast 2 seconds and then 5 seconds to return, but the actual chunks returned were not that relevant
The conclusion is stick with the defaults if it’s working if you have a smaller machine, but if you can afford the time, the use of Qwen2-7B for reranking is really important. To me, having more accurate results matters.
Ollama if on a remote machine and beware stability issues
You can also change to Ollama itself instead of the included embedding model but there are no guides, but looking the ollama.com site and seeing what is newest and most popular, the ones to try are. The main issue is some stability where it willl fail with network errors. And of course you should do this if you are using Ollama on a remote machines but the reranking has to be done on the client. So I w
The slowness of OpenWebUI Sentence Transformers on the Mac and a fix post 0.5.4?
These are much bigger than the other default models, but that’s what a big computer is for and they only run once doing the embedding. The downloads for this take forever.
It turns out that this is because there is no Apple Silicon detection, the so called “mps” device for PyTorch. I just added a PR for this and it seem to work well. This is now fixed on 0.5.4 it says, but when I load 0.5.4, it still uses the CPU, so got to figure that out and a patch is in process.
Adding Knowledge a Guide
There are a few conveniences that they have which include Workspace > Knowledge that lets you add directories automatically and sync them. Note that if you change the embedding model, you need to reimport all the documents.
This stuff seems a bit buggy by the way I’m not sure what is going on, but you need to look at the backend terminal output to see if things are loading
For Hard documents, use Tika Document Content Extraction
It’s not super clear how to get the names of these models and it also lets you set the Content Extraction to default or Tika, I’m not sure what this does. It also has a PDF extraction system as well that is not clear how it does it as that’s a hard problem.
The Content Extractor is either default or Apache Tika, they recommend using docker for this, but again I’m looking for native stuff, but you basically get it running on http://localhost:9998
and then watch out. This Tika thing is a super parser that knows thousands of data types. This this is just a Java program, so I’m wondering if there is a. pipx like thing that just installs it. And there is a brew install tika
I’ve not yet seen any gains doing this, but theoretically they have hundreds of plugins to read things, so I added it.
The default scanner is actually very good and I tried these sample documents:
- Docx. That is the Office format no problem for both default and tika
- Excel XLSX. The thing barfs with a crashed for both default and tika
- PowerPoint PPTX. Ingested just the text for both default and tika
- Adobe PDF. Ingest fine just the text only for both default and tika does but it needs a JPEG-2000 plugin that seems truly hard to install and requires a bunch of work to load it somewhere where it can be found.
- Image PNG. This seems to just get copied up and won’t work for RAG
- Archive ZIP of documents and code. Caused real headaches, the system just hangs for default but for Tika it does seem to read it but I can’t see the result. For this it does seem to be using the GPU for something when trying to uncompressed a ZIP file.
- Video MP3. Seeing if it can find the transcript amd then Tika crashes
So net, net for most common documents Default seems to be fine and I’m not quite sure how to diagnose and fix Tika since its a Java application
RAG Changing Top K and Prompt
There are some other simple heuristics like changing the Top K to 8 from 3 so there is more context for the model. Also changing the prompt might help, but it is all black magic really. The move to 10 will make the reranking job harder but hopefully give it more choices.
Large context model slow to load
If your data set is small, you might just skip all this and just insert the entire file set into the system and see how it behaves. It turns out this is harder than you think because the default context windows is just 2048 tokens and this is silently truncating user inputs. Also a number of optimizations are not taken, you have to set them:
- If you upload a file, then it uses RAG by default, so to test the long context, you need to concatenate all the input into out text file.
file . -type f =exec cat {} ; | pbcopy
is your friend, this loads up the clipboard and then you can paste it in. - The first thing thing you learn is that Open WebUI by default chops all context lengths to just 2K for performance.
- So you have to go into each model in Admin Settings > Models > _Your Model > Advanced and set
context length
to what ever the real limit is. I can’t quite figure out how to find this in Open WebUI, butollama info llama3.2
for example shows the context length. You can also change this globally if you think all the models you use support say 128K tokens withSettings > General > Advanced Parameters > Show > Context Length
- Then you need to optimize the KV Cache and the use of Flash Attention by setting OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0. This halves the memory with no noticeable impact. For small machines, try
q4_0
and this works great.
Once you do this, when you Upload a file, you will have the choice:
- Upload the a file with the + at the chat
- Click on the file uploaded
- At the upper right, you can choose “Segmented retrieval” which means use the RAG or the whole document is just inserted. If you have set the context high enough, it is going to take some time. For instance, Tulu3:8B runs at 200 prompt tokens per second so a 20K token document is going to take a while.
- I can’t find a way to cache the KV Cache that is created (like Anthropic does), but that will get rid of this. It should be in the Knowledge section I think.
Web Search
Very confusing, but basically you have to a few things to enable Web Search, it doesn’t work out of the box:
- Add the search engine to the system as noted below which ones are good. I’m not quite sure which one I like yet.
- Click on the web search button in the chat. What is not clear is what actually happens when you hit send,, but basically, this will cause it to feed the string you send it to the search engine. The engine will return a bunch of site contents. These searches are returning json, so you basically get the site and some summary data. It then uses RAG to figure out what’s relevant and answer the question. So as an example using the Jina engine (see below for discussions), it will take about 30 seconds, but it send back a set of pages. If your LLM supports citations, then it will add a footnote where it found the data. I installed each and it seems that the basic issue is that when you ask for the top 5 results, they should return website, but the problem is that some of them are very verbose, you do not want the HTML goo just the content. You can actually see what it has sucked in because each page brought in is clickable and you can see the relevance ranking and also the contents. Hint most of the searches are worthless because the summary is so small.
- Note that you have to turn on the above Web Search button every time you enter a new query
- The other mode is to use the pound sign syntax. This means that if it finds a match in the knowledge base, it will do a RAG. But if you do #_url_ then it will fetch just page and load it as a document to search . If you type
#https://tongfamily.com
and then#https://tne.ai
it will suck in the contents of this as a local file and apply rag to it. And you can click on each document after it finishes spinning to see what is retrieved, then you can run your query. So this is liked controlled searching. - If you install an LLM that does this citations and there are literally a sea of them (see below) so you get a full citation like Bing does. For instance Tulu3:8b supports this and you get nice footnotes
Cool Trick: Adding it as a Browser Search Engine
OK, this is a cool trick, if you have a custom browser and have going to localhost all the time for searches.
Go to your browser where you can add custom search engines. For instance with Brave, you go to brave://settings/searchEngines and next to Site Search, there is an Add button where you add this which means that if you type :oi your_prompt
it will run OpenWebUI against it
Name: Open WebUI
Shortcut: :oi
URL: http://localhost:8080/?q=%s
Search Engine Selection
The main thing here is you want the search engine to return a nice JSON so the LLM can process it. I’m doing this in order of what the pulldown menu is, presuming that’s a measure of what the default should be:
- Serxng: This can either be a local server (where you will need docker or you need to git clone the repo). It can also be a free public instance one of the free sites like
https://kantan.cat/search?q=<query>
with their string leads to rate limit errors. If you set it to one concurrent request then it works properly and returns a document but, I get 403 forbidden. You can also run this as a python application. Basically you need to run your own if you want it to work you have to find a not so busy instance like https://search.rhscz.eu where the main trick is figuring out the Query URL which ishttps://localhost/search?q=<query>
, but you should look at the console, most of these sites return an Error 403 forbidden when you wantformat=json
- Google PSE. This requires API key if you don’t want ads and is limited to 10K queries which is fine for this with JSON Output. So you can get a key from Google and you need to create a Programmable Search Engine as well at this place and you enter the Google PSE Engine ID in Admin Settings > Settings > Web Search > Web Search Engine is PSE. When this is working, you should see in the console web pages being adding to the collections. This whole setup is really slow taking five or more minutes to bring back a result when looking for the top 10 sites (so set it to 3 the default or the top five). It does throw lots of warnings about not using HTTPS, since we only give it an ID, there must be something wrong with the url creation internally.
- Brave Search requires a free API key limit to 1/second and 2000 per month, this actually works pretty well, but searching 10 sites is going to take 10 seconds.
- Kagi. The search is in closed beta and relatively expensive at $25/1000 calls. It is ad free though.
- Mojeek. This is a UK site with no ads that charges 1 pound/1000 calls. I”m trying to get a free site first though to see how well they perform.
- Serpstack. This is quite fast and you just need an API key and gives you 100 free queries a month. It’s $30/month for JSON returned data for 5,000 searches. The search was pretty relevant, it found this site, tongfamily.com, creativedestructionlab and linkedin.com plus microsoftalumni.com and pitchoobk.com. All were good references but I got the dreaded ret7urn on token bug, where is would just stop at the first generated token. I suspect I’m exceeding the 2K token bugger
- SearchAPI. This given the google search engine looks for sites rather than the specific URL. It seams like it is just finding the top hits from the # sign. If you feed it an exact URL it searches with the google engine just that site and does a retrieval. Otherwise it gives you a bunch of results.
- Duckduckgo. Like SearchAPI does not return the contents, seems like it is the search page and isn’t fast
- Serply. This is a special search engine that uses API keys and is designed specifically for searching for market data. The nice thing is that it returns simple JSON that is just the data of the search which is sort of useful. One problem of course is that it only returns the search results and not the full data in the sites themselves. There are a huge number of different ways to run the search such as news searches and so forth, but unfortunately I get a
'NoneType' object has no attribute 'pop'
error.
Using the Search #url at a time then ask
The syntax is string, the #
with a URL, but this is very inconsistent and
seems buggy, but it isn’t, its just not well documented. What you do is that you:
- Enter each URL with a pound sign.
- Press enter after each one
- You should see them become documents
- You can click on each and see the contents
- When you click, you can use the toggle which is set for RAG or “entire document”.
- Then enter your query
- You will see document below
- When you click on each you will see the RAG section that is used. Each will have a relevance score
Each document added is added to the context when you run a query but sometimes, you actually get the search engine results. And then annotation shows you the part that the RAG on the Web search has selected results. The default is to use “Focused Retrieval”, but in the document you can select “Using Entire Document”.
As an example, enter each of these with a SEND, that’s the main trick
#https://tne.ai/about
#https://tne.ai/solution
#https://tne.ai/app
#https://tne.ai/sys
summarize company
If you want to do a Google like search
Then you should then:
- Type the search phrase you want
- Clickon the button that says Web search
- This will call the default search engine and return up to three sites.
- You can set how many site you want in settings
As an example with this prompt and Web Search clicked, you will get the three most relevant sites summarized.
Richard Tong biography
Local RAG with uploaded documents
This actually works well, you just add documents with the plus and it summarizes, for instance Deepseek Coder 2.5, seems to have issues with RAG but the local models like LLama3.2 work OK. There do see to be some bugs because
sometimes it says it can’t find the document files.
Here is the process:
- Upload files
- Then you can run a prompt on them.
- There is a setting for what RAG to use
Speech Models
This part is really undocumented, but the UI says check for the cmu-arctic-xvectors, but I assuming this works the same as the RAG stuff, I’m assuming it wants huggingface syntax org/repo
, but this does not have any models in it.
It also mentions faster-whisper but again this does to seem to have a list of models seems pretty small at distil-whisper-large-v3 is not found:
- large-v2
- distil-whisper-large-v3
- small
But these do not seem correct, but looking at the error logs that come out of stderr, but if you say type in “foo”, then you can see what the valid models are which are currently as of December 2024. The other options are the OpenAI API and a WebAPI for the browser speech system
- tiny.en
- tiny
- base.en
- base
- small.en
- small
- medium.en
- medium
- large-v1
- large-v2
- large-v3
- large
- distil-large-v2
- distil-medium.en
- distil-small.en
- distil-large-v3
I have to say distill-large-v3 sounds pretty good to me.
And in looking at this YouTube video it’s clear that in addition to these default repos, you can specify any hugging face url so for instance so sentence-transformers/all-mpnet-base-v2
which does try to download but doesn’t actually load, so stick to the ones listed. This is supposed to be the best one and the default. There are other tools that show you how to do voice chat, but I can’t find anything about what models to use.
Making Speech to Text work
Well again, there are huge number of different things and its pretty confusing what is in there right now with the current 0.53 version:
- OpenAI. You fill in the API key as usual, but it is not clear what TTS Models you have available but the two options appear to be tts-1 and tts-hd. The main thing is that tts-1 has lower latency but tts-1-hd has fewer errors. In testing, the delay with tts-1-hd seems minimal compared with LLM processing time so I use tts-1-hd. You also have a collection of voices, alloy , echo, fable, onyx (a lower soothing voice), nova (a higher voice, sounds like HER) and shimmer (higher voice). I picked nova
- Transformers (local). This looks like another CPU driven model, again, it’s hard to know what the valid names are here. The user interface says it uses SpeechT5 and CMU Arctic Embeddings. I’m guessing based on the RAG portion that these are huggingface models, but it’s not clear what the names are since the SpeechT5 doesn’t have any huggingface names and the CMU listing is for a dataset. I just tried a file name
cmu_us_awb_arctic-wav-arctic_a0001
2. This seems to cause some serious issues as after I try this, I getnetwork error
and have to restart everything, but then when I rebooted and tried again, it worked! As an aside, they have literally 80 pages of models with the last one beingcmu_us_slt_arctic-wav-arctic_b0539
. It keeps generating this error that ends withOutput channels > 6553
6 not supported by MPS Device. The first one is a lower voice and the second a higher one.
The Voice call and Microphone Buttons for Voice Assistant
Well there are two other cool modes that this supports decently and there is a lot of work to make an interactive local assistant. And the documentation is currently blank on how to do text-to-speech 🙁
- You can click on the microphone, the main thing here is that the browser is doing this, so you have to wait a little bit before talking and then when done, you click on the stop icon. The STT (Speech to Text works pretty well though)
- OpenedAI-Speech. This is another choice that I need to try as it provides a local OpenAI compatible end point so you can just use a local application (the same way that Connections provides this for models). It hasn’t been updated in five months though, so I wonder how usable it is. But it provides both tts-1 and tts-1-hd as well as all the voices. And it doesn’t seem to work reliably plus Coqui went bankrupt.
- OPenai-edge-tts. This is another project that is similar but just uses edge-tts from Microsoft. You can run with docker or it is just a python project that you can
uv run
easily. - The Voice Call is pretty cool. If you have something like a Vision model, you can get both voice and also images into the system. It is definitely not fast enough for typical conversations even on an M4 Max, but we are getting there!
Installation of ComfyUI and its Bizarre UI
If you are going all locally, the one thing that would be nice to have is local image generation. There are plenty of local image recognition models like llama3.2-vision:90b
but to do this you have to get ComfyUI running. The setup instructions are a mess, but they are working on a desktop application like DiffusionBee (which is amazing by the way).
The setup is beyond byzantine though as it requires cloning a repo, but they do have a desktop application setup that seems to work now. There are some strange things about this application:
- You can download the test application for ComfyUI Desktop.
- It takes a good long time for it to actually compile and run, it looks about 30 seconds to boot up, so be patient
- There are some sample templates, but it’s not obvious how to add new models. The guide says you have to spelunker around Civitai and then download, but you have to refresh the application to see these. If you use the flux.1 template, it isn’t compatible
- At the upper right there is a Manager button. yes, I don’t know why that is there, but it apparently a separate application. Then you get a curate list of models to install, but it wants a checkpoint, so if you filter for that, you can try to install it. The one I use on DiffusionBee is Flux.1 [dev], so I like to give that a try. The Schnell one is the other one that is default downloaded.
- After you load a model, you need to hit in the Manager, Refresh to see them in the Load Checkpoint box.
- This thing has a huge graphical editor so figuring out the right way to lay things out isn’t easy. But they have some defaults.
- The installation happens by default in
~/Documents/ComfyUI
so if you have iCloud sync running make sure you have enough space.
Tools, Functions and Pipelines
Do not really understand this yet, but it is supposed to allow you to run Code
in the system. The easiest thing to do is just to use the Model system where
there are many providers that use the OpenAI API and you can plugin there, but here is their nomenclature:
- Pipelines. Most of them you can just use Connections if it is just an API plugin, Functions if you just need something simple. Pipelines and more complex and technically they just look like an OpenAI API end point.
- Tools. These run externally, so they don’t see any of the environment inside OpenWebUI.
- Functions. These run “in context” so they can manipulate and act on the different things inside Open WebUI, so they can do things like display and work. You have to manually assign Functions to Models in the Workspace > Model section. They can do things like pre-process data as an Inlet Function or post process with an Outlet Function. We used the Functions for Google Gemini and Anthropic. But be careful loading these, they are inside the system and when loading a bunch, I managed to crash Open WebUI
- Pipes. They can be Pipe which means that it looks like a single Model.
- Manifold is a collection of Models.
- Valves are user configurable data (I know right taking the Pipe metaphor pretty far)
User Valves are things that anyone can set.
Side Note: Using UV run and uvx and scripts and asdf and direnv interactions
There are so many of these things that require that you clone and then pip install something. A the very least you should just use uv for this so at least you get an environment. We will post more on how to do this, but the main point is that there are that there are three ways to install python now with uv:
- From a python source repo. You create a pyproject.toml with
uv init
and then you add dependencies withuv add pytorch
etc. This creates a pyproject.toml and you can create a virtual environment withuv venv
and now when you go to that directory, you cansource .venv/bin/activate
and it will start the venv for that directory and thendeactivate
gets you out. Alternatively, you just insertuv run
in front of everything and it starts it up. This is required since sourcing is not something that direnv can do since it is run as a subshell so you only get export from it, but you can do even better see below - If you are a lover of asdf and asdf-direnv, you can automate this with
.envrc
where you insert alayout python
and it should pick It up then there is another scheme where you do not have to do this manual activation like pipenv does (but is too slow). Just addlayout python
to your .envrc. Note that unlike ,uv venv this creates a virtual environment along that is python version specific. So it is more general that uv venv
you can have more than one version so if you have conflict you have to be clever about what python modules you use. You also need to modify your PS to pickup the VIRTUAL_ENV that is created so you have some idea in the command line what is happening but note that the zsh power line does this automatically and power line BASH as well. Note that if you use asdf and direnv, then you will never use the system python or others while inside your $HOME directory. Note you can use uv directly as well with you just need to add some code to your $XDG_CONFIG/direnv/direnvrc so you can uselayout uv
in your .envrc and it all works. - Python script with uv.You can have a single script with all its dependencies in a doctoring. So take a single file script and run
uv run script.py
which works if it has no dependencies or just the standard library ones or if you runuv init --script script.py --python 3.12
if will inject the right doctoring, souv run script.py
creates a venv and just works. - Pip package with cli aka tools. So if you have python pip package properly and it has command line entry points, then you can do a
uvx ruff
. This works beca8use python packages have entry points as part of their packaging, souvx --from httpie http
works when the entry point is different from the package name. And you can even as for extras withuvx --from mypy[faster-cache] mypy --xml-report report
works which is really nice. This works because when you package things, you get entry point specifications. You can create this with uv build and then you can uv publish it. The basic idea is that command line tools can be packaged in your pyproject.toml using typer which is the. new hotness for CLI applications (based on FastAPI for Web applications):
# in ./src/greetings/cli.py
import typer
from .greet import greet # the app lives ./src/greetings/greet.py
app = typer.Typer()
app.command()(greet)
if __name__ == "__main__":
app()
# make sure there is a null __init__.py
# this supports python -m greetings
# __main__.py
if __name__ == "__main__":
from greetings.cli import app
app()
# assumes the cli is in ./src/greetings/cli called a app()
[project.scripts]
greet = "greetings.cli:app"
Side note the frustrations of Unifi Threat
They’ve moved the threat protection logs yet again and buried them, so if you do a git push and it hangs, then go to unifi.ui.com > _Your Controller_ > Network > Insights (the bulb on the left). Then pick Flows from the dropdown and click on Threats
The second bizarre thing is that you can click on things like “Allow Threat Signature”, but if you accidentally click on Block this IP
then there is no way to toggle it.
Instead you have to go to the Firewall rules in Network > Settings > Security > Traffic & Firewall Rules
and then search for the IP you just blocked
Then you have to scroll all the way down this huge table to find a Manage
button. Why this is not at the top is a mystery, then it creates a checkbox and then scroll all the way down again to Remove.
Leave a Reply