ai: Open WebUI -1 problems, more ram, $200 plans and ComfyUI URLs

OK, some miscellaneous but valuable comments in the AI journey

Open WebUI model parameters redux

The context length and other parameters of Open WebUI are really low, like 2K context windows. This doesn’t make any sense in the era of AI coding. Some defaults help a bit, which we’ve covered before. Like -1, but I found that with the Qwen models, the -1 causes the models to emit nonsense.

Here, as a reminder, are the base parameters (and this is super hard to find, not in the documentation anywhere). Most of these values are from the Ollama Modelfile directly, so use that to figure out what is going on. One of these parameters breaks Qwen3 and Gemma3n models (see num_ctx). Ironically, our post provides recommendations:

max_tokens What is the maximum number of tokens in a response? Basically, how wordy should it be? This has two magic values: -1, which means do what is natural for the model, and -2, which means fill the entire context with the response. The default is 128 tokens. Using -1 feels the most natural or infinite generation. Note that this is mapped to num_predict in Ollama, and -1 means infinite generation.
num_ctx. This is the context window used to generate the next token. -1 means just set it to the maximum in the Modelfile. The old default was 2048 tokens, but it now looks like with 0.6.18 of Open WebUI, the default is the maximum context length, so that’s changing. Interestingly, if I set this to -1 for qwen3-coder, I get garbage, so this parameter looks busted. It is supposed to be 256K, so I then tried to set it manually, and it works. It is a strange bug since ollama show qwen3-coder:30b-a3b-q4_K_M it shows 262,144, which is correct. Note that this is for both input and output tokens, so you want it as big as your system memory allows. This breaks gemma3n, qwen3, and qwen3-coder when set to -1 or 0. Some manuals say 0 means the default, but neither of these works. This number can have a huge effect, for instance, the Qwen3 Coder model is 112GB with a 2K context, but it balloons to 150GB with a 256K context or 4GB per 32K tokens. This seems statically allocated and Ollama seems to be smart about it, if it is a big model, like a 235B model at Q3 (so 95GB of weights), it will shrink the default num_ctx to 128K for instance to fit.
num_keep.This is how many tokens to keep from the past when regenerating content. Huh?. Basically, there is a controversy: some say it means that when the context buffer is full, how many of the *initial* tokens do you keep? So that if you have, say, a system prompt at the beginning, without this set, you will lose the front of the context. Others say it is how many tokens you keep from the immediate past (so you discard from the start until the num_keep tokens remain, and it is about recency). The default is 24 tokens, which means that it will keep the first 24 tokens and then discard from there. I’m thinking it is the former, so it’s a tradeoff; if set too high, you will forget the recent stuff. There is no recommendation for it, but most of the time, you have some initial instructions, say the lower of 20% of num_ctx or 24K tokens of common prompts for coding, and keep it pretty static since for coding, I want lots of context
num_batch. This is how many tokens to process at once. The default is 512, but we recommend 1024 or 2048 to get faster time to first token. I use 2048 usually if it’s a small model, but it increases RAM usage but it is not clear how much. Essentially the size of the cache goes up linearly with the batch size, so a 1024 cache needs 2x the memory, for this to have an impact, you need really long inputs, so for instance 512 means it takes 512 tokens at once and in face, you can see in the ollama log the batches it is going through the seconds per iteration. For my M4 Max, it is 1.5 seconds to load the runner and then 6 seconds per iteration of a batch. So with 30K tokens, it is going to be really slow. The interesting thing is that when I change the 512 to something else, RAM use definitely shrinks and falls, but it is not allocated in ollama ps, so there must be more RAM for the KV Cache that is not in the ollama ps. The context length is definitely there, though. So more experimentation is needed, it’s pretty clear that for very large models, 512, the default is too big, so there is some handtuning that is needed here.

More GPU RAM and crashes

A little complicated, but setting the wired ram high is excellent. The default is that ⅓ of memory is reserved for non-GPU is less than or equal to 36GB of total RAM, and 25% if it’s more than that. But you can shave a lot closer.

I can get on a 128GB machine, instead of 96GB for GPU, up to 112GB. This is by experimentation to see when Ollama shows that it is offloading to the GPU.

# sets the gpu maximum memory to 112GB, the value is in MB
sudo sysctl iogpu.wired_limit_mb=112000
# reset when done
sudo sysctl iogpu.wired_limit_mb=0

Nightmare of $200 Max Plans on Cerebras Minute limits and Claude weekly limits

Who doesn’t love a good deal? Claude has a $200 Max plan where you can get lots of credits, but they’ve put in a weekly limit

Cerebras then came out with a $200 plan that at first seems amazing, but it is far from unlimited. It has a per-minute maximum of 275K tokens, which sounds like a lot until you realize that with things like Roo Coder with a 200K token limit, that is basically if you have lots of files (who doesn’t when coding), means essentially you can only do a single PER MINUTE. Yes, they allow up to 120M tokens per hour, but it is impossible to reach that with the minute goal.

Oh well, at least that subscription is monthly only, so if you want to sit there and every minute press enter, the results comback super fast, but then you get to wait a minute. Yuck

Claude has a five-hour limit, which we often break, but now there’s a weekly limit, so that plan is also nerfed.

This is just another reminder of how expensive nVidia GPUs are, you can easily burn $2K per day (I did $6K last month on API only calls!), so off to find smaller models that are good most of the time.

I’m finding Kimi K2 and Qwen3 coders are great, so off to find low-cost hosts of these. I’m estimating that I’m using about 100M tokens a day, so you either need input token caching (which is rare) or at $200/month, that’s $6/day or less of spend so I need to find a plan that averages 6 cents per million. That’s small!

Claude messes up n in markdown and JSON

The most annoying bug in Claude is that it often puts in escaped n instead of a real new line. So, lots of times I need the magic %s/n/r/g to fix in neovim. How annoying. Doesn’t seem to happen with other coders. This is one case where it is really annoying that VSCode doesn’t have a true neovim editor in it.

They have a Shift-Cmd-H (for finding across all files) or Ctrl-Cmd-F in the current file, which does find and replace, but while graphical, it’s hard to figure out, particularly since the pane on the left doesn’t have an obvious “Go” button.

It is this strange icon to the right of the Replace field, which is also Option-Command-Enter. Also, if you want a real new line, then there is the magic Ctrl-Enter that gives it to you.

ComfyUI Automated Download and E4m3 nightmare for Mac

OK, this is a strange bug, but the Mac is always a little different. There is a new floating point 8-bit format called e4m3fn, which is a different type. I was downloading these models and changing things to fp16 and then discovered if you just flip the type to “default” from fp8_e4m3fn, then it works.

Note also that you can replace these models with their GGUF equivalents and this works as well. It does mean you have to change the unet loader to GGUF, but then you get the smaller 8 bit models or even 4 bit if you dare.

Since I have a 128GB mac, it’s easier just to change the type to “default” and live with the FP16 size.

Also, sometimes, ComfyUI if it has a URL with automatically download models but sometimes it fails and does show the download. You can also go to the ComfyUI Manager and choose the Model Manager, and type it in there. But sometimes the workflow doesn’t seem to trigger automated download

Tongfamily (Richtong.com Sync)