Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 2 years ago

Guide to Self Hosting LLMs Faster/Better than Ollama

thirdBreakfast@lemmy.world · 2 years ago

Guide to Self Hosting LLMs with Ollama.

Download and run Ollama
Open a terminal, type ollama run llama3.2

AliasAKA@lemmy.world · 2 years ago

Bookmarked and will come back to this. One thing that may be if interest to add is for AMD cards with 20gb of ram. I’d suppose that it would be Qwen 2.5 34B with maybe less strict quant or something.

Also, it may be interesting to look at the AllenAI molmo related models. I’m kind of planning to do this myself but haven’t had time as yet.

brucethemoose@lemmy.world · 2 years ago

Yep. 20GB is basically 24GB, though its too tight for 70B models.

One quirk for 7900 owners is that installing flash attention for long context usage can be a pain. Apparently it is doable now, I need to dig up the link, but it might just be easier to use kobold.cpp rocm with its native flash attention.

As for vision models, that is a whole different can of worms. Exllama does not support this, so you’d need a framework that does.

If you are looking for niche models, check out MiniG (which is a continued pretrain of the already very excellent GLM4-9B): https://huggingface.co/bartowski/miniG-GGUF

Llama.cpp support is recent, though I’m not 100% sure its completely fixed. It should work in Aphrodite as well.

Konraddo@lemmy.world · 2 years ago

I know this is not the theme of this post, but I wonder if there’s an LLM that doesn’t hallucinate when asked to summarize information of a group of documents. I tried Gpt4all for simple queries like finding out which documents mentioned a certain phrase. It often gave me filenames that didn’t actually exist. Hallucinating contents is one thing but making up data source is just horrible.

brucethemoose@lemmy.world · 2 years ago

That’s absolutely on topic, check out https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard

Command R is built for this if you have the vram to swing it, otherwise GLM4 (or MiniG as linked below) is great. The later, unfortunately, doesn’t work with TabbyAPI, so you have to use something like Kobold.cpp.

You also have to use very low (basically zero) temperature and be careful with other sampling settings, and watch your context length.

There are more sophisticated RAG setups some of these UIs (like open Web UI) integrate, and sometimes you’ll need to host an embeddings model alongside the llm for that to work.

sturlabragason@lemmy.world · 2 years ago

Frontendwise; Librechat is pretty cool.

WolfLink@sh.itjust.works · 2 years ago

Could I run larger LLMs with multiple GPUs? E.g. would 2x3090 be able to run the 48GB models? Would I need NVLink to make it work?

brucethemoose@lemmy.world · edit-2 2 years ago

Absolutely.

Only aphrodite (and other enterprise backends like vllm/sglang) can make use of NVLink, but even exllama or mlc-llm split across GPUs nicely over PCIe, no NVLink needed.

2x 3090s or P40s is indeed a popular config among local runners, and is the perfect size for a 70B model. Some try to squeeze Mistral-Large in, but IMO its too tight a fit.

brucethemoose@lemmy.world · 2 years ago

Also, AMD is not off the table for multi-gpu. I know some LLM runners are buying used 32GB MI100s.

shaserlark@sh.itjust.works · 2 years ago

I run a Mac Mini as a home server because it’s great for hardware transcoding, I was wondering if I could host an LLM locally. I work with python so that wouldn’t be an issue but I have no idea how to do CUDA or work on low level code. Is there anything I need to consider? Would probably start with a really small model.

thirdBreakfast@lemmy.world · 2 years ago

If it’s an M1, you def can and it will work great. With Ollama.

shaserlark@sh.itjust.works · 2 years ago

Yeah it’s an M1 16GB, sounds awesome I’ll try, thanks a lot for the guide it’s super helpful. I just got the Mac Mini for jellyfin but this is an unexpected use case where the server comes in very handy.

brucethemoose@lemmy.world · 2 years ago

For that you probably want the llama.cpp server and a Qwen2 14B IQ3 quantization.

16GB is kinda tight though, especially if you’re running other stuff in the background.

BaroqueInMind@lemmy.one · 2 years ago

It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.

OP, do you have any telemetry you can show us comparing the performance difference between what you setup on this guide and an Ollama setup? Otherwise, at face value, I’m going to assume this is another thing on the internet i have to assume is uncorroborated bullshit. Apologies for sounding rude.

I don’t like some things about the devs. I won’t rant, but I especially don’t like the hint they’re cooking up something commercial.

This concerns me. Please provide links for us to read here about this. I would like any excuse to uninstall Ollama. Thank you!

morrowind@lemmy.ml · 2 years ago

Honestly, I’m just gonna stick to llamafile. I really don’t want to mess around with python. It also causes way more trouble than I anticipate

brucethemoose@lemmy.world · 2 years ago

Llamafile is fine, but it still leaves a lot of performance on the table.

You can setup kobold.cpp with Q8 flash attention without ever having to install pytorch, which is the real headache. It does have a little python launch script, but its super minimal.

You can use the native llama.cpp server for absolutely zero python usage.

Grimy@lemmy.world · 2 years ago

vLLM can only run on linux but it’s my personal favorite because of the speed gain when doing batch inference.

brucethemoose@lemmy.world · 2 years ago

Aphrodite is a fork of vllm. You should check it out!

If you are looking for raw batched speed, especially with some redundant context, I would actually recommend sglang instead. Check out its experimental flags too.

sleep_deprived@lemmy.world · 2 years ago

I’d be interested in setting up the highest quality models to run locally, and I don’t have the budget for a GPU with anywhere near enough VRAM, but my main server PC has a 7900x and I could afford to upgrade its RAM - is it possible, and if so how difficult, to get this stuff running on CPU? Inference speed isn’t a sticking point as long as it’s not unusably slow, but I do have access to an OpenAI subscription so there just wouldn’t be much point with lower quality models except as a toy.

brucethemoose@lemmy.world · edit-2 2 years ago

CPU inference is, unfortunately, slow, even on my 7800X3D.

The one that might be interesting is deepseek code v2 lite, as its a very fast MoE model. IIRC microsoft also released a Phi MoE thats good for CPU.

Keep an eye out for upcoming bitnet models.

Dont bother upgrading RAM though. You will be bandwidth limited anyway, and it doesn’t make a huge difference.

kitnaht@lemmy.world · edit-2 2 years ago

If your “FIRST STEP” is to choose an OS: Fuck that.

You should never have to change your OS just to use this crap. It’s all written in Python. It should work on every OS available. Your first step is installing the prerequisites.

If you’re using something like Continue for local coding tasks, CodeQwen is awesome, and you’ll generally want a context window of 120k or so because for coding, you want all the code context - or else the LLM starts spitting out repetitious stuff, or can’t ingest all of your context so it’ll rewrite stuff that’s already there.

sturlabragason@lemmy.world · 2 years ago

Choose OS is very relevant when doing cloud stuff.

brucethemoose@lemmy.world · 2 years ago

Or setting up a home server, which I figured some here would do.

L_Acacia@lemmy.one · 2 years ago

llama.cpp works on windows too (or any os for that matter), though linux will vive you better performances

brucethemoose@lemmy.world · 2 years ago

CodeQwen 1.5 is pretty old at this point, afaik made obsolete by their latest release.

The Qwen models (at least 2.5) are really only good to like 32K, which is still a ton of context. But I’ve been testing Qwen 32B at 64K -90K and even that larger model is… Not great.

32K is generally enough to get the jist of whatever you’re trying to fill in.

gravitas_deficiency@sh.itjust.works · 2 years ago

Wtf are you talking about. PCIe passthrough exists.

brucethemoose@lemmy.world · 2 years ago

I would not recommend that for performance reasons, AFAIK.

Windows is fine, I should make that more clear.

gravitas_deficiency@sh.itjust.works · 2 years ago

Huh, really? Is there that much of a perf hit using passthrough? I’d have assumed that the bottleneck isn’t actually the PCIE, so much as it is the beefiness of the GPU crunching the model.

brucethemoose@lemmy.world · 2 years ago

I have not tested WSL or VMs in Windows in awhile, but my impression is that “it depends” and you should use the native windows version unless you are having some major installation issues.

kitnaht@lemmy.world · 2 years ago

Why would you even bother trying to run this all through a VM when you can just run it directly? If you’re to the point of using VMs, you don’t need this tutorial anyways.

Are you seriously telling me you’re jumping through all the hoops to spin up a VM on Linux, and then doing all the configuration for GPU passthrough, because you can’t just figure out how to run it locally?

gravitas_deficiency@sh.itjust.works · 2 years ago

Bro this is a community for sharing knowledge and increasing the technical aptitude of fellow users by doing said sharing. Maybe instead of shitting on a pretty solid digest of the fundamentals of setting up something like this, try adding to the body of knowledge instead.