I know it’s a downvote earner on Lemmy but my 64gb M1 Max with its unified memory runs these large scale LLMs like a champ. My 4080 (which is ACHING for more VRAM) wishes it could. But when it comes to image generation, the 4080 smokes the Mac. The issue with image generation and VRAM size is you can think of the VRAM like an aperture, and the lesser VRAM closes off how much you can do in a single pass.
Not running any LLMs, but I do a lot of mathematical modelling, and my 32 GB RAM, M1 Pro MacBook is compiling code and crunching numbers like an absolute champ! After about a year, most of my colleagues ditched their old laptops for a MacBook themselves after just noticing that my machine out-performed theirs every day, and that it saved me a bunch of time day-to-day.
Of course, be a bit careful when buying one: Apple cranks up the price like hell if you start specing out the machine a lot. Especially for RAM.
For flux, I actually use a script that quantizes it down to 8 bit (not FP8, but true quantization with huggingface quanto), but I would also highly recommend checking this project out. It should fit everything in vram and be dramatically faster: https://github.com/mit-han-lab/nunchaku
I just run SD1.5 models, my process involves a lot of upscaling since things come out around 512 base size; I don’t really fuck with SDXL because generating at 1024 halves and halves again the number of images I can generate in any pass (and I have a lot of 1.5-based LORA models). I do really like SDXL’s general capabilities but I really rarely dip into that world (I feel like I locked in my process like 1.5 years ago and it works for me, don’t know what you kids are doing with your fancy pony diffusions 😃)
I usually run batches of 16 at 512x768 at most, doing more than that causes bottlenecks, but I feel like I was also able to do that on the 3070ti also. I’ll look into those other tools though when I’m home, thanks for the resources. (HF diffusers? I’m still using A1111)
(ETA: I have written a bunch of unreleased plugins to make A1111 work better for me, like VSCode-like editing for special symbols like (/[, and a bunch of other optimizations. I haven’t released them because they’re not “perfect” yet and I have other projects to be working on, but there’s reasons I haven’t left A1111)
The issue with Macs is that Apple does price gouge for memory, your software stack is effectively limited to llama.cpp or MLX, and 70B class LLMs do start to chug, especially at high context.
Diffusion is kinda a different duck. It’s more compute heavy, yes, but the “generally accessible” software stack is also much less optimized for Macs than it is for transformers LLMs.
I view AMD Strix Halo as a solution to this, as its a big IGP with a wide memory bus like a Mac, but it can use the same CUDA software stacks that discrete GPUs use for that speed/feature advantage… albeit with some quirks. But I’m willing to put up with that if AMD doesn’t price gouge it.
Apple price gouges for memory, yes, but a 64gb theoretical 4090 would have cost as much in this market as the whole computer did. If you’re using it to its full capabilities then I think it’s one of the best values on the market. I just run the 20b models because they meet my needs (and in open webui I can combine a couple at that size), as I use the Mac for personal use also.
GDDR is actually super cheap! I think it would only be like another $75 on paper to double the 4090’s VRAM to 48GB (like they do for pro cards already).
Nvidia just doesn’t do it for market segmentation. AMD doesn’t do it for… honestly I have no idea why? They basically have no pro market to lose, the only explanation I can come up with is that their CEOs are colluding because they are cousins. And Intel doesn’t do it because they didn’t make a (consumer) GPU that was eally worth it until the B580.
Oh I didn’t mean “should cost $4000” just “would cost $4000”. I wish that the vram on video cards was modular, there’s so much ewaste generated by these bottlenecks.
Oh I didn’t mean “should cost $4000” just “would cost $4000”
Ah, yeah. Absolutely. The situation sucks though.
I wish that the vram on video cards was modular, there’s so much ewaste generated by these bottlenecks.
Not possible, the speeds are so high that GDDR physically has to be soldered. Future CPUs will be that way too, unfortunately. SO-DIMMs have already topped out at 5600, with tons of wasted power/voltage, and I believe desktop DIMMs are bumping against their limits too.
But look into CAMM modules and LPCAMMS. My hope is that we will get modular LPDDR5X-8533 on AMD Strix Halo boards.
I know it’s a downvote earner on Lemmy but my 64gb M1 Max with its unified memory runs these large scale LLMs like a champ. My 4080 (which is ACHING for more VRAM) wishes it could. But when it comes to image generation, the 4080 smokes the Mac. The issue with image generation and VRAM size is you can think of the VRAM like an aperture, and the lesser VRAM closes off how much you can do in a single pass.
Not running any LLMs, but I do a lot of mathematical modelling, and my 32 GB RAM, M1 Pro MacBook is compiling code and crunching numbers like an absolute champ! After about a year, most of my colleagues ditched their old laptops for a MacBook themselves after just noticing that my machine out-performed theirs every day, and that it saved me a bunch of time day-to-day.
Of course, be a bit careful when buying one: Apple cranks up the price like hell if you start specing out the machine a lot. Especially for RAM.
You can always uses system memory too. Not exactly an UMA, but close enough.
Or just use iGPU.
It fails whenever it exceeds the vram capacity, I’ve not been able to get it to spillover to the system.
You don’t want it to anyway, as “automatic” spillover with an LLM painfully slow.
The RAM/VRAM split is manually configurable in llama.cpp, but if you have at least 10GB VRAM, generally you want to keep the whole model within that.
Oh I meant for image generation on a 4080, with LLM work I have the 64gb of the Mac available.
Oh, 16GB should be plenty for SDXL.
For flux, I actually use a script that quantizes it down to 8 bit (not FP8, but true quantization with huggingface quanto), but I would also highly recommend checking this project out. It should fit everything in vram and be dramatically faster: https://github.com/mit-han-lab/nunchaku
I just run SD1.5 models, my process involves a lot of upscaling since things come out around 512 base size; I don’t really fuck with SDXL because generating at 1024 halves and halves again the number of images I can generate in any pass (and I have a lot of 1.5-based LORA models). I do really like SDXL’s general capabilities but I really rarely dip into that world (I feel like I locked in my process like 1.5 years ago and it works for me, don’t know what you kids are doing with your fancy pony diffusions 😃)
Oh you should be able to batch the heck out of that on a 4080. Are you not using HF diffusers or something?
I’d check out stable-fast if you haven’t already:
https://github.com/chengzeyi/stable-fast
VoltaML is also old at this point, but it has really fast AITemplate implementation for SD 1.5: https://github.com/VoltaML/voltaML-fast-stable-diffusion
I usually run batches of 16 at 512x768 at most, doing more than that causes bottlenecks, but I feel like I was also able to do that on the 3070ti also. I’ll look into those other tools though when I’m home, thanks for the resources. (HF diffusers? I’m still using A1111)
(ETA: I have written a bunch of unreleased plugins to make A1111 work better for me, like VSCode-like editing for special symbols like (/[, and a bunch of other optimizations. I haven’t released them because they’re not “perfect” yet and I have other projects to be working on, but there’s reasons I haven’t left A1111)
The issue with Macs is that Apple does price gouge for memory, your software stack is effectively limited to llama.cpp or MLX, and 70B class LLMs do start to chug, especially at high context.
Diffusion is kinda a different duck. It’s more compute heavy, yes, but the “generally accessible” software stack is also much less optimized for Macs than it is for transformers LLMs.
I view AMD Strix Halo as a solution to this, as its a big IGP with a wide memory bus like a Mac, but it can use the same CUDA software stacks that discrete GPUs use for that speed/feature advantage… albeit with some quirks. But I’m willing to put up with that if AMD doesn’t price gouge it.
Apple price gouges for memory, yes, but a 64gb theoretical 4090 would have cost as much in this market as the whole computer did. If you’re using it to its full capabilities then I think it’s one of the best values on the market. I just run the 20b models because they meet my needs (and in open webui I can combine a couple at that size), as I use the Mac for personal use also.
I’ll look into the Amd Strix though.
GDDR is actually super cheap! I think it would only be like another $75 on paper to double the 4090’s VRAM to 48GB (like they do for pro cards already).
Nvidia just doesn’t do it for market segmentation. AMD doesn’t do it for… honestly I have no idea why? They basically have no pro market to lose, the only explanation I can come up with is that their CEOs are colluding because they are cousins. And Intel doesn’t do it because they didn’t make a (consumer) GPU that was eally worth it until the B580.
Oh I didn’t mean “should cost $4000” just “would cost $4000”. I wish that the vram on video cards was modular, there’s so much ewaste generated by these bottlenecks.
Ah, yeah. Absolutely. The situation sucks though.
Not possible, the speeds are so high that GDDR physically has to be soldered. Future CPUs will be that way too, unfortunately. SO-DIMMs have already topped out at 5600, with tons of wasted power/voltage, and I believe desktop DIMMs are bumping against their limits too.
But look into CAMM modules and LPCAMMS. My hope is that we will get modular LPDDR5X-8533 on AMD Strix Halo boards.