Now we can have 30b llama on a single 3090!
Good news, 4bit LLaMA is working, you need to clone the branch from this pull request https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/9
It goes the easy way of using the existing cuda kernel multiple times, so a better kernel will run even faster.
Here is a simple script for running inference
https://rentry.org/cuhry
Ang here's a magnet link for the 4bit weights magnet:?xt=urn:btih:2840e47fda47561333d57f1fc403bc026ad5d7ad&dn=LLaMA-4bit&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80&tr=udp%3a%2f%2ftracker.publicbt.com%3a80&tr=udp%3a%2f%2ftracker.ccc.de%3a80&tr=udp%3a%2f%2ftracker.grepler.com%3a6969&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969&tr=udp%3a%2f%2ftracker.tiny-vps.com%3a6969&tr=udp%3a%2f%2ftracker.filetracker.pl%3a8089&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337
So does this mean you can run 13b on 16 vram now?
13b takes 8.4gb I think
Nice, I've heard it is far better than 7b which is all I've been able to run so far.
Oh it's much better and way more coherent
poorfag logging on
so is 13b just out of reach for an 8gb card (3070)?
>he bought 3070 instead of 3060
Can you use ram for the rest on a 8gb gpu? Is there a big baseline performance penalty for using ram at all or will 400mb of model in ram make only small difference?
13b has always run fine for me on 16gb rtx 4080
how much VRAM does it need for you?
about 15.8gbs i think
This will not guarantee sex with llamas.
What's the point then?
Botnet
>https://github.com/qwopqwop200/GPTQ-for-LLaMa
tl;dr on the quality difference between 4bit and float16?
About the same, 4bit is kinda a sweet spot
are the 4bit ones on huggingface?
maybe we can speed up torrent by getting from there
The weights in the torrent are already converted
I meant I could download in parallel so I can seed sooner. Torrent's a little slow
>murders accuracy
>hails it as an achievement
From an anon elsewhere
Picrel took about 3 minutes to generate, so about 8 tokens/s, somewhat faster than 13B in 8bit ran on my pc. It's clearly smarter, than 13b, able to understand the concepts of goblin speak and first person, that the 8bit 13B struggled with.
https://files.catbox.moe/74bs1b.png
And it's llama 30b
not surprised, 30B and 60B were trained for a lot longer than 7B and 13B
From an anon elsewhere
In another good news, we now have a proper CUDA kernel, so you just need to clone https://github.com/qwopqwop200/GPTQ-for-LLaMa and run python setup_cuda.py install again
Decent increase in speed from the previous version too
Doesn't install for me unfortunately
Is this the retard setup that retards like me can handle?
I have 32gb of RAM and 8gb of VRAM(3070ticuck)
What is the best model for me?
an upgrade to 12/24gb or renting a box online
also not buying trash gpu in the future
Originally my plan was going to upgrade in 5000 refresh/6000 series anyways
How much RAM for quantization?
I have a Ryzen 3900x but no graphics card.
Set up 2tb swap with nvme
Tried to launch the model with torch CPU version but failed, it says I must have an Nvidia card.
What are the chances of getting it on a CPU?
You can run Llama on CPU but not quantized (this isn't unique to Llama, no one has done quantization on CPUs for whatever reason.)
RTX 3060 12GB with 32 GB RAM, what could I expect to run in the next few days with all the new optimization? 13B? 30B? any hope for the full 65B?
You won’t run 33b or 65b barring tech breakthroughs
how coherent is 13b? can it be massaged into chat stuff?
I managed to do pretty coherent chat stuff with 7b, by using some of the leaked ChatGPT and Bing chat initial prompts and editing them a bit, so I imagine that 13b would be more than capable.
13b in 4bit WHEN???
/aicg/ reporting in. Can I coom to e-bois with this?
no
Is using multiple GPU's for ai a thing ?
Like for more vram not multithreading
It is a thing but only for server GPUs
It is but only on Quadro tier cards or whatever they call them after the 6th gorillionth rebrand.
nothing prevents you from using normal cards, you won't get double performance but will get double of VRAM
LLaMAs tongued my anus
hopefully it will mine soon
the 13B stuff I saw was pretty coherent but
LLaMA got no instruct training so it is like a wild horse?
Is it able to run on kobold yet?
>Ang here's a magnet link for the 4bit weights
it's only the 30b one, right?
Can it be split between cpu and gpu.
Weights in the torrent won't work since they don't match decapoda-research/llama-7b-hf architecture that GPTQ-for-LLaMa uses. You'll get bunch of errors related to size mismatches because of that. At least this is the case in 7B version.
Can I run on my mobile 1060 6GB yet?
So can I run o CPU
I found thus:
https://github.com/markasoftware/llama-cpu
I have llama7b-4bit.pt model.
There is instruction in Readme to run:
`torchrun --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model`
$TARGET_FOLDER/model_size - so here goes llama7b-4bit.pt ?
--tokenizer_path $TARGET_FOLDER/tokenizer.model` - what is this shit, where do I get one?
It's useless running it purely on cpu. It will probably like take 1 minute per word. The best way to run big models is to split them up loading some on thr cpu and some on the gpu
>tokenizer.model
The original llama torrent maybe?
From the original HF LLaMa repo:
https://huggingface.co/decapoda-research/llama-7b-hf/tree/main
>https://rentry.org/cuhry
can somebody make a colab with everything set up? i'm lazy
For 8-bit in Kobold we needed git clone -b 8bit https://github.com/ebolam/KoboldAI/ but what now? Replace 8bit with 4bit? Is there a new rentry with instructions?
https://huggingface.co/decapoda-research
HAPPENING!
Interesting. Kobold support when?
I managed to get 4-bit 7B and 13B running on an RTX 3060 with the latest oobabooga PR - https://github.com/oobabooga/text-generation-webui/pull/206
but the text generation feels s l o w compared to 8-bit mode, and it gets progressivelly slower the more tokens it generates, by the time it reaches around 200 tokens it's already grinding down to a halt. Is that normal?
Streaming mode causes massive slowdowns as the number of tokens in the output increases. try using --no-stream
But why? It doesn't make sense
Unless it's been changed, streaming is very poorly implemented. It just calls the generate function with a tiny limit of something like 8 tokens, adds it to the context, calls generate again, etc. It throws out any cached state and starts from scratch every time, so it has to constantly re-parse and process the same text it just created.
The proper way to do streaming would be to hook into the main generation loop and receive a callback on each token, which is entirely possible because generally sampling uses the CPU anyway. But either HF doesn't do it, or ooba doesn't hook into it properly.
tbh even though this stuff seems super advanced and cutting edge, programmers are still retarded like they are everywhere else
That sucks
I hope they get their shit together
Does this mean that I can finally run 7B on my 2060 6GB?
yes, 7B 4bit uses under 5 GB
1-bit quantization when?
SEED YOU FUCKS
t. downloadlet
its merged now btw
https://github.com/oobabooga/text-generation-webui/pull/206
Halp.
https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#installation-1
Trying:
>mkdir repositories
>cd repositories
>git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
>cd GPTQ-for-LLaMa
>python setup_cuda.py install
All works up to install, I get:
>raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
> RuntimeError:
>The detected CUDA version (10.2) mismatches the version that was used to compile
>PyTorch (11.7). Please make sure to use the same CUDA versions.
wat do?
update CUDA to 11.7 you bumbling fucking retard
used:
>conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia
to install into the env already. This env is a working oobabooga install
and double checking, when I run
>conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia
get
># All requested packages already installed.
any more ideas?
yes, update CUDA, you bumbling fucking retard
>nvcc --version
get
>nvcc: NVIDIA (R) Cuda compiler driver
>Copyright (c) 2005-2022 NVIDIA Corporation
>Built on Wed_Jun__8_16:59:34_Pacific_Daylight_Time_2022
>Cuda compilation tools, release 11.7, V11.7.99
>Build cuda_11.7.r11.7/compiler.31442593_0
Which GPU are you using? Uninstall CUDA 10.2 and pytorch-cuda 10.2 if you have either installed.
I've a 20 and 30 series.
doing
>conda list
I can't see either:
>CUDA 10.2
>pytorch-cuda 10.2
It's listed as:
>cryptography 39.0.1
>cuda 11.7.1
>cuda-cccl 11.7.91
>cuda-command-line-tools 11.7.1
>cuda-compiler 11.7.1
>cuda-cudart 11.7.99
>cuda-cudart-dev 11.7.99
>cuda-cuobjdump 11.7.91
>cuda-cupti 11.7.101
>cuda-cuxxfilt 11.7.91
>cuda-demo-suite 12.1.55
>cuda-documentation 12.1.55
>cuda-libraries 11.7.1
>cuda-libraries-dev 11.7.1
>cuda-memcheck 11.8.86
>cuda-nsight-compute 12.1.0
>cuda-nvcc 11.7.99
>cuda-nvdisasm 12.1.55
>cuda-nvml-dev 11.7.91
>cuda-nvprof 12.1.55
>cuda-nvprune 11.7.91
>cuda-nvrtc 11.7.99
>cuda-nvrtc-dev 11.7.99
>cuda-nvtx 11.7.91
>cuda-nvvp 12.1.55
>cuda-runtime 11.7.1
>cuda-sanitizer-api 12.1.55
>cuda-toolkit 11.7.1
>cuda-tools 11.7.1
>cuda-visual-tools 11.7.1
>cycler 0.11.0
...
>python-multipart 0.0.6
>pytorch 1.13.1
>pytorch-cuda 11.7
>pytorch-mutex 1.0
>pytz 2022.7.1
should I just trash the conda env and build fresh?
yes
Rebuilt a brand new conda env.
Same error.
log out and log in again
Still the same fresh hell as it was before
Will try this and report back.
I think i had the same problem, anon. I had dealt with that fucking problem a while ago trying to make AI related stuff work and i honestly can't remember how i fixed it, i just know it was a nightmare.
But in this case, anyway, try the one-click installer and just modify the start-webui.sh file as you need. That's what i did to avoid that shit again.
How do you pull a console up to enter commands into when working with the one-click installers?
The entire point of this was to try to get 4bit running
Never mind.
Editing the start-webui.bat to do everything just brings me back to the fucking
>The detected CUDA version (10.2) mismatches the version that was used to compile
>PyTorch (11.7). Please make sure to use the same CUDA versions.
error
I'm a retard 10.2 was in a PATH env remove that now I'm getting build errors (finally getting somewhere)
post errors
>easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
> warnings.warn(
>running bdist_egg
>running egg_info
>writing quant_cuda.egg-infoPKG-INFO
>writing dependency_links to quant_cuda.egg-infodependency_links.txt
>writing top-level names to quant_cuda.egg-infotop_level.txt
>reading manifest file 'quant_cuda.egg-infoSOURCES.txt'
>writing manifest file 'quant_cuda.egg-infoSOURCES.txt'
>installing library code to buildbdist.win-amd64egg
>running install_lib
>running build_ext
>error: [WinError 2] The system cannot find the file specified
This is after installing VS2019 and adding to PATH as per https://github.com/oobabooga/text-generation-webui/pull/206#issuecomment-1462804697
Getting same error when going through
https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/11#issuecomment-1462643016
Fuck this.
Even after doing everything listed:
> https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/11#issuecomment-1462643016
I'm still getting the error listed
> https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/11#issuecomment-1462593611
Which is the same as
Anyone have whatever this thing is compiling as a download or does it need to be compiled on the system it's running on to work?
CAN YOU USE IT FOR ERPING
4gb vram models when, vram is going to be regulated in the future so more optimized models == more better
I can't compile GPTQ-for-LLaMa, keeps complaining about Tuples
this is the best i can do on a 1070 Ti
time to upgrade
How much does it suck compared to the normal version?
Nobody answering whether it’s actually good for cooming
Post pics of deranged llama 13b 4bit
Anyone have a 4bit quantized 30b llama? I've been pulling my fucking hair out all day as a dumbass windows user and I fucking finally got everything working in Anaconda. But quantizing 30b myself just throws errors and goes on forever. Fuck's sake it's like I'm right there and the fucking thing is impossible to make and nowhere to be found.
4bit weights are in a magnet in the OP.
my 4090 runs stable diffusion like a champ, looks like it can handle the 30B for this with 4bit as shown here:
>https://rentry.org/llama-tard-v2#bonus-4-4bit-llama-basic-setup
How hard is it to get two 4090s to play nice together? Anyone here have experience trying? How good is the 30B model?
You don't need them to "play nice" really, you can split layers between gpus and they each just handle their own thing. It's not any faster because they have to hand off data and each is only active half the time. But it lets you handle a bigger model. I think with 2x4090 ypu could run 65B at 4bit which supposedly should be smarter than anything OAI has (if very untamed)
>You don't need them to "play nice" really, you can split layers between gpus and they each just handle their own thing.
okay, I think I know what that would look like. When you say untamed, do you mean these models are not fine tuned very well, or just say things very unfiltered, or both? I'll probably try to get the 30B up and running tomorrow
Not fine tuned at all for any purpose except predicting text. It's an advanced auto complete. Totally unfiltered of course, but you have to learn how to prompt a raw model, it's not like chatgpt where you can just tell it what you want.
That makes sense. I've read some articles about how expensive the process was to create and train chatgpt 3.5, I don't know if what I read was exaggerated but they obviously used an amount of hardware far outside the scope of what a hobbyist has at their disposal with a paltry one or two 4090s.Is it possible to fine tune these models at home? The bulk of the training is done, right? I'm imagining something like people creating LoRA's for stable diffusion. It seems like people are fine tuning other models, but they are less grand than that 65B one
IIRC the catch is that even if you freeze the model and train a smaller network like a LoRA, you still have to push gradients back through the network to train it. Which means more intermediate (backward) values to hold on to, plus you have to keep all the forward activations in memory to compute the gradients later.
You do save a lot vs actual training, since frozen parameters don't matter to the optimizer and intermediate gradients can be tossed. But it's definitely more than inference, plus you don't want to train with batch size 1 like you can with inference. That's why the kobold softprompt tuner needs so much vram, even though the whole model is frozen for that.
IIRC SD also needs more vram for training loras and embeddings, it's just that SD is tiny as fuck like 850m lol
Thank you for the details anon. I'm looking forward to playing with this.
Putting these here for anyone that needs them
VS2019: https://www.techspot.com/downloads/7241-visual-studio-2019.html#download_scroll
VS2019 Build Tools: https://learn.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers
why can't someone just package this all together why do I have to install visual studios, anaconda and build a cuda kernel?
It's the price you pay for being on the cutting edge anon.
Because not every computer runs the same software on the same hardware
That's why something called static linking exists
I'd love to run all those different AI things on my pc if I could just have everything inside the same folder that I can put wherever I like, but no, instead I have to shit everything up with billions of dependencies that are hard to remove afterwards
Conda keeps everything clean for the most part.
Except now for some godforsaken reason you need to install visual studio for something to work
And even though conda helps with different python dependencies it still throws files all over the place instead of keeping everything in the install folder
I'm getting an OOM error when trying to load 13b in 4bit. I'm on a 3090 and was running it perfectly in 8bit earlier, what gives? Anyone else run into this issue? Is it because I'm still using the HFv1 weights since v2 hasn't d/l'd yet perhaps?
(I have 16GB of RAM and a 20GB paging file for reference.)
You have to convert it with the updated .py or download a more recently converted model. I'm running 13b on a 3060 12GB.
Fucked if I know, got
>VS Community 2019
>VS Build Tools 2019
Both with "Desktop Develop C++" installed (do I need any other components?)
somebody please stop me from buying 2 3090's
4xxx series has native 4bit support
also Arc has it
3x A770 16GB cost less than a single 24GB 4090 and will be able to run 65G model
>3x A770 16GB cost less than a single 24GB 4090 and will be able to run 65G model
They do support clustering though?
It could be locked behind Arctic Sound M
they don't need to support anything special
the script can divide layers between different GPUs
tempting, been considering getting one for a while now
I wish the arc cards supported sr-iov like intel's other stuff, I would have already bought one if that was the case
Is LLaMA better than Pygmalion right now for the gf experience?
Going by /aicg/, yeah
where's the rocm support for 4 bit
begone aicgger
How much VRAM does the 4bit 30b use?
20 gb
shouldn't it be 30/2 = 15?
No, it's not 1GB per billion parameters, never was
if you use 4bit per parameter then it's 0.5GB per billion parameters
then you need some extra memory for computations
Parameters aren't free floating, you need to store their relative addresses too, how they interconnect
>how they interconnect
it's strange they wouldn't go with some algorithmic topology, so connections could be calculated on-the-fly
Isn't it just one matrix per layer plus attention (a sort of convolved mm)? So you just multiply the inputs by the next layer and run attention over the input sequence to get the next activation vector.
I guess you need some vram to express this as code but calling it "how they interconnect" is a really weird way to say it.
is rentry down for you guys?
yeah, just as I fucking need it
Just read the documentation for the software you want to use. Why do all of you wait for these blog posts?
is there example output before i waste time setting it up to play with
How much ram would the biggest model need with this?
If Nvidia would release a 48 GB Ti/Titan what model could it run?
Seems like 65B at 4bit can fit into 48GB vram, so you can run it with two 3090s
How much normal RAM should it use to load these quantized models?
I've got a 12GB CPU and 12GB GPU server and the GPTQ running 7B gets killed because it's exhausting my normal RAM upon load.
Why is the hash in:
https://huggingface.co/decapoda-research/llama-7b-hf-int4/blob/main/llama-7b-4bit.pt
... different to the 7B hash in the torrent?
Which one is supposed to be used with the GPTQ repo?
Answering my own question:
>https://rentry.org/llama-tard-v2
They were converted incorrectly in the torrent. See note at top.
so downloading torrent is pointless?
sharty won
Fucking feds.
Is it better at coding than gpt?
No. It's not been really fine-tuned for coding.
bros... what can i run with my 1050 that has 4GB VRAM ???
or should i just stay with chatGPT ?
is llama really so much better ?
You can run Minecraft
Making some good progress in setting up analyzing DNA sequences and decoding and I'm curious to ask: What kind of changes would anon like to make to his or her body? Or even, what kind of organism would anon like to become? What would you edit to make you a more advanced lifeform? Photosythetic skin? Gills? Immortality? Dual sex organs? Anything.
>tfw my 2080 still can't run it because muh 8 GB
To anyone that can't run it, you're really not missing out on anything. 13b is ass.
Can I use this to read documents?
That is the primary use I expect out of these things, im too lazy to train a bot to do it
No, use bing