4bit Llama is finally here!

Now we can have 30b llama on a single 3090!

Good news, 4bit LLaMA is working, you need to clone the branch from this pull request https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/9
It goes the easy way of using the existing cuda kernel multiple times, so a better kernel will run even faster.
Here is a simple script for running inference
https://rentry.org/cuhry
Ang here's a magnet link for the 4bit weights magnet:?xt=urn:btih:2840e47fda47561333d57f1fc403bc026ad5d7ad&dn=LLaMA-4bit&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80&tr=udp%3a%2f%2ftracker.publicbt.com%3a80&tr=udp%3a%2f%2ftracker.ccc.de%3a80&tr=udp%3a%2f%2ftracker.grepler.com%3a6969&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969&tr=udp%3a%2f%2ftracker.tiny-vps.com%3a6969&tr=udp%3a%2f%2ftracker.filetracker.pl%3a8089&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337

  1. 3 weeks ago
    Anonymous

    So does this mean you can run 13b on 16 vram now?

    • 3 weeks ago
      Anonymous

      13b takes 8.4gb I think

      • 3 weeks ago
        Anonymous

        Nice, I've heard it is far better than 7b which is all I've been able to run so far.

        • 3 weeks ago
          Anonymous

          Oh it's much better and way more coherent

      • 3 weeks ago
        Anonymous

        poorfag logging on
        so is 13b just out of reach for an 8gb card (3070)?

        • 3 weeks ago
          Anonymous

          >he bought 3070 instead of 3060

      • 3 weeks ago
        Anonymous

        Can you use ram for the rest on a 8gb gpu? Is there a big baseline performance penalty for using ram at all or will 400mb of model in ram make only small difference?

    • 3 weeks ago
      Anonymous

      13b has always run fine for me on 16gb rtx 4080

      • 3 weeks ago
        Anonymous

        how much VRAM does it need for you?

        • 3 weeks ago
          Anonymous

          about 15.8gbs i think

  2. 3 weeks ago
    Anonymous

    This will not guarantee sex with llamas.

    • 3 weeks ago
      Anonymous

      What's the point then?

  3. 3 weeks ago
    Anonymous

    Botnet

    • 3 weeks ago
      Kipchoge

      >https://github.com/qwopqwop200/GPTQ-for-LLaMa

  4. 3 weeks ago
    Anonymous

    tl;dr on the quality difference between 4bit and float16?

    • 3 weeks ago
      Anonymous

      About the same, 4bit is kinda a sweet spot

  5. 3 weeks ago
    Anonymous

    are the 4bit ones on huggingface?
    maybe we can speed up torrent by getting from there

    • 3 weeks ago
      Anonymous

      The weights in the torrent are already converted

      • 3 weeks ago
        Anonymous

        I meant I could download in parallel so I can seed sooner. Torrent's a little slow

  6. 3 weeks ago
    Anonymous

    >murders accuracy
    >hails it as an achievement

    • 3 weeks ago
      Anonymous

      From an anon elsewhere

      Picrel took about 3 minutes to generate, so about 8 tokens/s, somewhat faster than 13B in 8bit ran on my pc. It's clearly smarter, than 13b, able to understand the concepts of goblin speak and first person, that the 8bit 13B struggled with.
      https://files.catbox.moe/74bs1b.png

      • 3 weeks ago
        Anonymous

        And it's llama 30b

      • 3 weeks ago
        Anonymous

        And it's llama 30b

        not surprised, 30B and 60B were trained for a lot longer than 7B and 13B

  7. 3 weeks ago
    Anonymous

    From an anon elsewhere

    In another good news, we now have a proper CUDA kernel, so you just need to clone https://github.com/qwopqwop200/GPTQ-for-LLaMa and run python setup_cuda.py install again
    Decent increase in speed from the previous version too

    • 3 weeks ago
      Anonymous

      Doesn't install for me unfortunately

    • 3 weeks ago
      Anonymous

      Is this the retard setup that retards like me can handle?

  8. 3 weeks ago
    Anonymous

    I have 32gb of RAM and 8gb of VRAM(3070ticuck)
    What is the best model for me?

    • 3 weeks ago
      Anonymous

      an upgrade to 12/24gb or renting a box online
      also not buying trash gpu in the future

      • 3 weeks ago
        Anonymous

        Originally my plan was going to upgrade in 5000 refresh/6000 series anyways

  9. 3 weeks ago
    Anonymous
  10. 3 weeks ago
    Anonymous

    How much RAM for quantization?

  11. 3 weeks ago
    Anonymous

    I have a Ryzen 3900x but no graphics card.
    Set up 2tb swap with nvme
    Tried to launch the model with torch CPU version but failed, it says I must have an Nvidia card.
    What are the chances of getting it on a CPU?

    • 3 weeks ago
      Anonymous

      You can run Llama on CPU but not quantized (this isn't unique to Llama, no one has done quantization on CPUs for whatever reason.)

  12. 3 weeks ago
    Anonymous

    RTX 3060 12GB with 32 GB RAM, what could I expect to run in the next few days with all the new optimization? 13B? 30B? any hope for the full 65B?

    • 3 weeks ago
      Anonymous

      You won’t run 33b or 65b barring tech breakthroughs

  13. 3 weeks ago
    Anonymous

    how coherent is 13b? can it be massaged into chat stuff?

    • 3 weeks ago
      Anonymous

      I managed to do pretty coherent chat stuff with 7b, by using some of the leaked ChatGPT and Bing chat initial prompts and editing them a bit, so I imagine that 13b would be more than capable.

  14. 3 weeks ago
    Anonymous

    13b in 4bit WHEN???

  15. 3 weeks ago
    Anonymous

    /aicg/ reporting in. Can I coom to e-bois with this?

    • 3 weeks ago
      Anonymous

      no

  16. 3 weeks ago
    Anonymous

    Is using multiple GPU's for ai a thing ?
    Like for more vram not multithreading

    • 3 weeks ago
      Anonymous

      It is a thing but only for server GPUs

    • 3 weeks ago
      Anonymous

      It is but only on Quadro tier cards or whatever they call them after the 6th gorillionth rebrand.

      • 3 weeks ago
        Anonymous

        nothing prevents you from using normal cards, you won't get double performance but will get double of VRAM

  17. 3 weeks ago
    Anonymous

    LLaMAs tongued my anus

    • 3 weeks ago
      Anonymous

      hopefully it will mine soon

      • 3 weeks ago
        Anonymous

        the 13B stuff I saw was pretty coherent but
        LLaMA got no instruct training so it is like a wild horse?

  18. 3 weeks ago
    Anonymous

    Is it able to run on kobold yet?

  19. 3 weeks ago
    Anonymous

    >Ang here's a magnet link for the 4bit weights
    it's only the 30b one, right?

  20. 3 weeks ago
    Anonymous

    Can it be split between cpu and gpu.

  21. 3 weeks ago
    Anonymous

    Weights in the torrent won't work since they don't match decapoda-research/llama-7b-hf architecture that GPTQ-for-LLaMa uses. You'll get bunch of errors related to size mismatches because of that. At least this is the case in 7B version.

  22. 3 weeks ago
    Anonymous

    Can I run on my mobile 1060 6GB yet?

  23. 3 weeks ago
    Anonymous

    So can I run o CPU
    I found thus:

    https://github.com/markasoftware/llama-cpu

    I have llama7b-4bit.pt model.

    There is instruction in Readme to run:

    `torchrun --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model`

    $TARGET_FOLDER/model_size - so here goes llama7b-4bit.pt ?
    --tokenizer_path $TARGET_FOLDER/tokenizer.model` - what is this shit, where do I get one?

    • 3 weeks ago
      Anonymous

      It's useless running it purely on cpu. It will probably like take 1 minute per word. The best way to run big models is to split them up loading some on thr cpu and some on the gpu

    • 3 weeks ago
      Anonymous

      >tokenizer.model
      The original llama torrent maybe?

    • 3 weeks ago
      Anonymous

      From the original HF LLaMa repo:
      https://huggingface.co/decapoda-research/llama-7b-hf/tree/main

  24. 3 weeks ago
    Anonymous

    >https://rentry.org/cuhry
    can somebody make a colab with everything set up? i'm lazy

  25. 3 weeks ago
    Anonymous

    For 8-bit in Kobold we needed git clone -b 8bit https://github.com/ebolam/KoboldAI/ but what now? Replace 8bit with 4bit? Is there a new rentry with instructions?

  26. 3 weeks ago
    Anonymous

    https://huggingface.co/decapoda-research

    HAPPENING!

    • 3 weeks ago
      Anonymous

      Interesting. Kobold support when?

  27. 3 weeks ago
    Anonymous

    I managed to get 4-bit 7B and 13B running on an RTX 3060 with the latest oobabooga PR - https://github.com/oobabooga/text-generation-webui/pull/206

    but the text generation feels s l o w compared to 8-bit mode, and it gets progressivelly slower the more tokens it generates, by the time it reaches around 200 tokens it's already grinding down to a halt. Is that normal?

    • 3 weeks ago
      Anonymous

      Streaming mode causes massive slowdowns as the number of tokens in the output increases. try using --no-stream

      • 3 weeks ago
        Anonymous

        But why? It doesn't make sense

        • 3 weeks ago
          Anonymous

          Unless it's been changed, streaming is very poorly implemented. It just calls the generate function with a tiny limit of something like 8 tokens, adds it to the context, calls generate again, etc. It throws out any cached state and starts from scratch every time, so it has to constantly re-parse and process the same text it just created.

          The proper way to do streaming would be to hook into the main generation loop and receive a callback on each token, which is entirely possible because generally sampling uses the CPU anyway. But either HF doesn't do it, or ooba doesn't hook into it properly.
          tbh even though this stuff seems super advanced and cutting edge, programmers are still retarded like they are everywhere else

          • 3 weeks ago
            Anonymous

            That sucks
            I hope they get their shit together

  28. 3 weeks ago
    Anonymous

    Does this mean that I can finally run 7B on my 2060 6GB?

    • 3 weeks ago
      Anonymous

      yes, 7B 4bit uses under 5 GB

  29. 3 weeks ago
    Anonymous

    1-bit quantization when?

  30. 3 weeks ago
    Anonymous

    SEED YOU FUCKS

    • 3 weeks ago
      Anonymous

      t. downloadlet

      its merged now btw
      https://github.com/oobabooga/text-generation-webui/pull/206

  31. 3 weeks ago
    Anonymous

    Halp.
    https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#installation-1
    Trying:
    >mkdir repositories
    >cd repositories
    >git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
    >cd GPTQ-for-LLaMa
    >python setup_cuda.py install

    All works up to install, I get:
    >raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
    > RuntimeError:
    >The detected CUDA version (10.2) mismatches the version that was used to compile
    >PyTorch (11.7). Please make sure to use the same CUDA versions.

    wat do?

    • 3 weeks ago
      Anonymous

      update CUDA to 11.7 you bumbling fucking retard

      • 3 weeks ago
        Anonymous

        used:
        >conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia
        to install into the env already. This env is a working oobabooga install

        and double checking, when I run
        >conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia
        get
        ># All requested packages already installed.
        any more ideas?

        • 3 weeks ago
          Anonymous

          yes, update CUDA, you bumbling fucking retard

          • 3 weeks ago
            Anonymous

            >nvcc --version
            get
            >nvcc: NVIDIA (R) Cuda compiler driver
            >Copyright (c) 2005-2022 NVIDIA Corporation
            >Built on Wed_Jun__8_16:59:34_Pacific_Daylight_Time_2022
            >Cuda compilation tools, release 11.7, V11.7.99
            >Build cuda_11.7.r11.7/compiler.31442593_0

            • 3 weeks ago
              Anonymous

              Which GPU are you using? Uninstall CUDA 10.2 and pytorch-cuda 10.2 if you have either installed.

              • 3 weeks ago
                Anonymous

                I've a 20 and 30 series.

              • 3 weeks ago
                Anonymous

                doing
                >conda list
                I can't see either:
                >CUDA 10.2
                >pytorch-cuda 10.2
                It's listed as:
                >cryptography 39.0.1
                >cuda 11.7.1
                >cuda-cccl 11.7.91
                >cuda-command-line-tools 11.7.1
                >cuda-compiler 11.7.1
                >cuda-cudart 11.7.99
                >cuda-cudart-dev 11.7.99
                >cuda-cuobjdump 11.7.91
                >cuda-cupti 11.7.101
                >cuda-cuxxfilt 11.7.91
                >cuda-demo-suite 12.1.55
                >cuda-documentation 12.1.55
                >cuda-libraries 11.7.1
                >cuda-libraries-dev 11.7.1
                >cuda-memcheck 11.8.86
                >cuda-nsight-compute 12.1.0
                >cuda-nvcc 11.7.99
                >cuda-nvdisasm 12.1.55
                >cuda-nvml-dev 11.7.91
                >cuda-nvprof 12.1.55
                >cuda-nvprune 11.7.91
                >cuda-nvrtc 11.7.99
                >cuda-nvrtc-dev 11.7.99
                >cuda-nvtx 11.7.91
                >cuda-nvvp 12.1.55
                >cuda-runtime 11.7.1
                >cuda-sanitizer-api 12.1.55
                >cuda-toolkit 11.7.1
                >cuda-tools 11.7.1
                >cuda-visual-tools 11.7.1
                >cycler 0.11.0
                ...
                >python-multipart 0.0.6
                >pytorch 1.13.1
                >pytorch-cuda 11.7
                >pytorch-mutex 1.0
                >pytz 2022.7.1

              • 3 weeks ago
                Anonymous

                should I just trash the conda env and build fresh?

              • 3 weeks ago
                Anonymous

                yes

              • 3 weeks ago
                Anonymous

                Rebuilt a brand new conda env.
                Same error.

              • 3 weeks ago
                Anonymous

                log out and log in again

              • 3 weeks ago
                Anonymous

                Still the same fresh hell as it was before

                I think i had the same problem, anon. I had dealt with that fucking problem a while ago trying to make AI related stuff work and i honestly can't remember how i fixed it, i just know it was a nightmare.
                But in this case, anyway, try the one-click installer and just modify the start-webui.sh file as you need. That's what i did to avoid that shit again.

                Will try this and report back.

              • 3 weeks ago
                Anonymous

                I think i had the same problem, anon. I had dealt with that fucking problem a while ago trying to make AI related stuff work and i honestly can't remember how i fixed it, i just know it was a nightmare.
                But in this case, anyway, try the one-click installer and just modify the start-webui.sh file as you need. That's what i did to avoid that shit again.

              • 3 weeks ago
                Anonymous

                How do you pull a console up to enter commands into when working with the one-click installers?
                The entire point of this was to try to get 4bit running

              • 3 weeks ago
                Anonymous

                Never mind.
                Editing the start-webui.bat to do everything just brings me back to the fucking
                >The detected CUDA version (10.2) mismatches the version that was used to compile
                >PyTorch (11.7). Please make sure to use the same CUDA versions.
                error

    • 3 weeks ago
      Anonymous

      used:
      >conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia
      to install into the env already. This env is a working oobabooga install

      and double checking, when I run
      >conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia
      get
      ># All requested packages already installed.
      any more ideas?

      log out and log in again

      I think i had the same problem, anon. I had dealt with that fucking problem a while ago trying to make AI related stuff work and i honestly can't remember how i fixed it, i just know it was a nightmare.
      But in this case, anyway, try the one-click installer and just modify the start-webui.sh file as you need. That's what i did to avoid that shit again.

      I'm a retard 10.2 was in a PATH env remove that now I'm getting build errors (finally getting somewhere)

      • 3 weeks ago
        Anonymous

        post errors

        • 3 weeks ago
          Anonymous

          >easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
          > warnings.warn(
          >running bdist_egg
          >running egg_info
          >writing quant_cuda.egg-infoPKG-INFO
          >writing dependency_links to quant_cuda.egg-infodependency_links.txt
          >writing top-level names to quant_cuda.egg-infotop_level.txt
          >reading manifest file 'quant_cuda.egg-infoSOURCES.txt'
          >writing manifest file 'quant_cuda.egg-infoSOURCES.txt'
          >installing library code to buildbdist.win-amd64egg
          >running install_lib
          >running build_ext
          >error: [WinError 2] The system cannot find the file specified

          This is after installing VS2019 and adding to PATH as per https://github.com/oobabooga/text-generation-webui/pull/206#issuecomment-1462804697

          • 3 weeks ago
            Anonymous

            Getting same error when going through
            https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/11#issuecomment-1462643016

          • 3 weeks ago
            Anonymous

            Fuck this.
            Even after doing everything listed:
            > https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/11#issuecomment-1462643016
            I'm still getting the error listed
            > https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/11#issuecomment-1462593611
            Which is the same as

            Anyone have whatever this thing is compiling as a download or does it need to be compiled on the system it's running on to work?

  32. 3 weeks ago
    Anonymous

    CAN YOU USE IT FOR ERPING

  33. 3 weeks ago
    Anonymous

    4gb vram models when, vram is going to be regulated in the future so more optimized models == more better

  34. 3 weeks ago
    Anonymous

    I can't compile GPTQ-for-LLaMa, keeps complaining about Tuples

  35. 3 weeks ago
    Anonymous

    this is the best i can do on a 1070 Ti
    time to upgrade

  36. 3 weeks ago
    Anonymous

    How much does it suck compared to the normal version?

  37. 3 weeks ago
    Anonymous

    Nobody answering whether it’s actually good for cooming

    Post pics of deranged llama 13b 4bit

  38. 3 weeks ago
    jew69

    Anyone have a 4bit quantized 30b llama? I've been pulling my fucking hair out all day as a dumbass windows user and I fucking finally got everything working in Anaconda. But quantizing 30b myself just throws errors and goes on forever. Fuck's sake it's like I'm right there and the fucking thing is impossible to make and nowhere to be found.

    • 3 weeks ago
      Anonymous

      4bit weights are in a magnet in the OP.

  39. 3 weeks ago
    Anonymous

    my 4090 runs stable diffusion like a champ, looks like it can handle the 30B for this with 4bit as shown here:

    >https://rentry.org/llama-tard-v2#bonus-4-4bit-llama-basic-setup

    How hard is it to get two 4090s to play nice together? Anyone here have experience trying? How good is the 30B model?

    • 3 weeks ago
      Anonymous

      You don't need them to "play nice" really, you can split layers between gpus and they each just handle their own thing. It's not any faster because they have to hand off data and each is only active half the time. But it lets you handle a bigger model. I think with 2x4090 ypu could run 65B at 4bit which supposedly should be smarter than anything OAI has (if very untamed)

      • 3 weeks ago
        Anonymous

        >You don't need them to "play nice" really, you can split layers between gpus and they each just handle their own thing.
        okay, I think I know what that would look like. When you say untamed, do you mean these models are not fine tuned very well, or just say things very unfiltered, or both? I'll probably try to get the 30B up and running tomorrow

        • 3 weeks ago
          Anonymous

          Not fine tuned at all for any purpose except predicting text. It's an advanced auto complete. Totally unfiltered of course, but you have to learn how to prompt a raw model, it's not like chatgpt where you can just tell it what you want.

          • 3 weeks ago
            Anonymous

            That makes sense. I've read some articles about how expensive the process was to create and train chatgpt 3.5, I don't know if what I read was exaggerated but they obviously used an amount of hardware far outside the scope of what a hobbyist has at their disposal with a paltry one or two 4090s.Is it possible to fine tune these models at home? The bulk of the training is done, right? I'm imagining something like people creating LoRA's for stable diffusion. It seems like people are fine tuning other models, but they are less grand than that 65B one

            • 3 weeks ago
              Anonymous

              IIRC the catch is that even if you freeze the model and train a smaller network like a LoRA, you still have to push gradients back through the network to train it. Which means more intermediate (backward) values to hold on to, plus you have to keep all the forward activations in memory to compute the gradients later.
              You do save a lot vs actual training, since frozen parameters don't matter to the optimizer and intermediate gradients can be tossed. But it's definitely more than inference, plus you don't want to train with batch size 1 like you can with inference. That's why the kobold softprompt tuner needs so much vram, even though the whole model is frozen for that.

              IIRC SD also needs more vram for training loras and embeddings, it's just that SD is tiny as fuck like 850m lol

              • 3 weeks ago
                Anonymous

                Thank you for the details anon. I'm looking forward to playing with this.

  40. 3 weeks ago
    Anonymous

    Putting these here for anyone that needs them

    VS2019: https://www.techspot.com/downloads/7241-visual-studio-2019.html#download_scroll

    VS2019 Build Tools: https://learn.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers

    • 3 weeks ago
      Anonymous

      why can't someone just package this all together why do I have to install visual studios, anaconda and build a cuda kernel?

      • 3 weeks ago
        Anonymous

        It's the price you pay for being on the cutting edge anon.

      • 3 weeks ago
        Anonymous

        Because not every computer runs the same software on the same hardware

        • 3 weeks ago
          Anonymous

          That's why something called static linking exists
          I'd love to run all those different AI things on my pc if I could just have everything inside the same folder that I can put wherever I like, but no, instead I have to shit everything up with billions of dependencies that are hard to remove afterwards

          • 3 weeks ago
            Anonymous

            Conda keeps everything clean for the most part.

            • 3 weeks ago
              Anonymous

              Except now for some godforsaken reason you need to install visual studio for something to work
              And even though conda helps with different python dependencies it still throws files all over the place instead of keeping everything in the install folder

  41. 3 weeks ago
    Anonymous

    I'm getting an OOM error when trying to load 13b in 4bit. I'm on a 3090 and was running it perfectly in 8bit earlier, what gives? Anyone else run into this issue? Is it because I'm still using the HFv1 weights since v2 hasn't d/l'd yet perhaps?

    (I have 16GB of RAM and a 20GB paging file for reference.)

    • 3 weeks ago
      Anonymous

      You have to convert it with the updated .py or download a more recently converted model. I'm running 13b on a 3060 12GB.

  42. 3 weeks ago
    Anonymous

    Fucked if I know, got
    >VS Community 2019
    >VS Build Tools 2019
    Both with "Desktop Develop C++" installed (do I need any other components?)

  43. 3 weeks ago
    Anonymous

    somebody please stop me from buying 2 3090's

    • 3 weeks ago
      Anonymous

      4xxx series has native 4bit support

      • 3 weeks ago
        Anonymous

        also Arc has it
        3x A770 16GB cost less than a single 24GB 4090 and will be able to run 65G model

        • 3 weeks ago
          Anonymous

          >3x A770 16GB cost less than a single 24GB 4090 and will be able to run 65G model
          They do support clustering though?
          It could be locked behind Arctic Sound M

          • 3 weeks ago
            Anonymous

            they don't need to support anything special
            the script can divide layers between different GPUs

        • 3 weeks ago
          Anonymous

          tempting, been considering getting one for a while now
          I wish the arc cards supported sr-iov like intel's other stuff, I would have already bought one if that was the case

  44. 3 weeks ago
    Anonymous

    Is LLaMA better than Pygmalion right now for the gf experience?

    • 3 weeks ago
      Anonymous

      Going by /aicg/, yeah

    • 3 weeks ago
      Anonymous

      where's the rocm support for 4 bit

      begone aicgger

  45. 3 weeks ago
    Anonymous

    How much VRAM does the 4bit 30b use?

    • 3 weeks ago
      Anonymous

      20 gb

      • 3 weeks ago
        Anonymous

        shouldn't it be 30/2 = 15?

        • 3 weeks ago
          Anonymous

          No, it's not 1GB per billion parameters, never was

          • 3 weeks ago
            Anonymous

            if you use 4bit per parameter then it's 0.5GB per billion parameters
            then you need some extra memory for computations

            • 3 weeks ago
              Anonymous

              Parameters aren't free floating, you need to store their relative addresses too, how they interconnect

              • 3 weeks ago
                Anonymous

                >how they interconnect
                it's strange they wouldn't go with some algorithmic topology, so connections could be calculated on-the-fly

              • 3 weeks ago
                Anonymous

                Isn't it just one matrix per layer plus attention (a sort of convolved mm)? So you just multiply the inputs by the next layer and run attention over the input sequence to get the next activation vector.
                I guess you need some vram to express this as code but calling it "how they interconnect" is a really weird way to say it.

  46. 3 weeks ago
    Anonymous

    is rentry down for you guys?

    • 3 weeks ago
      Anonymous

      yeah, just as I fucking need it

    • 3 weeks ago
      Anonymous

      Just read the documentation for the software you want to use. Why do all of you wait for these blog posts?

  47. 3 weeks ago
    Anonymous

    is there example output before i waste time setting it up to play with

  48. 3 weeks ago
    Anonymous

    How much ram would the biggest model need with this?
    If Nvidia would release a 48 GB Ti/Titan what model could it run?

    • 3 weeks ago
      Anonymous

      Seems like 65B at 4bit can fit into 48GB vram, so you can run it with two 3090s

    • 3 weeks ago
      Anonymous

      How much normal RAM should it use to load these quantized models?
      I've got a 12GB CPU and 12GB GPU server and the GPTQ running 7B gets killed because it's exhausting my normal RAM upon load.

  49. 3 weeks ago
    Anonymous

    Why is the hash in:
    https://huggingface.co/decapoda-research/llama-7b-hf-int4/blob/main/llama-7b-4bit.pt
    ... different to the 7B hash in the torrent?
    Which one is supposed to be used with the GPTQ repo?

    • 3 weeks ago
      Anonymous

      Answering my own question:
      >https://rentry.org/llama-tard-v2
      They were converted incorrectly in the torrent. See note at top.

      • 3 weeks ago
        Anonymous

        so downloading torrent is pointless?

  50. 3 weeks ago
    Anonymous

    sharty won

    • 3 weeks ago
      Anonymous

      Fucking feds.

  51. 3 weeks ago
    Anonymous

    Is it better at coding than gpt?

    • 3 weeks ago
      Anonymous

      No. It's not been really fine-tuned for coding.

  52. 3 weeks ago
    Anonymous

    bros... what can i run with my 1050 that has 4GB VRAM ???

    or should i just stay with chatGPT ?
    is llama really so much better ?

    • 3 weeks ago
      Anonymous

      You can run Minecraft

  53. 3 weeks ago
    Anonymous

    Making some good progress in setting up analyzing DNA sequences and decoding and I'm curious to ask: What kind of changes would anon like to make to his or her body? Or even, what kind of organism would anon like to become? What would you edit to make you a more advanced lifeform? Photosythetic skin? Gills? Immortality? Dual sex organs? Anything.

  54. 3 weeks ago
    Anonymous

    >tfw my 2080 still can't run it because muh 8 GB

  55. 3 weeks ago
    Anonymous

    To anyone that can't run it, you're really not missing out on anything. 13b is ass.

  56. 3 weeks ago
    Anonymous

    Can I use this to read documents?
    That is the primary use I expect out of these things, im too lazy to train a bot to do it

    • 3 weeks ago
      Anonymous

      No, use bing

Your email address will not be published. Required fields are marked *