• sga@piefed.social
    link
    fedilink
    English
    arrow-up
    23
    arrow-down
    2
    ·
    2 days ago

    try to use it with llama cpp if you folks are interested in runinng locall llms - https://github.com/ggml-org/llama.cpp/issues/9181

    the issue is closed, but that is not because it is solved, check it out, and find link to your relevant hardware (amd or intel or something else), and see if your particular piece is available. if so, you have hope.

    in case it is not, try to find first party stuff (intel vino or intel one or amd rocm stack, and use that with transformers python or see if vllm has support).

    also, try to check r/localllama on the forbidden website for your particular hardware - there is likely someone who has done something with it.

      • Hexarei@beehaw.org
        link
        fedilink
        arrow-up
        3
        ·
        24 hours ago

        It’s not a specific model, its a harness for running models. You can find ethically trained models to run on it though iirc.

        • sga@piefed.social
          link
          fedilink
          English
          arrow-up
          1
          ·
          13 hours ago

          pretty much this. I use smolllm (a 3B param model, trained only on openly available datasets)

      • dyathinkhesaurus@lemmy.world
        link
        fedilink
        arrow-up
        7
        ·
        2 days ago

        llama.cpp is not the same thing as ollama. It does a similar thing, but better imo. can multiplex/multithread sessions/conversations with llms (where ollama and lmstudio etc. have to queue them up), stuff like that.

        • sga@piefed.social
          link
          fedilink
          English
          arrow-up
          4
          ·
          1 day ago

          further clarification - ollama is a distribution of llama cpp (and it is a bit commercial in some sense). basically, in ye olde days of 2023-24 (decades in llm space as they say), llama cpp was a server/cli only thing. it would provide output in terminal (that is how i used to use it back then), or via a api (an openai compatible one, so if you used openai stuff before, you can easily swap over), and many people wanted a gui interface (a web based chat interface), so ollama back then was a wrapper around llama cpp (there were several others, but ollama was relatively main stream). then as time progressed ollama “allegedly enshittified”, while llama cpp kept getting features (a web ui, ability to swap models during run time(back then that required a separate llama-swap), etc. also llama cpp stack is a bit “lighter” (not really, they both are web tech, so as light as js can get), and first party(ish - the interface was done by community, but it is still the same git repo) so more and more local llama folk kept switching to llama cpp only setup (you could use llama cpp with ollama, but at that point, ollama was just a web ui, and not a great one, some people prefered comfyui, etc). some old timers (like me) never even tried ollama, as plain llama.cpp was sufficient for us.

          as the above commenter said, you can do very fancy things with llama cpp (the best thing about llama cpp is that it works with both cpu and gpu - you can use both simulataneously, as opposed to vllm or transformers, where you almost always need gpu. this simultaneous thing is called offloading. where some of the layers are dumped in system meory as opposed to vram, hence the vram poor population used ram )(this also kinda led to ram inflation, but do not blame llama cpp for it, blame people), and you can do some of them on ollama (as ollama is wrapper around llama cpp), but that requires ollama folks to keep their fork up to date to parent, as well as expose the said features in ui.

  • Sparrow_1029@programming.dev
    link
    fedilink
    arrow-up
    23
    ·
    2 days ago

    I got an Asus Vivobook with a pretty new AMD chipset that has one… running Fedora, no Idea how to make something use it either :/

    • Territorial@piefed.ca
      link
      fedilink
      English
      arrow-up
      7
      ·
      1 day ago

      Also on an Asus laptop with Fedora, after following some obscure instructions from amd I managed to get a Python script to confirm the npu exists and is technically functional… But apparently to do anything approaching useful, I need the slightly fancier npu. What a waste.

        • knightly the Sneptaur@pawb.social
          link
          fedilink
          arrow-up
          4
          ·
          1 day ago

          You can do lots of useful work with processors designed for large matrix calculus, but not a lot of folks want to run physics sims for some reason.

            • knightly the Sneptaur@pawb.social
              link
              fedilink
              arrow-up
              2
              ·
              22 hours ago

              If there are game devs that want to try making a game out of a fluid sim, I’d be all for it. Realistically, most devs will have to wait for better middleware that adds NPU matrix acceleration to existing game engines.

  • √𝛂𝛋𝛆@piefed.world
    link
    fedilink
    English
    arrow-up
    10
    ·
    2 days ago

    You need a model compiled for the architecture. I saw some for the RK35xx devices when shopping for hardware. I do not think there is software made to split up or run models in general on a NPU. The models must be configured for the physical hardware topology. The stuff that runs on most devices is very small, and these either need a ton of custom fine tuning or they are barely capable of simple tasks.

    On the other hand, segmentation models are small, and that makes layers, object identification, and background removal stuff work. Looking at your CPU speed, and available memory, it is unlikely to make much difference. You are also memory constrained for running models, though you could use deepspeed to load from a disk drive too.

  • hperrin@lemmy.ca
    link
    fedilink
    English
    arrow-up
    8
    ·
    2 days ago

    The only thing I can think of is AI video game upscalers. Other than that, yeah, it’s a waste of silicon.