(We’re going to need an AI community lol instead of posting to genzedong all the time)

Today I want to share how I use Deepseek to translate text.

I was given a friend’s game to translate in various languages. These usually come in json files or similar, which are structured in a specific way with a key:value pair. This makes it easier for devs and translators to handle various languages

It’s also a file structure LLMs understand very well.

The file can look like this:

"tx_b_menu"			: "Menu",
"tx_b_newgame"			: "New Game",
"tx_b_continue"			: "Continue",
"tx_b_options"			: "Options",

etc

If it’s properly set up, like this one, then that’s additional context the LLM can use to understand what it’s translating by reading the key property.

Now mind you, this file has only 650 lines in it - it’s a small indie game. This is something deepseek can handle in one go without needing to break it up in various tasks. My upper limit was sending it 1060 lines of JS so far so it has some context size.

These strings can also contain variables such as [%v] in them, which will be replaced by numbers or words in game. It can also contain other markings such as [color=yellow]text[/c] to indicate the text should display as yellow.

I made a huge prompt for deepseek just to properly frame the task, but on the first try, thanks to the reasoning capabilities, it understood the structure just fine and that it should leave variables and other markings alone.

To complete the translation, I sent deepseek the json file instead of pasting it (it can read text but not pictures), but I sent two of them: one in english, the other in french. Both were human-made and so should be consistent across each other. That way, deepseek can properly (and hopefully) cross-reference the two files to eliminate ambiguity if it’s not sure about a string. I once saw “Options” translated as “Choices” in a game, so.

In my prompt, I explained:

  • that I was sending two json files, and which languages they were in (and which one is which)
  • explained what the json file is and why games do this
  • told deepseek it was a professional translator who knows both languages perfectly AND has experience working with video games and devs
  • task is to translate to [language] while leaving the json alone, i.e. you have key:value pairs and you can only touch the value portion, not the key.
  • retain the flair of the original strings, meaning don’t add or change stuff that’s not there (very important)
  • also explained the game is ‘fantasy’ setting so it knows what kind of words it’s looking for.
  • read the files carefully first before doing any translation
  • how to handle special characters like \n so as not to break the UI
  • Remind deepseek to translate to X language, and to output translated content.
Full prompt

I am sending you TWO json files for a video game called [game]. One of the files (local_en) contains the english language to the game, i.e. all the strings. The other, local_fr, contains a human translation to French of the same strings. As you know, video games often handle languages this way for translation purposes, loading the strings from an external json file so that it’s easier for translators and devs to translate in a variety of languages. That’s where you come in. You are a professional translator who knows English and French perfectly, and you especially have experience working on video games and with video game developers. Your task is to translate the strings to [language] while leaving the JSON alone. This means that for each key pair, you will only translate what comes after the colon, i.e. the value of the key pair but not its property name. When it comes to how to translate strings, there are of course various ways to approach it as a professional. You should as much as possible retain the flair of the original files, especially as it’s for a fantasy-type game. So read through the file carefully before starting the translation task and, above all, don’t add things that aren’t there in the original. Be direct: translate as closely to the original as possible so that strings remain consistent inside the game. When it comes to \n special characters, in other word line breaks, you should check the length in the original first and decide where to place the \n in the [language] translation so as not to break the UI when the text is later loaded in the game. Likewise, translated strings shouldn’t be longer than the original if possible - visual space is a constraint here. Some terms come back often in various ways and should be kept consistent each time.

Remember, it has to be translated to [language]. Output the translated content and I will copy it manually from your output. Take a deep breath, don’t worry, and let’s do it!

It took around 25 seconds to think about it, catching stuff I didn’t necessarily think about and thinking about how it would approach the task. Then, it just generated a complete json file

This was all done through the web interface. Because the file is so small in the first place, I don’t need the API which you have to pay for. Too much of a hassle.

It does take a while to output the translated strings, but that’s okay. I’m playing another game while it does that and check back in a while. I just have to wait.

The translated strings come in a perfect JSON format and I can even click to download the file. Then I just need to rename it, and I can test it in the game.

With that, you can translate stuff very easily and make it accessible more broadly. There are hundreds of theory essays and books that only exist in one language. I’ve already used older LLMs for book translation tasks. By properly testing your prompt

Some caveats:

  • I only stay on HRL (High resource languages), since there is sufficient training data for the LLM. It will hallucinate in some languages.
  • Make a translation to a language you can read first so you know how it handles it, and refine your prompt afterwards. Keep doing small batch tests like this (a few strings at a time) until you’re satisfied. The prompt I shared above was created after years of doing translation tasks with LLMs, I know what to tell them (mostly) now.
  • Also confirm your translations in languages you don’t understand. Run some strings through google translate, ask someone who speaks the language if they can take a quick look, google the terms - for example I took its word for fireball in Japanese and looked online to confirm it was used in other contexts (I found it on magic cards lol).
  • Is it perfect? probably not. But the original translations were done by amateurs too (e.g. me for French) because the dev, like many people, has no money to pay a professional for everything.

But the good part is that it doesn’t destroy the original, right? A human can always come along and do a perfect human translation, or you can always redo the translations later with better models. It’s not destructive.

Hope this helps you out. If there’s theory books that only exist in your language, I can only recommend making them accessible. We’d be happy to host them on prolewiki if you don’t know where to disseminate them. (spoiler: usually you go through the trouble and then find a super obscure translated edition from 50 years ago as soon as you finish lol)

-> late edit: since the game is not compiled (like a ton of indie games), it’s also possible for users to add their own language if it’s missing and they would like it. I expect this usecase will become bigger in the future, being able to customize your software and tailor it to your needs.

You can, for example, already find models that will generate subtitle files from a video (https://freesubtitles.ai/ is one I’ve used a few times, it’s free lol). If a series you’d like to watch is not available in your language, then you can have subtitles generated for it and enjoy it today.

  • KrasnaiaZvezda@lemmygrad.ml
    link
    fedilink
    arrow-up
    8
    ·
    1 month ago

    (We’re going to need an AI community lol instead of posting to genzedong all the time)

    I had made c/Singularity for things like this but most of the talks about AI/LLMs are on c/technology as it’s big.

    then that’s additional context the LLM can use to understand what it’s translating by reading the key property.

    I was just thinking that it would be easy to remove the keys for reduced tokens but treating it as extra context for the LLM makes a lot of sense.

    Nice job!

    And two questions: Do you ask the LLM after it’s done if there were mistakes or if there is anything that can be improved as well? And as for books and longer texts do you have to break them up or do you keep to things that can be done one go?

    • CriticalResist8@lemmygrad.mlOP
      link
      fedilink
      arrow-up
      5
      ·
      edit-2
      1 month ago

      For books I break it up, but deepseek seems to be able to handle a huge amount of tokens. If it can’t handle it anymore (if the convo gets too long), it will return a server error so I just copy my initial prompt and a long portion of text and start over in a new chat.

      Just to make sure, I tell it in my initial ‘framing’ prompt that I’m going to be sending excerpts sequentially and that it should only return the translation and nothing else.

      For mistakes etc you could probably ask it to do a second pass. You might even want to try a new, fresh chat so that it doesn’t know what the original was. That’s a good idea that I hadn’t thought about!

      I was just thinking that it would be easy to remove the keys for reduced tokens but treating it as extra context for the LLM makes a lot of sense.

      And it saves on effort too if you just send it the full file x)

      • KrasnaiaZvezda@lemmygrad.ml
        link
        fedilink
        arrow-up
        2
        ·
        1 month ago

        lol

        For mistakes etc you could probably ask it to do a second pass. You might even want to try a new, fresh chat so that it doesn’t know what the original was. That’s a good idea that I hadn’t thought about!

        In some tests I was doing with using Qwen 0.6B LLMs for classification I did ask it multiple times and basically give more weight the more tries something appears in. In your case you can probably ask two different models and take anything translated equally both times as “good enough” and use an(other) LLM to check the remainining things, although the longer the sentence/text/key the less such a system is likely to help and the more the raw LLM abilities will be necessary.

        And as for asking the LLMs for mistakes I was curious because big LLMs should be able to catch some mistakes due to reflexion…

        • CriticalResist8@lemmygrad.mlOP
          link
          fedilink
          arrow-up
          2
          ·
          1 month ago

          I tried your proofread method with another file and I think there’s definitely some merits, to make sure that specific terms get translated the same way each time and improving consistency. I just asked deepseek to do a second pass and look for consistency, typos, errors, etc. Didn’t seem to have a lot to correct though.