LM Studio Settings (from sudoingX)

  • Context length = 131072 (sets 128K context. this is where the model can hold your entire codebase in memory while working)
  • Max Concurrent (Experimental) = 1 (uses a single parallel slot. saves 190MB of VRAM. you are one person talking to one model. you do not need multiple slots. )
  • Flash Attention = on (this is the big one. without it, generation speed degrades as context fills. with it, 50 tok/s at 4K stays 50 tok/s at 512K. flat curve. no penalty.)
  • KV cache offload → ON (helps fit larger models)
  • mmap → ON (faster loading)
  • Keep model in memory → ON (prevents reloads)
  • K Cache Quantization Type → q4_0
  • V Cache Quantization Type → q4_0
    • These last two settings quantizes the KV cache. this is how you fit 128K+ context on 12GB. without it, the cache alone would eat your VRAM before you hit 32K.

Adapted from this X article: 12GB of VRAM runs more intelligence than you think in 2026.

Model Selection

For my computer (M4 Pro chip with 24GB unified memory):

  • Fast & Efficient (Agents): Qwen 3.5 9B (Q4_K_M)
  • Default (Chat & Coding): Qwen 3 14B (Q4_K_M)
  • Max Intelligence: Qwen 3.5 27B (Q4_K_M)