LM Studio Settings (from sudoingX)

Context length = 131072 (sets 128K context. this is where the model can hold your entire codebase in memory while working)
Max Concurrent (Experimental) = 1 (uses a single parallel slot. saves 190MB of VRAM. you are one person talking to one model. you do not need multiple slots. )
Flash Attention = on (this is the big one. without it, generation speed degrades as context fills. with it, 50 tok/s at 4K stays 50 tok/s at 512K. flat curve. no penalty.)
KV cache offload → ON (helps fit larger models)
mmap → ON (faster loading)
Keep model in memory → ON (prevents reloads)
K Cache Quantization Type → q4_0
V Cache Quantization Type → q4_0
- These last two settings quantizes the KV cache. this is how you fit 128K+ context on 12GB. without it, the cache alone would eat your VRAM before you hit 32K.

Model Selection

For my computer (M4 Pro chip with 24GB unified memory):