artificial-intelligence

Running Qwen3.5 Locally with Unsloth

Large language models are rapidly becoming more capable — and more accessible. One of the latest developments is the release of the Qwen3.5 family, a new generation of models designed for reasoning, coding, multimodal tasks, and agent workflows. This post walks step-by-step through how to run Qwen3.5 locally using Unsloth — from understanding the model to deployment and tool calling.

If you’ve been wanting to experiment with frontier-scale models on your own hardware, this guide covers everything you need.

Introduction

Qwen3.5 is a new model family released by Alibaba that includes models such as Qwen3.5–397B-A17B, a multimodal reasoning model with 397B parameters (17B active). It supports long context windows, multilingual interaction, and hybrid reasoning modes — making it suitable for coding, agents, chat, and vision tasks. It can process up to 256K tokens (extendable to 1M) and supports 201 languages, placing it among the most capable open models currently available.

Despite its size (about 807GB on disk), quantization techniques from Unsloth allow the model to run locally with reduced memory footprints using 3-bit or 4-bit variants.

Understanding the Hybrid Reasoning Design

A key idea behind modern Qwen models is hybrid reasoning:

  • Thinking mode → deeper reasoning and multi-step analysis
  • Non-thinking mode → faster conversational responses

This concept originated in earlier Qwen architecture research, where models dynamically balance reasoning cost and latency via adaptive inference budgeting.

This unified approach means you no longer need separate reasoning vs chat models — the same model can switch modes depending on configuration.

Hardware and Storage Requirements

Before setup, understand the constraints:

QuantizationRAM / VRAM NeededNotes3-bit~192GB RAMRuns locally4-bit MXFP4~256GB RAMRecommended8-bit~512GB RAMHighest accuracy

4-bit dynamic quantization occupies about 214GB disk space and performs well with memory offloading setups (e.g., 24GB GPU + 256GB RAM).

For best performance:

  • Ensure RAM + VRAM ≈ quant size
  • Otherwise rely on SSD offloading (slower inference)

Step-by-Step Setup

Step 1 — Install llama.cpp

Clone and build the latest version:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
  • Disable CUDA flag if running CPU-only
  • This engine loads and runs quantized models

Step 2 — Download the Model

Install download utilities:

pip install huggingface_hub hf_transfer

Choose a quant version (recommended minimum):

  • UD-Q2_K_XL (size vs accuracy balance)
  • MXFP4_MOE
  • UD-Q4_K_XL

Then download from the model repository.

Unsloth uses Dynamic 2.0 quantization, where important layers are upcast to higher precision for better performance.

Step 3 — Run the Model in llama.cpp

Example launch parameters:

./main \
  --model model.gguf \
  --threads 32 \
  --ctx-size 16384 \
  --n-gpu-layers 2

Tune parameters:

  • threads → CPU utilization
  • ctx-size → context length
  • n-gpu-layers → GPU offloading

Disable reasoning mode if desired:

--chat-template-kwargs "{\"enable_thinking\": false}"

Thinking Mode

  • temperature = 0.6
  • top_p = 0.95
  • top_k = 20

Non-Thinking Mode

  • temperature = 0.7
  • top_p = 0.8
  • top_k = 20

Additional tips:

  • Context window up to 262,144 tokens
  • Typical output length: 32K tokens
  • Presence penalty can reduce repetition (0–2 range)

Serving for Production

You can expose the model via llama-server.

Run server in one terminal:

./llama-server -m model.gguf

Then connect from Python:

pip install openai

Use OpenAI-style completion calls pointing to the local endpoint.

This allows integration into applications, pipelines, or agent frameworks.

Tool Calling Capabilities

Qwen3.5 supports tool invocation workflows:

Examples include:

  • Arithmetic tools
  • Python execution
  • Linux command execution

Workflow:

  1. Define tools
  2. Launch server
  3. Send requests through an OpenAI-compatible client
  4. Parse tool calls automatically

This enables autonomous agent behaviors such as coding or environment interaction.

Multilingual and Benchmark Performance

Qwen3.5 demonstrates strong performance across reasoning and coding benchmarks compared to other frontier models:

  • Competitive scores on multilingual reasoning tasks
  • Strong math and translation capability
  • Solid coding benchmark performance (e.g., SWE-bench variants)

These results reflect the model’s emphasis on long-context reasoning and agent workflows rather than pure chat optimization.

Why Local Deployment Matters

Running models locally provides:

  • Privacy and data control
  • No API costs
  • Full customization
  • Experimental flexibility

Recent industry trends emphasize agentic capabilities — models taking actions in software environments — which Qwen3.5 explicitly targets through improved efficiency and autonomous task execution features. (Reuters)

Final Thoughts

Qwen3.5 represents a shift toward highly capable, locally deployable reasoning models. With Unsloth quantization and modern inference engines, even extremely large models can be explored outside cloud APIs.

The setup isn’t trivial — hardware requirements remain significant — but the workflow is now practical for advanced practitioners and research environments.

If you’re building:

  • AI agents
  • coding copilots
  • multilingual systems
  • local research pipelines

This stack is worth experimenting with.