Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions bitnet-studio/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
__pycache__/
*.pyc
*.pyo
.venv/
venv/
*.egg-info/
dist/
build/
.pytest_cache/
.mypy_cache/
.coverage
*.log
work/
adapters/
exports/
*.egg-info/
125 changes: 125 additions & 0 deletions bitnet-studio/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# BitNet Studio

Produto final do ambiente **BitNet CPU-Universal**: serve modelos 1.58bit
em CPU, conecta MCPs "plugáveis" (ex: protheus-rag), faz fine-tuning
QLoRA em GPU modesta e exporta para GGUF / HuggingFace / Ollama.

```
┌─────────────────────────────────────────────────────────────┐
│ Web UI local (vanilla JS, zero CDN) │
├─────────────────────────────────────────────────────────────┤
│ API OpenAI-compatible /v1/chat/completions │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ Tool Engine │←→│ MCP Bridge │←→│ protheus-rag, ... │ │
│ │ (PT-BR+GBNF) │ │ (stdio RPC) │ │ (plugável) │ │
│ └──────┬───────┘ └──────────────┘ └───────────────────┘ │
│ ↓ │
│ llama-server (build BitNet L2-L5, CPU-only, AVX2) │
├─────────────────────────────────────────────────────────────┤
│ Training (GPU) │ Export │
│ QLoRA 4-bit → merge → │ GGUF + SHA256 / HF safetensors / │
│ GGUF quantizado │ Ollama Modelfile │
└─────────────────────────────────────────────────────────────┘
```

## Instalação

```bash
cd bitnet-studio
python3 -m venv .venv && source .venv/bin/activate
pip install -e . # núcleo (serve/export/mcp)
pip install -e ".[train]" # + treino QLoRA (GPU)
```

Pré-requisito: o build do repo pai (`cmake --build build -j`) — o Studio
usa `build/bin/llama-server` e `build/bin/llama-quantize`.

## Uso rápido

### Servir (CPU-only, D4)

```bash
bitnet-studio serve # http://127.0.0.1:8080
bitnet-studio models # lista o registry
```

Abra `http://127.0.0.1:8080`, escolha o modelo (ex: `falcon3-10b-1.58`)
e pergunte em português. Se a pergunta precisar do Protheus, o modelo
chama o MCP `protheus-rag` automaticamente.

### Testar um MCP isoladamente

```bash
bitnet-studio mcp protheus-rag
bitnet-studio mcp protheus-rag --call consultar_base_direta \
--args '{"pergunta": "tabela SE1 campos"}'
```

### API (OpenAI-compatible)

```bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "falcon3-10b-1.58",
"messages": [{"role": "user",
"content": "Quais campos tem a tabela SA1 do Protheus?"}]}'
```

A resposta inclui `tool_trace` com cada chamada MCP feita no loop agentic.

### Hot-plug de MCP em runtime

```bash
curl -X POST http://127.0.0.1:8080/mcp -H 'Content-Type: application/json' \
-d '{"name": "meu-mcp", "command": "python3", "args": ["servidor.py"]}'
```

## Fine-tuning PT-BR + tools (GPU modesta)

Pipeline completo — piloto no 3B, produção no 10B:

```bash
# 1. Gerar dataset sintético de tool-calling a partir das tools reais
bitnet-studio mcp protheus-rag # ver tools disponíveis
bitnet-studio dataset synth data/ptbr_tools.jsonl \
--tools-json data/tools.json --asks data/perguntas.txt -n 10

# 2. Validar
bitnet-studio dataset validate data/ptbr_tools.jsonl

# 3. QLoRA (GPU). Piloto 3B primeiro:
bitnet-studio finetune --base tiiuae/Falcon3-3B-Instruct \
--dataset data/ptbr_tools.jsonl --out adapters/f3b-ptbr-tools

# Produção 10B (GPU 8-16GB: reduza --max-seq se faltar VRAM):
bitnet-studio finetune --base tiiuae/Falcon3-10B-Instruct \
--dataset data/ptbr_tools.jsonl --out adapters/f10b-ptbr-tools \
--max-seq 512

# 4. Merge + quantizar → GGUF pronto para CPU
bitnet-studio merge --base tiiuae/Falcon3-10B-Instruct \
--adapter adapters/f10b-ptbr-tools \
--name falcon3-10b-ptbr-tools --workdir work/

# 5. Registrar em configs/models.yaml e servir
```

## Export para outras plataformas

```bash
bitnet-studio export gguf --source work/falcon3-10b-ptbr-tools-Q4_K_M.gguf \
--name falcon3-10b-ptbr-tools
bitnet-studio export ollama --source work/falcon3-10b-ptbr-tools-Q4_K_M.gguf \
--name falcon3-10b-ptbr-tools
bitnet-studio export hf --source work/falcon3-10b-ptbr-tools-merged \
--name falcon3-10b-ptbr-tools
```

## Garantias D4

- Inferência 100% CPU (kernels L2-L5 do BitNet CPU-Universal)
- Servidor escuta apenas em `127.0.0.1`
- Web UI sem CDN, sem fonts externas, sem analytics
- `report_to=[]` no treino (sem wandb/telemetria)
- `--offline` em finetune/merge para ambientes air-gapped
- MCPs são subprocess locais auditáveis (stdio, sem rede no bridge)
156 changes: 156 additions & 0 deletions bitnet-studio/colab_f10b_run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
"""Falcon 10B fine-tune — versão pura Python para %run no Colab.

USO NO COLAB:
!rm -rf /content/BitNet
!git clone --depth 1 https://github.com/peder1981/BitNet.git /content/BitNet
%run /content/BitNet/bitnet-studio/colab_f10b_run.py
"""

# 1. Verificar GPU
import torch

if not torch.cuda.is_available():
raise SystemExit(
"\n" + "=" * 60 + "\n"
"❌ ERRO: GPU não detectada!\n\n"
"O Google Colab está em modo CPU. Para corrigir:\n"
" 1. Menu → Runtime → Change runtime type\n"
" 2. Hardware accelerator: GPU\n"
" 3. Save\n"
" 4. Menu → Runtime → Restart runtime\n"
" 5. Execute esta célula novamente\n"
+ "=" * 60
)

print(f"✅ GPU: {torch.cuda.get_device_name(0)}")
print(f"✅ VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

# 2. Instalar
import subprocess, sys
subprocess.check_call([
sys.executable, "-m", "pip", "install", "-q",
"transformers==4.40.0", "peft==0.11.0", "datasets==2.19.0",
"accelerate==0.30.0", "bitsandbytes==0.43.0", "safetensors",
"sentencepiece", # Requerido pelo tokenizer Falcon3
])

# 3. Dataset
import json, os
from urllib.request import urlopen

DATASET_URL = "https://raw.githubusercontent.com/peder1981/BitNet/main/bitnet-studio/data/ptbr_tools_train_large.jsonl"
OUTPUT = "/content/f10b-ptbr-tools-qlora"

print("Baixando dataset...")
with urlopen(DATASET_URL) as resp:
dataset_text = resp.read().decode("utf-8")

rows = []
for line in dataset_text.strip().split("\n"):
if line.strip():
rows.append(json.loads(line))

def to_text(messages):
parts = []
for m in messages:
if m["role"] == "system":
parts.append(f"<|system|>\n{m['content']}")
elif m["role"] == "user":
parts.append(f"<|user|>\n{m['content']}")
else:
parts.append(f"<|assistant|>\n{m['content']}")
return "\n".join(parts) + "\n<|assistant|>\n"

texts = [to_text(r["messages"]) for r in rows]
print(f"Dataset: {len(texts)} exemplos")

# 4. Tokenizer (workaround bug tokenizers + Falcon3)
from transformers import AutoTokenizer
import transformers

MODEL = "tiiuae/Falcon3-10B-Instruct"

# Limpar TODO cache HuggingFace para evitar tokenizer corrompido
import shutil
hf_cache = os.path.expanduser("~/.cache/huggingface/hub")
if os.path.exists(hf_cache):
shutil.rmtree(hf_cache, ignore_errors=True)
print("Cache HuggingFace limpo")

# Forçar tokenizer lento
transformers.utils.import_utils.is_tokenizers_available = lambda: False

tok = AutoTokenizer.from_pretrained(
MODEL,
trust_remote_code=True,
use_fast=False,
)
if tok.pad_token is None:
tok.pad_token = tok.eos_token

# 5. Modelo QLoRA
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset

model = AutoModelForCausalLM.from_pretrained(
MODEL,
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
device_map="auto",
trust_remote_code=True,
max_memory={0: "14GiB"},
)
model = prepare_model_for_kbit_training(model)

lora = LoraConfig(
r=8, lora_alpha=16, lora_dropout=0.05,
bias="none", task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()

# 6. Treino
ds = Dataset.from_dict({"text": texts}).map(
lambda b: tok(b["text"], truncation=True, max_length=128, padding=False),
batched=True, remove_columns=["text"],
)

args = TrainingArguments(
output_dir=OUTPUT + "/checkpoints",
max_steps=300,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=2e-4,
warmup_steps=30,
logging_steps=10,
save_strategy="steps", save_steps=50,
optim="paged_adamw_8bit",
fp16=False, bf16=True,
seed=42, report_to=[],
gradient_checkpointing=True,
)

trainer = Trainer(
model=model, args=args, train_dataset=ds,
data_collator=DataCollatorForLanguageModeling(tok, mlm=False),
)

print("\n=== INICIANDO TREINO FALCON 10B ===")
print("⚠️ Se der OOM, reinicie e use colab_f3b_run.py")
trainer.train()

model.save_pretrained(OUTPUT)
tok.save_pretrained(OUTPUT)
print(f"\n✅ Adapter salvo em: {OUTPUT}")

# 7. Download
from google.colab import files
import shutil

shutil.make_archive("/content/f10b-ptbr-tools-qlora", "zip", OUTPUT)
files.download("/content/f10b-ptbr-tools-qlora.zip")
print("📥 Download iniciado!")
Loading