Capstone: pipeline CV em produção

Projeto proposto

Construa pipeline end-to-end de detecção de objetos em um domínio escolhido por você (EPI em canteiro, contagem de carros, produtos em gôndola). Entregáveis: dataset rotulado, modelo YOLO fine-tuned, API FastAPI, ONNX + Triton com dynamic batching, benchmark de latência e monitoramento de drift. Esse nível de entrega é exatamente o que distingue engenheiro júnior de sênior em CV.

Entregáveis

# Checklist do capstone
dataset:
  - 500-2000 imagens rotuladas (LabelStudio ou CVAT)
  - split 80/10/10 (train/val/test) estratificado
  - dataset.yaml (Ultralytics) ou COCO JSON
  - card de dataset documentando fonte, bias e limitações

modelo:
  - YOLOv10s ou v10m fine-tuned
  - mAP@0.5:0.95 >= baseline pretrained
  - confusion matrix por classe em notebook

export:
  - best.pt -> best.onnx (dynamic axes, NMS embutido)
  - best.onnx -> best.engine (TensorRT FP16)
  - validação numérica: MAE < 1e-3 entre PyTorch e ONNX

serving:
  - FastAPI com /predict (multipart/form-data)
  - Triton Inference Server com model_repository
  - dynamic batching: max_queue_delay 5000us, preferred [4, 8]
  - Dockerfile multi-stage (build + runtime slim)

benchmark:
  - p50/p95/p99 latency em 1/4/8/16 concurrent clients
  - throughput (images/sec) em GPU alvo
  - comparação PyTorch vs ONNX vs TensorRT FP16
  - tabela de resultados no README

observabilidade:
  - Prometheus metrics (request_count, latency_histogram, gpu_util)
  - Grafana dashboard commitado como JSON
  - drift monitoring com Evidently (brightness média, confidence média)
  - alerta: confidence_mean drop > 10% em 24h

writeup:
  - README com problema, dataset, modelo, trade-offs
  - seção "limitações e próximos passos" honesta
  - link do repo, Docker Hub image, vídeo demo

API FastAPI + Triton

from fastapi import FastAPI, File, UploadFile
from tritonclient.http import InferenceServerClient, InferInput, InferRequestedOutput
import numpy as np, cv2

app = FastAPI()
triton = InferenceServerClient(url="triton:8000")

@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    raw = await file.read()
    arr = np.frombuffer(raw, dtype=np.uint8)
    img = cv2.imdecode(arr, cv2.IMREAD_COLOR)
    x = preprocess(img)  # importa preprocess.py único

    inp = InferInput("images", x.shape, "FP32")
    inp.set_data_from_numpy(x)
    out = InferRequestedOutput("output0")
    resp = triton.infer(model_name="yolov10s", inputs=[inp], outputs=[out])
    detections = resp.as_numpy("output0")
    return {"detections": detections.tolist()}

Triton model_repository

# model_repository/yolov10s/config.pbtxt
name: "yolov10s"
platform: "onnxruntime_onnx"
max_batch_size: 16

input [
  { text: "images"  data_type: TYPE_FP32  dims: [ 3, 640, 640 ] }
]
output [
  { text: "output0"  data_type: TYPE_FP32  dims: [ -1, 6 ] }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 5000
}

instance_group [
  { count: 1  kind: KIND_GPU }
]

Benchmark de latência

# Benchmark com perf_analyzer oficial da NVIDIA
perf_analyzer \
  -m yolov10s \
  -u triton:8000 \
  --concurrency-range 1:16:1 \
  --measurement-interval 5000 \
  --shape images:3,640,640 \
  -f benchmark.csv

💡

Report latência honesta: p50, p95, p99 em concurrency 1, 4, 8, 16. Throughput sozinho engana — uma API a 1000 req/s com p99 de 2s é inutilizável.

Observabilidade e drift

from prometheus_client import Counter, Histogram

REQUESTS = Counter("cv_requests_total", "Total requests", ["status"])
LATENCY  = Histogram("cv_latency_seconds", "Request latency", buckets=[.01, .05, .1, .25, .5, 1, 2])
CONF_MEAN = Histogram("cv_confidence_mean", "Mean confidence", buckets=[.3, .5, .7, .9])

@app.middleware("http")
async def metrics_mw(request, call_next):
    with LATENCY.time():
        resp = await call_next(request)
    REQUESTS.labels(status=resp.status_code).inc()
    return resp

✅

Entregar o repo com dataset card, benchmark reproduzível, Docker image pública e dashboard Grafana commitado fala mais alto que um currículo de 3 páginas. É esse capstone que vira conversa com hiring manager sênior em vaga de CV engineer.

Projeto proposto

Entregáveis

# Checklist do capstone
dataset:
  - 500-2000 imagens rotuladas (LabelStudio ou CVAT)
  - split 80/10/10 (train/val/test) estratificado
  - dataset.yaml (Ultralytics) ou COCO JSON
  - card de dataset documentando fonte, bias e limitações

modelo:
  - YOLOv10s ou v10m fine-tuned
  - mAP@0.5:0.95 >= baseline pretrained
  - confusion matrix por classe em notebook

export:
  - best.pt -> best.onnx (dynamic axes, NMS embutido)
  - best.onnx -> best.engine (TensorRT FP16)
  - validação numérica: MAE < 1e-3 entre PyTorch e ONNX

serving:
  - FastAPI com /predict (multipart/form-data)
  - Triton Inference Server com model_repository
  - dynamic batching: max_queue_delay 5000us, preferred [4, 8]
  - Dockerfile multi-stage (build + runtime slim)

benchmark:
  - p50/p95/p99 latency em 1/4/8/16 concurrent clients
  - throughput (images/sec) em GPU alvo
  - comparação PyTorch vs ONNX vs TensorRT FP16
  - tabela de resultados no README

observabilidade:
  - Prometheus metrics (request_count, latency_histogram, gpu_util)
  - Grafana dashboard commitado como JSON
  - drift monitoring com Evidently (brightness média, confidence média)
  - alerta: confidence_mean drop > 10% em 24h

writeup:
  - README com problema, dataset, modelo, trade-offs
  - seção "limitações e próximos passos" honesta
  - link do repo, Docker Hub image, vídeo demo

API FastAPI + Triton

from fastapi import FastAPI, File, UploadFile
from tritonclient.http import InferenceServerClient, InferInput, InferRequestedOutput
import numpy as np, cv2

app = FastAPI()
triton = InferenceServerClient(url="triton:8000")

@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    raw = await file.read()
    arr = np.frombuffer(raw, dtype=np.uint8)
    img = cv2.imdecode(arr, cv2.IMREAD_COLOR)
    x = preprocess(img)  # importa preprocess.py único

    inp = InferInput("images", x.shape, "FP32")
    inp.set_data_from_numpy(x)
    out = InferRequestedOutput("output0")
    resp = triton.infer(model_name="yolov10s", inputs=[inp], outputs=[out])
    detections = resp.as_numpy("output0")
    return {"detections": detections.tolist()}

Triton model_repository

# model_repository/yolov10s/config.pbtxt
name: "yolov10s"
platform: "onnxruntime_onnx"
max_batch_size: 16

input [
  { text: "images"  data_type: TYPE_FP32  dims: [ 3, 640, 640 ] }
]
output [
  { text: "output0"  data_type: TYPE_FP32  dims: [ -1, 6 ] }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 5000
}

instance_group [
  { count: 1  kind: KIND_GPU }
]

Benchmark de latência

# Benchmark com perf_analyzer oficial da NVIDIA
perf_analyzer \
  -m yolov10s \
  -u triton:8000 \
  --concurrency-range 1:16:1 \
  --measurement-interval 5000 \
  --shape images:3,640,640 \
  -f benchmark.csv

💡

Report latência honesta: p50, p95, p99 em concurrency 1, 4, 8, 16. Throughput sozinho engana — uma API a 1000 req/s com p99 de 2s é inutilizável.

Observabilidade e drift

from prometheus_client import Counter, Histogram

REQUESTS = Counter("cv_requests_total", "Total requests", ["status"])
LATENCY  = Histogram("cv_latency_seconds", "Request latency", buckets=[.01, .05, .1, .25, .5, 1, 2])
CONF_MEAN = Histogram("cv_confidence_mean", "Mean confidence", buckets=[.3, .5, .7, .9])

@app.middleware("http")
async def metrics_mw(request, call_next):
    with LATENCY.time():
        resp = await call_next(request)
    REQUESTS.labels(status=resp.status_code).inc()
    return resp

✅

Projeto proposto

Entregáveis

API FastAPI + Triton

Triton model_repository

Benchmark de latência

Observabilidade e drift

Discussão

Capstone: pipeline CV em produção

Projeto proposto

Entregáveis

API FastAPI + Triton

Triton model_repository

Benchmark de latência

Observabilidade e drift

Discussão