feat(api): MistralOcrProvider + multipart upload sur /invoices/upload

MistralOcrProvider (app/services/ocr/mistral_ocr_provider.ts) : - Pipeline 2 étapes : POST /v1/ocr (mistral-ocr-latest) → markdown structuré, puis POST /v1/chat/completions (mistral-large-latest) avec response_format json_schema strict pour extraire les champs typés (clientName/Email, numero, amountTtcCents, issueDate, dueDate) + un objet `_conf` pour la confiance par champ. - Télécharge le PDF depuis Drive (MinIO en dev) via getArrayBuffer, encode en base64 pour le data URI. - Throw clair si storageKey null (incompatible avec le mode JSON {filenames}). - Throw au constructor si MISTRAL_API_KEY manquante. getOcrProvider() retourne maintenant vraiment Mistral quand OCR_PROVIDER=mistral (plus de fallback silencieux sur mock). Multipart upload sur POST /invoices/upload : - Détecte Content-Type. Si multipart/form-data : itère sur `files[]`, valide ext (pdf/png/jpg/jpeg) + size (10mb), upload chaque fichier vers `import-drafts/<orgId>/<draftId>.<ext>` via @adonisjs/drive, puis appelle createImportBatch avec sources [{filename, storageKey}]. - Si JSON : route compat conservée pour le mode démo. Refactor service import_batch : - Nouvelle fonction createImportBatch(orgId, sources) générique - createImportBatchFromFilenames() devient un wrapper compat (storageKey null) - OCR exécuté HORS transaction (calls réseau Mistral lents — 3-8s par PDF — pas de raison de tenir un lock PG) Bruno : - 06-Imports/02 Upload (multipart Mistral).bru — nouveau, body multipart-form avec @file() à sélectionner. Doc : setup .env, where to find files in MinIO console, latence Mistral. - Renumérote 03/04/05/06 (Get batch / Validate / Skip / Cancel). - Met à jour 01 Upload (mock) doc pour pointer vers 02 pour le vrai OCR. Pour tester : 1. .env → OCR_PROVIDER=mistral + MISTRAL_API_KEY=... 2. Restart pnpm dev:api 3. Bruno → Imports → 02 Upload (multipart Mistral) → sélectionne un PDF 4. Bruno → Imports → 03 Get batch (drafts ont pdfStorageKey + extracted depuis l'OCR)
2026-05-06 15:17:11 +02:00 · 2026-05-06 15:17:11 +02:00 · 19dd71bd93
commit 19dd71bd93
parent 57e1d0d0be
10 changed files with 412 additions and 51 deletions
--- a/apps/api/app/controllers/import_batches_controller.ts
+++ b/apps/api/app/controllers/import_batches_controller.ts
@ -10,8 +10,15 @@ import {
  validateDraftValidator,
 } from '#validators/import_batch'
 import { resolveClient } from '#services/resolve_client'
-import { createImportBatchFromFilenames } from '#services/import_batch'
+import {
+  createImportBatch,
+  createImportBatchFromFilenames,
+  type ImportSource,
+} from '#services/import_batch'
 import { recordActivity } from '#services/activity_recorder'
+import drive from '@adonisjs/drive/services/main'
+import { createReadStream } from 'node:fs'
+import { randomUUID } from 'node:crypto'
 import type { HttpContext } from '@adonisjs/core/http'
 import { Exception } from '@adonisjs/core/exceptions'
 import db from '@adonisjs/lucid/services/db'
@ -45,16 +52,64 @@ export default class ImportBatchesController {
  /**
   * POST /invoices/upload — démarre un batch OCR.
   *
-   * V1 mock : accepte un body JSON `{ filenames: [...] }` (pas de fichier
-   * réel). Le service appelle le MockOcrProvider qui invente des champs
-   * plausibles. Quand on aura Mistral, on basculera sur multipart.
+   * Deux modes selon Content-Type :
+   *  - **multipart/form-data** : champ `files[]` avec les vrais PDFs.
+   *    Stockage MinIO + OCR (mock OU mistral selon OCR_PROVIDER).
+   *  - **application/json** : `{ filenames: string[] }` (V1 démo).
+   *    Aucun fichier stocké → ne marche QU'AVEC OCR_PROVIDER=mock.
   */
-  async upload({ auth, request, response }: HttpContext) {
+  async upload(ctx: HttpContext) {
+    const { auth, request, response } = ctx
    const organizationId = requireOrgId(auth)
+
+    const isMultipart = (request.header('content-type') ?? '').startsWith('multipart/')
+
+    if (isMultipart) {
+      const files = request.files('files', {
+        size: '10mb',
+        extnames: ['pdf', 'png', 'jpg', 'jpeg'],
+      })
+      if (files.length === 0) {
+        return response.status(422).json({
+          errors: [
+            { code: 'validation_failed', field: 'files', message: 'Au moins un fichier requis' },
+          ],
+        })
+      }
+
+      // Upload vers Drive (MinIO) AVANT l'OCR — l'OCR Mistral télécharge
+      // depuis Drive donc il faut que le fichier soit déjà posé.
+      // Clé : import-drafts/<orgId>/<draftId>.<ext> — pas de batchId
+      // dans la clé car le batch est créé après.
+      const sources: ImportSource[] = []
+      for (const f of files) {
+        if (!f.isValid || !f.tmpPath || !f.extname) {
+          return response.status(422).json({
+            errors: [
+              {
+                code: 'validation_failed',
+                field: 'files',
+                message: f.errors?.[0]?.message ?? 'Fichier invalide',
+              },
+            ],
+          })
+        }
+        const draftKey = randomUUID()
+        const storageKey = `import-drafts/${organizationId}/${draftKey}.${f.extname}`
+        await drive.use().putStream(storageKey, createReadStream(f.tmpPath))
+        sources.push({
+          filename: f.clientName ?? `${draftKey}.${f.extname}`,
+          storageKey,
+        })
+      }
+
+      const batch = await createImportBatch(organizationId, sources)
+      return response.status(201).json({ data: serializeBatch(batch) })
+    }
+
+    // Mode JSON — compat V1 démo.
    const { filenames } = await request.validateUsing(uploadValidator)
-
    const batch = await createImportBatchFromFilenames(organizationId, filenames)
-
    return response.status(201).json({ data: serializeBatch(batch) })
  }

--- a/apps/api/app/services/import_batch.ts
+++ b/apps/api/app/services/import_batch.ts
@ -29,10 +29,17 @@ export type DraftConfidence = Partial<{
 }>

 /**
- * Compose les champs `extracted` (DraftFields) + `confidence` à partir
- * du résultat OCR brut. Tente un match client immédiat (case-insensitive
- * sur le nom) pour pré-remplir clientId — l'utilisateur n'a rien à faire
- * dans le combobox si ça matche.
+ * Une "source" de draft : un filename + (optionnellement) une storageKey
+ * MinIO du PDF stocké. Mock OCR ignore storageKey, Mistral l'exige.
+ */
+export type ImportSource = {
+  filename: string
+  storageKey: string | null
+}
+
+/**
+ * Compose `extracted` + `confidence` à partir du résultat OCR. Tente un
+ * match client immédiat (case-insensitive) pour pré-remplir `clientId`.
 */
 async function buildDraftFromOcr(
  organizationId: string,
@ -68,42 +75,64 @@ async function buildDraftFromOcr(
 }

 /**
- * Crée un batch + N drafts à partir d'une liste de filenames (V1 mock).
- * Quand on aura le vrai upload multipart + MinIO, cette fonction prendra
- * `Array<{ filename, storageKey }>` à la place.
+ * Crée un batch + N drafts à partir de N sources (filename + storageKey).
+ * Le provider OCR (mock ou mistral) est résolu à l'intérieur.
+ *
+ * - Mock : storageKey=null OK, extraction depuis filename
+ * - Mistral : storageKey requis, extraction depuis le PDF stocké
 */
-export async function createImportBatchFromFilenames(
+export async function createImportBatch(
  organizationId: string,
-  filenames: string[]
+  sources: ImportSource[]
 ): Promise<ImportBatch> {
  const ocr = getOcrProvider()

-  // Plan par défaut = premier `is_default` de l'org (provisionné au signup).
+  // Plan par défaut = premier is_default de l'org (provisionné au signup).
  const defaultPlan = await Plan.query()
    .where('organization_id', organizationId)
    .where('is_default', true)
    .orderBy('name', 'asc')
    .first()

+  // OCR fait HORS transaction (calls réseau lents, on ne tient pas de lock
+  // PG pendant). Si l'OCR échoue, l'erreur remonte avant le INSERT.
+  type DraftPayload = {
+    filename: string
+    storageKey: string | null
+    extracted: DraftFields
+    edited: DraftFields
+    confidence: DraftConfidence
+  }
+  const drafts: DraftPayload[] = []
+
+  for (const src of sources) {
+    const result = await ocr.extract(src)
+    const { extracted, confidence } = await buildDraftFromOcr(
+      organizationId,
+      result,
+      defaultPlan?.id ?? null
+    )
+    drafts.push({
+      filename: src.filename,
+      storageKey: src.storageKey,
+      extracted,
+      edited: { ...extracted },
+      confidence,
+    })
+  }
+
  return db.transaction(async (trx) => {
    const batch = await ImportBatch.create({ organizationId }, { client: trx })

-    for (const filename of filenames) {
-      const result = await ocr.extract({ storageKey: null, filename })
-      const { extracted, confidence } = await buildDraftFromOcr(
-        organizationId,
-        result,
-        defaultPlan?.id ?? null
-      )
-
+    for (const d of drafts) {
      await ImportDraft.create(
        {
          batchId: batch.id,
-          filename,
-          pdfStorageKey: null,
-          extracted,
-          edited: { ...extracted },
-          confidence,
+          filename: d.filename,
+          pdfStorageKey: d.storageKey,
+          extracted: d.extracted,
+          edited: d.edited,
+          confidence: d.confidence,
          status: 'pending',
          invoiceId: null,
        },
@ -115,3 +144,17 @@ export async function createImportBatchFromFilenames(
    return batch
  })
 }
+
+/**
+ * Wrapper compat : V1 mock JSON `{filenames}` → sources avec storageKey null.
+ * @deprecated Préférer `createImportBatch` avec sources explicites.
+ */
+export async function createImportBatchFromFilenames(
+  organizationId: string,
+  filenames: string[]
+): Promise<ImportBatch> {
+  return createImportBatch(
+    organizationId,
+    filenames.map((filename) => ({ filename, storageKey: null }))
+  )
+}
--- a/apps/api/app/services/ocr/index.ts
+++ b/apps/api/app/services/ocr/index.ts
@ -1,22 +1,20 @@
 import env from '#start/env'
 import type { OcrProvider } from '#services/ocr/ocr_provider'
 import { MockOcrProvider } from '#services/ocr/mock_ocr_provider'
+import { MistralOcrProvider } from '#services/ocr/mistral_ocr_provider'

 /**
 * Résout l'implémentation OCR à utiliser selon OCR_PROVIDER.
 *
 *  - `mock` (default) : MockOcrProvider, données plausibles depuis filename.
- *  - `mistral` : à brancher (cf. ADR-020). Pour l'instant on fallback sur mock
- *    avec un warning pour ne pas casser le boot quand la clé n'est pas posée.
+ *    Compatible avec /invoices/upload en mode JSON `{filenames}`.
+ *  - `mistral` : MistralOcrProvider. Nécessite un PDF stocké (multipart
+ *    upload) + MISTRAL_API_KEY. Pas compatible avec le mode JSON.
 */
 export function getOcrProvider(): OcrProvider {
  const provider = env.get('OCR_PROVIDER', 'mock')
  if (provider === 'mistral') {
-    // TODO: implémenter MistralOcrProvider quand la clé API est dispo.
-    // En attendant, on log et on fallback sur mock.
-    console.warn(
-      '[ocr] OCR_PROVIDER=mistral mais MistralOcrProvider pas implémenté — fallback sur mock'
-    )
+    return new MistralOcrProvider()
  }
  return new MockOcrProvider()
 }
--- a/apps/api/app/services/ocr/mistral_ocr_provider.ts
+++ b/apps/api/app/services/ocr/mistral_ocr_provider.ts
@ -0,0 +1,184 @@
+import drive from '@adonisjs/drive/services/main'
+import env from '#start/env'
+import type { OcrProvider, OcrResult } from '#services/ocr/ocr_provider'
+
+const MISTRAL_API = 'https://api.mistral.ai/v1'
+// Modèle OCR dédié de Mistral — extrait le texte structuré d'un doc.
+const OCR_MODEL = 'mistral-ocr-latest'
+// Modèle chat pour la 2e étape (markdown → JSON typé via json_schema strict).
+const EXTRACTION_MODEL = 'mistral-large-latest'
+
+const SYSTEM_PROMPT = `Tu es un extracteur de factures françaises B2B.
+Tu reçois le markdown d'une facture (issu d'une OCR) et tu retournes un
+JSON strict avec les champs demandés.
+
+Règles :
+- amountTtcCents : montant TTC en centimes (entier). Pas le HT.
+- issueDate / dueDate : ISO 8601 datetime UTC à 09:00 (ex. "2026-04-15T09:00:00.000Z").
+- clientEmail : null si absent ou illisible (pas d'invention).
+- numero : tel qu'imprimé sur la facture.
+- Si un champ est ambigu, mets une confiance basse (0.3–0.6).`
+
+/**
+ * Provider OCR Mistral. Pipeline en 2 étapes :
+ *  1. POST /v1/ocr avec le PDF en data URI base64 → markdown structuré
+ *  2. POST /v1/chat/completions avec le markdown + json_schema strict →
+ *     extraction typée des champs
+ *
+ * Nécessite un PDF réel (storageKey non null). Pour le dev sans PDF,
+ * utiliser OCR_PROVIDER=mock.
+ */
+export class MistralOcrProvider implements OcrProvider {
+  private apiKey: string
+
+  constructor() {
+    const key = env.get('MISTRAL_API_KEY', '')
+    if (!key) {
+      throw new Error(
+        'MISTRAL_API_KEY manquante. Posez la dans .env ou bascule OCR_PROVIDER=mock.'
+      )
+    }
+    this.apiKey = key
+  }
+
+  async extract(input: {
+    storageKey: string | null
+    filename: string
+  }): Promise<OcrResult> {
+    if (!input.storageKey) {
+      throw new Error(
+        `MistralOcrProvider exige un PDF stocké (storageKey). Filename "${input.filename}" reçu sans storageKey — utiliser OCR_PROVIDER=mock pour les uploads sans fichier réel.`
+      )
+    }
+
+    // 1. Télécharge le PDF depuis Drive (MinIO en dev) puis encode en base64.
+    const buffer = await this.downloadAsBuffer(input.storageKey)
+    const dataUri = `data:application/pdf;base64,${buffer.toString('base64')}`
+
+    // 2. OCR → markdown
+    const ocrJson = await this.postJson('/ocr', {
+      model: OCR_MODEL,
+      document: { type: 'document_url', document_url: dataUri },
+    })
+    const markdown = (ocrJson?.pages ?? [])
+      .map((p: { markdown?: string }) => p.markdown ?? '')
+      .join('\n\n')
+      .trim()
+
+    if (!markdown) {
+      throw new Error("Mistral OCR n'a retourné aucun texte exploitable")
+    }
+
+    // 3. Extraction structurée via chat avec json_schema strict.
+    const extracted = await this.extractFields(markdown)
+
+    return {
+      fields: {
+        clientName: { value: extracted.clientName, confidence: extracted._conf.clientName },
+        clientEmail: { value: extracted.clientEmail, confidence: extracted._conf.clientEmail },
+        numero: { value: extracted.numero, confidence: extracted._conf.numero },
+        amountTtcCents: {
+          value: extracted.amountTtcCents,
+          confidence: extracted._conf.amountTtcCents,
+        },
+        issueDate: { value: extracted.issueDate, confidence: extracted._conf.issueDate },
+        dueDate: { value: extracted.dueDate, confidence: extracted._conf.dueDate },
+      },
+      rawProviderResponse: { ocr: ocrJson, extracted },
+    }
+  }
+
+  private async downloadAsBuffer(storageKey: string): Promise<Buffer> {
+    const arr = await drive.use().getArrayBuffer(storageKey)
+    return Buffer.from(arr)
+  }
+
+  private async postJson(path: string, body: unknown): Promise<any> {
+    const res = await fetch(`${MISTRAL_API}${path}`, {
+      method: 'POST',
+      headers: {
+        Authorization: `Bearer ${this.apiKey}`,
+        'Content-Type': 'application/json',
+      },
+      body: JSON.stringify(body),
+    })
+    if (!res.ok) {
+      const text = await res.text()
+      throw new Error(`Mistral ${path} → HTTP ${res.status}: ${text}`)
+    }
+    return res.json()
+  }
+
+  private async extractFields(markdown: string): Promise<{
+    clientName: string
+    clientEmail: string | null
+    numero: string
+    amountTtcCents: number
+    issueDate: string
+    dueDate: string
+    _conf: Record<string, number>
+  }> {
+    const json = await this.postJson('/chat/completions', {
+      model: EXTRACTION_MODEL,
+      messages: [
+        { role: 'system', content: SYSTEM_PROMPT },
+        { role: 'user', content: markdown },
+      ],
+      response_format: {
+        type: 'json_schema',
+        json_schema: {
+          name: 'invoice_fields',
+          strict: true,
+          schema: {
+            type: 'object',
+            additionalProperties: false,
+            properties: {
+              clientName: { type: 'string' },
+              clientEmail: { type: ['string', 'null'] },
+              numero: { type: 'string' },
+              amountTtcCents: { type: 'integer' },
+              issueDate: { type: 'string' },
+              dueDate: { type: 'string' },
+              _conf: {
+                type: 'object',
+                additionalProperties: false,
+                properties: {
+                  clientName: { type: 'number' },
+                  clientEmail: { type: 'number' },
+                  numero: { type: 'number' },
+                  amountTtcCents: { type: 'number' },
+                  issueDate: { type: 'number' },
+                  dueDate: { type: 'number' },
+                },
+                required: [
+                  'clientName',
+                  'clientEmail',
+                  'numero',
+                  'amountTtcCents',
+                  'issueDate',
+                  'dueDate',
+                ],
+              },
+            },
+            required: [
+              'clientName',
+              'clientEmail',
+              'numero',
+              'amountTtcCents',
+              'issueDate',
+              'dueDate',
+              '_conf',
+            ],
+          },
+        },
+      },
+      temperature: 0,
+    })
+
+    const content = json?.choices?.[0]?.message?.content
+    if (typeof content !== 'string') {
+      throw new Error('Mistral chat: pas de content string dans la réponse')
+    }
+    return JSON.parse(content)
+  }
+}
--- a/bruno/06-Imports/01
+++ b/bruno/06-Imports/01
@ -51,13 +51,12 @@ tests {
 docs {
  POST /api/v1/invoices/upload

-  V1 mock : on envoie un body JSON `{ filenames: [...] }` (pas de fichier
-  réel). Le service crée un ImportBatch + 1 ImportDraft par filename, en
-  appelant le `MockOcrProvider` qui invente des champs plausibles depuis
-  le nom du fichier.
+  Mode JSON (V1 démo) : body `{ filenames: [...] }` — aucun PDF stocké.
+  Crée un ImportBatch + 1 ImportDraft par filename via le MockOcrProvider
+  qui invente des champs plausibles depuis le nom.

-  Quand Mistral sera branché : on basculera sur multipart `files[]` avec
-  upload effectif vers MinIO. Le contrat de réponse reste identique.
+  Pour le vrai OCR avec PDFs : utiliser **02 Upload (multipart)** avec
+  OCR_PROVIDER=mistral dans le .env.

  Capture `batchId` et `draftId` (le 1er pending) pour les requêtes
  suivantes.
--- a/bruno/06-Imports/02
+++ b/bruno/06-Imports/02
@ -0,0 +1,82 @@
+meta {
+  name: 02 Upload (multipart Mistral)
+  type: http
+  seq: 2
+}
+
+post {
+  url: {{baseUrl}}/api/v1/invoices/upload
+  body: multipartForm
+  auth: inherit
+}
+
+body:multipart-form {
+  files: @file()
+  files: @file()
+}
+
+script:post-response {
+  if (res.getStatus() === 201) {
+    const batch = res.getBody().data;
+    bru.setEnvVar("batchId", batch.id);
+    if (batch.drafts && batch.drafts.length > 0) {
+      const firstPending = batch.drafts.find(d => d.status === "pending") || batch.drafts[0];
+      bru.setEnvVar("draftId", firstPending.id);
+    }
+  }
+}
+
+tests {
+  test("201 Created", function () {
+    expect(res.getStatus()).to.equal(201);
+  });
+  test("drafts ont un pdfStorageKey non null", function () {
+    const drafts = res.getBody().data.drafts;
+    if (drafts.length > 0) {
+      expect(drafts[0].pdfStorageKey).to.not.be.null;
+    }
+  });
+}
+
+docs {
+  POST /api/v1/invoices/upload (multipart/form-data)
+
+  Vrai upload OCR : un ou plusieurs fichiers PDF (ou PNG/JPG) sont
+  uploadés sur MinIO, puis l'OCR provider configuré (`OCR_PROVIDER` du
+  .env) extrait les champs.
+
+  ## Setup côté Bruno
+
+  Dans le bloc `body:multipart-form` ci-dessus, clique sur le `@file()`
+  pour sélectionner un PDF depuis ton disque. Tu peux ajouter d'autres
+  champs `files: @file()` pour uploader plusieurs PDFs en une fois.
+
+  ## Setup côté API
+
+  Dans `apps/api/.env` :
+  ```
+  OCR_PROVIDER=mistral
+  MISTRAL_API_KEY=ms_xxx...
+  ```
+
+  Si `OCR_PROVIDER=mock`, l'upload multipart fonctionne aussi mais le
+  PDF n'est pas analysé — le MockOcrProvider invente des champs depuis
+  le nom du fichier (le PDF est juste stocké pour la suite).
+
+  ## Stockage MinIO
+
+  Les fichiers atterrissent dans `import-drafts/<orgId>/<draftId>.<ext>`.
+  Visibles via la console MinIO http://localhost:9101 (login `rubis` /
+  `rubis-dev-secret`).
+
+  ## Validation
+
+  - Extensions : pdf, png, jpg, jpeg
+  - Taille max : 10 MB par fichier
+
+  ## Limites Mistral
+
+  Le provider fait 2 appels (OCR → extraction structurée). Latence
+  ~3-8s par PDF. Si l'extraction échoue, l'erreur remonte en 500 — pas
+  de retry V1 (à mettre dans BullMQ pour la prod).
+}
--- a/bruno/06-Imports/03
+++ b/bruno/06-Imports/03
@ -1,7 +1,7 @@
 meta {
-  name: 02 Get batch
+  name: 03 Get batch
  type: http
-  seq: 2
+  seq: 3
 }

 get {
--- a/bruno/06-Imports/04
+++ b/bruno/06-Imports/04
@ -1,7 +1,7 @@
 meta {
-  name: 03 Validate draft
+  name: 04 Validate draft
  type: http
-  seq: 3
+  seq: 4
 }

 post {
--- a/bruno/06-Imports/05
+++ b/bruno/06-Imports/05
@ -1,7 +1,7 @@
 meta {
-  name: 04 Skip draft
+  name: 05 Skip draft
  type: http
-  seq: 4
+  seq: 5
 }

 post {
--- a/bruno/06-Imports/06
+++ b/bruno/06-Imports/06
@ -1,7 +1,7 @@
 meta {
-  name: 05 Cancel batch
+  name: 06 Cancel batch
  type: http
-  seq: 5
+  seq: 6
 }

 delete {