chore: sync project files

2026-01-13 19:49:04 +01:00
parent 53f8227941
commit ecda149a4b
149 changed files with 65272 additions and 1 deletions
--- a/pricewatch/app/stores/backmarket/init.py
+++ b/pricewatch/app/stores/backmarket/init.py
--- a/pricewatch/app/stores/backmarket/pycache/init.cpython-313.pyc
+++ b/pricewatch/app/stores/backmarket/pycache/init.cpython-313.pyc
--- a/pricewatch/app/stores/backmarket/pycache/store.cpython-313.pyc
+++ b/pricewatch/app/stores/backmarket/pycache/store.cpython-313.pyc
--- a/pricewatch/app/stores/backmarket/fixtures/README.md
+++ b/pricewatch/app/stores/backmarket/fixtures/README.md
@@ -0,0 +1,143 @@
+# Fixtures Backmarket
+
+Ce dossier contient des fichiers HTML réels capturés depuis Backmarket.fr pour les tests.
+
+## ⚠️ Note importante sur Backmarket
+
+Backmarket utilise une **protection anti-bot**:
+- HTTP simple retourne **403 Forbidden**
+- **Playwright est OBLIGATOIRE** pour récupérer le contenu
+- Temps de chargement: ~2-3 secondes
+
+## Spécificité Backmarket
+
+Backmarket vend des **produits reconditionnés**:
+- Prix variable selon la **condition** (Correct, Bon, Excellent, etc.)
+- Chaque produit a plusieurs offres avec des états différents
+- Le prix extrait correspond à l'offre sélectionnée par défaut
+
+## Fichiers
+
+### backmarket_iphone15pro.html
+- **Produit**: iPhone 15 Pro (reconditionné)
+- **SKU**: iphone-15-pro
+- **URL**: https://www.backmarket.fr/fr-fr/p/iphone-15-pro
+- **Taille**: ~1.5 MB
+- **Date capture**: 2026-01-13
+- **Prix capturé**: 571 EUR (prix de l'offre par défaut)
+- **Usage**: Test complet parsing smartphone reconditionné
+
+## Structure HTML Backmarket
+
+### JSON-LD Schema.org ✓
+Backmarket utilise **JSON-LD structuré** (contrairement à Cdiscount):
+```json
+{
+  "@type": "Product",
+  "name": "iPhone 15 Pro",
+  "offers": {
+    "@type": "Offer",
+    "price": "571.00",
+    "priceCurrency": "EUR"
+  }
+}
+```
+
+### Sélecteurs identifiés
+
+#### Titre
+```css
+h1.heading-1
+```
+Classes stables, simple et propre.
+
+#### Prix
+Priorité: **JSON-LD** (source la plus fiable)
+Fallback: `div[data-test='price']`
+
+#### Images
+```css
+img[alt]
+```
+URLs CDN: `https://d2e6ccujb3mkqf.cloudfront.net/...`
+
+#### SKU
+Extraction depuis l'URL:
+```regex
+/p/([a-z0-9-]+)
+```
+Exemple: `/p/iphone-15-pro` → SKU = "iphone-15-pro"
+
+#### Condition (État du reconditionné)
+```css
+button[data-test='condition-button']
+div[class*='condition']
+```
+Valeurs possibles: Correct, Bon, Très bon, Excellent, Comme neuf
+
+## Comparaison avec autres stores
+
+| Aspect | Amazon | Cdiscount | Backmarket |
+|--------|--------|-----------|------------|
+| **Anti-bot** | Faible | Fort | Fort |
+| **Méthode** | HTTP OK | Playwright | Playwright |
+| **JSON-LD** | Partiel | ✗ Non | ✓ Oui (complet) |
+| **Sélecteurs** | Stables (IDs) | Instables | Stables (classes) |
+| **SKU format** | `/dp/{ASIN}` | `/f-{cat}-{SKU}` | `/p/{slug}` |
+| **Particularité** | - | Prix dynamiques | Reconditionné (condition) |
+
+## Utilisation dans les tests
+
+```python
+@pytest.fixture
+def backmarket_fixture_iphone15pro():
+    fixture_path = Path(__file__).parent.parent.parent / \
+        "pricewatch/app/stores/backmarket/fixtures/backmarket_iphone15pro.html"
+    with open(fixture_path, "r", encoding="utf-8") as f:
+        return f.read()
+
+def test_parse_real_fixture(store, backmarket_fixture_iphone15pro):
+    url = "https://www.backmarket.fr/fr-fr/p/iphone-15-pro"
+    snapshot = store.parse(backmarket_fixture_iphone15pro, url)
+
+    assert snapshot.title == "iPhone 15 Pro"
+    assert snapshot.price == 571.0
+    assert snapshot.reference == "iphone-15-pro"
+    assert snapshot.currency == "EUR"
+```
+
+## Points d'attention pour les tests
+
+1. **JSON-LD prioritaire** - Le prix vient du JSON-LD, pas du HTML visible
+2. **Prix variable** - Change selon la condition sélectionnée
+3. **Ne pas tester le prix exact** - Il varie avec les offres disponibles
+4. **Tester le format** et la présence des données
+5. Backmarket = **produits reconditionnés** uniquement
+
+## Comment capturer une nouvelle fixture
+
+```python
+from pricewatch.app.scraping.pw_fetch import fetch_playwright
+
+url = "https://www.backmarket.fr/fr-fr/p/..."
+result = fetch_playwright(url, headless=True, timeout_ms=60000)
+
+if result.success:
+    with open("fixture.html", "w", encoding="utf-8") as f:
+        f.write(result.html)
+```
+
+⚠️ **N'utilisez JAMAIS** `fetch_http()` pour Backmarket - cela retournera 403!
+
+## Avantages de Backmarket
+
+✓ **JSON-LD structuré** → Parsing très fiable
+✓ **Classes CSS stables** → Moins de casse que Cdiscount
+✓ **URL propre** → SKU facile à extraire
+✓ **Schema.org complet** → Prix, nom, images dans JSON
+
+## Inconvénients
+
+✗ **Protection anti-bot** → Playwright obligatoire (lent)
+✗ **Prix multiples** → Un produit = plusieurs offres selon état
+✗ **Stock complexe** → Dépend de l'offre et de la condition
--- a/pricewatch/app/stores/backmarket/fixtures/backmarket_iphone15pro.html
+++ b/pricewatch/app/stores/backmarket/fixtures/backmarket_iphone15pro.html
--- a/pricewatch/app/stores/backmarket/selectors.yml
+++ b/pricewatch/app/stores/backmarket/selectors.yml
@@ -0,0 +1,72 @@
+# Sélecteurs CSS/XPath pour Backmarket.fr
+# Mis à jour le 2026-01-13 après analyse du HTML réel
+
+# ⚠️ IMPORTANT: Backmarket utilise une protection anti-bot
+# - HTTP simple ne fonctionne PAS (retourne 403 Forbidden)
+# - Playwright est OBLIGATOIRE pour récupérer le contenu
+# - Les classes CSS sont relativement stables (heading-1, etc.)
+
+# Titre du produit
+# Classes simples et stables
+title:
+  - "h1.heading-1"
+  - "h1"  # Fallback
+
+# Prix principal
+# ✓ JSON-LD schema.org disponible (prioritaire)
+# Les prix sont dans <script type="application/ld+json">
+price:
+  - "div[data-test='price']"  # Fallback si JSON-LD n'est pas disponible
+  - "span[class*='price']"
+
+# Devise
+# Toujours EUR pour Backmarket France
+currency:
+  - "meta[property='og:price:currency']"
+  # Fallback: statique EUR
+
+# État / Condition (spécifique aux produits reconditionnés)
+# Backmarket vend du reconditionné, donc il y a des grades (Correct, Bon, Excellent, etc.)
+condition:
+  - "button[data-test='condition-button']"
+  - "div[class*='condition']"
+  - "span[class*='grade']"
+
+# Images produit
+images:
+  - "img[alt]"  # Toutes les images avec alt
+  # Filtrer celles qui contiennent le nom du produit
+
+# Catégorie / breadcrumb
+category:
+  - "nav[aria-label='breadcrumb'] a"
+  - ".breadcrumb a"
+
+# Caractéristiques techniques
+# Peuvent être dans des sections dépliables
+specs_table:
+  - "div[class*='specification']"
+  - "div[class*='technical']"
+  - "dl"
+
+# SKU / référence produit
+# Extraction depuis l'URL plus fiable
+# URL pattern: /fr-fr/p/{slug}
+# SKU = slug
+sku:
+  - "meta[property='product:retailer_item_id']"
+  - "span[data-test='sku']"
+
+# Stock / Disponibilité
+stock_status:
+  - "button[data-test='add-to-cart']"  # Si présent = en stock
+  - "div[class*='availability']"
+
+# Notes importantes:
+# 1. ⚠️ Playwright OBLIGATOIRE - HTTP retourne 403 Forbidden
+# 2. JSON-LD schema.org disponible → prioritaire pour prix/titre
+# 3. Classes CSS relativement stables (heading-1, etc.)
+# 4. SKU: extraire depuis URL /fr-fr/p/{slug}
+# 5. Condition (grade) important pour Backmarket (Correct/Bon/Excellent)
+# 6. Prix varie selon la condition choisie
+# 7. Devise: toujours EUR pour France (static fallback OK)
--- a/pricewatch/app/stores/backmarket/store.py
+++ b/pricewatch/app/stores/backmarket/store.py
@@ -0,0 +1,358 @@
+"""
+Store Backmarket - Parsing de produits Backmarket.fr.
+
+Supporte l'extraction de: titre, prix, SKU, images, condition (état), etc.
+Spécificité: Backmarket vend du reconditionné, donc prix variable selon condition.
+"""
+
+import json
+import re
+from datetime import datetime
+from pathlib import Path
+from typing import Optional
+from urllib.parse import urlparse
+
+from bs4 import BeautifulSoup
+
+from pricewatch.app.core.logging import get_logger
+from pricewatch.app.core.schema import (
+    DebugInfo,
+    DebugStatus,
+    FetchMethod,
+    ProductSnapshot,
+    StockStatus,
+)
+from pricewatch.app.stores.base import BaseStore
+
+logger = get_logger("stores.backmarket")
+
+
+class BackmarketStore(BaseStore):
+    """Store pour Backmarket.fr (produits reconditionnés)."""
+
+    def __init__(self):
+        """Initialise le store Backmarket avec ses sélecteurs."""
+        selectors_path = Path(__file__).parent / "selectors.yml"
+        super().__init__(store_id="backmarket", selectors_path=selectors_path)
+
+    def match(self, url: str) -> float:
+        """
+        Détecte si l'URL est Backmarket.
+
+        Returns:
+            0.9 pour backmarket.fr/backmarket.com
+            0.0 sinon
+        """
+        if not url:
+            return 0.0
+
+        url_lower = url.lower()
+
+        if "backmarket.fr" in url_lower:
+            return 0.9
+        elif "backmarket.com" in url_lower:
+            return 0.8  # .com pour autres pays
+
+        return 0.0
+
+    def canonicalize(self, url: str) -> str:
+        """
+        Normalise l'URL Backmarket.
+
+        Les URLs Backmarket ont généralement la forme:
+        https://www.backmarket.fr/fr-fr/p/{slug}
+
+        On garde l'URL complète sans query params.
+        """
+        if not url:
+            return url
+
+        parsed = urlparse(url)
+        # Retirer query params et fragment
+        return f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
+
+    def extract_reference(self, url: str) -> Optional[str]:
+        """
+        Extrait le SKU (slug) depuis l'URL.
+
+        Format typique: /fr-fr/p/{slug}
+        Exemple: /fr-fr/p/iphone-15-pro → "iphone-15-pro"
+        """
+        if not url:
+            return None
+
+        # Pattern: /p/{slug} (peut être /fr-fr/p/ ou /en-us/p/ etc.)
+        match = re.search(r"/p/([a-z0-9-]+)", url, re.IGNORECASE)
+        if match:
+            return match.group(1)
+
+        return None
+
+    def parse(self, html: str, url: str) -> ProductSnapshot:
+        """
+        Parse le HTML Backmarket vers ProductSnapshot.
+
+        Utilise en priorité JSON-LD schema.org, puis BeautifulSoup avec sélecteurs.
+        """
+        soup = BeautifulSoup(html, "lxml")
+
+        debug_info = DebugInfo(
+            method=FetchMethod.HTTP,  # Sera mis à jour par l'appelant
+            status=DebugStatus.SUCCESS,
+            errors=[],
+            notes=[],
+        )
+
+        # Extraction prioritaire depuis JSON-LD
+        json_ld_data = self._extract_json_ld(soup)
+
+        # Extraction des champs
+        title = json_ld_data.get("name") or self._extract_title(soup, debug_info)
+        price = json_ld_data.get("price") or self._extract_price(soup, debug_info)
+        currency = (
+            json_ld_data.get("priceCurrency") or self._extract_currency(soup, debug_info) or "EUR"
+        )
+        stock_status = self._extract_stock(soup, debug_info)
+        images = json_ld_data.get("images") or self._extract_images(soup, debug_info)
+        category = self._extract_category(soup, debug_info)
+        specs = self._extract_specs(soup, debug_info)
+        reference = self.extract_reference(url)
+
+        # Spécifique Backmarket: condition (état du reconditionné)
+        condition = self._extract_condition(soup, debug_info)
+        if condition:
+            specs["Condition"] = condition
+            debug_info.notes.append(f"Produit reconditionné: {condition}")
+
+        # Déterminer le statut final
+        if not title or price is None:
+            debug_info.status = DebugStatus.PARTIAL
+            debug_info.notes.append("Parsing incomplet: titre ou prix manquant")
+
+        snapshot = ProductSnapshot(
+            source=self.store_id,
+            url=self.canonicalize(url),
+            fetched_at=datetime.now(),
+            title=title,
+            price=price,
+            currency=currency,
+            shipping_cost=None,
+            stock_status=stock_status,
+            reference=reference,
+            category=category,
+            images=images,
+            specs=specs,
+            debug=debug_info,
+        )
+
+        logger.info(
+            f"[Backmarket] Parsing {'réussi' if snapshot.is_complete() else 'partiel'}: "
+            f"title={bool(title)}, price={price is not None}"
+        )
+
+        return snapshot
+
+    def _extract_json_ld(self, soup: BeautifulSoup) -> dict:
+        """
+        Extrait les données depuis JSON-LD schema.org.
+
+        Backmarket utilise schema.org Product, c'est la source la plus fiable.
+        """
+        json_ld_scripts = soup.find_all("script", {"type": "application/ld+json"})
+
+        for script in json_ld_scripts:
+            try:
+                data = json.loads(script.string)
+                if isinstance(data, dict) and data.get("@type") == "Product":
+                    result = {
+                        "name": data.get("name"),
+                        "priceCurrency": None,
+                        "price": None,
+                        "images": [],
+                    }
+
+                    # Prix depuis offers
+                    offers = data.get("offers", {})
+                    if isinstance(offers, dict):
+                        result["price"] = offers.get("price")
+                        result["priceCurrency"] = offers.get("priceCurrency")
+
+                        # Convertir en float si c'est une string
+                        if isinstance(result["price"], str):
+                            try:
+                                result["price"] = float(result["price"])
+                            except ValueError:
+                                result["price"] = None
+
+                    # Images
+                    image_data = data.get("image")
+                    if isinstance(image_data, str):
+                        result["images"] = [image_data]
+                    elif isinstance(image_data, list):
+                        result["images"] = image_data
+
+                    return result
+            except (json.JSONDecodeError, AttributeError):
+                continue
+
+        return {}
+
+    def _extract_title(self, soup: BeautifulSoup, debug: DebugInfo) -> Optional[str]:
+        """Extrait le titre du produit."""
+        selectors = self.get_selector("title", [])
+        if isinstance(selectors, str):
+            selectors = [selectors]
+
+        for selector in selectors:
+            element = soup.select_one(selector)
+            if element:
+                title = element.get_text(strip=True)
+                if title:
+                    return title
+
+        debug.errors.append("Titre non trouvé")
+        return None
+
+    def _extract_price(self, soup: BeautifulSoup, debug: DebugInfo) -> Optional[float]:
+        """Extrait le prix."""
+        selectors = self.get_selector("price", [])
+        if isinstance(selectors, str):
+            selectors = [selectors]
+
+        for selector in selectors:
+            elements = soup.select(selector)
+            for element in elements:
+                # Attribut content (schema.org) ou texte
+                price_text = element.get("content") or element.get_text(strip=True)
+
+                # Extraire nombre (format: "299,99" ou "299.99" ou "299")
+                match = re.search(r"(\d+)[.,]?(\d*)", price_text)
+                if match:
+                    integer_part = match.group(1)
+                    decimal_part = match.group(2) or "00"
+                    price_str = f"{integer_part}.{decimal_part}"
+                    try:
+                        return float(price_str)
+                    except ValueError:
+                        continue
+
+        debug.errors.append("Prix non trouvé")
+        return None
+
+    def _extract_currency(self, soup: BeautifulSoup, debug: DebugInfo) -> Optional[str]:
+        """Extrait la devise."""
+        selectors = self.get_selector("currency", [])
+        if isinstance(selectors, str):
+            selectors = [selectors]
+
+        for selector in selectors:
+            element = soup.select_one(selector)
+            if element:
+                # Attribut content
+                currency = element.get("content")
+                if currency:
+                    return currency.upper()
+
+        # Défaut EUR pour Backmarket France
+        return "EUR"
+
+    def _extract_stock(self, soup: BeautifulSoup, debug: DebugInfo) -> StockStatus:
+        """Extrait le statut de stock."""
+        # Chercher le bouton "Ajouter au panier"
+        add_to_cart = soup.find("button", attrs={"data-test": "add-to-cart"})
+        if add_to_cart and not add_to_cart.get("disabled"):
+            return StockStatus.IN_STOCK
+
+        # Fallback: chercher textes indiquant la disponibilité
+        selectors = self.get_selector("stock_status", [])
+        if isinstance(selectors, str):
+            selectors = [selectors]
+
+        for selector in selectors:
+            element = soup.select_one(selector)
+            if element:
+                text = element.get_text(strip=True).lower()
+
+                if "en stock" in text or "disponible" in text or "ajouter" in text:
+                    return StockStatus.IN_STOCK
+                elif (
+                    "rupture" in text
+                    or "indisponible" in text
+                    or "épuisé" in text
+                ):
+                    return StockStatus.OUT_OF_STOCK
+
+        return StockStatus.UNKNOWN
+
+    def _extract_images(self, soup: BeautifulSoup, debug: DebugInfo) -> list[str]:
+        """Extrait les URLs d'images."""
+        images = []
+        selectors = self.get_selector("images", [])
+        if isinstance(selectors, str):
+            selectors = [selectors]
+
+        for selector in selectors:
+            elements = soup.select(selector)
+            for element in elements:
+                # src ou data-src
+                img_url = element.get("src") or element.get("data-src")
+                if img_url and img_url.startswith("http"):
+                    # Éviter les doublons
+                    if img_url not in images:
+                        images.append(img_url)
+
+        return images
+
+    def _extract_category(self, soup: BeautifulSoup, debug: DebugInfo) -> Optional[str]:
+        """Extrait la catégorie depuis le breadcrumb."""
+        selectors = self.get_selector("category", [])
+        if isinstance(selectors, str):
+            selectors = [selectors]
+
+        for selector in selectors:
+            elements = soup.select(selector)
+            if elements:
+                # Prendre le dernier élément du breadcrumb (catégorie la plus spécifique)
+                categories = [elem.get_text(strip=True) for elem in elements if elem.get_text(strip=True)]
+                if categories:
+                    return categories[-1]
+
+        return None
+
+    def _extract_specs(self, soup: BeautifulSoup, debug: DebugInfo) -> dict[str, str]:
+        """Extrait les caractéristiques techniques."""
+        specs = {}
+
+        # Chercher les dl (definition lists)
+        dls = soup.find_all("dl")
+        for dl in dls:
+            dts = dl.find_all("dt")
+            dds = dl.find_all("dd")
+
+            for dt, dd in zip(dts, dds):
+                key = dt.get_text(strip=True)
+                value = dd.get_text(strip=True)
+                if key and value:
+                    specs[key] = value
+
+        return specs
+
+    def _extract_condition(self, soup: BeautifulSoup, debug: DebugInfo) -> Optional[str]:
+        """
+        Extrait la condition/état du produit reconditionné.
+
+        Spécifique à Backmarket: Correct, Bon, Très bon, Excellent, etc.
+        """
+        selectors = self.get_selector("condition", [])
+        if isinstance(selectors, str):
+            selectors = [selectors]
+
+        for selector in selectors:
+            elements = soup.select(selector)
+            for element in elements:
+                text = element.get_text(strip=True)
+                # Chercher les grades Backmarket
+                if any(grade in text for grade in ["Correct", "Bon", "Très bon", "Excellent", "Comme neuf"]):
+                    return text
+
+        return None