gilles/scrap

Fork 0

Go to file

Gilles Soulier d0b73b9319 codex2

2026-01-14 21:54:55 +01:00

pricewatch

codex2

2026-01-14 21:54:55 +01:00

pricewatch.egg-info

codex2

2026-01-14 21:54:55 +01:00

scraped

chore: sync project files

2026-01-13 19:49:04 +01:00

tests

codex2

2026-01-14 21:54:55 +01:00

webui

codex2

2026-01-14 21:54:55 +01:00

.coverage

codex2

2026-01-14 21:54:55 +01:00

.env

codex2

2026-01-14 21:54:55 +01:00

.env.example

codex2

2026-01-14 21:54:55 +01:00

.gitignore

codex2

2026-01-14 21:54:55 +01:00

AGENTS.md

chore: sync project files

2026-01-13 19:49:04 +01:00

alembic.ini

codex2

2026-01-14 21:54:55 +01:00

analys-amazon.txt

chore: sync project files

2026-01-13 19:49:04 +01:00

ANALYSE_PROJET.md

chore: sync project files

2026-01-13 19:49:04 +01:00

analyze_aliexpress_data.py

chore: sync project files

2026-01-13 19:49:04 +01:00

analyze_backmarket.py

chore: sync project files

2026-01-13 19:49:04 +01:00

analyze_cdiscount.py

chore: sync project files

2026-01-13 19:49:04 +01:00

analyze_price_philips.py

chore: sync project files

2026-01-13 19:49:04 +01:00

BACKMARKET_ANALYSIS.md

chore: sync project files

2026-01-13 19:49:04 +01:00

CDISCOUNT_ANALYSIS.md

chore: sync project files

2026-01-13 19:49:04 +01:00

CHANGELOG.md

codex2

2026-01-14 21:54:55 +01:00

CLAUDE.md

chore: sync project files

2026-01-13 19:49:04 +01:00

DELIVERY_SUMMARY.md

chore: sync project files

2026-01-13 19:49:04 +01:00

detail_produit_backmarket.py

chore: sync project files

2026-01-13 19:49:04 +01:00

docker-compose.yml

codex2

2026-01-14 21:54:55 +01:00

Dockerfile

codex2

2026-01-14 21:54:55 +01:00

fetch_aliexpress_pw.py

chore: sync project files

2026-01-13 19:49:04 +01:00

fetch_aliexpress_wait.py

chore: sync project files

2026-01-13 19:49:04 +01:00

fetch_aliexpress.py

chore: sync project files

2026-01-13 19:49:04 +01:00

fetch_backmarket.py

chore: sync project files

2026-01-13 19:49:04 +01:00

fetch_cdiscount.py

chore: sync project files

2026-01-13 19:49:04 +01:00

Image collée (2).png

codex2

2026-01-14 21:54:55 +01:00

Image collée (3).png

codex2

2026-01-14 21:54:55 +01:00

Image collée (4).png

codex2

2026-01-14 21:54:55 +01:00

Image collée.png

codex2

2026-01-14 21:54:55 +01:00

INDEX.md

chore: sync project files

2026-01-13 19:49:04 +01:00

MIGRATION_GUIDE.md

codex2

2026-01-14 21:54:55 +01:00

PHASE_1_COMPLETE.md

codex2

2026-01-14 21:54:55 +01:00

PHASE_2_PROGRESS.md

codex2

2026-01-14 21:54:55 +01:00

PROJECT_SPEC.md

chore: sync project files

2026-01-13 19:49:04 +01:00

pyproject.toml

codex2

2026-01-14 21:54:55 +01:00

QUICKSTART.md

chore: sync project files

2026-01-13 19:49:04 +01:00

README.md

codex2

2026-01-14 21:54:55 +01:00

scrap_url.yaml

codex2

2026-01-14 21:54:55 +01:00

scraped_store.json

codex2

2026-01-14 21:54:55 +01:00

SESSION_2_SUMMARY.md

chore: sync project files

2026-01-13 19:49:04 +01:00

SESSION_SUMMARY.md

chore: sync project files

2026-01-13 19:49:04 +01:00

test_aliexpress_parser.py

chore: sync project files

2026-01-13 19:49:04 +01:00

test_aliexpress_product2.py

chore: sync project files

2026-01-13 19:49:04 +01:00

test_amazon.json

chore: sync project files

2026-01-13 19:49:04 +01:00

test_backmarket_macbook_m3.py

chore: sync project files

2026-01-13 19:49:04 +01:00

test_backmarket_macbook.py

chore: sync project files

2026-01-13 19:49:04 +01:00

test_backmarket_parser.py

chore: sync project files

2026-01-13 19:49:04 +01:00

test_backmarket_samsung.py

chore: sync project files

2026-01-13 19:49:04 +01:00

test_cdiscount_parser.py

chore: sync project files

2026-01-13 19:49:04 +01:00

test_cdiscount.json

chore: sync project files

2026-01-13 19:49:04 +01:00

TEST_FILES_README.md

chore: sync project files

2026-01-13 19:49:04 +01:00

test_result.json

chore: sync project files

2026-01-13 19:49:04 +01:00

test_selectors.py

chore: sync project files

2026-01-13 19:49:04 +01:00

TODO.md

codex2

2026-01-14 21:54:55 +01:00

README.md

PriceWatch 🛒

Application Python de suivi de prix e-commerce (Amazon, Cdiscount, extensible).

Description

PriceWatch est une application CLI permettant de scraper et suivre les prix de produits sur différents sites e-commerce. L'application gère automatiquement la détection du site, la récupération des données (HTTP + fallback Playwright), et produit un historique exploitable.

Fonctionnalités

✅ Scraping automatique avec détection du site marchand
✅ Récupération multi-méthode (HTTP prioritaire, Playwright en fallback)
✅ Support Amazon et Cdiscount (architecture extensible)
✅ Extraction complète des données produit (prix, titre, images, specs, stock)
✅ Pipeline YAML → JSON reproductible
✅ Logging détaillé et mode debug
✅ Tests pytest avec fixtures HTML

Prérequis

Python 3.12+
pip

Installation

# Cloner le dépôt
git clone <repository-url>
cd scrap

# Installer les dépendances
pip install -e .

# Installer les navigateurs Playwright
playwright install

Structure du projet

pricewatch/
├── app/
│   ├── core/              # Modèles et utilitaires centraux
│   │   ├── schema.py      # ProductSnapshot (modèle Pydantic)
│   │   ├── registry.py    # Détection automatique des stores
│   │   ├── io.py          # Lecture YAML / Écriture JSON
│   │   └── logging.py     # Configuration logging
│   ├── scraping/          # Méthodes de récupération
│   │   ├── http_fetch.py  # Récupération HTTP
│   │   └── pw_fetch.py    # Récupération Playwright
│   ├── stores/            # Parsers par site marchand
│   │   ├── base.py        # Classe abstraite BaseStore
│   │   ├── amazon/
│   │   │   ├── store.py
│   │   │   ├── selectors.yml
│   │   │   └── fixtures/
│   │   └── cdiscount/
│   │       ├── store.py
│   │       ├── selectors.yml
│   │       └── fixtures/
│   ├── db/                # Persistence SQLAlchemy (Phase 2)
│   │   ├── models.py
│   │   ├── connection.py
│   │   └── migrations/
│   ├── tasks/             # Jobs RQ (Phase 3)
│   │   ├── scrape.py
│   │   └── scheduler.py
│   └── cli/
│       └── main.py        # CLI Typer
├── tests/                 # Tests pytest
├── scraped/               # Fichiers de debug (HTML, screenshots)
├── scrap_url.yaml         # Configuration des URLs à scraper
└── scraped_store.json     # Résultat du scraping

Usage CLI

Pipeline complet

# Scraper toutes les URLs définies dans scrap_url.yaml
pricewatch run --yaml scrap_url.yaml --out scraped_store.json

# Avec debug
pricewatch run --yaml scrap_url.yaml --out scraped_store.json --debug

# Avec persistence DB
pricewatch run --yaml scrap_url.yaml --out scraped_store.json --save-db

Commandes utilitaires

# Détecter le store depuis une URL
pricewatch detect https://www.amazon.fr/dp/B08N5WRWNW

# Récupérer une page (HTTP)
pricewatch fetch https://www.amazon.fr/dp/B08N5WRWNW --http

# Récupérer une page (Playwright)
pricewatch fetch https://www.amazon.fr/dp/B08N5WRWNW --playwright

# Parser un fichier HTML avec un store spécifique
pricewatch parse amazon --in scraped/page.html

# Vérifier l'installation
pricewatch doctor

Commandes base de donnees

# Initialiser les tables
pricewatch init-db

# Generer une migration
pricewatch migrate "Initial schema"

# Appliquer les migrations
pricewatch upgrade

# Revenir en arriere
pricewatch downgrade -1

Commandes worker

# Lancer un worker RQ
pricewatch worker

# Enqueue un job immediat
pricewatch enqueue "https://example.com/product"

# Planifier un job recurrent
pricewatch schedule "https://example.com/product" --interval 24

Base de donnees (Phase 2)

# Lancer PostgreSQL + Redis en local
docker-compose up -d

# Exemple de configuration
cp .env.example .env

Guide de migration JSON -> DB: MIGRATION_GUIDE.md

API REST (Phase 3)

L'API est protegee par un token simple.

export PW_API_TOKEN=change_me
docker compose up -d api

Exemples:

curl -H "Authorization: Bearer $PW_API_TOKEN" http://localhost:8001/products
curl http://localhost:8001/health

Filtres (exemples rapides):

curl -H "Authorization: Bearer $PW_API_TOKEN" \\
  "http://localhost:8001/products?price_min=100&stock_status=in_stock"
curl -H "Authorization: Bearer $PW_API_TOKEN" \\
  "http://localhost:8001/products/1/prices?fetch_status=success&fetched_after=2026-01-14T00:00:00"
curl -H "Authorization: Bearer $PW_API_TOKEN" \\
  "http://localhost:8001/logs?fetch_status=failed&fetched_before=2026-01-15T00:00:00"

Exports (CSV/JSON):

curl -H "Authorization: Bearer $PW_API_TOKEN" \\
  "http://localhost:8001/products/export?format=csv"
curl -H "Authorization: Bearer $PW_API_TOKEN" \\
  "http://localhost:8001/logs/export?format=json"

CRUD (examples rapides):

curl -H "Authorization: Bearer $PW_API_TOKEN" -X POST http://localhost:8001/products \\
  -H "Content-Type: application/json" \\
  -d '{"source":"amazon","reference":"REF1","url":"https://example.com"}'

Webhooks (exemples rapides):

curl -H "Authorization: Bearer $PW_API_TOKEN" -X POST http://localhost:8001/webhooks \\
  -H "Content-Type: application/json" \\
  -d '{"event":"price_changed","url":"https://example.com/webhook","enabled":true}'
curl -H "Authorization: Bearer $PW_API_TOKEN" -X POST http://localhost:8001/webhooks/1/test

Web UI (Phase 4)

Interface Vue 3 dense avec themes Gruvbox/Monokai, header fixe, sidebar filtres, et split compare.

docker compose up -d frontend
# Acces: http://localhost:3000

Configuration (scrap_url.yaml)

urls:
  - "https://www.amazon.fr/dp/B08N5WRWNW"
  - "https://www.cdiscount.com/informatique/clavier-souris-webcam/example/f-1070123-example.html"

options:
  use_playwright: true    # Utiliser Playwright en fallback
  headful: false          # Mode headless (true = voir le navigateur)
  save_html: true         # Sauvegarder HTML pour debug
  save_screenshot: true   # Sauvegarder screenshot pour debug
  timeout_ms: 60000       # Timeout par page (ms)

Format de sortie (ProductSnapshot)

Chaque produit scraped est représenté par un objet ProductSnapshot contenant :

Métadonnées

source: Site d'origine (amazon, cdiscount, unknown)
url: URL canonique du produit
fetched_at: Date/heure de récupération (ISO 8601)

Données produit

title: Nom du produit
price: Prix (float ou null)
currency: Devise (EUR, USD, etc.)
shipping_cost: Frais de port (float ou null)
stock_status: Statut stock (in_stock, out_of_stock, unknown)
reference: Référence produit (ASIN pour Amazon, SKU pour autres)
images: Liste des URLs d'images
category: Catégorie du produit
specs: Caractéristiques techniques (dict clé/valeur)

Debug

debug.method: Méthode utilisée (http, playwright)
debug.errors: Liste des erreurs rencontrées
debug.notes: Notes techniques
debug.status: Statut de récupération (success, partial, failed)

Tests

# Lancer tous les tests
pytest

# Tests avec couverture
pytest --cov=pricewatch

# Tests d'un store spécifique
pytest tests/stores/amazon/
pytest tests/stores/cdiscount/

# Mode verbose
pytest -v

Architecture des stores

Chaque store implémente la classe abstraite BaseStore avec :

match(url) -> float: Score de correspondance (0.0 à 1.0)
canonicalize(url) -> str: Normalisation de l'URL
extract_reference(url) -> str: Extraction référence produit
fetch(url, method, options) -> str: Récupération HTML
parse(html, url) -> ProductSnapshot: Parsing vers modèle canonique

Les sélecteurs (XPath/CSS) sont externalisés dans selectors.yml pour faciliter la maintenance.

Ajouter un nouveau store

Créer pricewatch/app/stores/nouveaustore/
Créer store.py avec une classe héritant de BaseStore
Créer selectors.yml avec les sélecteurs XPath/CSS
Ajouter des fixtures HTML dans fixtures/
Enregistrer le store dans le Registry
Écrire les tests dans tests/stores/nouveaustore/

Gestion des erreurs

L'application est conçue pour être robuste face aux anti-bots :

403 Forbidden : Fallback automatique vers Playwright
Captcha détecté : Logged dans debug.errors, statut failed
Timeout : Configurable, logged
Parsing échoué : ProductSnapshot partiel avec debug.status=partial

Aucune erreur ne doit crasher silencieusement : toutes sont loggées et tracées dans le ProductSnapshot.

Roadmap

Phase 1 : CLI (actuelle)

✅ Pipeline YAML → JSON
✅ Support Amazon + Cdiscount
✅ Scraping HTTP + Playwright
✅ Tests pytest

Phase 2 : Persistence

Base de données PostgreSQL
Migrations Alembic
Historique des prix

Phase 3 : Automation

Worker (Redis + RQ/Celery)
Planification quotidienne
Gestion de la queue

Phase 4 : Web UI

Interface web responsive
Dark theme (Gruvbox)
Graphiques historique prix
Gestion des alertes

Phase 5 : Alertes

Notifications baisse de prix
Notifications retour en stock
Webhooks/email

Développement

Règles

Python 3.12 obligatoire
Commentaires et discussions en français
Toute décision technique doit être justifiée (1-3 phrases)
Pas d'optimisation prématurée
Logging systématique (méthode, durée, erreurs)
Tests obligatoires pour chaque store

Documentation

README.md : Ce fichier
TODO.md : Liste des tâches priorisées
CHANGELOG.md : Journal des modifications
CLAUDE.md : Guide pour Claude Code

License

À définir

Auteur

À définir