codex

2026-01-14 07:03:38 +01:00
parent ecda149a4b
commit c91c0f1fc9
61 changed files with 4388 additions and 38 deletions
--- a/PHASE_2_PROGRESS.md
+++ b/PHASE_2_PROGRESS.md
@@ -0,0 +1,437 @@
+# 🚀 Phase 2 Infrastructure - EN COURS
+
+**Date de démarrage**: 2026-01-14
+**Version cible**: 0.4.0
+**Objectif**: Ajouter PostgreSQL + Redis/RQ worker pour persistence et scraping asynchrone
+
+---
+
+## 📊 Vue d'Ensemble
+
+### Objectifs Phase 2
+- ✅ Configuration centralisée (database, Redis, app)
+- ✅ Modèles SQLAlchemy ORM (5 tables)
+- ✅ Connexion base de données (init_db, get_session)
+- ✅ Migrations Alembic
+- ⏳ Repository pattern (CRUD)
+- ⏳ Worker RQ pour scraping asynchrone
+- ⏳ Scheduler pour jobs récurrents
+- ✅ CLI étendu (commandes DB)
+- ✅ Docker Compose (PostgreSQL + Redis)
+- ⏳ Tests complets
+
+---
+
+## ✅ Semaine 1: Database Foundation (TERMINÉE)
+
+### Tâches Complétées
+
+#### 1. Configuration Centralisée ✅
+**Fichier**: `pricewatch/app/core/config.py` (187 lignes)
+
+**Contenu**:
+- `DatabaseConfig`: Configuration PostgreSQL
+  - Host, port, database, user, password
+  - Propriété `url`: SQLAlchemy connection string
+  - Propriété `url_async`: AsyncPG connection string (futur)
+  - Prefix env vars: `PW_DB_*` (PW_DB_HOST, PW_DB_PORT, etc.)
+
+- `RedisConfig`: Configuration Redis pour RQ
+  - Host, port, db, password (optional)
+  - Propriété `url`: Redis connection string
+  - Prefix env vars: `PW_REDIS_*`
+
+- `AppConfig`: Configuration globale application
+  - Debug mode
+  - Worker timeout (300s par défaut)
+  - Worker concurrency (2 par défaut)
+  - Feature flags: `enable_db`, `enable_worker`
+  - Defaults Playwright: timeout, use_playwright
+  - Nested configs: `db`, `redis`
+  - Prefix env vars: `PW_*`
+
+- **Pattern Singleton**: `get_config()`, `set_config()`, `reset_config()`
+
+**Justifications**:
+- 12-factor app: configuration via env vars
+- Pydantic validation garantit config valide au démarrage
+- Valeurs par défaut pour développement local
+- Support `.env` file pour faciliter le setup
+- Feature flags permettent de désactiver DB/worker pour tests
+
+#### 2. Dépendances Phase 2 ✅
+**Fichier**: `pyproject.toml` (lignes 48-60)
+
+**Ajouts**:
+```toml
+# Database (Phase 2)
+"sqlalchemy>=2.0.0",
+"psycopg2-binary>=2.9.0",
+"alembic>=1.13.0",
+
+# Configuration (Phase 2)
+"python-dotenv>=1.0.0",
+
+# Worker/Queue (Phase 2)
+"redis>=5.0.0",
+"rq>=1.15.0",
+"rq-scheduler>=0.13.0",
+```
+
+#### 3. Modèles SQLAlchemy ORM ✅
+**Fichier**: `pricewatch/app/db/models.py` (322 lignes)
+
+**Tables créées**:
+
+1. **`products`** - Catalogue produits
+   - PK: `id` (Integer, autoincrement)
+   - Natural key: `(source, reference)` - Unique constraint
+   - Colonnes: `url`, `title`, `category`, `currency`
+   - Timestamps: `first_seen_at`, `last_updated_at`
+   - Relations: `price_history`, `images`, `specs`, `logs`
+   - Indexes: source, reference, last_updated_at
+
+2. **`price_history`** - Historique prix (time-series)
+   - PK: `id` (Integer, autoincrement)
+   - FK: `product_id` → products(id) CASCADE
+   - Unique: `(product_id, fetched_at)` - Évite doublons
+   - Colonnes: `price` (Numeric 10,2), `shipping_cost`, `stock_status`
+   - Fetch metadata: `fetch_method`, `fetch_status`, `fetched_at`
+   - Check constraints: stock_status, fetch_method, fetch_status
+   - Indexes: product_id, fetched_at
+
+3. **`product_images`** - Images produit
+   - PK: `id` (Integer, autoincrement)
+   - FK: `product_id` → products(id) CASCADE
+   - Unique: `(product_id, image_url)` - Évite doublons
+   - Colonnes: `image_url` (Text), `position` (Integer, 0=main)
+   - Index: product_id
+
+4. **`product_specs`** - Caractéristiques produit (key-value)
+   - PK: `id` (Integer, autoincrement)
+   - FK: `product_id` → products(id) CASCADE
+   - Unique: `(product_id, spec_key)` - Évite doublons
+   - Colonnes: `spec_key` (String 200), `spec_value` (Text)
+   - Indexes: product_id, spec_key
+
+5. **`scraping_logs`** - Logs observabilité
+   - PK: `id` (Integer, autoincrement)
+   - FK optionnelle: `product_id` → products(id) SET NULL
+   - Colonnes: `url`, `source`, `reference`, `fetched_at`
+   - Métriques: `duration_ms`, `html_size_bytes`
+   - Fetch metadata: `fetch_method`, `fetch_status`
+   - Debug data (JSONB): `errors`, `notes`
+   - Indexes: product_id, source, fetched_at, fetch_status
+
+**Justifications schéma**:
+- Normalisation: products séparée de price_history (catalogue vs time-series)
+- Clé naturelle (source, reference) vs UUID arbitraire
+- Tables séparées pour images/specs: évite JSONB non structuré
+- JSONB uniquement pour données variables: errors, notes dans logs
+- Cascade DELETE: suppression produit → suppression historique
+- SET NULL pour logs: garde trace même si produit supprimé
+
+---
+
+### Tâches Complétées (suite)
+
+#### 4. Connexion Base de Données ✅
+**Fichier**: `pricewatch/app/db/connection.py`
+
+**Contenu**:
+- `get_engine(config)`: Engine SQLAlchemy (pooling)
+- `get_session_factory(config)`: Session factory
+- `get_session(config)`: Context manager
+- `init_db(config)`: Création tables
+- `check_db_connection(config)`: Health check
+- `reset_engine()`: Reset pour tests
+
+**Justifications**:
+- Singleton engine pour éviter les pools multiples
+- `pool_pre_ping` pour robustesse
+- Context manager pour rollback/close automatiques
+
+---
+
+#### 5. Setup Alembic ✅
+**Fichiers**:
+- `alembic.ini`
+- `pricewatch/app/db/migrations/env.py`
+- `pricewatch/app/db/migrations/script.py.mako`
+
+**Justifications**:
+- URL DB injectée depuis `AppConfig`
+- `compare_type=True` pour cohérence des migrations
+
+#### 6. Migration Initiale ✅
+**Fichier**: `pricewatch/app/db/migrations/versions/20260114_01_initial_schema.py`
+
+**Contenu**:
+- 5 tables + indexes + contraintes
+- JSONB pour `errors` et `notes`
+
+#### 7. Commandes CLI Database ✅
+**Fichier**: `pricewatch/app/cli/main.py`
+
+**Commandes**:
+```bash
+pricewatch init-db              # Créer tables
+pricewatch migrate "message"    # Générer migration Alembic
+pricewatch upgrade              # Appliquer migrations
+pricewatch downgrade            # Rollback migration
+```
+
+#### 8. Docker Compose ✅
+**Fichier**: `docker-compose.yml`
+
+**Services**:
+- PostgreSQL 16 (port 5432)
+- Redis 7 (port 6379)
+- Volumes pour persistence
+
+#### 9. Fichier .env Exemple ✅
+**Fichier**: `.env.example`
+
+**Variables**:
+```bash
+# Database
+PW_DB_HOST=localhost
+PW_DB_PORT=5432
+PW_DB_DATABASE=pricewatch
+PW_DB_USER=pricewatch
+PW_DB_PASSWORD=pricewatch
+
+# Redis
+PW_REDIS_HOST=localhost
+PW_REDIS_PORT=6379
+PW_REDIS_DB=0
+
+# App
+PW_DEBUG=false
+PW_WORKER_TIMEOUT=300
+PW_WORKER_CONCURRENCY=2
+PW_ENABLE_DB=true
+PW_ENABLE_WORKER=true
+```
+
+#### 10. Tests Database ✅
+**Fichiers**:
+- `tests/db/test_models.py`: Tests des modèles SQLAlchemy
+- `tests/db/test_connection.py`: Tests connexion et session
+
+**Stratégie tests**:
+- SQLite in-memory pour tests unitaires
+- Fixtures pytest pour setup/teardown
+- Tests relationships, constraints, indexes
+
+---
+
+## 📦 Semaine 2: Repository & Pipeline (EN COURS)
+
+### Tâches Prévues
+
+#### Repository Pattern
+**Fichier**: `pricewatch/app/db/repository.py`
+
+**Classe**: `ProductRepository`
+- `get_or_create(source, reference)`: Trouver ou créer produit
+- `save_snapshot(snapshot)`: Persist ProductSnapshot to DB
+- `update_product_metadata(product, snapshot)`: Update title, url, etc.
+- `add_price_history(product, snapshot)`: Ajouter entrée prix
+- `sync_images(product, images)`: Sync images (add new, keep existing)
+- `sync_specs(product, specs)`: Sync specs (upsert)
+- `add_scraping_log(snapshot, product_id)`: Log scraping
+
+**Statut**: ✅ Terminé
+
+#### Scraping Pipeline
+**Fichier**: `pricewatch/app/scraping/pipeline.py`
+
+**Classe**: `ScrapingPipeline`
+- `process_snapshot(snapshot, save_to_db)`: Orchestration
+- Non-blocking: échec DB ne crash pas pipeline
+- Retour: `product_id` ou `None`
+
+**Statut**: ✅ Terminé
+
+#### CLI Modification
+**Fichier**: `pricewatch/app/cli/main.py`
+
+**Modification commande `run`**:
+- Ajouter flag `--save-db / --no-db`
+- Intégrer `ScrapingPipeline` si `save_db=True`
+- Compatibilité backward: JSON output toujours créé
+
+**Statut**: ✅ Terminé
+
+#### Tests Repository + Pipeline ✅
+**Fichiers**:
+- `tests/db/test_repository.py`
+- `tests/scraping/test_pipeline.py`
+
+**Statut**: ✅ Terminé
+
+#### Tests end-to-end CLI + DB ✅
+**Fichier**:
+- `tests/cli/test_run_db.py`
+
+**Statut**: ✅ Terminé
+
+---
+
+## 📦 Semaine 3: Worker Infrastructure (EN COURS)
+
+### Tâches Prévues
+
+#### RQ Task
+**Fichier**: `pricewatch/app/tasks/scrape.py`
+
+**Fonction**: `scrape_product(url, use_playwright=True)`
+- Réutilise 100% code Phase 1 (detect → fetch → parse)
+- Save to DB via ScrapingPipeline
+- Retour: `{success, product_id, snapshot, error}`
+
+**Statut**: ✅ Terminé
+
+#### Scheduler
+**Fichier**: `pricewatch/app/tasks/scheduler.py`
+
+**Classe**: `ScrapingScheduler`
+- `schedule_product(url, interval_hours=24)`: Job récurrent
+- `enqueue_immediate(url)`: Job unique
+- Basé sur `rq-scheduler`
+
+**Statut**: ✅ Terminé
+
+#### CLI Worker
+**Nouvelles commandes**:
+```bash
+pricewatch worker               # Lancer worker RQ
+pricewatch enqueue <url>        # Enqueue scrape immédiat
+pricewatch schedule <url> --interval 24  # Scrape quotidien
+```
+
+**Statut**: ✅ Terminé
+
+---
+
+## 📦 Semaine 4: Tests & Documentation (NON DÉMARRÉ)
+
+### Tâches Prévues
+
+#### Tests
+- Tests end-to-end (CLI → DB → Worker)
+- Tests erreurs (DB down, Redis down)
+- Tests backward compatibility (`--no-db`)
+- Performance tests (100+ produits)
+
+#### Documentation
+- Update README.md (setup Phase 2)
+- Update CHANGELOG.md
+- Migration guide (JSON → DB)
+
+---
+
+## 📈 Métriques d'Avancement
+
+| Catégorie | Complétées | Totales | % |
+|-----------|------------|---------|---|
+| **Semaine 1** | 10 | 10 | 100% |
+| **Semaine 2** | 5 | 5 | 100% |
+| **Semaine 3** | 3 | 6 | 50% |
+| **Semaine 4** | 0 | 7 | 0% |
+| **TOTAL Phase 2** | 18 | 28 | **64%** |
+
+---
+
+## 🎯 Prochaine Étape Immédiate
+
+**Prochaine étape immédiate**
+- Tests end-to-end worker + DB
+- Gestion des erreurs Redis down (CLI + worker)
+
+**Apres (prevu)**
+- Logs d'observabilite pour jobs planifies
+
+---
+
+## 🔧 Vérifications
+
+### Vérification Semaine 1 (objectif)
+```bash
+# Setup infrastructure
+docker-compose up -d
+pricewatch init-db
+
+# Vérifier tables créées
+psql -h localhost -U pricewatch pricewatch
+\dt
+# → 5 tables: products, price_history, product_images, product_specs, scraping_logs
+```
+
+### Vérification Semaine 2 (objectif)
+```bash
+# Test pipeline avec DB
+pricewatch run --yaml scrap_url.yaml --save-db
+
+# Vérifier données en DB
+psql -h localhost -U pricewatch pricewatch
+SELECT * FROM products LIMIT 5;
+SELECT * FROM price_history ORDER BY fetched_at DESC LIMIT 10;
+```
+
+### Vérification Semaine 3 (objectif)
+```bash
+# Enqueue job
+pricewatch enqueue "https://www.amazon.fr/dp/B08N5WRWNW"
+
+# Lancer worker
+pricewatch worker
+
+# Vérifier job traité
+psql -h localhost -U pricewatch pricewatch
+SELECT * FROM scraping_logs ORDER BY fetched_at DESC LIMIT 5;
+```
+
+---
+
+## 📝 Notes Importantes
+
+### Backward Compatibility
+- ✅ CLI Phase 1 fonctionne sans changement
+- ✅ Format JSON identique
+- ✅ Database optionnelle (`--no-db` flag)
+- ✅ ProductSnapshot inchangé
+- ✅ Tests Phase 1 continuent à passer (295 tests)
+
+### Architecture Décisions
+
+**Normalisation vs Performance**:
+- Choix: Normalisation stricte (5 tables)
+- Justification: Catalogue change rarement, prix changent quotidiennement
+- Alternative rejetée: Tout dans products + JSONB (moins queryable)
+
+**Clé Naturelle vs UUID**:
+- Choix: `(source, reference)` comme unique constraint
+- Justification: ASIN Amazon déjà unique globalement
+- Alternative rejetée: UUID artificiel (complexifie déduplication)
+
+**Synchrone vs Asynchrone**:
+- Choix: RQ synchrone (pas d'async/await)
+- Justification: Code Phase 1 réutilisable à 100%, simplicité
+- Alternative rejetée: Asyncio + asyncpg (refactoring massif)
+
+---
+
+**Dernière mise à jour**: 2026-01-14
+
+### Validation locale (Semaine 1)
+```bash
+docker compose up -d
+./venv/bin/alembic -c alembic.ini upgrade head
+psql -h localhost -U pricewatch pricewatch
+\\dt
+```
+
+**Resultat**: 6 tables visibles (products, price_history, product_images, product_specs, scraping_logs, alembic_version).
+**Statut**: ✅ Semaine 1 en cours (30% complétée)