scrap/PHASE_2_PROGRESS.md

# 🚀 Phase 2 Infrastructure - EN COURS

**Date de démarrage**: 2026-01-14
**Version cible**: 0.4.0
**Objectif**: Ajouter PostgreSQL + Redis/RQ worker pour persistence et scraping asynchrone

---

## 📊 Vue d'Ensemble

### Mises a jour recentes
- Migration Alembic corrigee (down_revision sur 20260114_02)
- Extraction images Amazon amelioree (data-a-dynamic-image + filtre logos)
- Nouveau scraping de validation (URL Amazon ASUS A16)

### Prochaines actions
- Verifier l'affichage des images, description, specs, msrp et reduction dans le Web UI
- Confirmer que le popup ajout produit affiche toutes les donnees du preview

### Objectifs Phase 2
- ✅ Configuration centralisée (database, Redis, app)
- ✅ Modèles SQLAlchemy ORM (5 tables)
- ✅ Connexion base de données (init_db, get_session)
- ✅ Migrations Alembic
- ✅ Repository pattern (CRUD)
- ✅ Worker RQ pour scraping asynchrone
- ✅ Scheduler pour jobs récurrents
- ✅ CLI étendu (commandes DB + worker)
- ✅ Docker Compose (PostgreSQL + Redis)
- ✅ Gestion erreurs Redis
- ✅ Logs d'observabilité jobs
- ⏳ Tests end-to-end (Semaine 4)

---

## ✅ Semaine 1: Database Foundation (TERMINÉE)

### Tâches Complétées

#### 1. Configuration Centralisée ✅
**Fichier**: `pricewatch/app/core/config.py` (187 lignes)

**Contenu**:
- `DatabaseConfig`: Configuration PostgreSQL
  - Host, port, database, user, password
  - Propriété `url`: SQLAlchemy connection string
  - Propriété `url_async`: AsyncPG connection string (futur)
  - Prefix env vars: `PW_DB_*` (PW_DB_HOST, PW_DB_PORT, etc.)

- `RedisConfig`: Configuration Redis pour RQ
  - Host, port, db, password (optional)
  - Propriété `url`: Redis connection string
  - Prefix env vars: `PW_REDIS_*`

- `AppConfig`: Configuration globale application
  - Debug mode
  - Worker timeout (300s par défaut)
  - Worker concurrency (2 par défaut)
  - Feature flags: `enable_db`, `enable_worker`
  - Defaults Playwright: timeout, use_playwright
  - Nested configs: `db`, `redis`
  - Prefix env vars: `PW_*`

- **Pattern Singleton**: `get_config()`, `set_config()`, `reset_config()`

**Justifications**:
- 12-factor app: configuration via env vars
- Pydantic validation garantit config valide au démarrage
- Valeurs par défaut pour développement local
- Support `.env` file pour faciliter le setup
- Feature flags permettent de désactiver DB/worker pour tests

#### 2. Dépendances Phase 2 ✅
**Fichier**: `pyproject.toml` (lignes 48-60)

**Ajouts**:
```toml
# Database (Phase 2)
"sqlalchemy>=2.0.0",
"psycopg2-binary>=2.9.0",
"alembic>=1.13.0",

# Configuration (Phase 2)
"python-dotenv>=1.0.0",

# Worker/Queue (Phase 2)
"redis>=5.0.0",
"rq>=1.15.0",
"rq-scheduler>=0.13.0",
```

#### 3. Modèles SQLAlchemy ORM ✅
**Fichier**: `pricewatch/app/db/models.py` (322 lignes)

**Tables créées**:

1. **`products`** - Catalogue produits
   - PK: `id` (Integer, autoincrement)
   - Natural key: `(source, reference)` - Unique constraint
   - Colonnes: `url`, `title`, `category`, `currency`
   - Timestamps: `first_seen_at`, `last_updated_at`
   - Relations: `price_history`, `images`, `specs`, `logs`
   - Indexes: source, reference, last_updated_at

2. **`price_history`** - Historique prix (time-series)
   - PK: `id` (Integer, autoincrement)
   - FK: `product_id` → products(id) CASCADE
   - Unique: `(product_id, fetched_at)` - Évite doublons
   - Colonnes: `price` (Numeric 10,2), `shipping_cost`, `stock_status`
   - Fetch metadata: `fetch_method`, `fetch_status`, `fetched_at`
   - Check constraints: stock_status, fetch_method, fetch_status
   - Indexes: product_id, fetched_at

3. **`product_images`** - Images produit
   - PK: `id` (Integer, autoincrement)
   - FK: `product_id` → products(id) CASCADE
   - Unique: `(product_id, image_url)` - Évite doublons
   - Colonnes: `image_url` (Text), `position` (Integer, 0=main)
   - Index: product_id

4. **`product_specs`** - Caractéristiques produit (key-value)
   - PK: `id` (Integer, autoincrement)
   - FK: `product_id` → products(id) CASCADE
   - Unique: `(product_id, spec_key)` - Évite doublons
   - Colonnes: `spec_key` (String 200), `spec_value` (Text)
   - Indexes: product_id, spec_key

5. **`scraping_logs`** - Logs observabilité
   - PK: `id` (Integer, autoincrement)
   - FK optionnelle: `product_id` → products(id) SET NULL
   - Colonnes: `url`, `source`, `reference`, `fetched_at`
   - Métriques: `duration_ms`, `html_size_bytes`
   - Fetch metadata: `fetch_method`, `fetch_status`
   - Debug data (JSONB): `errors`, `notes`
   - Indexes: product_id, source, fetched_at, fetch_status

**Justifications schéma**:
- Normalisation: products séparée de price_history (catalogue vs time-series)
- Clé naturelle (source, reference) vs UUID arbitraire
- Tables séparées pour images/specs: évite JSONB non structuré
- JSONB uniquement pour données variables: errors, notes dans logs
- Cascade DELETE: suppression produit → suppression historique
- SET NULL pour logs: garde trace même si produit supprimé

---

### Tâches Complétées (suite)

#### 4. Connexion Base de Données ✅
**Fichier**: `pricewatch/app/db/connection.py`

**Contenu**:
- `get_engine(config)`: Engine SQLAlchemy (pooling)
- `get_session_factory(config)`: Session factory
- `get_session(config)`: Context manager
- `init_db(config)`: Création tables
- `check_db_connection(config)`: Health check
- `reset_engine()`: Reset pour tests

**Justifications**:
- Singleton engine pour éviter les pools multiples
- `pool_pre_ping` pour robustesse
- Context manager pour rollback/close automatiques

---

#### 5. Setup Alembic ✅
**Fichiers**:
- `alembic.ini`
- `pricewatch/app/db/migrations/env.py`
- `pricewatch/app/db/migrations/script.py.mako`

**Justifications**:
- URL DB injectée depuis `AppConfig`
- `compare_type=True` pour cohérence des migrations

#### 6. Migration Initiale ✅
**Fichier**: `pricewatch/app/db/migrations/versions/20260114_01_initial_schema.py`

**Contenu**:
- 5 tables + indexes + contraintes
- JSONB pour `errors` et `notes`

#### 7. Commandes CLI Database ✅
**Fichier**: `pricewatch/app/cli/main.py`

**Commandes**:
```bash
pricewatch init-db              # Créer tables
pricewatch migrate "message"    # Générer migration Alembic
pricewatch upgrade              # Appliquer migrations
pricewatch downgrade            # Rollback migration
```

#### 8. Docker Compose ✅
**Fichier**: `docker-compose.yml`

**Services**:
- PostgreSQL 16 (port 5432)
- Redis 7 (port 6379)
- Volumes pour persistence

#### 9. Fichier .env Exemple ✅
**Fichier**: `.env.example`

**Variables**:
```bash
# Database
PW_DB_HOST=localhost
PW_DB_PORT=5432
PW_DB_DATABASE=pricewatch
PW_DB_USER=pricewatch
PW_DB_PASSWORD=pricewatch

# Redis
PW_REDIS_HOST=localhost
PW_REDIS_PORT=6379
PW_REDIS_DB=0

# App
PW_DEBUG=false
PW_WORKER_TIMEOUT=300
PW_WORKER_CONCURRENCY=2
PW_ENABLE_DB=true
PW_ENABLE_WORKER=true
```

#### 10. Tests Database ✅
**Fichiers**:
- `tests/db/test_models.py`: Tests des modèles SQLAlchemy
- `tests/db/test_connection.py`: Tests connexion et session

**Stratégie tests**:
- SQLite in-memory pour tests unitaires
- Fixtures pytest pour setup/teardown
- Tests relationships, constraints, indexes

---

## 📦 Semaine 2: Repository & Pipeline (TERMINEE)

### Tâches Prévues

#### Repository Pattern
**Fichier**: `pricewatch/app/db/repository.py`

**Classe**: `ProductRepository`
- `get_or_create(source, reference)`: Trouver ou créer produit
- `save_snapshot(snapshot)`: Persist ProductSnapshot to DB
- `update_product_metadata(product, snapshot)`: Update title, url, etc.
- `add_price_history(product, snapshot)`: Ajouter entrée prix
- `sync_images(product, images)`: Sync images (add new, keep existing)
- `sync_specs(product, specs)`: Sync specs (upsert)
- `add_scraping_log(snapshot, product_id)`: Log scraping

**Statut**: ✅ Terminé

#### Scraping Pipeline
**Fichier**: `pricewatch/app/scraping/pipeline.py`

**Classe**: `ScrapingPipeline`
- `process_snapshot(snapshot, save_to_db)`: Orchestration
- Non-blocking: échec DB ne crash pas pipeline
- Retour: `product_id` ou `None`

**Statut**: ✅ Terminé

#### CLI Modification
**Fichier**: `pricewatch/app/cli/main.py`

**Modification commande `run`**:
- Ajouter flag `--save-db / --no-db`
- Intégrer `ScrapingPipeline` si `save_db=True`
- Compatibilité backward: JSON output toujours créé

**Statut**: ✅ Terminé

#### Tests Repository + Pipeline ✅
**Fichiers**:
- `tests/db/test_repository.py`
- `tests/scraping/test_pipeline.py`

**Statut**: ✅ Terminé

#### Tests end-to-end CLI + DB ✅
**Fichier**:
- `tests/cli/test_run_db.py`

**Statut**: ✅ Terminé

---

## 📦 Semaine 3: Worker Infrastructure (TERMINEE)

### Tâches Prévues

#### RQ Task
**Fichier**: `pricewatch/app/tasks/scrape.py`

**Fonction**: `scrape_product(url, use_playwright=True)`
- Réutilise 100% code Phase 1 (detect → fetch → parse)
- Save to DB via ScrapingPipeline
- Retour: `{success, product_id, snapshot, error}`

**Statut**: ✅ Terminé

#### Scheduler
**Fichier**: `pricewatch/app/tasks/scheduler.py`

**Classe**: `ScrapingScheduler`
- `schedule_product(url, interval_hours=24)`: Job récurrent
- `enqueue_immediate(url)`: Job unique
- Basé sur `rq-scheduler`

**Statut**: ✅ Terminé

#### CLI Worker
**Nouvelles commandes**:
```bash
pricewatch worker               # Lancer worker RQ
pricewatch enqueue <url>        # Enqueue scrape immédiat
pricewatch schedule <url> --interval 24  # Scrape quotidien
```

**Statut**: ✅ Terminé

#### Tests worker + scheduler ✅
**Fichiers**:
- `tests/tasks/test_scrape_task.py`
- `tests/tasks/test_scheduler.py`

**Statut**: ✅ Terminé

#### Gestion erreurs Redis ✅
**Fichiers modifiés**:
- `pricewatch/app/tasks/scheduler.py`:
  - Ajout `RedisUnavailableError` exception
  - Ajout `check_redis_connection()` helper
  - Connexion lazy avec ping de vérification
- `pricewatch/app/cli/main.py`:
  - Commandes `worker`, `enqueue`, `schedule` gèrent Redis down
  - Messages d'erreur clairs avec instructions

**Tests ajoutés** (7 tests):
- `test_scheduler_redis_connection_error`
- `test_scheduler_lazy_connection`
- `test_check_redis_connection_success`
- `test_check_redis_connection_failure`
- `test_scheduler_schedule_redis_error`

**Statut**: ✅ Terminé

#### Logs d'observabilité jobs ✅
**Fichier modifié**: `pricewatch/app/tasks/scrape.py`

**Logs ajoutés**:
- `[JOB START]` - Début du job avec URL
- `[STORE]` - Store détecté
- `[FETCH]` - Résultat fetch HTTP/Playwright (durée, taille)
- `[PARSE]` - Résultat parsing (titre, prix)
- `[JOB OK]` / `[JOB FAILED]` - Résultat final avec durée totale

**Note**: Les logs sont aussi persistés en DB via `ScrapingLog` (déjà implémenté).

**Statut**: ✅ Terminé

---

## 📦 Semaine 4: Tests & Documentation (EN COURS)

### Tâches Prévues

#### Tests
- ✅ Tests end-to-end (CLI → DB → Worker)
- ✅ Tests erreurs (DB down, Redis down)
- ✅ Tests backward compatibility (`--no-db`)
- ✅ Performance tests (100+ produits)

**Fichiers tests ajoutes**:
- `tests/cli/test_worker_cli.py`
- `tests/cli/test_enqueue_schedule_cli.py`
- `tests/scraping/test_pipeline.py` (erreurs DB)
- `tests/tasks/test_redis_errors.py`
- `tests/cli/test_run_no_db.py`
- `tests/db/test_bulk_persistence.py`
- `tests/tasks/test_worker_end_to_end.py`
- `tests/cli/test_cli_worker_end_to_end.py`
  - **Resultat**: OK avec Redis actif

#### Documentation
- ✅ Update README.md (setup Phase 2)
- ✅ Update CHANGELOG.md
- ✅ Migration guide (JSON → DB)

---

## 📈 Métriques d'Avancement

| Catégorie | Complétées | Totales | % |
|-----------|------------|---------|---|
| **Semaine 1** | 10 | 10 | 100% |
| **Semaine 2** | 5 | 5 | 100% |
| **Semaine 3** | 6 | 6 | 100% |
| **Semaine 4** | 7 | 7 | 100% |
| **TOTAL Phase 2** | 28 | 28 | **100%** |

---

## 🎯 Prochaine Étape Immédiate

**Prochaine étape immédiate**
- Phase 2 terminee, bascule vers Phase 3 (API REST)
- API v1 avancee: filtres, export CSV/JSON, webhooks + tests associes

**Après (prévu)**
- Documentation Phase 2 (resume final)
- Retry policy (optionnel)
- Phase 4 Web UI (dashboard + graphiques)

---

## 🔧 Vérifications

### Vérification Semaine 1 (objectif)
```bash
# Setup infrastructure
docker-compose up -d
pricewatch init-db

# Vérifier tables créées
psql -h localhost -U pricewatch pricewatch
\dt
# → 5 tables: products, price_history, product_images, product_specs, scraping_logs
```

### Vérification Semaine 2 (objectif)
```bash
# Test pipeline avec DB
pricewatch run --yaml scrap_url.yaml --save-db

# Vérifier données en DB
psql -h localhost -U pricewatch pricewatch
SELECT * FROM products LIMIT 5;
SELECT * FROM price_history ORDER BY fetched_at DESC LIMIT 10;
```

### Vérification Semaine 3 (objectif)
```bash
# Enqueue job
pricewatch enqueue "https://www.amazon.fr/dp/B08N5WRWNW"

# Lancer worker
pricewatch worker

# Vérifier job traité
psql -h localhost -U pricewatch pricewatch
SELECT * FROM scraping_logs ORDER BY fetched_at DESC LIMIT 5;
```

---

## 📝 Notes Importantes

### Backward Compatibility
- ✅ CLI Phase 1 fonctionne sans changement
- ✅ Format JSON identique
- ✅ Database optionnelle (`--no-db` flag)
- ✅ ProductSnapshot inchangé
- ✅ Tests Phase 1 continuent à passer (295 tests)

### Architecture Décisions

**Normalisation vs Performance**:
- Choix: Normalisation stricte (5 tables)
- Justification: Catalogue change rarement, prix changent quotidiennement
- Alternative rejetée: Tout dans products + JSONB (moins queryable)

**Clé Naturelle vs UUID**:
- Choix: `(source, reference)` comme unique constraint
- Justification: ASIN Amazon déjà unique globalement
- Alternative rejetée: UUID artificiel (complexifie déduplication)

**Synchrone vs Asynchrone**:
- Choix: RQ synchrone (pas d'async/await)
- Justification: Code Phase 1 réutilisable à 100%, simplicité
- Alternative rejetée: Asyncio + asyncpg (refactoring massif)

---

**Dernière mise à jour**: 2026-01-15

### Recap avancement recent (Phase 3 API)
- Filtres avances + exports CSV/JSON + webhooks (CRUD + test)
- Tests API avances ajoutes
- Nettoyage warnings Pydantic/datetime/selectors
- Suite pytest complete: 339 passed, 4 skipped

### Validation locale (Semaine 1)
```bash
docker compose up -d
./venv/bin/alembic -c alembic.ini upgrade head
psql -h localhost -U pricewatch pricewatch
\\dt
```

**Resultat**: 6 tables visibles (products, price_history, product_images, product_specs, scraping_logs, alembic_version).
**Statut**: ✅ Semaine 1 terminee (100%).