codex
This commit is contained in:
437
PHASE_2_PROGRESS.md
Executable file
437
PHASE_2_PROGRESS.md
Executable file
@@ -0,0 +1,437 @@
|
||||
# 🚀 Phase 2 Infrastructure - EN COURS
|
||||
|
||||
**Date de démarrage**: 2026-01-14
|
||||
**Version cible**: 0.4.0
|
||||
**Objectif**: Ajouter PostgreSQL + Redis/RQ worker pour persistence et scraping asynchrone
|
||||
|
||||
---
|
||||
|
||||
## 📊 Vue d'Ensemble
|
||||
|
||||
### Objectifs Phase 2
|
||||
- ✅ Configuration centralisée (database, Redis, app)
|
||||
- ✅ Modèles SQLAlchemy ORM (5 tables)
|
||||
- ✅ Connexion base de données (init_db, get_session)
|
||||
- ✅ Migrations Alembic
|
||||
- ⏳ Repository pattern (CRUD)
|
||||
- ⏳ Worker RQ pour scraping asynchrone
|
||||
- ⏳ Scheduler pour jobs récurrents
|
||||
- ✅ CLI étendu (commandes DB)
|
||||
- ✅ Docker Compose (PostgreSQL + Redis)
|
||||
- ⏳ Tests complets
|
||||
|
||||
---
|
||||
|
||||
## ✅ Semaine 1: Database Foundation (TERMINÉE)
|
||||
|
||||
### Tâches Complétées
|
||||
|
||||
#### 1. Configuration Centralisée ✅
|
||||
**Fichier**: `pricewatch/app/core/config.py` (187 lignes)
|
||||
|
||||
**Contenu**:
|
||||
- `DatabaseConfig`: Configuration PostgreSQL
|
||||
- Host, port, database, user, password
|
||||
- Propriété `url`: SQLAlchemy connection string
|
||||
- Propriété `url_async`: AsyncPG connection string (futur)
|
||||
- Prefix env vars: `PW_DB_*` (PW_DB_HOST, PW_DB_PORT, etc.)
|
||||
|
||||
- `RedisConfig`: Configuration Redis pour RQ
|
||||
- Host, port, db, password (optional)
|
||||
- Propriété `url`: Redis connection string
|
||||
- Prefix env vars: `PW_REDIS_*`
|
||||
|
||||
- `AppConfig`: Configuration globale application
|
||||
- Debug mode
|
||||
- Worker timeout (300s par défaut)
|
||||
- Worker concurrency (2 par défaut)
|
||||
- Feature flags: `enable_db`, `enable_worker`
|
||||
- Defaults Playwright: timeout, use_playwright
|
||||
- Nested configs: `db`, `redis`
|
||||
- Prefix env vars: `PW_*`
|
||||
|
||||
- **Pattern Singleton**: `get_config()`, `set_config()`, `reset_config()`
|
||||
|
||||
**Justifications**:
|
||||
- 12-factor app: configuration via env vars
|
||||
- Pydantic validation garantit config valide au démarrage
|
||||
- Valeurs par défaut pour développement local
|
||||
- Support `.env` file pour faciliter le setup
|
||||
- Feature flags permettent de désactiver DB/worker pour tests
|
||||
|
||||
#### 2. Dépendances Phase 2 ✅
|
||||
**Fichier**: `pyproject.toml` (lignes 48-60)
|
||||
|
||||
**Ajouts**:
|
||||
```toml
|
||||
# Database (Phase 2)
|
||||
"sqlalchemy>=2.0.0",
|
||||
"psycopg2-binary>=2.9.0",
|
||||
"alembic>=1.13.0",
|
||||
|
||||
# Configuration (Phase 2)
|
||||
"python-dotenv>=1.0.0",
|
||||
|
||||
# Worker/Queue (Phase 2)
|
||||
"redis>=5.0.0",
|
||||
"rq>=1.15.0",
|
||||
"rq-scheduler>=0.13.0",
|
||||
```
|
||||
|
||||
#### 3. Modèles SQLAlchemy ORM ✅
|
||||
**Fichier**: `pricewatch/app/db/models.py` (322 lignes)
|
||||
|
||||
**Tables créées**:
|
||||
|
||||
1. **`products`** - Catalogue produits
|
||||
- PK: `id` (Integer, autoincrement)
|
||||
- Natural key: `(source, reference)` - Unique constraint
|
||||
- Colonnes: `url`, `title`, `category`, `currency`
|
||||
- Timestamps: `first_seen_at`, `last_updated_at`
|
||||
- Relations: `price_history`, `images`, `specs`, `logs`
|
||||
- Indexes: source, reference, last_updated_at
|
||||
|
||||
2. **`price_history`** - Historique prix (time-series)
|
||||
- PK: `id` (Integer, autoincrement)
|
||||
- FK: `product_id` → products(id) CASCADE
|
||||
- Unique: `(product_id, fetched_at)` - Évite doublons
|
||||
- Colonnes: `price` (Numeric 10,2), `shipping_cost`, `stock_status`
|
||||
- Fetch metadata: `fetch_method`, `fetch_status`, `fetched_at`
|
||||
- Check constraints: stock_status, fetch_method, fetch_status
|
||||
- Indexes: product_id, fetched_at
|
||||
|
||||
3. **`product_images`** - Images produit
|
||||
- PK: `id` (Integer, autoincrement)
|
||||
- FK: `product_id` → products(id) CASCADE
|
||||
- Unique: `(product_id, image_url)` - Évite doublons
|
||||
- Colonnes: `image_url` (Text), `position` (Integer, 0=main)
|
||||
- Index: product_id
|
||||
|
||||
4. **`product_specs`** - Caractéristiques produit (key-value)
|
||||
- PK: `id` (Integer, autoincrement)
|
||||
- FK: `product_id` → products(id) CASCADE
|
||||
- Unique: `(product_id, spec_key)` - Évite doublons
|
||||
- Colonnes: `spec_key` (String 200), `spec_value` (Text)
|
||||
- Indexes: product_id, spec_key
|
||||
|
||||
5. **`scraping_logs`** - Logs observabilité
|
||||
- PK: `id` (Integer, autoincrement)
|
||||
- FK optionnelle: `product_id` → products(id) SET NULL
|
||||
- Colonnes: `url`, `source`, `reference`, `fetched_at`
|
||||
- Métriques: `duration_ms`, `html_size_bytes`
|
||||
- Fetch metadata: `fetch_method`, `fetch_status`
|
||||
- Debug data (JSONB): `errors`, `notes`
|
||||
- Indexes: product_id, source, fetched_at, fetch_status
|
||||
|
||||
**Justifications schéma**:
|
||||
- Normalisation: products séparée de price_history (catalogue vs time-series)
|
||||
- Clé naturelle (source, reference) vs UUID arbitraire
|
||||
- Tables séparées pour images/specs: évite JSONB non structuré
|
||||
- JSONB uniquement pour données variables: errors, notes dans logs
|
||||
- Cascade DELETE: suppression produit → suppression historique
|
||||
- SET NULL pour logs: garde trace même si produit supprimé
|
||||
|
||||
---
|
||||
|
||||
### Tâches Complétées (suite)
|
||||
|
||||
#### 4. Connexion Base de Données ✅
|
||||
**Fichier**: `pricewatch/app/db/connection.py`
|
||||
|
||||
**Contenu**:
|
||||
- `get_engine(config)`: Engine SQLAlchemy (pooling)
|
||||
- `get_session_factory(config)`: Session factory
|
||||
- `get_session(config)`: Context manager
|
||||
- `init_db(config)`: Création tables
|
||||
- `check_db_connection(config)`: Health check
|
||||
- `reset_engine()`: Reset pour tests
|
||||
|
||||
**Justifications**:
|
||||
- Singleton engine pour éviter les pools multiples
|
||||
- `pool_pre_ping` pour robustesse
|
||||
- Context manager pour rollback/close automatiques
|
||||
|
||||
---
|
||||
|
||||
#### 5. Setup Alembic ✅
|
||||
**Fichiers**:
|
||||
- `alembic.ini`
|
||||
- `pricewatch/app/db/migrations/env.py`
|
||||
- `pricewatch/app/db/migrations/script.py.mako`
|
||||
|
||||
**Justifications**:
|
||||
- URL DB injectée depuis `AppConfig`
|
||||
- `compare_type=True` pour cohérence des migrations
|
||||
|
||||
#### 6. Migration Initiale ✅
|
||||
**Fichier**: `pricewatch/app/db/migrations/versions/20260114_01_initial_schema.py`
|
||||
|
||||
**Contenu**:
|
||||
- 5 tables + indexes + contraintes
|
||||
- JSONB pour `errors` et `notes`
|
||||
|
||||
#### 7. Commandes CLI Database ✅
|
||||
**Fichier**: `pricewatch/app/cli/main.py`
|
||||
|
||||
**Commandes**:
|
||||
```bash
|
||||
pricewatch init-db # Créer tables
|
||||
pricewatch migrate "message" # Générer migration Alembic
|
||||
pricewatch upgrade # Appliquer migrations
|
||||
pricewatch downgrade # Rollback migration
|
||||
```
|
||||
|
||||
#### 8. Docker Compose ✅
|
||||
**Fichier**: `docker-compose.yml`
|
||||
|
||||
**Services**:
|
||||
- PostgreSQL 16 (port 5432)
|
||||
- Redis 7 (port 6379)
|
||||
- Volumes pour persistence
|
||||
|
||||
#### 9. Fichier .env Exemple ✅
|
||||
**Fichier**: `.env.example`
|
||||
|
||||
**Variables**:
|
||||
```bash
|
||||
# Database
|
||||
PW_DB_HOST=localhost
|
||||
PW_DB_PORT=5432
|
||||
PW_DB_DATABASE=pricewatch
|
||||
PW_DB_USER=pricewatch
|
||||
PW_DB_PASSWORD=pricewatch
|
||||
|
||||
# Redis
|
||||
PW_REDIS_HOST=localhost
|
||||
PW_REDIS_PORT=6379
|
||||
PW_REDIS_DB=0
|
||||
|
||||
# App
|
||||
PW_DEBUG=false
|
||||
PW_WORKER_TIMEOUT=300
|
||||
PW_WORKER_CONCURRENCY=2
|
||||
PW_ENABLE_DB=true
|
||||
PW_ENABLE_WORKER=true
|
||||
```
|
||||
|
||||
#### 10. Tests Database ✅
|
||||
**Fichiers**:
|
||||
- `tests/db/test_models.py`: Tests des modèles SQLAlchemy
|
||||
- `tests/db/test_connection.py`: Tests connexion et session
|
||||
|
||||
**Stratégie tests**:
|
||||
- SQLite in-memory pour tests unitaires
|
||||
- Fixtures pytest pour setup/teardown
|
||||
- Tests relationships, constraints, indexes
|
||||
|
||||
---
|
||||
|
||||
## 📦 Semaine 2: Repository & Pipeline (EN COURS)
|
||||
|
||||
### Tâches Prévues
|
||||
|
||||
#### Repository Pattern
|
||||
**Fichier**: `pricewatch/app/db/repository.py`
|
||||
|
||||
**Classe**: `ProductRepository`
|
||||
- `get_or_create(source, reference)`: Trouver ou créer produit
|
||||
- `save_snapshot(snapshot)`: Persist ProductSnapshot to DB
|
||||
- `update_product_metadata(product, snapshot)`: Update title, url, etc.
|
||||
- `add_price_history(product, snapshot)`: Ajouter entrée prix
|
||||
- `sync_images(product, images)`: Sync images (add new, keep existing)
|
||||
- `sync_specs(product, specs)`: Sync specs (upsert)
|
||||
- `add_scraping_log(snapshot, product_id)`: Log scraping
|
||||
|
||||
**Statut**: ✅ Terminé
|
||||
|
||||
#### Scraping Pipeline
|
||||
**Fichier**: `pricewatch/app/scraping/pipeline.py`
|
||||
|
||||
**Classe**: `ScrapingPipeline`
|
||||
- `process_snapshot(snapshot, save_to_db)`: Orchestration
|
||||
- Non-blocking: échec DB ne crash pas pipeline
|
||||
- Retour: `product_id` ou `None`
|
||||
|
||||
**Statut**: ✅ Terminé
|
||||
|
||||
#### CLI Modification
|
||||
**Fichier**: `pricewatch/app/cli/main.py`
|
||||
|
||||
**Modification commande `run`**:
|
||||
- Ajouter flag `--save-db / --no-db`
|
||||
- Intégrer `ScrapingPipeline` si `save_db=True`
|
||||
- Compatibilité backward: JSON output toujours créé
|
||||
|
||||
**Statut**: ✅ Terminé
|
||||
|
||||
#### Tests Repository + Pipeline ✅
|
||||
**Fichiers**:
|
||||
- `tests/db/test_repository.py`
|
||||
- `tests/scraping/test_pipeline.py`
|
||||
|
||||
**Statut**: ✅ Terminé
|
||||
|
||||
#### Tests end-to-end CLI + DB ✅
|
||||
**Fichier**:
|
||||
- `tests/cli/test_run_db.py`
|
||||
|
||||
**Statut**: ✅ Terminé
|
||||
|
||||
---
|
||||
|
||||
## 📦 Semaine 3: Worker Infrastructure (EN COURS)
|
||||
|
||||
### Tâches Prévues
|
||||
|
||||
#### RQ Task
|
||||
**Fichier**: `pricewatch/app/tasks/scrape.py`
|
||||
|
||||
**Fonction**: `scrape_product(url, use_playwright=True)`
|
||||
- Réutilise 100% code Phase 1 (detect → fetch → parse)
|
||||
- Save to DB via ScrapingPipeline
|
||||
- Retour: `{success, product_id, snapshot, error}`
|
||||
|
||||
**Statut**: ✅ Terminé
|
||||
|
||||
#### Scheduler
|
||||
**Fichier**: `pricewatch/app/tasks/scheduler.py`
|
||||
|
||||
**Classe**: `ScrapingScheduler`
|
||||
- `schedule_product(url, interval_hours=24)`: Job récurrent
|
||||
- `enqueue_immediate(url)`: Job unique
|
||||
- Basé sur `rq-scheduler`
|
||||
|
||||
**Statut**: ✅ Terminé
|
||||
|
||||
#### CLI Worker
|
||||
**Nouvelles commandes**:
|
||||
```bash
|
||||
pricewatch worker # Lancer worker RQ
|
||||
pricewatch enqueue <url> # Enqueue scrape immédiat
|
||||
pricewatch schedule <url> --interval 24 # Scrape quotidien
|
||||
```
|
||||
|
||||
**Statut**: ✅ Terminé
|
||||
|
||||
---
|
||||
|
||||
## 📦 Semaine 4: Tests & Documentation (NON DÉMARRÉ)
|
||||
|
||||
### Tâches Prévues
|
||||
|
||||
#### Tests
|
||||
- Tests end-to-end (CLI → DB → Worker)
|
||||
- Tests erreurs (DB down, Redis down)
|
||||
- Tests backward compatibility (`--no-db`)
|
||||
- Performance tests (100+ produits)
|
||||
|
||||
#### Documentation
|
||||
- Update README.md (setup Phase 2)
|
||||
- Update CHANGELOG.md
|
||||
- Migration guide (JSON → DB)
|
||||
|
||||
---
|
||||
|
||||
## 📈 Métriques d'Avancement
|
||||
|
||||
| Catégorie | Complétées | Totales | % |
|
||||
|-----------|------------|---------|---|
|
||||
| **Semaine 1** | 10 | 10 | 100% |
|
||||
| **Semaine 2** | 5 | 5 | 100% |
|
||||
| **Semaine 3** | 3 | 6 | 50% |
|
||||
| **Semaine 4** | 0 | 7 | 0% |
|
||||
| **TOTAL Phase 2** | 18 | 28 | **64%** |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Prochaine Étape Immédiate
|
||||
|
||||
**Prochaine étape immédiate**
|
||||
- Tests end-to-end worker + DB
|
||||
- Gestion des erreurs Redis down (CLI + worker)
|
||||
|
||||
**Apres (prevu)**
|
||||
- Logs d'observabilite pour jobs planifies
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Vérifications
|
||||
|
||||
### Vérification Semaine 1 (objectif)
|
||||
```bash
|
||||
# Setup infrastructure
|
||||
docker-compose up -d
|
||||
pricewatch init-db
|
||||
|
||||
# Vérifier tables créées
|
||||
psql -h localhost -U pricewatch pricewatch
|
||||
\dt
|
||||
# → 5 tables: products, price_history, product_images, product_specs, scraping_logs
|
||||
```
|
||||
|
||||
### Vérification Semaine 2 (objectif)
|
||||
```bash
|
||||
# Test pipeline avec DB
|
||||
pricewatch run --yaml scrap_url.yaml --save-db
|
||||
|
||||
# Vérifier données en DB
|
||||
psql -h localhost -U pricewatch pricewatch
|
||||
SELECT * FROM products LIMIT 5;
|
||||
SELECT * FROM price_history ORDER BY fetched_at DESC LIMIT 10;
|
||||
```
|
||||
|
||||
### Vérification Semaine 3 (objectif)
|
||||
```bash
|
||||
# Enqueue job
|
||||
pricewatch enqueue "https://www.amazon.fr/dp/B08N5WRWNW"
|
||||
|
||||
# Lancer worker
|
||||
pricewatch worker
|
||||
|
||||
# Vérifier job traité
|
||||
psql -h localhost -U pricewatch pricewatch
|
||||
SELECT * FROM scraping_logs ORDER BY fetched_at DESC LIMIT 5;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📝 Notes Importantes
|
||||
|
||||
### Backward Compatibility
|
||||
- ✅ CLI Phase 1 fonctionne sans changement
|
||||
- ✅ Format JSON identique
|
||||
- ✅ Database optionnelle (`--no-db` flag)
|
||||
- ✅ ProductSnapshot inchangé
|
||||
- ✅ Tests Phase 1 continuent à passer (295 tests)
|
||||
|
||||
### Architecture Décisions
|
||||
|
||||
**Normalisation vs Performance**:
|
||||
- Choix: Normalisation stricte (5 tables)
|
||||
- Justification: Catalogue change rarement, prix changent quotidiennement
|
||||
- Alternative rejetée: Tout dans products + JSONB (moins queryable)
|
||||
|
||||
**Clé Naturelle vs UUID**:
|
||||
- Choix: `(source, reference)` comme unique constraint
|
||||
- Justification: ASIN Amazon déjà unique globalement
|
||||
- Alternative rejetée: UUID artificiel (complexifie déduplication)
|
||||
|
||||
**Synchrone vs Asynchrone**:
|
||||
- Choix: RQ synchrone (pas d'async/await)
|
||||
- Justification: Code Phase 1 réutilisable à 100%, simplicité
|
||||
- Alternative rejetée: Asyncio + asyncpg (refactoring massif)
|
||||
|
||||
---
|
||||
|
||||
**Dernière mise à jour**: 2026-01-14
|
||||
|
||||
### Validation locale (Semaine 1)
|
||||
```bash
|
||||
docker compose up -d
|
||||
./venv/bin/alembic -c alembic.ini upgrade head
|
||||
psql -h localhost -U pricewatch pricewatch
|
||||
\\dt
|
||||
```
|
||||
|
||||
**Resultat**: 6 tables visibles (products, price_history, product_images, product_specs, scraping_logs, alembic_version).
|
||||
**Statut**: ✅ Semaine 1 en cours (30% complétée)
|
||||
Reference in New Issue
Block a user