This commit is contained in:
2026-01-14 07:03:38 +01:00
parent ecda149a4b
commit c91c0f1fc9
61 changed files with 4388 additions and 38 deletions

BIN
.coverage

Binary file not shown.

18
.env.example Executable file
View File

@@ -0,0 +1,18 @@
# Database
PW_DB_HOST=localhost
PW_DB_PORT=5432
PW_DB_DATABASE=pricewatch
PW_DB_USER=pricewatch
PW_DB_PASSWORD=pricewatch
# Redis
PW_REDIS_HOST=localhost
PW_REDIS_PORT=6379
PW_REDIS_DB=0
# App
PW_DEBUG=false
PW_WORKER_TIMEOUT=300
PW_WORKER_CONCURRENCY=2
PW_ENABLE_DB=true
PW_ENABLE_WORKER=true

0
.gitignore vendored Normal file → Executable file
View File

View File

@@ -9,9 +9,94 @@ Le format est basé sur [Keep a Changelog](https://keepachangelog.com/fr/1.0.0/)
## [Non publié]
### En cours
- Ajout de fixtures HTML réalistes pour tests pytest
- Tests stores/cdiscount/
- Tests scraping/ avec mocks
- Phase 2 : Base de données PostgreSQL
- Phase 2 : Worker Redis/RQ
- Phase 3 : API REST FastAPI
- Phase 4 : Web UI
### Ajouté
- Configuration Alembic (env.py, script.py.mako, alembic.ini)
- Migration initiale SQLAlchemy (5 tables + indexes)
- Commandes CLI DB: `init-db`, `migrate`, `upgrade`, `downgrade`
- `docker-compose.yml` PostgreSQL/Redis
- `.env.example` avec variables DB/Redis/app
- Tests DB de base (models + connection)
- Repository `ProductRepository` + `ScrapingPipeline`
- Flag CLI `--save-db/--no-db` pour la persistence
- Tests repository/pipeline (SQLite)
- Test end-to-end CLI + DB (SQLite)
- Worker RQ + scheduler (tasks + CLI)
---
## [0.3.0] - 2026-01-14 🎉 PHASE 1 TERMINÉE
### ✅ Phase 1 CLI complétée à 100%
**Résultat final**:
- **295 tests passent** (100% de réussite)
- **76% code coverage global**
- **4 stores opérationnels** (Amazon, Cdiscount, Backmarket, AliExpress)
### Ajouté
#### Corrections et améliorations
- **Amazon Store**: Correction extraction images avec fallback générique
- **Amazon Store**: Support prix séparés en 2 spans (a-price-whole + a-price-fraction)
#### Tests complets ajoutés (177 nouveaux tests)
- **tests/core/test_registry.py**: 40 tests (100% coverage)
- 24 tests unitaires avec mocks
- 16 tests d'intégration avec les 4 stores réels
- Tests de détection automatique multi-stores
- **tests/core/test_registry_integration.py**: Tests d'intégration stores
- Vérification détection correcte pour Amazon, Cdiscount, Backmarket, AliExpress
- Tests de priorité et exclusivité des matches
- **tests/core/test_io.py**: 36 tests (97% coverage)
- Tests ScrapingConfig/ScrapingOptions Pydantic
- Tests read_yaml_config avec validation erreurs
- Tests write_json_results et read_json_results
- Tests save_debug_html et save_debug_screenshot
- **tests/scraping/test_http_fetch.py**: 21 tests (100% coverage)
- Tests fetch_http avec mocks requests
- Tests codes HTTP (200, 403, 404, 429, 500+)
- Tests timeout et exceptions réseau
- Tests User-Agent rotation et headers personnalisés
- **tests/scraping/test_pw_fetch.py**: 21 tests (91% coverage)
- Tests fetch_playwright avec mocks Playwright
- Tests modes headless/headful
- Tests screenshot et wait_for_selector
- Tests fetch_with_fallback (stratégie HTTP → Playwright)
- Tests cleanup des ressources
### Statistiques détaillées
**Coverage par module**:
| Module | Coverage | Tests |
|--------|----------|-------|
| core/schema.py | 100% | 29 |
| core/registry.py | 100% | 40 |
| core/io.py | 97% | 36 |
| scraping/http_fetch.py | 100% | 21 |
| scraping/pw_fetch.py | 91% | 21 |
| stores/amazon/store.py | 89% | 33 |
| stores/aliexpress/store.py | 85% | 32 |
| stores/backmarket/store.py | 85% | 25 |
| stores/cdiscount/store.py | 72% | 30 |
| **TOTAL** | **76%** | **295** |
### Améliorations techniques
- Architecture complètement testée avec mocks et fixtures
- Tests d'intégration validant le fonctionnement end-to-end
- Couverture de code élevée sur tous les modules critiques
- Détection automatique de stores validée avec URLs réelles
### Prochaines étapes (Phase 2)
Phase 1 CLI est maintenant **production-ready**. La Phase 2 peut démarrer:
1. Base de données PostgreSQL + Alembic
2. Worker Redis/RQ pour scraping planifié
3. API REST FastAPI
4. Web UI responsive avec dark theme Gruvbox
---

267
PHASE_1_COMPLETE.md Executable file
View File

@@ -0,0 +1,267 @@
# 🎉 Phase 1 CLI - TERMINÉE À 100%
**Date de complétion**: 2026-01-14
**Version**: 0.3.0
---
## 📊 Résultats Finaux
### Tests
-**295/295 tests passent** (100% de réussite)
- 📈 **76% code coverage global**
-**Temps d'exécution**: 41.4 secondes
### Modules testés
| Module | Coverage | Tests | Statut |
|--------|----------|-------|--------|
| `core/schema.py` | **100%** | 29 | ✅ |
| `core/registry.py` | **100%** | 40 | ✅ |
| `core/io.py` | **97%** | 36 | ✅ |
| `scraping/http_fetch.py` | **100%** | 21 | ✅ |
| `scraping/pw_fetch.py` | **91%** | 21 | ✅ |
| `stores/amazon/` | **89%** | 33 | ✅ |
| `stores/aliexpress/` | **85%** | 32 | ✅ |
| `stores/backmarket/` | **85%** | 25 | ✅ |
| `stores/cdiscount/` | **72%** | 30 | ✅ |
| `base.py` | **87%** | - | ✅ |
| `logging.py` | **71%** | - | ✅ |
---
## 🏗️ Architecture Implémentée
### 1. Core (`pricewatch/app/core/`)
-`schema.py` - Modèle ProductSnapshot Pydantic
-`registry.py` - Détection automatique stores
-`io.py` - Lecture YAML / Écriture JSON
-`logging.py` - Système de logs colorés
### 2. Scraping (`pricewatch/app/scraping/`)
-`http_fetch.py` - HTTP simple avec rotation User-Agent
-`pw_fetch.py` - Playwright fallback anti-bot
- ✅ Stratégie automatique: HTTP → Playwright si échec
### 3. Stores (`pricewatch/app/stores/`)
-`base.py` - Classe abstraite BaseStore
-**Amazon** - amazon.fr, amazon.com, amazon.co.uk, amazon.de
-**Cdiscount** - cdiscount.com
-**Backmarket** - backmarket.fr, backmarket.com
-**AliExpress** - fr.aliexpress.com, aliexpress.com
### 4. CLI (`pricewatch/app/cli/`)
-`pricewatch run` - Pipeline YAML → JSON
-`pricewatch detect` - Détection store depuis URL
-`pricewatch fetch` - Test HTTP/Playwright
-`pricewatch parse` - Test parsing HTML
-`pricewatch doctor` - Health check
---
## 🔧 Corrections Apportées
### Amazon Store
1. **Extraction images** - Ajout fallback générique `soup.find_all("img")`
2. **Prix séparés** - Support `a-price-whole` + `a-price-fraction`
### Tests Ajoutés (177 nouveaux)
1. **Registry** - 40 tests (24 unitaires + 16 intégration)
2. **I/O** - 36 tests (YAML, JSON, debug files)
3. **HTTP Fetch** - 21 tests (mocks requests)
4. **Playwright Fetch** - 21 tests (mocks Playwright)
---
## ✨ Fonctionnalités Validées
### Scraping
- ✅ Détection automatique du store depuis URL
- ✅ Normalisation URLs vers forme canonique
- ✅ Extraction ASIN/SKU/référence produit
- ✅ Parsing HTML → ProductSnapshot
- ✅ Fallback HTTP → Playwright automatique
- ✅ Gestion anti-bot (User-Agent, headers, timeout)
### Data Extraction
- ✅ Titre produit
- ✅ Prix (EUR, USD, GBP)
- ✅ Statut stock (in_stock, out_of_stock, unknown)
- ✅ Images (URLs multiples)
- ✅ Catégorie (breadcrumb)
- ✅ Caractéristiques techniques (specs dict)
- ✅ Référence produit (ASIN, SKU)
### Debug & Observabilité
- ✅ Logs détaillés avec timestamps et couleurs
- ✅ Sauvegarde HTML optionnelle
- ✅ Screenshots Playwright optionnels
- ✅ Métriques (durée, taille HTML, méthode)
- ✅ Gestion erreurs robuste (403, captcha, timeout)
### Output
- ✅ JSON structuré (ProductSnapshot[])
- ✅ Validation Pydantic
- ✅ Serialization ISO 8601 (dates)
- ✅ Pretty-print configurable
---
## 📋 Commandes Testées
```bash
# Pipeline complet
pricewatch run --yaml scrap_url.yaml --out scraped_store.json
# Détection store
pricewatch detect "https://www.amazon.fr/dp/B08N5WRWNW"
# Test HTTP
pricewatch fetch "https://example.com" --http
# Test Playwright
pricewatch fetch "https://example.com" --playwright
# Parse HTML
pricewatch parse amazon --in page.html
# Health check
pricewatch doctor
# Mode debug
pricewatch run --yaml scrap_url.yaml --debug
```
---
## 🧪 Tests Exécutés
### Lancer tous les tests
```bash
pytest -v --tb=no --cov=pricewatch
```
**Résultat**: `295 passed, 3 warnings in 41.40s`
### Par module
```bash
pytest tests/core/ # 105 tests
pytest tests/scraping/ # 42 tests
pytest tests/stores/ # 148 tests
```
### Coverage détaillé
```bash
pytest --cov=pricewatch --cov-report=html
# Voir: htmlcov/index.html
```
---
## 📦 Dépendances
### Production
- `typer[all]` - CLI framework
- `rich` - Terminal UI
- `pydantic` - Data validation
- `requests` - HTTP client
- `playwright` - Browser automation
- `beautifulsoup4` - HTML parsing
- `lxml` - XML/HTML parser
- `pyyaml` - YAML support
### Développement
- `pytest` - Testing framework
- `pytest-cov` - Coverage reporting
- `pytest-mock` - Mocking utilities
- `pytest-asyncio` - Async test support
---
## 🚀 Prochaines Étapes (Phase 2)
La Phase 1 CLI est **production-ready**. Vous pouvez démarrer la Phase 2:
### Infrastructure
1. **PostgreSQL + Alembic**
- Schéma base de données
- Migrations versionnées
- Models SQLAlchemy
- Historique prix
2. **Worker & Scheduler**
- Redis pour queue
- RQ ou Celery worker
- Scraping planifié (quotidien)
- Retry policy
3. **API REST**
- FastAPI endpoints
- Authentification JWT
- Documentation OpenAPI
- CORS configuration
4. **Web UI**
- Framework React/Vue
- Design responsive
- Dark theme Gruvbox
- Graphiques historique prix
- Système d'alertes
### Features
- Alertes baisse prix (email, webhooks)
- Alertes retour en stock
- Comparateur multi-stores
- Export données (CSV, Excel)
- API publique
---
## 📝 Documentation
- `README.md` - Guide utilisateur complet
- `TODO.md` - Roadmap et phases
- `CHANGELOG.md` - Historique des versions
- `CLAUDE.md` - Guide pour Claude Code
- `PROJECT_SPEC.md` - Spécifications techniques
---
## 🎯 Métriques de Qualité
| Métrique | Valeur | Objectif | Statut |
|----------|--------|----------|--------|
| Tests passants | 295/295 | 100% | ✅ |
| Code coverage | 76% | >70% | ✅ |
| Stores actifs | 4 | ≥2 | ✅ |
| CLI commands | 5 | ≥4 | ✅ |
| Documentation | Complète | Complète | ✅ |
---
## ✅ Checklist Phase 1
- [x] Architecture modulaire
- [x] Modèle de données Pydantic
- [x] Système de logging
- [x] Lecture YAML / Écriture JSON
- [x] Registry stores avec détection automatique
- [x] HTTP fetch avec User-Agent rotation
- [x] Playwright fallback anti-bot
- [x] BaseStore abstrait
- [x] Amazon store complet
- [x] Cdiscount store complet
- [x] Backmarket store complet
- [x] AliExpress store complet
- [x] CLI Typer avec 5 commandes
- [x] Tests pytest (295 tests)
- [x] Code coverage >70%
- [x] Documentation complète
- [x] Pipeline YAML → JSON fonctionnel
- [x] Validation avec URLs réelles
---
**Phase 1 CLI: 100% COMPLÈTE**
Prêt pour la Phase 2! 🚀

437
PHASE_2_PROGRESS.md Executable file
View File

@@ -0,0 +1,437 @@
# 🚀 Phase 2 Infrastructure - EN COURS
**Date de démarrage**: 2026-01-14
**Version cible**: 0.4.0
**Objectif**: Ajouter PostgreSQL + Redis/RQ worker pour persistence et scraping asynchrone
---
## 📊 Vue d'Ensemble
### Objectifs Phase 2
- ✅ Configuration centralisée (database, Redis, app)
- ✅ Modèles SQLAlchemy ORM (5 tables)
- ✅ Connexion base de données (init_db, get_session)
- ✅ Migrations Alembic
- ⏳ Repository pattern (CRUD)
- ⏳ Worker RQ pour scraping asynchrone
- ⏳ Scheduler pour jobs récurrents
- ✅ CLI étendu (commandes DB)
- ✅ Docker Compose (PostgreSQL + Redis)
- ⏳ Tests complets
---
## ✅ Semaine 1: Database Foundation (TERMINÉE)
### Tâches Complétées
#### 1. Configuration Centralisée ✅
**Fichier**: `pricewatch/app/core/config.py` (187 lignes)
**Contenu**:
- `DatabaseConfig`: Configuration PostgreSQL
- Host, port, database, user, password
- Propriété `url`: SQLAlchemy connection string
- Propriété `url_async`: AsyncPG connection string (futur)
- Prefix env vars: `PW_DB_*` (PW_DB_HOST, PW_DB_PORT, etc.)
- `RedisConfig`: Configuration Redis pour RQ
- Host, port, db, password (optional)
- Propriété `url`: Redis connection string
- Prefix env vars: `PW_REDIS_*`
- `AppConfig`: Configuration globale application
- Debug mode
- Worker timeout (300s par défaut)
- Worker concurrency (2 par défaut)
- Feature flags: `enable_db`, `enable_worker`
- Defaults Playwright: timeout, use_playwright
- Nested configs: `db`, `redis`
- Prefix env vars: `PW_*`
- **Pattern Singleton**: `get_config()`, `set_config()`, `reset_config()`
**Justifications**:
- 12-factor app: configuration via env vars
- Pydantic validation garantit config valide au démarrage
- Valeurs par défaut pour développement local
- Support `.env` file pour faciliter le setup
- Feature flags permettent de désactiver DB/worker pour tests
#### 2. Dépendances Phase 2 ✅
**Fichier**: `pyproject.toml` (lignes 48-60)
**Ajouts**:
```toml
# Database (Phase 2)
"sqlalchemy>=2.0.0",
"psycopg2-binary>=2.9.0",
"alembic>=1.13.0",
# Configuration (Phase 2)
"python-dotenv>=1.0.0",
# Worker/Queue (Phase 2)
"redis>=5.0.0",
"rq>=1.15.0",
"rq-scheduler>=0.13.0",
```
#### 3. Modèles SQLAlchemy ORM ✅
**Fichier**: `pricewatch/app/db/models.py` (322 lignes)
**Tables créées**:
1. **`products`** - Catalogue produits
- PK: `id` (Integer, autoincrement)
- Natural key: `(source, reference)` - Unique constraint
- Colonnes: `url`, `title`, `category`, `currency`
- Timestamps: `first_seen_at`, `last_updated_at`
- Relations: `price_history`, `images`, `specs`, `logs`
- Indexes: source, reference, last_updated_at
2. **`price_history`** - Historique prix (time-series)
- PK: `id` (Integer, autoincrement)
- FK: `product_id` → products(id) CASCADE
- Unique: `(product_id, fetched_at)` - Évite doublons
- Colonnes: `price` (Numeric 10,2), `shipping_cost`, `stock_status`
- Fetch metadata: `fetch_method`, `fetch_status`, `fetched_at`
- Check constraints: stock_status, fetch_method, fetch_status
- Indexes: product_id, fetched_at
3. **`product_images`** - Images produit
- PK: `id` (Integer, autoincrement)
- FK: `product_id` → products(id) CASCADE
- Unique: `(product_id, image_url)` - Évite doublons
- Colonnes: `image_url` (Text), `position` (Integer, 0=main)
- Index: product_id
4. **`product_specs`** - Caractéristiques produit (key-value)
- PK: `id` (Integer, autoincrement)
- FK: `product_id` → products(id) CASCADE
- Unique: `(product_id, spec_key)` - Évite doublons
- Colonnes: `spec_key` (String 200), `spec_value` (Text)
- Indexes: product_id, spec_key
5. **`scraping_logs`** - Logs observabilité
- PK: `id` (Integer, autoincrement)
- FK optionnelle: `product_id` → products(id) SET NULL
- Colonnes: `url`, `source`, `reference`, `fetched_at`
- Métriques: `duration_ms`, `html_size_bytes`
- Fetch metadata: `fetch_method`, `fetch_status`
- Debug data (JSONB): `errors`, `notes`
- Indexes: product_id, source, fetched_at, fetch_status
**Justifications schéma**:
- Normalisation: products séparée de price_history (catalogue vs time-series)
- Clé naturelle (source, reference) vs UUID arbitraire
- Tables séparées pour images/specs: évite JSONB non structuré
- JSONB uniquement pour données variables: errors, notes dans logs
- Cascade DELETE: suppression produit → suppression historique
- SET NULL pour logs: garde trace même si produit supprimé
---
### Tâches Complétées (suite)
#### 4. Connexion Base de Données ✅
**Fichier**: `pricewatch/app/db/connection.py`
**Contenu**:
- `get_engine(config)`: Engine SQLAlchemy (pooling)
- `get_session_factory(config)`: Session factory
- `get_session(config)`: Context manager
- `init_db(config)`: Création tables
- `check_db_connection(config)`: Health check
- `reset_engine()`: Reset pour tests
**Justifications**:
- Singleton engine pour éviter les pools multiples
- `pool_pre_ping` pour robustesse
- Context manager pour rollback/close automatiques
---
#### 5. Setup Alembic ✅
**Fichiers**:
- `alembic.ini`
- `pricewatch/app/db/migrations/env.py`
- `pricewatch/app/db/migrations/script.py.mako`
**Justifications**:
- URL DB injectée depuis `AppConfig`
- `compare_type=True` pour cohérence des migrations
#### 6. Migration Initiale ✅
**Fichier**: `pricewatch/app/db/migrations/versions/20260114_01_initial_schema.py`
**Contenu**:
- 5 tables + indexes + contraintes
- JSONB pour `errors` et `notes`
#### 7. Commandes CLI Database ✅
**Fichier**: `pricewatch/app/cli/main.py`
**Commandes**:
```bash
pricewatch init-db # Créer tables
pricewatch migrate "message" # Générer migration Alembic
pricewatch upgrade # Appliquer migrations
pricewatch downgrade # Rollback migration
```
#### 8. Docker Compose ✅
**Fichier**: `docker-compose.yml`
**Services**:
- PostgreSQL 16 (port 5432)
- Redis 7 (port 6379)
- Volumes pour persistence
#### 9. Fichier .env Exemple ✅
**Fichier**: `.env.example`
**Variables**:
```bash
# Database
PW_DB_HOST=localhost
PW_DB_PORT=5432
PW_DB_DATABASE=pricewatch
PW_DB_USER=pricewatch
PW_DB_PASSWORD=pricewatch
# Redis
PW_REDIS_HOST=localhost
PW_REDIS_PORT=6379
PW_REDIS_DB=0
# App
PW_DEBUG=false
PW_WORKER_TIMEOUT=300
PW_WORKER_CONCURRENCY=2
PW_ENABLE_DB=true
PW_ENABLE_WORKER=true
```
#### 10. Tests Database ✅
**Fichiers**:
- `tests/db/test_models.py`: Tests des modèles SQLAlchemy
- `tests/db/test_connection.py`: Tests connexion et session
**Stratégie tests**:
- SQLite in-memory pour tests unitaires
- Fixtures pytest pour setup/teardown
- Tests relationships, constraints, indexes
---
## 📦 Semaine 2: Repository & Pipeline (EN COURS)
### Tâches Prévues
#### Repository Pattern
**Fichier**: `pricewatch/app/db/repository.py`
**Classe**: `ProductRepository`
- `get_or_create(source, reference)`: Trouver ou créer produit
- `save_snapshot(snapshot)`: Persist ProductSnapshot to DB
- `update_product_metadata(product, snapshot)`: Update title, url, etc.
- `add_price_history(product, snapshot)`: Ajouter entrée prix
- `sync_images(product, images)`: Sync images (add new, keep existing)
- `sync_specs(product, specs)`: Sync specs (upsert)
- `add_scraping_log(snapshot, product_id)`: Log scraping
**Statut**: ✅ Terminé
#### Scraping Pipeline
**Fichier**: `pricewatch/app/scraping/pipeline.py`
**Classe**: `ScrapingPipeline`
- `process_snapshot(snapshot, save_to_db)`: Orchestration
- Non-blocking: échec DB ne crash pas pipeline
- Retour: `product_id` ou `None`
**Statut**: ✅ Terminé
#### CLI Modification
**Fichier**: `pricewatch/app/cli/main.py`
**Modification commande `run`**:
- Ajouter flag `--save-db / --no-db`
- Intégrer `ScrapingPipeline` si `save_db=True`
- Compatibilité backward: JSON output toujours créé
**Statut**: ✅ Terminé
#### Tests Repository + Pipeline ✅
**Fichiers**:
- `tests/db/test_repository.py`
- `tests/scraping/test_pipeline.py`
**Statut**: ✅ Terminé
#### Tests end-to-end CLI + DB ✅
**Fichier**:
- `tests/cli/test_run_db.py`
**Statut**: ✅ Terminé
---
## 📦 Semaine 3: Worker Infrastructure (EN COURS)
### Tâches Prévues
#### RQ Task
**Fichier**: `pricewatch/app/tasks/scrape.py`
**Fonction**: `scrape_product(url, use_playwright=True)`
- Réutilise 100% code Phase 1 (detect → fetch → parse)
- Save to DB via ScrapingPipeline
- Retour: `{success, product_id, snapshot, error}`
**Statut**: ✅ Terminé
#### Scheduler
**Fichier**: `pricewatch/app/tasks/scheduler.py`
**Classe**: `ScrapingScheduler`
- `schedule_product(url, interval_hours=24)`: Job récurrent
- `enqueue_immediate(url)`: Job unique
- Basé sur `rq-scheduler`
**Statut**: ✅ Terminé
#### CLI Worker
**Nouvelles commandes**:
```bash
pricewatch worker # Lancer worker RQ
pricewatch enqueue <url> # Enqueue scrape immédiat
pricewatch schedule <url> --interval 24 # Scrape quotidien
```
**Statut**: ✅ Terminé
---
## 📦 Semaine 4: Tests & Documentation (NON DÉMARRÉ)
### Tâches Prévues
#### Tests
- Tests end-to-end (CLI → DB → Worker)
- Tests erreurs (DB down, Redis down)
- Tests backward compatibility (`--no-db`)
- Performance tests (100+ produits)
#### Documentation
- Update README.md (setup Phase 2)
- Update CHANGELOG.md
- Migration guide (JSON → DB)
---
## 📈 Métriques d'Avancement
| Catégorie | Complétées | Totales | % |
|-----------|------------|---------|---|
| **Semaine 1** | 10 | 10 | 100% |
| **Semaine 2** | 5 | 5 | 100% |
| **Semaine 3** | 3 | 6 | 50% |
| **Semaine 4** | 0 | 7 | 0% |
| **TOTAL Phase 2** | 18 | 28 | **64%** |
---
## 🎯 Prochaine Étape Immédiate
**Prochaine étape immédiate**
- Tests end-to-end worker + DB
- Gestion des erreurs Redis down (CLI + worker)
**Apres (prevu)**
- Logs d'observabilite pour jobs planifies
---
## 🔧 Vérifications
### Vérification Semaine 1 (objectif)
```bash
# Setup infrastructure
docker-compose up -d
pricewatch init-db
# Vérifier tables créées
psql -h localhost -U pricewatch pricewatch
\dt
# → 5 tables: products, price_history, product_images, product_specs, scraping_logs
```
### Vérification Semaine 2 (objectif)
```bash
# Test pipeline avec DB
pricewatch run --yaml scrap_url.yaml --save-db
# Vérifier données en DB
psql -h localhost -U pricewatch pricewatch
SELECT * FROM products LIMIT 5;
SELECT * FROM price_history ORDER BY fetched_at DESC LIMIT 10;
```
### Vérification Semaine 3 (objectif)
```bash
# Enqueue job
pricewatch enqueue "https://www.amazon.fr/dp/B08N5WRWNW"
# Lancer worker
pricewatch worker
# Vérifier job traité
psql -h localhost -U pricewatch pricewatch
SELECT * FROM scraping_logs ORDER BY fetched_at DESC LIMIT 5;
```
---
## 📝 Notes Importantes
### Backward Compatibility
- ✅ CLI Phase 1 fonctionne sans changement
- ✅ Format JSON identique
- ✅ Database optionnelle (`--no-db` flag)
- ✅ ProductSnapshot inchangé
- ✅ Tests Phase 1 continuent à passer (295 tests)
### Architecture Décisions
**Normalisation vs Performance**:
- Choix: Normalisation stricte (5 tables)
- Justification: Catalogue change rarement, prix changent quotidiennement
- Alternative rejetée: Tout dans products + JSONB (moins queryable)
**Clé Naturelle vs UUID**:
- Choix: `(source, reference)` comme unique constraint
- Justification: ASIN Amazon déjà unique globalement
- Alternative rejetée: UUID artificiel (complexifie déduplication)
**Synchrone vs Asynchrone**:
- Choix: RQ synchrone (pas d'async/await)
- Justification: Code Phase 1 réutilisable à 100%, simplicité
- Alternative rejetée: Asyncio + asyncpg (refactoring massif)
---
**Dernière mise à jour**: 2026-01-14
### Validation locale (Semaine 1)
```bash
docker compose up -d
./venv/bin/alembic -c alembic.ini upgrade head
psql -h localhost -U pricewatch pricewatch
\\dt
```
**Resultat**: 6 tables visibles (products, price_history, product_images, product_specs, scraping_logs, alembic_version).
**Statut**: ✅ Semaine 1 en cours (30% complétée)

View File

@@ -58,6 +58,13 @@ pricewatch/
│ │ ├── store.py
│ │ ├── selectors.yml
│ │ └── fixtures/
│ ├── db/ # Persistence SQLAlchemy (Phase 2)
│ │ ├── models.py
│ │ ├── connection.py
│ │ └── migrations/
│ ├── tasks/ # Jobs RQ (Phase 3)
│ │ ├── scrape.py
│ │ └── scheduler.py
│ └── cli/
│ └── main.py # CLI Typer
├── tests/ # Tests pytest
@@ -76,6 +83,9 @@ pricewatch run --yaml scrap_url.yaml --out scraped_store.json
# Avec debug
pricewatch run --yaml scrap_url.yaml --out scraped_store.json --debug
# Avec persistence DB
pricewatch run --yaml scrap_url.yaml --out scraped_store.json --save-db
```
### Commandes utilitaires
@@ -97,6 +107,45 @@ pricewatch parse amazon --in scraped/page.html
pricewatch doctor
```
### Commandes base de donnees
```bash
# Initialiser les tables
pricewatch init-db
# Generer une migration
pricewatch migrate "Initial schema"
# Appliquer les migrations
pricewatch upgrade
# Revenir en arriere
pricewatch downgrade -1
```
### Commandes worker
```bash
# Lancer un worker RQ
pricewatch worker
# Enqueue un job immediat
pricewatch enqueue "https://example.com/product"
# Planifier un job recurrent
pricewatch schedule "https://example.com/product" --interval 24
```
## Base de donnees (Phase 2)
```bash
# Lancer PostgreSQL + Redis en local
docker-compose up -d
# Exemple de configuration
cp .env.example .env
```
## Configuration (scrap_url.yaml)
```yaml
@@ -196,8 +245,8 @@ Aucune erreur ne doit crasher silencieusement : toutes sont loggées et tracées
- ✅ Tests pytest
### Phase 2 : Persistence
- [ ] Base de données PostgreSQL
- [ ] Migrations Alembic
- [x] Base de données PostgreSQL
- [x] Migrations Alembic
- [ ] Historique des prix
### Phase 3 : Automation

86
TODO.md
View File

@@ -101,72 +101,92 @@ Liste des tâches priorisées pour le développement de PriceWatch.
### Étape 9 : Tests
- [x] Configurer pytest dans pyproject.toml
- [x] Tests core/schema.py
- [x] Tests core/schema.py (29 tests - 100% coverage)
- [x] Validation ProductSnapshot
- [x] Serialization JSON
- [x] Tests core/registry.py
- [x] Tests core/registry.py (40 tests - 100% coverage)
- [x] Enregistrement stores
- [x] Détection automatique
- [x] Tests stores/amazon/
- [x] Tests d'intégration avec 4 stores réels
- [x] Tests core/io.py (36 tests - 97% coverage)
- [x] Lecture/écriture YAML/JSON
- [x] Sauvegarde debug HTML/screenshots
- [x] Tests stores/amazon/ (33 tests - 89% coverage)
- [x] match() avec différentes URLs
- [x] canonicalize()
- [x] extract_reference()
- [~] parse() sur fixtures HTML (6 tests nécessitent fixtures réels)
- [ ] Tests stores/cdiscount/
- [ ] Idem Amazon
- [ ] Tests scraping/
- [ ] http_fetch avec mock
- [ ] pw_fetch avec mock
- [x] parse() sur fixtures HTML
- [x] Tests stores/cdiscount/ (30 tests - 72% coverage)
- [x] Tests complets avec fixtures réels
- [x] Tests stores/backmarket/ (25 tests - 85% coverage)
- [x] Tests complets avec fixtures réels
- [x] Tests stores/aliexpress/ (32 tests - 85% coverage)
- [x] Tests complets avec fixtures réels
- [x] Tests scraping/ (42 tests)
- [x] http_fetch avec mock (21 tests - 100% coverage)
- [x] pw_fetch avec mock (21 tests - 91% coverage)
### Étape 10 : Intégration et validation
- [x] Créer scrap_url.yaml exemple
- [x] Tester pipeline complet YAML → JSON
- [x] Tester avec vraies URLs Amazon
- [ ] Tester avec vraies URLs Cdiscount
- [x] Tester avec vraies URLs Cdiscount
- [x] Vérifier tous les modes de debug
- [x] Valider sauvegarde HTML/screenshots
- [x] Documentation finale
### Bilan Étape 9 (Tests pytest)
**État**: 80 tests passent / 86 tests totaux (93%)
- ✓ core/schema.py: 29/29 tests
- ✓ core/registry.py: 24/24 tests
- ✓ stores/amazon/: 27/33 tests (6 tests nécessitent fixtures HTML réalistes)
### ✅ PHASE 1 TERMINÉE À 100%
**État final**: 295 tests passent / 295 tests totaux (100%)
**Coverage global**: 76%
**Tests restants**:
- Fixtures HTML Amazon/Cdiscount
- Tests Cdiscount store
- Tests scraping avec mocks
**Détail par module**:
- ✅ core/schema.py: 100% coverage
- ✅ core/registry.py: 100% coverage (40 tests)
- ✅ core/io.py: 97% coverage (36 tests)
- ✅ scraping/http_fetch.py: 100% coverage (21 tests)
- ✅ scraping/pw_fetch.py: 91% coverage (21 tests)
- ✅ stores/amazon/: 89% coverage (33 tests)
- ✅ stores/aliexpress/: 85% coverage (32 tests)
- ✅ stores/backmarket/: 85% coverage (25 tests)
- ✅ stores/cdiscount/: 72% coverage (30 tests)
**4 stores opérationnels**: Amazon, Cdiscount, Backmarket, AliExpress
---
## Phase 2 : Base de données (Future)
## Phase 2 : Base de données (En cours)
### Persistence
- [ ] Schéma PostgreSQL
- [ ] Migrations Alembic
- [ ] Models SQLAlchemy
- [x] Schéma PostgreSQL
- [x] Migrations Alembic
- [x] Models SQLAlchemy
- [x] Connexion DB (engine, session, init)
- [x] Tests DB de base
- [x] Repository pattern (ProductRepository)
- [x] ScrapingPipeline (persistence optionnelle)
- [x] CLI `--save-db/--no-db`
- [x] Tests end-to-end CLI + DB
- [ ] CRUD produits
- [ ] Historique prix
### Configuration
- [ ] Fichier config (DB credentials)
- [ ] Variables d'environnement
- [ ] Dockerfile PostgreSQL
- [x] Fichier config (DB credentials)
- [x] Variables d'environnement
- [x] Docker Compose PostgreSQL/Redis
---
## Phase 3 : Worker et automation (Future)
## Phase 3 : Worker et automation (En cours)
### Worker
- [ ] Setup Redis
- [ ] Worker RQ ou Celery
- [ ] Queue de scraping
- [x] Setup Redis
- [x] Worker RQ
- [x] Queue de scraping
- [ ] Retry policy
### Planification
- [ ] Cron ou scheduler intégré
- [ ] Scraping quotidien automatique
- [x] Cron ou scheduler intégré
- [x] Scraping quotidien automatique
- [ ] Logs des runs
---
@@ -216,4 +236,4 @@ Liste des tâches priorisées pour le développement de PriceWatch.
---
**Dernière mise à jour**: 2026-01-13
**Dernière mise à jour**: 2026-01-14

36
alembic.ini Executable file
View File

@@ -0,0 +1,36 @@
[alembic]
script_location = pricewatch/app/db/migrations
sqlalchemy.url = postgresql://pricewatch:pricewatch@localhost:5432/pricewatch
# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic
[handlers]
keys = console
[formatters]
keys = generic
[logger_root]
level = WARN
handlers = console
[logger_sqlalchemy]
level = WARN
handlers = console
qualname = sqlalchemy.engine
[logger_alembic]
level = INFO
handlers = console
qualname = alembic
[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic
[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s

22
docker-compose.yml Executable file
View File

@@ -0,0 +1,22 @@
services:
postgres:
image: postgres:16
environment:
POSTGRES_DB: pricewatch
POSTGRES_USER: pricewatch
POSTGRES_PASSWORD: pricewatch
ports:
- "5432:5432"
volumes:
- pricewatch_pgdata:/var/lib/postgresql/data
redis:
image: redis:7
ports:
- "6379:6379"
volumes:
- pricewatch_redisdata:/data
volumes:
pricewatch_pgdata:
pricewatch_redisdata:

View File

@@ -21,6 +21,13 @@ Requires-Dist: lxml>=5.1.0
Requires-Dist: cssselect>=1.2.0
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: python-dateutil>=2.8.2
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: psycopg2-binary>=2.9.0
Requires-Dist: alembic>=1.13.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: redis>=5.0.0
Requires-Dist: rq>=1.15.0
Requires-Dist: rq-scheduler>=0.13.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"

View File

@@ -11,6 +11,7 @@ pricewatch/app/__init__.py
pricewatch/app/cli/__init__.py
pricewatch/app/cli/main.py
pricewatch/app/core/__init__.py
pricewatch/app/core/config.py
pricewatch/app/core/io.py
pricewatch/app/core/logging.py
pricewatch/app/core/registry.py

View File

@@ -9,6 +9,13 @@ lxml>=5.1.0
cssselect>=1.2.0
pyyaml>=6.0.1
python-dateutil>=2.8.2
sqlalchemy>=2.0.0
psycopg2-binary>=2.9.0
alembic>=1.13.0
python-dotenv>=1.0.0
redis>=5.0.0
rq>=1.15.0
rq-scheduler>=0.13.0
[dev]
pytest>=8.0.0

View File

@@ -13,20 +13,28 @@ import sys
from pathlib import Path
from typing import Optional
import redis
import typer
from rq import Connection, Worker
from alembic import command as alembic_command
from alembic.config import Config as AlembicConfig
from rich import print as rprint
from rich.console import Console
from rich.table import Table
from pricewatch.app.core import logging as app_logging
from pricewatch.app.core.config import get_config
from pricewatch.app.core.io import read_yaml_config, write_json_results
from pricewatch.app.core.logging import get_logger, set_level
from pricewatch.app.core.registry import get_registry, register_store
from pricewatch.app.core.schema import DebugInfo, DebugStatus, FetchMethod
from pricewatch.app.db.connection import init_db
from pricewatch.app.scraping.http_fetch import fetch_http
from pricewatch.app.scraping.pipeline import ScrapingPipeline
from pricewatch.app.scraping.pw_fetch import fetch_playwright
from pricewatch.app.stores.amazon.store import AmazonStore
from pricewatch.app.stores.cdiscount.store import CdiscountStore
from pricewatch.app.tasks.scheduler import ScrapingScheduler
# Créer l'application Typer
app = typer.Typer(
@@ -46,6 +54,75 @@ def setup_stores():
registry.register(CdiscountStore())
def get_alembic_config() -> AlembicConfig:
"""Construit la configuration Alembic à partir du repository."""
root_path = Path(__file__).resolve().parents[3]
config_path = root_path / "alembic.ini"
migrations_path = root_path / "pricewatch" / "app" / "db" / "migrations"
if not config_path.exists():
logger.error(f"alembic.ini introuvable: {config_path}")
raise typer.Exit(code=1)
alembic_cfg = AlembicConfig(str(config_path))
alembic_cfg.set_main_option("script_location", str(migrations_path))
alembic_cfg.set_main_option("sqlalchemy.url", get_config().db.url)
return alembic_cfg
@app.command("init-db")
def init_db_command():
"""
Initialise la base de donnees (creer toutes les tables).
"""
try:
init_db(get_config())
except Exception as e:
logger.error(f"Init DB echoue: {e}")
raise typer.Exit(code=1)
@app.command()
def migrate(
message: str = typer.Argument(..., help="Message de migration"),
autogenerate: bool = typer.Option(True, "--autogenerate/--no-autogenerate"),
):
"""
Genere une migration Alembic.
"""
try:
alembic_cfg = get_alembic_config()
alembic_command.revision(alembic_cfg, message=message, autogenerate=autogenerate)
except Exception as e:
logger.error(f"Migration echouee: {e}")
raise typer.Exit(code=1)
@app.command()
def upgrade(revision: str = typer.Argument("head", help="Revision cible")):
"""
Applique les migrations Alembic.
"""
try:
alembic_cfg = get_alembic_config()
alembic_command.upgrade(alembic_cfg, revision)
except Exception as e:
logger.error(f"Upgrade echoue: {e}")
raise typer.Exit(code=1)
@app.command()
def downgrade(revision: str = typer.Argument("-1", help="Revision cible")):
"""
Rollback une migration Alembic.
"""
try:
alembic_cfg = get_alembic_config()
alembic_command.downgrade(alembic_cfg, revision)
except Exception as e:
logger.error(f"Downgrade echoue: {e}")
raise typer.Exit(code=1)
@app.command()
def run(
yaml: Path = typer.Option(
@@ -67,6 +144,11 @@ def run(
"-d",
help="Activer le mode debug",
),
save_db: Optional[bool] = typer.Option(
None,
"--save-db/--no-db",
help="Activer la persistence en base de donnees",
),
):
"""
Pipeline complet: scrape toutes les URLs du YAML et génère le JSON.
@@ -88,6 +170,12 @@ def run(
logger.error(f"Erreur lecture YAML: {e}")
raise typer.Exit(code=1)
app_config = get_config()
if save_db is None:
save_db = app_config.enable_db
pipeline = ScrapingPipeline(config=app_config)
logger.info(f"{len(config.urls)} URL(s) à scraper")
# Scraper chaque URL
@@ -158,6 +246,11 @@ def run(
snapshot = store.parse(html, canonical_url)
snapshot.debug.method = fetch_method
if save_db:
product_id = pipeline.process_snapshot(snapshot, save_to_db=True)
if product_id:
logger.info(f"DB: produit id={product_id}")
snapshots.append(snapshot)
status_emoji = "" if snapshot.is_complete() else ""
@@ -180,6 +273,8 @@ def run(
errors=[f"Parsing failed: {str(e)}"],
),
)
if save_db:
pipeline.process_snapshot(snapshot, save_to_db=True)
snapshots.append(snapshot)
else:
# Pas de HTML récupéré
@@ -194,6 +289,8 @@ def run(
errors=[f"Fetch failed: {fetch_error or 'Unknown error'}"],
),
)
if save_db:
pipeline.process_snapshot(snapshot, save_to_db=True)
snapshots.append(snapshot)
# Écrire les résultats
@@ -359,5 +456,65 @@ def doctor():
rprint("\n[green]✓ PriceWatch est prêt![/green]")
@app.command()
def worker(
queue: str = typer.Option("default", "--queue", "-q", help="Nom de la queue RQ"),
with_scheduler: bool = typer.Option(
True, "--with-scheduler/--no-scheduler", help="Activer le scheduler RQ"
),
):
"""
Lance un worker RQ.
"""
config = get_config()
connection = redis.from_url(config.redis.url)
with Connection(connection):
worker_instance = Worker([queue])
worker_instance.work(with_scheduler=with_scheduler)
@app.command()
def enqueue(
url: str = typer.Argument(..., help="URL du produit a scraper"),
queue: str = typer.Option("default", "--queue", "-q", help="Nom de la queue RQ"),
save_db: bool = typer.Option(True, "--save-db/--no-db", help="Activer la DB"),
use_playwright: Optional[bool] = typer.Option(
None, "--playwright/--no-playwright", help="Forcer Playwright"
),
):
"""
Enqueue un scraping immediat.
"""
scheduler = ScrapingScheduler(get_config(), queue_name=queue)
job = scheduler.enqueue_immediate(url, use_playwright=use_playwright, save_db=save_db)
rprint(f"[green]✓ Job enqueued: {job.id}[/green]")
@app.command()
def schedule(
url: str = typer.Argument(..., help="URL du produit a planifier"),
interval: int = typer.Option(24, "--interval", help="Intervalle en heures"),
queue: str = typer.Option("default", "--queue", "-q", help="Nom de la queue RQ"),
save_db: bool = typer.Option(True, "--save-db/--no-db", help="Activer la DB"),
use_playwright: Optional[bool] = typer.Option(
None, "--playwright/--no-playwright", help="Forcer Playwright"
),
):
"""
Planifie un scraping recurrent.
"""
scheduler = ScrapingScheduler(get_config(), queue_name=queue)
job_info = scheduler.schedule_product(
url,
interval_hours=interval,
use_playwright=use_playwright,
save_db=save_db,
)
rprint(
f"[green]✓ Job planifie: {job_info.job_id} (next={job_info.next_run.isoformat()})[/green]"
)
if __name__ == "__main__":
app()

Binary file not shown.

186
pricewatch/app/core/config.py Executable file
View File

@@ -0,0 +1,186 @@
"""
Configuration centralisée pour PriceWatch Phase 2.
Gère la configuration de la base de données, Redis, et l'application globale.
Utilise Pydantic Settings pour validation et chargement depuis variables d'environnement.
Justification technique:
- Pattern 12-factor app: configuration via env vars
- Pydantic validation garantit config valide au démarrage
- Valeurs par défaut pour développement local
- Support .env file pour faciliter le setup
"""
from typing import Optional
from pydantic import Field
from pydantic_settings import BaseSettings, SettingsConfigDict
from pricewatch.app.core.logging import get_logger
logger = get_logger("core.config")
class DatabaseConfig(BaseSettings):
"""Configuration PostgreSQL."""
host: str = Field(default="localhost", description="PostgreSQL host")
port: int = Field(default=5432, description="PostgreSQL port")
database: str = Field(default="pricewatch", description="Database name")
user: str = Field(default="pricewatch", description="Database user")
password: str = Field(default="pricewatch", description="Database password")
model_config = SettingsConfigDict(
env_prefix="PW_DB_", # PW_DB_HOST, PW_DB_PORT, etc.
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
@property
def url(self) -> str:
"""
SQLAlchemy connection URL.
Format: postgresql://user:password@host:port/database
"""
return f"postgresql://{self.user}:{self.password}@{self.host}:{self.port}/{self.database}"
@property
def url_async(self) -> str:
"""
Async SQLAlchemy connection URL (pour usage futur avec asyncpg).
Format: postgresql+asyncpg://user:password@host:port/database
"""
return f"postgresql+asyncpg://{self.user}:{self.password}@{self.host}:{self.port}/{self.database}"
class RedisConfig(BaseSettings):
"""Configuration Redis pour RQ worker."""
host: str = Field(default="localhost", description="Redis host")
port: int = Field(default=6379, description="Redis port")
db: int = Field(default=0, description="Redis database number (0-15)")
password: Optional[str] = Field(default=None, description="Redis password (optional)")
model_config = SettingsConfigDict(
env_prefix="PW_REDIS_", # PW_REDIS_HOST, PW_REDIS_PORT, etc.
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
@property
def url(self) -> str:
"""
Redis connection URL pour RQ.
Format: redis://[password@]host:port/db
"""
auth = f":{self.password}@" if self.password else ""
return f"redis://{auth}{self.host}:{self.port}/{self.db}"
class AppConfig(BaseSettings):
"""Configuration globale de l'application."""
# Mode debug
debug: bool = Field(
default=False, description="Enable debug mode (verbose logging, SQL echo)"
)
# Worker configuration
worker_timeout: int = Field(
default=300, description="Worker job timeout in seconds (5 minutes)"
)
worker_concurrency: int = Field(
default=2, description="Number of concurrent worker processes"
)
# Feature flags
enable_db: bool = Field(
default=True, description="Enable database persistence (can disable for testing)"
)
enable_worker: bool = Field(
default=True, description="Enable background worker functionality"
)
# Scraping defaults
default_playwright_timeout: int = Field(
default=60000, description="Default Playwright timeout in milliseconds"
)
default_use_playwright: bool = Field(
default=True, description="Use Playwright fallback by default"
)
model_config = SettingsConfigDict(
env_prefix="PW_", # PW_DEBUG, PW_WORKER_TIMEOUT, etc.
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
# Nested configs (instances, not classes)
db: DatabaseConfig = Field(default_factory=DatabaseConfig)
redis: RedisConfig = Field(default_factory=RedisConfig)
def log_config(self) -> None:
"""Log la configuration active (sans password)."""
logger.info("=== Configuration PriceWatch ===")
logger.info(f"Debug mode: {self.debug}")
logger.info(f"Database: {self.db.host}:{self.db.port}/{self.db.database}")
logger.info(f"Redis: {self.redis.host}:{self.redis.port}/{self.redis.db}")
logger.info(f"DB enabled: {self.enable_db}")
logger.info(f"Worker enabled: {self.enable_worker}")
logger.info(f"Worker timeout: {self.worker_timeout}s")
logger.info(f"Worker concurrency: {self.worker_concurrency}")
logger.info("================================")
# Singleton global config instance
_config: Optional[AppConfig] = None
def get_config() -> AppConfig:
"""
Récupère l'instance globale de configuration (singleton).
Returns:
Instance AppConfig
Justification:
- Évite de recharger la config à chaque appel
- Centralise la configuration pour toute l'application
- Permet d'override pour les tests
"""
global _config
if _config is None:
_config = AppConfig()
if _config.debug:
_config.log_config()
return _config
def set_config(config: AppConfig) -> None:
"""
Override la configuration globale (principalement pour tests).
Args:
config: Instance AppConfig à utiliser
"""
global _config
_config = config
logger.debug("Configuration overridden")
def reset_config() -> None:
"""Reset la configuration globale (pour tests)."""
global _config
_config = None
logger.debug("Configuration reset")

41
pricewatch/app/db/__init__.py Executable file
View File

@@ -0,0 +1,41 @@
"""
Module de base de données pour PriceWatch Phase 2.
Gère la persistence PostgreSQL avec SQLAlchemy ORM.
"""
from pricewatch.app.db.connection import (
check_db_connection,
get_engine,
get_session,
get_session_factory,
init_db,
reset_engine,
)
from pricewatch.app.db.repository import ProductRepository
from pricewatch.app.db.models import (
Base,
Product,
PriceHistory,
ProductImage,
ProductSpec,
ScrapingLog,
)
__all__ = [
# Models
"Base",
"Product",
"PriceHistory",
"ProductImage",
"ProductSpec",
"ScrapingLog",
"ProductRepository",
# Connection
"get_engine",
"get_session_factory",
"get_session",
"init_db",
"check_db_connection",
"reset_engine",
]

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

238
pricewatch/app/db/connection.py Executable file
View File

@@ -0,0 +1,238 @@
"""
Gestion des connexions PostgreSQL pour PriceWatch Phase 2.
Fournit:
- Engine SQLAlchemy avec connection pooling
- Session factory avec context manager
- Initialisation des tables
- Health check
Justification technique:
- Connection pooling: réutilisation connexions pour performance
- Context manager: garantit fermeture session (pas de leak)
- pool_pre_ping: vérifie connexion avant usage (robustesse)
- echo=debug: logs SQL en mode debug
"""
from contextlib import contextmanager
from typing import Generator, Optional
from sqlalchemy import create_engine, text
from sqlalchemy.engine import Engine
from sqlalchemy.engine.url import make_url
from sqlalchemy.exc import OperationalError, SQLAlchemyError
from sqlalchemy.orm import Session, sessionmaker
from pricewatch.app.core.config import AppConfig, get_config
from pricewatch.app.core.logging import get_logger
from pricewatch.app.db.models import Base
logger = get_logger("db.connection")
# Global engine instance (singleton)
_engine: Optional[Engine] = None
_session_factory: Optional[sessionmaker] = None
def get_engine(config: Optional[AppConfig] = None) -> Engine:
"""
Récupère ou crée l'Engine SQLAlchemy (singleton).
Args:
config: Configuration app (utilise get_config() si None)
Returns:
Engine SQLAlchemy configuré
Justification:
- Singleton: une seule pool de connexions par application
- pool_pre_ping: vérifie connexion avant usage (évite "connection closed")
- pool_size=5, max_overflow=10: limite connexions (15 max)
- echo=debug: logs SQL pour debugging
"""
global _engine
if _engine is None:
if config is None:
config = get_config()
db_url = config.db.url
url = make_url(db_url)
is_sqlite = url.get_backend_name() == "sqlite"
logger.info(f"Creating database engine: {db_url}")
engine_kwargs = {
"pool_pre_ping": True,
"pool_recycle": 3600,
"echo": config.debug,
}
if not is_sqlite:
engine_kwargs.update(
{
"pool_size": 5,
"max_overflow": 10,
}
)
_engine = create_engine(db_url, **engine_kwargs)
logger.info("Database engine created successfully")
return _engine
def init_db(config: Optional[AppConfig] = None) -> None:
"""
Initialise la base de données (crée toutes les tables).
Args:
config: Configuration app (utilise get_config() si None)
Raises:
OperationalError: Si connexion impossible
SQLAlchemyError: Si création tables échoue
Note:
Utilise Base.metadata.create_all() - idempotent (ne crash pas si tables existent)
"""
if config is None:
config = get_config()
logger.info("Initializing database...")
try:
engine = get_engine(config)
# Créer toutes les tables définies dans Base.metadata
Base.metadata.create_all(bind=engine)
logger.info("Database initialized successfully")
logger.info(f"Tables created: {', '.join(Base.metadata.tables.keys())}")
except OperationalError as e:
logger.error(f"Failed to connect to database: {e}")
raise
except SQLAlchemyError as e:
logger.error(f"Failed to create tables: {e}")
raise
def get_session_factory(config: Optional[AppConfig] = None) -> sessionmaker:
"""
Récupère ou crée la session factory (singleton).
Args:
config: Configuration app (utilise get_config() si None)
Returns:
Session factory SQLAlchemy
Justification:
- expire_on_commit=False: objets restent accessibles après commit
- autocommit=False, autoflush=False: contrôle explicite
"""
global _session_factory
if _session_factory is None:
engine = get_engine(config)
_session_factory = sessionmaker(
bind=engine,
expire_on_commit=False, # Objets restent accessibles après commit
autocommit=False, # Contrôle explicite du commit
autoflush=False, # Contrôle explicite du flush
)
logger.debug("Session factory created")
return _session_factory
@contextmanager
def get_session(config: Optional[AppConfig] = None) -> Generator[Session, None, None]:
"""
Context manager pour session SQLAlchemy.
Args:
config: Configuration app (utilise get_config() si None)
Yields:
Session SQLAlchemy
Usage:
with get_session() as session:
product = session.query(Product).filter_by(reference="B08N5WRWNW").first()
session.commit()
Justification:
- Context manager: garantit fermeture session (pas de leak)
- Rollback automatique sur exception
- Close automatique en fin de bloc
"""
factory = get_session_factory(config)
session = factory()
try:
logger.debug("Session opened")
yield session
except Exception as e:
logger.error(f"Session error, rolling back: {e}")
session.rollback()
raise
finally:
logger.debug("Session closed")
session.close()
def check_db_connection(config: Optional[AppConfig] = None) -> bool:
"""
Vérifie la connexion à la base de données (health check).
Args:
config: Configuration app (utilise get_config() si None)
Returns:
True si connexion OK, False sinon
Note:
Execute une query simple: SELECT 1
"""
if config is None:
config = get_config()
try:
engine = get_engine(config)
with engine.connect() as conn:
result = conn.execute(text("SELECT 1"))
result.scalar()
logger.info("Database connection OK")
return True
except OperationalError as e:
logger.error(f"Database connection failed: {e}")
return False
except SQLAlchemyError as e:
logger.error(f"Database health check failed: {e}")
return False
def reset_engine() -> None:
"""
Reset l'engine global (pour tests).
Note:
Dispose l'engine et reset les singletons.
"""
global _engine, _session_factory
if _engine is not None:
logger.debug("Disposing database engine")
_engine.dispose()
_engine = None
_session_factory = None
logger.debug("Engine reset complete")

Binary file not shown.

View File

@@ -0,0 +1,80 @@
"""
Configuration Alembic pour PriceWatch.
Recupere l'URL DB depuis AppConfig pour garantir un setup coherent.
"""
from logging.config import fileConfig
from alembic import context
from sqlalchemy import engine_from_config, pool
from pricewatch.app.core.config import get_config
from pricewatch.app.db.models import Base
# Alembic Config object
config = context.config
# Configure logging
if config.config_file_name is not None:
fileConfig(config.config_file_name)
# Metadata SQLAlchemy pour autogenerate
target_metadata = Base.metadata
def _get_database_url() -> str:
"""Construit l'URL DB depuis la config applicative."""
app_config = get_config()
return app_config.db.url
def run_migrations_offline() -> None:
"""
Execute les migrations en mode offline.
Configure le contexte avec l'URL DB sans creer d'engine.
"""
url = _get_database_url()
context.configure(
url=url,
target_metadata=target_metadata,
literal_binds=True,
dialect_opts={"paramstyle": "named"},
compare_type=True,
)
with context.begin_transaction():
context.run_migrations()
def run_migrations_online() -> None:
"""
Execute les migrations en mode online.
Cree un engine SQLAlchemy et etablit la connexion.
"""
configuration = config.get_section(config.config_ini_section) or {}
configuration["sqlalchemy.url"] = _get_database_url()
connectable = engine_from_config(
configuration,
prefix="sqlalchemy.",
poolclass=pool.NullPool,
)
with connectable.connect() as connection:
context.configure(
connection=connection,
target_metadata=target_metadata,
compare_type=True,
)
with context.begin_transaction():
context.run_migrations()
if context.is_offline_mode():
run_migrations_offline()
else:
run_migrations_online()

View File

@@ -0,0 +1,24 @@
"""${message}
Revision ID: ${up_revision}
Revises: ${down_revision | comma,n}
Create Date: ${create_date}
"""
from alembic import op
import sqlalchemy as sa
${imports if imports else ""}
# Revision identifiers, used by Alembic.
revision = ${repr(up_revision)}
down_revision = ${repr(down_revision)}
branch_labels = ${repr(branch_labels)}
depends_on = ${repr(depends_on)}
def upgrade() -> None:
${upgrades if upgrades else "pass"}
def downgrade() -> None:
${downgrades if downgrades else "pass"}

View File

@@ -0,0 +1,124 @@
"""Initial schema
Revision ID: 20260114_01
Revises: None
Create Date: 2026-01-14 00:00:00
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
# Revision identifiers, used by Alembic.
revision = "20260114_01"
down_revision = None
branch_labels = None
depends_on = None
def upgrade() -> None:
op.create_table(
"products",
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
sa.Column("source", sa.String(length=50), nullable=False),
sa.Column("reference", sa.String(length=100), nullable=False),
sa.Column("url", sa.Text(), nullable=False),
sa.Column("title", sa.Text(), nullable=True),
sa.Column("category", sa.Text(), nullable=True),
sa.Column("currency", sa.String(length=3), nullable=True),
sa.Column("first_seen_at", sa.TIMESTAMP(), nullable=False),
sa.Column("last_updated_at", sa.TIMESTAMP(), nullable=False),
sa.UniqueConstraint("source", "reference", name="uq_product_source_reference"),
)
op.create_index("ix_product_source", "products", ["source"], unique=False)
op.create_index("ix_product_reference", "products", ["reference"], unique=False)
op.create_index("ix_product_last_updated", "products", ["last_updated_at"], unique=False)
op.create_table(
"price_history",
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
sa.Column("product_id", sa.Integer(), nullable=False),
sa.Column("price", sa.Numeric(10, 2), nullable=True),
sa.Column("shipping_cost", sa.Numeric(10, 2), nullable=True),
sa.Column("stock_status", sa.String(length=20), nullable=True),
sa.Column("fetch_method", sa.String(length=20), nullable=False),
sa.Column("fetch_status", sa.String(length=20), nullable=False),
sa.Column("fetched_at", sa.TIMESTAMP(), nullable=False),
sa.ForeignKeyConstraint(["product_id"], ["products.id"], ondelete="CASCADE"),
sa.UniqueConstraint("product_id", "fetched_at", name="uq_price_history_product_time"),
sa.CheckConstraint("stock_status IN ('in_stock', 'out_of_stock', 'unknown')"),
sa.CheckConstraint("fetch_method IN ('http', 'playwright')"),
sa.CheckConstraint("fetch_status IN ('success', 'partial', 'failed')"),
)
op.create_index("ix_price_history_product_id", "price_history", ["product_id"], unique=False)
op.create_index("ix_price_history_fetched_at", "price_history", ["fetched_at"], unique=False)
op.create_table(
"product_images",
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
sa.Column("product_id", sa.Integer(), nullable=False),
sa.Column("image_url", sa.Text(), nullable=False),
sa.Column("position", sa.Integer(), nullable=False),
sa.ForeignKeyConstraint(["product_id"], ["products.id"], ondelete="CASCADE"),
sa.UniqueConstraint("product_id", "image_url", name="uq_product_image_url"),
)
op.create_index("ix_product_image_product_id", "product_images", ["product_id"], unique=False)
op.create_table(
"product_specs",
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
sa.Column("product_id", sa.Integer(), nullable=False),
sa.Column("spec_key", sa.String(length=200), nullable=False),
sa.Column("spec_value", sa.Text(), nullable=False),
sa.ForeignKeyConstraint(["product_id"], ["products.id"], ondelete="CASCADE"),
sa.UniqueConstraint("product_id", "spec_key", name="uq_product_spec_key"),
)
op.create_index("ix_product_spec_product_id", "product_specs", ["product_id"], unique=False)
op.create_index("ix_product_spec_key", "product_specs", ["spec_key"], unique=False)
op.create_table(
"scraping_logs",
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
sa.Column("product_id", sa.Integer(), nullable=True),
sa.Column("url", sa.Text(), nullable=False),
sa.Column("source", sa.String(length=50), nullable=False),
sa.Column("reference", sa.String(length=100), nullable=True),
sa.Column("fetch_method", sa.String(length=20), nullable=False),
sa.Column("fetch_status", sa.String(length=20), nullable=False),
sa.Column("fetched_at", sa.TIMESTAMP(), nullable=False),
sa.Column("duration_ms", sa.Integer(), nullable=True),
sa.Column("html_size_bytes", sa.Integer(), nullable=True),
sa.Column("errors", postgresql.JSONB(), nullable=True),
sa.Column("notes", postgresql.JSONB(), nullable=True),
sa.ForeignKeyConstraint(["product_id"], ["products.id"], ondelete="SET NULL"),
sa.CheckConstraint("fetch_method IN ('http', 'playwright')"),
sa.CheckConstraint("fetch_status IN ('success', 'partial', 'failed')"),
)
op.create_index("ix_scraping_log_product_id", "scraping_logs", ["product_id"], unique=False)
op.create_index("ix_scraping_log_source", "scraping_logs", ["source"], unique=False)
op.create_index("ix_scraping_log_fetched_at", "scraping_logs", ["fetched_at"], unique=False)
op.create_index("ix_scraping_log_fetch_status", "scraping_logs", ["fetch_status"], unique=False)
def downgrade() -> None:
op.drop_index("ix_scraping_log_fetch_status", table_name="scraping_logs")
op.drop_index("ix_scraping_log_fetched_at", table_name="scraping_logs")
op.drop_index("ix_scraping_log_source", table_name="scraping_logs")
op.drop_index("ix_scraping_log_product_id", table_name="scraping_logs")
op.drop_table("scraping_logs")
op.drop_index("ix_product_spec_key", table_name="product_specs")
op.drop_index("ix_product_spec_product_id", table_name="product_specs")
op.drop_table("product_specs")
op.drop_index("ix_product_image_product_id", table_name="product_images")
op.drop_table("product_images")
op.drop_index("ix_price_history_fetched_at", table_name="price_history")
op.drop_index("ix_price_history_product_id", table_name="price_history")
op.drop_table("price_history")
op.drop_index("ix_product_last_updated", table_name="products")
op.drop_index("ix_product_reference", table_name="products")
op.drop_index("ix_product_source", table_name="products")
op.drop_table("products")

320
pricewatch/app/db/models.py Executable file
View File

@@ -0,0 +1,320 @@
"""
Modèles SQLAlchemy pour PriceWatch Phase 2.
Schéma normalisé pour persistence PostgreSQL:
- products: Catalogue produits (déduplication sur source + reference)
- price_history: Historique prix time-series
- product_images: Images produit (N par produit)
- product_specs: Caractéristiques produit (key-value)
- scraping_logs: Logs observabilité pour debugging
Justification technique:
- Normalisation: products séparée de price_history (catalogue vs time-series)
- Clé naturelle: (source, reference) comme unique constraint (ASIN Amazon, etc.)
- Pas de JSONB pour données structurées: tables séparées pour images/specs
- JSONB uniquement pour données variables: errors, notes dans logs
"""
from datetime import datetime
from decimal import Decimal
from typing import List, Optional
from sqlalchemy import (
TIMESTAMP,
CheckConstraint,
Column,
ForeignKey,
Index,
Integer,
JSON,
Numeric,
String,
Text,
UniqueConstraint,
)
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column, relationship
class Base(DeclarativeBase):
"""Base class pour tous les modèles SQLAlchemy."""
pass
class Product(Base):
"""
Catalogue produits (1 ligne par produit unique).
Clé naturelle: (source, reference) - Ex: (amazon, B08N5WRWNW)
Mise à jour: title, category, url à chaque scraping
Historique prix: relation 1-N vers PriceHistory
"""
__tablename__ = "products"
# Primary key
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
# Natural key (unique)
source: Mapped[str] = mapped_column(
String(50), nullable=False, comment="Store ID (amazon, cdiscount, etc.)"
)
reference: Mapped[str] = mapped_column(
String(100), nullable=False, comment="Product reference (ASIN, SKU, etc.)"
)
# Product metadata
url: Mapped[str] = mapped_column(Text, nullable=False, comment="Canonical product URL")
title: Mapped[Optional[str]] = mapped_column(Text, nullable=True, comment="Product title")
category: Mapped[Optional[str]] = mapped_column(
Text, nullable=True, comment="Product category (breadcrumb)"
)
currency: Mapped[Optional[str]] = mapped_column(
String(3), nullable=True, comment="Currency code (EUR, USD, GBP)"
)
# Timestamps
first_seen_at: Mapped[datetime] = mapped_column(
TIMESTAMP, nullable=False, default=datetime.utcnow, comment="First scraping timestamp"
)
last_updated_at: Mapped[datetime] = mapped_column(
TIMESTAMP,
nullable=False,
default=datetime.utcnow,
onupdate=datetime.utcnow,
comment="Last metadata update",
)
# Relationships
price_history: Mapped[List["PriceHistory"]] = relationship(
"PriceHistory", back_populates="product", cascade="all, delete-orphan"
)
images: Mapped[List["ProductImage"]] = relationship(
"ProductImage", back_populates="product", cascade="all, delete-orphan"
)
specs: Mapped[List["ProductSpec"]] = relationship(
"ProductSpec", back_populates="product", cascade="all, delete-orphan"
)
logs: Mapped[List["ScrapingLog"]] = relationship(
"ScrapingLog", back_populates="product", cascade="all, delete-orphan"
)
# Constraints
__table_args__ = (
UniqueConstraint("source", "reference", name="uq_product_source_reference"),
Index("ix_product_source", "source"),
Index("ix_product_reference", "reference"),
Index("ix_product_last_updated", "last_updated_at"),
)
def __repr__(self) -> str:
return f"<Product(id={self.id}, source={self.source}, reference={self.reference})>"
class PriceHistory(Base):
"""
Historique prix (time-series).
Une ligne par scraping réussi avec extraction prix.
Unique constraint sur (product_id, fetched_at) évite doublons.
"""
__tablename__ = "price_history"
# Primary key
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
# Foreign key
product_id: Mapped[int] = mapped_column(
Integer, ForeignKey("products.id", ondelete="CASCADE"), nullable=False
)
# Price data
price: Mapped[Optional[Decimal]] = mapped_column(
Numeric(10, 2), nullable=True, comment="Product price"
)
shipping_cost: Mapped[Optional[Decimal]] = mapped_column(
Numeric(10, 2), nullable=True, comment="Shipping cost"
)
stock_status: Mapped[Optional[str]] = mapped_column(
String(20), nullable=True, comment="Stock status (in_stock, out_of_stock, unknown)"
)
# Fetch metadata
fetch_method: Mapped[str] = mapped_column(
String(20), nullable=False, comment="Fetch method (http, playwright)"
)
fetch_status: Mapped[str] = mapped_column(
String(20), nullable=False, comment="Fetch status (success, partial, failed)"
)
fetched_at: Mapped[datetime] = mapped_column(
TIMESTAMP, nullable=False, comment="Scraping timestamp"
)
# Relationship
product: Mapped["Product"] = relationship("Product", back_populates="price_history")
# Constraints
__table_args__ = (
UniqueConstraint("product_id", "fetched_at", name="uq_price_history_product_time"),
Index("ix_price_history_product_id", "product_id"),
Index("ix_price_history_fetched_at", "fetched_at"),
CheckConstraint("stock_status IN ('in_stock', 'out_of_stock', 'unknown')"),
CheckConstraint("fetch_method IN ('http', 'playwright')"),
CheckConstraint("fetch_status IN ('success', 'partial', 'failed')"),
)
def __repr__(self) -> str:
return f"<PriceHistory(id={self.id}, product_id={self.product_id}, price={self.price}, fetched_at={self.fetched_at})>"
class ProductImage(Base):
"""
Images produit (N images par produit).
Unique constraint sur (product_id, image_url) évite doublons.
Position permet de garder l'ordre des images.
"""
__tablename__ = "product_images"
# Primary key
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
# Foreign key
product_id: Mapped[int] = mapped_column(
Integer, ForeignKey("products.id", ondelete="CASCADE"), nullable=False
)
# Image data
image_url: Mapped[str] = mapped_column(Text, nullable=False, comment="Image URL")
position: Mapped[int] = mapped_column(
Integer, nullable=False, default=0, comment="Image position (0=main)"
)
# Relationship
product: Mapped["Product"] = relationship("Product", back_populates="images")
# Constraints
__table_args__ = (
UniqueConstraint("product_id", "image_url", name="uq_product_image_url"),
Index("ix_product_image_product_id", "product_id"),
)
def __repr__(self) -> str:
return f"<ProductImage(id={self.id}, product_id={self.product_id}, position={self.position})>"
class ProductSpec(Base):
"""
Caractéristiques produit (key-value).
Unique constraint sur (product_id, spec_key) évite doublons.
Permet queries efficaces par clé.
"""
__tablename__ = "product_specs"
# Primary key
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
# Foreign key
product_id: Mapped[int] = mapped_column(
Integer, ForeignKey("products.id", ondelete="CASCADE"), nullable=False
)
# Spec data
spec_key: Mapped[str] = mapped_column(
String(200), nullable=False, comment="Specification key (e.g., 'Brand', 'Color')"
)
spec_value: Mapped[str] = mapped_column(Text, nullable=False, comment="Specification value")
# Relationship
product: Mapped["Product"] = relationship("Product", back_populates="specs")
# Constraints
__table_args__ = (
UniqueConstraint("product_id", "spec_key", name="uq_product_spec_key"),
Index("ix_product_spec_product_id", "product_id"),
Index("ix_product_spec_key", "spec_key"),
)
def __repr__(self) -> str:
return f"<ProductSpec(id={self.id}, product_id={self.product_id}, key={self.spec_key})>"
class ScrapingLog(Base):
"""
Logs observabilité pour debugging.
FK optionnelle vers products (permet logs même si produit non créé).
JSONB pour errors/notes car structure variable.
Permet analytics: taux succès, durée moyenne, etc.
"""
__tablename__ = "scraping_logs"
# Primary key
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
# Foreign key (optional)
product_id: Mapped[Optional[int]] = mapped_column(
Integer, ForeignKey("products.id", ondelete="SET NULL"), nullable=True
)
# Scraping metadata
url: Mapped[str] = mapped_column(Text, nullable=False, comment="Scraped URL")
source: Mapped[str] = mapped_column(
String(50), nullable=False, comment="Store ID (amazon, cdiscount, etc.)"
)
reference: Mapped[Optional[str]] = mapped_column(
String(100), nullable=True, comment="Product reference (if extracted)"
)
# Fetch metadata
fetch_method: Mapped[str] = mapped_column(
String(20), nullable=False, comment="Fetch method (http, playwright)"
)
fetch_status: Mapped[str] = mapped_column(
String(20), nullable=False, comment="Fetch status (success, partial, failed)"
)
fetched_at: Mapped[datetime] = mapped_column(
TIMESTAMP, nullable=False, default=datetime.utcnow, comment="Scraping timestamp"
)
# Performance metrics
duration_ms: Mapped[Optional[int]] = mapped_column(
Integer, nullable=True, comment="Fetch duration in milliseconds"
)
html_size_bytes: Mapped[Optional[int]] = mapped_column(
Integer, nullable=True, comment="HTML response size in bytes"
)
# Debug data (JSONB)
errors: Mapped[Optional[list[str]]] = mapped_column(
JSON().with_variant(JSONB, "postgresql"),
nullable=True,
comment="Error messages (list of strings)",
)
notes: Mapped[Optional[list[str]]] = mapped_column(
JSON().with_variant(JSONB, "postgresql"),
nullable=True,
comment="Debug notes (list of strings)",
)
# Relationship
product: Mapped[Optional["Product"]] = relationship("Product", back_populates="logs")
# Constraints
__table_args__ = (
Index("ix_scraping_log_product_id", "product_id"),
Index("ix_scraping_log_source", "source"),
Index("ix_scraping_log_fetched_at", "fetched_at"),
Index("ix_scraping_log_fetch_status", "fetch_status"),
CheckConstraint("fetch_method IN ('http', 'playwright')"),
CheckConstraint("fetch_status IN ('success', 'partial', 'failed')"),
)
def __repr__(self) -> str:
return f"<ScrapingLog(id={self.id}, url={self.url}, status={self.fetch_status}, fetched_at={self.fetched_at})>"

140
pricewatch/app/db/repository.py Executable file
View File

@@ -0,0 +1,140 @@
"""
Repository pattern pour la persistence SQLAlchemy.
Centralise les operations CRUD sur les modeles DB a partir d'un ProductSnapshot.
"""
from __future__ import annotations
from typing import Optional
from sqlalchemy.exc import SQLAlchemyError
from sqlalchemy.orm import Session
from pricewatch.app.core.logging import get_logger
from pricewatch.app.core.schema import ProductSnapshot
from pricewatch.app.db.models import PriceHistory, Product, ProductImage, ProductSpec, ScrapingLog
logger = get_logger("db.repository")
class ProductRepository:
"""Repository de persistence pour ProductSnapshot."""
def __init__(self, session: Session) -> None:
self.session = session
def get_or_create(self, source: str, reference: str, url: str) -> Product:
"""
Recuperer ou creer un produit par cle naturelle (source, reference).
"""
product = (
self.session.query(Product)
.filter(Product.source == source, Product.reference == reference)
.one_or_none()
)
if product:
return product
product = Product(source=source, reference=reference, url=url)
self.session.add(product)
self.session.flush()
return product
def update_product_metadata(self, product: Product, snapshot: ProductSnapshot) -> None:
"""Met a jour les metadonnees produit si disponibles."""
if snapshot.url:
product.url = snapshot.url
if snapshot.title:
product.title = snapshot.title
if snapshot.category:
product.category = snapshot.category
if snapshot.currency:
product.currency = snapshot.currency
def add_price_history(self, product: Product, snapshot: ProductSnapshot) -> Optional[PriceHistory]:
"""Ajoute une entree d'historique de prix si inexistante."""
existing = (
self.session.query(PriceHistory)
.filter(
PriceHistory.product_id == product.id,
PriceHistory.fetched_at == snapshot.fetched_at,
)
.one_or_none()
)
if existing:
return existing
price_entry = PriceHistory(
product_id=product.id,
price=snapshot.price,
shipping_cost=snapshot.shipping_cost,
stock_status=snapshot.stock_status,
fetch_method=snapshot.debug.method,
fetch_status=snapshot.debug.status,
fetched_at=snapshot.fetched_at,
)
self.session.add(price_entry)
return price_entry
def sync_images(self, product: Product, images: list[str]) -> None:
"""Synchronise les images (ajout des nouvelles)."""
existing_urls = {image.image_url for image in product.images}
for position, url in enumerate(images):
if url in existing_urls:
continue
self.session.add(ProductImage(product_id=product.id, image_url=url, position=position))
def sync_specs(self, product: Product, specs: dict[str, str]) -> None:
"""Synchronise les specs (upsert par cle)."""
existing_specs = {spec.spec_key: spec for spec in product.specs}
for key, value in specs.items():
if key in existing_specs:
existing_specs[key].spec_value = value
else:
self.session.add(ProductSpec(product_id=product.id, spec_key=key, spec_value=value))
def add_scraping_log(self, snapshot: ProductSnapshot, product_id: Optional[int]) -> ScrapingLog:
"""Ajoute un log de scraping pour observabilite."""
log_entry = ScrapingLog(
product_id=product_id,
url=snapshot.url,
source=snapshot.source,
reference=snapshot.reference,
fetch_method=snapshot.debug.method,
fetch_status=snapshot.debug.status,
fetched_at=snapshot.fetched_at,
duration_ms=snapshot.debug.duration_ms,
html_size_bytes=snapshot.debug.html_size_bytes,
errors=snapshot.debug.errors or None,
notes=snapshot.debug.notes or None,
)
self.session.add(log_entry)
return log_entry
def save_snapshot(self, snapshot: ProductSnapshot) -> Optional[int]:
"""
Persiste un ProductSnapshot complet dans la base.
Retourne l'id produit ou None si reference absente.
"""
if not snapshot.reference:
logger.warning("Reference absente: persistence ignoree")
self.add_scraping_log(snapshot, product_id=None)
return None
product = self.get_or_create(snapshot.source, snapshot.reference, snapshot.url)
self.update_product_metadata(product, snapshot)
self.add_price_history(product, snapshot)
self.sync_images(product, snapshot.images)
self.sync_specs(product, snapshot.specs)
self.add_scraping_log(snapshot, product_id=product.id)
return product.id
def safe_save_snapshot(self, snapshot: ProductSnapshot) -> Optional[int]:
"""Sauvegarde avec gestion d'erreur SQLAlchemy."""
try:
return self.save_snapshot(snapshot)
except SQLAlchemyError as exc:
logger.error(f"Erreur SQLAlchemy: {exc}")
raise

View File

@@ -0,0 +1,3 @@
from pricewatch.app.scraping.pipeline import ScrapingPipeline
__all__ = ["ScrapingPipeline"]

Binary file not shown.

View File

@@ -0,0 +1,52 @@
"""
Pipeline de persistence pour les snapshots de scraping.
Ne doit jamais bloquer le pipeline principal si la DB est indisponible.
"""
from __future__ import annotations
from typing import Optional
from sqlalchemy.exc import SQLAlchemyError
from pricewatch.app.core.config import AppConfig, get_config
from pricewatch.app.core.logging import get_logger
from pricewatch.app.core.schema import ProductSnapshot
from pricewatch.app.db.connection import get_session
from pricewatch.app.db.repository import ProductRepository
logger = get_logger("scraping.pipeline")
class ScrapingPipeline:
"""Orchestration de persistence DB pour un ProductSnapshot."""
def __init__(self, config: Optional[AppConfig] = None) -> None:
self.config = config
def process_snapshot(self, snapshot: ProductSnapshot, save_to_db: bool = True) -> Optional[int]:
"""
Persiste un snapshot en base si active.
Retourne l'id produit si sauve, sinon None.
"""
app_config = self.config or get_config()
if not save_to_db or not app_config.enable_db:
logger.debug("Persistence DB desactivee")
return None
try:
with get_session(app_config) as session:
repo = ProductRepository(session)
product_id = repo.safe_save_snapshot(snapshot)
session.commit()
return product_id
except SQLAlchemyError as exc:
snapshot.add_note(f"Persistence DB echouee: {exc}")
logger.error(f"Persistence DB echouee: {exc}")
return None
except Exception as exc:
snapshot.add_note(f"Erreur pipeline DB: {exc}")
logger.error(f"Erreur pipeline DB: {exc}")
return None

View File

@@ -214,6 +214,18 @@ class AmazonStore(BaseStore):
except ValueError:
continue
# Fallback: chercher les spans séparés a-price-whole et a-price-fraction
whole = soup.select_one("span.a-price-whole")
fraction = soup.select_one("span.a-price-fraction")
if whole and fraction:
whole_text = whole.get_text(strip=True)
fraction_text = fraction.get_text(strip=True)
try:
price_str = f"{whole_text}.{fraction_text}"
return float(price_str)
except ValueError:
pass
debug.errors.append("Prix non trouvé")
return None
@@ -270,6 +282,14 @@ class AmazonStore(BaseStore):
if url and url.startswith("http"):
images.append(url)
# Fallback: chercher tous les img tags si aucune image trouvée
if not images:
all_imgs = soup.find_all("img")
for img in all_imgs:
url = img.get("src") or img.get("data-src")
if url and url.startswith("http"):
images.append(url)
return list(set(images)) # Dédupliquer
def _extract_category(self, soup: BeautifulSoup, debug: DebugInfo) -> Optional[str]:

View File

@@ -0,0 +1,8 @@
"""
Module tasks pour les jobs RQ.
"""
from pricewatch.app.tasks.scrape import scrape_product
from pricewatch.app.tasks.scheduler import ScrapingScheduler
__all__ = ["scrape_product", "ScrapingScheduler"]

View File

@@ -0,0 +1,75 @@
"""
Planification des jobs de scraping via RQ Scheduler.
"""
from __future__ import annotations
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from typing import Optional
import redis
from rq import Queue
from rq_scheduler import Scheduler
from pricewatch.app.core.config import AppConfig, get_config
from pricewatch.app.core.logging import get_logger
from pricewatch.app.tasks.scrape import scrape_product
logger = get_logger("tasks.scheduler")
@dataclass
class ScheduledJobInfo:
"""Infos de retour pour un job planifie."""
job_id: str
next_run: datetime
class ScrapingScheduler:
"""Scheduler pour les jobs de scraping avec RQ."""
def __init__(self, config: Optional[AppConfig] = None, queue_name: str = "default") -> None:
self.config = config or get_config()
self.redis = redis.from_url(self.config.redis.url)
self.queue = Queue(queue_name, connection=self.redis)
self.scheduler = Scheduler(queue=self.queue, connection=self.redis)
def enqueue_immediate(
self,
url: str,
use_playwright: Optional[bool] = None,
save_db: bool = True,
):
"""Enqueue un job immediat."""
job = self.queue.enqueue(
scrape_product,
url,
use_playwright=use_playwright,
save_db=save_db,
)
logger.info(f"Job enqueued: {job.id}")
return job
def schedule_product(
self,
url: str,
interval_hours: int = 24,
use_playwright: Optional[bool] = None,
save_db: bool = True,
) -> ScheduledJobInfo:
"""Planifie un scraping recurrent (intervalle en heures)."""
interval_seconds = int(timedelta(hours=interval_hours).total_seconds())
next_run = datetime.now(timezone.utc) + timedelta(seconds=interval_seconds)
job = self.scheduler.schedule(
scheduled_time=next_run,
func=scrape_product,
args=[url],
kwargs={"use_playwright": use_playwright, "save_db": save_db},
interval=interval_seconds,
repeat=None,
)
logger.info(f"Job planifie: {job.id}, prochaine execution: {next_run.isoformat()}")
return ScheduledJobInfo(job_id=job.id, next_run=next_run)

160
pricewatch/app/tasks/scrape.py Executable file
View File

@@ -0,0 +1,160 @@
"""
Tache de scraping asynchrone pour RQ.
"""
from __future__ import annotations
from typing import Any, Optional
from pricewatch.app.core.config import AppConfig, get_config
from pricewatch.app.core.logging import get_logger
from pricewatch.app.core.registry import get_registry
from pricewatch.app.core.schema import DebugInfo, DebugStatus, FetchMethod, ProductSnapshot
from pricewatch.app.scraping.http_fetch import fetch_http
from pricewatch.app.scraping.pipeline import ScrapingPipeline
from pricewatch.app.scraping.pw_fetch import fetch_playwright
from pricewatch.app.stores.aliexpress.store import AliexpressStore
from pricewatch.app.stores.amazon.store import AmazonStore
from pricewatch.app.stores.backmarket.store import BackmarketStore
from pricewatch.app.stores.cdiscount.store import CdiscountStore
logger = get_logger("tasks.scrape")
def setup_stores() -> None:
"""Enregistre les stores disponibles si besoin."""
registry = get_registry()
if registry.list_stores():
return
registry.register(AmazonStore())
registry.register(CdiscountStore())
registry.register(BackmarketStore())
registry.register(AliexpressStore())
def scrape_product(
url: str,
use_playwright: Optional[bool] = None,
save_db: bool = True,
save_html: bool = False,
save_screenshot: bool = False,
headful: bool = False,
timeout_ms: Optional[int] = None,
) -> dict[str, Any]:
"""
Scrape un produit et persiste en base via ScrapingPipeline.
Retourne un dict avec success, product_id, snapshot, error.
"""
config: AppConfig = get_config()
setup_stores()
if use_playwright is None:
use_playwright = config.default_use_playwright
if timeout_ms is None:
timeout_ms = config.default_playwright_timeout
registry = get_registry()
store = registry.detect_store(url)
if not store:
snapshot = ProductSnapshot(
source="unknown",
url=url,
debug=DebugInfo(
method=FetchMethod.HTTP,
status=DebugStatus.FAILED,
errors=["Aucun store detecte"],
),
)
ScrapingPipeline(config=config).process_snapshot(snapshot, save_to_db=save_db)
return {"success": False, "product_id": None, "snapshot": snapshot, "error": "store"}
canonical_url = store.canonicalize(url)
html = None
fetch_method = FetchMethod.HTTP
fetch_error = None
duration_ms = None
html_size_bytes = None
pw_result = None
http_result = fetch_http(canonical_url)
duration_ms = http_result.duration_ms
if http_result.success:
html = http_result.html
fetch_method = FetchMethod.HTTP
elif use_playwright:
pw_result = fetch_playwright(
canonical_url,
headless=not headful,
timeout_ms=timeout_ms,
save_screenshot=save_screenshot,
)
duration_ms = pw_result.duration_ms
if pw_result.success:
html = pw_result.html
fetch_method = FetchMethod.PLAYWRIGHT
else:
fetch_error = pw_result.error
else:
fetch_error = http_result.error
if html:
html_size_bytes = len(html.encode("utf-8"))
if save_html:
from pricewatch.app.core.io import save_debug_html
ref = store.extract_reference(canonical_url) or "unknown"
save_debug_html(html, f"{store.store_id}_{ref}")
if save_screenshot and fetch_method == FetchMethod.PLAYWRIGHT and pw_result:
from pricewatch.app.core.io import save_debug_screenshot
if pw_result and pw_result.screenshot:
ref = store.extract_reference(canonical_url) or "unknown"
save_debug_screenshot(pw_result.screenshot, f"{store.store_id}_{ref}")
try:
snapshot = store.parse(html, canonical_url)
snapshot.debug.method = fetch_method
snapshot.debug.duration_ms = duration_ms
snapshot.debug.html_size_bytes = html_size_bytes
success = snapshot.debug.status != DebugStatus.FAILED
except Exception as exc:
snapshot = ProductSnapshot(
source=store.store_id,
url=canonical_url,
debug=DebugInfo(
method=fetch_method,
status=DebugStatus.FAILED,
errors=[f"Parsing failed: {exc}"],
duration_ms=duration_ms,
html_size_bytes=html_size_bytes,
),
)
success = False
fetch_error = str(exc)
else:
snapshot = ProductSnapshot(
source=store.store_id,
url=canonical_url,
debug=DebugInfo(
method=fetch_method,
status=DebugStatus.FAILED,
errors=[f"Fetch failed: {fetch_error or 'Unknown error'}"],
duration_ms=duration_ms,
),
)
success = False
product_id = ScrapingPipeline(config=config).process_snapshot(snapshot, save_to_db=save_db)
return {
"success": success,
"product_id": product_id,
"snapshot": snapshot,
"error": fetch_error,
}

View File

@@ -44,6 +44,19 @@ dependencies = [
# Date/time utilities
"python-dateutil>=2.8.2",
# Database (Phase 2)
"sqlalchemy>=2.0.0",
"psycopg2-binary>=2.9.0",
"alembic>=1.13.0",
# Configuration (Phase 2)
"python-dotenv>=1.0.0",
# Worker/Queue (Phase 2)
"redis>=5.0.0",
"rq>=1.15.0",
"rq-scheduler>=0.13.0",
]
[project.optional-dependencies]

106
tests/cli/test_run_db.py Executable file
View File

@@ -0,0 +1,106 @@
"""
Tests end-to-end pour la commande CLI run avec persistence DB.
"""
from dataclasses import dataclass
from pathlib import Path
from typer.testing import CliRunner
from pricewatch.app.cli import main as cli_main
from pricewatch.app.core.registry import get_registry
from pricewatch.app.core.schema import DebugInfo, DebugStatus, FetchMethod, ProductSnapshot
from pricewatch.app.db.connection import get_session, init_db, reset_engine
from pricewatch.app.db.models import Product
from pricewatch.app.stores.base import BaseStore
@dataclass
class FakeDbConfig:
url: str
@dataclass
class FakeAppConfig:
db: FakeDbConfig
debug: bool = False
enable_db: bool = True
class DummyStore(BaseStore):
def __init__(self) -> None:
super().__init__(store_id="dummy")
def match(self, url: str) -> float:
return 1.0 if "example.com" in url else 0.0
def canonicalize(self, url: str) -> str:
return url
def extract_reference(self, url: str) -> str | None:
return "REF123"
def parse(self, html: str, url: str) -> ProductSnapshot:
return ProductSnapshot(
source=self.store_id,
url=url,
title="Produit dummy",
price=9.99,
currency="EUR",
reference="REF123",
debug=DebugInfo(method=FetchMethod.HTTP, status=DebugStatus.SUCCESS),
)
class DummyFetchResult:
def __init__(self, html: str) -> None:
self.success = True
self.html = html
self.error = None
def test_cli_run_persists_db(tmp_path, monkeypatch):
"""Le CLI run persiste en base quand --save-db est active."""
reset_engine()
db_path = tmp_path / "test.db"
config = FakeAppConfig(db=FakeDbConfig(url=f"sqlite:///{db_path}"))
init_db(config)
yaml_path = tmp_path / "config.yaml"
out_path = tmp_path / "out.json"
yaml_path.write_text(
"""
urls:
- "https://example.com/product"
options:
use_playwright: false
save_html: false
save_screenshot: false
""",
encoding="utf-8",
)
registry = get_registry()
previous_stores = list(registry._stores)
registry._stores = []
registry.register(DummyStore())
monkeypatch.setattr(cli_main, "get_config", lambda: config)
monkeypatch.setattr(cli_main, "setup_stores", lambda: None)
monkeypatch.setattr(cli_main, "fetch_http", lambda url: DummyFetchResult("<html></html>"))
runner = CliRunner()
try:
result = runner.invoke(
cli_main.app,
["run", "--yaml", str(yaml_path), "--out", str(out_path), "--save-db"],
)
finally:
registry._stores = previous_stores
reset_engine()
assert result.exit_code == 0
assert out_path.exists()
with get_session(config) as session:
assert session.query(Product).count() == 1

Binary file not shown.

462
tests/core/test_io.py Executable file
View File

@@ -0,0 +1,462 @@
"""
Tests pour pricewatch.app.core.io
Teste la lecture/écriture YAML/JSON et les fonctions de sauvegarde debug.
"""
import json
import tempfile
from datetime import datetime
from pathlib import Path
import pytest
import yaml
from pricewatch.app.core.io import (
ScrapingConfig,
ScrapingOptions,
read_json_results,
read_yaml_config,
save_debug_html,
save_debug_screenshot,
write_json_results,
)
from pricewatch.app.core.schema import (
DebugInfo,
DebugStatus,
FetchMethod,
ProductSnapshot,
StockStatus,
)
class TestScrapingOptions:
"""Tests pour le modèle ScrapingOptions."""
def test_default_values(self):
"""Les valeurs par défaut sont correctes."""
options = ScrapingOptions()
assert options.use_playwright is True
assert options.headful is False
assert options.save_html is True
assert options.save_screenshot is True
assert options.timeout_ms == 60000
def test_custom_values(self):
"""Les valeurs personnalisées sont acceptées."""
options = ScrapingOptions(
use_playwright=False,
headful=True,
save_html=False,
save_screenshot=False,
timeout_ms=30000,
)
assert options.use_playwright is False
assert options.headful is True
assert options.save_html is False
assert options.save_screenshot is False
assert options.timeout_ms == 30000
def test_timeout_validation_min(self):
"""Timeout inférieur à 1000ms est rejeté."""
with pytest.raises(ValueError):
ScrapingOptions(timeout_ms=500)
def test_timeout_validation_valid(self):
"""Timeout >= 1000ms est accepté."""
options = ScrapingOptions(timeout_ms=1000)
assert options.timeout_ms == 1000
class TestScrapingConfig:
"""Tests pour le modèle ScrapingConfig."""
def test_minimal_config(self):
"""Config minimale avec URLs uniquement."""
config = ScrapingConfig(urls=["https://example.com"])
assert len(config.urls) == 1
assert config.urls[0] == "https://example.com"
assert isinstance(config.options, ScrapingOptions)
def test_config_with_options(self):
"""Config avec URLs et options."""
options = ScrapingOptions(use_playwright=False, timeout_ms=10000)
config = ScrapingConfig(
urls=["https://example.com", "https://test.com"], options=options
)
assert len(config.urls) == 2
assert config.options.use_playwright is False
assert config.options.timeout_ms == 10000
def test_validate_urls_empty_list(self):
"""Liste d'URLs vide est rejetée."""
with pytest.raises(ValueError, match="Au moins une URL"):
ScrapingConfig(urls=[])
def test_validate_urls_strips_whitespace(self):
"""Les espaces sont nettoyés."""
config = ScrapingConfig(urls=[" https://example.com ", "https://test.com"])
assert config.urls == ["https://example.com", "https://test.com"]
def test_validate_urls_removes_empty(self):
"""Les URLs vides sont supprimées."""
config = ScrapingConfig(
urls=["https://example.com", "", " ", "https://test.com"]
)
assert len(config.urls) == 2
assert config.urls == ["https://example.com", "https://test.com"]
def test_validate_urls_all_empty(self):
"""Si toutes les URLs sont vides, erreur."""
with pytest.raises(ValueError, match="Aucune URL valide"):
ScrapingConfig(urls=["", " ", "\t"])
class TestReadYamlConfig:
"""Tests pour read_yaml_config()."""
def test_read_valid_yaml(self, tmp_path):
"""Lit un fichier YAML valide."""
yaml_path = tmp_path / "config.yaml"
yaml_content = {
"urls": ["https://example.com", "https://test.com"],
"options": {"use_playwright": False, "timeout_ms": 30000},
}
with open(yaml_path, "w") as f:
yaml.dump(yaml_content, f)
config = read_yaml_config(yaml_path)
assert len(config.urls) == 2
assert config.urls[0] == "https://example.com"
assert config.options.use_playwright is False
assert config.options.timeout_ms == 30000
def test_read_yaml_minimal(self, tmp_path):
"""Lit un YAML minimal (URLs uniquement)."""
yaml_path = tmp_path / "config.yaml"
yaml_content = {"urls": ["https://example.com"]}
with open(yaml_path, "w") as f:
yaml.dump(yaml_content, f)
config = read_yaml_config(yaml_path)
assert len(config.urls) == 1
# Options par défaut
assert config.options.use_playwright is True
assert config.options.timeout_ms == 60000
def test_read_yaml_file_not_found(self, tmp_path):
"""Fichier introuvable lève FileNotFoundError."""
yaml_path = tmp_path / "nonexistent.yaml"
with pytest.raises(FileNotFoundError):
read_yaml_config(yaml_path)
def test_read_yaml_empty_file(self, tmp_path):
"""Fichier YAML vide lève ValueError."""
yaml_path = tmp_path / "empty.yaml"
yaml_path.write_text("")
with pytest.raises(ValueError, match="Fichier YAML vide"):
read_yaml_config(yaml_path)
def test_read_yaml_invalid_syntax(self, tmp_path):
"""YAML avec syntaxe invalide lève ValueError."""
yaml_path = tmp_path / "invalid.yaml"
yaml_path.write_text("urls: [invalid yaml syntax")
with pytest.raises(ValueError, match="YAML invalide"):
read_yaml_config(yaml_path)
def test_read_yaml_missing_urls(self, tmp_path):
"""YAML sans champ 'urls' lève erreur de validation."""
yaml_path = tmp_path / "config.yaml"
yaml_content = {"options": {"use_playwright": False}}
with open(yaml_path, "w") as f:
yaml.dump(yaml_content, f)
with pytest.raises(Exception): # Pydantic validation error
read_yaml_config(yaml_path)
def test_read_yaml_accepts_path_string(self, tmp_path):
"""Accepte un string comme chemin."""
yaml_path = tmp_path / "config.yaml"
yaml_content = {"urls": ["https://example.com"]}
with open(yaml_path, "w") as f:
yaml.dump(yaml_content, f)
config = read_yaml_config(str(yaml_path))
assert len(config.urls) == 1
class TestWriteJsonResults:
"""Tests pour write_json_results()."""
@pytest.fixture
def sample_snapshot(self) -> ProductSnapshot:
"""Fixture: ProductSnapshot exemple."""
return ProductSnapshot(
source="test",
url="https://example.com/product",
fetched_at=datetime(2024, 1, 1, 12, 0, 0),
title="Test Product",
price=99.99,
currency="EUR",
stock_status=StockStatus.IN_STOCK,
reference="TEST123",
images=["https://example.com/img1.jpg"],
category="Test Category",
specs={"Brand": "TestBrand"},
debug=DebugInfo(
method=FetchMethod.HTTP,
status=DebugStatus.SUCCESS,
errors=[],
notes=[],
),
)
def test_write_single_snapshot(self, tmp_path, sample_snapshot):
"""Écrit un seul snapshot."""
json_path = tmp_path / "results.json"
write_json_results([sample_snapshot], json_path)
assert json_path.exists()
# Vérifier le contenu
with open(json_path) as f:
data = json.load(f)
assert isinstance(data, list)
assert len(data) == 1
assert data[0]["source"] == "test"
assert data[0]["title"] == "Test Product"
def test_write_multiple_snapshots(self, tmp_path, sample_snapshot):
"""Écrit plusieurs snapshots."""
snapshot2 = ProductSnapshot(
source="test2",
url="https://example.com/product2",
fetched_at=datetime(2024, 1, 2, 12, 0, 0),
title="Test Product 2",
price=49.99,
currency="EUR",
stock_status=StockStatus.OUT_OF_STOCK,
debug=DebugInfo(
method=FetchMethod.PLAYWRIGHT,
status=DebugStatus.PARTIAL,
errors=["Test error"],
notes=[],
),
)
json_path = tmp_path / "results.json"
write_json_results([sample_snapshot, snapshot2], json_path)
with open(json_path) as f:
data = json.load(f)
assert len(data) == 2
assert data[0]["source"] == "test"
assert data[1]["source"] == "test2"
def test_write_creates_parent_dirs(self, tmp_path, sample_snapshot):
"""Crée les dossiers parents si nécessaire."""
json_path = tmp_path / "sub" / "dir" / "results.json"
write_json_results([sample_snapshot], json_path)
assert json_path.exists()
assert json_path.parent.exists()
def test_write_empty_list(self, tmp_path):
"""Écrit une liste vide."""
json_path = tmp_path / "empty.json"
write_json_results([], json_path)
assert json_path.exists()
with open(json_path) as f:
data = json.load(f)
assert data == []
def test_write_indent_control(self, tmp_path, sample_snapshot):
"""Contrôle l'indentation."""
# Avec indent
json_path1 = tmp_path / "pretty.json"
write_json_results([sample_snapshot], json_path1, indent=2)
content1 = json_path1.read_text()
assert "\n" in content1 # Pretty-printed
# Sans indent (compact)
json_path2 = tmp_path / "compact.json"
write_json_results([sample_snapshot], json_path2, indent=None)
content2 = json_path2.read_text()
assert len(content2) < len(content1) # Plus compact
def test_write_accepts_path_string(self, tmp_path, sample_snapshot):
"""Accepte un string comme chemin."""
json_path = tmp_path / "results.json"
write_json_results([sample_snapshot], str(json_path))
assert json_path.exists()
class TestReadJsonResults:
"""Tests pour read_json_results()."""
@pytest.fixture
def json_file_with_snapshot(self, tmp_path) -> Path:
"""Fixture: Fichier JSON avec un snapshot."""
json_path = tmp_path / "results.json"
snapshot_data = {
"source": "test",
"url": "https://example.com/product",
"fetched_at": "2024-01-01T12:00:00",
"title": "Test Product",
"price": 99.99,
"currency": "EUR",
"shipping_cost": None,
"stock_status": "in_stock",
"reference": "TEST123",
"images": ["https://example.com/img.jpg"],
"category": "Test",
"specs": {"Brand": "Test"},
"debug": {
"method": "http",
"status": "success",
"errors": [],
"notes": [],
"duration_ms": None,
"html_size_bytes": None,
},
}
with open(json_path, "w") as f:
json.dump([snapshot_data], f)
return json_path
def test_read_single_snapshot(self, json_file_with_snapshot):
"""Lit un fichier avec un snapshot."""
snapshots = read_json_results(json_file_with_snapshot)
assert len(snapshots) == 1
assert isinstance(snapshots[0], ProductSnapshot)
assert snapshots[0].source == "test"
assert snapshots[0].title == "Test Product"
assert snapshots[0].price == 99.99
def test_read_file_not_found(self, tmp_path):
"""Fichier introuvable lève FileNotFoundError."""
json_path = tmp_path / "nonexistent.json"
with pytest.raises(FileNotFoundError):
read_json_results(json_path)
def test_read_invalid_json(self, tmp_path):
"""JSON invalide lève ValueError."""
json_path = tmp_path / "invalid.json"
json_path.write_text("{invalid json")
with pytest.raises(ValueError, match="JSON invalide"):
read_json_results(json_path)
def test_read_not_a_list(self, tmp_path):
"""JSON qui n'est pas une liste lève ValueError."""
json_path = tmp_path / "notlist.json"
with open(json_path, "w") as f:
json.dump({"key": "value"}, f)
with pytest.raises(ValueError, match="doit contenir une liste"):
read_json_results(json_path)
def test_read_empty_list(self, tmp_path):
"""Liste vide est acceptée."""
json_path = tmp_path / "empty.json"
with open(json_path, "w") as f:
json.dump([], f)
snapshots = read_json_results(json_path)
assert snapshots == []
def test_read_accepts_path_string(self, json_file_with_snapshot):
"""Accepte un string comme chemin."""
snapshots = read_json_results(str(json_file_with_snapshot))
assert len(snapshots) == 1
class TestSaveDebugHtml:
"""Tests pour save_debug_html()."""
def test_save_html_default_dir(self, tmp_path, monkeypatch):
"""Sauvegarde HTML dans le dossier par défaut."""
# Changer le répertoire de travail pour le test
monkeypatch.chdir(tmp_path)
html = "<html><body>Test</body></html>"
result_path = save_debug_html(html, "test_page")
assert result_path.exists()
assert result_path.name == "test_page.html"
assert result_path.read_text(encoding="utf-8") == html
def test_save_html_custom_dir(self, tmp_path):
"""Sauvegarde HTML dans un dossier personnalisé."""
output_dir = tmp_path / "debug_html"
html = "<html><body>Test</body></html>"
result_path = save_debug_html(html, "test_page", output_dir)
assert result_path.parent == output_dir
assert result_path.name == "test_page.html"
assert result_path.read_text(encoding="utf-8") == html
def test_save_html_creates_dir(self, tmp_path):
"""Crée le dossier de sortie s'il n'existe pas."""
output_dir = tmp_path / "sub" / "dir" / "html"
html = "<html><body>Test</body></html>"
result_path = save_debug_html(html, "test_page", output_dir)
assert output_dir.exists()
assert result_path.exists()
def test_save_html_large_content(self, tmp_path):
"""Sauvegarde du HTML volumineux."""
html = "<html><body>" + ("x" * 100000) + "</body></html>"
result_path = save_debug_html(html, "large_page", tmp_path)
assert result_path.exists()
assert len(result_path.read_text(encoding="utf-8")) == len(html)
class TestSaveDebugScreenshot:
"""Tests pour save_debug_screenshot()."""
def test_save_screenshot_default_dir(self, tmp_path, monkeypatch):
"""Sauvegarde screenshot dans le dossier par défaut."""
monkeypatch.chdir(tmp_path)
screenshot_bytes = b"\x89PNG fake image data"
result_path = save_debug_screenshot(screenshot_bytes, "test_screenshot")
assert result_path.exists()
assert result_path.name == "test_screenshot.png"
assert result_path.read_bytes() == screenshot_bytes
def test_save_screenshot_custom_dir(self, tmp_path):
"""Sauvegarde screenshot dans un dossier personnalisé."""
output_dir = tmp_path / "screenshots"
screenshot_bytes = b"\x89PNG fake image data"
result_path = save_debug_screenshot(screenshot_bytes, "test_screenshot", output_dir)
assert result_path.parent == output_dir
assert result_path.name == "test_screenshot.png"
assert result_path.read_bytes() == screenshot_bytes
def test_save_screenshot_creates_dir(self, tmp_path):
"""Crée le dossier de sortie s'il n'existe pas."""
output_dir = tmp_path / "sub" / "dir" / "screenshots"
screenshot_bytes = b"\x89PNG fake image data"
result_path = save_debug_screenshot(screenshot_bytes, "test_screenshot", output_dir)
assert output_dir.exists()
assert result_path.exists()

View File

@@ -0,0 +1,174 @@
"""
Tests d'intégration pour le registry avec les stores réels.
Teste la détection automatique du bon store pour des URLs
Amazon, Cdiscount, Backmarket et AliExpress.
"""
import pytest
from pricewatch.app.core.registry import StoreRegistry
from pricewatch.app.stores.amazon.store import AmazonStore
from pricewatch.app.stores.cdiscount.store import CdiscountStore
from pricewatch.app.stores.backmarket.store import BackmarketStore
from pricewatch.app.stores.aliexpress.store import AliexpressStore
class TestRegistryRealStores:
"""Tests d'intégration avec les 4 stores réels."""
@pytest.fixture
def registry_with_all_stores(self) -> StoreRegistry:
"""Fixture: Registry avec les 4 stores réels enregistrés."""
registry = StoreRegistry()
registry.register(AmazonStore())
registry.register(CdiscountStore())
registry.register(BackmarketStore())
registry.register(AliexpressStore())
return registry
def test_all_stores_registered(self, registry_with_all_stores):
"""Vérifie que les 4 stores sont enregistrés."""
assert len(registry_with_all_stores) == 4
stores = registry_with_all_stores.list_stores()
assert "amazon" in stores
assert "cdiscount" in stores
assert "backmarket" in stores
assert "aliexpress" in stores
def test_detect_amazon_fr(self, registry_with_all_stores):
"""Détecte Amazon.fr correctement."""
url = "https://www.amazon.fr/dp/B08N5WRWNW"
store = registry_with_all_stores.detect_store(url)
assert store is not None
assert store.store_id == "amazon"
def test_detect_amazon_com(self, registry_with_all_stores):
"""Détecte Amazon.com correctement."""
url = "https://www.amazon.com/dp/B08N5WRWNW"
store = registry_with_all_stores.detect_store(url)
assert store is not None
assert store.store_id == "amazon"
def test_detect_amazon_with_product_name(self, registry_with_all_stores):
"""Détecte Amazon avec nom de produit dans l'URL."""
url = "https://www.amazon.fr/Product-Name-Here/dp/B08N5WRWNW/ref=sr_1_1"
store = registry_with_all_stores.detect_store(url)
assert store is not None
assert store.store_id == "amazon"
def test_detect_cdiscount(self, registry_with_all_stores):
"""Détecte Cdiscount correctement."""
url = "https://www.cdiscount.com/informatique/clavier-souris-webcam/example/f-1070123-example.html"
store = registry_with_all_stores.detect_store(url)
assert store is not None
assert store.store_id == "cdiscount"
def test_detect_backmarket(self, registry_with_all_stores):
"""Détecte Backmarket correctement."""
url = "https://www.backmarket.fr/fr-fr/p/iphone-15-pro"
store = registry_with_all_stores.detect_store(url)
assert store is not None
assert store.store_id == "backmarket"
def test_detect_backmarket_locale_en(self, registry_with_all_stores):
"""Détecte Backmarket avec locale anglais."""
url = "https://www.backmarket.fr/en-fr/p/macbook-air-15-2024"
store = registry_with_all_stores.detect_store(url)
assert store is not None
assert store.store_id == "backmarket"
def test_detect_aliexpress_fr(self, registry_with_all_stores):
"""Détecte AliExpress.fr correctement."""
url = "https://fr.aliexpress.com/item/1005007187023722.html"
store = registry_with_all_stores.detect_store(url)
assert store is not None
assert store.store_id == "aliexpress"
def test_detect_aliexpress_com(self, registry_with_all_stores):
"""Détecte AliExpress.com correctement."""
url = "https://www.aliexpress.com/item/1005007187023722.html"
store = registry_with_all_stores.detect_store(url)
assert store is not None
assert store.store_id == "aliexpress"
def test_detect_unknown_store(self, registry_with_all_stores):
"""URL inconnue retourne None."""
url = "https://www.ebay.com/itm/123456789"
store = registry_with_all_stores.detect_store(url)
assert store is None
def test_detect_invalid_url(self, registry_with_all_stores):
"""URL invalide retourne None."""
url = "not-a-valid-url"
store = registry_with_all_stores.detect_store(url)
assert store is None
def test_detect_priority_amazon_over_others(self, registry_with_all_stores):
"""Amazon.fr doit avoir le meilleur score pour ses URLs."""
url = "https://www.amazon.fr/dp/B08N5WRWNW"
store = registry_with_all_stores.detect_store(url)
# Amazon.fr devrait avoir score 0.9, les autres 0.0
assert store.store_id == "amazon"
def test_each_store_matches_only_own_urls(self, registry_with_all_stores):
"""Chaque store ne matche que ses propres URLs."""
test_cases = [
("https://www.amazon.fr/dp/B08N5WRWNW", "amazon"),
("https://www.cdiscount.com/product", "cdiscount"),
("https://www.backmarket.fr/fr-fr/p/product", "backmarket"),
("https://fr.aliexpress.com/item/12345.html", "aliexpress"),
]
for url, expected_store_id in test_cases:
store = registry_with_all_stores.detect_store(url)
assert store is not None, f"Aucun store détecté pour {url}"
assert store.store_id == expected_store_id, (
f"Mauvais store pour {url}: "
f"attendu {expected_store_id}, obtenu {store.store_id}"
)
def test_get_store_by_id(self, registry_with_all_stores):
"""Récupère chaque store par son ID."""
amazon = registry_with_all_stores.get_store("amazon")
assert amazon is not None
assert isinstance(amazon, AmazonStore)
cdiscount = registry_with_all_stores.get_store("cdiscount")
assert cdiscount is not None
assert isinstance(cdiscount, CdiscountStore)
backmarket = registry_with_all_stores.get_store("backmarket")
assert backmarket is not None
assert isinstance(backmarket, BackmarketStore)
aliexpress = registry_with_all_stores.get_store("aliexpress")
assert aliexpress is not None
assert isinstance(aliexpress, AliexpressStore)
def test_unregister_store(self, registry_with_all_stores):
"""Désenregistre un store et vérifie qu'il n'est plus détecté."""
assert len(registry_with_all_stores) == 4
# Désenregistrer Amazon
removed = registry_with_all_stores.unregister("amazon")
assert removed is True
assert len(registry_with_all_stores) == 3
# Amazon ne doit plus être détecté
store = registry_with_all_stores.detect_store("https://www.amazon.fr/dp/B08N5WRWNW")
assert store is None
# Les autres stores doivent toujours fonctionner
store = registry_with_all_stores.detect_store("https://www.cdiscount.com/product")
assert store is not None
assert store.store_id == "cdiscount"
def test_repr_includes_all_stores(self, registry_with_all_stores):
"""La représentation string inclut tous les stores."""
repr_str = repr(registry_with_all_stores)
assert "StoreRegistry" in repr_str
assert "amazon" in repr_str
assert "cdiscount" in repr_str
assert "backmarket" in repr_str
assert "aliexpress" in repr_str

87
tests/db/test_connection.py Executable file
View File

@@ -0,0 +1,87 @@
"""
Tests pour la couche de connexion SQLAlchemy.
"""
from dataclasses import dataclass
import pytest
from sqlalchemy import inspect
from pricewatch.app.db.connection import (
check_db_connection,
get_engine,
get_session,
init_db,
reset_engine,
)
from pricewatch.app.db.models import Product
@dataclass
class FakeDbConfig:
"""Config DB minimale pour tests SQLite."""
url: str
host: str = "sqlite"
port: int = 0
database: str = ":memory:"
@dataclass
class FakeAppConfig:
"""Config App minimale pour tests."""
db: FakeDbConfig
debug: bool = False
@pytest.fixture(autouse=True)
def reset_db_engine():
"""Reset l'engine global entre les tests."""
reset_engine()
yield
reset_engine()
@pytest.fixture
def sqlite_config() -> FakeAppConfig:
"""Config SQLite in-memory pour tests."""
return FakeAppConfig(db=FakeDbConfig(url="sqlite:///:memory:"))
def test_get_engine_sqlite(sqlite_config: FakeAppConfig):
"""Cree un engine SQLite fonctionnel."""
engine = get_engine(sqlite_config)
assert engine.url.get_backend_name() == "sqlite"
def test_init_db_creates_tables(sqlite_config: FakeAppConfig):
"""Init DB cree toutes les tables attendues."""
init_db(sqlite_config)
engine = get_engine(sqlite_config)
inspector = inspect(engine)
tables = set(inspector.get_table_names())
assert "products" in tables
assert "price_history" in tables
assert "product_images" in tables
assert "product_specs" in tables
assert "scraping_logs" in tables
def test_get_session_commit(sqlite_config: FakeAppConfig):
"""La session permet un commit simple."""
init_db(sqlite_config)
with get_session(sqlite_config) as session:
product = Product(source="amazon", reference="B08N5WRWNW", url="https://example.com")
session.add(product)
session.commit()
with get_session(sqlite_config) as session:
assert session.query(Product).count() == 1
def test_check_db_connection(sqlite_config: FakeAppConfig):
"""Le health check DB retourne True en SQLite."""
init_db(sqlite_config)
assert check_db_connection(sqlite_config) is True

89
tests/db/test_models.py Executable file
View File

@@ -0,0 +1,89 @@
"""
Tests pour les modeles SQLAlchemy.
"""
from datetime import datetime
import pytest
from sqlalchemy import create_engine
from sqlalchemy.exc import IntegrityError
from sqlalchemy.orm import Session, sessionmaker
from pricewatch.app.db.models import (
Base,
PriceHistory,
Product,
ProductImage,
ProductSpec,
ScrapingLog,
)
@pytest.fixture
def session() -> Session:
"""Session SQLite in-memory pour tests de modeles."""
engine = create_engine("sqlite:///:memory:")
Base.metadata.create_all(engine)
SessionLocal = sessionmaker(bind=engine)
session = SessionLocal()
try:
yield session
finally:
session.close()
def test_product_relationships(session: Session):
"""Les relations principales fonctionnent (prix, images, specs, logs)."""
product = Product(source="amazon", reference="B08N5WRWNW", url="https://example.com")
price = PriceHistory(
price=199.99,
shipping_cost=0,
stock_status="in_stock",
fetch_method="http",
fetch_status="success",
fetched_at=datetime.utcnow(),
)
image = ProductImage(image_url="https://example.com/image.jpg", position=0)
spec = ProductSpec(spec_key="Couleur", spec_value="Noir")
log = ScrapingLog(
url="https://example.com",
source="amazon",
reference="B08N5WRWNW",
fetch_method="http",
fetch_status="success",
fetched_at=datetime.utcnow(),
duration_ms=1200,
html_size_bytes=2048,
errors={"items": []},
notes={"items": ["OK"]},
)
product.price_history.append(price)
product.images.append(image)
product.specs.append(spec)
product.logs.append(log)
session.add(product)
session.commit()
loaded = session.query(Product).first()
assert loaded is not None
assert len(loaded.price_history) == 1
assert len(loaded.images) == 1
assert len(loaded.specs) == 1
assert len(loaded.logs) == 1
def test_unique_product_constraint(session: Session):
"""La contrainte unique source+reference est respectee."""
product_a = Product(source="amazon", reference="B08N5WRWNW", url="https://example.com/a")
product_b = Product(source="amazon", reference="B08N5WRWNW", url="https://example.com/b")
session.add(product_a)
session.commit()
session.add(product_b)
with pytest.raises(IntegrityError):
session.commit()
session.rollback()

82
tests/db/test_repository.py Executable file
View File

@@ -0,0 +1,82 @@
"""
Tests pour le repository SQLAlchemy.
"""
from datetime import datetime
import pytest
from sqlalchemy import create_engine
from sqlalchemy.orm import Session, sessionmaker
from pricewatch.app.core.schema import DebugInfo, DebugStatus, FetchMethod, ProductSnapshot
from pricewatch.app.db.models import Base, Product, ScrapingLog
from pricewatch.app.db.repository import ProductRepository
@pytest.fixture
def session() -> Session:
"""Session SQLite in-memory pour tests repository."""
engine = create_engine("sqlite:///:memory:")
Base.metadata.create_all(engine)
SessionLocal = sessionmaker(bind=engine)
session = SessionLocal()
try:
yield session
finally:
session.close()
engine.dispose()
def _make_snapshot(reference: str | None) -> ProductSnapshot:
return ProductSnapshot(
source="amazon",
url="https://example.com/product",
fetched_at=datetime(2026, 1, 14, 12, 0, 0),
title="Produit test",
price=199.99,
currency="EUR",
shipping_cost=0.0,
reference=reference,
images=["https://example.com/img1.jpg"],
specs={"Couleur": "Noir"},
debug=DebugInfo(
method=FetchMethod.HTTP,
status=DebugStatus.SUCCESS,
errors=["Avertissement"],
notes=["OK"],
),
)
def test_save_snapshot_creates_product(session: Session):
"""Le repository persiste produit + log."""
repo = ProductRepository(session)
snapshot = _make_snapshot(reference="B08N5WRWNW")
product_id = repo.save_snapshot(snapshot)
session.commit()
product = session.query(Product).one()
assert product.id == product_id
assert product.reference == "B08N5WRWNW"
assert len(product.images) == 1
assert len(product.specs) == 1
assert len(product.price_history) == 1
log = session.query(ScrapingLog).one()
assert log.product_id == product_id
assert log.errors == ["Avertissement"]
assert log.notes == ["OK"]
def test_save_snapshot_without_reference(session: Session):
"""Sans reference, le produit n'est pas cree mais le log existe."""
repo = ProductRepository(session)
snapshot = _make_snapshot(reference=None)
product_id = repo.save_snapshot(snapshot)
session.commit()
assert product_id is None
assert session.query(Product).count() == 0
assert session.query(ScrapingLog).count() == 1

0
tests/scraping/__init__.py Executable file
View File

Binary file not shown.

290
tests/scraping/test_http_fetch.py Executable file
View File

@@ -0,0 +1,290 @@
"""
Tests pour pricewatch.app.scraping.http_fetch
Teste la récupération HTTP avec mocks pour éviter les vraies requêtes.
"""
from unittest.mock import Mock, patch
import pytest
import requests
from requests.exceptions import RequestException, Timeout
from pricewatch.app.scraping.http_fetch import FetchResult, fetch_http
class TestFetchResult:
"""Tests pour la classe FetchResult."""
def test_success_result(self):
"""Création d'un résultat réussi."""
result = FetchResult(
success=True,
html="<html>Test</html>",
status_code=200,
duration_ms=150,
)
assert result.success is True
assert result.html == "<html>Test</html>"
assert result.error is None
assert result.status_code == 200
assert result.duration_ms == 150
def test_error_result(self):
"""Création d'un résultat d'erreur."""
result = FetchResult(
success=False,
error="403 Forbidden",
status_code=403,
duration_ms=100,
)
assert result.success is False
assert result.html is None
assert result.error == "403 Forbidden"
assert result.status_code == 403
assert result.duration_ms == 100
def test_minimal_result(self):
"""Résultat minimal avec success uniquement."""
result = FetchResult(success=False)
assert result.success is False
assert result.html is None
assert result.error is None
assert result.status_code is None
assert result.duration_ms is None
class TestFetchHttp:
"""Tests pour la fonction fetch_http()."""
def test_fetch_success(self, mocker):
"""Requête HTTP réussie (200 OK)."""
# Mock de requests.get
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = "<html><body>Test Page</body></html>"
mocker.patch("requests.get", return_value=mock_response)
result = fetch_http("https://example.com")
assert result.success is True
assert result.html == "<html><body>Test Page</body></html>"
assert result.status_code == 200
assert result.error is None
assert result.duration_ms is not None
assert result.duration_ms >= 0
def test_fetch_with_custom_timeout(self, mocker):
"""Requête avec timeout personnalisé."""
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = "<html>OK</html>"
mock_get = mocker.patch("requests.get", return_value=mock_response)
fetch_http("https://example.com", timeout=60)
# Vérifier que timeout est passé à requests.get
mock_get.assert_called_once()
call_kwargs = mock_get.call_args.kwargs
assert call_kwargs["timeout"] == 60
def test_fetch_with_custom_headers(self, mocker):
"""Requête avec headers personnalisés."""
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = "<html>OK</html>"
mock_get = mocker.patch("requests.get", return_value=mock_response)
custom_headers = {"X-Custom-Header": "test-value"}
fetch_http("https://example.com", headers=custom_headers)
# Vérifier que les headers personnalisés sont inclus
mock_get.assert_called_once()
call_kwargs = mock_get.call_args.kwargs
assert "X-Custom-Header" in call_kwargs["headers"]
assert call_kwargs["headers"]["X-Custom-Header"] == "test-value"
# Headers par défaut doivent aussi être présents
assert "User-Agent" in call_kwargs["headers"]
def test_fetch_403_forbidden(self, mocker):
"""Requête bloquée (403 Forbidden)."""
mock_response = Mock()
mock_response.status_code = 403
mocker.patch("requests.get", return_value=mock_response)
result = fetch_http("https://example.com")
assert result.success is False
assert result.html is None
assert result.status_code == 403
assert "403 Forbidden" in result.error
assert "Anti-bot" in result.error
def test_fetch_404_not_found(self, mocker):
"""Page introuvable (404 Not Found)."""
mock_response = Mock()
mock_response.status_code = 404
mocker.patch("requests.get", return_value=mock_response)
result = fetch_http("https://example.com")
assert result.success is False
assert result.status_code == 404
assert "404 Not Found" in result.error
def test_fetch_429_rate_limit(self, mocker):
"""Rate limit atteint (429 Too Many Requests)."""
mock_response = Mock()
mock_response.status_code = 429
mocker.patch("requests.get", return_value=mock_response)
result = fetch_http("https://example.com")
assert result.success is False
assert result.status_code == 429
assert "429" in result.error
assert "Rate limit" in result.error
def test_fetch_500_server_error(self, mocker):
"""Erreur serveur (500 Internal Server Error)."""
mock_response = Mock()
mock_response.status_code = 500
mocker.patch("requests.get", return_value=mock_response)
result = fetch_http("https://example.com")
assert result.success is False
assert result.status_code == 500
assert "500" in result.error
assert "Server Error" in result.error
def test_fetch_503_service_unavailable(self, mocker):
"""Service indisponible (503)."""
mock_response = Mock()
mock_response.status_code = 503
mocker.patch("requests.get", return_value=mock_response)
result = fetch_http("https://example.com")
assert result.success is False
assert result.status_code == 503
assert "503" in result.error
def test_fetch_unknown_status_code(self, mocker):
"""Code de statut inconnu (par ex. 418 I'm a teapot)."""
mock_response = Mock()
mock_response.status_code = 418
mocker.patch("requests.get", return_value=mock_response)
result = fetch_http("https://example.com")
assert result.success is False
assert result.status_code == 418
assert "418" in result.error
def test_fetch_timeout_error(self, mocker):
"""Timeout lors de la requête."""
mocker.patch("requests.get", side_effect=Timeout("Connection timed out"))
result = fetch_http("https://example.com", timeout=10)
assert result.success is False
assert result.html is None
assert "Timeout" in result.error
assert result.duration_ms is not None
def test_fetch_request_exception(self, mocker):
"""Exception réseau générique."""
mocker.patch(
"requests.get",
side_effect=RequestException("Network error"),
)
result = fetch_http("https://example.com")
assert result.success is False
assert "Erreur réseau" in result.error
assert result.duration_ms is not None
def test_fetch_unexpected_exception(self, mocker):
"""Exception inattendue."""
mocker.patch("requests.get", side_effect=ValueError("Unexpected error"))
result = fetch_http("https://example.com")
assert result.success is False
assert "Erreur inattendue" in result.error
assert result.duration_ms is not None
def test_fetch_empty_url(self):
"""URL vide retourne une erreur."""
result = fetch_http("")
assert result.success is False
assert "URL vide" in result.error
assert result.html is None
def test_fetch_whitespace_url(self):
"""URL avec espaces uniquement retourne une erreur."""
result = fetch_http(" ")
assert result.success is False
assert "URL vide" in result.error
def test_fetch_no_redirects(self, mocker):
"""Requête sans suivre les redirections."""
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = "<html>OK</html>"
mock_get = mocker.patch("requests.get", return_value=mock_response)
fetch_http("https://example.com", follow_redirects=False)
mock_get.assert_called_once()
call_kwargs = mock_get.call_args.kwargs
assert call_kwargs["allow_redirects"] is False
def test_fetch_uses_random_user_agent(self, mocker):
"""Vérifie qu'un User-Agent aléatoire est utilisé."""
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = "<html>OK</html>"
mock_get = mocker.patch("requests.get", return_value=mock_response)
fetch_http("https://example.com")
# Vérifier qu'un User-Agent est présent
mock_get.assert_called_once()
call_kwargs = mock_get.call_args.kwargs
assert "User-Agent" in call_kwargs["headers"]
# User-Agent doit contenir "Mozilla" (présent dans tous les UA)
assert "Mozilla" in call_kwargs["headers"]["User-Agent"]
def test_fetch_duration_is_measured(self, mocker):
"""Vérifie que la durée est mesurée."""
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = "<html>OK</html>"
mocker.patch("requests.get", return_value=mock_response)
result = fetch_http("https://example.com")
assert result.duration_ms is not None
assert isinstance(result.duration_ms, int)
assert result.duration_ms >= 0
def test_fetch_large_response(self, mocker):
"""Requête avec réponse volumineuse."""
mock_response = Mock()
mock_response.status_code = 200
# Simuler une grosse page HTML (1 MB)
mock_response.text = "<html>" + ("x" * 1000000) + "</html>"
mocker.patch("requests.get", return_value=mock_response)
result = fetch_http("https://example.com")
assert result.success is True
assert len(result.html) > 1000000

82
tests/scraping/test_pipeline.py Executable file
View File

@@ -0,0 +1,82 @@
"""
Tests pour ScrapingPipeline.
"""
from dataclasses import dataclass
from datetime import datetime
import pytest
from pricewatch.app.core.schema import DebugInfo, DebugStatus, FetchMethod, ProductSnapshot
from pricewatch.app.db.connection import get_session, init_db, reset_engine
from pricewatch.app.db.models import Product
from pricewatch.app.scraping.pipeline import ScrapingPipeline
@dataclass
class FakeDbConfig:
url: str
@dataclass
class FakeAppConfig:
db: FakeDbConfig
debug: bool = False
enable_db: bool = True
@pytest.fixture(autouse=True)
def reset_db_engine():
"""Reset l'engine global entre les tests."""
reset_engine()
yield
reset_engine()
def test_pipeline_persists_snapshot():
"""Le pipeline persiste un snapshot en base SQLite."""
config = FakeAppConfig(db=FakeDbConfig(url="sqlite:///:memory:"))
init_db(config)
snapshot = ProductSnapshot(
source="amazon",
url="https://example.com/product",
fetched_at=datetime(2026, 1, 14, 12, 30, 0),
title="Produit pipeline",
price=99.99,
currency="EUR",
reference="B08PIPE",
debug=DebugInfo(method=FetchMethod.HTTP, status=DebugStatus.SUCCESS),
)
pipeline = ScrapingPipeline(config=config)
product_id = pipeline.process_snapshot(snapshot, save_to_db=True)
assert product_id is not None
with get_session(config) as session:
assert session.query(Product).count() == 1
def test_pipeline_respects_disable_flag():
"""Le pipeline ignore la persistence si enable_db=False."""
config = FakeAppConfig(db=FakeDbConfig(url="sqlite:///:memory:"), enable_db=False)
init_db(config)
snapshot = ProductSnapshot(
source="amazon",
url="https://example.com/product",
fetched_at=datetime(2026, 1, 14, 12, 45, 0),
title="Produit pipeline",
price=99.99,
currency="EUR",
reference="B08PIPE",
debug=DebugInfo(method=FetchMethod.HTTP, status=DebugStatus.SUCCESS),
)
pipeline = ScrapingPipeline(config=config)
product_id = pipeline.process_snapshot(snapshot, save_to_db=True)
assert product_id is None
with get_session(config) as session:
assert session.query(Product).count() == 0

388
tests/scraping/test_pw_fetch.py Executable file
View File

@@ -0,0 +1,388 @@
"""
Tests pour pricewatch.app.scraping.pw_fetch
Teste la récupération Playwright avec mocks pour éviter de lancer vraiment un navigateur.
"""
from unittest.mock import Mock, patch
import pytest
from playwright.sync_api import TimeoutError as PlaywrightTimeout
from pricewatch.app.scraping.pw_fetch import (
PlaywrightFetchResult,
fetch_playwright,
fetch_with_fallback,
)
class TestPlaywrightFetchResult:
"""Tests pour la classe PlaywrightFetchResult."""
def test_success_result(self):
"""Création d'un résultat réussi."""
result = PlaywrightFetchResult(
success=True,
html="<html>Test</html>",
screenshot=b"fake_screenshot_bytes",
duration_ms=2500,
)
assert result.success is True
assert result.html == "<html>Test</html>"
assert result.screenshot == b"fake_screenshot_bytes"
assert result.error is None
assert result.duration_ms == 2500
def test_error_result(self):
"""Création d'un résultat d'erreur."""
result = PlaywrightFetchResult(
success=False,
error="Timeout",
screenshot=b"error_screenshot",
duration_ms=3000,
)
assert result.success is False
assert result.html is None
assert result.error == "Timeout"
assert result.screenshot == b"error_screenshot"
assert result.duration_ms == 3000
def test_minimal_result(self):
"""Résultat minimal."""
result = PlaywrightFetchResult(success=False)
assert result.success is False
assert result.html is None
assert result.screenshot is None
assert result.error is None
assert result.duration_ms is None
class TestFetchPlaywright:
"""Tests pour fetch_playwright()."""
@pytest.fixture
def mock_playwright_stack(self, mocker):
"""Fixture: Mock complet de la stack Playwright."""
# Mock de la page
mock_page = Mock()
mock_page.content.return_value = "<html><body>Playwright Test</body></html>"
mock_page.screenshot.return_value = b"fake_screenshot_data"
mock_page.goto.return_value = Mock(status=200)
# Mock du context
mock_context = Mock()
mock_context.new_page.return_value = mock_page
# Mock du browser
mock_browser = Mock()
mock_browser.new_context.return_value = mock_context
# Mock playwright chromium
mock_chromium = Mock()
mock_chromium.launch.return_value = mock_browser
# Mock playwright
mock_playwright_obj = Mock()
mock_playwright_obj.chromium = mock_chromium
# Mock sync_playwright().start()
mock_sync_playwright = Mock()
mock_sync_playwright.start.return_value = mock_playwright_obj
mocker.patch(
"pricewatch.app.scraping.pw_fetch.sync_playwright",
return_value=mock_sync_playwright,
)
return {
"playwright": mock_playwright_obj,
"browser": mock_browser,
"context": mock_context,
"page": mock_page,
}
def test_fetch_success(self, mock_playwright_stack):
"""Récupération Playwright réussie."""
result = fetch_playwright("https://example.com")
assert result.success is True
assert result.html == "<html><body>Playwright Test</body></html>"
assert result.screenshot is None # Par défaut pas de screenshot
assert result.error is None
assert result.duration_ms is not None
assert result.duration_ms >= 0
# Vérifier que la page a été visitée
mock_playwright_stack["page"].goto.assert_called_once_with(
"https://example.com", wait_until="domcontentloaded"
)
def test_fetch_with_screenshot(self, mock_playwright_stack):
"""Récupération avec screenshot."""
result = fetch_playwright("https://example.com", save_screenshot=True)
assert result.success is True
assert result.screenshot == b"fake_screenshot_data"
# Vérifier que screenshot() a été appelé
mock_playwright_stack["page"].screenshot.assert_called_once()
def test_fetch_headful_mode(self, mock_playwright_stack):
"""Mode headful (navigateur visible)."""
result = fetch_playwright("https://example.com", headless=False)
assert result.success is True
# Vérifier que headless=False a été passé
mock_playwright_stack["playwright"].chromium.launch.assert_called_once()
call_kwargs = mock_playwright_stack["playwright"].chromium.launch.call_args.kwargs
assert call_kwargs["headless"] is False
def test_fetch_with_custom_timeout(self, mock_playwright_stack):
"""Timeout personnalisé."""
result = fetch_playwright("https://example.com", timeout_ms=30000)
assert result.success is True
# Vérifier que set_default_timeout a été appelé
mock_playwright_stack["page"].set_default_timeout.assert_called_once_with(30000)
def test_fetch_with_wait_for_selector(self, mock_playwright_stack):
"""Attente d'un sélecteur CSS spécifique."""
result = fetch_playwright(
"https://example.com", wait_for_selector=".product-title"
)
assert result.success is True
# Vérifier que wait_for_selector a été appelé
mock_playwright_stack["page"].wait_for_selector.assert_called_once_with(
".product-title", timeout=60000
)
def test_fetch_wait_for_selector_timeout(self, mock_playwright_stack):
"""Timeout lors de l'attente du sélecteur."""
# Le sélecteur timeout mais la page continue
mock_playwright_stack["page"].wait_for_selector.side_effect = PlaywrightTimeout(
"Selector timeout"
)
result = fetch_playwright(
"https://example.com", wait_for_selector=".non-existent"
)
# Doit quand même réussir (le wait_for_selector est non-bloquant)
assert result.success is True
assert result.html is not None
def test_fetch_empty_url(self):
"""URL vide retourne une erreur."""
result = fetch_playwright("")
assert result.success is False
assert "URL vide" in result.error
assert result.html is None
def test_fetch_whitespace_url(self):
"""URL avec espaces retourne une erreur."""
result = fetch_playwright(" ")
assert result.success is False
assert "URL vide" in result.error
def test_fetch_no_response_from_server(self, mock_playwright_stack):
"""Pas de réponse du serveur."""
mock_playwright_stack["page"].goto.return_value = None
result = fetch_playwright("https://example.com")
assert result.success is False
assert "Pas de réponse du serveur" in result.error
def test_fetch_playwright_timeout(self, mock_playwright_stack):
"""Timeout Playwright lors de la navigation."""
mock_playwright_stack["page"].goto.side_effect = PlaywrightTimeout(
"Navigation timeout"
)
result = fetch_playwright("https://example.com", timeout_ms=10000)
assert result.success is False
assert "Timeout" in result.error
assert result.duration_ms is not None
def test_fetch_playwright_generic_error(self, mock_playwright_stack):
"""Erreur générique Playwright."""
mock_playwright_stack["page"].goto.side_effect = Exception(
"Generic Playwright error"
)
result = fetch_playwright("https://example.com")
assert result.success is False
assert "Erreur Playwright" in result.error
assert result.duration_ms is not None
def test_fetch_cleanup_on_success(self, mock_playwright_stack):
"""Nettoyage des ressources sur succès."""
result = fetch_playwright("https://example.com")
assert result.success is True
# Vérifier que les ressources sont nettoyées
mock_playwright_stack["page"].close.assert_called_once()
mock_playwright_stack["browser"].close.assert_called_once()
mock_playwright_stack["playwright"].stop.assert_called_once()
def test_fetch_cleanup_on_error(self, mock_playwright_stack):
"""Nettoyage des ressources sur erreur."""
mock_playwright_stack["page"].goto.side_effect = Exception("Test error")
result = fetch_playwright("https://example.com")
assert result.success is False
# Vérifier que les ressources sont nettoyées même en cas d'erreur
mock_playwright_stack["page"].close.assert_called_once()
mock_playwright_stack["browser"].close.assert_called_once()
mock_playwright_stack["playwright"].stop.assert_called_once()
def test_fetch_screenshot_on_error(self, mock_playwright_stack):
"""Screenshot capturé même en cas d'erreur."""
mock_playwright_stack["page"].goto.side_effect = PlaywrightTimeout("Timeout")
result = fetch_playwright("https://example.com", save_screenshot=True)
assert result.success is False
assert result.screenshot == b"fake_screenshot_data"
# Screenshot doit avoir été tenté
mock_playwright_stack["page"].screenshot.assert_called_once()
class TestFetchWithFallback:
"""Tests pour fetch_with_fallback()."""
def test_http_success_no_playwright(self, mocker):
"""Si HTTP réussit, Playwright n'est pas appelé."""
# Mock fetch_http qui réussit
mock_http_result = Mock()
mock_http_result.success = True
mock_http_result.html = "<html>HTTP Success</html>"
mock_http_result.duration_ms = 150
mocker.patch(
"pricewatch.app.scraping.http_fetch.fetch_http",
return_value=mock_http_result,
)
# Mock fetch_playwright (ne devrait pas être appelé)
mock_playwright = mocker.patch(
"pricewatch.app.scraping.pw_fetch.fetch_playwright"
)
result = fetch_with_fallback("https://example.com")
assert result.success is True
assert result.html == "<html>HTTP Success</html>"
assert result.duration_ms == 150
# Playwright ne doit pas être appelé
mock_playwright.assert_not_called()
def test_http_fails_playwright_fallback(self, mocker):
"""Si HTTP échoue, fallback vers Playwright."""
# Mock fetch_http qui échoue
mock_http_result = Mock()
mock_http_result.success = False
mock_http_result.error = "403 Forbidden"
mocker.patch(
"pricewatch.app.scraping.http_fetch.fetch_http",
return_value=mock_http_result,
)
# Mock fetch_playwright qui réussit
mock_playwright_result = PlaywrightFetchResult(
success=True,
html="<html>Playwright Success</html>",
duration_ms=2500,
)
mock_playwright = mocker.patch(
"pricewatch.app.scraping.pw_fetch.fetch_playwright",
return_value=mock_playwright_result,
)
result = fetch_with_fallback("https://example.com")
assert result.success is True
assert result.html == "<html>Playwright Success</html>"
# Playwright doit avoir été appelé
mock_playwright.assert_called_once()
def test_skip_http_direct_playwright(self, mocker):
"""Mode Playwright direct (sans essayer HTTP d'abord)."""
# Mock fetch_http (ne devrait pas être appelé)
mock_http = mocker.patch("pricewatch.app.scraping.http_fetch.fetch_http")
# Mock fetch_playwright
mock_playwright_result = PlaywrightFetchResult(
success=True,
html="<html>Playwright Direct</html>",
duration_ms=2500,
)
mock_playwright = mocker.patch(
"pricewatch.app.scraping.pw_fetch.fetch_playwright",
return_value=mock_playwright_result,
)
result = fetch_with_fallback("https://example.com", try_http_first=False)
assert result.success is True
assert result.html == "<html>Playwright Direct</html>"
# HTTP ne doit pas être appelé
mock_http.assert_not_called()
# Playwright doit avoir été appelé
mock_playwright.assert_called_once()
def test_playwright_options_passed(self, mocker):
"""Options Playwright passées correctement."""
# Mock fetch_http qui échoue
mock_http_result = Mock()
mock_http_result.success = False
mock_http_result.error = "403 Forbidden"
mocker.patch(
"pricewatch.app.scraping.http_fetch.fetch_http",
return_value=mock_http_result,
)
# Mock fetch_playwright
mock_playwright_result = PlaywrightFetchResult(
success=True,
html="<html>OK</html>",
duration_ms=2500,
)
mock_playwright = mocker.patch(
"pricewatch.app.scraping.pw_fetch.fetch_playwright",
return_value=mock_playwright_result,
)
# Options personnalisées
options = {"headless": False, "timeout_ms": 30000, "save_screenshot": True}
result = fetch_with_fallback("https://example.com", playwright_options=options)
assert result.success is True
# Vérifier que les options sont passées à fetch_playwright
mock_playwright.assert_called_once_with("https://example.com", **options)