Files

Gilles Soulier d0b73b9319 codex2

2026-01-14 21:54:55 +01:00

14 KiB

Raw Blame History

🚀 Phase 2 Infrastructure - EN COURS

Date de démarrage: 2026-01-14 Version cible: 0.4.0 Objectif: Ajouter PostgreSQL + Redis/RQ worker pour persistence et scraping asynchrone

📊 Vue d'Ensemble

Mises a jour recentes

Migration Alembic corrigee (down_revision sur 20260114_02)
Extraction images Amazon amelioree (data-a-dynamic-image + filtre logos)
Nouveau scraping de validation (URL Amazon ASUS A16)

Prochaines actions

Verifier l'affichage des images, description, specs, msrp et reduction dans le Web UI
Confirmer que le popup ajout produit affiche toutes les donnees du preview

Objectifs Phase 2

✅ Configuration centralisée (database, Redis, app)
✅ Modèles SQLAlchemy ORM (5 tables)
✅ Connexion base de données (init_db, get_session)
✅ Migrations Alembic
✅ Repository pattern (CRUD)
✅ Worker RQ pour scraping asynchrone
✅ Scheduler pour jobs récurrents
✅ CLI étendu (commandes DB + worker)
✅ Docker Compose (PostgreSQL + Redis)
✅ Gestion erreurs Redis
✅ Logs d'observabilité jobs
⏳ Tests end-to-end (Semaine 4)

✅ Semaine 1: Database Foundation (TERMINÉE)

Tâches Complétées

1. Configuration Centralisée ✅

Fichier: pricewatch/app/core/config.py (187 lignes)

Contenu:

DatabaseConfig: Configuration PostgreSQL
- Host, port, database, user, password
- Propriété url: SQLAlchemy connection string
- Propriété url_async: AsyncPG connection string (futur)
- Prefix env vars: PW_DB_* (PW_DB_HOST, PW_DB_PORT, etc.)
RedisConfig: Configuration Redis pour RQ
- Host, port, db, password (optional)
- Propriété url: Redis connection string
- Prefix env vars: PW_REDIS_*
AppConfig: Configuration globale application
- Debug mode
- Worker timeout (300s par défaut)
- Worker concurrency (2 par défaut)
- Feature flags: enable_db, enable_worker
- Defaults Playwright: timeout, use_playwright
- Nested configs: db, redis
- Prefix env vars: PW_*
Pattern Singleton: get_config(), set_config(), reset_config()

Justifications:

12-factor app: configuration via env vars
Pydantic validation garantit config valide au démarrage
Valeurs par défaut pour développement local
Support .env file pour faciliter le setup
Feature flags permettent de désactiver DB/worker pour tests

2. Dépendances Phase 2 ✅

Fichier: pyproject.toml (lignes 48-60)

Ajouts:

# Database (Phase 2)
"sqlalchemy>=2.0.0",
"psycopg2-binary>=2.9.0",
"alembic>=1.13.0",

# Configuration (Phase 2)
"python-dotenv>=1.0.0",

# Worker/Queue (Phase 2)
"redis>=5.0.0",
"rq>=1.15.0",
"rq-scheduler>=0.13.0",

3. Modèles SQLAlchemy ORM ✅

Fichier: pricewatch/app/db/models.py (322 lignes)

Tables créées:

products - Catalogue produits
- PK: id (Integer, autoincrement)
- Natural key: (source, reference) - Unique constraint
- Colonnes: url, title, category, currency
- Timestamps: first_seen_at, last_updated_at
- Relations: price_history, images, specs, logs
- Indexes: source, reference, last_updated_at
price_history - Historique prix (time-series)
- PK: id (Integer, autoincrement)
- FK: product_id → products(id) CASCADE
- Unique: (product_id, fetched_at) - Évite doublons
- Colonnes: price (Numeric 10,2), shipping_cost, stock_status
- Fetch metadata: fetch_method, fetch_status, fetched_at
- Check constraints: stock_status, fetch_method, fetch_status
- Indexes: product_id, fetched_at
product_images - Images produit
- PK: id (Integer, autoincrement)
- FK: product_id → products(id) CASCADE
- Unique: (product_id, image_url) - Évite doublons
- Colonnes: image_url (Text), position (Integer, 0=main)
- Index: product_id
product_specs - Caractéristiques produit (key-value)
- PK: id (Integer, autoincrement)
- FK: product_id → products(id) CASCADE
- Unique: (product_id, spec_key) - Évite doublons
- Colonnes: spec_key (String 200), spec_value (Text)
- Indexes: product_id, spec_key
scraping_logs - Logs observabilité
- PK: id (Integer, autoincrement)
- FK optionnelle: product_id → products(id) SET NULL
- Colonnes: url, source, reference, fetched_at
- Métriques: duration_ms, html_size_bytes
- Fetch metadata: fetch_method, fetch_status
- Debug data (JSONB): errors, notes
- Indexes: product_id, source, fetched_at, fetch_status

Justifications schéma:

Normalisation: products séparée de price_history (catalogue vs time-series)
Clé naturelle (source, reference) vs UUID arbitraire
Tables séparées pour images/specs: évite JSONB non structuré
JSONB uniquement pour données variables: errors, notes dans logs
Cascade DELETE: suppression produit → suppression historique
SET NULL pour logs: garde trace même si produit supprimé

Tâches Complétées (suite)

4. Connexion Base de Données ✅

Fichier: pricewatch/app/db/connection.py

Contenu:

get_engine(config): Engine SQLAlchemy (pooling)
get_session_factory(config): Session factory
get_session(config): Context manager
init_db(config): Création tables
check_db_connection(config): Health check
reset_engine(): Reset pour tests

Justifications:

Singleton engine pour éviter les pools multiples
pool_pre_ping pour robustesse
Context manager pour rollback/close automatiques

5. Setup Alembic ✅

Fichiers:

alembic.ini
pricewatch/app/db/migrations/env.py
pricewatch/app/db/migrations/script.py.mako

Justifications:

URL DB injectée depuis AppConfig
compare_type=True pour cohérence des migrations

6. Migration Initiale ✅

Fichier: pricewatch/app/db/migrations/versions/20260114_01_initial_schema.py

Contenu:

5 tables + indexes + contraintes
JSONB pour errors et notes

7. Commandes CLI Database ✅

Fichier: pricewatch/app/cli/main.py

Commandes:

pricewatch init-db              # Créer tables
pricewatch migrate "message"    # Générer migration Alembic
pricewatch upgrade              # Appliquer migrations
pricewatch downgrade            # Rollback migration

8. Docker Compose ✅

Fichier: docker-compose.yml

Services:

PostgreSQL 16 (port 5432)
Redis 7 (port 6379)
Volumes pour persistence

9. Fichier .env Exemple ✅

Fichier: .env.example

Variables:

# Database
PW_DB_HOST=localhost
PW_DB_PORT=5432
PW_DB_DATABASE=pricewatch
PW_DB_USER=pricewatch
PW_DB_PASSWORD=pricewatch

# Redis
PW_REDIS_HOST=localhost
PW_REDIS_PORT=6379
PW_REDIS_DB=0

# App
PW_DEBUG=false
PW_WORKER_TIMEOUT=300
PW_WORKER_CONCURRENCY=2
PW_ENABLE_DB=true
PW_ENABLE_WORKER=true

10. Tests Database ✅

Fichiers:

tests/db/test_models.py: Tests des modèles SQLAlchemy
tests/db/test_connection.py: Tests connexion et session

Stratégie tests:

SQLite in-memory pour tests unitaires
Fixtures pytest pour setup/teardown
Tests relationships, constraints, indexes

📦 Semaine 2: Repository & Pipeline (TERMINEE)

Tâches Prévues

Repository Pattern

Fichier: pricewatch/app/db/repository.py

Classe: ProductRepository

get_or_create(source, reference): Trouver ou créer produit
save_snapshot(snapshot): Persist ProductSnapshot to DB
update_product_metadata(product, snapshot): Update title, url, etc.
add_price_history(product, snapshot): Ajouter entrée prix
sync_images(product, images): Sync images (add new, keep existing)
sync_specs(product, specs): Sync specs (upsert)
add_scraping_log(snapshot, product_id): Log scraping

Statut: ✅ Terminé

Scraping Pipeline

Fichier: pricewatch/app/scraping/pipeline.py

Classe: ScrapingPipeline

process_snapshot(snapshot, save_to_db): Orchestration
Non-blocking: échec DB ne crash pas pipeline
Retour: product_id ou None

Statut: ✅ Terminé

CLI Modification

Fichier: pricewatch/app/cli/main.py

Modification commande run:

Ajouter flag --save-db / --no-db
Intégrer ScrapingPipeline si save_db=True
Compatibilité backward: JSON output toujours créé

Statut: ✅ Terminé

Tests Repository + Pipeline ✅

Fichiers:

tests/db/test_repository.py
tests/scraping/test_pipeline.py

Statut: ✅ Terminé

Tests end-to-end CLI + DB ✅

Fichier:

tests/cli/test_run_db.py

Statut: ✅ Terminé

📦 Semaine 3: Worker Infrastructure (TERMINEE)

Tâches Prévues

RQ Task

Fichier: pricewatch/app/tasks/scrape.py

Fonction: scrape_product(url, use_playwright=True)

Réutilise 100% code Phase 1 (detect → fetch → parse)
Save to DB via ScrapingPipeline
Retour: {success, product_id, snapshot, error}

Statut: ✅ Terminé

Scheduler

Fichier: pricewatch/app/tasks/scheduler.py

Classe: ScrapingScheduler

schedule_product(url, interval_hours=24): Job récurrent
enqueue_immediate(url): Job unique
Basé sur rq-scheduler

Statut: ✅ Terminé

CLI Worker

Nouvelles commandes:

pricewatch worker               # Lancer worker RQ
pricewatch enqueue <url>        # Enqueue scrape immédiat
pricewatch schedule <url> --interval 24  # Scrape quotidien

Statut: ✅ Terminé

Tests worker + scheduler ✅

Fichiers:

tests/tasks/test_scrape_task.py
tests/tasks/test_scheduler.py

Statut: ✅ Terminé

Gestion erreurs Redis ✅

Fichiers modifiés:

pricewatch/app/tasks/scheduler.py:
- Ajout RedisUnavailableError exception
- Ajout check_redis_connection() helper
- Connexion lazy avec ping de vérification
pricewatch/app/cli/main.py:
- Commandes worker, enqueue, schedule gèrent Redis down
- Messages d'erreur clairs avec instructions

Tests ajoutés (7 tests):

test_scheduler_redis_connection_error
test_scheduler_lazy_connection
test_check_redis_connection_success
test_check_redis_connection_failure
test_scheduler_schedule_redis_error

Statut: ✅ Terminé

Logs d'observabilité jobs ✅

Fichier modifié: pricewatch/app/tasks/scrape.py

Logs ajoutés:

[JOB START] - Début du job avec URL
[STORE] - Store détecté
[FETCH] - Résultat fetch HTTP/Playwright (durée, taille)
[PARSE] - Résultat parsing (titre, prix)
[JOB OK] / [JOB FAILED] - Résultat final avec durée totale

Note: Les logs sont aussi persistés en DB via ScrapingLog (déjà implémenté).

Statut: ✅ Terminé

📦 Semaine 4: Tests & Documentation (EN COURS)

Tâches Prévues

Tests

✅ Tests end-to-end (CLI → DB → Worker)
✅ Tests erreurs (DB down, Redis down)
✅ Tests backward compatibility (--no-db)
✅ Performance tests (100+ produits)

Fichiers tests ajoutes:

tests/cli/test_worker_cli.py
tests/cli/test_enqueue_schedule_cli.py
tests/scraping/test_pipeline.py (erreurs DB)
tests/tasks/test_redis_errors.py
tests/cli/test_run_no_db.py
tests/db/test_bulk_persistence.py
tests/tasks/test_worker_end_to_end.py
tests/cli/test_cli_worker_end_to_end.py
- Resultat: OK avec Redis actif

Documentation

✅ Update README.md (setup Phase 2)
✅ Update CHANGELOG.md
✅ Migration guide (JSON → DB)

📈 Métriques d'Avancement

Catégorie	Complétées	Totales	%
Semaine 1	10	10	100%
Semaine 2	5	5	100%
Semaine 3	6	6	100%
Semaine 4	7	7	100%
TOTAL Phase 2	28	28	100%

🎯 Prochaine Étape Immédiate

Prochaine étape immédiate

Phase 2 terminee, bascule vers Phase 3 (API REST)
API v1 avancee: filtres, export CSV/JSON, webhooks + tests associes

Après (prévu)

Documentation Phase 2 (resume final)
Retry policy (optionnel)
Phase 4 Web UI (dashboard + graphiques)

🔧 Vérifications

Vérification Semaine 1 (objectif)

# Setup infrastructure
docker-compose up -d
pricewatch init-db

# Vérifier tables créées
psql -h localhost -U pricewatch pricewatch
\dt
# → 5 tables: products, price_history, product_images, product_specs, scraping_logs

Vérification Semaine 2 (objectif)

# Test pipeline avec DB
pricewatch run --yaml scrap_url.yaml --save-db

# Vérifier données en DB
psql -h localhost -U pricewatch pricewatch
SELECT * FROM products LIMIT 5;
SELECT * FROM price_history ORDER BY fetched_at DESC LIMIT 10;

Vérification Semaine 3 (objectif)

# Enqueue job
pricewatch enqueue "https://www.amazon.fr/dp/B08N5WRWNW"

# Lancer worker
pricewatch worker

# Vérifier job traité
psql -h localhost -U pricewatch pricewatch
SELECT * FROM scraping_logs ORDER BY fetched_at DESC LIMIT 5;

📝 Notes Importantes

Backward Compatibility

✅ CLI Phase 1 fonctionne sans changement
✅ Format JSON identique
✅ Database optionnelle (--no-db flag)
✅ ProductSnapshot inchangé
✅ Tests Phase 1 continuent à passer (295 tests)

Architecture Décisions

Normalisation vs Performance:

Choix: Normalisation stricte (5 tables)
Justification: Catalogue change rarement, prix changent quotidiennement
Alternative rejetée: Tout dans products + JSONB (moins queryable)

Clé Naturelle vs UUID:

Choix: (source, reference) comme unique constraint
Justification: ASIN Amazon déjà unique globalement
Alternative rejetée: UUID artificiel (complexifie déduplication)

Synchrone vs Asynchrone:

Choix: RQ synchrone (pas d'async/await)
Justification: Code Phase 1 réutilisable à 100%, simplicité
Alternative rejetée: Asyncio + asyncpg (refactoring massif)

Dernière mise à jour: 2026-01-15

Recap avancement recent (Phase 3 API)

Filtres avances + exports CSV/JSON + webhooks (CRUD + test)
Tests API avances ajoutes
Nettoyage warnings Pydantic/datetime/selectors
Suite pytest complete: 339 passed, 4 skipped

Validation locale (Semaine 1)

docker compose up -d
./venv/bin/alembic -c alembic.ini upgrade head
psql -h localhost -U pricewatch pricewatch
\\dt

Resultat: 6 tables visibles (products, price_history, product_images, product_specs, scraping_logs, alembic_version). Statut: ✅ Semaine 1 terminee (100%).

14 KiB Raw Blame History

🚀 Phase 2 Infrastructure - EN COURS

📊 Vue d'Ensemble

Mises a jour recentes

Prochaines actions

Objectifs Phase 2

✅ Semaine 1: Database Foundation (TERMINÉE)

Tâches Complétées

1. Configuration Centralisée ✅

2. Dépendances Phase 2 ✅

3. Modèles SQLAlchemy ORM ✅

Tâches Complétées (suite)

4. Connexion Base de Données ✅

5. Setup Alembic ✅

6. Migration Initiale ✅

7. Commandes CLI Database ✅

8. Docker Compose ✅

9. Fichier .env Exemple ✅

10. Tests Database ✅

📦 Semaine 2: Repository & Pipeline (TERMINEE)

Tâches Prévues

Repository Pattern

Scraping Pipeline

CLI Modification

Tests Repository + Pipeline ✅

Tests end-to-end CLI + DB ✅

📦 Semaine 3: Worker Infrastructure (TERMINEE)

Tâches Prévues

RQ Task

Scheduler

CLI Worker

Tests worker + scheduler ✅

Gestion erreurs Redis ✅

Logs d'observabilité jobs ✅

📦 Semaine 4: Tests & Documentation (EN COURS)

Tâches Prévues

Tests

Documentation

📈 Métriques d'Avancement

🎯 Prochaine Étape Immédiate

🔧 Vérifications

Vérification Semaine 1 (objectif)

Vérification Semaine 2 (objectif)

Vérification Semaine 3 (objectif)

📝 Notes Importantes

Backward Compatibility

Architecture Décisions

Recap avancement recent (Phase 3 API)

Validation locale (Semaine 1)

14 KiB

Raw Blame History