Files

2026-01-14 07:03:38 +01:00

12 KiB

Executable File

Raw Blame History

🚀 Phase 2 Infrastructure - EN COURS

Date de démarrage: 2026-01-14 Version cible: 0.4.0 Objectif: Ajouter PostgreSQL + Redis/RQ worker pour persistence et scraping asynchrone

📊 Vue d'Ensemble

Objectifs Phase 2

✅ Configuration centralisée (database, Redis, app)
✅ Modèles SQLAlchemy ORM (5 tables)
✅ Connexion base de données (init_db, get_session)
✅ Migrations Alembic
⏳ Repository pattern (CRUD)
⏳ Worker RQ pour scraping asynchrone
⏳ Scheduler pour jobs récurrents
✅ CLI étendu (commandes DB)
✅ Docker Compose (PostgreSQL + Redis)
⏳ Tests complets

✅ Semaine 1: Database Foundation (TERMINÉE)

Tâches Complétées

1. Configuration Centralisée ✅

Fichier: pricewatch/app/core/config.py (187 lignes)

Contenu:

DatabaseConfig: Configuration PostgreSQL
- Host, port, database, user, password
- Propriété url: SQLAlchemy connection string
- Propriété url_async: AsyncPG connection string (futur)
- Prefix env vars: PW_DB_* (PW_DB_HOST, PW_DB_PORT, etc.)
RedisConfig: Configuration Redis pour RQ
- Host, port, db, password (optional)
- Propriété url: Redis connection string
- Prefix env vars: PW_REDIS_*
AppConfig: Configuration globale application
- Debug mode
- Worker timeout (300s par défaut)
- Worker concurrency (2 par défaut)
- Feature flags: enable_db, enable_worker
- Defaults Playwright: timeout, use_playwright
- Nested configs: db, redis
- Prefix env vars: PW_*
Pattern Singleton: get_config(), set_config(), reset_config()

Justifications:

12-factor app: configuration via env vars
Pydantic validation garantit config valide au démarrage
Valeurs par défaut pour développement local
Support .env file pour faciliter le setup
Feature flags permettent de désactiver DB/worker pour tests

2. Dépendances Phase 2 ✅

Fichier: pyproject.toml (lignes 48-60)

Ajouts:

# Database (Phase 2)
"sqlalchemy>=2.0.0",
"psycopg2-binary>=2.9.0",
"alembic>=1.13.0",

# Configuration (Phase 2)
"python-dotenv>=1.0.0",

# Worker/Queue (Phase 2)
"redis>=5.0.0",
"rq>=1.15.0",
"rq-scheduler>=0.13.0",

3. Modèles SQLAlchemy ORM ✅

Fichier: pricewatch/app/db/models.py (322 lignes)

Tables créées:

products - Catalogue produits
- PK: id (Integer, autoincrement)
- Natural key: (source, reference) - Unique constraint
- Colonnes: url, title, category, currency
- Timestamps: first_seen_at, last_updated_at
- Relations: price_history, images, specs, logs
- Indexes: source, reference, last_updated_at
price_history - Historique prix (time-series)
- PK: id (Integer, autoincrement)
- FK: product_id → products(id) CASCADE
- Unique: (product_id, fetched_at) - Évite doublons
- Colonnes: price (Numeric 10,2), shipping_cost, stock_status
- Fetch metadata: fetch_method, fetch_status, fetched_at
- Check constraints: stock_status, fetch_method, fetch_status
- Indexes: product_id, fetched_at
product_images - Images produit
- PK: id (Integer, autoincrement)
- FK: product_id → products(id) CASCADE
- Unique: (product_id, image_url) - Évite doublons
- Colonnes: image_url (Text), position (Integer, 0=main)
- Index: product_id
product_specs - Caractéristiques produit (key-value)
- PK: id (Integer, autoincrement)
- FK: product_id → products(id) CASCADE
- Unique: (product_id, spec_key) - Évite doublons
- Colonnes: spec_key (String 200), spec_value (Text)
- Indexes: product_id, spec_key
scraping_logs - Logs observabilité
- PK: id (Integer, autoincrement)
- FK optionnelle: product_id → products(id) SET NULL
- Colonnes: url, source, reference, fetched_at
- Métriques: duration_ms, html_size_bytes
- Fetch metadata: fetch_method, fetch_status
- Debug data (JSONB): errors, notes
- Indexes: product_id, source, fetched_at, fetch_status

Justifications schéma:

Normalisation: products séparée de price_history (catalogue vs time-series)
Clé naturelle (source, reference) vs UUID arbitraire
Tables séparées pour images/specs: évite JSONB non structuré
JSONB uniquement pour données variables: errors, notes dans logs
Cascade DELETE: suppression produit → suppression historique
SET NULL pour logs: garde trace même si produit supprimé

Tâches Complétées (suite)

4. Connexion Base de Données ✅

Fichier: pricewatch/app/db/connection.py

Contenu:

get_engine(config): Engine SQLAlchemy (pooling)
get_session_factory(config): Session factory
get_session(config): Context manager
init_db(config): Création tables
check_db_connection(config): Health check
reset_engine(): Reset pour tests

Justifications:

Singleton engine pour éviter les pools multiples
pool_pre_ping pour robustesse
Context manager pour rollback/close automatiques

5. Setup Alembic ✅

Fichiers:

alembic.ini
pricewatch/app/db/migrations/env.py
pricewatch/app/db/migrations/script.py.mako

Justifications:

URL DB injectée depuis AppConfig
compare_type=True pour cohérence des migrations

6. Migration Initiale ✅

Fichier: pricewatch/app/db/migrations/versions/20260114_01_initial_schema.py

Contenu:

5 tables + indexes + contraintes
JSONB pour errors et notes

7. Commandes CLI Database ✅

Fichier: pricewatch/app/cli/main.py

Commandes:

pricewatch init-db              # Créer tables
pricewatch migrate "message"    # Générer migration Alembic
pricewatch upgrade              # Appliquer migrations
pricewatch downgrade            # Rollback migration

8. Docker Compose ✅

Fichier: docker-compose.yml

Services:

PostgreSQL 16 (port 5432)
Redis 7 (port 6379)
Volumes pour persistence

9. Fichier .env Exemple ✅

Fichier: .env.example

Variables:

# Database
PW_DB_HOST=localhost
PW_DB_PORT=5432
PW_DB_DATABASE=pricewatch
PW_DB_USER=pricewatch
PW_DB_PASSWORD=pricewatch

# Redis
PW_REDIS_HOST=localhost
PW_REDIS_PORT=6379
PW_REDIS_DB=0

# App
PW_DEBUG=false
PW_WORKER_TIMEOUT=300
PW_WORKER_CONCURRENCY=2
PW_ENABLE_DB=true
PW_ENABLE_WORKER=true

10. Tests Database ✅

Fichiers:

tests/db/test_models.py: Tests des modèles SQLAlchemy
tests/db/test_connection.py: Tests connexion et session

Stratégie tests:

SQLite in-memory pour tests unitaires
Fixtures pytest pour setup/teardown
Tests relationships, constraints, indexes

📦 Semaine 2: Repository & Pipeline (EN COURS)

Tâches Prévues

Repository Pattern

Fichier: pricewatch/app/db/repository.py

Classe: ProductRepository

get_or_create(source, reference): Trouver ou créer produit
save_snapshot(snapshot): Persist ProductSnapshot to DB
update_product_metadata(product, snapshot): Update title, url, etc.
add_price_history(product, snapshot): Ajouter entrée prix
sync_images(product, images): Sync images (add new, keep existing)
sync_specs(product, specs): Sync specs (upsert)
add_scraping_log(snapshot, product_id): Log scraping

Statut: ✅ Terminé

Scraping Pipeline

Fichier: pricewatch/app/scraping/pipeline.py

Classe: ScrapingPipeline

process_snapshot(snapshot, save_to_db): Orchestration
Non-blocking: échec DB ne crash pas pipeline
Retour: product_id ou None

Statut: ✅ Terminé

CLI Modification

Fichier: pricewatch/app/cli/main.py

Modification commande run:

Ajouter flag --save-db / --no-db
Intégrer ScrapingPipeline si save_db=True
Compatibilité backward: JSON output toujours créé

Statut: ✅ Terminé

Tests Repository + Pipeline ✅

Fichiers:

tests/db/test_repository.py
tests/scraping/test_pipeline.py

Statut: ✅ Terminé

Tests end-to-end CLI + DB ✅

Fichier:

tests/cli/test_run_db.py

Statut: ✅ Terminé

📦 Semaine 3: Worker Infrastructure (EN COURS)

Tâches Prévues

RQ Task

Fichier: pricewatch/app/tasks/scrape.py

Fonction: scrape_product(url, use_playwright=True)

Réutilise 100% code Phase 1 (detect → fetch → parse)
Save to DB via ScrapingPipeline
Retour: {success, product_id, snapshot, error}

Statut: ✅ Terminé

Scheduler

Fichier: pricewatch/app/tasks/scheduler.py

Classe: ScrapingScheduler

schedule_product(url, interval_hours=24): Job récurrent
enqueue_immediate(url): Job unique
Basé sur rq-scheduler

Statut: ✅ Terminé

CLI Worker

Nouvelles commandes:

pricewatch worker               # Lancer worker RQ
pricewatch enqueue <url>        # Enqueue scrape immédiat
pricewatch schedule <url> --interval 24  # Scrape quotidien

Statut: ✅ Terminé

📦 Semaine 4: Tests & Documentation (NON DÉMARRÉ)

Tâches Prévues

Tests

Tests end-to-end (CLI → DB → Worker)
Tests erreurs (DB down, Redis down)
Tests backward compatibility (--no-db)
Performance tests (100+ produits)

Documentation

Update README.md (setup Phase 2)
Update CHANGELOG.md
Migration guide (JSON → DB)

📈 Métriques d'Avancement

Catégorie	Complétées	Totales	%
Semaine 1	10	10	100%
Semaine 2	5	5	100%
Semaine 3	3	6	50%
Semaine 4	0	7	0%
TOTAL Phase 2	18	28	64%

🎯 Prochaine Étape Immédiate

Prochaine étape immédiate

Tests end-to-end worker + DB
Gestion des erreurs Redis down (CLI + worker)

Apres (prevu)

Logs d'observabilite pour jobs planifies

🔧 Vérifications

Vérification Semaine 1 (objectif)

# Setup infrastructure
docker-compose up -d
pricewatch init-db

# Vérifier tables créées
psql -h localhost -U pricewatch pricewatch
\dt
# → 5 tables: products, price_history, product_images, product_specs, scraping_logs

Vérification Semaine 2 (objectif)

# Test pipeline avec DB
pricewatch run --yaml scrap_url.yaml --save-db

# Vérifier données en DB
psql -h localhost -U pricewatch pricewatch
SELECT * FROM products LIMIT 5;
SELECT * FROM price_history ORDER BY fetched_at DESC LIMIT 10;

Vérification Semaine 3 (objectif)

# Enqueue job
pricewatch enqueue "https://www.amazon.fr/dp/B08N5WRWNW"

# Lancer worker
pricewatch worker

# Vérifier job traité
psql -h localhost -U pricewatch pricewatch
SELECT * FROM scraping_logs ORDER BY fetched_at DESC LIMIT 5;

📝 Notes Importantes

Backward Compatibility

✅ CLI Phase 1 fonctionne sans changement
✅ Format JSON identique
✅ Database optionnelle (--no-db flag)
✅ ProductSnapshot inchangé
✅ Tests Phase 1 continuent à passer (295 tests)

Architecture Décisions

Normalisation vs Performance:

Choix: Normalisation stricte (5 tables)
Justification: Catalogue change rarement, prix changent quotidiennement
Alternative rejetée: Tout dans products + JSONB (moins queryable)

Clé Naturelle vs UUID:

Choix: (source, reference) comme unique constraint
Justification: ASIN Amazon déjà unique globalement
Alternative rejetée: UUID artificiel (complexifie déduplication)

Synchrone vs Asynchrone:

Choix: RQ synchrone (pas d'async/await)
Justification: Code Phase 1 réutilisable à 100%, simplicité
Alternative rejetée: Asyncio + asyncpg (refactoring massif)

Dernière mise à jour: 2026-01-14

Validation locale (Semaine 1)

docker compose up -d
./venv/bin/alembic -c alembic.ini upgrade head
psql -h localhost -U pricewatch pricewatch
\\dt

Resultat: 6 tables visibles (products, price_history, product_images, product_specs, scraping_logs, alembic_version). Statut: ✅ Semaine 1 en cours (30% complétée)

12 KiB Executable File Raw Blame History

🚀 Phase 2 Infrastructure - EN COURS

📊 Vue d'Ensemble

Objectifs Phase 2

✅ Semaine 1: Database Foundation (TERMINÉE)

Tâches Complétées

1. Configuration Centralisée ✅

2. Dépendances Phase 2 ✅

3. Modèles SQLAlchemy ORM ✅

Tâches Complétées (suite)

4. Connexion Base de Données ✅

5. Setup Alembic ✅

6. Migration Initiale ✅

7. Commandes CLI Database ✅

8. Docker Compose ✅

9. Fichier .env Exemple ✅

10. Tests Database ✅

📦 Semaine 2: Repository & Pipeline (EN COURS)

Tâches Prévues

Repository Pattern

Scraping Pipeline

CLI Modification

Tests Repository + Pipeline ✅

Tests end-to-end CLI + DB ✅

📦 Semaine 3: Worker Infrastructure (EN COURS)

Tâches Prévues

RQ Task

Scheduler

CLI Worker

📦 Semaine 4: Tests & Documentation (NON DÉMARRÉ)

Tâches Prévues

Tests

Documentation

📈 Métriques d'Avancement

🎯 Prochaine Étape Immédiate

🔧 Vérifications

Vérification Semaine 1 (objectif)

Vérification Semaine 2 (objectif)

Vérification Semaine 3 (objectif)

📝 Notes Importantes

Backward Compatibility

Architecture Décisions

Validation locale (Semaine 1)

12 KiB

Executable File

Raw Blame History