Rule: No source page = no product.
Before publishing any product, verify that _toby_source_url (or matched_url in Rosetta) resolves to a specific product page, not a category/listing URL.
A valid source URL must:
.../reagent/biochemistry/, .../products/, .../catalog/)If no valid source page exists, delete the product. Don't generate images, descriptions, or prices for products we can't trace to a real source.
This applies across all suppliers.
| Stage | Tool | What Happens |
|---|---|---|
| Onboarding (Phase 3) | Ingestion analysis | Skip products without a real source page |
| Forge build | forge.py |
Reject builds where source URL fails validation |
| Rosetta linking | matcher.py |
Flag matched_url that resolve to category pages |
| Ansel images | scout.py |
Refuse to source images without a valid product page |
| Deming QA | orchestrator.py |
Score source URL validity as a risk dimension |
| Canonical verification | canonicalize.py |
Require --url to pass validation |
# Category/listing URL patterns that are NOT valid source pages
INVALID_SOURCE_PATTERNS = [
r'/products/?$',
r'/catalog/?$',
r'/category/',
r'/collections/',
r'/brands?/',
r'/search',
r'/$', # bare domain (homepage)
]
When in doubt: if you can't find the specific product's name, model number, or SKU on the page the URL points to, it's not a valid source page.