VOOZH about

URL: https://deepwiki.com/calevans/staticforge/3.4-file-discovery-and-processing

⇱ File Discovery & Processing | calevans/staticforge | DeepWiki


Loading...
Last indexed: 11 February 2026 (5f6a2a)
Menu

File Discovery & Processing

Purpose and Scope

This document describes the file discovery and processing system in StaticForge, which is responsible for scanning content directories, parsing metadata, filtering files based on publication rules, and generating URLs. This phase occurs early in the build pipeline and produces the discovered_files data structure that all subsequent features rely upon.

For information about how frontmatter metadata is structured and used in content files, see Content Files & Frontmatter. For details on URL generation logic and category-based routing, see URL Generation. For the broader event-driven architecture that file discovery integrates with, see Event System.


System Overview

The FileDiscovery class is the core component responsible for identifying processable content files and extracting their metadata. It executes during the discovery phase (before PRE_GLOB event) and runs exactly once per build, implementing a "discover once, process many" pattern that ensures consistent site structure visibility across all features.

Key Responsibilities:

  • Recursive directory traversal of configured source directories
  • Frontmatter parsing from Markdown (.md) and HTML (.html) files
  • Content filtering based on draft status and publication dates
  • URL generation with category-based routing logic
  • Storage of results in the dependency injection container

Sources: src/Core/FileDiscovery.php9-296


File Discovery Pipeline

The following diagram illustrates the complete file discovery pipeline from directory scanning to container storage:


Sources: src/Core/FileDiscovery.php26-49 src/Core/FileDiscovery.php72-129


Directory Scanning Configuration

The system supports scanning multiple directories, configured through the container variable SCAN_DIRECTORIES. If not set, it defaults to SOURCE_DIR.

Configuration VariablePurposeDefault Fallback
SCAN_DIRECTORIESArray of directories to scan for content[SOURCE_DIR]
SOURCE_DIRPrimary content directorycontent

The getDirectoriesToScan() method retrieves this configuration and normalizes it to an array:

Sources: src/Core/FileDiscovery.php51-70


Extension Registry Integration

File discovery integrates with the ExtensionRegistry to determine which files are processable. The registry maintains a whitelist of file extensions (.md, .html) that the system can handle.


This allows features (like MarkdownRenderer, HtmlRenderer) to register their supported file types dynamically during feature initialization.

Sources: src/Core/FileDiscovery.php86 src/Core/ExtensionRegistry.php


Frontmatter Parsing

StaticForge supports dual frontmatter formats to accommodate both Markdown and HTML content files. Both formats use YAML syntax but differ in their delimiters.

Markdown Frontmatter Format

Markdown files use triple-dash delimiters:


The regex pattern for extraction: /^---\s*\n(.*?)\n---\s*\n/s

Sources: src/Core/FileDiscovery.php156-173

HTML Frontmatter Format

HTML files embed frontmatter within HTML comments:


The regex pattern for extraction: /^<!--\s*\n---\s*\n(.*?)\n---\s*\n-->\s*\n/s

Sources: src/Core/FileDiscovery.php175-192

YAML Parsing

Both formats delegate to parseYamlContent(), which uses Symfony's YAML parser:

StepMethodPurpose
1Extract YAML blockRegex capture between delimiters
2Yaml::parse()Parse YAML into PHP array
3Type validationEnsure result is array (not null/scalar)
4Error handlingCatch parse exceptions, log, return empty array

Sources: src/Core/FileDiscovery.php194-220


File Filtering Logic

Draft Filtering

Files marked with draft: true are excluded from the build unless SHOW_DRAFTS is enabled:


The SHOW_DRAFTS variable supports both boolean and string values (e.g., from environment variables), with automatic type coercion using filter_var():

Sources: src/Core/FileDiscovery.php92-102

Future Date Filtering

Files with a date field set to a future timestamp are automatically excluded:


Date Format Support:

  • Integer timestamps (pre-parsed by YAML)
  • String dates parseable by strtotime() (e.g., 2025-12-31, +1 day)
  • Invalid dates (strtotime() returns false) are included by default

Sources: src/Core/FileDiscovery.php104-118


URL Generation Algorithm

The generateUrl() method transforms file paths into canonical URLs through a multi-step process:

URL Generation Pipeline


URL Transformation Examples

Input PathCategoryOutput URL
content/index.md/index.html
content/blog/post.md/blog/post.html
content/post.mdblog/blog/post.html
content/tutorials/intro.mdguides/tutorials/intro.html

Note: Files in subdirectories retain their directory structure even if they have a category. The category only affects files at the root level.

Sources: src/Core/FileDiscovery.php222-269

Category Slug Generation

The slugify() method converts category names into URL-safe strings:

Transformation Rules:

  1. Convert to lowercase
  2. Replace spaces and underscores with hyphens
  3. Remove non-alphanumeric characters (except hyphens)
  4. Collapse consecutive hyphens
  5. Trim hyphens from ends

Examples:

  • "Web Development""web-development"
  • "PHP & MySQL""php-mysql"
  • "Front-End__Design""front-end-design"

Sources: src/Core/FileDiscovery.php271-295


The discovered_files Data Structure

The result of file discovery is stored in the container as discovered_files, an array of file descriptor objects:


Data Structure Schema

FieldTypeDescription
pathstringAbsolute filesystem path to source file
urlstringFull canonical URL including SITE_BASE_URL
metadataarrayParsed frontmatter as associative array

This structure becomes the single source of truth for the entire build process. Features query this data during POST_GLOB to build menus, category indexes, tag clouds, and other site-wide structures.

Sources: src/Core/FileDiscovery.php33-46 src/Core/FileDiscovery.php122-127


Integration with Event System

File discovery executes before the event lifecycle begins, but its results are consumed extensively during the POST_GLOB event:


Features That Consume discovered_files:

  • MenuBuilder (priority 100): Scans for menu metadata to build navigation
  • CategoryIndex (priority 50): Creates category index pages
  • Tags (priority 150): Extracts tag metadata for tag clouds
  • RobotsTxt (priority 150): Identifies pages with robots: no
  • Search (POST_RENDER): Indexes content for search functionality

Sources: content/development/architecture.md62-103 src/Features/MenuBuilder/Services/StaticMenuProcessor.php26-99


File Discovery in the Build Lifecycle

The following diagram shows where file discovery fits in the complete build process:


Sources: content/development/architecture.md17-103 src/Core/FileDiscovery.php26-49


Performance Characteristics

Single-Pass Architecture

FileDiscovery implements a single-pass scan strategy:

  • Files are read from disk exactly once during discovery
  • Frontmatter is parsed once and cached in memory
  • No re-scanning occurs during the rendering phase

Memory Efficiency

The discovered_files array is held in memory for the duration of the build:

  • Typical site (100 pages): ~500 KB memory
  • Large site (1000 pages): ~5 MB memory
  • Content bodies are not stored—only metadata

Filesystem Access Patterns

OperationCount per BuildCaching
Directory scan1 × directoriesNo cache
File read (metadata)1 × filesIn-memory
File read (content)1 × files (during render)No cache

Sources: content/development/architecture.md235-243


Configuration Dependencies

File discovery requires the following container variables:

VariableRequiredPurposeDefault
SOURCE_DIRYesPrimary content directorycontent
SCAN_DIRECTORIESNoAdditional directories to scanFalls back to SOURCE_DIR
SHOW_DRAFTSNoInclude draft filesfalse
SITE_BASE_URLYesURL prefix for generated URLs(none)

Sources: src/Core/FileDiscovery.php56-70 src/Core/FileDiscovery.php231-268


Error Handling

Missing Directories

Non-existent directories trigger a WARNING log entry but do not halt execution:

Sources: src/Core/FileDiscovery.php37-40

Frontmatter Parse Errors

YAML parse exceptions are caught and logged as ERROR, returning an empty metadata array:

Sources: src/Core/FileDiscovery.php206-219

File Read Failures

Failed file_get_contents() operations log WARNING and return empty metadata:

Sources: src/Core/FileDiscovery.php139-143

Missing Configuration

Missing required container variables throw RuntimeException:

  • SOURCE_DIR not set → Exception during directory scan
  • SITE_BASE_URL not set → Exception during URL generation

Sources: src/Core/FileDiscovery.php63-65 src/Core/FileDiscovery.php260-262


Testing Coverage

File discovery is extensively tested through unit tests:

Test Scenarios

Test ClassCoverage
FileDiscoveryFutureDateTestFuture date filtering, no-date handling, invalid dates
Integration testsFull pipeline with multiple file types

Key Test Cases:

  • Draft filtering with SHOW_DRAFTS toggle
  • Future date exclusion (timestamp and string formats)
  • Invalid date handling (fails gracefully)
  • Files without date metadata (included by default)
  • Category-based URL generation
  • Subdirectory structure preservation

Sources: tests/Unit/Core/FileDiscoveryFutureDateTest.php1-93


Usage Example

The following code demonstrates manual file discovery invocation (typically handled by the application bootstrap):


Sources: src/Core/FileDiscovery.php19-24 src/Core/FileDiscovery.php26-49