Last indexed: 11 February 2026 (5f6a2a)

File Discovery & Processing

Purpose and Scope

This document describes the file discovery and processing system in StaticForge, which is responsible for scanning content directories, parsing metadata, filtering files based on publication rules, and generating URLs. This phase occurs early in the build pipeline and produces the discovered_files data structure that all subsequent features rely upon.

For information about how frontmatter metadata is structured and used in content files, see Content Files & Frontmatter. For details on URL generation logic and category-based routing, see URL Generation. For the broader event-driven architecture that file discovery integrates with, see Event System.

System Overview

The FileDiscovery class is the core component responsible for identifying processable content files and extracting their metadata. It executes during the discovery phase (before PRE_GLOB event) and runs exactly once per build, implementing a "discover once, process many" pattern that ensures consistent site structure visibility across all features.

Key Responsibilities:

Recursive directory traversal of configured source directories
Frontmatter parsing from Markdown (.md) and HTML (.html) files
Content filtering based on draft status and publication dates
URL generation with category-based routing logic
Storage of results in the dependency injection container

Sources: src/Core/FileDiscovery.php9-296

File Discovery Pipeline

The following diagram illustrates the complete file discovery pipeline from directory scanning to container storage:

Sources: src/Core/FileDiscovery.php26-49 src/Core/FileDiscovery.php72-129

Directory Scanning Configuration

The system supports scanning multiple directories, configured through the container variable SCAN_DIRECTORIES. If not set, it defaults to SOURCE_DIR.

Configuration Variable	Purpose	Default Fallback
`SCAN_DIRECTORIES`	Array of directories to scan for content	`[SOURCE_DIR]`
`SOURCE_DIR`	Primary content directory	`content`

The getDirectoriesToScan() method retrieves this configuration and normalizes it to an array:

Sources: src/Core/FileDiscovery.php51-70

Extension Registry Integration

File discovery integrates with the ExtensionRegistry to determine which files are processable. The registry maintains a whitelist of file extensions (.md, .html) that the system can handle.

This allows features (like MarkdownRenderer, HtmlRenderer) to register their supported file types dynamically during feature initialization.

Sources: src/Core/FileDiscovery.php86 src/Core/ExtensionRegistry.php

Frontmatter Parsing

StaticForge supports dual frontmatter formats to accommodate both Markdown and HTML content files. Both formats use YAML syntax but differ in their delimiters.

Markdown Frontmatter Format

Markdown files use triple-dash delimiters:

The regex pattern for extraction: /^---\s*\n(.*?)\n---\s*\n/s

Sources: src/Core/FileDiscovery.php156-173

HTML Frontmatter Format

HTML files embed frontmatter within HTML comments:

The regex pattern for extraction: /^\s*\n/s

Sources: src/Core/FileDiscovery.php175-192

YAML Parsing

Both formats delegate to parseYamlContent(), which uses Symfony's YAML parser:

Step	Method	Purpose
1	Extract YAML block	Regex capture between delimiters
2	`Yaml::parse()`	Parse YAML into PHP array
3	Type validation	Ensure result is array (not null/scalar)
4	Error handling	Catch parse exceptions, log, return empty array

Sources: src/Core/FileDiscovery.php194-220

File Filtering Logic

Draft Filtering

Files marked with draft: true are excluded from the build unless SHOW_DRAFTS is enabled:

The SHOW_DRAFTS variable supports both boolean and string values (e.g., from environment variables), with automatic type coercion using filter_var():

Sources: src/Core/FileDiscovery.php92-102

Future Date Filtering

Files with a date field set to a future timestamp are automatically excluded:

Date Format Support:

Integer timestamps (pre-parsed by YAML)
String dates parseable by strtotime() (e.g., 2025-12-31, +1 day)
Invalid dates (strtotime() returns false) are included by default

Sources: src/Core/FileDiscovery.php104-118

URL Generation Algorithm

The generateUrl() method transforms file paths into canonical URLs through a multi-step process:

URL Generation Pipeline

URL Transformation Examples

Input Path	Category	Output URL
`content/index.md`	—	`/index.html`
`content/blog/post.md`	—	`/blog/post.html`
`content/post.md`	`blog`	`/blog/post.html`
`content/tutorials/intro.md`	`guides`	`/tutorials/intro.html`

Note: Files in subdirectories retain their directory structure even if they have a category. The category only affects files at the root level.

Sources: src/Core/FileDiscovery.php222-269

Category Slug Generation

The slugify() method converts category names into URL-safe strings:

Transformation Rules:

Convert to lowercase
Replace spaces and underscores with hyphens
Remove non-alphanumeric characters (except hyphens)
Collapse consecutive hyphens
Trim hyphens from ends

Examples:

"Web Development" → "web-development"
"PHP & MySQL" → "php-mysql"
"Front-End__Design" → "front-end-design"

Sources: src/Core/FileDiscovery.php271-295

The discovered_files Data Structure

The result of file discovery is stored in the container as discovered_files, an array of file descriptor objects:

Data Structure Schema

Field	Type	Description
`path`	`string`	Absolute filesystem path to source file
`url`	`string`	Full canonical URL including `SITE_BASE_URL`
`metadata`	`array`	Parsed frontmatter as associative array

This structure becomes the single source of truth for the entire build process. Features query this data during POST_GLOB to build menus, category indexes, tag clouds, and other site-wide structures.

Sources: src/Core/FileDiscovery.php33-46 src/Core/FileDiscovery.php122-127

Integration with Event System

File discovery executes before the event lifecycle begins, but its results are consumed extensively during the POST_GLOB event:

Features That Consume discovered_files:

MenuBuilder (priority 100): Scans for menu metadata to build navigation
CategoryIndex (priority 50): Creates category index pages
Tags (priority 150): Extracts tag metadata for tag clouds
RobotsTxt (priority 150): Identifies pages with robots: no
Search (POST_RENDER): Indexes content for search functionality

Sources: content/development/architecture.md62-103 src/Features/MenuBuilder/Services/StaticMenuProcessor.php26-99

File Discovery in the Build Lifecycle

The following diagram shows where file discovery fits in the complete build process:

Sources: content/development/architecture.md17-103 src/Core/FileDiscovery.php26-49

Performance Characteristics

Single-Pass Architecture

FileDiscovery implements a single-pass scan strategy:

Files are read from disk exactly once during discovery
Frontmatter is parsed once and cached in memory
No re-scanning occurs during the rendering phase

Memory Efficiency

The discovered_files array is held in memory for the duration of the build:

Typical site (100 pages): ~500 KB memory
Large site (1000 pages): ~5 MB memory
Content bodies are not stored—only metadata

Filesystem Access Patterns

Operation	Count per Build	Caching
Directory scan	1 × directories	No cache
File read (metadata)	1 × files	In-memory
File read (content)	1 × files (during render)	No cache

Sources: content/development/architecture.md235-243

Configuration Dependencies

File discovery requires the following container variables:

Variable	Required	Purpose	Default
`SOURCE_DIR`	Yes	Primary content directory	`content`
`SCAN_DIRECTORIES`	No	Additional directories to scan	Falls back to `SOURCE_DIR`
`SHOW_DRAFTS`	No	Include draft files	`false`
`SITE_BASE_URL`	Yes	URL prefix for generated URLs	(none)

Sources: src/Core/FileDiscovery.php56-70 src/Core/FileDiscovery.php231-268

Error Handling

Missing Directories

Non-existent directories trigger a WARNING log entry but do not halt execution:

Sources: src/Core/FileDiscovery.php37-40

Frontmatter Parse Errors

YAML parse exceptions are caught and logged as ERROR, returning an empty metadata array:

Sources: src/Core/FileDiscovery.php206-219

File Read Failures

Failed file_get_contents() operations log WARNING and return empty metadata:

Sources: src/Core/FileDiscovery.php139-143

Missing Configuration

Missing required container variables throw RuntimeException:

SOURCE_DIR not set → Exception during directory scan
SITE_BASE_URL not set → Exception during URL generation

Sources: src/Core/FileDiscovery.php63-65 src/Core/FileDiscovery.php260-262

Testing Coverage

File discovery is extensively tested through unit tests:

Test Scenarios

Test Class	Coverage
`FileDiscoveryFutureDateTest`	Future date filtering, no-date handling, invalid dates
Integration tests	Full pipeline with multiple file types

Key Test Cases:

Draft filtering with SHOW_DRAFTS toggle
Future date exclusion (timestamp and string formats)
Invalid date handling (fails gracefully)
Files without date metadata (included by default)
Category-based URL generation
Subdirectory structure preservation

Sources: tests/Unit/Core/FileDiscoveryFutureDateTest.php1-93

Usage Example

The following code demonstrates manual file discovery invocation (typically handled by the application bootstrap):

Sources: src/Core/FileDiscovery.php19-24 src/Core/FileDiscovery.php26-49

Refresh this wiki

URL: https://deepwiki.com/calevans/staticforge/3.4-file-discovery-and-processing