Error Recovery

When a resource fails, pragma captures the failure details and gives you the tools to diagnose, fix, and retry. This guide walks through identifying failures, understanding their causes, and getting your infrastructure back to a healthy state.

Identifying Failed Resources

Resources in a failed state appear with [FAILED] when listed:

pragma resources list

Output shows lifecycle states for all resources:

gcp/storage/data-lake [READY]
gcp/bigquery-dataset/analytics [FAILED]
gcp/bigquery-table/events [PENDING]

To see details about a specific failed resource:

pragma resources get gcp/bigquery-dataset analytics

Common Failure Scenarios

Configuration Errors

The most common cause of failure is invalid configuration:

Missing required fields - A required config value is missing
Invalid values - A value doesn’t match the expected format or constraints
Permission errors - The provider doesn’t have access to create or modify the resource

Recovery: Fix the configuration in your YAML file and re-apply:

pragma resources apply --pending fixed-resource.yaml

Dependency Failures

A resource fails if its dependencies aren’t satisfied:

Missing dependency - A referenced resource doesn’t exist
Dependency not ready - A dependency exists but isn’t in READY state
Invalid field reference - A ${...} reference points to a field that doesn’t exist

Recovery: Ensure all dependencies are in READY state first:

# Check dependency status
pragma resources get gcp/storage data-lake

# If dependency is failed, fix it first
pragma resources apply --pending data-lake.yaml

Provider Errors

Sometimes the underlying provider (GCP, AWS, etc.) rejects the operation:

Quota exceeded - You’ve hit a service limit
Resource conflicts - A resource with that name already exists outside pragma
Service unavailable - Temporary provider outage

Recovery: Address the provider-specific issue, then retry the resource.

Using the Dead Letter Queue

When a resource operation fails after retries, it moves to the dead letter queue. This prevents failed operations from blocking other work and preserves the failure details for investigation.

List Failed Events

See all failed events:

pragma ops dead-letter list

Output shows a table with event details:

Event ID    Provider   Resource Type   Resource Name   Error Message              Failed At
evt_abc123  gcp        bigquery-dataset  analytics     Permission denied: ...     2025-01-15 10:30:00
evt_def456  gcp        storage           backup        Quota exceeded: ...        2025-01-15 10:32:00

Filter by provider to focus on specific failures:

pragma ops dead-letter list --provider gcp

Inspect Event Details

Get the full error message and context:

pragma ops dead-letter show evt_abc123

This returns the complete event data including:

The resource that failed
The full error message
When the failure occurred
The operation that was attempted

Retry Failed Events

After fixing the underlying issue, retry the failed operation:

pragma ops dead-letter retry evt_abc123

Or retry all failed events at once:

pragma ops dead-letter retry --all

Clear Resolved Events

Once you’ve addressed failures (or decided to abandon them), remove events from the queue:

# Delete a single event
pragma ops dead-letter delete evt_abc123

# Delete all events for a provider
pragma ops dead-letter delete --provider gcp

# Delete all events
pragma ops dead-letter delete --all

Dependency Failure Cascades

When a resource fails, it affects downstream resources:

Failed resources stay failed - They don’t retry automatically
Dependent resources wait - Resources that depend on a failed resource stay in PENDING
Changes don’t propagate - The dependency graph pauses until the failure is resolved

Consider this dependency chain:

data-lake (READY) -> analytics (FAILED) -> reports (PENDING)

The reports resource can’t proceed because analytics is failed. To recover:

Fix the analytics configuration
Re-apply with --pending to retry
Once analytics reaches READY, reports will automatically proceed

Recovery Workflow

When you encounter failures, follow this workflow:

Identify failures

pragma resources list
pragma ops dead-letter list

Investigate root cause

pragma resources get <provider>/<resource> <name>
pragma ops dead-letter show <event-id>

Fix the issue

Update your YAML configuration, fix permissions, or address provider limits.

Retry

pragma resources apply --pending fixed-resource.yaml
pragma ops dead-letter retry <event-id>

Verify

pragma resources get <provider>/<resource> <name>

Confirm the resource reaches READY state.

Preventing Failures

Reduce failures by:

Validating configuration before applying with --pending
Checking dependencies are READY before applying dependent resources
Using draft mode - Apply without --pending first to validate, then apply with --pending
Monitoring dead letter queue regularly for early warning of issues

Getting started

Concepts

Guides

Troubleshooting

Identifying Failed Resources

Common Failure Scenarios

Configuration Errors

Dependency Failures

Provider Errors

Using the Dead Letter Queue

List Failed Events

Inspect Event Details

Retry Failed Events

Clear Resolved Events

Dependency Failure Cascades

Recovery Workflow

Preventing Failures

Next Steps

Common Issues

Resource Lifecycle

Getting started

Concepts

Guides

Troubleshooting

​Identifying Failed Resources

​Common Failure Scenarios

​Configuration Errors

​Dependency Failures

​Provider Errors

​Using the Dead Letter Queue

​List Failed Events

​Inspect Event Details

​Retry Failed Events

​Clear Resolved Events

​Dependency Failure Cascades

​Recovery Workflow

​Preventing Failures

​Next Steps

Common Issues

Resource Lifecycle

Identifying Failed Resources

Common Failure Scenarios

Configuration Errors

Dependency Failures

Provider Errors

Using the Dead Letter Queue

List Failed Events

Inspect Event Details

Retry Failed Events

Clear Resolved Events

Dependency Failure Cascades

Recovery Workflow

Preventing Failures

Next Steps