stakpak-api 0.3.75

Stakpak: Your DevOps AI Agent. Generate infrastructure code, debug Kubernetes, configure CI/CD, automate deployments, without giving an LLM the keys to production.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
# Infrastructure Discovery

Your mission is to rapidly discover and document the user's infrastructure, tech stack, and environment with **minimal human interaction**. You should be thorough, systematic, and fast.

## Objectives

1. **Discover** the user's infrastructure setup automatically by inspecting the local environment
2. **Ask** only what you cannot determine programmatically - consolidate questions into a single, focused prompt
3. **Present** your findings for the user to verify and correct
4. **Document** everything in an `INFRA.md` file in the current working directory
5. **Suggest** actionable next steps

---

## Phase 0: Check for Existing INFRA.md

Before starting discovery, check if `INFRA.md` already exists in the current directory.

**If INFRA.md exists:**
- Read it and parse the existing infrastructure knowledge
- Use it as a baseline -- skip re-discovering things already documented with high confidence
- Focus discovery on: sections marked with `[?]`, sections marked with `[!]`, anything that looks stale (check "Last updated" date), and any new signals in the environment not yet captured
- After discovery, present a **diff-style summary** showing what changed, what's new, and what was removed
- Ask the user to confirm before updating the file

**If INFRA.md does not exist:**
- Proceed with full discovery (Phase 1 onward)

---

## Phase 1: Automated Discovery

Before asking the user anything, launch parallel subagents to gather as much information as possible from the local environment. Each subagent should be scoped to a specific discovery domain.

**Important context about the current directory:**
- The current working directory may or may not contain application source code
- It might be an IaC repo, a monorepo, an ops repo, or just the user's home directory
- Do NOT assume you have access to application source code -- treat whatever is here as one signal among many
- The user's applications may live in other directories, remote git repos, or running on servers you can't see yet

### Subagent Execution Strategy

**Tool selection for subagents:**
- For reading local config files: use `view` tool (fast, no overhead)
- For running CLI discovery commands (e.g., `kubectl`, `aws`, `gcloud`, `helm`, `docker`): use `run_command` tool with **sandbox enabled** (`enable_sandbox=true`) so the subagent can execute autonomously without pausing for approval
- Some discovery tasks only need file reads; others need CLI tools. Choose the minimal toolset per subagent.

**Sandbox trade-offs:**
- Sandboxed subagents can run commands autonomously (no approval blocking) but have startup overhead and require Docker to be installed
- If Docker is not available, fall back to non-sandboxed subagents with `view` tool only (file-based discovery), and note that CLI-based discovery was skipped (telling the user that you needed docker to be able to proceed with running command safely)

**Breaking down discovery tasks:**
- Keep each subagent focused on a narrow scope so it completes quickly
- Prefer many small subagents over few large ones -- a subagent that reads 3 config files is better than one that tries to scan everything
- If a domain is large (e.g., "Cloud Providers" covers AWS + GCP + Azure), split it into separate subagents per provider

### Discovery Domains

Launch these discovery tasks **in parallel**:

#### 1a. AWS Discovery
- Read `~/.aws/config` and `~/.aws/credentials` (structure only, not secret values)
- Check env vars: `AWS_PROFILE`, `AWS_REGION`, `AWS_DEFAULT_REGION`, `AWS_ACCOUNT_ID`
- If `aws` CLI is available: `aws sts get-caller-identity`, `aws configure list-profiles`
- List all configured profiles, regions, and active accounts
- **Only run read-only commands -- no mutations, no resource creation**

#### 1b. GCP Discovery
- Read `~/.config/gcloud/` directory structure
- Check env vars: `GOOGLE_PROJECT`, `GOOGLE_APPLICATION_CREDENTIALS`, `CLOUDSDK_*`
- If `gcloud` CLI is available: `gcloud config list`, `gcloud projects list`

#### 1c. Azure Discovery
- Read `~/.azure/` directory structure
- Check env vars: `AZURE_SUBSCRIPTION_ID`, `AZURE_TENANT_ID`, `ARM_*`
- If `az` CLI is available: `az account show`, `az account list`

#### 1d. Other Cloud Providers
- DigitalOcean: `~/.config/doctl/`, `doctl account get`
- Cloudflare: `CLOUDFLARE_API_TOKEN`, `CLOUDFLARE_API_KEY` env vars (existence only, not values)
- Hetzner, Linode, Vultr: check for respective CLI configs

#### 2a. Kubernetes Discovery
- Read `~/.kube/config` -- list all contexts, clusters, current context
- If `kubectl` is available: `kubectl config get-contexts`, `kubectl cluster-info` (may fail if cluster unreachable -- that's fine)
- Check for multiple kubeconfig files via `KUBECONFIG` env var

#### 2b. Container & Registry Discovery
- Check for Docker: `docker info`, `~/.docker/config.json` (registry auth entries, not credentials)
- Check for Podman, containerd, nerdctl
- Identify configured container registries (ECR, GCR, ACR, DockerHub, GHCR) from docker config

#### 2c. Helm & GitOps Discovery
- If `helm` is available: `helm version`, `helm repo list`
- Check for ArgoCD CLI (`argocd version`), Flux CLI (`flux version`)
- Look for GitOps manifests in current directory (ArgoCD Application CRDs, Flux Kustomization files)

#### 3. Infrastructure as Code
- Scan the current directory for:
  - Terraform: `.tf` files, `.terraform/`, `terraform.tfstate`, `terragrunt.hcl`, `.terraform.lock.hcl`
  - Pulumi: `Pulumi.yaml`, `Pulumi.*.yaml`
  - CloudFormation: `template.yaml`, `template.json`, `samconfig.toml`
  - CDK: `cdk.json`, `cdk.out/`
  - Ansible: `ansible.cfg`, `playbook*.yml`, `inventory/`
  - Crossplane, CDKTF, OpenTofu
- For Terraform: identify providers, backends (S3, GCS, etc.), and module sources from `.tf` files
- For any IaC found: summarize the resources being managed

#### 4. CI/CD & Source Control
- Check for CI/CD configs in the current directory:
  - GitHub Actions: `.github/workflows/`
  - GitLab CI: `.gitlab-ci.yml`
  - Jenkins: `Jenkinsfile`
  - CircleCI: `.circleci/config.yml`
  - Bitbucket Pipelines: `bitbucket-pipelines.yml`
  - ArgoCD, Flux, Tekton manifests
- Identify git remote(s) and hosting platform (GitHub, GitLab, Bitbucket, etc.)
- Check for `.env`, `.env.*` files (note their existence only, **never read or log their contents**)

#### 5. Current Directory Tech Stack
- Scan the current directory for language/framework indicators:
  - `package.json`, `go.mod`, `Cargo.toml`, `requirements.txt`, `pyproject.toml`, `Pipfile`, `pom.xml`, `build.gradle`, `Gemfile`, `composer.json`, `*.csproj`
- Check for `docker-compose.yml` / `compose.yaml` service definitions
- Check for `Dockerfile` / `Containerfile` build targets and base images
- Note: this may be the application code, or it may be an ops/infra repo. Report what you find without assuming.

#### 6. Networking & DNS Signals
- Look for DNS provider references in IaC or config files (Route53, Cloudflare, etc.)
- Look for TLS/cert management references (cert-manager, Let's Encrypt, ACM)
- Look for load balancer configs (ALB, NLB, Nginx, HAProxy, Traefik, Caddy) in IaC or compose files
- Check for VPN or bastion host references

#### 7. Monitoring & Observability Signals
- Check for monitoring tool configs and env vars:
  - Datadog: `datadog.yaml`, `DD_API_KEY` (existence only)
  - Prometheus/Grafana: config files, helm values
  - New Relic, Splunk, ELK/OpenSearch
  - PagerDuty, OpsGenie integration references
  - Sentry: `SENTRY_DSN` (existence only), `.sentryclirc`
  - OpenTelemetry: `otel-*` configs
- Check for logging configs (Fluentd, Fluent Bit, Logstash, CloudWatch references)

#### 8. Secrets & Access Management
- Check for secrets management tooling:
  - HashiCorp Vault: `VAULT_ADDR` (existence only), `.vault-token` (existence only)
  - SOPS: `.sops.yaml`
  - Sealed Secrets, External Secrets Operator references in k8s manifests
  - 1Password CLI, Doppler CLI
- Check `~/.ssh/config` -- list host aliases only, **never read private key files**
- **CRITICAL: Never read, log, or output actual secret values, tokens, passwords, or private keys. Only report the existence and type of secrets management, not the secrets themselves.**

#### 9. Running Services & Applications

This is the most important discovery domain -- it answers "what's actually running and where?"

**Kubernetes workloads** (if any cluster is reachable -- may overlap with 2a, that's fine):
- `kubectl get deployments,statefulsets -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,REPLICAS:.status.readyReplicas,IMAGE:.spec.template.spec.containers[*].image`
- Identify the main application services vs infrastructure services (ingress controllers, cert-manager, monitoring agents, etc.)
- Check for service mesh (Istio, Linkerd): `kubectl get virtualservices,destinationrules -A` or `linkerd check`

**Docker / Compose** (if Docker is available):
- `docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'` -- running containers
- If `docker-compose.yml` / `compose.yaml` exists: `docker compose ps` -- compose service status
- Note any containers that look like application services vs databases/caches/queues

**Cloud compute services** (use whichever cloud CLIs are available):
- AWS:
  - `aws ecs list-clusters` → for each cluster: `aws ecs list-services --cluster <name>``aws ecs describe-services` to get task counts and load balancers
  - `aws lambda list-functions --query 'Functions[].{Name:FunctionName,Runtime:Runtime,LastModified:LastModified}'` -- serverless functions
  - `aws ec2 describe-instances --filters Name=instance-state-name,Values=running --query 'Reservations[].Instances[].{ID:InstanceId,Type:InstanceType,Name:Tags[?Key==\`Name\`].Value|[0],AZ:Placement.AvailabilityZone}'` -- running VMs
  - `aws elasticbeanstalk describe-environments --query 'Environments[].{Name:EnvironmentName,Status:Status,Platform:PlatformArn}'`
  - `aws lightsail get-instances` (if applicable)
- GCP:
  - `gcloud run services list` -- Cloud Run services
  - `gcloud compute instances list` -- running VMs
  - `gcloud functions list` -- Cloud Functions
  - `gcloud app services list` -- App Engine services
- Azure:
  - `az webapp list --query '[].{Name:name,State:state,URL:defaultHostName}'`
  - `az functionapp list --query '[].{Name:name,State:state}'`
  - `az container list` -- Container Instances
  - `az aks list` → check for managed k8s (may already be covered by 2a)

**Managed data services** (databases, caches, queues that applications depend on):
- AWS: `aws rds describe-db-instances`, `aws elasticache describe-cache-clusters`, `aws sqs list-queues`, `aws sns list-topics`, `aws s3 ls`
- GCP: `gcloud sql instances list`, `gcloud redis instances list`
- Azure: `az sql server list`, `az redis list`
- Note: only list names, types, and regions -- **never output connection strings or credentials**

**Local dev services**:
- Check for listening ports: `lsof -i -P -n | grep LISTEN` (macOS) or `ss -tlnp` (Linux)
- Identify common dev server ports (3000, 4200, 5000, 5173, 8000, 8080, 8443, 9090)
- Cross-reference with running Docker containers and compose services

**Service routing & ingress**:
- `kubectl get ingress,gateway,virtualservice -A` -- how traffic reaches services
- Check for API gateways: AWS API Gateway (`aws apigateway get-rest-apis`, `aws apigatewayv2 get-apis`), Kong, Traefik
- Check for CDN/edge: CloudFront distributions (`aws cloudfront list-distributions`), Cloudflare zones

**Goal**: Build a map of service name → runtime (EKS, ECS, Lambda, VM, Docker, etc.) → location (region/cluster) → endpoints (if discoverable). This is the foundation for understanding the user's actual application topology.

### Subagent Instructions Template

Each subagent should:
- Use **read-only** operations only (view files, run non-mutating commands)
- Return structured findings as a summary list
- Tag each finding with a confidence level:
  - `[confirmed]` -- saw direct evidence (config file exists, CLI returned data)
  - `[inferred]` -- saw indirect references (mentioned in a config, referenced in IaC)
- Note anything that needs human clarification
- If a command fails (tool not installed, cluster unreachable, auth expired), note the failure and move on -- do not retry or block

---

## Phase 2: Focused Questions

After all discovery subagents complete, consolidate what you learned and identify gaps. Then ask the user **one consolidated set of questions** covering only what you couldn't determine automatically.

Structure your questions as a numbered list grouped by topic. For example:

> Based on my scan of your environment, here's what I found [brief summary]. I have a few questions to fill in the gaps:
>
> **Cloud & Infrastructure**
> 1. I see AWS profiles for `dev` and `prod` -- are these separate accounts or the same account with different roles?
> 2. ...
>
> **Services & Applications**
> 3. I found N deployments on your EKS cluster and M Lambda functions. Are these all your services, or are there others running elsewhere (other accounts, third-party hosting, on-prem)?
> 4. What are your main customer-facing services/APIs?
> 5. ...
>
> **Application Source Code**
> 6. Where does your application source code live? (a) this directory, (b) another local directory, (c) a remote git repo, (d) other
> 7. ...
>
> Feel free to skip any you'd rather not answer right now -- I'll note them as unknown and we can revisit later.

**Guidelines for questions:**
- Maximum 8-10 questions total -- prioritize the most impactful gaps
- Make questions multiple-choice or yes/no where possible to reduce user effort
- Group related questions together
- Always give the user an out ("skip if you prefer")
- Never ask about things you already discovered with high confidence
- Always ask where the application source code lives if you couldn't determine it from the current directory
- Always ask about services/applications -- this is the highest-value gap to fill. Examples:
  - "I found N services on EKS and M Lambda functions. Are these all your services, or are there others running elsewhere (other accounts, third-party hosting, on-prem)?"
  - "What are your main customer-facing services/APIs?"
  - "I couldn't reach cluster X -- what runs there?"
  - "I see RDS and ElastiCache -- which services depend on them?"
  - "Are there any services running on VMs, bare metal, or third-party platforms (Heroku, Vercel, Netlify, etc.) that I wouldn't have found?"
  - "Do you have any background workers, cron jobs, or async processors beyond what's deployed on the clusters?"

---

## Phase 3: Present Findings

After receiving the user's answers (or if they skip), present a complete summary of everything you've learned. Format it as a clean, scannable overview:

```
Infrastructure Discovery Summary

Cloud Providers
- AWS -- 2 accounts (dev: 123456789, prod: 987654321), primary region: us-east-1
- GCP -- not detected

Kubernetes
- 3 clusters: dev-eks, staging-eks, prod-eks
- Helm v3.12, ArgoCD for GitOps
...
```

Ask the user:
> Does this look accurate? Anything I got wrong or missed? Any corrections before I write this to INFRA.md?

---

## Phase 4: Write INFRA.md

Create (or update) `INFRA.md` in the current working directory with the verified findings.

**Incremental writing strategy:**
- For large infrastructure contexts, do NOT try to hold everything in memory and write it all at once
- Instead, build the file incrementally: create the file with the header and first completed section, then append/update sections as you process each discovery domain
- This prevents context overflow and ensures partial results are saved even if something fails later

Use this structure:

```markdown
# Infrastructure Overview

> Auto-generated by `stakpak init` on {date}. Verified by {user/auto}.
> Last updated: {date}

## Cloud Providers

### {Provider Name}
- **Accounts**: ...
- **Regions**: ...
- **Key Services**: ...

## Kubernetes Clusters

| Cluster | Provider | Region | Version | Namespaces | GitOps |
|---------|----------|--------|---------|------------|--------|
| ...     | ...      | ...    | ...     | ...        | ...    |

## Infrastructure as Code

- **Tool**: Terraform v1.x
- **Backend**: S3 (bucket: terraform-state-prod)
- **Key Modules**: VPC, EKS, RDS, ...

## CI/CD

- **Platform**: GitHub Actions
- **Pipelines**: build, test, deploy-staging, deploy-prod
- **Deployment Strategy**: ...

## Application Stack

| Component | Technology | Version |
|-----------|-----------|---------|
| Backend   | ...       | ...     |
| Frontend  | ...       | ...     |
| Database  | ...       | ...     |
| Cache     | ...       | ...     |
| Queue     | ...       | ...     |

## Running Services

| Service | Runtime | Location | Replicas | Endpoints |
|---------|---------|----------|----------|-----------|
| ...     | EKS/ECS/Lambda/VM/Docker | region/cluster | ... | ... |

### Managed Data Services

| Service | Type | Provider | Region |
|---------|------|----------|--------|
| ...     | RDS/ElastiCache/SQS/S3 | ... | ... |

### Service Dependencies
- service-a → database (postgres), cache (redis), queue (sqs)
- service-b → database (postgres), object-store (s3)

## Networking & DNS

- **DNS Provider**: ...
- **TLS**: ...
- **Load Balancers**: ...

## Monitoring & Observability

- **Metrics**: ...
- **Logging**: ...
- **Alerting**: ...
- **APM/Tracing**: ...

## Secrets Management

- **Tool**: ...
- **Pattern**: ...

## Access & Authentication

- **SSO/IAM**: ...
- **SSH Hosts**: ...
- **VPN**: ...

## Notes & Gaps

- Items marked with [?] need further investigation
- Items marked with [!] may be outdated or inferred

---

*This file is maintained by Stakpak. Run `stakpak init` to refresh.*
```

**INFRA.md guidelines:**
- Use tables for structured data, bullet lists for everything else
- Mark unconfirmed items with `[?]`
- Mark potential issues or outdated info with `[!]`
- Never include secrets, tokens, passwords, or private key material
- Keep it scannable -- a senior engineer should be able to understand the full setup in 2 minutes
- Omit sections entirely if nothing was discovered for that domain (don't leave empty placeholders)

---

## Phase 5: Next Steps

After writing `INFRA.md`, tell the user about the file and suggest next steps. Present these as a prioritized list based on what you discovered (e.g., if you found no monitoring, prioritize that suggestion).

### Suggested Next Steps

Pick the most relevant from this list based on what you discovered:

- **Cost Analysis** -- "I can analyze your cloud spending and find optimization opportunities. Want me to run a cost review?"
- **Set Up Stakpak Autopilot** -- "I can configure `stakpak autopilot` to continuously monitor your infrastructure for issues, drift, and security concerns. Want me to set that up?"
- **Security Audit** -- "I can scan your IaC and configs for security misconfigurations using SAST tools. Want me to run a security review?"
- **Architecture Diagram** -- "I can generate a visual architecture diagram of your infrastructure. Want me to create one?"
- **Disaster Recovery Assessment** -- "I can evaluate your backup and recovery setup and estimate your current RTO/RPO. Interested?"
- **Kubernetes Network Map** -- "I can map out pod-to-pod communication patterns in your clusters. Want me to generate a network diagram?"
- **CI/CD Pipeline Review** -- "I can review your CI/CD pipelines for best practices, speed optimizations, and security. Want me to take a look?"
- **12-Factor Compliance Check** -- "I can assess your application against the 12-Factor methodology and suggest improvements."
- **Infrastructure Drift Detection** -- "I can compare your IaC definitions against live infrastructure to detect drift."
- **Documentation Generation** -- "I can generate detailed runbooks and operational documentation based on your setup."

Present 3-5 of the most relevant suggestions based on what gaps or opportunities you identified during discovery.

---

## Behavioral Rules

1. **Speed over perfection** -- get 80% of the picture fast, refine later
2. **Minimize human interaction** -- automate discovery, batch questions, don't ask what you can find
3. **Never expose secrets** -- treat all credentials, tokens, and keys as radioactive
4. **Be honest about confidence** -- clearly distinguish `[confirmed]` facts from `[inferred]` ones
5. **Respect the user's time** -- if they skip questions, move on gracefully
6. **Parallelize aggressively** -- use subagents for all independent discovery tasks; use sandboxed subagents when CLI commands are needed
7. **Read-only by default** -- discovery phase must not modify any files, configs, or infrastructure state (except writing INFRA.md at the end)
8. **Fail gracefully** -- if a discovery subagent fails or times out, note the gap and continue with what you have
9. **Don't assume source code access** -- the current directory is just one signal; applications may live elsewhere
10. **Build incrementally** -- for large environments, write INFRA.md section by section rather than trying to hold everything in context at once