Every MCP server you deploy to production sits between your AI agents and your business systems. It translates natural language intent into API calls that create, modify, and delete real data. A server that passes this checklist is production-grade. A server that fails any of the first five items is a liability.
This is the checklist NimbleBrain runs on every MCP server before deployment, our own custom builds and community servers alike. Fifteen items, grouped into three categories. Each item has a specific thing to check and a clear failure signal. Print it. Bookmark it. Use it every time.
Security (Items 1-5)
Security items are non-negotiable. A failure on any item in this category means the server does not deploy to production until the issue is resolved. No exceptions, no “we’ll fix it later.”
1. Source Verification
Check: Can you trace the server’s code to a verifiable origin? Is the source repository public or available for audit? Does the published artifact match the source code, can you build from source and get the same result?
Pass: Source code is available, the publisher is identifiable, and the build is reproducible. Tagged releases correspond to specific commits. The supply chain from source to artifact is auditable.
Fail: Binary-only distribution with no source. Anonymous publisher with no public identity. Published artifact doesn’t match what building from source produces. Any gap in the chain from source code to running server is a gap where tampering can hide.
2. Permission Scope
Check: What system resources does the server request access to? Do the requested permissions match the server’s stated purpose, and nothing more?
Pass: A Slack server requests network access to api.slack.com and nothing else. A database server requests network access to your database host. Permissions map precisely to capabilities.
Fail: A CRM server requests filesystem access. A Slack server requests access to arbitrary network hosts. Any permission that doesn’t map to a documented capability is either a design flaw or a data exfiltration vector. The MCP Trust Framework calls this “least privilege”, the server should request the minimum permissions required to function, and the declared permissions should be verifiable against the actual behavior.
3. Clean Stdout
Check: Does the server’s stdout output contain only valid JSON-RPC messages? No startup banners, no log messages, no debug output mixed into the protocol stream.
Pass: Stdout is exclusively JSON-RPC protocol messages. All diagnostic output (logs, startup messages, debug traces) goes to stderr. The protocol stream is clean and parseable.
Fail: Any non-JSON-RPC content on stdout. This is the most common failure in community MCP servers and one of the most dangerous. Contaminated stdout corrupts the communication channel between the agent and the server. The agent receives malformed messages, misinterprets tool responses, and takes incorrect actions on your systems. The mpak scanner catches this automatically, servers that fail clean stdout verification don’t pass basic MTF assessment.
4. Authentication Security
Check: How does the server handle credentials for the target system? Are secrets stored securely? Does the auth flow support production requirements: token rotation, per-user credentials, managed identity?
Pass: OAuth with token refresh and secure token storage. Environment variable injection from a secrets manager (not hardcoded in config files). Support for managed identity in cloud deployments. No credentials in source code, logs, or error messages.
Fail: Hardcoded API keys in source code. Credentials logged in plaintext. Static tokens with no rotation mechanism. Auth tokens stored in plaintext configuration files. Any of these are immediate disqualifiers for production deployment. A server that leaks credentials in its error output turns every agent error into a security incident.
5. Rate Limiting
Check: Does the server implement rate limiting to protect the target system from agent-driven overuse? AI agents can make tool calls at machine speed, without rate limiting, a single agent can exhaust an API’s rate limits in seconds.
Pass: Configurable rate limits per tool or per endpoint. Backoff logic for rate limit responses (HTTP 429). Clear error messages when limits are hit so agents can adjust behavior. Rate limits documented in the server’s configuration reference.
Fail: No rate limiting at all, agents can fire unlimited requests at the target API. No backoff on 429 responses (the server retries immediately and compounds the problem. Hard failures with no error context when rate limits are hit) the agent doesn’t know why the call failed and may retry in a loop.
Reliability (Items 6-10)
Reliability items determine whether the server works consistently in production, not just in demos. Failures here don’t prevent deployment but signal technical debt that will surface under load.
6. Error Handling
Check: Does the server return meaningful, structured error messages when operations fail? Can an AI agent understand what went wrong and decide whether to retry, try a different approach, or escalate?
Pass: Errors include an error code, a human-readable message, and enough context for the agent to make a decision. Transient failures (network timeouts, rate limits) are distinguished from permanent failures (invalid input, missing resource). The agent receives actionable information, not raw stack traces.
Fail: Generic error messages (“something went wrong”). Raw exception traces passed to the agent. Silent failures where the server returns success but the operation didn’t complete. Errors that lose context from the target API, the Salesforce error said “field X is required” but the server returned “API error.”
7. Structured Logging
Check: Does the server produce structured, queryable logs on stderr? Can you trace a specific tool invocation from request to response, including the target API interaction?
Pass: JSON-structured log output with timestamps, request IDs, tool names, and durations. Configurable log levels (debug, info, warn, error). Sensitive data redacted from log output: no credentials, no PII in logs. Correlation IDs that link agent requests to target API calls.
Fail: Unstructured print statements. No timestamps. Full request/response bodies logged including credentials or PII. No way to correlate logs to specific tool invocations. Logging to stdout instead of stderr, this contaminates the protocol stream (see item 3).
8. Test Coverage
Check: Does the server have automated tests? Do the tests cover the critical paths: authentication, core tool operations, error handling, edge cases?
Pass: Unit tests for business logic and data transformation. Integration tests against mock or sandbox instances of the target API. Tests run in CI on every commit. Coverage of auth flows, error paths, and edge cases (empty results, pagination, timeouts).
Fail: No tests at all. Tests that only cover the happy path. Tests that require a live API key to run (making them impractical for CI). Tests that were clearly written after the fact and test implementation details rather than behavior.
9. Versioning
Check: Does the server follow semantic versioning? Is there a changelog that documents what changed between versions? Can you pin to a specific version and know it won’t change?
Pass: Semantic versioning (major.minor.patch). Tagged releases in the source repository. A changelog or release notes documenting breaking changes, new features, and bug fixes. Published artifacts pinned to specific versions, installing v1.2.3 today gives you the same artifact as installing v1.2.3 next month.
Fail: No version numbers. Unpinned “latest” as the only installation target. Breaking changes shipped without major version bumps. No changelog, you discover breaking changes by deploying and watching things fail.
10. Dependency Hygiene
Check: Are the server’s dependencies pinned to specific versions? Are there known vulnerabilities in the dependency tree? Is the dependency surface area reasonable for what the server does?
Pass: Lockfile with pinned dependency versions. No known critical or high-severity vulnerabilities in the dependency tree (check with npm audit, pip audit, or equivalent). Reasonable dependency count, a server that wraps a single API shouldn’t pull in 200 packages.
Fail: Unpinned dependencies that resolve to whatever version is latest at install time. Known critical vulnerabilities in the dependency tree. Excessive dependency count, particularly transitive dependencies that the maintainer may not be aware of. This is a supply chain risk: every dependency is a surface where malicious code could be injected.
Operations (Items 11-15)
Operations items determine whether the server is deployable and maintainable in a real infrastructure environment. Failures here create operational burden that scales linearly with the number of servers deployed.
11. Container Compatibility
Check: Does the server run in an isolated container environment? Can you deploy it with standard container orchestration (Docker, Kubernetes)?
Pass: A Dockerfile or container image is available. The server runs without host system dependencies beyond the container runtime. Resource limits (CPU, memory) are documented or configurable. The container image is minimal. No unnecessary tools, shells, or utilities that expand the attack surface.
Fail: The server requires host-level installation with system-wide dependencies. No container support. The container image includes development tools, SSH servers, or other utilities that have no place in a production deployment. Resource usage is unbounded and undocumented.
12. Graceful Degradation
Check: What happens when the target system is down? Does the server fail gracefully, or does it crash, hang, or return misleading results?
Pass: The server detects target system outages and returns clear error messages ,“target API unavailable, retry after N seconds.” Connection timeouts are configured and enforced. The server itself remains responsive even when the target is not. Health check endpoints report the server’s ability to reach the target system.
Fail: The server hangs indefinitely waiting for a response from a target system that’s down. The server crashes when the target returns unexpected responses. No health check, the orchestrator can’t tell if the server is healthy. Misleading success responses when the target operation actually failed.
13. Configuration Management
Check: Are all configurable values managed through environment variables or configuration files, not hardcoded in source? Can you deploy the same server to different environments without code changes?
Pass: Environment variables for all environment-specific values (API endpoints, credentials, feature flags). Sensible defaults for non-sensitive configuration. Clear separation between code (same across environments) and configuration (different per environment). No secrets in configuration files, secrets come from environment variables or a secrets manager.
Fail: Hardcoded URLs, API keys, or environment-specific values in source code. Configuration that requires modifying source files to change. Secrets in committed configuration files. No documentation of available configuration options.
14. Documentation
Check: Can a new engineer on your team deploy, configure, and troubleshoot this server in under 30 minutes using only the documentation?
Pass: Installation steps that work as written, tested on a clean machine, not just the maintainer’s environment. Configuration reference with every option described. Tool descriptions that include parameters, return values, and error conditions. Known limitations documented honestly. Troubleshooting guide for common failures.
Fail: “See code for usage.” Documentation that assumes familiarity with the codebase. Missing configuration reference, you discover options by reading source code. Outdated docs that describe a previous version. No troubleshooting guidance, when something breaks, you’re reading source code to understand why.
15. MTF Trust Score
Check: Has the server been assessed against the MCP Trust Framework? What is its trust level on mpak.dev?
Pass: A trust level that meets or exceeds your deployment tier requirements. The MTF assessment is an automated composite of security posture, manifest quality, dependency analysis, and operational characteristics. A high trust level means the server has passed automated checks across multiple quality dimensions; it’s not a guarantee, but it’s a validated baseline.
Fail: No MTF assessment. Trust level below your minimum threshold. Unresolved findings from the MTF scan. Note that absence of a score isn’t automatically disqualifying, the server might be new or hosted outside the mpak registry. But for servers available on mpak.dev, a missing or low trust score should trigger additional manual review.
Using the Checklist
For Community Servers
Run the full 15-item check before deploying any community server to production. The mpak.dev scanner automates items 2,3, 5,9, 10,11, and 15, the mechanical checks that don’t require human judgment. For the remaining items, allocate 30-60 minutes of engineering time. That’s a small investment against the cost of deploying a server that corrupts your CRM data or leaks credentials.
For Custom Builds
Use the checklist as a definition of done. Before any custom MCP server ships to production, it passes all 15 items. NimbleBrain’s server template includes scaffolding for items 3,4, 5,6, 7,9, 11,12, and 13, the structural patterns that every server needs. The remaining items require project-specific engineering and documentation.
For Teams
Print this checklist. Post it where your engineering team can see it. Make it part of your deployment process. The goal isn’t bureaucracy; it’s consistency. When every server goes through the same 15 checks, quality doesn’t depend on which engineer happened to evaluate it. The process catches failures that individual judgment might miss.
This is how Business-as-Code works at the operational layer. You encode the evaluation process as a structured checklist, the same way you encode business logic as skills and entity definitions as schemas. The checklist becomes a team asset. It scales with your server fleet. And it compounds: every server that passes the checklist is a server your agents can trust.
The Anti-Consultancy perspective: a traditional consultancy would charge for a “tool governance framework” and deliver a 60-page PDF. This checklist is the framework. Fifteen items. Pass or fail criteria for each. An hour per server to execute. No discovery phase. No governance committee. Just a list that works.
For the evaluation framework that precedes this checklist (deciding which servers to evaluate in the first place), see Evaluating MCP Servers: A Buyer’s Checklist. For the upstream decision of whether to build or evaluate at all, see When to Build Custom MCP Servers.
Frequently Asked Questions
Do I need to check all 15 for every server?
For production deployments accessing business-critical systems, yes, all 15. For development and testing (focus on the first 5 (source, permissions, stdout, errors, auth). For low-risk integrations (public APIs, read-only access)) the top 10 is sufficient. Scale the rigor to the risk.
How long does a full quality check take?
30-60 minutes for an experienced engineer using automated tooling (mpak scanner handles 8 of the 15 automatically). Manual review of source code, documentation quality, and configuration management adds another 30 minutes. It's an hour well spent compared to debugging a compromised production system.
Can this checklist be automated?
Partially. The mpak scanner automates: clean stdout verification, permission scope analysis, dependency scanning, versioning checks, and container compatibility. Manual review is still needed for: documentation quality, error handling patterns, graceful degradation behavior, and configuration management. NimbleBrain is building more automation into the MTF scanner.