Monitoring and Operations¶
Monitor policy compliance, handle exceptions, troubleshoot issues, and follow operational best practices.
Monitoring and Observability¶
Policy Reporter Dashboard¶
Access the Policy Reporter UI:
Dashboard features:
- Policy compliance by namespace
- Violation trends over time
- Top violating resources
- Policy effectiveness metrics
Dashboard for Compliance Reviews
Use the Policy Reporter dashboard during compliance reviews. Export monthly reports showing pass/fail ratios by namespace. This provides auditors with visual proof of enforcement.
Prometheus Metrics¶
Policy Reporter exports Prometheus metrics:
policy_report_result{policy="require-resource-limits", status="fail"} 12
policy_report_result{policy="disallow-latest-tag", status="fail"} 3
policy_report_summary{status="pass"} 245
policy_report_summary{status="fail"} 15
Useful queries:
# Compliance rate
sum(policy_report_result{status="pass"}) / sum(policy_report_result)
# Top failing policies
topk(10, sum by (policy) (policy_report_result{status="fail"}))
# Violations by namespace
sum by (namespace) (policy_report_result{status="fail"})
Grafana Dashboard: Visualize policy compliance trends.
Slack Notifications¶
Configure Slack alerts for policy violations:
# policy-reporter-values.yaml
targets:
slack:
webhook: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
minimumPriority: "warning"
skipExistingOnStartup: true
channels:
- name: "#security-alerts"
filter:
namespaces:
include: ["production"]
policies:
include: ["require-resource-limits", "disallow-latest-tag"]
Alert example:
🚨 Policy Violation Detected
Policy: require-resource-limits
Resource: Deployment/nginx
Namespace: production
Message: CPU and memory limits required
Severity: medium
Alert on Critical Policies Only
Don't send Slack alerts for every policy violation. Reserve alerts for critical security policies in production namespaces. Use dashboard for everything else to prevent alert fatigue.
Exception Handling¶
Policy Exceptions¶
Allow specific resources to bypass policies:
apiVersion: kyverno.io/v2beta1
kind: PolicyException
metadata:
name: allow-legacy-app
namespace: kyverno
spec:
exceptions:
- policyName: require-resource-limits
ruleNames:
- check-cpu-memory
match:
any:
- resources:
kinds:
- Deployment
namespaces:
- legacy
names:
- old-app
Use sparingly. Exceptions should be temporary.
Exception Governance¶
Best practices:
- Require expiration dates:
- Add approval annotations:
metadata:
annotations:
jira.ticket: "SEC-1234"
approved-by: "security-team"
reason: "Legacy application, migration planned Q2 2025"
- Quarterly review:
# List all exceptions
kubectl get policyexception -A
# Find expiring exceptions
kubectl get policyexception -A \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.validUntil}{"\n"}{end}'
Exceptions Must Expire
Every PolicyException MUST have validUntil set. Permanent exceptions create permanent security gaps. Set expiration, review quarterly, renew only with approval.
Troubleshooting¶
Admission Webhook Failures¶
Problem: admission webhook "validate.kyverno.svc" denied the request
Cause: Policy violation detected
Solution: Check policy and resource
# Get policy details
kubectl get clusterpolicy require-resource-limits -o yaml
# Check resource against policy locally
kyverno apply require-resource-limits.yaml --resource deployment.yaml
Background Scan Not Running¶
Problem: PolicyReports not updating
Cause: Background controller not running or disabled
Solution: Check background controller
# Check background controller logs
kubectl logs -n kyverno \
-l app.kubernetes.io/component=background-controller
# Verify background scan interval
kubectl get deployment -n kyverno kyverno-background-controller \
-o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="BACKGROUND_SCAN_INTERVAL")].value}'
Policy Not Enforced¶
Problem: Resource deployed despite policy violation
Cause: Policy in Audit mode instead of Enforce
Solution: Check validation failure action
Expected: Enforce
If Audit: Policy is warn-only.
Fix:
kubectl patch clusterpolicy require-resource-limits \
--type merge \
-p '{"spec":{"validationFailureAction":"Enforce"}}'
Always Check Validation Failure Action
If policies aren't blocking violations, 99% of the time it's because validationFailureAction is set to Audit instead of Enforce. Check this first.
Best Practices¶
1. Start with Audit Mode¶
Roll out new policies in Audit mode first:
Monitor PolicyReports. Switch to Enforce after validation.
2. Exclude System Namespaces¶
Avoid breaking system components:
3. Use Background Scanning¶
Detect drift and violations in existing resources:
4. Monitor Policy Reports¶
Set up alerts for critical violations:
5. Version Policy Deployments¶
Use Helm chart versions for rollback capability:
helm upgrade --install security-policy \
/repos/security-policy/charts/security-policy \
--version 1.2.3
Operational Checklist¶
Weekly:
- [ ] Review PolicyReports for new violations
- [ ] Check PolicyException expiration dates
- [ ] Verify background scans running
Monthly:
- [ ] Generate compliance report
- [ ] Review exception renewals
- [ ] Audit policy effectiveness
Quarterly:
- [ ] Review all PolicyExceptions
- [ ] Update policy versions
- [ ] Test disaster recovery procedures
Next Steps¶
- Operations - Day-to-day policy lifecycle management
- Multi-Source Policies - Aggregate policies from multiple repos
- Policy Packaging - Build policy-platform container
- Kyverno Deep Dive - Advanced Kyverno patterns