14 KiB
Cosmos DB Connectivity Diagnostic - Classification Matrix & Support Guide
Classification Decision Tree
START: Run diagnostic script
│
├─→ DNS Resolution Check
│ │
│ ├─→ ❌ FAILED
│ │ └─→ Classification: dns_resolution_failed
│ │ Action: DNS/VPN/proxy troubleshooting
│ │
│ └─→ ✓ PASSED
│ │
│ ├─→ Resolved IP is RFC 1918 (10.x, 172.16-31.x, 192.168.x)?
│ │ │
│ │ ├─→ YES (Private endpoint detected)
│ │ │ │
│ │ │ └─→ TCP 443 Test
│ │ │ │
│ │ │ ├─→ ❌ FAILED
│ │ │ │ └─→ private_endpoint_network_path_blocked
│ │ │ │ (VPN route, NSG, firewall, UDR, peering)
│ │ │ │
│ │ │ └─→ ✓ PASSED
│ │ │ └─→ Check RBAC
│ │ │
│ │ └─→ NO (Public endpoint)
│ │ │
│ │ └─→ TCP 443 Test
│ │ │
│ │ ├─→ ❌ FAILED
│ │ │ └─→ tcp_connectivity_blocked
│ │ │ (Firewall, ISP, proxy)
│ │ │
│ │ └─→ ✓ PASSED
│ │ └─→ network_connectivity_healthy
│ │
│ └─→ Check Azure Configuration & RBAC
│ │
│ ├─→ Azure CLI authenticated?
│ │ ├─→ NO → Skip ARM checks, mark warning
│ │ └─→ YES → Query network config & roles
│ │
│ └─→ Sufficient permissions?
│ ├─→ NO → rbac_insufficient
│ └─→ YES → All checks passed
Classification Code Reference
Success Codes
network_connectivity_healthy
- Status: success
- When: DNS resolves AND TCP 443 succeeds
- Interpretation: Local network is working. If Cosmos DB operations fail, issue is auth/RBAC/data-plane.
- Actions:
- Verify RBAC/authentication permissions
- Check account firewall IP rules
- Verify data-plane token hasn't expired
- Check application logs for specific errors
Failure Codes
dns_resolution_failed
- Status: failure
- When: DNS lookup fails with SocketException or timeout
- Interpretation: Cannot resolve account hostname to any IP
- Root Causes:
- DNS server misconfiguration
- VPN/proxy intercepting DNS queries
- Corporate proxy redirecting .documents.azure.com
- Network unreachable before DNS server
- ISP DNS failure
- Actions:
- Check VPN/proxy DNS settings
- Run
nslookup <endpoint-hostname> - Try alternate DNS:
nslookup <endpoint-hostname> 8.8.8.8 - Ping endpoint:
ping <endpoint-hostname> - Contact network team if no resolution
tcp_connectivity_blocked
- Status: failure
- When: DNS succeeds BUT TCP 443 fails
- Interpretation: Network path blocked between client and endpoint
- Root Causes (Public Endpoint):
- Corporate firewall blocking outbound 443
- ISP blocking Cosmos/Azure IPs
- Regional geo-blocking
- HTTPS inspection proxy interfering
- Host-level firewall (Windows Defender, etc.)
- Root Causes (Private Endpoint):
- VPN not configured for private endpoint subnet
- Route not established between VPN subnet and private endpoint subnet
- NSG rules blocking 443 inbound on PE subnet
- NVA/firewall dropping packets
- UDR misconfiguration
- VNet peering not configured or expired
- Private DNS zone misconfiguration
- Actions:
- Run
Test-NetConnection -ComputerName <hostname> -Port 443 -TraceRoute - If private endpoint: Ask network team to verify VPN routing
- Check host firewall (Windows Defender, Mac firewall, Linux iptables)
- If corporate proxy: Verify HTTPS inspection not blocking certificates
- Try from different network to isolate source
- Run
private_endpoint_network_path_blocked
- Status: failure
- When: Resolved to private IP (10.x, 172.16-31.x, 192.168.x) BUT TCP 443 fails
- Interpretation: Private endpoint detected but cannot reach it—network path issue
- Root Causes:
- VPN client subnet → private endpoint subnet routing broken
- Firewall/NVA blocking internal traffic
- NSG with restrictive rules on PE subnet
- UDR pointing to wrong next hop
- VNet peering not established
- Private DNS zone not configured or stale
- Actions:
- Confirm VPN is connected and assigned correct subnet
- Ask network team to verify routing:
route print(Windows) ornetstat -rn(Linux/Mac) - Check Azure NSG rules on private endpoint subnet for port 443 inbound
- Verify private DNS zone has A record pointing to PE IP
- Check if VNet peering exists and is Active
- Run
Test-NetConnection -ComputerName <pe-ip> -Port 443directly to PE IP - Provide network team with source IP from script output
Warning Codes
rbac_insufficient
- Status: warning
- When: Network OK BUT caller lacks data-plane permissions
- Interpretation: Network is healthy, but RBAC prevents data operations
- Actions:
- Request Cosmos DB Operator or Contributor role assignment
- If using connection strings: ensure account hasn't been regenerated
- Check data-plane RBAC (if enabled) via Azure CLI:
az role assignment list --scope <account-id>
private_endpoint_mismatch
- Status: warning
- When: Resolved IP differs from expected private endpoint IP
- Interpretation: Routing may be asymmetric or PE configuration changed
- Actions:
- Verify private endpoint IP hasn't changed in Azure Portal
- Ask network team to check asymmetric routing (DNS from corp vs VPN DNS)
- Flush DNS cache:
ipconfig /flushdns(Windows) orsudo dscacheutil -flushcache(Mac)
azure_config_check_skipped
- Status: warning
- When: Azure CLI not authenticated or not installed
- Interpretation: Cannot validate ARM-level network config (firewall rules, PE connections)
- Actions:
- Install Azure CLI: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli
- Authenticate:
az login - Re-run script to collect ARM-level diagnostics
unknown_error
- Status: failure or warning
- When: Unhandled condition or unexpected error
- Interpretation: Script encountered something not in the matrix
- Actions:
- Check script output for error details
- Provide full JSON report to support
Support Playbook
Tier 1: Triage (ICM Responder)
When customer reports: "Cosmos DB operations return HTTP 0.0 / connection errors"
-
Ask customer to run script:
.\Diagnose-CosmosConnectivity.ps1 -Interactive -
Receive JSON output. Check classification.code:
Code Response network_connectivity_healthy→ Escalate to data-plane/auth team. This is not a network issue. dns_resolution_failed→ Run script playbook below tcp_connectivity_blocked(public endpoint)→ Run TCP failed / public endpoint playbook private_endpoint_network_path_blocked→ Run private endpoint playbook rbac_insufficient→ Check RBAC permissions azure_config_check_skipped→ Ask customer to run az loginand re-run -
Document:
- Save JSON report in ICM
- Note classification code and recommended actions
- Link to this support guide in response
Playbook: DNS Resolution Failed
Symptoms: dns_resolution_failed code
Steps:
-
Verify endpoint name with customer:
- Check it matches Azure Portal > Cosmos Account > URI
- Typos are common
-
Customer self-service:
- Ask: "Can you manually run nslookup?"
nslookup my-cosmos-account.documents.azure.com - If nslookup fails → Likely VPN/proxy DNS redirect
- If nslookup succeeds but script fails → Check DNS servers in script output vs nslookup
- Ask: "Can you manually run nslookup?"
-
If behind corporate proxy:
- Ask: "Is your traffic routed through a corporate proxy?"
- If YES: Proxy may be intercepting DNS or blocking .documents.azure.com
- Action: Customer should contact corporate network team
-
If using VPN:
- Ask: "Does DNS work when you disconnect from VPN?"
- If YES → VPN DNS redirect issue
- Action: Customer should contact VPN admin
-
Escalation:
- If all above fail, ask customer to contact their ISP or network provider
- This is not a Cosmos issue; it's upstream DNS
Playbook: TCP 443 Failed / Public Endpoint
Symptoms: tcp_connectivity_blocked code with public IP
Steps:
-
Customer runs detailed trace:
Test-NetConnection -ComputerName <hostname> -Port 443 -TraceRoute -
Analyze output:
- Does it reach gateway/ISP?
- Where does it drop?
-
If corporate network:
- Check with network team if 443 outbound is allowed to Azure
- May need to whitelist docs.microsoft.com or documents.azure.com
-
If ISP/home network:
- Try from mobile hotspot to rule out ISP blocking
- If hotspot works → ISP is blocking Azure
-
If Windows Defender Firewall:
- Check Windows Defender Firewall for outbound rules
- Ensure 443 is not blocked
-
If behind proxy:
- Proxy may be doing HTTPS inspection
- Ask IT if they use SSL Bump/HTTPS Inspection
- May need to disable inspection for documents.azure.com or accept custom cert
Playbook: Private Endpoint Network Path Blocked
Symptoms: private_endpoint_network_path_blocked code
Steps:
-
Gather critical info from customer:
- Source IP (from script output:
execution.hostnameanddiagnostics.tcp.sourceIp) - Resolved PE IP (from script:
diagnostics.dns.addresses[0]) - Is VPN connected?
- Which VPN client?
- Source IP (from script output:
-
Customer provides to network team:
- "TCP from [source-IP] to [PE-IP]:443 is timing out"
- "Please verify routing from VPN subnet to PE subnet"
- "Please check NSGs for port 443 inbound on PE subnet"
-
Network team should check:
- Route table: Does VPN subnet have route to PE subnet?
- NSG: PE subnet NSG allows inbound 443?
- NVA/Firewall: Any stateful filtering blocking traffic?
- UDR: Any User Defined Routes sending traffic wrong way?
- VNet peering: If PE in different VNet, is peering configured?
- Private DNS: Does private DNS zone have A record for PE IP?
-
Cosmos team role:
- Verify account has private endpoint connection in Approved state
- Check if PE IP matches what Azure reports
- Provide PE connection details from Azure Portal
-
Escalation criteria:
- If routing is correct but still fails → May be NSG inside PE subnet (rare)
- If all checks pass → Escalate to Azure Networking support
Playbook: RBAC Insufficient
Symptoms: rbac_insufficient code
Steps:
-
Check role assignments:
az role assignment list --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.DocumentDB/databaseAccounts/<account> -
Assign appropriate role:
- Cosmos DB Operator (read/write data)
- Cosmos DB Account Reader (read-only)
- Contributor or Owner (full management)
-
If using master key:
- Primary/secondary keys are still valid if account hasn't been regenerated
- Ask: Has the account been regenerated recently?
- If yes, old keys won't work
JSON Parsing for Automation
Python Example (Support Bot)
import json
def parse_cosmos_diagnostic(json_data):
report = json.loads(json_data)
classification = report.get("classification", {})
code = classification.get("code")
status = classification.get("status")
# Route based on code
if code == "network_connectivity_healthy":
return "Escalate: Auth/RBAC team"
elif code == "dns_resolution_failed":
return "Run DNS playbook"
elif code == "tcp_connectivity_blocked":
endpoint = report["target"]["endpointUrl"]
if "10." in report["diagnostics"]["dns"]["addresses"][0]:
return "Run Private Endpoint playbook"
else:
return "Run TCP Failure / Public Endpoint playbook"
elif code == "private_endpoint_network_path_blocked":
return "Run Private Endpoint playbook"
elif code == "rbac_insufficient":
return "Check RBAC: " + str(report["diagnostics"]["rbac"]["roleAssignments"])
else:
return "Unknown code: " + code
Support Ticket Template
COSMOS DB CONNECTIVITY ISSUE - DIAGNOSTIC RECEIVED
Classification: [classification.code]
Status: [classification.status]
Summary: [classification.summary]
Network Diagnostics:
DNS Resolution: [diagnostics.dns.succeeded]
TCP 443 Connectivity: [diagnostics.tcp.succeeded]
HTTPS Reachability: [diagnostics.https.statusCode]
Private Endpoint: [diagnostics.privateNetwork.isPrivateRange]
Azure Configuration:
Public Network Restricted: [diagnostics.azureNetworkConfig.publicNetworkAccessRestricted]
Private Endpoints: [diagnostics.azureNetworkConfig.privateEndpoints.length] configured
RBAC Status:
Classification: [diagnostics.rbac.classification]
Can Read Account: [diagnostics.rbac.canReadAccount]
Can Manage Account: [diagnostics.rbac.canManageAccount]
Recommended Actions:
[classification.recommendedActions joined with newlines]
Next Step:
[routing based on classification.code]