mirror of
https://github.com/Azure/cosmos-explorer.git
synced 2026-05-15 09:47:30 +01:00
network connectivity
This commit is contained in:
@@ -0,0 +1,411 @@
|
||||
# Cosmos DB Connectivity Diagnostic - Classification Matrix & Support Guide
|
||||
|
||||
## Classification Decision Tree
|
||||
|
||||
```
|
||||
START: Run diagnostic script
|
||||
│
|
||||
├─→ DNS Resolution Check
|
||||
│ │
|
||||
│ ├─→ ❌ FAILED
|
||||
│ │ └─→ Classification: dns_resolution_failed
|
||||
│ │ Action: DNS/VPN/proxy troubleshooting
|
||||
│ │
|
||||
│ └─→ ✓ PASSED
|
||||
│ │
|
||||
│ ├─→ Resolved IP is RFC 1918 (10.x, 172.16-31.x, 192.168.x)?
|
||||
│ │ │
|
||||
│ │ ├─→ YES (Private endpoint detected)
|
||||
│ │ │ │
|
||||
│ │ │ └─→ TCP 443 Test
|
||||
│ │ │ │
|
||||
│ │ │ ├─→ ❌ FAILED
|
||||
│ │ │ │ └─→ private_endpoint_network_path_blocked
|
||||
│ │ │ │ (VPN route, NSG, firewall, UDR, peering)
|
||||
│ │ │ │
|
||||
│ │ │ └─→ ✓ PASSED
|
||||
│ │ │ └─→ Check RBAC
|
||||
│ │ │
|
||||
│ │ └─→ NO (Public endpoint)
|
||||
│ │ │
|
||||
│ │ └─→ TCP 443 Test
|
||||
│ │ │
|
||||
│ │ ├─→ ❌ FAILED
|
||||
│ │ │ └─→ tcp_connectivity_blocked
|
||||
│ │ │ (Firewall, ISP, proxy)
|
||||
│ │ │
|
||||
│ │ └─→ ✓ PASSED
|
||||
│ │ └─→ network_connectivity_healthy
|
||||
│ │
|
||||
│ └─→ Check Azure Configuration & RBAC
|
||||
│ │
|
||||
│ ├─→ Azure CLI authenticated?
|
||||
│ │ ├─→ NO → Skip ARM checks, mark warning
|
||||
│ │ └─→ YES → Query network config & roles
|
||||
│ │
|
||||
│ └─→ Sufficient permissions?
|
||||
│ ├─→ NO → rbac_insufficient
|
||||
│ └─→ YES → All checks passed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Classification Code Reference
|
||||
|
||||
### Success Codes
|
||||
|
||||
#### `network_connectivity_healthy`
|
||||
- **Status:** success
|
||||
- **When:** DNS resolves AND TCP 443 succeeds
|
||||
- **Interpretation:** Local network is working. If Cosmos DB operations fail, issue is auth/RBAC/data-plane.
|
||||
- **Actions:**
|
||||
- Verify RBAC/authentication permissions
|
||||
- Check account firewall IP rules
|
||||
- Verify data-plane token hasn't expired
|
||||
- Check application logs for specific errors
|
||||
|
||||
---
|
||||
|
||||
### Failure Codes
|
||||
|
||||
#### `dns_resolution_failed`
|
||||
- **Status:** failure
|
||||
- **When:** DNS lookup fails with SocketException or timeout
|
||||
- **Interpretation:** Cannot resolve account hostname to any IP
|
||||
- **Root Causes:**
|
||||
- DNS server misconfiguration
|
||||
- VPN/proxy intercepting DNS queries
|
||||
- Corporate proxy redirecting .documents.azure.com
|
||||
- Network unreachable before DNS server
|
||||
- ISP DNS failure
|
||||
- **Actions:**
|
||||
1. Check VPN/proxy DNS settings
|
||||
2. Run `nslookup <endpoint-hostname>`
|
||||
3. Try alternate DNS: `nslookup <endpoint-hostname> 8.8.8.8`
|
||||
4. Ping endpoint: `ping <endpoint-hostname>`
|
||||
5. Contact network team if no resolution
|
||||
|
||||
---
|
||||
|
||||
#### `tcp_connectivity_blocked`
|
||||
- **Status:** failure
|
||||
- **When:** DNS succeeds BUT TCP 443 fails
|
||||
- **Interpretation:** Network path blocked between client and endpoint
|
||||
- **Root Causes (Public Endpoint):**
|
||||
- Corporate firewall blocking outbound 443
|
||||
- ISP blocking Cosmos/Azure IPs
|
||||
- Regional geo-blocking
|
||||
- HTTPS inspection proxy interfering
|
||||
- Host-level firewall (Windows Defender, etc.)
|
||||
- **Root Causes (Private Endpoint):**
|
||||
- VPN not configured for private endpoint subnet
|
||||
- Route not established between VPN subnet and private endpoint subnet
|
||||
- NSG rules blocking 443 inbound on PE subnet
|
||||
- NVA/firewall dropping packets
|
||||
- UDR misconfiguration
|
||||
- VNet peering not configured or expired
|
||||
- Private DNS zone misconfiguration
|
||||
- **Actions:**
|
||||
1. Run `Test-NetConnection -ComputerName <hostname> -Port 443 -TraceRoute`
|
||||
2. If private endpoint: Ask network team to verify VPN routing
|
||||
3. Check host firewall (Windows Defender, Mac firewall, Linux iptables)
|
||||
4. If corporate proxy: Verify HTTPS inspection not blocking certificates
|
||||
5. Try from different network to isolate source
|
||||
|
||||
---
|
||||
|
||||
#### `private_endpoint_network_path_blocked`
|
||||
- **Status:** failure
|
||||
- **When:** Resolved to private IP (10.x, 172.16-31.x, 192.168.x) BUT TCP 443 fails
|
||||
- **Interpretation:** Private endpoint detected but cannot reach it—network path issue
|
||||
- **Root Causes:**
|
||||
- VPN client subnet → private endpoint subnet routing broken
|
||||
- Firewall/NVA blocking internal traffic
|
||||
- NSG with restrictive rules on PE subnet
|
||||
- UDR pointing to wrong next hop
|
||||
- VNet peering not established
|
||||
- Private DNS zone not configured or stale
|
||||
- **Actions:**
|
||||
1. Confirm VPN is connected and assigned correct subnet
|
||||
2. Ask network team to verify routing: `route print` (Windows) or `netstat -rn` (Linux/Mac)
|
||||
3. Check Azure NSG rules on private endpoint subnet for port 443 inbound
|
||||
4. Verify private DNS zone has A record pointing to PE IP
|
||||
5. Check if VNet peering exists and is Active
|
||||
6. Run `Test-NetConnection -ComputerName <pe-ip> -Port 443` directly to PE IP
|
||||
7. Provide network team with source IP from script output
|
||||
|
||||
---
|
||||
|
||||
### Warning Codes
|
||||
|
||||
#### `rbac_insufficient`
|
||||
- **Status:** warning
|
||||
- **When:** Network OK BUT caller lacks data-plane permissions
|
||||
- **Interpretation:** Network is healthy, but RBAC prevents data operations
|
||||
- **Actions:**
|
||||
1. Request Cosmos DB Operator or Contributor role assignment
|
||||
2. If using connection strings: ensure account hasn't been regenerated
|
||||
3. Check data-plane RBAC (if enabled) via Azure CLI: `az role assignment list --scope <account-id>`
|
||||
|
||||
---
|
||||
|
||||
#### `private_endpoint_mismatch`
|
||||
- **Status:** warning
|
||||
- **When:** Resolved IP differs from expected private endpoint IP
|
||||
- **Interpretation:** Routing may be asymmetric or PE configuration changed
|
||||
- **Actions:**
|
||||
1. Verify private endpoint IP hasn't changed in Azure Portal
|
||||
2. Ask network team to check asymmetric routing (DNS from corp vs VPN DNS)
|
||||
3. Flush DNS cache: `ipconfig /flushdns` (Windows) or `sudo dscacheutil -flushcache` (Mac)
|
||||
|
||||
---
|
||||
|
||||
#### `azure_config_check_skipped`
|
||||
- **Status:** warning
|
||||
- **When:** Azure CLI not authenticated or not installed
|
||||
- **Interpretation:** Cannot validate ARM-level network config (firewall rules, PE connections)
|
||||
- **Actions:**
|
||||
1. Install Azure CLI: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli
|
||||
2. Authenticate: `az login`
|
||||
3. Re-run script to collect ARM-level diagnostics
|
||||
|
||||
---
|
||||
|
||||
#### `unknown_error`
|
||||
- **Status:** failure or warning
|
||||
- **When:** Unhandled condition or unexpected error
|
||||
- **Interpretation:** Script encountered something not in the matrix
|
||||
- **Actions:**
|
||||
1. Check script output for error details
|
||||
2. Provide full JSON report to support
|
||||
|
||||
---
|
||||
|
||||
## Support Playbook
|
||||
|
||||
### Tier 1: Triage (ICM Responder)
|
||||
|
||||
**When customer reports: "Cosmos DB operations return HTTP 0.0 / connection errors"**
|
||||
|
||||
1. **Ask customer to run script:**
|
||||
```powershell
|
||||
.\Diagnose-CosmosConnectivity.ps1 -Interactive
|
||||
```
|
||||
|
||||
2. **Receive JSON output. Check classification.code:**
|
||||
|
||||
| Code | Response |
|
||||
|------|----------|
|
||||
| `network_connectivity_healthy` | → Escalate to data-plane/auth team. This is not a network issue. |
|
||||
| `dns_resolution_failed` | → Run script playbook below |
|
||||
| `tcp_connectivity_blocked` (public endpoint) | → Run TCP failed / public endpoint playbook |
|
||||
| `private_endpoint_network_path_blocked` | → Run private endpoint playbook |
|
||||
| `rbac_insufficient` | → Check RBAC permissions |
|
||||
| `azure_config_check_skipped` | → Ask customer to run `az login` and re-run |
|
||||
|
||||
3. **Document:**
|
||||
- Save JSON report in ICM
|
||||
- Note classification code and recommended actions
|
||||
- Link to this support guide in response
|
||||
|
||||
---
|
||||
|
||||
### Playbook: DNS Resolution Failed
|
||||
|
||||
**Symptoms:** `dns_resolution_failed` code
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. **Verify endpoint name with customer:**
|
||||
- Check it matches Azure Portal > Cosmos Account > URI
|
||||
- Typos are common
|
||||
|
||||
2. **Customer self-service:**
|
||||
- Ask: "Can you manually run nslookup?"
|
||||
```powershell
|
||||
nslookup my-cosmos-account.documents.azure.com
|
||||
```
|
||||
- If nslookup fails → Likely VPN/proxy DNS redirect
|
||||
- If nslookup succeeds but script fails → Check DNS servers in script output vs nslookup
|
||||
|
||||
3. **If behind corporate proxy:**
|
||||
- Ask: "Is your traffic routed through a corporate proxy?"
|
||||
- If YES: Proxy may be intercepting DNS or blocking .documents.azure.com
|
||||
- Action: Customer should contact corporate network team
|
||||
|
||||
4. **If using VPN:**
|
||||
- Ask: "Does DNS work when you disconnect from VPN?"
|
||||
- If YES → VPN DNS redirect issue
|
||||
- Action: Customer should contact VPN admin
|
||||
|
||||
5. **Escalation:**
|
||||
- If all above fail, ask customer to contact their ISP or network provider
|
||||
- This is not a Cosmos issue; it's upstream DNS
|
||||
|
||||
---
|
||||
|
||||
### Playbook: TCP 443 Failed / Public Endpoint
|
||||
|
||||
**Symptoms:** `tcp_connectivity_blocked` code with public IP
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. **Customer runs detailed trace:**
|
||||
```powershell
|
||||
Test-NetConnection -ComputerName <hostname> -Port 443 -TraceRoute
|
||||
```
|
||||
|
||||
2. **Analyze output:**
|
||||
- Does it reach gateway/ISP?
|
||||
- Where does it drop?
|
||||
|
||||
3. **If corporate network:**
|
||||
- Check with network team if 443 outbound is allowed to Azure
|
||||
- May need to whitelist docs.microsoft.com or documents.azure.com
|
||||
|
||||
4. **If ISP/home network:**
|
||||
- Try from mobile hotspot to rule out ISP blocking
|
||||
- If hotspot works → ISP is blocking Azure
|
||||
|
||||
5. **If Windows Defender Firewall:**
|
||||
- Check Windows Defender Firewall for outbound rules
|
||||
- Ensure 443 is not blocked
|
||||
|
||||
6. **If behind proxy:**
|
||||
- Proxy may be doing HTTPS inspection
|
||||
- Ask IT if they use SSL Bump/HTTPS Inspection
|
||||
- May need to disable inspection for documents.azure.com or accept custom cert
|
||||
|
||||
---
|
||||
|
||||
### Playbook: Private Endpoint Network Path Blocked
|
||||
|
||||
**Symptoms:** `private_endpoint_network_path_blocked` code
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. **Gather critical info from customer:**
|
||||
- Source IP (from script output: `execution.hostname` and `diagnostics.tcp.sourceIp`)
|
||||
- Resolved PE IP (from script: `diagnostics.dns.addresses[0]`)
|
||||
- Is VPN connected?
|
||||
- Which VPN client?
|
||||
|
||||
2. **Customer provides to network team:**
|
||||
- "TCP from [source-IP] to [PE-IP]:443 is timing out"
|
||||
- "Please verify routing from VPN subnet to PE subnet"
|
||||
- "Please check NSGs for port 443 inbound on PE subnet"
|
||||
|
||||
3. **Network team should check:**
|
||||
- Route table: Does VPN subnet have route to PE subnet?
|
||||
- NSG: PE subnet NSG allows inbound 443?
|
||||
- NVA/Firewall: Any stateful filtering blocking traffic?
|
||||
- UDR: Any User Defined Routes sending traffic wrong way?
|
||||
- VNet peering: If PE in different VNet, is peering configured?
|
||||
- Private DNS: Does private DNS zone have A record for PE IP?
|
||||
|
||||
4. **Cosmos team role:**
|
||||
- Verify account has private endpoint connection in Approved state
|
||||
- Check if PE IP matches what Azure reports
|
||||
- Provide PE connection details from Azure Portal
|
||||
|
||||
5. **Escalation criteria:**
|
||||
- If routing is correct but still fails → May be NSG inside PE subnet (rare)
|
||||
- If all checks pass → Escalate to Azure Networking support
|
||||
|
||||
---
|
||||
|
||||
### Playbook: RBAC Insufficient
|
||||
|
||||
**Symptoms:** `rbac_insufficient` code
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. **Check role assignments:**
|
||||
```powershell
|
||||
az role assignment list --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.DocumentDB/databaseAccounts/<account>
|
||||
```
|
||||
|
||||
2. **Assign appropriate role:**
|
||||
- Cosmos DB Operator (read/write data)
|
||||
- Cosmos DB Account Reader (read-only)
|
||||
- Contributor or Owner (full management)
|
||||
|
||||
3. **If using master key:**
|
||||
- Primary/secondary keys are still valid if account hasn't been regenerated
|
||||
- Ask: Has the account been regenerated recently?
|
||||
- If yes, old keys won't work
|
||||
|
||||
---
|
||||
|
||||
## JSON Parsing for Automation
|
||||
|
||||
### Python Example (Support Bot)
|
||||
|
||||
```python
|
||||
import json
|
||||
|
||||
def parse_cosmos_diagnostic(json_data):
|
||||
report = json.loads(json_data)
|
||||
|
||||
classification = report.get("classification", {})
|
||||
code = classification.get("code")
|
||||
status = classification.get("status")
|
||||
|
||||
# Route based on code
|
||||
if code == "network_connectivity_healthy":
|
||||
return "Escalate: Auth/RBAC team"
|
||||
elif code == "dns_resolution_failed":
|
||||
return "Run DNS playbook"
|
||||
elif code == "tcp_connectivity_blocked":
|
||||
endpoint = report["target"]["endpointUrl"]
|
||||
if "10." in report["diagnostics"]["dns"]["addresses"][0]:
|
||||
return "Run Private Endpoint playbook"
|
||||
else:
|
||||
return "Run TCP Failure / Public Endpoint playbook"
|
||||
elif code == "private_endpoint_network_path_blocked":
|
||||
return "Run Private Endpoint playbook"
|
||||
elif code == "rbac_insufficient":
|
||||
return "Check RBAC: " + str(report["diagnostics"]["rbac"]["roleAssignments"])
|
||||
else:
|
||||
return "Unknown code: " + code
|
||||
```
|
||||
|
||||
### Support Ticket Template
|
||||
|
||||
```
|
||||
COSMOS DB CONNECTIVITY ISSUE - DIAGNOSTIC RECEIVED
|
||||
|
||||
Classification: [classification.code]
|
||||
Status: [classification.status]
|
||||
Summary: [classification.summary]
|
||||
|
||||
Network Diagnostics:
|
||||
DNS Resolution: [diagnostics.dns.succeeded]
|
||||
TCP 443 Connectivity: [diagnostics.tcp.succeeded]
|
||||
HTTPS Reachability: [diagnostics.https.statusCode]
|
||||
Private Endpoint: [diagnostics.privateNetwork.isPrivateRange]
|
||||
|
||||
Azure Configuration:
|
||||
Public Network Restricted: [diagnostics.azureNetworkConfig.publicNetworkAccessRestricted]
|
||||
Private Endpoints: [diagnostics.azureNetworkConfig.privateEndpoints.length] configured
|
||||
|
||||
RBAC Status:
|
||||
Classification: [diagnostics.rbac.classification]
|
||||
Can Read Account: [diagnostics.rbac.canReadAccount]
|
||||
Can Manage Account: [diagnostics.rbac.canManageAccount]
|
||||
|
||||
Recommended Actions:
|
||||
[classification.recommendedActions joined with newlines]
|
||||
|
||||
Next Step:
|
||||
[routing based on classification.code]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Azure Cosmos DB Troubleshoot Connectivity Issues](https://learn.microsoft.com/en-us/azure/cosmos-db/troubleshoot-connection)
|
||||
- [Private Endpoints for Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-configure-private-endpoints)
|
||||
- [Network Security Groups](https://learn.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview)
|
||||
- [User Defined Routes](https://learn.microsoft.com/en-us/azure/virtual-network/virtual-networks-udr-overview)
|
||||
Reference in New Issue
Block a user