network connectivity

This commit is contained in:
Bikram Choudhury
2026-05-13 18:13:05 +05:30
parent 470f043ed5
commit 8b3fb06b23
7 changed files with 3000 additions and 0 deletions
+411
View File
@@ -0,0 +1,411 @@
# Cosmos DB Connectivity Diagnostic - Classification Matrix & Support Guide
## Classification Decision Tree
```
START: Run diagnostic script
├─→ DNS Resolution Check
│ │
│ ├─→ ❌ FAILED
│ │ └─→ Classification: dns_resolution_failed
│ │ Action: DNS/VPN/proxy troubleshooting
│ │
│ └─→ ✓ PASSED
│ │
│ ├─→ Resolved IP is RFC 1918 (10.x, 172.16-31.x, 192.168.x)?
│ │ │
│ │ ├─→ YES (Private endpoint detected)
│ │ │ │
│ │ │ └─→ TCP 443 Test
│ │ │ │
│ │ │ ├─→ ❌ FAILED
│ │ │ │ └─→ private_endpoint_network_path_blocked
│ │ │ │ (VPN route, NSG, firewall, UDR, peering)
│ │ │ │
│ │ │ └─→ ✓ PASSED
│ │ │ └─→ Check RBAC
│ │ │
│ │ └─→ NO (Public endpoint)
│ │ │
│ │ └─→ TCP 443 Test
│ │ │
│ │ ├─→ ❌ FAILED
│ │ │ └─→ tcp_connectivity_blocked
│ │ │ (Firewall, ISP, proxy)
│ │ │
│ │ └─→ ✓ PASSED
│ │ └─→ network_connectivity_healthy
│ │
│ └─→ Check Azure Configuration & RBAC
│ │
│ ├─→ Azure CLI authenticated?
│ │ ├─→ NO → Skip ARM checks, mark warning
│ │ └─→ YES → Query network config & roles
│ │
│ └─→ Sufficient permissions?
│ ├─→ NO → rbac_insufficient
│ └─→ YES → All checks passed
```
---
## Classification Code Reference
### Success Codes
#### `network_connectivity_healthy`
- **Status:** success
- **When:** DNS resolves AND TCP 443 succeeds
- **Interpretation:** Local network is working. If Cosmos DB operations fail, issue is auth/RBAC/data-plane.
- **Actions:**
- Verify RBAC/authentication permissions
- Check account firewall IP rules
- Verify data-plane token hasn't expired
- Check application logs for specific errors
---
### Failure Codes
#### `dns_resolution_failed`
- **Status:** failure
- **When:** DNS lookup fails with SocketException or timeout
- **Interpretation:** Cannot resolve account hostname to any IP
- **Root Causes:**
- DNS server misconfiguration
- VPN/proxy intercepting DNS queries
- Corporate proxy redirecting .documents.azure.com
- Network unreachable before DNS server
- ISP DNS failure
- **Actions:**
1. Check VPN/proxy DNS settings
2. Run `nslookup <endpoint-hostname>`
3. Try alternate DNS: `nslookup <endpoint-hostname> 8.8.8.8`
4. Ping endpoint: `ping <endpoint-hostname>`
5. Contact network team if no resolution
---
#### `tcp_connectivity_blocked`
- **Status:** failure
- **When:** DNS succeeds BUT TCP 443 fails
- **Interpretation:** Network path blocked between client and endpoint
- **Root Causes (Public Endpoint):**
- Corporate firewall blocking outbound 443
- ISP blocking Cosmos/Azure IPs
- Regional geo-blocking
- HTTPS inspection proxy interfering
- Host-level firewall (Windows Defender, etc.)
- **Root Causes (Private Endpoint):**
- VPN not configured for private endpoint subnet
- Route not established between VPN subnet and private endpoint subnet
- NSG rules blocking 443 inbound on PE subnet
- NVA/firewall dropping packets
- UDR misconfiguration
- VNet peering not configured or expired
- Private DNS zone misconfiguration
- **Actions:**
1. Run `Test-NetConnection -ComputerName <hostname> -Port 443 -TraceRoute`
2. If private endpoint: Ask network team to verify VPN routing
3. Check host firewall (Windows Defender, Mac firewall, Linux iptables)
4. If corporate proxy: Verify HTTPS inspection not blocking certificates
5. Try from different network to isolate source
---
#### `private_endpoint_network_path_blocked`
- **Status:** failure
- **When:** Resolved to private IP (10.x, 172.16-31.x, 192.168.x) BUT TCP 443 fails
- **Interpretation:** Private endpoint detected but cannot reach it—network path issue
- **Root Causes:**
- VPN client subnet → private endpoint subnet routing broken
- Firewall/NVA blocking internal traffic
- NSG with restrictive rules on PE subnet
- UDR pointing to wrong next hop
- VNet peering not established
- Private DNS zone not configured or stale
- **Actions:**
1. Confirm VPN is connected and assigned correct subnet
2. Ask network team to verify routing: `route print` (Windows) or `netstat -rn` (Linux/Mac)
3. Check Azure NSG rules on private endpoint subnet for port 443 inbound
4. Verify private DNS zone has A record pointing to PE IP
5. Check if VNet peering exists and is Active
6. Run `Test-NetConnection -ComputerName <pe-ip> -Port 443` directly to PE IP
7. Provide network team with source IP from script output
---
### Warning Codes
#### `rbac_insufficient`
- **Status:** warning
- **When:** Network OK BUT caller lacks data-plane permissions
- **Interpretation:** Network is healthy, but RBAC prevents data operations
- **Actions:**
1. Request Cosmos DB Operator or Contributor role assignment
2. If using connection strings: ensure account hasn't been regenerated
3. Check data-plane RBAC (if enabled) via Azure CLI: `az role assignment list --scope <account-id>`
---
#### `private_endpoint_mismatch`
- **Status:** warning
- **When:** Resolved IP differs from expected private endpoint IP
- **Interpretation:** Routing may be asymmetric or PE configuration changed
- **Actions:**
1. Verify private endpoint IP hasn't changed in Azure Portal
2. Ask network team to check asymmetric routing (DNS from corp vs VPN DNS)
3. Flush DNS cache: `ipconfig /flushdns` (Windows) or `sudo dscacheutil -flushcache` (Mac)
---
#### `azure_config_check_skipped`
- **Status:** warning
- **When:** Azure CLI not authenticated or not installed
- **Interpretation:** Cannot validate ARM-level network config (firewall rules, PE connections)
- **Actions:**
1. Install Azure CLI: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli
2. Authenticate: `az login`
3. Re-run script to collect ARM-level diagnostics
---
#### `unknown_error`
- **Status:** failure or warning
- **When:** Unhandled condition or unexpected error
- **Interpretation:** Script encountered something not in the matrix
- **Actions:**
1. Check script output for error details
2. Provide full JSON report to support
---
## Support Playbook
### Tier 1: Triage (ICM Responder)
**When customer reports: "Cosmos DB operations return HTTP 0.0 / connection errors"**
1. **Ask customer to run script:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 -Interactive
```
2. **Receive JSON output. Check classification.code:**
| Code | Response |
|------|----------|
| `network_connectivity_healthy` | → Escalate to data-plane/auth team. This is not a network issue. |
| `dns_resolution_failed` | → Run script playbook below |
| `tcp_connectivity_blocked` (public endpoint) | → Run TCP failed / public endpoint playbook |
| `private_endpoint_network_path_blocked` | → Run private endpoint playbook |
| `rbac_insufficient` | → Check RBAC permissions |
| `azure_config_check_skipped` | → Ask customer to run `az login` and re-run |
3. **Document:**
- Save JSON report in ICM
- Note classification code and recommended actions
- Link to this support guide in response
---
### Playbook: DNS Resolution Failed
**Symptoms:** `dns_resolution_failed` code
**Steps:**
1. **Verify endpoint name with customer:**
- Check it matches Azure Portal > Cosmos Account > URI
- Typos are common
2. **Customer self-service:**
- Ask: "Can you manually run nslookup?"
```powershell
nslookup my-cosmos-account.documents.azure.com
```
- If nslookup fails → Likely VPN/proxy DNS redirect
- If nslookup succeeds but script fails → Check DNS servers in script output vs nslookup
3. **If behind corporate proxy:**
- Ask: "Is your traffic routed through a corporate proxy?"
- If YES: Proxy may be intercepting DNS or blocking .documents.azure.com
- Action: Customer should contact corporate network team
4. **If using VPN:**
- Ask: "Does DNS work when you disconnect from VPN?"
- If YES → VPN DNS redirect issue
- Action: Customer should contact VPN admin
5. **Escalation:**
- If all above fail, ask customer to contact their ISP or network provider
- This is not a Cosmos issue; it's upstream DNS
---
### Playbook: TCP 443 Failed / Public Endpoint
**Symptoms:** `tcp_connectivity_blocked` code with public IP
**Steps:**
1. **Customer runs detailed trace:**
```powershell
Test-NetConnection -ComputerName <hostname> -Port 443 -TraceRoute
```
2. **Analyze output:**
- Does it reach gateway/ISP?
- Where does it drop?
3. **If corporate network:**
- Check with network team if 443 outbound is allowed to Azure
- May need to whitelist docs.microsoft.com or documents.azure.com
4. **If ISP/home network:**
- Try from mobile hotspot to rule out ISP blocking
- If hotspot works → ISP is blocking Azure
5. **If Windows Defender Firewall:**
- Check Windows Defender Firewall for outbound rules
- Ensure 443 is not blocked
6. **If behind proxy:**
- Proxy may be doing HTTPS inspection
- Ask IT if they use SSL Bump/HTTPS Inspection
- May need to disable inspection for documents.azure.com or accept custom cert
---
### Playbook: Private Endpoint Network Path Blocked
**Symptoms:** `private_endpoint_network_path_blocked` code
**Steps:**
1. **Gather critical info from customer:**
- Source IP (from script output: `execution.hostname` and `diagnostics.tcp.sourceIp`)
- Resolved PE IP (from script: `diagnostics.dns.addresses[0]`)
- Is VPN connected?
- Which VPN client?
2. **Customer provides to network team:**
- "TCP from [source-IP] to [PE-IP]:443 is timing out"
- "Please verify routing from VPN subnet to PE subnet"
- "Please check NSGs for port 443 inbound on PE subnet"
3. **Network team should check:**
- Route table: Does VPN subnet have route to PE subnet?
- NSG: PE subnet NSG allows inbound 443?
- NVA/Firewall: Any stateful filtering blocking traffic?
- UDR: Any User Defined Routes sending traffic wrong way?
- VNet peering: If PE in different VNet, is peering configured?
- Private DNS: Does private DNS zone have A record for PE IP?
4. **Cosmos team role:**
- Verify account has private endpoint connection in Approved state
- Check if PE IP matches what Azure reports
- Provide PE connection details from Azure Portal
5. **Escalation criteria:**
- If routing is correct but still fails → May be NSG inside PE subnet (rare)
- If all checks pass → Escalate to Azure Networking support
---
### Playbook: RBAC Insufficient
**Symptoms:** `rbac_insufficient` code
**Steps:**
1. **Check role assignments:**
```powershell
az role assignment list --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.DocumentDB/databaseAccounts/<account>
```
2. **Assign appropriate role:**
- Cosmos DB Operator (read/write data)
- Cosmos DB Account Reader (read-only)
- Contributor or Owner (full management)
3. **If using master key:**
- Primary/secondary keys are still valid if account hasn't been regenerated
- Ask: Has the account been regenerated recently?
- If yes, old keys won't work
---
## JSON Parsing for Automation
### Python Example (Support Bot)
```python
import json
def parse_cosmos_diagnostic(json_data):
report = json.loads(json_data)
classification = report.get("classification", {})
code = classification.get("code")
status = classification.get("status")
# Route based on code
if code == "network_connectivity_healthy":
return "Escalate: Auth/RBAC team"
elif code == "dns_resolution_failed":
return "Run DNS playbook"
elif code == "tcp_connectivity_blocked":
endpoint = report["target"]["endpointUrl"]
if "10." in report["diagnostics"]["dns"]["addresses"][0]:
return "Run Private Endpoint playbook"
else:
return "Run TCP Failure / Public Endpoint playbook"
elif code == "private_endpoint_network_path_blocked":
return "Run Private Endpoint playbook"
elif code == "rbac_insufficient":
return "Check RBAC: " + str(report["diagnostics"]["rbac"]["roleAssignments"])
else:
return "Unknown code: " + code
```
### Support Ticket Template
```
COSMOS DB CONNECTIVITY ISSUE - DIAGNOSTIC RECEIVED
Classification: [classification.code]
Status: [classification.status]
Summary: [classification.summary]
Network Diagnostics:
DNS Resolution: [diagnostics.dns.succeeded]
TCP 443 Connectivity: [diagnostics.tcp.succeeded]
HTTPS Reachability: [diagnostics.https.statusCode]
Private Endpoint: [diagnostics.privateNetwork.isPrivateRange]
Azure Configuration:
Public Network Restricted: [diagnostics.azureNetworkConfig.publicNetworkAccessRestricted]
Private Endpoints: [diagnostics.azureNetworkConfig.privateEndpoints.length] configured
RBAC Status:
Classification: [diagnostics.rbac.classification]
Can Read Account: [diagnostics.rbac.canReadAccount]
Can Manage Account: [diagnostics.rbac.canManageAccount]
Recommended Actions:
[classification.recommendedActions joined with newlines]
Next Step:
[routing based on classification.code]
```
---
## References
- [Azure Cosmos DB Troubleshoot Connectivity Issues](https://learn.microsoft.com/en-us/azure/cosmos-db/troubleshoot-connection)
- [Private Endpoints for Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-configure-private-endpoints)
- [Network Security Groups](https://learn.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview)
- [User Defined Routes](https://learn.microsoft.com/en-us/azure/virtual-network/virtual-networks-udr-overview)