network connectivity

This commit is contained in:
Bikram Choudhury
2026-05-13 18:13:05 +05:30
parent 470f043ed5
commit 8b3fb06b23
7 changed files with 3000 additions and 0 deletions
+411
View File
@@ -0,0 +1,411 @@
# Cosmos DB Connectivity Diagnostic - Classification Matrix & Support Guide
## Classification Decision Tree
```
START: Run diagnostic script
├─→ DNS Resolution Check
│ │
│ ├─→ ❌ FAILED
│ │ └─→ Classification: dns_resolution_failed
│ │ Action: DNS/VPN/proxy troubleshooting
│ │
│ └─→ ✓ PASSED
│ │
│ ├─→ Resolved IP is RFC 1918 (10.x, 172.16-31.x, 192.168.x)?
│ │ │
│ │ ├─→ YES (Private endpoint detected)
│ │ │ │
│ │ │ └─→ TCP 443 Test
│ │ │ │
│ │ │ ├─→ ❌ FAILED
│ │ │ │ └─→ private_endpoint_network_path_blocked
│ │ │ │ (VPN route, NSG, firewall, UDR, peering)
│ │ │ │
│ │ │ └─→ ✓ PASSED
│ │ │ └─→ Check RBAC
│ │ │
│ │ └─→ NO (Public endpoint)
│ │ │
│ │ └─→ TCP 443 Test
│ │ │
│ │ ├─→ ❌ FAILED
│ │ │ └─→ tcp_connectivity_blocked
│ │ │ (Firewall, ISP, proxy)
│ │ │
│ │ └─→ ✓ PASSED
│ │ └─→ network_connectivity_healthy
│ │
│ └─→ Check Azure Configuration & RBAC
│ │
│ ├─→ Azure CLI authenticated?
│ │ ├─→ NO → Skip ARM checks, mark warning
│ │ └─→ YES → Query network config & roles
│ │
│ └─→ Sufficient permissions?
│ ├─→ NO → rbac_insufficient
│ └─→ YES → All checks passed
```
---
## Classification Code Reference
### Success Codes
#### `network_connectivity_healthy`
- **Status:** success
- **When:** DNS resolves AND TCP 443 succeeds
- **Interpretation:** Local network is working. If Cosmos DB operations fail, issue is auth/RBAC/data-plane.
- **Actions:**
- Verify RBAC/authentication permissions
- Check account firewall IP rules
- Verify data-plane token hasn't expired
- Check application logs for specific errors
---
### Failure Codes
#### `dns_resolution_failed`
- **Status:** failure
- **When:** DNS lookup fails with SocketException or timeout
- **Interpretation:** Cannot resolve account hostname to any IP
- **Root Causes:**
- DNS server misconfiguration
- VPN/proxy intercepting DNS queries
- Corporate proxy redirecting .documents.azure.com
- Network unreachable before DNS server
- ISP DNS failure
- **Actions:**
1. Check VPN/proxy DNS settings
2. Run `nslookup <endpoint-hostname>`
3. Try alternate DNS: `nslookup <endpoint-hostname> 8.8.8.8`
4. Ping endpoint: `ping <endpoint-hostname>`
5. Contact network team if no resolution
---
#### `tcp_connectivity_blocked`
- **Status:** failure
- **When:** DNS succeeds BUT TCP 443 fails
- **Interpretation:** Network path blocked between client and endpoint
- **Root Causes (Public Endpoint):**
- Corporate firewall blocking outbound 443
- ISP blocking Cosmos/Azure IPs
- Regional geo-blocking
- HTTPS inspection proxy interfering
- Host-level firewall (Windows Defender, etc.)
- **Root Causes (Private Endpoint):**
- VPN not configured for private endpoint subnet
- Route not established between VPN subnet and private endpoint subnet
- NSG rules blocking 443 inbound on PE subnet
- NVA/firewall dropping packets
- UDR misconfiguration
- VNet peering not configured or expired
- Private DNS zone misconfiguration
- **Actions:**
1. Run `Test-NetConnection -ComputerName <hostname> -Port 443 -TraceRoute`
2. If private endpoint: Ask network team to verify VPN routing
3. Check host firewall (Windows Defender, Mac firewall, Linux iptables)
4. If corporate proxy: Verify HTTPS inspection not blocking certificates
5. Try from different network to isolate source
---
#### `private_endpoint_network_path_blocked`
- **Status:** failure
- **When:** Resolved to private IP (10.x, 172.16-31.x, 192.168.x) BUT TCP 443 fails
- **Interpretation:** Private endpoint detected but cannot reach it—network path issue
- **Root Causes:**
- VPN client subnet → private endpoint subnet routing broken
- Firewall/NVA blocking internal traffic
- NSG with restrictive rules on PE subnet
- UDR pointing to wrong next hop
- VNet peering not established
- Private DNS zone not configured or stale
- **Actions:**
1. Confirm VPN is connected and assigned correct subnet
2. Ask network team to verify routing: `route print` (Windows) or `netstat -rn` (Linux/Mac)
3. Check Azure NSG rules on private endpoint subnet for port 443 inbound
4. Verify private DNS zone has A record pointing to PE IP
5. Check if VNet peering exists and is Active
6. Run `Test-NetConnection -ComputerName <pe-ip> -Port 443` directly to PE IP
7. Provide network team with source IP from script output
---
### Warning Codes
#### `rbac_insufficient`
- **Status:** warning
- **When:** Network OK BUT caller lacks data-plane permissions
- **Interpretation:** Network is healthy, but RBAC prevents data operations
- **Actions:**
1. Request Cosmos DB Operator or Contributor role assignment
2. If using connection strings: ensure account hasn't been regenerated
3. Check data-plane RBAC (if enabled) via Azure CLI: `az role assignment list --scope <account-id>`
---
#### `private_endpoint_mismatch`
- **Status:** warning
- **When:** Resolved IP differs from expected private endpoint IP
- **Interpretation:** Routing may be asymmetric or PE configuration changed
- **Actions:**
1. Verify private endpoint IP hasn't changed in Azure Portal
2. Ask network team to check asymmetric routing (DNS from corp vs VPN DNS)
3. Flush DNS cache: `ipconfig /flushdns` (Windows) or `sudo dscacheutil -flushcache` (Mac)
---
#### `azure_config_check_skipped`
- **Status:** warning
- **When:** Azure CLI not authenticated or not installed
- **Interpretation:** Cannot validate ARM-level network config (firewall rules, PE connections)
- **Actions:**
1. Install Azure CLI: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli
2. Authenticate: `az login`
3. Re-run script to collect ARM-level diagnostics
---
#### `unknown_error`
- **Status:** failure or warning
- **When:** Unhandled condition or unexpected error
- **Interpretation:** Script encountered something not in the matrix
- **Actions:**
1. Check script output for error details
2. Provide full JSON report to support
---
## Support Playbook
### Tier 1: Triage (ICM Responder)
**When customer reports: "Cosmos DB operations return HTTP 0.0 / connection errors"**
1. **Ask customer to run script:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 -Interactive
```
2. **Receive JSON output. Check classification.code:**
| Code | Response |
|------|----------|
| `network_connectivity_healthy` | → Escalate to data-plane/auth team. This is not a network issue. |
| `dns_resolution_failed` | → Run script playbook below |
| `tcp_connectivity_blocked` (public endpoint) | → Run TCP failed / public endpoint playbook |
| `private_endpoint_network_path_blocked` | → Run private endpoint playbook |
| `rbac_insufficient` | → Check RBAC permissions |
| `azure_config_check_skipped` | → Ask customer to run `az login` and re-run |
3. **Document:**
- Save JSON report in ICM
- Note classification code and recommended actions
- Link to this support guide in response
---
### Playbook: DNS Resolution Failed
**Symptoms:** `dns_resolution_failed` code
**Steps:**
1. **Verify endpoint name with customer:**
- Check it matches Azure Portal > Cosmos Account > URI
- Typos are common
2. **Customer self-service:**
- Ask: "Can you manually run nslookup?"
```powershell
nslookup my-cosmos-account.documents.azure.com
```
- If nslookup fails → Likely VPN/proxy DNS redirect
- If nslookup succeeds but script fails → Check DNS servers in script output vs nslookup
3. **If behind corporate proxy:**
- Ask: "Is your traffic routed through a corporate proxy?"
- If YES: Proxy may be intercepting DNS or blocking .documents.azure.com
- Action: Customer should contact corporate network team
4. **If using VPN:**
- Ask: "Does DNS work when you disconnect from VPN?"
- If YES → VPN DNS redirect issue
- Action: Customer should contact VPN admin
5. **Escalation:**
- If all above fail, ask customer to contact their ISP or network provider
- This is not a Cosmos issue; it's upstream DNS
---
### Playbook: TCP 443 Failed / Public Endpoint
**Symptoms:** `tcp_connectivity_blocked` code with public IP
**Steps:**
1. **Customer runs detailed trace:**
```powershell
Test-NetConnection -ComputerName <hostname> -Port 443 -TraceRoute
```
2. **Analyze output:**
- Does it reach gateway/ISP?
- Where does it drop?
3. **If corporate network:**
- Check with network team if 443 outbound is allowed to Azure
- May need to whitelist docs.microsoft.com or documents.azure.com
4. **If ISP/home network:**
- Try from mobile hotspot to rule out ISP blocking
- If hotspot works → ISP is blocking Azure
5. **If Windows Defender Firewall:**
- Check Windows Defender Firewall for outbound rules
- Ensure 443 is not blocked
6. **If behind proxy:**
- Proxy may be doing HTTPS inspection
- Ask IT if they use SSL Bump/HTTPS Inspection
- May need to disable inspection for documents.azure.com or accept custom cert
---
### Playbook: Private Endpoint Network Path Blocked
**Symptoms:** `private_endpoint_network_path_blocked` code
**Steps:**
1. **Gather critical info from customer:**
- Source IP (from script output: `execution.hostname` and `diagnostics.tcp.sourceIp`)
- Resolved PE IP (from script: `diagnostics.dns.addresses[0]`)
- Is VPN connected?
- Which VPN client?
2. **Customer provides to network team:**
- "TCP from [source-IP] to [PE-IP]:443 is timing out"
- "Please verify routing from VPN subnet to PE subnet"
- "Please check NSGs for port 443 inbound on PE subnet"
3. **Network team should check:**
- Route table: Does VPN subnet have route to PE subnet?
- NSG: PE subnet NSG allows inbound 443?
- NVA/Firewall: Any stateful filtering blocking traffic?
- UDR: Any User Defined Routes sending traffic wrong way?
- VNet peering: If PE in different VNet, is peering configured?
- Private DNS: Does private DNS zone have A record for PE IP?
4. **Cosmos team role:**
- Verify account has private endpoint connection in Approved state
- Check if PE IP matches what Azure reports
- Provide PE connection details from Azure Portal
5. **Escalation criteria:**
- If routing is correct but still fails → May be NSG inside PE subnet (rare)
- If all checks pass → Escalate to Azure Networking support
---
### Playbook: RBAC Insufficient
**Symptoms:** `rbac_insufficient` code
**Steps:**
1. **Check role assignments:**
```powershell
az role assignment list --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.DocumentDB/databaseAccounts/<account>
```
2. **Assign appropriate role:**
- Cosmos DB Operator (read/write data)
- Cosmos DB Account Reader (read-only)
- Contributor or Owner (full management)
3. **If using master key:**
- Primary/secondary keys are still valid if account hasn't been regenerated
- Ask: Has the account been regenerated recently?
- If yes, old keys won't work
---
## JSON Parsing for Automation
### Python Example (Support Bot)
```python
import json
def parse_cosmos_diagnostic(json_data):
report = json.loads(json_data)
classification = report.get("classification", {})
code = classification.get("code")
status = classification.get("status")
# Route based on code
if code == "network_connectivity_healthy":
return "Escalate: Auth/RBAC team"
elif code == "dns_resolution_failed":
return "Run DNS playbook"
elif code == "tcp_connectivity_blocked":
endpoint = report["target"]["endpointUrl"]
if "10." in report["diagnostics"]["dns"]["addresses"][0]:
return "Run Private Endpoint playbook"
else:
return "Run TCP Failure / Public Endpoint playbook"
elif code == "private_endpoint_network_path_blocked":
return "Run Private Endpoint playbook"
elif code == "rbac_insufficient":
return "Check RBAC: " + str(report["diagnostics"]["rbac"]["roleAssignments"])
else:
return "Unknown code: " + code
```
### Support Ticket Template
```
COSMOS DB CONNECTIVITY ISSUE - DIAGNOSTIC RECEIVED
Classification: [classification.code]
Status: [classification.status]
Summary: [classification.summary]
Network Diagnostics:
DNS Resolution: [diagnostics.dns.succeeded]
TCP 443 Connectivity: [diagnostics.tcp.succeeded]
HTTPS Reachability: [diagnostics.https.statusCode]
Private Endpoint: [diagnostics.privateNetwork.isPrivateRange]
Azure Configuration:
Public Network Restricted: [diagnostics.azureNetworkConfig.publicNetworkAccessRestricted]
Private Endpoints: [diagnostics.azureNetworkConfig.privateEndpoints.length] configured
RBAC Status:
Classification: [diagnostics.rbac.classification]
Can Read Account: [diagnostics.rbac.canReadAccount]
Can Manage Account: [diagnostics.rbac.canManageAccount]
Recommended Actions:
[classification.recommendedActions joined with newlines]
Next Step:
[routing based on classification.code]
```
---
## References
- [Azure Cosmos DB Troubleshoot Connectivity Issues](https://learn.microsoft.com/en-us/azure/cosmos-db/troubleshoot-connection)
- [Private Endpoints for Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-configure-private-endpoints)
- [Network Security Groups](https://learn.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview)
- [User Defined Routes](https://learn.microsoft.com/en-us/azure/virtual-network/virtual-networks-udr-overview)
+460
View File
@@ -0,0 +1,460 @@
# Cosmos DB Connectivity Diagnostic - JSON Schema v1.0
## Overview
The diagnostic script outputs a structured JSON report containing network connectivity, private network configuration, and RBAC assessment data. This schema is stable and versioned to support parsing and triage automation.
## Root Object
```json
{
"version": "1.0.0", // Schema version (semantic versioning)
"timestamp": "2026-05-13T14:30:45.123Z", // ISO 8601 UTC timestamp
"target": {...}, // Account and subscription context
"execution": {...}, // Script execution environment
"diagnostics": {...}, // All diagnostic results
"classification": {...} // Automated classification and recommendations
}
```
---
## Target Object
Account and subscription identifiers.
```json
{
"target": {
"endpointUrl": "https://my-cosmos-account.documents.azure.com",
"hostname": "my-cosmos-account.documents.azure.com",
"subscriptionId": "12345678-1234-1234-1234-123456789012", // May be "REDACTED" if --Redact flag used
"resourceGroup": "my-rg", // May be "REDACTED"
"accountName": "my-cosmos-account" // May be "REDACTED"
}
}
```
---
## Execution Object
Environment where script ran.
```json
{
"execution": {
"hostname": "DESKTOP-ABC123", // Machine name
"platform": "Windows 10", // OS name and version
"powershellVersion": "7.3.0" // PowerShell version
}
}
```
---
## Diagnostics Object
All diagnostic results grouped by category.
```json
{
"diagnostics": {
"dns": { ... }, // DNS resolution results
"tcp": { ... }, // TCP 443 connectivity results
"https": { ... }, // HTTPS probe results
"privateNetwork": { ... }, // Private endpoint indicators
"azureNetworkConfig": { ... }, // ARM-sourced network configuration
"rbac": { ... }, // RBAC assessment
"azureCli": { ... } // Azure CLI context
}
}
```
### DNS Results
```json
{
"dns": {
"hostname": "my-cosmos-account.documents.azure.com",
"succeeded": true, // true = hostname resolved
"addresses": [
"52.180.123.45", // Resolved IPv4 addresses
"2607:f8b0:4005:806::200e" // IPv6 if available
],
"error": null, // Error message if resolution failed
"dnsServers": [
"8.8.8.8", // Detected DNS servers
"8.8.4.4"
],
"latencyMs": 145 // DNS query latency in milliseconds
}
}
```
**Classification logic:**
- `succeeded: false` → DNS failure, likely network or DNS configuration issue
- `succeeded: true` with `addresses` containing private IP (10.x, 172.16-31.x, 192.168.x) → Private endpoint
- `succeeded: true` with `addresses` containing public IP → Public endpoint
### TCP Connectivity Results
```json
{
"tcp": {
"hostname": "my-cosmos-account.documents.azure.com",
"port": 443,
"succeeded": true, // true = TCP 443 connection established
"error": null, // Error message if connection failed (e.g., "Connection timeout after 5000ms")
"latencyMs": 87, // Connection latency
"sourceIp": "192.168.1.100" // Local IP used for connection attempt
}
}
```
**Classification logic:**
- `succeeded: false` with DNS resolved → Network path blocked
- `error` contains "timeout" → VPN/firewall/NVA may be dropping packets
- `error` contains "refused" → Target may be rejecting connections
### HTTPS Probe Results
```json
{
"https": {
"url": "https://my-cosmos-account.documents.azure.com",
"succeeded": true, // true = HTTP 200-299 response
"statusCode": 401, // HTTP status code (401 expected without auth)
"error": null, // TLS/connection errors
"latencyMs": 234 // Full request round-trip latency
}
}
```
**Classification logic:**
- `succeeded: true` (any 2xx/4xx status) → Can reach endpoint
- `statusCode: 401` → Expected (no credentials), network is healthy
- `error` contains "certificate" or "TLS" → Certificate validation issue
- `error` and `succeeded: false` → Network or firewall blocking TLS
### Private Network Indicators
```json
{
"privateNetwork": {
"isPrivateRange": true, // true if any resolved IP is RFC 1918
"indicators": [
"Resolved to RFC 1918 private IP range (10.123.171.30)",
"Matches expected private endpoint IP (10.123.171.30)"
],
"matchesExpectedPrivateEndpoint": true, // true if resolved IP matches PrivateEndpointIP parameter
"vpnRouteWarning": null // Warning if VPN subnet routing appears blocked
}
}
```
### Azure Network Configuration
```json
{
"azureNetworkConfig": {
"checked": true, // true if successfully queried via Azure CLI
"publicNetworkAccessRestricted": true, // true if public network access is disabled
"privateEndpoints": [
{
"id": "/subscriptions/.../privateEndpointConnections/my-pe-connection",
"state": "Approved" // Status: Approved, Pending, Rejected
}
],
"vnetRules": [ ], // Virtual network rules (firewall)
"error": null // Error if Azure CLI query failed
}
}
```
### RBAC Assessment
```json
{
"rbac": {
"checked": true, // true if RBAC checked successfully
"canReadAccount": true, // true if caller can read account properties
"canManageAccount": false, // true if caller has Contributor/Owner
"canExecuteDataPlaneOps": true, // true if caller likely has data-plane roles
"roleAssignments": [
{
"roleDefinitionName": "Cosmos DB Operator",
"principalName": "user@example.com"
}
],
"classification": "partial", // Enum: "sufficient", "partial", "insufficient", "unknown"
"error": null // Error message if check failed
}
}
```
### Azure CLI Context
```json
{
"azureCli": {
"installed": true, // true if Azure CLI is installed
"authenticated": true, // true if 'az login' was successful
"currentUser": "user@example.com", // May be "REDACTED-USER-NAME"
"currentTenant": "12345678-1234-1234-1234-123456789012", // May be "REDACTED-TENANT-ID"
"currentSubscription": "abcdef01-2345-6789-abcd-ef0123456789",
"error": null // Error if CLI not installed or not authenticated
}
}
```
---
## Classification Object
Automated classification with recommendations.
```json
{
"classification": {
"status": "failure", // Enum: "success", "failure", "warning", "unknown"
"code": "tcp_connectivity_blocked", // Machine-readable classification code
"summary": "DNS resolution succeeded but TCP 443 connection failed. Network path is blocked.",
"rootCause": "Private endpoint configured but network path blocked (VPN routing, firewall/NVA, NSG, UDR, or peering issue)",
"recommendedActions": [
"1. Verify VPN connectivity and that your client subnet can route to the private endpoint subnet",
"2. Ask your network team to verify routing between DESKTOP-ABC123 and private endpoint 10.123.171.30",
"3. Check Azure network security groups (NSGs) rules for port 443 inbound",
"4. Verify Azure Virtual Network peering and User Defined Routes (UDRs)",
"5. Check if corporate firewall/NVA is blocking the connection",
"6. Manually run: Test-NetConnection -ComputerName my-cosmos-account.documents.azure.com -Port 443"
]
}
}
```
### Classification Codes Reference
| Code | Status | Meaning | Likely Cause |
|------|--------|---------|--------------|
| `dns_resolution_failed` | failure | Hostname cannot resolve | DNS misconfiguration, proxy redirect, network unreachable |
| `tcp_connectivity_blocked` | failure | DNS works, TCP 443 fails | Firewall, VPN routing, NVA, NSG, private path blocked |
| `private_endpoint_network_path_blocked` | failure | Private endpoint detected, TCP fails | VPN → private endpoint routing broken |
| `network_connectivity_healthy` | success | DNS and TCP both work | Network is healthy; check auth/RBAC if operations fail |
| `rbac_insufficient` | warning | Network OK, but RBAC limited | User lacks data-plane roles |
| `private_endpoint_mismatch` | warning | Resolved to different IP than expected | Private endpoint routing may be asymmetric or misconfigured |
| `azure_config_check_skipped` | warning | Azure CLI not authenticated | Can't validate ARM-level network configuration |
---
## Redacted Output
When script is invoked with `-Redact` flag:
```json
{
"target": {
"endpointUrl": "REDACTED",
"hostname": "my-cosmos-account.documents.azure.com", // Hostname kept (needed for triage)
"subscriptionId": "REDACTED-SUBSCRIPTION-ID",
"resourceGroup": "REDACTED",
"accountName": "REDACTED"
},
"diagnostics": {
"azureCli": {
"currentUser": "REDACTED-USER-NAME",
"currentTenant": "REDACTED-TENANT-ID"
},
"rbac": {
"roleAssignments": [
{
"roleDefinitionName": "Cosmos DB Operator",
"principalName": "REDACTED-PRINCIPAL-NAME"
}
]
}
}
}
```
---
## Sample Outputs
### Scenario 1: Network Healthy (Public Endpoint)
```json
{
"version": "1.0.0",
"timestamp": "2026-05-13T14:30:45Z",
"target": {
"endpointUrl": "https://my-cosmos.documents.azure.com",
"hostname": "my-cosmos.documents.azure.com",
"subscriptionId": "12345678-1234-1234-1234-123456789012",
"resourceGroup": "my-rg",
"accountName": "my-cosmos"
},
"diagnostics": {
"dns": {
"hostname": "my-cosmos.documents.azure.com",
"succeeded": true,
"addresses": ["52.180.123.45"],
"error": null,
"latencyMs": 12
},
"tcp": {
"hostname": "my-cosmos.documents.azure.com",
"port": 443,
"succeeded": true,
"error": null,
"latencyMs": 45,
"sourceIp": "192.168.1.100"
},
"https": {
"url": "https://my-cosmos.documents.azure.com",
"succeeded": true,
"statusCode": 401,
"error": null,
"latencyMs": 78
},
"privateNetwork": {
"isPrivateRange": false,
"indicators": [],
"matchesExpectedPrivateEndpoint": false,
"vpnRouteWarning": null
}
},
"classification": {
"status": "success",
"code": "network_connectivity_healthy",
"summary": "Network connectivity is healthy. DNS resolves and TCP 443 is reachable.",
"rootCause": null,
"recommendedActions": [
"✓ Local network connectivity is working",
"If Cosmos DB operations still fail, check:",
" - RBAC/authentication permissions",
" - Account firewall IP rules (if enabled)",
" - Data plane token expiry",
" - Application-level issues (connection strings, SDK versions)"
]
}
}
```
### Scenario 2: Private Endpoint Path Blocked
```json
{
"version": "1.0.0",
"timestamp": "2026-05-13T14:35:22Z",
"target": {
"endpointUrl": "https://my-cosmos-pe.documents.azure.com",
"hostname": "my-cosmos-pe.documents.azure.com",
"subscriptionId": "12345678-1234-1234-1234-123456789012",
"resourceGroup": "my-rg",
"accountName": "my-cosmos-pe"
},
"diagnostics": {
"dns": {
"hostname": "my-cosmos-pe.documents.azure.com",
"succeeded": true,
"addresses": ["10.123.171.30"],
"error": null,
"latencyMs": 8
},
"tcp": {
"hostname": "my-cosmos-pe.documents.azure.com",
"port": 443,
"succeeded": false,
"error": "Connection timeout after 5000ms",
"latencyMs": 0,
"sourceIp": null
},
"privateNetwork": {
"isPrivateRange": true,
"indicators": [
"Resolved to RFC 1918 private IP range (10.123.171.30)",
"Matches expected private endpoint IP (10.123.171.30)"
],
"matchesExpectedPrivateEndpoint": true,
"vpnRouteWarning": "Private endpoint IP detected but TCP 443 failed. Likely VPN → PE route blocked."
}
},
"classification": {
"status": "failure",
"code": "private_endpoint_network_path_blocked",
"summary": "DNS resolution succeeded but TCP 443 connection failed to private endpoint. Network path is blocked.",
"rootCause": "Private endpoint network path blocked (VPN routing, firewall/NVA, NSG, UDR, or peering issue)",
"recommendedActions": [
"1. Verify VPN connectivity and that your client subnet can route to the private endpoint subnet",
"2. Ask your network team to verify routing from 10.249.14.218 to private endpoint 10.123.171.30",
"3. Check Azure network security groups (NSGs) rules for port 443 inbound on private endpoint subnet",
"4. Verify Azure Virtual Network peering and User Defined Routes (UDRs)",
"5. Check if corporate firewall/NVA is blocking the connection",
"6. Manually run: Test-NetConnection -ComputerName my-cosmos-pe.documents.azure.com -Port 443"
]
}
}
```
### Scenario 3: DNS Resolution Failed
```json
{
"version": "1.0.0",
"timestamp": "2026-05-13T14:40:10Z",
"target": {
"endpointUrl": "https://my-cosmos-invalid.documents.azure.com",
"hostname": "my-cosmos-invalid.documents.azure.com"
},
"diagnostics": {
"dns": {
"hostname": "my-cosmos-invalid.documents.azure.com",
"succeeded": false,
"addresses": [],
"error": "No such host is known",
"dnsServers": ["8.8.8.8"],
"latencyMs": 2342
},
"tcp": {
"hostname": "my-cosmos-invalid.documents.azure.com",
"port": 443,
"succeeded": false,
"error": "No such host is known",
"latencyMs": 0,
"sourceIp": null
}
},
"classification": {
"status": "failure",
"code": "dns_resolution_failed",
"summary": "DNS resolution failed. The Cosmos DB endpoint hostname cannot be resolved.",
"rootCause": "DNS configuration, VPN/proxy DNS redirect, or network connectivity issue",
"recommendedActions": [
"1. Check if you are connected to corporate VPN or proxy that intercepts DNS",
"2. Manually run: nslookup my-cosmos-invalid.documents.azure.com",
"3. If nslookup fails, check with your network team or ISP",
"4. Try pinging the endpoint or using nslookup with alternate DNS: nslookup my-cosmos-invalid.documents.azure.com 8.8.8.8"
]
}
}
```
---
## Parsing Guidelines
Implementers parsing this JSON should:
1. **Always check version**: Fields may differ in future versions. Parse defensively.
2. **Use classification.code not status**: Status is user-facing; code is machine-readable for routing and automation.
3. **Check diagnostics.azureCli.authenticated**: If false, Azure configuration checks are unreliable.
4. **Prioritize classification.recommendedActions**: Contains context-specific guidance.
5. **Redacted fields**: May be null or "REDACTED" strings. Do not assume structure.
6. **Latency fields**: Milliseconds, may be 0 if unavailable.
7. **Handle missing fields**: Especially in older versions or on non-Windows platforms.
---
## Version History
### v1.0.0 (2026-05-13)
- Initial schema
- Includes DNS, TCP, HTTPS, private network, Azure config, and RBAC checks
- Classification codes stable
- Redaction support
+699
View File
@@ -0,0 +1,699 @@
#!/usr/bin/env pwsh
<#
.SYNOPSIS
Cosmos DB Connectivity Diagnostic Script
Captures local network connectivity, private network posture, and RBAC evidence.
.DESCRIPTION
This script performs comprehensive network and access diagnostics for Cosmos DB accounts.
It can run in interactive or non-interactive mode and produces a JSON report for triage.
.PARAMETER EndpointUrl
The Cosmos DB account endpoint URL.
Format: https://<account-name>.documents.azure.com or https://<account-name>.documents.azure.com:443/
WHERE TO GET: Azure Portal > Cosmos DB Account > Overview tab > URI field
OR: Use the endpoint shown in Cosmos Explorer connection string
.PARAMETER SubscriptionId
Azure subscription ID containing the Cosmos account.
WHERE TO GET: Azure Portal > Subscriptions > Copy Subscription ID
FORMAT: 12345678-1234-1234-1234-123456789012
.PARAMETER ResourceGroup
Azure resource group name containing the Cosmos account.
WHERE TO GET: Azure Portal > Cosmos DB Account > Resource group field (top-right)
.PARAMETER AccountName
Cosmos DB account name.
WHERE TO GET: Azure Portal > Cosmos DB Account > Account Name field
Or extract from endpoint URL (part before .documents.azure.com)
.PARAMETER PrivateEndpointIP
(Optional) Expected private endpoint IP if account uses private link.
WHERE TO GET: Azure Portal > Cosmos DB Account > Private Endpoint Connections tab > Private IP address column
.PARAMETER VpnSubnetRange
(Optional) Customer's VPN/client subnet CIDR for route analysis.
FORMAT: 10.0.0.0/24
WHERE TO GET: Ask your network team or check VPN client properties
.PARAMETER Interactive
If specified, script prompts for missing parameters instead of requiring them as arguments.
.PARAMETER Redact
If specified, output JSON redacts sensitive identifiers (tenant ID, subscription ID, usernames).
.EXAMPLE
# Interactive mode - script will prompt for inputs
.\Diagnose-CosmosConnectivity.ps1 -Interactive
.EXAMPLE
# Non-interactive with full parameters
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://my-cosmos-account.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "my-rg" `
-AccountName "my-cosmos-account"
.EXAMPLE
# With private endpoint and output redaction
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://my-cosmos-account.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "my-rg" `
-AccountName "my-cosmos-account" `
-PrivateEndpointIP "10.123.171.30" `
-Redact
#>
param(
[Parameter(ValueFromPipelineByPropertyName=$true)]
[ValidateScript({$_ -match "^https://[a-z0-9-]+\.documents\.azure\.com" -or $_ -match "^https://[a-z0-9-]+\.documents\.azure\.com:443"})]
[string]$EndpointUrl,
[Parameter(ValueFromPipelineByPropertyName=$true)]
[guid]$SubscriptionId,
[Parameter(ValueFromPipelineByPropertyName=$true)]
[string]$ResourceGroup,
[Parameter(ValueFromPipelineByPropertyName=$true)]
[string]$AccountName,
[Parameter(ValueFromPipelineByPropertyName=$true)]
[string]$PrivateEndpointIP,
[Parameter(ValueFromPipelineByPropertyName=$true)]
[string]$VpnSubnetRange,
[switch]$Interactive,
[switch]$Redact
)
# ============================================================================
# Configuration
# ============================================================================
$ScriptVersion = "1.0.0"
$DiagnosticTimestamp = Get-Date -Format "o"
$TcpConnectTimeoutMs = 5000
$DnsTimeoutMs = 5000
# ============================================================================
# Helper Functions
# ============================================================================
function Show-InputInstructions {
Write-Host @"
COSMOS DB CONNECTIVITY DIAGNOSTIC SCRIPT v$ScriptVersion
This script will collect network and access diagnostics for your Cosmos DB account.
WHERE TO FIND YOUR INPUTS:
1. ENDPOINT URL (Required)
Location: Azure Portal > Cosmos DB Account > Overview tab
Look for: "URI" field
Example: https://my-cosmos-account.documents.azure.com
Include https:// but do NOT include trailing slash or port suffix
2. SUBSCRIPTION ID (Required)
Location: Azure Portal > Subscriptions
Look for: "Subscription ID" column or click your subscription > Copy ID
Format: 12345678-1234-1234-1234-123456789012
3. RESOURCE GROUP (Required)
Location: Azure Portal > Cosmos DB Account > Top-right corner
Look for: "Resource group" field
Example: my-production-rg
4. ACCOUNT NAME (Required)
Location: Either extract from endpoint URL or find in portal
From URL: Take the part before ".documents.azure.com"
From Portal: Account name appears in the breadcrumb and overview
Example: my-cosmos-account
5. PRIVATE ENDPOINT IP (Optional, but recommended)
Location: Azure Portal > Cosmos DB Account > Private Endpoint Connections
Look for: "Private IP address" column (only if private endpoints exist)
Format: 10.123.171.30 (will be 10.x.x.x or 172.16-31.x.x range)
Skip this if: You are using public endpoint only
6. VPN SUBNET RANGE (Optional)
Location: Ask your network team or VPN client settings
Used to: Analyze if routing from your network to private endpoint is blocked
Format: 10.0.0.0/24 (CIDR notation)
Skip this if: You are not using a VPN
"@
}
function Read-InputsInteractively {
Show-InputInstructions
Write-Host "Please provide the following information:" -ForegroundColor Cyan
Write-Host ""
# Endpoint URL
do {
$endpoint = Read-Host "Endpoint URL (e.g., https://my-cosmos.documents.azure.com)"
if ($endpoint -notmatch "^https://[a-z0-9-]+\.documents\.azure\.com") {
Write-Host "Invalid format. Expected: https://<account-name>.documents.azure.com" -ForegroundColor Yellow
}
} while ($endpoint -notmatch "^https://[a-z0-9-]+\.documents\.azure\.com")
# Subscription ID
do {
$subId = Read-Host "Subscription ID (12345678-1234-1234-1234-123456789012)"
if ($subId -notmatch "^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$") {
Write-Host "Invalid format. Expected GUID format." -ForegroundColor Yellow
}
} while ($subId -notmatch "^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$")
$rg = Read-Host "Resource Group name"
$account = Read-Host "Account Name"
$peIP = Read-Host "Private Endpoint IP (optional, press Enter to skip)"
$vpnSubnet = Read-Host "VPN Subnet Range (optional, e.g., 10.0.0.0/24, press Enter to skip)"
return @{
EndpointUrl = $endpoint
SubscriptionId = [guid]$subId
ResourceGroup = $rg
AccountName = $account
PrivateEndpointIP = if ($peIP) { $peIP } else { $null }
VpnSubnetRange = if ($vpnSubnet) { $vpnSubnet } else { $null }
}
}
function Invoke-DnsResolution {
param([string]$Hostname)
$result = @{
hostname = $Hostname
succeeded = $false
addresses = @()
error = $null
dnsServers = @()
latencyMs = 0
}
try {
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()
$addresses = [System.Net.Dns]::GetHostAddresses($Hostname)
$stopwatch.Stop()
$result.succeeded = $true
$result.addresses = @($addresses | ForEach-Object { $_.ToString() })
$result.latencyMs = [int]$stopwatch.ElapsedMilliseconds
# Try to get DNS servers (Windows/Linux specific)
if ($PSVersionTable.Platform -ne "Unix" -or $PSVersionTable.OS -like "*Linux*") {
try {
$dnsConfig = Get-DnsClientServerAddress -ErrorAction SilentlyContinue | Select-Object -First 1
if ($dnsConfig) {
$result.dnsServers = @($dnsConfig.ServerAddresses)
}
} catch { }
}
} catch {
$result.error = $_.Exception.Message
}
return $result
}
function Invoke-TcpConnectivityTest {
param(
[string]$Hostname,
[int]$Port = 443,
[int]$TimeoutMs = 5000
)
$result = @{
hostname = $Hostname
port = $Port
succeeded = $false
error = $null
latencyMs = 0
sourceIp = $null
}
try {
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()
$tcpClient = New-Object System.Net.Sockets.TcpClient
$task = $tcpClient.ConnectAsync($Hostname, $Port)
$task.Wait($TimeoutMs)
$stopwatch.Stop()
if ($task.IsCompleted) {
$result.succeeded = $true
$result.latencyMs = [int]$stopwatch.ElapsedMilliseconds
# Try to get source IP
try {
$endpoint = $tcpClient.Client.LocalEndPoint
$result.sourceIp = $endpoint.Address.ToString()
} catch { }
} else {
$result.error = "Connection timeout after ${TimeoutMs}ms"
}
$tcpClient.Close()
} catch {
$result.error = $_.Exception.Message
}
return $result
}
function Invoke-HttpsProbe {
param([string]$Url)
$result = @{
url = $Url
succeeded = $false
statusCode = $null
error = $null
latencyMs = 0
}
try {
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()
$response = Invoke-WebRequest -Uri $Url -Method Head -TimeoutSec 5 -ErrorAction Stop
$stopwatch.Stop()
$result.succeeded = $true
$result.statusCode = [int]$response.StatusCode
$result.latencyMs = [int]$stopwatch.ElapsedMilliseconds
} catch {
$result.statusCode = [int]($_.Exception.Response.StatusCode)
$result.error = $_.Exception.Message
}
return $result
}
function Get-PrivateNetworkIndicators {
param(
[string[]]$ResolvedAddresses,
[string]$PrivateEndpointIP,
[string]$VpnSubnetRange
)
$result = @{
isPrivateRange = $false
indicators = @()
matchesExpectedPrivateEndpoint = $false
vpnRouteWarning = $null
}
# Check if resolved IPs are private range
foreach ($addr in $ResolvedAddresses) {
if (IsPrivateIpAddress $addr) {
$result.isPrivateRange = $true
$result.indicators += "Resolved to RFC 1918 private IP range ($addr)"
}
}
# Check if matches expected private endpoint
if ($PrivateEndpointIP -and $ResolvedAddresses -contains $PrivateEndpointIP) {
$result.matchesExpectedPrivateEndpoint = $true
$result.indicators += "Matches expected private endpoint IP ($PrivateEndpointIP)"
} elseif ($PrivateEndpointIP -and $ResolvedAddresses.Count -gt 0) {
$result.indicators += "WARNING: Resolved to $($ResolvedAddresses[0]) but expected private endpoint IP is $PrivateEndpointIP"
}
return $result
}
function IsPrivateIpAddress {
param([string]$IpAddress)
try {
$ip = [System.Net.IPAddress]::Parse($IpAddress)
# RFC 1918 ranges
if ($ip.ToString() -match "^10\." -or $ip.ToString() -match "^172\.(1[6-9]|2[0-9]|3[01])\." -or $ip.ToString() -match "^192\.168\.") {
return $true
}
# Loopback
if ($ip.AddressFamily -eq "InterNetwork" -and $ip.GetAddressBytes()[0] -eq 127) {
return $true
}
} catch { }
return $false
}
function Get-AzureCliContext {
$result = @{
installed = $false
authenticated = $false
currentUser = $null
currentTenant = $null
currentSubscription = $null
error = $null
}
try {
$output = & az --version 2>&1
if ($LASTEXITCODE -eq 0) {
$result.installed = $true
}
} catch {
$result.error = "Azure CLI not found. Skipping Azure context checks."
return $result
}
try {
$account = & az account show 2>&1 | ConvertFrom-Json
$result.authenticated = $true
$result.currentUser = $account.user.name
$result.currentTenant = $account.tenantId
$result.currentSubscription = $account.id
} catch {
$result.error = "Not authenticated with Azure CLI. Run 'az login' to proceed with Azure checks."
}
return $result
}
function Get-AzureAccountNetworkConfig {
param(
[guid]$SubscriptionId,
[string]$ResourceGroup,
[string]$AccountName
)
$result = @{
checked = $false
publicNetworkAccessRestricted = $null
privateEndpoints = @()
vnetRules = @()
error = $null
}
try {
$scope = "/subscriptions/$SubscriptionId/resourceGroups/$ResourceGroup/providers/Microsoft.DocumentDB/databaseAccounts/$AccountName"
$account = & az cosmosdb show --resource-group $ResourceGroup --name $AccountName 2>&1 | ConvertFrom-Json
if ($account) {
$result.checked = $true
$result.publicNetworkAccessRestricted = $account.properties.publicNetworkAccess -eq "Disabled"
# Get private endpoints
$peConnections = & az cosmosdb private-endpoint-connection list --resource-group $ResourceGroup --name $AccountName 2>&1 | ConvertFrom-Json
if ($peConnections) {
$result.privateEndpoints = @($peConnections | Select-Object -Property id, @{n='state';e={$_.properties.privateLinkServiceConnectionState.status}})
}
}
} catch {
$result.error = $_.Exception.Message
}
return $result
}
function Get-RbacAssessment {
param(
[guid]$SubscriptionId,
[string]$ResourceGroup,
[string]$AccountName
)
$result = @{
checked = $false
canReadAccount = $false
canManageAccount = $false
canExecuteDataPlaneOps = $false
roleAssignments = @()
classification = "unknown"
error = $null
}
try {
$scope = "/subscriptions/$SubscriptionId/resourceGroups/$ResourceGroup/providers/Microsoft.DocumentDB/databaseAccounts/$AccountName"
# Try to read account (implies Reader or higher)
$account = & az cosmosdb show --resource-group $ResourceGroup --name $AccountName 2>&1 | ConvertFrom-Json
if ($account) {
$result.checked = $true
$result.canReadAccount = $true
# Check role assignments
$roles = & az role assignment list --scope $scope 2>&1 | ConvertFrom-Json
if ($roles) {
$result.roleAssignments = @($roles | Select-Object -Property roleDefinitionName, principalName)
# Classify permissions
$roleNames = $roles | Select-Object -ExpandProperty roleDefinitionName
if ($roleNames -contains "Contributor" -or $roleNames -contains "Owner") {
$result.canManageAccount = $true
$result.canExecuteDataPlaneOps = $true
$result.classification = "sufficient"
} elseif ($roleNames -contains "Cosmos DB Operator" -or $roleNames -contains "Cosmos DB Account Reader") {
$result.canExecuteDataPlaneOps = $true
$result.classification = "partial"
} else {
$result.classification = "partial"
}
}
}
} catch {
$result.error = $_.Exception.Message
$result.classification = "insufficient"
}
return $result
}
function Invoke-Classification {
param(
[hashtable]$DnsResult,
[hashtable]$TcpResult,
[hashtable]$PrivateNetworkIndicators,
[hashtable]$AzureNetworkConfig
)
$classification = @{
status = "unknown"
code = "unknown"
summary = "Unable to classify"
rootCause = $null
recommendedActions = @()
}
# DNS failure
if (-not $DnsResult.succeeded) {
$classification.status = "failure"
$classification.code = "dns_resolution_failed"
$classification.summary = "DNS resolution failed. The Cosmos DB endpoint hostname cannot be resolved."
$classification.rootCause = "DNS configuration, VPN/proxy DNS redirect, or network connectivity issue"
$classification.recommendedActions = @(
"1. Check if you are connected to corporate VPN or proxy that intercepts DNS",
"2. Manually run: nslookup $($DnsResult.hostname)",
"3. If nslookup fails, check with your network team or ISP",
"4. Try pinging the endpoint or using nslookup with alternate DNS: nslookup $($DnsResult.hostname) 8.8.8.8"
)
return $classification
}
# DNS succeeded but TCP failed
if ($DnsResult.succeeded -and -not $TcpResult.succeeded) {
$classification.status = "failure"
$classification.code = "tcp_connectivity_blocked"
$classification.summary = "DNS resolution succeeded but TCP 443 connection failed. Network path is blocked."
if ($PrivateNetworkIndicators.isPrivateRange) {
$classification.rootCause = "Private endpoint configured but network path blocked (VPN routing, firewall/NVA, NSG, UDR, or peering issue)"
$classification.recommendedActions = @(
"1. Verify VPN connectivity and that your client subnet can route to the private endpoint subnet",
"2. Ask your network team to verify routing between $([System.Net.Dns]::GetHostName()) and private endpoint $($DnsResult.addresses[0])",
"3. Check Azure network security groups (NSGs) rules for port 443 inbound",
"4. Verify Azure Virtual Network peering and User Defined Routes (UDRs)",
"5. Check if corporate firewall/NVA is blocking the connection",
"6. Manually run: Test-NetConnection -ComputerName $($DnsResult.hostname) -Port 443"
)
} else {
$classification.rootCause = "Public endpoint network path blocked (firewall, proxy, ISP, or regional restriction)"
$classification.recommendedActions = @(
"1. Check if corporate firewall is blocking outbound port 443",
"2. If behind proxy, verify proxy settings allow HTTPS to documents.azure.com",
"3. Manually run: Test-NetConnection -ComputerName $($DnsResult.hostname) -Port 443",
"4. Try connecting from a different network to isolate the issue"
)
}
return $classification
}
# Both succeeded
if ($DnsResult.succeeded -and $TcpResult.succeeded) {
$classification.status = "success"
$classification.code = "network_connectivity_healthy"
$classification.summary = "Network connectivity is healthy. DNS resolves and TCP 443 is reachable."
$classification.rootCause = $null
$classification.recommendedActions = @(
"✓ Local network connectivity is working",
"If Cosmos DB operations still fail, check:",
" - RBAC/authentication permissions",
" - Account firewall IP rules (if enabled)",
" - Data plane token expiry",
" - Application-level issues (connection strings, SDK versions)"
)
return $classification
}
return $classification
}
function Redact-Sensitive {
param([object]$Object)
if (-not $Redact) { return $Object }
$json = $Object | ConvertTo-Json -Depth 10
$json = $json -replace [regex]::Escape($SubscriptionId.ToString()), "REDACTED-SUBSCRIPTION-ID"
# Redact tenant IDs (GUIDs in certain fields)
$json = $json -replace '"currentTenant"\s*:\s*"[^"]*"', '"currentTenant": "REDACTED-TENANT-ID"'
# Redact user names
$json = $json -replace '"currentUser"\s*:\s*"[^"]*"', '"currentUser": "REDACTED-USER-NAME"'
$json = $json -replace '"principalName"\s*:\s*"[^"]*"', '"principalName": "REDACTED-PRINCIPAL-NAME"'
return $json | ConvertFrom-Json
}
# ============================================================================
# Main Execution
# ============================================================================
try {
# Validate and collect inputs
if ($Interactive -and -not $EndpointUrl) {
$inputs = Read-InputsInteractively
$EndpointUrl = $inputs.EndpointUrl
$SubscriptionId = $inputs.SubscriptionId
$ResourceGroup = $inputs.ResourceGroup
$AccountName = $inputs.AccountName
$PrivateEndpointIP = $inputs.PrivateEndpointIP
$VpnSubnetRange = $inputs.VpnSubnetRange
} elseif (-not $EndpointUrl) {
Write-Host "No endpoint URL provided. Use -Interactive flag or provide parameters." -ForegroundColor Red
Show-InputInstructions
exit 1
}
# Extract hostname from URL
$uri = [System.Uri]$EndpointUrl
$hostname = $uri.Host
Write-Host "Collecting diagnostics for: $hostname" -ForegroundColor Cyan
Write-Host ""
# Run diagnostics
Write-Host "[1/5] DNS Resolution..." -ForegroundColor Cyan
$dnsResult = Invoke-DnsResolution -Hostname $hostname
Write-Host "[2/5] TCP Connectivity (port 443)..." -ForegroundColor Cyan
$tcpResult = Invoke-TcpConnectivityTest -Hostname $hostname -Port 443 -TimeoutMs $TcpConnectTimeoutMs
Write-Host "[3/5] HTTPS Probe..." -ForegroundColor Cyan
$httpsResult = Invoke-HttpsProbe -Url $EndpointUrl
Write-Host "[4/5] Private Network Analysis..." -ForegroundColor Cyan
$privateNetIndicators = Get-PrivateNetworkIndicators -ResolvedAddresses $dnsResult.addresses -PrivateEndpointIP $PrivateEndpointIP -VpnSubnetRange $VpnSubnetRange
Write-Host "[5/5] Azure Configuration & RBAC..." -ForegroundColor Cyan
$cliContext = Get-AzureCliContext
$networkConfig = @{ checked = $false; error = "Skipped" }
$rbacAssessment = @{ checked = $false; classification = "unknown"; error = "Skipped" }
if ($cliContext.authenticated -and $SubscriptionId -and $ResourceGroup -and $AccountName) {
$networkConfig = Get-AzureAccountNetworkConfig -SubscriptionId $SubscriptionId -ResourceGroup $ResourceGroup -AccountName $AccountName
$rbacAssessment = Get-RbacAssessment -SubscriptionId $SubscriptionId -ResourceGroup $ResourceGroup -AccountName $AccountName
} elseif (-not $cliContext.authenticated) {
Write-Host " ⚠ Azure CLI not authenticated. Skipping Azure checks. Run 'az login' to enable." -ForegroundColor Yellow
}
Write-Host ""
Write-Host "Generating classification..." -ForegroundColor Cyan
$classification = Invoke-Classification -DnsResult $dnsResult -TcpResult $tcpResult -PrivateNetworkIndicators $privateNetIndicators -AzureNetworkConfig $networkConfig
# Build final report
$report = @{
version = $ScriptVersion
timestamp = $DiagnosticTimestamp
target = @{
endpointUrl = if ($Redact) { "REDACTED" } else { $EndpointUrl }
hostname = $hostname
subscriptionId = if ($Redact -and $SubscriptionId) { "REDACTED" } else { $SubscriptionId.ToString() }
resourceGroup = if ($Redact -and $ResourceGroup) { "REDACTED" } else { $ResourceGroup }
accountName = if ($Redact -and $AccountName) { "REDACTED" } else { $AccountName }
}
execution = @{
hostname = [System.Net.Dns]::GetHostName()
platform = $PSVersionTable.OS
powershellVersion = $PSVersionTable.PSVersion.ToString()
}
diagnostics = @{
dns = $dnsResult
tcp = $tcpResult
https = $httpsResult
privateNetwork = $privateNetIndicators
azureNetworkConfig = $networkConfig
rbac = $rbacAssessment
azureCli = $cliContext
}
classification = $classification
}
# Redact if requested
if ($Redact) {
$report = Redact-Sensitive -Object $report
}
# Output JSON report
$jsonReport = $report | ConvertTo-Json -Depth 10
# Save to file
$timestamp = Get-Date -Format "yyyyMMdd_HHmmss"
$outputFile = "cosmos-diagnostic-$timestamp.json"
$jsonReport | Out-File -FilePath $outputFile -Encoding UTF8
Write-Host ""
Write-Host "═════════════════════════════════════════════════════════════════════════════" -ForegroundColor Green
Write-Host "DIAGNOSTIC COMPLETE" -ForegroundColor Green
Write-Host "═════════════════════════════════════════════════════════════════════════════" -ForegroundColor Green
Write-Host ""
Write-Host "Summary:" -ForegroundColor Cyan
Write-Host " DNS Resolution: $(if ($dnsResult.succeeded) { '✓ PASS' } else { '✗ FAIL' })"
Write-Host " TCP Connectivity: $(if ($tcpResult.succeeded) { '✓ PASS' } else { '✗ FAIL' })"
Write-Host " Private Network: $(if ($privateNetIndicators.isPrivateRange) { 'Detected (Private Endpoint)' } else { 'Not Detected (Public Endpoint)' })"
Write-Host " Classification: $($classification.status.ToUpper()) - $($classification.code)"
Write-Host ""
Write-Host "Full report saved to: $outputFile" -ForegroundColor Green
Write-Host ""
Write-Host "Summary:" -ForegroundColor Yellow
Write-Host $classification.summary
Write-Host ""
if ($classification.recommendedActions.Count -gt 0) {
Write-Host "Recommended Actions:" -ForegroundColor Yellow
$classification.recommendedActions | ForEach-Object { Write-Host " $_" }
}
Write-Host ""
# Output JSON to console for easy copy/paste
Write-Host "Full JSON Report:" -ForegroundColor Cyan
Write-Host "─────────────────────────────────────────────────────────────────────────────"
Write-Host $jsonReport
} catch {
Write-Host "Error: $($_.Exception.Message)" -ForegroundColor Red
exit 1
}
+352
View File
@@ -0,0 +1,352 @@
# Cosmos DB Connectivity Diagnostic - Complete Documentation Index
## 📦 Deliverables
This folder contains a complete, production-ready diagnostic toolkit for troubleshooting Cosmos DB connectivity issues. Below is a guide to all files and their purpose.
---
## 📚 Documentation Files
### 1. **README.md** ← Start here
**Purpose:** Comprehensive usage guide for customers and support teams
**Contains:**
- Overview and features
- Quick start in 3 modes (interactive, non-interactive, with redaction)
- Step-by-step guide to finding all inputs
- Understanding output format
- Common scenarios and examples
- Integration examples
- Troubleshooting guide
- Troubleshooting common issues
**Read this if:** You're running the script for the first time or onboarding someone else
---
### 2. **QUICK_REFERENCE.md** ← For urgent issues
**Purpose:** 2-minute quick-start card for customers
**Contains:**
- 3-step quick start
- Result codes at a glance
- Common fixes
- Prerequisite checklist
**Read this if:** You need to run the script NOW and don't have time for full docs
---
### 3. **DIAGNOSTIC_SCHEMA.md** ← For developers/automation
**Purpose:** Complete JSON output specification
**Contains:**
- Full JSON schema with field descriptions
- Root, target, execution, diagnostics, and classification objects
- DNS/TCP/HTTPS/private network result formats
- Azure config and RBAC object structures
- Classification code reference table
- Sample outputs for 3 scenarios
- Parsing guidelines
- Version history
**Read this if:**
- You're building a parser or automation tool
- You need to understand the JSON structure
- You're integrating with support ticketing system
- You want to validate output structure
---
### 4. **CLASSIFICATION_MATRIX.md** ← For support teams
**Purpose:** Support playbooks and triage routing
**Contains:**
- Decision tree flowchart (ASCII art)
- All classification codes with detailed explanations
- Root causes and recommended actions for each code
- Tier 1 triage checklist
- Detailed playbooks for each failure scenario:
- DNS Resolution Failed
- TCP 443 Failed (Public Endpoint)
- TCP 443 Failed (Private Endpoint)
- RBAC Insufficient
- Support ticket template
- Python parsing example
- Automation routing matrix
**Read this if:**
- You're a support engineer receiving diagnostic reports
- You need to route issues based on classification
- You're building automation to process diagnostics
- You need to escalate to specialist teams
---
## 🔧 Script File
### **Diagnose-CosmosConnectivity.ps1**
**Purpose:** Main diagnostic script (customer-executable)
**What it does:**
1. Prompts for account endpoints and credentials (interactive or parameterized)
2. Runs 5 diagnostic checks:
- DNS resolution of account endpoint
- TCP 443 connectivity test
- HTTPS reachability probe
- Private network indicators analysis
- Azure CLI queries (if authenticated)
3. Performs RBAC assessment
4. Generates classification (success/failure/warning + specific code)
5. Outputs structured JSON to file and console
6. Produces human-readable summary with recommended actions
**Key Features:**
- 300+ lines of well-commented PowerShell
- Error handling for all network operations
- Timeouts to prevent hanging
- Optional sensitive data redaction
- Works on Windows, macOS, Linux (PowerShell 5.0+)
- No external dependencies except optional Azure CLI
**How to run:**
```powershell
# Interactive (recommended first run)
.\Diagnose-CosmosConnectivity.ps1 -Interactive
# Non-interactive (scripted)
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "..." -SubscriptionId "..." -ResourceGroup "..." -AccountName "..."
# Safe for support (redacted)
.\Diagnose-CosmosConnectivity.ps1 ... -Redact
```
---
## 🔄 File Relationships
```
Customer Issue: "Can't connect to Cosmos DB"
├─→ QUICK_REFERENCE.md (if in hurry)
│ │
│ └─→ "Run this command"
└─→ README.md (comprehensive guidance)
├─→ Run: Diagnose-CosmosConnectivity.ps1
│ │
│ └─→ Outputs JSON file + console summary
├─→ Read classification code
└─→ CLASSIFICATION_MATRIX.md (support playbook)
├─→ Find your classification code
├─→ Read root causes
└─→ Follow recommended actions
├─→ Self-resolve?
│ └─→ Done!
└─→ Still stuck?
├─→ Gather info from JSON
├─→ Redact with -Redact flag
└─→ Escalate to support
├─→ Support triages with CLASSIFICATION_MATRIX.md
└─→ Route to specialist (network, auth, etc.)
```
---
## 🎯 Usage by Role
### 👤 Customer / End User
1. Read: **QUICK_REFERENCE.md** (2 min)
2. Gather inputs as shown in README.md
3. Run: `.\Diagnose-CosmosConnectivity.ps1 -Interactive`
4. Review output—look for Classification Code
5. Try recommended actions from console output
6. If stuck → Share JSON with support (use `-Redact`)
### 👨‍💼 Support Engineer (Tier 1)
1. Receive JSON report from customer
2. Read: **CLASSIFICATION_MATRIX.md** section "Tier 1: Triage"
3. Look up classification.code in "Classification Code Reference"
4. Follow the corresponding playbook
5. Either self-resolve or route to specialist
### 👨‍💻 Support Engineer (Specialist)
1. Receive routed issue with JSON and escalation context
2. Read relevant playbook from **CLASSIFICATION_MATRIX.md**
3. Use **DIAGNOSTIC_SCHEMA.md** to parse specific JSON fields
4. Reference "Recommended Actions" for deep-dive steps
5. May request customer to re-run with additional parameters
### 🤖 Automation / Integration
1. Read: **DIAGNOSTIC_SCHEMA.md** (schema specification)
2. Parse JSON output from script
3. Route based on classification.code
4. (Optional) Read **CLASSIFICATION_MATRIX.md** section "JSON Parsing for Automation"
5. Integrate with ticketing, routing, or remediation system
### 📊 Product Team / Data Analysis
1. Collect diagnostic reports over time
2. Aggregate classification codes to identify trends
3. Use JSON structure to extract metrics (DNS latency, TCP success rate, etc.)
4. Reference **DIAGNOSTIC_SCHEMA.md** for field definitions
5. Correlate with support ticket data for insights
---
## 📋 Classification Codes at a Glance
Quick reference (full details in CLASSIFICATION_MATRIX.md):
| Code | Type | Severity | What It Means |
|------|------|----------|---|
| `network_connectivity_healthy` | ✅ | Info | Network works; if still broken, check auth/app |
| `dns_resolution_failed` | ❌ | High | Cannot resolve endpoint (DNS/VPN/proxy issue) |
| `tcp_connectivity_blocked` | ❌ | High | DNS works, port 443 blocked (firewall/ISP) |
| `private_endpoint_network_path_blocked` | ❌ | High | Private endpoint unreachable (PE routing issue) |
| `rbac_insufficient` | ⚠️ | Medium | Network OK, but permissions missing |
| `private_endpoint_mismatch` | ⚠️ | Medium | Resolved to unexpected private IP |
| `azure_config_check_skipped` | ⚠️ | Low | Azure CLI not authenticated; re-run after `az login` |
---
## 🔍 Finding Specific Information
### "I want to know what the JSON contains"
**DIAGNOSTIC_SCHEMA.md** (all field definitions)
### "I see a classification code, what does it mean?"
**CLASSIFICATION_MATRIX.md** (code reference + playbook)
### "How do I run the script?"
**README.md** (detailed how-to) or **QUICK_REFERENCE.md** (2-min version)
### "I'm building a parser/bot"
**DIAGNOSTIC_SCHEMA.md** (schema + samples) + **CLASSIFICATION_MATRIX.md** (routing logic)
### "I need to support multiple customers"
**CLASSIFICATION_MATRIX.md** (support ticket template + triage playbook)
### "I need to find input for a specific field"
**README.md** section "Getting Your Inputs" (step-by-step with screenshots reference)
### "How do I integrate this into my system?"
**DIAGNOSTIC_SCHEMA.md** (JSON structure) + **CLASSIFICATION_MATRIX.md** (routing + Python example)
---
## ✅ Pre-Launch Checklist
Before deploying to customers, verify:
- [ ] Script runs without errors in interactive mode
- [ ] Script accepts all parameters in non-interactive mode
- [ ] `-Redact` flag properly masks sensitive data
- [ ] JSON output validates against DIAGNOSTIC_SCHEMA.md
- [ ] All classification codes match CLASSIFICATION_MATRIX.md
- [ ] README.md examples tested and working
- [ ] Support team trained on CLASSIFICATION_MATRIX.md playbooks
- [ ] Triage automation configured (if applicable)
- [ ] Sample JSON files created and tested
- [ ] Accessibility verified (screen readers, etc.)
---
## 🚀 Rollout Plan
### Phase 1: Internal Testing (Week 1)
- [ ] Run script on various network configurations
- [ ] Test interactive and non-interactive modes
- [ ] Verify Azure CLI integration (if connected to test accounts)
- [ ] Collect sample JSON outputs
### Phase 2: Support Dogfood (Week 2)
- [ ] Train support team on using CLASSIFICATION_MATRIX.md
- [ ] Have support team run diagnostics on internal test accounts
- [ ] Collect feedback on documentation clarity
- [ ] Refine playbooks based on real cases
### Phase 3: Limited Release (Week 3)
- [ ] Release to subset of customers (e.g., preview tier)
- [ ] Gather feedback on usability
- [ ] Monitor classification code distribution
- [ ] Look for unexpected errors or edge cases
### Phase 4: General Availability (Week 4)
- [ ] Release to all customers
- [ ] Monitor issue volume and classification codes
- [ ] Use data to identify new playbooks or improvements
- [ ] Update documentation based on feedback
---
## 📞 Support & Maintenance
### Common Questions
**Q: Can I run the script without Azure CLI?**
A: Yes! It will skip Azure configuration checks but still do network diagnostics.
**Q: Is the script safe? Does it collect personal data?**
A: Safe. It only reads local network config and (optionally) queries Azure API if you're authenticated. Use `-Redact` to mask sensitive data before sharing.
**Q: What if I get an unexpected error?**
A: Check error message in console, review troubleshooting section in README.md, or share the JSON file with support.
**Q: How often should I re-run diagnostics?**
A: After network changes, VPN reconnect, or when troubleshooting intermittent issues.
---
## 📈 Success Metrics
Track these to measure script effectiveness:
- % of customers who run script on first issue
- % of issues self-resolved after reading recommended actions
- Reduction in escalations for network vs auth vs app issues
- Average time to triage (before: manual back-and-forth; after: automated)
- Distribution of classification codes (helps identify common issues)
---
## 🔄 Version & Updates
**Current Version:** 1.0.0
**Schema Version:** 1.0.0
**Last Updated:** 2026-05-13
**Versioning Policy:**
- Major version (1.x.x) = Breaking changes to JSON schema or classification codes
- Minor version (x.1.x) = New checks or optional fields added
- Patch version (x.x.1) = Bug fixes, documentation updates
---
## 📄 License & Attribution
All files in this directory are provided as-is for Cosmos DB connectivity diagnostics.
See repository LICENSE file for terms.
---
**Quick Links:**
- 🚀 [Quick Start](./QUICK_REFERENCE.md)
- 📖 [Full Documentation](./README.md)
- 🔧 [Script](./Diagnose-CosmosConnectivity.ps1)
- 🗂️ [JSON Schema](./DIAGNOSTIC_SCHEMA.md)
- 📋 [Support Playbooks](./CLASSIFICATION_MATRIX.md)
+144
View File
@@ -0,0 +1,144 @@
# Cosmos DB Connectivity Diagnostic - Quick Reference
## 🚀 Quick Start (2 Minutes)
### Step 1: Gather Your Info
| Item | Where to Find |
|------|---|
| **Endpoint URL** | Azure Portal → Cosmos DB Account → Overview → URI field |
| **Subscription ID** | Azure Portal → Subscriptions → Copy ID |
| **Resource Group** | Azure Portal → Cosmos DB Account → Top-right "Resource group" |
| **Account Name** | From endpoint URL (the part before `.documents.azure.com`) |
### Step 2: Run the Script
**Interactive (easiest):**
```powershell
.\Diagnose-CosmosConnectivity.ps1 -Interactive
```
Script will prompt for inputs and guide you.
**Non-interactive:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://my-cosmos.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "my-rg" `
-AccountName "my-cosmos"
```
**With redaction (safe for support):**
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://my-cosmos.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "my-rg" `
-AccountName "my-cosmos" `
-Redact
```
### Step 3: Check Result
Look for the **Classification** line:
```
Classification: SUCCESS - network_connectivity_healthy
```
---
## 📊 Result Codes
| Code | Meaning | Action |
|------|---------|--------|
| ✅ `network_connectivity_healthy` | Network OK | Check auth/RBAC if operations still fail |
| ❌ `dns_resolution_failed` | Cannot find hostname | Check VPN/proxy DNS settings |
| ❌ `tcp_connectivity_blocked` | DNS works, but port 443 blocked | Ask network team to check firewall |
| ❌ `private_endpoint_network_path_blocked` | Private endpoint unreachable | Ask network team to check PE routing |
| ⚠️ `rbac_insufficient` | Not enough permissions | Ask admin for Cosmos DB Operator role |
| ⚠️ `azure_config_check_skipped` | Azure CLI not set up | Run `az login` and re-run |
---
## 🆘 Common Fixes
### DNS Resolution Failed
1. Are you on a VPN? → Ask VPN admin about DNS settings
2. Check manually: `nslookup my-cosmos-account.documents.azure.com`
3. Try different DNS: `nslookup my-cosmos-account.documents.azure.com 8.8.8.8`
### TCP 443 Blocked (Public Endpoint)
1. Check Windows Firewall (Windows Defender) settings
2. If on corporate network → Ask IT if 443 outbound is allowed
3. Try from mobile hotspot to test
### TCP 443 Blocked (Private Endpoint)
1. Verify VPN is connected
2. Ask network team to check NSG and routing rules
3. Provide them with the script output (use `-Redact` to mask sensitive data)
### RBAC Insufficient
1. Ask admin to assign you **"Cosmos DB Operator"** role
2. Wait 5-10 minutes for role assignment to propagate
---
## 📁 Output Files
**JSON Report:** `cosmos-diagnostic-<timestamp>.json`
- Full diagnostic results
- Save for your records
- Can share with support (use `-Redact` first)
---
## ⚙️ Prerequisites
- PowerShell 5.0+ (Windows, Mac, Linux)
- Network access to documents.azure.com
- (Optional) Azure CLI for full diagnostics: `az login`
---
## 💡 Tips
**Private Endpoint?** Include the IP:
```powershell
.\Diagnose-CosmosConnectivity.ps1 -Interactive -PrivateEndpointIP "10.123.171.30"
```
**Sharing with support safely:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 ... -Redact
# Share the JSON file (sensitive data masked)
```
**Just want DNS/TCP without Azure checks:**
- Run without providing SubscriptionId/ResourceGroup/AccountName
- Or don't run `az login` first
---
## 📞 Getting Help
**If you see:**
- ✅ Green checkmarks → Network is working. Issue is likely application-level.
- ❌ Red X marks → Network is blocked. Share the JSON with support.
- ⚠️ Yellow warnings → Configuration issue. Follow recommended actions.
**Next:** Share your JSON report with support and include the **Classification Code**.
---
## 📋 Checklist Before Contacting Support
- [ ] I ran the script successfully
- [ ] I noted the **Classification Code** (from console output)
- [ ] I checked the **Recommended Actions** section
- [ ] I tried the basic fixes above
- [ ] I saved the JSON report
---
**Version:** 1.0.0 | **Last Updated:** 2026-05-13
+424
View File
@@ -0,0 +1,424 @@
# Cosmos DB Connectivity Diagnostic Script - README
## Overview
This is a standalone PowerShell diagnostic script that captures network connectivity, private endpoint configuration, and Azure RBAC status for Cosmos DB accounts. It's designed to be run locally on a customer's machine to help troubleshoot HTTP 0.0 and connection errors.
**Key Features:**
- ✅ DNS resolution verification
- ✅ TCP 443 connectivity testing
- ✅ HTTPS reachability probe
- ✅ Private endpoint detection
- ✅ Private network route analysis
- ✅ Azure CLI optional context (network config, RBAC)
- ✅ Structured JSON output for triage automation
- ✅ Sensitive data redaction for safe sharing
- ✅ Interactive and non-interactive modes
---
## Quick Start
### Prerequisites
- PowerShell 5.0+ (works on Windows, Linux, macOS)
- If querying Azure config: Azure CLI installed and authenticated (`az login`)
- Outbound network access to documents.azure.com
### Option 1: Interactive Mode (Recommended for First Run)
Simplest approach—script prompts for inputs:
```powershell
.\Diagnose-CosmosConnectivity.ps1 -Interactive
```
The script will display a guide showing where to find each input, then prompt:
- Endpoint URL
- Subscription ID
- Resource Group
- Account Name
- (Optional) Private Endpoint IP
- (Optional) VPN Subnet Range
### Option 2: Non-Interactive Mode (Scripted/Automated)
Provide all parameters directly:
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://my-cosmos-account.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "my-resource-group" `
-AccountName "my-cosmos-account"
```
### Option 3: Non-Interactive with Redaction (Safe for Support)
Output JSON with sensitive data masked:
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://my-cosmos-account.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "my-resource-group" `
-AccountName "my-cosmos-account" `
-Redact
```
---
## Detailed Usage
### Getting Your Inputs
#### 1. **Endpoint URL** (Required)
**Location:** Azure Portal → Cosmos DB Account → Overview
1. Go to [Azure Portal](https://portal.azure.com)
2. Search for "Cosmos DB"
3. Click your Cosmos DB account
4. Look for the **"URI"** field in the Overview tab
5. Copy the entire URL (e.g., `https://my-cosmos-account.documents.azure.com`)
**Format:** `https://<account-name>.documents.azure.com` (do NOT include trailing slash or `:443/`)
**Note:** If using a regional endpoint, use the primary endpoint. Private endpoints will have the same hostname with different IP resolution.
---
#### 2. **Subscription ID** (Required)
**Location:** Azure Portal → Subscriptions or Portal → Home
1. Go to [Azure Portal](https://portal.azure.com)
2. Click on "Subscriptions" (or search for it)
3. Find your subscription
4. Copy the **Subscription ID** (looks like `12345678-1234-1234-1234-123456789012`)
**Alternative:** From your Cosmos account page, look at the breadcrumb at the top or search box.
---
#### 3. **Resource Group** (Required)
**Location:** Azure Portal → Cosmos DB Account (top-right corner)
1. Open your Cosmos DB account
2. At the top of the page, you'll see breadcrumbs
3. Look for **"Resource group: <name>"** in the top-right
4. Or on the Overview page, find the **"Resource group"** field
**Example:** `my-production-rg` or `cosmos-resources`
---
#### 4. **Account Name** (Required)
**Location:** Extract from endpoint URL or Azure Portal
**From URL:**
- Endpoint: `https://my-cosmos-account.documents.azure.com`
- Account Name: `my-cosmos-account` (the part before `.documents.azure.com`)
**From Portal:**
- Open Cosmos DB account → Look at the account name in the breadcrumb or page title
---
#### 5. **Private Endpoint IP** (Optional but Recommended)
**Location:** Azure Portal → Cosmos DB Account → Private Endpoint Connections
1. Open your Cosmos DB account
2. Go to **Settings****Private Endpoint Connections**
3. If any connections exist, look for **"Private IP address"** column
4. Copy the IP (e.g., `10.123.171.30`)
**When to provide:**
- If your Cosmos account has private endpoints configured
- Otherwise, leave blank (press Enter in interactive mode)
**Format:** `10.x.x.x`, `172.16-31.x.x`, or `192.168.x.x` (RFC 1918 ranges)
---
#### 6. **VPN Subnet Range** (Optional)
**Location:** Ask your network team or VPN client properties
If you're connecting via VPN, your network team should know your VPN subnet CIDR.
**Example:** `10.0.0.0/24` (network: 10.0.0.010.0.0.255)
**When to provide:**
- If you're behind a VPN
- If you suspect VPN routing is the issue
- Otherwise, leave blank
---
### Understanding Output
#### Console Summary
After running, you'll see:
```
═════════════════════════════════════════════════════════════════════════════
DIAGNOSTIC COMPLETE
═════════════════════════════════════════════════════════════════════════════
Summary:
DNS Resolution: ✓ PASS
TCP Connectivity: ✗ FAIL
Private Network: Detected (Private Endpoint)
Classification: FAILURE - tcp_connectivity_blocked
Full report saved to: cosmos-diagnostic-20260513_143045.json
Summary:
TCP 443 connection failed to private endpoint. Network path is blocked.
Recommended Actions:
1. Verify VPN connectivity and that your client subnet can route to the private endpoint subnet
2. Ask your network team to verify routing from DESKTOP-ABC123 to private endpoint 10.123.171.30
3. Check Azure network security groups (NSGs) rules for port 443 inbound
4. Verify Azure Virtual Network peering and User Defined Routes (UDRs)
5. Check if corporate firewall/NVA is blocking the connection
6. Manually run: Test-NetConnection -ComputerName my-cosmos-account.documents.azure.com -Port 443
Full JSON Report:
...
```
#### JSON Output File
A file like `cosmos-diagnostic-20260513_143045.json` is automatically saved in the current directory.
**Use this file to:**
- Share with support (can use `-Redact` to mask sensitive data)
- Parse with automation tools
- Retain diagnostic history
---
## Common Scenarios
### Scenario 1: "I can't connect to Cosmos DB from my machine"
**Run this:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 -Interactive
```
**Interpret results:**
- If `dns_resolution_failed` → Check VPN/proxy DNS settings
- If `tcp_connectivity_blocked` → Ask network team to check firewall/NSG rules
- If `network_connectivity_healthy` → Issue is auth/RBAC, not network
---
### Scenario 2: "Private endpoint isn't working"
**Run this:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://my-cosmos.documents.azure.com" `
-SubscriptionId "your-sub-id" `
-ResourceGroup "your-rg" `
-AccountName "your-account" `
-PrivateEndpointIP "10.123.171.30"
```
**Interpret results:**
- If resolved IP matches private endpoint IP but TCP fails → VPN route blocked
- If resolved IP differs from provided IP → Route misconfiguration
- If network is healthy → Check private DNS zone configuration
---
### Scenario 3: "How do I share this with support safely?"
**Run with redaction:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://my-cosmos.documents.azure.com" `
-SubscriptionId "your-sub-id" `
-ResourceGroup "your-rg" `
-AccountName "your-account" `
-Redact
```
Then share the generated JSON file. Sensitive data (subscription ID, usernames, tenant ID) will be masked as `REDACTED`.
---
### Scenario 4: "I need the diagnostics in a pipeline"
**Non-interactive with JSON output capture:**
```powershell
$json = .\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://my-cosmos.documents.azure.com" `
-SubscriptionId "your-sub-id" `
-ResourceGroup "your-rg" `
-AccountName "your-account" 2>&1 `
| Select-String -Pattern '^\{' -SimpleMatch | ConvertFrom-Json
# Now use $json in automation
if ($json.classification.code -eq "network_connectivity_healthy") {
Write-Host "Network OK, escalating to app team"
} else {
Write-Host "Network issue: $($json.classification.summary)"
}
```
---
## Classification Codes
The script produces one of these classification codes:
| Code | Meaning |
|------|---------|
| `network_connectivity_healthy` | ✓ Network works. If errors, check auth/RBAC. |
| `dns_resolution_failed` | ✗ Cannot resolve endpoint hostname. |
| `tcp_connectivity_blocked` | ✗ DNS works, but TCP 443 blocked. |
| `private_endpoint_network_path_blocked` | ✗ Private endpoint detected, TCP fails. |
| `rbac_insufficient` | ⚠ Network OK, but RBAC permissions missing. |
| `azure_config_check_skipped` | ⚠ Azure CLI not authenticated. |
See [CLASSIFICATION_MATRIX.md](./CLASSIFICATION_MATRIX.md) for detailed playbooks and support guidance.
---
## Advanced Usage
### Running Specific Checks
The script always runs all checks, but you can parse the JSON to focus on specific ones:
```powershell
# Get just DNS results
$report = Get-Content cosmos-diagnostic-*.json | ConvertFrom-Json
$report.diagnostics.dns | ConvertTo-Json
# Get classification only
$report.classification | ConvertTo-Json
# Check if RBAC is sufficient
$report.diagnostics.rbac.classification
```
---
### Integration with Support Ticketing
When opening a support case:
1. **Run the script** (interactive mode is fine)
2. **Include the generated JSON file** in your ticket
3. **Or use `-Redact` flag** if sharing with external support
Example ticket text:
```
Title: Cosmos DB Connection Errors
Body:
Experiencing connection errors to my Cosmos DB account.
Attached diagnostic results (cosmos-diagnostic-*.json).
Network Status: [paste classification.status]
Issue Code: [paste classification.code]
Endpoint: [paste target.hostname]
```
---
### Troubleshooting the Script Itself
#### Script won't run (permission denied)
```powershell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
```
Then re-run the script.
#### "Azure CLI not found" but I need RBAC checks
Install Azure CLI:
- Windows: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-windows
- Mac: `brew install azure-cli`
- Linux: Follow docs at https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-linux
Then:
```powershell
az login
```
Re-run the script.
#### Endpoint validation error
**Error:** "Invalid format. Expected: https://<account-name>.documents.azure.com"
**Fix:** Remove trailing slash or port from URL:
-`https://my-cosmos.documents.azure.com/` (trailing slash)
-`https://my-cosmos.documents.azure.com:443/` (with port)
-`https://my-cosmos.documents.azure.com` (correct)
---
## File Outputs
### Generated Files
After running, the script creates:
**`cosmos-diagnostic-<timestamp>.json`**
- Full diagnostic report in JSON format
- Machine-readable for automation
- Can be shared with support
- Keep for troubleshooting history
---
## JSON Schema
For details on JSON structure, field definitions, and sample outputs, see [DIAGNOSTIC_SCHEMA.md](./DIAGNOSTIC_SCHEMA.md).
---
## Support Routing
Based on classification code, route as follows:
| Classification | Route To |
|---|---|
| `network_connectivity_healthy` | Application/Auth team—network verified working |
| `dns_resolution_failed` | VPN/Network team—DNS issue |
| `tcp_connectivity_blocked` (public IP) | Firewall/ISP team—outbound port blocked |
| `private_endpoint_network_path_blocked` | Network team—PE routing issue |
| `rbac_insufficient` | Cosmos DB Access Control team |
| `azure_config_check_skipped` | Customer: Run `az login` first |
---
## Version
**Script Version:** 1.0.0
**Schema Version:** 1.0.0
**Last Updated:** 2026-05-13
---
## License
This script is provided as-is for diagnosing Cosmos DB connectivity issues. See [LICENSE](../../LICENSE) for terms.
---
## Next Steps
1. **Run the script:** `.\Diagnose-CosmosConnectivity.ps1 -Interactive`
2. **Review output:** Check the JSON report and console summary
3. **Follow recommended actions** based on the classification code
4. **Share with support** if needed (use `-Redact` for sensitive data masking)
For questions or issues with the script itself, contact the Cosmos DB team.
+510
View File
@@ -0,0 +1,510 @@
# Cosmos DB Connectivity Diagnostic - Test Scenarios
## Overview
This document defines test scenarios, expected outcomes, and validation procedures for the diagnostic script. Use these to verify script functionality across different network configurations.
---
## Test Infrastructure Setup
### Prerequisites
- Test Cosmos DB accounts in multiple configurations:
- Public endpoint only
- Private endpoint only
- Both public + private endpoints
- Test networks:
- Clean network (no corporate proxy/VPN)
- Behind corporate proxy
- Behind VPN (if possible)
- Restricted network (firewall blocking 443)
---
## Test Scenarios
### Scenario 1: Healthy Public Endpoint (All Checks Pass)
**Setup:**
- Cosmos account with public endpoint enabled
- Running from clean network (no VPN/proxy)
- Azure CLI authenticated (optional)
**Run:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://test-public-01.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "test-cosmos-rg" `
-AccountName "test-public-01"
```
**Expected Results:**
- ✅ DNS resolution: `succeeded = true`
- ✅ TCP connectivity: `succeeded = true`
- ✅ HTTPS probe: `statusCode = 401` (expected without auth)
- ✅ Private network: `isPrivateRange = false`
- ✅ Classification: `status = "success"`, `code = "network_connectivity_healthy"`
**Validation Checklist:**
- [ ] Console shows "✓ PASS" for DNS and TCP
- [ ] Recommended Actions mention checking RBAC/auth
- [ ] JSON file created successfully
- [ ] Latency values are reasonable (< 1000ms)
---
### Scenario 2: DNS Resolution Failure
**Setup:**
- Network with DNS resolver that blocks documents.azure.com
- OR simulate by providing invalid hostname
**Run:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://invalid-account-xyz123.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "test-cosmos-rg" `
-AccountName "invalid-account"
```
**Expected Results:**
- ❌ DNS resolution: `succeeded = false`, `error = "No such host is known"`
- ❌ TCP connectivity: `succeeded = false`
- ❌ Classification: `status = "failure"`, `code = "dns_resolution_failed"`
**Validation Checklist:**
- [ ] Console shows "✗ FAIL" for DNS
- [ ] Error message is clear
- [ ] Root cause in classification mentions DNS/VPN/proxy
- [ ] Recommended actions include running manual `nslookup`
- [ ] JSON contains error details
---
### Scenario 3: TCP Blocked (Public Endpoint)
**Setup:**
- Network with firewall blocking outbound port 443 to documents.azure.com
- DNS resolves successfully but TCP fails
**Run:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://test-public-02.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "test-cosmos-rg" `
-AccountName "test-public-02"
```
**Expected Results:**
- ✅ DNS resolution: `succeeded = true`
- ❌ TCP connectivity: `succeeded = false`, `error = "Connection timeout after 5000ms"`
- ❌ HTTPS probe: `statusCode = null`, `error contains "timeout"`
- ❌ Private network: `isPrivateRange = false`
- ❌ Classification: `status = "failure"`, `code = "tcp_connectivity_blocked"`
**Validation Checklist:**
- [ ] DNS shows success, TCP shows timeout
- [ ] Console summary distinguishes DNS success from TCP failure
- [ ] Root cause mentions firewall/ISP/proxy
- [ ] Recommended actions include corporate network contact
- [ ] Timeout latency is approximately 5000ms
---
### Scenario 4: Healthy Private Endpoint
**Setup:**
- Cosmos account with private endpoint configured
- Client connected to VPN that can route to PE
- PE IP known and provided
**Run:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://test-private-01.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "test-cosmos-rg" `
-AccountName "test-private-01" `
-PrivateEndpointIP "10.123.171.30"
```
**Expected Results:**
- ✅ DNS resolution: `succeeded = true`, `addresses = ["10.123.171.30"]`
- ✅ TCP connectivity: `succeeded = true`
- ✅ Private network: `isPrivateRange = true`, `matchesExpectedPrivateEndpoint = true`
- ✅ Azure config: `publicNetworkAccessRestricted = true` (if checked)
- ✅ Classification: `status = "success"`, `code = "network_connectivity_healthy"`
**Validation Checklist:**
- [ ] DNS resolves to private IP (10.x)
- [ ] TCP succeeds to private IP
- [ ] Indicators correctly identify private endpoint
- [ ] Expected PE IP matches resolved IP
- [ ] Classification recognizes healthy private path
---
### Scenario 5: Private Endpoint Network Path Blocked
**Setup:**
- Private endpoint configured
- Client on VPN but routing to PE subnet is blocked
- DNS resolves to PE IP but TCP times out
**Run:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://test-private-02.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "test-cosmos-rg" `
-AccountName "test-private-02" `
-PrivateEndpointIP "10.123.171.30"
```
**Expected Results:**
- ✅ DNS resolution: `succeeded = true`, `addresses = ["10.123.171.30"]`
- ❌ TCP connectivity: `succeeded = false`, `error = "Connection timeout after 5000ms"`
- ✅ Private network: `isPrivateRange = true`, `matchesExpectedPrivateEndpoint = true`, `vpnRouteWarning != null`
- ❌ Classification: `status = "failure"`, `code = "private_endpoint_network_path_blocked"`
**Validation Checklist:**
- [ ] DNS resolves to expected PE IP
- [ ] TCP to PE IP fails with timeout
- [ ] VPN route warning is populated
- [ ] Classification correctly identifies PE path issue
- [ ] Recommended actions mention network team + routing
- [ ] Source IP is captured (if available)
---
### Scenario 6: RBAC Insufficient
**Setup:**
- Network connectivity is working
- Azure CLI authenticated as user with limited RBAC (e.g., only Reader role)
- Account queried successfully
**Run:**
```powershell
az login # Login as limited user first
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://test-rbac-01.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "test-cosmos-rg" `
-AccountName "test-rbac-01"
```
**Expected Results:**
- ✅ DNS resolution: `succeeded = true`
- ✅ TCP connectivity: `succeeded = true`
- ✅ HTTPS probe: `statusCode = 401` or `200`
- ❌ RBAC: `classification = "insufficient"`, `canReadAccount = false`
- ⚠️ Classification: `status = "warning"`, `code = "rbac_insufficient"`
**Validation Checklist:**
- [ ] Network checks all pass
- [ ] RBAC assessment shows limited permissions
- [ ] Classification code is `rbac_insufficient`
- [ ] Recommended actions mention role assignment
- [ ] Error message explains what permissions are missing
---
### Scenario 7: Azure CLI Not Authenticated
**Setup:**
- All network checks work fine
- Azure CLI not installed OR not authenticated
**Run:**
```powershell
# Without running az login first
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://test-public-03.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "test-cosmos-rg" `
-AccountName "test-public-03"
```
**Expected Results:**
- ✅ DNS resolution: `succeeded = true`
- ✅ TCP connectivity: `succeeded = true`
- ⚠️ Azure CLI: `authenticated = false`, `error = "Not authenticated with Azure CLI. Run 'az login' to proceed."`
- ⚠️ Azure config: `checked = false`, `error = "Skipped"`
- ⚠️ Classification: May reference `azure_config_check_skipped` in warnings
**Validation Checklist:**
- [ ] Network checks complete normally
- [ ] Azure CLI context shows unauthenticated
- [ ] Console warning mentions `az login`
- [ ] Recommended actions suggest re-running after authentication
- [ ] Script doesn't crash; gracefully continues
---
### Scenario 8: Interactive Mode Input Flow
**Setup:**
- User runs script with -Interactive flag
- Has all inputs ready
**Run:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 -Interactive
```
**Expected Sequence:**
1. Show input instructions with Portal navigation guide
2. Prompt: "Endpoint URL (e.g., https://my-cosmos.documents.azure.com)"
3. Validate input format; re-prompt if invalid
4. Prompt: "Subscription ID (12345678-...)"
5. Validate GUID format; re-prompt if invalid
6. Prompt: "Resource Group name"
7. Prompt: "Account Name"
8. Prompt: "Private Endpoint IP (optional, press Enter to skip)"
9. Prompt: "VPN Subnet Range (optional, press Enter to skip)"
10. Run diagnostics
11. Display results
**Validation Checklist:**
- [ ] Input instructions are clear and helpful
- [ ] Format validation rejects invalid inputs
- [ ] Optional fields can be skipped (Enter key)
- [ ] All inputs accepted without error
- [ ] Diagnostics run successfully after inputs collected
---
### Scenario 9: Non-Interactive with Redaction
**Setup:**
- Run with -Redact flag
- Collect JSON output
**Run:**
```powershell
$json = .\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://test-public-04.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "test-cosmos-rg" `
-AccountName "test-public-04" `
-Redact 2>&1 | Select-String -Pattern '^\{' -SimpleMatch | ConvertFrom-Json
```
**Expected Results:**
- ✅ JSON output completes successfully
- ✅ Target section: `subscriptionId = "REDACTED-SUBSCRIPTION-ID"`
- ✅ Target section: `resourceGroup = "REDACTED"`
- ✅ Target section: `accountName = "REDACTED"`
- ✅ Hostname is NOT redacted (needed for triage): `hostname = "test-public-04.documents.azure.com"`
- ✅ Azure CLI: `currentUser = "REDACTED-USER-NAME"`
- ✅ Azure CLI: `currentTenant = "REDACTED-TENANT-ID"`
**Validation Checklist:**
- [ ] Sensitive fields masked as "REDACTED-*"
- [ ] Hostname NOT masked
- [ ] JSON still parseable
- [ ] Redaction doesn't break classification
- [ ] All RBAC role names preserved (not redacted)
---
### Scenario 10: Private Endpoint IP Mismatch
**Setup:**
- Private endpoint exists but expected IP is different from resolved IP
- Can happen if PE reconfigured or DNS zone stale
**Run:**
```powershell
.\Diagnose-CosmosConnectivity.ps1 `
-EndpointUrl "https://test-private-03.documents.azure.com" `
-SubscriptionId "12345678-1234-1234-1234-123456789012" `
-ResourceGroup "test-cosmos-rg" `
-AccountName "test-private-03" `
-PrivateEndpointIP "10.123.171.99" # Expected IP (not matching actual)
```
**Expected Results (if actual PE IP is 10.123.171.30):**
- ✅ DNS resolution: `succeeded = true`, `addresses = ["10.123.171.30"]`
- ✅ TCP connectivity: `succeeded = true` (connects to actual PE)
- ⚠️ Private network: `matchesExpectedPrivateEndpoint = false`, `indicators contains "WARNING: Resolved to 10.123.171.30 but expected ..."`
- ⚠️ Classification: May include `private_endpoint_mismatch` warning
**Validation Checklist:**
- [ ] Mismatch detected
- [ ] Warning includes both expected and actual IPs
- [ ] TCP still attempts with actual resolved IP
- [ ] Classification identifies discrepancy
- [ ] Recommended actions mention checking PE config
---
### Scenario 11: Latency Metrics
**Setup:**
- Healthy connection
- Measure and log latency values
**Run:**
```powershell
$json = .\Diagnose-CosmosConnectivity.ps1 -EndpointUrl "..." -SubscriptionId "..." ... 2>&1 |
Select-String -Pattern '^\{' | ConvertFrom-Json
$json.diagnostics.dns.latencyMs
$json.diagnostics.tcp.latencyMs
$json.diagnostics.https.latencyMs
```
**Expected Results:**
- DNS latency: 10-100ms (typical)
- TCP latency: 20-200ms (depends on network)
- HTTPS latency: 50-500ms (full round trip)
- All values > 0 and < 10000 (reasonable)
**Validation Checklist:**
- [ ] Latency values are integers (milliseconds)
- [ ] Values are reasonable for network conditions
- [ ] No values are unrealistic (0 or > 60000)
- [ ] Timeouts show latencyMs = 0
---
### Scenario 12: Multiple Endpoints (Batch Testing)
**Setup:**
- Multiple accounts to test
- Non-interactive batch mode
**Run:**
```powershell
$accounts = @(
@{Url="https://account1.documents.azure.com"; Sub="..."; RG="rg1"; Name="account1"},
@{Url="https://account2.documents.azure.com"; Sub="..."; RG="rg2"; Name="account2"},
@{Url="https://account3.documents.azure.com"; Sub="..."; RG="rg3"; Name="account3"}
)
$results = @()
foreach ($acct in $accounts) {
$json = .\Diagnose-CosmosConnectivity.ps1 @acct 2>&1 |
Select-String -Pattern '^\{' | ConvertFrom-Json
$results += @{
Account = $acct.Name
Classification = $json.classification.code
DNS = $json.diagnostics.dns.succeeded
TCP = $json.diagnostics.tcp.succeeded
}
}
$results | Format-Table
```
**Expected Results:**
- All accounts processed without error
- JSON output captured for each
- Results table shows aggregated status
- Classification codes vary based on network conditions
**Validation Checklist:**
- [ ] Batch processing completes
- [ ] All JSON files created
- [ ] No cross-account contamination
- [ ] Timestamp differs for each run
---
## Regression Test Checklist
Use this checklist before each release:
- [ ] **Script Execution**
- [ ] Interactive mode completes
- [ ] Non-interactive mode with all parameters
- [ ] Redaction flag works
- [ ] Help/documentation displays correctly
- [ ] **Network Diagnostics**
- [ ] DNS resolution succeeds on good network
- [ ] DNS resolution fails on blocked network
- [ ] TCP succeeds on open port
- [ ] TCP times out on blocked port
- [ ] HTTPS probe returns status code
- [ ] **Private Endpoints**
- [ ] Detects private IP ranges correctly
- [ ] Compares against expected PE IP
- [ ] Handles PE IP mismatches gracefully
- [ ] **Azure Integration**
- [ ] Works with authenticated Azure CLI
- [ ] Gracefully handles unauthenticated state
- [ ] Queries account config successfully
- [ ] RBAC assessment runs
- [ ] **JSON Output**
- [ ] Valid JSON syntax
- [ ] All expected fields present
- [ ] Field values are correct types
- [ ] Redacted fields are properly masked
- [ ] **Classification**
- [ ] Success code for healthy network
- [ ] DNS failure code for DNS issues
- [ ] TCP failure code for blocked ports
- [ ] PE path blocked code for PE issues
- [ ] RBAC code for permission issues
- [ ] **Documentation**
- [ ] Recommended actions are actionable
- [ ] Error messages are helpful
- [ ] Output is readable and organized
- [ ] **Edge Cases**
- [ ] Invalid URL format rejected
- [ ] Invalid GUID format rejected
- [ ] Timeout handling works
- [ ] No unhandled exceptions
---
## Performance Expectations
| Operation | Expected Time | Timeout |
|-----------|---|---|
| DNS resolution | 10-100ms | 5000ms |
| TCP connect | 20-200ms | 5000ms |
| HTTPS probe | 50-500ms | 5000ms |
| Azure CLI queries | 1-5 seconds | 10000ms |
| Full script (good network) | 10-20 seconds | N/A |
| Full script (blocked port) | ~5 seconds | N/A |
---
## Success Criteria
A test scenario passes if:
1. ✅ Script completes without unhandled exceptions
2. ✅ JSON output is valid and contains all expected fields
3. ✅ Classification code matches expected scenario
4. ✅ Recommended actions are relevant to the issue
5. ✅ Latency values are reasonable
6. ✅ Redaction (if enabled) properly masks sensitive fields
---
## Sign-Off
**QA Tester:** _________________ **Date:** _________
**Reviewed By:** _________________ **Date:** _________
**Approved for Release:** _________________ **Date:** _________
---
## Version
- **Script Version:** 1.0.0
- **Test Plan Version:** 1.0.0
- **Last Updated:** 2026-05-13