Skip to content

Conversation

@gladjohn
Copy link
Contributor

@gladjohn gladjohn commented Oct 8, 2025

Added detailed caching strategy and resilience plan for Managed Identity v2, including problem identification, proposed solutions, call sequence, cache renewal matrix, invalidation rules, and security considerations.

Added detailed caching strategy and resilience plan for Managed Identity v2, including problem identification, proposed solutions, call sequence, cache renewal matrix, invalidation rules, and security considerations.
@gladjohn gladjohn requested a review from a team as a code owner October 8, 2025 15:43
---

## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "link-local" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the local endpoint


## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean "treat as primary anchor". Pls use more precise wording.


## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not hardcode any expirations. We rely on services returning expirations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls be precise. Specify:

  • jitter (e.g. 5 min)
  • if renewal should happen on front-end or back-end thread. I think front-end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is jitter calculated? Is it randomized per host/process or globally coordinated? Could jitter introduce any unintended renewal delays?

1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want cross-process coordination, please specify the IPC that is going to be used. This needs to exist on Windows and Linux and it needs to be available in sanctioned libraries across all supported MSAL languages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the ‘single-writer’ selected? Is this a file lock, named mutex, or other mechanism? What happens if the single-writer crashes mid-renewal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is cross‑process (shared file + shared cert store),

2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert.
5. **MAA token** is used **only** for issuance/renewal; short-lived cache to prevent attestations calls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enough precision. What does it mean "short-lived" cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the cache invalidation logic if a policy or key rotation occurs on the MAA side? Is there a way to force re-attestation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enough precision. What does it mean "short-lived" cache?

Added details to it

What’s the cache invalidation logic if a policy or key rotation occurs on the MAA side? Is there a way to force re-attestation?

MSAL creates the key, and sends to MAA for attestation.

```
Call 0 (local): Probe IMDS v2 → cache MSI source (V2/V1)
1 (local): Create KeyGuard key (per reboot)
2 (external): Get MAA token // only for (re)issuing cert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is local and what is external ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local is IMDS endpoints (link-local) or local to the machine. external is ESTS/MAA, in this case.

| Item | Scope | Where | TTL | Notes |
|---|---|---|---|---|
| **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here |
| **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you going to deal with atomicity, multiple file writers, and a process that gets killed in the middle of a write?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the MAA token row

Copy link
Member

@bgavrilMS bgavrilMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enough details.

Copy link
Contributor

@Robbie-Microsoft Robbie-Microsoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For each cache and renewal step, document what happens if the cache is missing, invalid, or corrupted.
  • Outline (even briefly) the implementation details of the single-writer system.

# Managed Identity v2 (Attested TB) — Resilience & Caching Plan

## TL;DR
We reduce cold-start latency and dependency risk for MSI v2 by caching safe, long-lived artifacts, coordinating renewal across processes, and keeping the hot path in memory. **MAA is used only to (re)issue the binding certificate**; bound AT acquisition relies on that cert. Result: fewer failures, less churn, smoother CX.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the fallback if the binding cert is lost or corrupted? Is there any emergency recovery path?

## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is jitter calculated? Is it randomized per host/process or globally coordinated? Could jitter introduce any unintended renewal delays?

1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the ‘single-writer’ selected? Is this a file lock, named mutex, or other mechanism? What happens if the single-writer crashes mid-renewal?

2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert.
5. **MAA token** is used **only** for issuance/renewal; short-lived cache to prevent attestations calls.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the cache invalidation logic if a policy or key rotation occurs on the MAA side? Is there a way to force re-attestation?

```
Call 0 (local): Probe IMDS v2 → cache MSI source (V2/V1)
1 (local): Create KeyGuard key (per reboot)
2 (external): Get MAA token // only for (re)issuing cert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there retries or backoff strategies if the MAA call fails? Is exponential backoff used or is it a fixed retry policy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated details please check

| Item | Scope | Where | TTL | Notes |
|---|---|---|---|---|
| **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here |
| **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are policy changes detected? Is it polled, pushed, or inferred from failures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t poll.

We don’t get a push.

We infer from failures: when MAA/IMDS/eSTS return specific attestation/policy/key errors, we treat that as “policy changed, invalidate cache and re‑attest”.

|---|---|---|---|---|
| **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here |
| **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter |
| **Binding cert + `/issuecredential` metadata** | Per **Managed Identity per user context** | Persisted (Win: `CurrentUser\My`; Linux: protected file/PEM) | ~7 days | Renew at **half-life + jitter**; Serialize issuance |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What protects against file corruption or unauthorized access on Linux? Is there a fallback if the file is deleted outside of MSAL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no MAA on Linux, so the only persisted artifact there is the binding cert + metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that file is deleted or corrupted outside MSAL, the next read fails validation and we just treat it as a cache miss

| **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here |
| **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter |
| **Binding cert + `/issuecredential` metadata** | Per **Managed Identity per user context** | Persisted (Win: `CurrentUser\My`; Linux: protected file/PEM) | ~7 days | Renew at **half-life + jitter**; Serialize issuance |
| **Access tokens (`bearer` or `mtls_pop`)** | Per audience | In memory | Service-configured | Reacquire after reboot (new key) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there scenarios where token invalidation lags behind key rotation? How does the system ensure that stale tokens aren’t accidentally reused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, I understand this. When do we rotate keys?

## Invalidation Rules
- **Reboot** → Use **persisted binding cert** to fetch new ATs; re-attest on first demand on service failure.
- **Cert expiry** → re-issue.
- **MAA token expired** → re-attest and re-issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there built-in safeguards to prevent a thundering herd if all processes notice expiry at the same time?


## Why This Improves CX
- **MAA is out of the hot path**—steady-state calls rely on a **multi-day binding cert**.
- Different identities on the same VM, uses **cached MAA token**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cache not keyed per identity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is, updated

Updated the caching strategy for MSI v2 to enhance resilience and reduce cold-start latency. Key changes include improved certificate renewal processes and better caching mechanisms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants