Designing Azure Virtual Desktop: Principles That Actually Matter
Why build AVD and what problems does it solve.
Azure Virtual Desktop has matured significantly since its initial release. What was once a complex, sprawling service requiring deep Azure networking expertise has evolved into a genuinely viable enterprise VDI platform. But maturity in the product doesn't guarantee maturity in implementations. I've seen AVD deployments that replicate every anti-pattern from legacy Citrix environments, and others that embrace cloud-native principles to deliver something genuinely transformative.
This post distils the architectural principles I apply when designing AVD solutions. It's not a step-by-step deployment guide — Microsoft's documentation covers that adequately. Instead, it focuses on the decisions that determine whether your AVD platform becomes an asset or a liability.
The Fundamental Shift: Cloud-Native, Not Lift-and-Shift
The most common mistake in AVD design is treating it as "Citrix in Azure." Organisations take their existing VDI architecture — ExpressRoute backhaul, domain-joined hosts, on-premises file servers for profiles — and recreate it in Azure. This approach delivers the worst of both worlds: cloud costs with on-premises constraints.
A well-designed AVD platform should be cloud-native by default. This means Entra ID-joined session hosts (not hybrid-joined), Azure Files for profile storage (not on-premises file servers accessed over ExpressRoute), and connectivity that doesn't depend on your corporate WAN.
The cloud-native approach eliminates dependencies that create fragility. When your AVD platform doesn't need ExpressRoute to function, you've removed a significant failure domain. When authentication happens against Entra ID rather than domain controllers in a distant data centre, you've improved both resilience and user experience.
This doesn't mean zero connectivity to on-premises resources. Legacy applications still exist, and users need to access them. But that access should be application-level, through services like Zscaler Private Access or Azure Private Link, rather than network-level through broad VPN or ExpressRoute connectivity. The session host itself remains cloud-native; legacy access is an overlay, not a foundation.
Landing Zone Design: Standardisation Over Flexibility
AVD landing zones often suffer from organic growth. Department A gets a subscription with their preferred configuration. Department B gets another with slightly different policies. Before long, you have a fragmented estate where every deployment is a special case.
The solution is to treat AVD landing zones as standardised patterns, enforced through Azure Policy at the management group level. Departments and regions may have separate subscriptions for billing and RBAC purposes, but the technical implementation should be identical and portable.
Security, tagging, RBAC — enforced at management group level
The management subscription hosts shared services: orchestration tooling, monitoring, secrets management, and image pipelines. AVD subscriptions consume these services but don't define their own patterns. This separation ensures that when you need to deploy AVD in a new region or for a new business unit, you're instantiating a known pattern rather than designing from scratch.
Sizing: The PCU Model
Capacity planning for VDI is notoriously difficult. Traditional approaches often result in either over-provisioning (wasting money) or under-provisioning (degraded user experience). The Peak Concurrent User (PCU) model provides a structured framework for getting it right.
The core formula is straightforward:
PCU = Headcount × ActivityFactor × OverlapFactor × PeakUplift
ActivityFactor represents the proportion of users expected to use AVD on any given day. Not everyone logs in every day — contractors might use the platform twice a week, while analysts use it daily.
OverlapFactor reflects simultaneous usage at peak times. Even if 500 people use the platform in a day, they're not all online at 9:15am.
PeakUplift (typically 1.15–1.25) accounts for short-term spikes beyond normal peaks.
Once you have PCU, you need to map it to host pool capacity. This requires defining user personas based on workload intensity:
| Persona | Typical Workload | VM SKU | Users per Host |
|---|---|---|---|
| Occasional | Basic productivity, web apps | D8s_v5 (8 vCPU / 32 GB) | 10-12 |
| Regular | Office apps, light data work | D8s_v5 (8 vCPU / 32 GB) | 8-10 |
| High | Analytics, development, heavy apps | D16s_v5 (16 vCPU / 64 GB) | 4-6 |
With personas defined, calculate base hosts for each, add a 15% buffer for drain/failover scenarios, and configure autoscale bands accordingly. The goal is three bands: minimum (off-peak), core (matches PCU), and peak (PCU plus buffer).
Critically, pre-warm capacity 30-45 minutes before expected peaks. Users arriving at 9am shouldn't wait for VMs to spin up — the capacity should already be warm and ready.
Profile Management: FSLogix Done Right
FSLogix profile containers are the standard approach for non-persistent AVD desktops, but implementation details matter enormously. Poor FSLogix design manifests as slow logins, profile corruption, and frustrated users.
Storage selection is the first decision. Azure Files Premium with Entra Kerberos authentication is the baseline recommendation. It supports Entra-joined hosts without domain controllers, delivers predictable performance, and keeps traffic on the Azure backbone via Private Endpoints. Azure NetApp Files offers better raw performance but requires AD DS integration, which conflicts with a cloud-native identity strategy.
Sizing for logon storms is critical. The worst-case scenario for profile storage isn't steady-state usage — it's 8:55am when everyone logs in simultaneously. Profile attach times should meet a p95 target of under 15 seconds and p99 of under 25 seconds. If your storage can't sustain this during peak logon, users will notice.
Container sizing varies by user type. Standard users typically need 10-20GB; heavy Outlook or reporting users may need 20-40GB. Enable OneDrive Files On-Demand to reduce storage requirements, and configure exclusions for caches that don't need to roam.
Resilience requires multiple layers. Enable soft delete on storage accounts to protect against accidental or malicious deletion. Configure Azure Backup with both snapshots (for fast operational recovery) and vaulted backups (for retention and regional DR). For mission-critical deployments, consider warm standby shares in a paired region with scheduled AzCopy synchronisation.
Application Delivery: MSIX App Attach as Default
Application delivery strategy determines how clean your golden images stay and how quickly you can respond to application changes. MSIX App Attach should be the default model for AVD.
The principle is separation: the golden image contains the operating system, core productivity tools, and mandatory agents. Everything else — line-of-business applications — is delivered dynamically through App Attach. Applications are stored in containers (VHD/VHDX/CIM format) on Azure Files, mounted at user logon, and presented to users based on Entra ID group membership.
The benefits compound over time. Image updates no longer require coordination with application owners — the OS and apps have independent lifecycles. Application updates can be staged alongside existing versions and cut over without host redeployment. The same MSIX packages created for Intune deployment to physical devices work in AVD, enabling genuine "build once, deploy anywhere" consistency.
Not every application converts cleanly to MSIX. Applications with kernel drivers, complex licensing, or deep system integration may need to be baked into the golden image as exceptions. But these should be documented exceptions with justification, not the default approach.
One critical limitation: user-driven installs from Intune Company Portal don't work in pooled AVD. Company Portal assumes a 1:1 user-to-device relationship, which doesn't exist in multi-session environments. All application delivery must be either machine-assigned (in the image) or dynamically attached (MSIX App Attach).
Image Lifecycle: Rings and Immutability
Golden images should be immutable and versioned. An image, once built and validated, should never be modified in place. Changes result in new versions, and those versions flow through validation rings before reaching production.
A practical ring structure for AVD images:
| Ring | Purpose | Population | Promotion Criteria |
|---|---|---|---|
| Insider | Early validation of upcoming changes | IT volunteers, nominated testers | Build completes, basic smoke tests pass |
| Pilot | Broader validation with real workloads | Representatives from each business function | 1 week stable, no blocking issues |
| Production | General availability | All users | Pilot sign-off, CAB approval |
Each month, a new image is built with cumulative updates, driver updates, and agent updates. This image enters the Insider ring for initial validation. If no blocking issues emerge after a week, it promotes to Pilot. After another week of stable operation with broader user representation, it promotes to Production.
The Insider ring deserves particular attention. Access should be controlled through Entra ID Access Packages, ensuring only nominated testers can access Insider hosts. This prevents accidental assignment and provides auditable governance. Each user group added to AVD should nominate at least two members for Insider testing — this ensures coverage while keeping the blast radius contained.
Azure Compute Gallery provides the versioning infrastructure. Images are stored with version numbers, and host pools reference specific versions. Rollback is straightforward: point the host pool back to the previous version and reimage hosts.
Orchestration: Nerdio as the Operational Layer
Native Azure tooling for AVD is comprehensive but fragmented. Managing images, host pools, autoscaling, and day-to-day operations across the Azure portal, PowerShell, and various APIs creates operational overhead and inconsistency.
Nerdio Manager for Enterprise consolidates these operations into a single interface while keeping Azure as the system of record. It doesn't replace Azure constructs — it orchestrates them. Host pools are still Azure resources; images still live in Azure Compute Gallery; autoscaling still uses Azure's native capabilities. Nerdio provides the workflow layer that makes operating these components manageable at scale.
Key capabilities to leverage:
Image lifecycle management: Build, test, and promote images through defined rings with approval gates. Automate the monthly build process while retaining manual promotion control.
Autoscaling: Configure scale bands (minimum/core/peak) aligned to the PCU model. Pre-warm capacity before peaks. Drain hosts gracefully before shutdown.
Host operations: Drain, reimage, resize, and rebuild hosts through standardised workflows with audit trails. No more ad-hoc PowerShell scripts with inconsistent outcomes.
Cost visibility: Track cost per concurrent user, surface right-sizing recommendations, and integrate with FinOps dashboards.
The critical principle is that Nerdio configuration should be treated as code. Export configurations regularly, store them in source control, and document the native Azure equivalents. This prevents lock-in and ensures you can rebuild the orchestration layer if needed.
Security: Zero Trust as Foundation
AVD security should assume breach. Every component — session hosts, storage, management plane — should be secured independently, with no implicit trust based on network location.
Session hosts should have no inbound ports exposed. AVD uses reverse connect — the host initiates the connection to Microsoft's control plane, and user traffic flows back through that established channel. There's no technical reason to allow RDP inbound to session hosts, and doing so creates unnecessary attack surface.
Storage should be accessed exclusively through Private Endpoints. This keeps traffic on the Azure backbone and eliminates exposure to the public internet. Authentication uses Entra Kerberos, removing the need for domain controllers in Azure.
Secrets belong in Key Vault with Private Endpoints. Automation should use Managed Identities rather than service principals with stored credentials. No credentials should be embedded in scripts, images, or configuration.
Conditional Access gates entry to the platform. Require MFA, enforce device compliance, block legacy authentication, and consider session controls for sensitive workloads.
Privileged access should follow least-privilege principles. Helpdesk staff don't need local admin on session hosts — they need scoped permissions to drain hosts and restart sessions. Administrative access to the management plane should require PIM activation with justification and time limits.
Monitoring: Observability That Drives Action
Monitoring should answer two questions: "Is the platform healthy?" and "Are users having a good experience?" The first is necessary but insufficient; the second is what actually matters.
Platform health metrics include host availability, autoscale events, storage performance, and agent status. These belong in Azure Monitor with alerts configured for anomalies.
User experience metrics require more nuance. Track login times (target p95 under 30 seconds to interactive desktop), FSLogix attach times (target p95 under 15 seconds), and session reconnect success rates (target above 99%). These metrics should be visualised in dashboards that operations teams review daily.
The goal is proactive identification of degradation before users report it. If FSLogix attach times are trending upward over several days, you want to investigate before they breach SLA. If autoscale is consistently hitting peak capacity, you want to increase headroom before users experience session queuing.
Conclusion
AVD design is fundamentally about making deliberate choices rather than accepting defaults. Cloud-native identity over domain join. Standardised landing zones over organic growth. Immutable images with ring-based promotion over ad-hoc updates. MSIX App Attach over image bloat. These choices compound — each one makes the platform more manageable, more resilient, and more cost-effective.
The principles in this post aren't theoretical. They emerge from implementations where the alternative approaches created pain: fragmented estates that couldn't be governed consistently, login storms that brought platforms to their knees, image updates that required weekends of coordination, and security postures that assumed network perimeter would protect inadequate host configuration.
AVD done well is genuinely transformative — a flexible, scalable, secure desktop platform that adapts to business needs rather than constraining them. AVD done poorly is Citrix with a different logo and a larger bill. The difference lies in the design decisions made before the first host is deployed.
If you're planning an AVD implementation or migrating from legacy VDI, I'd be interested to hear about your constraints and challenges. The principles are consistent, but the application always depends on context.