Last week (7th September 2022) I spotted a post from EUC and Cloud Veteran Marius Sandbu alerting his professional and social media networks to an outage of Azure Front Door that in turn meant a significant number of Azure services and resources were unavailable.
Marius Posted:
“How much do you trust the cloud providers?
About 1 hour ago, Microsoft started to have issues with their Front door service which affects many of their global services including Azure Portal, Azure Virtual Desktop, Azure AD, GitHub, and of course services that are placed behind Azure Front door.
While outages like this can happen, ensure that you have some proper monitoring that monitors the cloud providers and services from an external point of view.”
And of course, as ever, he is correct. It is best practice to monitor the availability of cloud services for yourself. Whilst Microsoft do maintain an Azure Status page that notifies users of known outages, it can take a very long time for outages to actually show up on this site. Microsoft spend a lot of time verifying, reproducing and confirming issues, their regional extent, what other services impacted etc. As a result, an organization with end-users using AVD (Azure Virtual Desktop) can find their own help desk bombarded by support calls, emails, and tickets, long before Microsoft publish and confirm an outage. Slowdowns and more subtle performance issues can take even longer for any official information to appear.
Organizations that proactively verify Azure availability and performance themselves have an advantage. If an Azure service is found to be down, IT can proactively broadcast that information to their users and warn help desk staff – a simple “Sorry, Microsoft Azure is down, we’ve notified Microsoft and have to wait for them to fix it” can pre-empt help desk tickets.
Receiving early notification of outages and their extent also allows IT to direct users to alternatives that may be available, perhaps on-premises, in a different cloud or a different cloud region.
Azure Outage: What happened on the 7th September 2022?
Microsoft have already published their initial analysis: Preliminary Post Incident Review (PIR) – Azure Front Door – Connectivity Issues (Tracking ID YV8C-DT0), that states:
Between 16:10 and 19:55 UTC on 07 Sep 2022, subset of customers using Azure Front Door might have experienced connectivity issues. This could also be impacting customers’ ability to access other Azure services that leverage Azure Front Door, this includes the Azure Management Portal and Azure CDN.
The report includes some details of what went wrong and how they attempted to resolve the issue and bring services back online.
How eG Enterprise Proactively Monitors Azure Outages
At eG Enterprise there is no doubt as to whether we “might” or “could” have been affected by this issue because our proactive monitoring of our own Azure deployments detected the issue within minutes and allowed us to assess the impact.
Exploring the AVD Logon Simulator alert, raised at 16:20, the detailed diagnosis confirmed the issue. A simple intuitive GUI suitable for L1/L2 help desk staff showed the problem clearly lay with Web URL Access to Azure and not the authentication process.
Over the duration of the incident, eG Enterprise continued to probe Azure availability.
As you can see from the above graph of availability it was not a continual outage and services went up and down as Microsoft attempted to stabilize the system.
This type of data is extremely useful when debating outage credit with cloud providers, especially smaller clouds, Microsoft are generally very good at crediting customers after outages. This data can also be extremely useful if working with support at a cloud vendor as it allows you to explain the impact of changes, they may make to help you correlating changes such as moving server to end-user availability etc.
It is also worth remembering that when services go down what help desk and IT see vs. what the end user experiences may be different. When authentication services (e.g., Azure Active Directory) fail users already logged in will have a different experience to those attempting to log on afresh and many failures will have specific geographical extents. A tool like eG Enterprise allows an administrator to pinpoint what is going on if a user claims “AVD is down” even if they are not experiencing the issue themselves.
With eG Enterprise and its best-in-class synthetic testing, administrators receive proactive alerts about not only outages but slowdowns with enough information to make decisions to avert support calls. eG Enterprise gives administrators the confidence to proactively notify users and avert calls to help and support desks.
To ensure you get alerted of problems with Azure as soon as possible (and wherever you are) not only does eG Enterprise raise alerts on its alert console, but you can get alerts raised directly in your browser, sent to as SMS or email, integrated with ITSM systems such as ServiceNow and Microsoft Teams or you can use the eG Enterprise Mobile apps for iOS or Android. The eG Enterprise Mobile apps offer a mobile view of the monitoring system so you can see why alerts are raised, key real time and historical metrics, dashboards and more beyond simple alert messages.
Or if you don’t want to follow Marius’ best-practice advice and externally monitor Azure services, you might want to follow him on twitter @msandbu or LinkedIn where he is likely to share early warnings of the next outage.
If you are an AVD user you may like to try our free logon simulator for AVD to get proactive notifications of logon problems including Azure outages, see: Free AVD logon simulator for Azure Virtual Desktop | eG Innovations.
eG Enterprise is an Observability solution for Modern IT. Monitor digital workspaces,
web applications, SaaS services, cloud and containers from a single pane of glass.
Further Reading:
- If you enjoyed this postmortem blog post – you may enjoy this similar one, Troubleshooting Web Application Performance & SSL (Secure Socket Layer) Issues
- Or a more complex performance issue postmortem on Amazon’s AWS (Amazon Web Service) cloud, Application Performance Troubleshooting on AWS Cloud: A Case Study (eginnovations.com).
- Learn more about synthetic monitoring tools: Synthetic Monitoring & Synthetic Transaction Monitoring Tools (eginnovations.com)
- Learn about end-to-end observability for Azure: Azure Cloud Monitoring Tools & Solutions | eG Innovations