Network issues at one datacenter
Incident Report for Aptoma AS
Postmortem

An AWS availability zone (AZ) in Frankfurt got shut down due to overheating. Some features in our technology services (AMP services) still rely on specific AZs.

While most affected AMP services recovered fully with 20 minutes with only partial outages during this short period, two features in DrPublish and one in DrEdition Print Automation did not recover until approx 01:34 Friday 11th June CEST, namely DrPublish Newsroom search, DrPublish image and graphics uploads, and DrEdition Print Automation content inbox.

Some customers were not affected, due to dependence on a separate AZ. For the customers set up for these features on the affected AZ, consequences were severely limiting for production in this timeframe (approx Thursday 2021-06-10 22:20 CEST to Friday 2021-06-11 01:30 CEST). Duration 3h 10min.

We have for some time worked to remove the weakness of being dependent to one single AZ at AWS Frankfurt for these specific features, but we are not yet finished. This work will continue at top priority. We expect to be finished during 2021.

We are sorry for the problems caused, and will do what we can to make sure similar outages at AWS cannot affect our services this way in the future.

Posted Jun 11, 2021 - 13:25 CEST

Resolved
We have been monitoring and double checking, and we conclude that the incident has been resolved.

You should expect everything to work as normal.
Posted Jun 11, 2021 - 02:22 CEST
Monitoring
All services have recovered.
We are monitoring and double checking.
Posted Jun 11, 2021 - 01:37 CEST
Update
We have received reports that AWS engineers are still locked out of their affected datacenters due to environmental conditions, and we are initiating our own fallback plans in parallell with their ongoing efforts.

We expect our plans to take several hours to complete. We will abort our plans if AWS engineers are successful before our own fallback plan completion.

We will post an update at 06:00 at the latest.
Posted Jun 11, 2021 - 01:28 CEST
Update
AWS reports that "temperatures continue to return to normal levels" but that their engineers are still not able to physicallay access the affected region to restore normal services.

The same subset of customers are still affected with the following
* Can't search for articles in DrP Newsroom
* Can't upload and insert assets into DrPublish articles

We are currently making plans for what to do if AWS can not re-establish normal services in their affected datacenters.
Posted Jun 11, 2021 - 00:49 CEST
Update
We have received reports from AWS that temperatures at the affected datacenter(s) are returning to normal levels, but not all of our services have returned to normal operations yet.

The following issues still persists for some of our customers:
* Can't search for articles in DrP Newsroom
* Can't upload and insert assets into DrPublish articles
Posted Jun 11, 2021 - 00:18 CEST
Update
We notice intermittent image editing issues in DrEdition Front Page does, but these issues are now resolved.

The following issues still persists for some customers:
* Can't search for articles in DrP Newsroom
* Can't upload and insert assets into DrPublish articles

Please note that for these customers it's still possible to create and publish DrPublish articles.
We are still investigating.
Posted Jun 10, 2021 - 23:43 CEST
Update
We are continuing to work on a fix for this issue.
Posted Jun 10, 2021 - 23:14 CEST
Update
We have received the following update from our cloud provider AWS:
> 1:55 PM PDT We can confirm increased API error rates and latencies for the EC2 APIs and connectivity issues for instances within a single Availability Zone (euc1-az1) within the EU-CENTRAL-1 Region, caused by an increase in ambient temperature within a subsection of the affected Availability Zone. Other Availability Zones within the EU-CENTRAL-1 Region are not affected by the issue and we continue to work towards resolving the issue.

The consequences for some of our customers currently are:
* Certain customers can't search for articles in DrP Newsroom
* The same customers can't upload and insert assets into DrPublish articles
Posted Jun 10, 2021 - 23:12 CEST
Identified
Services are in the process of transitioning onto alternate datacenters. Some customers may experience aborted requests or connectivity issues while the services are recovering. We expect services to recover within 30 minutes.
Posted Jun 10, 2021 - 22:43 CEST
This incident affected: DrPublish v5 - GUI, DrPublish v5 - API /io, DrPublish v4 - GUI, DrPublish v4 - API /io, DrEdition - GUI, DrEdition - Front Page Renderer (Sphynx, DrE4F), and DrEdition - LayoutPreview (DrE4P).