Data Processing Pipeline Paused

Incident Report for MachineMetrics

Postmortem

On Wednesday, March 27, from 1:40PM to 6:00PM EDT, our cloud services provider had a system outage in the region where the MachineMetrics system runs. During this time, our data ingestion pipeline stopped processing new events. Our edge devices which collect information from the manufacturing equipment at our customers' facilities were still able to send this data to our cloud where it was buffered for future processing and archival.

At approximately 3:30PM, some systems recovered; however, due to the prior outage, our systems scaled down their resources. When our pipeline began processing the backlog of data, resources attempted to scale, but another issue within our provider’s infrastructure prevented all automatic scaling. This lasted until we were able to disabled automatic scaling for our systems allowing us to manually tune these values at approximately 6PM.

During this outage, certain failures in our pipeline (amplified by the outages in our provider’s infrastructure) caused data to be processed incorrectly during this period. Data points (including part counts and utilization) are not accurate during this time period which will affect historical reporting. We have found the issue that caused the failure, will be deploying a fix soon, and are currently working to reprocess the batch of data that was received during that time window. After this is complete, all reports will be accurate and up to date.

Posted Mar 29, 2019 - 09:57 EDT

Resolved

All systems are reporting. We will continue to monitor the situation and will provide a more detailed update about the event in the coming days.
Posted Mar 27, 2019 - 21:25 EDT

Update

The issue has been resolved and nearly all data is caught up. We are currently verifying data integrity and will post back with more information.
Posted Mar 27, 2019 - 18:37 EDT

Monitoring

Amazon Web Services (our cloud service provider) has fixed the issue that was preventing us from updating our data store's write capacity. Our data processing pipeline is starting to get through the backlog. We are monitoring the situation and will update as we get closer to having all data processed. No data loss is expected during this event.
Posted Mar 27, 2019 - 17:53 EDT

Update

Amazon Web Services is still reporting issues with provisioning one of our data stores but are working to address the problem. We will update with more information when available.
Posted Mar 27, 2019 - 16:59 EDT

Update

We are still monitoring this issue and working with Amazon Web Services to get it resolved. We will update as more information is available.
Posted Mar 27, 2019 - 15:39 EDT

Update

Amazon Web Services has reported issues with some of their internal systems. We have reached out and are waiting for more information about the cause. We will post again soon.
Posted Mar 27, 2019 - 14:45 EDT

Investigating

We are currently investigating an issue which has caused our data processing pipeline to pause. We will update with more information when available.
Posted Mar 27, 2019 - 14:26 EDT
This incident affected: Data Processing Services.