Data Processing Pipeline Paused
Incident Report for MachineMetrics
Postmortem

On Wednesday, March 27, from 1:40PM to 6:00PM EDT, our cloud services provider had a system outage in the region where the MachineMetrics system runs. During this time, our data ingestion pipeline stopped processing new events. Our edge devices which collect information from the manufacturing equipment at our customers' facilities were still able to send this data to our cloud where it was buffered for future processing and archival.

At approximately 3:30PM, some systems recovered; however, due to the prior outage, our systems scaled down their resources. When our pipeline began processing the backlog of data, resources attempted to scale, but another issue within our provider’s infrastructure prevented all automatic scaling. This lasted until we were able to disabled automatic scaling for our systems allowing us to manually tune these values at approximately 6PM.

During this outage, certain failures in our pipeline (amplified by the outages in our provider’s infrastructure) caused data to be processed incorrectly during this period. Data points (including part counts and utilization) are not accurate during this time period which will affect historical reporting. We have found the issue that caused the failure, will be deploying a fix soon, and are currently working to reprocess the batch of data that was received during that time window. After this is complete, all reports will be accurate and up to date.

Posted 5 months ago. Mar 29, 2019 - 09:57 EDT

Resolved
All systems are reporting. We will continue to monitor the situation and will provide a more detailed update about the event in the coming days.
Posted 5 months ago. Mar 27, 2019 - 21:25 EDT
Update
The issue has been resolved and nearly all data is caught up. We are currently verifying data integrity and will post back with more information.
Posted 5 months ago. Mar 27, 2019 - 18:37 EDT
Monitoring
Amazon Web Services (our cloud service provider) has fixed the issue that was preventing us from updating our data store's write capacity. Our data processing pipeline is starting to get through the backlog. We are monitoring the situation and will update as we get closer to having all data processed. No data loss is expected during this event.
Posted 5 months ago. Mar 27, 2019 - 17:53 EDT
Update
Amazon Web Services is still reporting issues with provisioning one of our data stores but are working to address the problem. We will update with more information when available.
Posted 5 months ago. Mar 27, 2019 - 16:59 EDT
Update
We are still monitoring this issue and working with Amazon Web Services to get it resolved. We will update as more information is available.
Posted 5 months ago. Mar 27, 2019 - 15:39 EDT
Update
Amazon Web Services has reported issues with some of their internal systems. We have reached out and are waiting for more information about the cause. We will post again soon.
Posted 5 months ago. Mar 27, 2019 - 14:45 EDT
Investigating
We are currently investigating an issue which has caused our data processing pipeline to pause. We will update with more information when available.
Posted 5 months ago. Mar 27, 2019 - 14:26 EDT
This incident affected: Data Processing Services.