On Wednesday, March 27, from 1:40PM to 6:00PM EDT, our cloud services provider had a system outage in the region where the MachineMetrics system runs. During this time, our data ingestion pipeline stopped processing new events. Our edge devices which collect information from the manufacturing equipment at our customers' facilities were still able to send this data to our cloud where it was buffered for future processing and archival.
At approximately 3:30PM, some systems recovered; however, due to the prior outage, our systems scaled down their resources. When our pipeline began processing the backlog of data, resources attempted to scale, but another issue within our provider’s infrastructure prevented all automatic scaling. This lasted until we were able to disabled automatic scaling for our systems allowing us to manually tune these values at approximately 6PM.
During this outage, certain failures in our pipeline (amplified by the outages in our provider’s infrastructure) caused data to be processed incorrectly during this period. Data points (including part counts and utilization) are not accurate during this time period which will affect historical reporting. We have found the issue that caused the failure, will be deploying a fix soon, and are currently working to reprocess the batch of data that was received during that time window. After this is complete, all reports will be accurate and up to date.