Timetrix Predictive Alerting System
Anomaly detection combined with dynamic rules
Open source core

Virtualization, containerization and orchestration frameworks abstract infrastructure level. Monitoring solutions are often distributed over multiple locations and architecture types. Independent teams using some sort of own solutions complicate further the overall picture of operations, orchestration and change management. Common result - anti-patterns creeping in your infrastructure environment. While such approach might work for a while, there is always a tipping point, when the system starts crushing down on you.

Modern stack
  • Increasingly missing visibility of overall monitoring coverage. It is hard to get overall picture of operations, to orchestrate and make changes
  • Amount of metrics in monitoring grows exponentially. It is almost impossible to operate on billions of metrics
  • Proliferating amount of metrics leads to unusable dashboards. Nobody understands such huge amount of information
  • Keeping everything recorded just in case leads to missing context and missing visibility into root causes
  • Manual configuration of alerts. Non-flexible thresholds non-adjustable for temporal patterns. Overwhelming amount of false-positives.

And while your system becomes more and more vulnerable, incidents as a result of changes, bugs, network- or hardware failure and human factor start affecting your performance. Incidents are expensive and come with credibility cost. Any reduction of outage/incident timeline results in significant positive financial impact. And your DevOps teams feel less pain and toil on their way.

Average cost per minute for service outages
Increase in processed metrics, events and alerts over the last 12 months
3.5 hours
Average mean time to resolution per incident
Percent of outages that require 6 or more IT FTEs to resolve
Source: Digital Enterprise Journal
We combine metrics, tracing and event logs to create a topology of your stack reducing investigation part on the alert side by effectively pinpointing root cause
Finding anomalies on metrics
Narrowing down the amount of metrics required to define KPIs
Finding regularities on a higher level between metrics, tracing and event logs
Combined push/pull model (local pull, central push)
Combining events from organization internals (changes, deployments)
ML-based system, which learns your metrics' behavior
The focus is on KPI metrics

It is almost impossible to operate on millions and billions of metrics. In case of normal system behaviour there will always be outliers in real production data. Not all outliers should be flagged as anomalous incidents. We do not want to process everything. A lot of events are needed on-demand. It is ok to lose some signals in favour of performance.

Machine Learning Engine for the detection of anomalies and categorisation of business incidents. This engine collects and transforms the data, performs certain analytical computations, categorises the business incidents and predicts probabilities for business incidents.
We use time-series for multi-source joining to tackle operations on a worldwide basis and unpredicted delays on networking. It gives the ability to combine metrics, event logs, and traces, transform data based on relevant conditions and easily test hypotheses. There is no mismatch between metaquering and quering.
The Rules Matching Engine is the main component where all the rules are created, matched and any anomalies are detected. All RME computations are performed in memory, so the average time taken in checking one violation message against the rules is less than 1ms. Every RME’s process is independent, so it is very easy to shard it.
Dashboards are very useful when you know where and when to watch. It is ok to lose some signals in favour of performance. By narrowing down the amount of metrics required to certain defined KPI metrics we are able to let you track critical events on-demand without missing any critical incidents.
Simple to model
Cheap to run
Lightning fast