Service Update: How We’re Improving Crelate

Recent Service Outages

Over the past three weeks, a few of our customers may have experienced challenges accessing Crelate services. As our impressive 99.95%+ up-time’s demonstrate, these incidents are rare. However when several small incidents happen over a short period of time, we believe it important to be transparent and share what happened.

This post addresses each of these incidents and explains what we’re doing to learn from these incidents and improve our systems and processes. With all incidents, our approach is the same: assess, restore service, investigate the root cause, learn, and improve. The security, reliability, and speed of our service are our top priorities.

There are two categories of service disruption to address.

  1. Issues that result from the code we write, host, or manage. These items are generally under our complete control.
  2. Issues related to 3rd party services and vendors on which we rely.

The first two service disruptions were are a combination of both categories.

For context, due to the redundancy of our system design, not all issues affect all customers. Often only a small percentage of customers are impacted.

 

Issue #1 – Cloudflare / Verizon Outage – June 24, 2019

Issue: Slow connections / timeouts
Affected users: ~20-100%
Duration: ~10-25 minutes
Current Status: Resolved
Notes: This issue was caused by a problem with Verizon that cascaded to Cloudflare and affected large aspects of the Internet. Not all customers were affected, as the issue depended on where you were located and the underlying route your Internet traffic took. You can read more about this issue here.

Issue #2 – Cloudflare Outage – July 2, 2019

Issue: Slow connections / timeouts / 502 Errors
Affected users: ~20-100%
Duration: ~20-30 minutes
Current Status: Resolved
Notes: This issue was caused by a global issue with our cloud firewall and content distribution vendor Cloudflare. Crelate was not the only SaaS product affected. Companies such as Discord, Nest, Dropbox, and many others were affected. The issue intermittent for some, but ultimately affected a wide range of customers.

 

Our Response

Since implementing Cloudflare in December of 2018, our customers have enjoyed significant performance improvements and improved security. With the exception of recent events, we’ve been very happy with Cloudflare. Cloudflare is one of the largest, most respected web application firewall and CDN providers in the world. Some estimate that as much as 20% of all Internet traffic flows through Cloudflare. We have reviewed our implementation and confirmed we are following best practices. We are in the process of evaluating our Service Level Agreements with Cloudflare to determine if other providers could be a possible replacement or secondary backup.

 

 

Issue #3 – Crelate Partial Outage – July 9, 2019

Issue: Slow connections / timeouts
Affected users: ~30-80%
Duration: ~5 to 10 minutes

Current Status: We are still investigating the underlying cause of this issue. Our alert system functioned properly, and our engineers were able to partially restore service within just 3 minutes of being alerted. Service was fully restored within 10 minutes. In addition to investigating the root cause, we are looking at ways to further improve our monitoring as to shorten the time it takes to engage our engineering team.

Our Response

We are still investigating the underlying cause of this issue. Our alert system functioned properly, and our engineers were able to partially restore service within just 3 minutes of being alerted. Service was fully restored within 10 minutes. In addition to investigating the root cause, we are looking at ways to further improve our monitoring as to shorten the time it takes to engage our engineering team. 

 

Our goal is to make Crelate the fastest and most reliable recruiting solution on the market. We will continually learn and invest to achieve this goal. We take pride in our transparency and approach to continuous improvement. Customers can always see a the status of our services and details of past incidents at https://status.crelate.com.

As a company, we’ve recently added to our development, test, and operations teams with more expansion planned in the coming months. These new team members will help us accelerate our roadmap and continue to improve the speed and quality of all Crelate offerings.

 

Sincerely,

 

Aaron Elder

Filed under: Crelate Updates