[SBG/RBX/GRA/BRU][Core Network] - Call routing

Incident Report for Web Cloud

Postmortem

Dear all,

The incident has now been resolved, and the details follow.

This Wednesday 07 June 2023 at 06:32 UTC we have been informed by the infrastructure monitoring that calls are not routed correctly.

Investigations start, and we control several things:

=> infrastructure health
=> last VoIP interventions
=> network interventions

We can locate the malfunction, it is on the call routing infrastructure. This infrastructure is strategic, as it manages the route of your calls from our class 5 subscriber infrastructure (Centrex / Trunk).
It is therefore redundant: 6 servers in 2 different geographical zones and 4 different datacentres.

Our investigations and machine behaviour remind us of a network malfunction, and the network team will be called as backup at 07:00 UTC.

From there, a long process of investigation begins in order to locate precisely the equipment that is causing the problem.

The investigations unfortunately take time but we can not act brutally on the infrastructure so as not to make the situation worse.

At the end of the morning, we manage to reduce the scope of possibilities on the cause of the outage.

From 12:30 UTC, we will locate an IP replica of one of our routers on a virtual machine in our Public Cloud project. After this machine was isolated, the service was back to normal.

This IP duplication was possible and had a significant impact because:
・ Our Public Cloud project is attached to the VoIP service vRack
・ For our private network, we use the vRack, which is spread over several DCs and types of services/OVH infrastructure
・ The IP duplicated is the one used for a router on our vRack

It took us some time to detect this IP duplication, as it was assigned automatically by taking the last IP of the Subnet used on our vRack.

In order to stop being affected and detect this type of incident more quickly, we will:
・ Enhance the monitoring network in order to be alerted in case of ARP entry change too frequently
・ Reduce the scope of our vRack to better segment and isolate our different services/infrastructures
・ Investigate the change of subnet used on the private network of our Public Cloud so that it is no longer common to other infrastructures

Once again, we apologise for this exceptional incident, as we are committed to providing you and your customers with the best possible service.

Posted Jun 09, 2023 - 15:55 UTC

Resolved

Start time : 07/06/2023 06:32 UTC
End time : 07/06/2023 15:11 UTC

On Wednesday June 7th at 8:27 AM (CET), a network malfunction impacted the VoIP services of some of our customers.

OVHcloud technical teams immediately intervened to identify the cause of the incident and resolve it. Impacted services have been restored on 07/06/2023 15:11 UTC and are now operational.

You can contact our customer service on 1007 or open a support ticket from your Help Center.

We apologize for any inconvenience this may has caused. We will update our support pages with new information as soon as available.

Posted Jun 08, 2023 - 16:25 UTC

Update

Impacted services are operational.
We will continue to monitor the situation for the time being.
Further explanation will be post here.
We apologized for the inconvenience.

Posted Jun 07, 2023 - 15:11 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 07, 2023 - 13:34 UTC

Update

All technical teams are hands on impacted services.
We are doing our very best to resolved the issue.

Posted Jun 07, 2023 - 12:16 UTC

Update

Our technicians are still working on this situation to solve it.

Posted Jun 07, 2023 - 09:46 UTC

Update

Update : Our teams are still investigating the cause of the service degradation.

Posted Jun 07, 2023 - 08:47 UTC

Update

Our technicians are currently working on in order to fix it.

Posted Jun 07, 2023 - 08:46 UTC

Investigating

Start time : 07/06/2023 06:32 UTC
Service impact : We noticed call routing errors on our core infrastructure
Ongoing actions : Our technical teams are investigating on this issue. They are checking our network.
Update will be post as significant progress is made.

Posted Jun 07, 2023 - 06:59 UTC

This incident affected: VoIP || Core Network.