Dear all,
The incident has now been resolved, and the details follow.
This Wednesday 07 June 2023 at 06:32 UTC we have been informed by the infrastructure monitoring that calls are not routed correctly.
Investigations start, and we control several things:
=> infrastructure health
=> last VoIP interventions
=> network interventions
We can locate the malfunction, it is on the call routing infrastructure. This infrastructure is strategic, as it manages the route of your calls from our class 5 subscriber infrastructure (Centrex / Trunk).
It is therefore redundant: 6 servers in 2 different geographical zones and 4 different datacentres.
Our investigations and machine behaviour remind us of a network malfunction, and the network team will be called as backup at 07:00 UTC.
From there, a long process of investigation begins in order to locate precisely the equipment that is causing the problem.
The investigations unfortunately take time but we can not act brutally on the infrastructure so as not to make the situation worse.
At the end of the morning, we manage to reduce the scope of possibilities on the cause of the outage.
From 12:30 UTC, we will locate an IP replica of one of our routers on a virtual machine in our Public Cloud project. After this machine was isolated, the service was back to normal.
This IP duplication was possible and had a significant impact because:
・ Our Public Cloud project is attached to the VoIP service vRack
・ For our private network, we use the vRack, which is spread over several DCs and types of services/OVH infrastructure
・ The IP duplicated is the one used for a router on our vRack
It took us some time to detect this IP duplication, as it was assigned automatically by taking the last IP of the Subnet used on our vRack.
In order to stop being affected and detect this type of incident more quickly, we will:
・ Enhance the monitoring network in order to be alerted in case of ARP entry change too frequently
・ Reduce the scope of our vRack to better segment and isolate our different services/infrastructures
・ Investigate the change of subnet used on the private network of our Public Cloud so that it is no longer common to other infrastructures
Once again, we apologise for this exceptional incident, as we are committed to providing you and your customers with the best possible service.