FS#5793 — Authentication problem on the network of \"collect\"

Incident Report for Web Cloud

Resolved

One of our providers \"operateur de collect\" has currently an authentification problem on its network.
That's why the ADSL connexion using this network is interrupted since 19:45 approximately.

We are in touch with their team in order to know when the problem will be fixed.

Update(s):

Date: 2011-09-20 23:18:51 UTC
SFR has added 5 new BAS to strengthen the existing 8.
In all, all the traffic now goes on 13 (this is a good number, there will be no problem) so we should not have packets loss or bandwidth problems.

We will take off the blacklisting of MIT-1 and MIT-2.

Date: 2011-09-20 22:57:29 UTC
SFR identified the problem with Ericsson

An operator client sent a badly constructed packet on the interconnection with SFR and this packet crashed the BAS.
This operator (we do not know it yet) kept sending the packet until sunday noon. Then there was no trace of the packet and after the hard reboot the situation was back to normal.

Ericsson has patched the software of BAS and is currently checking if it is working properly by sending the packet through the lab redback.

There will certainly be a software update in the coming hours / days / weeks.

Date: 2011-09-20 22:50:00 UTC
SFR would have made remarks on how we are transparent with our customers. They haven't appreciated that the link was published.
We had nothing to hide for 12 years and our transparency is a part of our value that we'll always keep.

We censored emotional paragraphs that were enthusiastically written by people who slept little during the problem.
We think that it was not enough to understand what really happened.

Date: 2011-09-19 22:08:54 UTC
We applied the same configuration on all of our LNS. We accept no more a BAS connection that doesn't work in SFR like MIT-1 and TIM-2.

When the problem is fixed in SFR we will restore these two BAS.

Date: 2011-09-19 22:05:41 UTC
We made a wrong configuration of BAS with MIT-1 and MIT-2 then killed all sessions of MIT-1 and MIT-2 several times in order to have this configuration considered. Result: customers can not use MIT-1 and MIT-2 they reconfigure themselves automatically on the BAS that are considered as functional.

Date: 2011-09-19 22:00:23 UTC
We are identifying customers who are on these two BAS.
After the session Clear , the client moves to another BAS and then he won't have a problem.

We are checking how to avoid using these two BAS.

Date: 2011-09-19 18:27:25 UTC
d'apres nos recherches avec les clients, le probleme
concerne 2 BAS:

SE800-MIT-1 >> perde de packet
SE800-MIT-2 >> perde de packet+haut latence

Accordinf to our researches with the customers, the problem
concerns 2 BAS :

SE800-MIT-1 >> packet loss
SE800-MIT-2 >> packet loss + high latency

Date: 2011-09-19 18:26:12 UTC
The rate seems correct but many customers have important latency times
since this morning.

Example :
http://twitpic.com/6ncjga/full

We try to find out how to fix this problem.

Date: 2011-09-19 12:33:42 UTC
Impact : L2TP sessions loss affecting the trafic Option 1 and Option 3 of the POP customers.

Concerned equipments : 8 BAS SE800 (of concentration)

Investigations & analyses are done all the night.

Root cause, still not identified.

Accomplished actions :

-Surveillance of the BAS equipments (control of the CPU load levels) and manual relaunching of the L2TP process when necessary
-Changing of the SUP card on an equipment (expected at 9:00 a.m
- router reload
- knot isolation ( it was causinf packet drops)
- Interface card isolation unstable on 1 router (changing of the card in progress)
- Restart of each BAS (reduced impact , partial trafic upturn confirmed by STC, but the sites are down)
- A date (yesterday 17:40), 90% of trafic OK for the operator customers. Quantification in progress from Company side.

Actions accomplished last night :
- Action plan for the night ( complete or not complete reboot of the BAS on incidents or general , impact duration to anticipate)
- Logs analysis to determine the root cause

Other accomplished actions : setting up of a securisation after a card changes (BAS and router)

Date: 2011-09-19 11:48:56 UTC
Since 3:30 everything is stable according to SFR.
We are going to see case by case if there's a problem.

Date: 2011-09-19 06:44:04 UTC
SFR is rebooting CBV1-1

Sep 19 02:06:00: %L2TP-6-TUNNEL: SE800-CBV1-1:756 Max retransmits on packet 238. Closing tunnel
Sep 19 02:06:00: %L2TP-6-PEER: Marking peer SE800-CBV1-1 dead for 120 seconds
Sep 19 02:06:00: %L2TP-6-TUNNEL: SE800-CBV1-1:756 remote abort: Reached max retransmits
Sep 19 02:19:04: %L2TP-6-TUNNEL: SE800-CBV1-1:1073 Max retransmits on packet 2. Closing tunnel
Sep 19 02:19:04: %L2TP-6-PEER: Marking peer SE800-CBV1-1 dead for 120 seconds
Sep 19 02:19:04: %L2TP-6-TUNNEL: SE800-CBV1-1:1073 remote abort: Reached max retransmits
Sep 19 02:20:33: %L2TP-6-PEER: SE800-CBV1-1 marked alive

Date: 2011-09-19 06:07:40 UTC
We rebooted on hard a SE1200 (bigger than SE800),and we need 4 minutes to have it back alive.

Sep 19 00:46:16 10g.lyo-1-6k.routers.ovh.net 3105: Sep 18 23:45:53 GMT: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet3/1, changed state to down
[...]
Sep 19 00:50:05 10g.lyo-1-6k.routers.ovh.net 3129: Sep 18 23:49:42 GMT: %DTP-SP-5-TRUNKPORTON: Port Te3/1 has become dot1q trunk

Date: 2011-09-19 06:05:36 UTC
Here is something interesting.Our Lyon LNS is also a redback.
And we see that four hours before the failure of \"collect\" at SFR,something with SFR happened and made our Lyon LNS crash.
The same software, the same bug.

Sep 17 14:12:51: %LOG-6-PRI_STANDBY: Sep 17 14:12:51: %AAA-6-INFO: AAA: session sync completed
Sep 17 14:19:35: %LOG-6-PRI_STANDBY: Sep 17 14:19:35: %AAA-6-INFO: AAA: start session sync
Sep 17 14:33:35: %LOG-6-PRI_STANDBY: Sep 17 14:33:35: %AAA-6-INFO: AAA: session sync completed
Sep 17 15:38:06: %LOG-6-PRI_STANDBY: Sep 17 15:38:06: %AAA-6-INFO: AAA: start session sync
Sep 17 15:38:33: [0003]: %OSPF-6-INFO: OSPF-16276: Full neighbor xx.xx.xx.xx event: 1 Way
Sep 17 15:38:33: [0003]: %OSPF-6-INFO: OSPF-16276: Neighbor xx.xx.xx.xx fell from Full state to Init state
Sep 17 15:38:38: [0001]: %OSPF-6-INFO: OSPF-16276: Full neighbor xx.xx.xx.xx event: 1 Way
Sep 17 15:38:38: [0001]: %OSPF-6-INFO: OSPF-16276: Neighbor xx.xx.xx.xx fell from Full state to Init state
Sep 17 15:38:40: %L2TP-6-TUNNEL: SE800-ABV-1:15467 Max retransmits on packet 8803. Closing tunnel
Sep 17 15:38:40: %L2TP-6-PEER: Marking peer SE800-ABV-1 dead for 120 seconds
Sep 17 15:38:40: %L2TP-6-TUNNEL: SE800-ABV-1:15467 remote abort: Reached max retransmits
Sep 17 15:38:40: %L2TP-6-TUNNEL: SE800-MIT-1:15451 Max retransmits on packet 9846. Closing tunnel
Sep 17 15:38:40: %L2TP-6-PEER: Marking peer SE800-MIT-1 dead for 120 seconds
Sep 17 15:38:40: %L2TP-6-TUNNEL: SE800-MIT-1:15451 remote abort: Reached max retransmits
Sep 17 15:38:44: [0002]: %OSPF-6-INFO: OSPF-16276: Full neighbor xx.xx.xx.xx event: 1 Way
Sep 17 15:38:44: [0002]: %OSPF-6-INFO: OSPF-16276: Neighbor xx.xx.xx.xx fell from Full state to Init state
Sep 17 15:38:55: %L2TP-6-TUNNEL: SE800-CBV1-1:15461 Max retransmits on packet 9630. Closing tunnel
Sep 17 15:38:55: %L2TP-6-PEER: Marking peer SE800-CBV1-1 dead for 120 seconds
Sep 17 15:38:55: %L2TP-6-TUNNEL: SE800-CBV1-1:15461 remote abort: Reached max retransmits
Sep 17 15:39:06: %L2TP-6-TUNNEL: SE800-MIT-2:15463 Max retransmits on packet 9604. Closing tunnel
Sep 17 15:39:06: %L2TP-6-PEER: Marking peer SE800-MIT-2 dead for 120 seconds
Sep 17 15:39:06: %L2TP-6-TUNNEL: SE800-MIT-2:15463 remote abort: Reached max retransmits
Sep 17 15:39:08: %L2TP-6-TUNNEL: SE800-VEL-1:15455 Max retransmits on packet 8060. Closing tunnel
Sep 17 15:39:08: %L2TP-6-PEER: Marking peer SE800-VEL-1 dead for 120 seconds
Sep 17 15:39:08: %L2TP-6-TUNNEL: SE800-VEL-1:15455 remote abort: Reached max retransmits
Sep 17 15:39:45: %L2TP-6-TUNNEL: SE800-ABV-2:15457 Max retransmits on packet 9278. Closing tunnel
Sep 17 15:39:45: %L2TP-6-PEER: Marking peer SE800-ABV-2 dead for 120 seconds
Sep 17 15:39:45: %L2TP-6-TUNNEL: SE800-ABV-2:15457 remote abort: Reached max retransmits
Sep 17 15:40:00: %L2TP-6-TUNNEL: SE800-MAS-1:15453 Max retransmits on packet 8414. Closing tunnel
Sep 17 15:40:00: %L2TP-6-PEER: Marking peer SE800-MAS-1 dead for 120 seconds
Sep 17 15:40:38: [0003]: %BGP-6-INFO: xx.xx.xx.xx DOWN - Notification sent
Sep 17 15:40:38: [0003]: %BGP-6-INFO: xx.xx.xx.xx send NOTIFICATION: 4/0 (hold time expired) with 0 byte data. mxReadMs=60857
Sep 17 15:40:45: [0003]: %BGP-6-INFO: xx.xx.xx.xx DOWN - Notification sent
Sep 17 15:40:45: [0003]: %BGP-6-INFO: xx.xx.xx.xx end NOTIFICATION: 4/0 (hold time expired) with 0 byte data. mxReadMs=61013
Sep 17 15:40:49: [0002]: %BGP-6-INFO: xx.xx.xx.xx DOWN - Notification sent
Sep 17 15:40:49: [0002]: %BGP-6-INFO: xx.xx.xx.xx send NOTIFICATION: 4/0 (hold time expired) with 0 byte data. mxReadMs=61152
Sep 17 15:41:07: [0002]: %BGP-6-INFO: xx.xx.xx.xx DOWN - Notification sent
Sep 17 15:41:07: [0002]: %BGP-6-INFO: xx.xx.xx.xx send NOTIFICATION: 4/0 (hold time expired) with 0 byte data. mxReadMs=60853
Sep 17 15:41:07: [0003]: %BGP-6-INFO: xx.xx.xx.xx DOWN - Notification sent
Sep 17 15:41:07: [0003]: %BGP-6-INFO: xx.xx.xx.xx send NOTIFICATION: 4/0 (hold time expired) with 0 byte data. mxReadMs=60966
Sep 17 15:41:34: [0002]: %BGP-6-INFO: xx.xx.xx.xx DOWN - Notification sent
Sep 17 15:41:34: [0002]: %BGP-6-INFO: xx.xx.xx.xx send NOTIFICATION: 4/0 (hold time expired) with 0 byte data. mxReadMs=61148
Sep 17 15:45:06: %LOG-6-PRI_STANDBY: Sep 17 15:45:06: %AAA-6-INFO: AAA: session sync completed

and at 19:00 it was a general breakdown.So the problem is coming somewhere from SFR network and exactly a software bug on redback platform.
Anyway,this is our own analysis.

Date: 2011-09-19 05:44:00 UTC
The reboot is done, the VEL-1 is up and takes the new subscribers. As we are at 0:10, there are no many new subscribers.

SFR decided 2:30 the time to watch the logs and do things on VEL-1 and then at 3:00 am, they should take the decision if they will reboot all the rest,and we hope to spend a normal monday if they don't reboot the rest and it is sure that it will be like taking the wall at 200km/h.

Too many business customers(companies) are impacted.With a simple look, the SE800 has 8 cards 4x10G. on each card we can subscribe 8000 to 15,000 business ADSL customers:companies (55,000 of the general public). So we manager around 80000-100000 ADSL by BAS.
There are 8 BAS,so it is between 500K to 800K ADSL business running on these 8 BAS. it's a few people who will start tomorrow at 7:00 ..

Well, you have to decide to do something but there is less than 7 hours to fix everything. If it takes 30 mintues by BAS, and there are 7 remaining, it's four hours of work non-stop,which means before 3:00 you have to take decisions and go.

Date: 2011-09-19 00:29:21 UTC
It is back:

SE800-VEL-1 ovh LNS Local 1 1 Unnamed

lns-1-par-se1200#sh l2tp peer SE800-VEL-1
Peer Name: SE800-VEL-1
Admin State: Up
Local Name: ovh
Vendor: RedBack Networks (Firmware: 0x0600)
Hello Timer: 300 Preference: 10
Maximum Tunnels: 32767 Maximum Ses/Tunnel: 65535
Control Timeout: 3 Retry: 10
Tunnel Count: 1 Session Count: 7
Active Sessions: 7

We were asking ourselves the question why the PPP sessions that go through several BAS are with less quality than via other BAS. This is probably because the repartition of subscribers is not homogeneous on all BAS because of the reload performed during day. When they reload, all subscribers of a BAS switch to the other BAS. And therefore all N-1 BAS take the X + (X / (N-1)) subscribers and is apparently beyond its limit.
Basically the infrastructures are overloaded and do not stand failover or failure.

Date: 2011-09-19 00:18:06 UTC
VEL-1 reboot in progress
#sh l2tp peer
Conf. Tun Ses
Peer Name Local Name Role Source Count Count
-------------------- -------------------- ---- ------ ----- -----
SE800-VEL-1 ovh LNS Local 0 0 Unnamed DOWN

Sep 18 22:29:08: %L2TP-6-TUNNEL: SE800-VEL-1:508 Max retransmits on packet 277. Closing tunnel
Sep 18 22:29:08: %L2TP-6-PEER: Marking peer SE800-VEL-1 dead for 120 seconds
Sep 18 22:29:08: %L2TP-6-TUNNEL: SE800-VEL-1:508 remote abort: Reached max retransmits

Date: 2011-09-19 00:17:35 UTC
Many customers are back again claiming the problems of bandwidth, even after the session kills. It's obvious that the problem is not resolved.

And if it were me, I'd already gone to the datacentre to cut the juice of washing machines and see what happens after a hardware cut in a good standing.

Date: 2011-09-18 23:50:34 UTC
At 23:30 p.m. SFR expects to perform a hardware reboot to one of the 8 BAS,it is VEL-1 which was the first one to be reloaded in software this after noon.

It takes about 25 minutes to reboot the BAS of the brand:redback. Customers using this BAS (in our case there are less than 500 subscribers) will have a cut and should go through another BAS.

Once rebooted, SFR will provide a 4-hour monitoring. We assume it will be about checking these graphs and appraising with a simple look whether the rebooted hardware is working well or bad.

At 04:00am, a small meeting is planned to know if the other BAS will restart with the same method or a faster one.

A prioress is obtruding.

Date: 2011-09-18 23:34:34 UTC
In image, the disaster of the weekend.
This is our curve of LNS Paris 's traffic:
http://yfrog.com/z/h6ojpihj

The possibility that the problem continues tomorrow is very strong!

Date: 2011-09-18 23:30:26 UTC
Marseille, Bordeaux, Lille, Lyon done.
Sessions remounted.We found the sessions and the bandwidth of the customers.

Date: 2011-09-18 23:27:51 UTC
Paris done.

Date: 2011-09-18 23:27:36 UTC
Kill in progress...

Date: 2011-09-18 23:27:06 UTC
We have customers who are losing packets and have quality problems curently.
Apparently after killing session the quality is good again.

We will kill all sessions in the next 10 minutes.

This means that the substancial problem is not fixed. And tomorrow we will end up with a black day to manage.

Date: 2011-09-18 23:22:10 UTC
No session crashed,since there is no problem, SFR and Ericson are saying that the problem is fixed.

But the problem is not fixed at all,so just because it is Sunday 22:00 the problem is gone.

Tomorrow is Monday and it will crash like this morning at 11am and we will all cry because we will have to wait till 22:00 so kind of some customers won't have problems.

This is driving us crazy !

Date: 2011-09-18 23:01:49 UTC
Till now,SFR made too many software reboots of their BAS,1G cards or 10G cards or supervision cards.
No harware rebbot were performed.

Ericson refuses to do anything unless a hardware reboot of all BAS is performed.

Seeing that some clients still are working on the BAS, SFR refuses to perform a hardware reboot before 22:00.

So,till then we will tinker/do it ourselves,we will monitor the tunnels between the BAS and our LNS.If the tunnels break down SFR will restart them manually.

That's it.

Unfortunately there is no one in SFR who has both to make decisions imposed since 24 hours so we will have to wait and probably have a sleepless night reassuring our customers and taking the news.
This is the weekend, ok, but we have to carry on and reduce the time of unavailability without seeking stupid compromises.
Tomorrow is Monday and everyone wants to use the internet without faults in the office!
Not to mention all the individuals who are rare one hour of 2 in this weekend.

In brief,we will have no news before 22:00.While waiting we look at our tunnels, SFR also, and DIY (Do It Yourself) continues.

AT 22:00, the hardware reboots will start.We don't know yet if it will go fast or we will keep giving details.

Once the hardware reboot is done there will be 2 solutions:
- Either he problem is fixed and we will never identify the origin of the problem, and then we will brobably have it another day with the same conditions
- Or it won't be fixed and Ericson will start working in order to check the origin of the problem, ie where does the software bug in the LNS / DOWN Redback that they commercialize come from.

We feel like a licensed driver up to date and who can lead this car is missing.

Date: 2011-09-18 22:22:00 UTC
VEL-1 crashed again,only the l2tp tunnel.

Date: 2011-09-18 22:20:35 UTC
All the BAS were reloaded.
But one of the BAS of Mitry has crashed again.

There is a configuration between SFR and Ericson, which started at 19.30.

The reboots that have been performed are just soft ... It was just a configuration synchro with a reload!
A hard reboot of all BAS is expected at 22h.

We are looking forward to have more informations after the configuration.

Date: 2011-09-18 22:15:39 UTC
5 of the 8 BAS of SFR has been rebooted.
Rebootings from SFR's side continue.
It's so long because it is done sequentially and SFR is taking 30-45 mins to have everything stable then move to another BAS.

Date: 2011-09-18 22:10:11 UTC
We asked SFR to cut the interco port between our Lyon's LNS and SFR.

Date: 2011-09-18 21:46:44 UTC
Who is down?
50% of \"SFR company\" 's customers
All \"Bouygues company\" 's customers and operator clients of SFR as OVH.

Who is not impacted?
The SFR / Bouygues general public who have to go through another infrastructure.
Our customers who are on our own DSLAMs

Date: 2011-09-18 21:40:08 UTC
SFR rebooted on of the BAS,the service is restarting.
Ericson (constructor who provides the LNS redback) is working on the problem for SFR.

Date: 2011-09-18 21:37:57 UTC
The service seems to be gradually restarting.
We don't have official informations yet.

Date: 2011-09-18 21:36:37 UTC
The origin of this problem is still unknown.
Constructors' support is currently working on the problem with SFR to find solutions.

Date: 2011-09-18 14:07:44 UTC
We will have an evaluation from SFR in about 20 minutes.
For the time being a +1 level of \"on call support\" is in
the SFR datacenter and tries to make the service ork
by changing the cards.

Date: 2011-09-18 13:56:05 UTC
Some BAS still have the problem. The PPP session is up and then kills itself.
SFR thought that it was because of the BAS of Aubervilliers. They changed the cards
on this BAS. Still the same. They are still searching for the origin.

We still don't have the hour for the reestablishment of the service.

Some SFR companies and customers of SFR are impacted.

Date: 2011-09-18 13:46:46 UTC
There are 3 BAS working instead of 7.

It seems that SFR switches on this 3 . We contact them
to ask why.

Date: 2011-09-18 13:45:40 UTC
There are still unstabilities from BAS SFR side.
A card on the site of Aubervillier will be changed at 13:00

The customers connecting on other BAS different from Aubervillier don't
have this problem.

Date: 2011-09-18 13:43:34 UTC
A new interruption.

Date: 2011-09-18 13:43:02 UTC
The situation is still not stable. According to the logs, we had
2 very short interruptions at 9h02 and 11h10. We forward the information
to SFR to see how to fix this problem definitively.

Date: 2011-09-18 13:41:09 UTC
The source of the problem was the BAS of our collect suppliers.
It's now corrected.

Date: 2011-09-18 13:40:03 UTC
Everything is now in order, all the ppp sessions are up and stable.

Date: 2011-09-18 13:39:16 UTC
During the last 3rd hour, everything seems to be stable, we're
still monitoring the network.

Date: 2011-09-18 13:37:53 UTC
During the last 2nd hour, everything seems to be stable, we're
still monitoring the network.

Date: 2011-09-18 03:39:36 UTC
During the last hour,everything looks stable,we are monitoring the network.

Date: 2011-09-18 03:37:56 UTC
The origin of this problem has been identified,the provider of \"collect\" is restarting his backbone routers,ppp sessions seem to be very stable.
We still need some work to make sure that everything is going well.

Date: 2011-09-18 01:20:46 UTC
Most sessions are up,but still not stable,we are working with SFR to resolve it.

Date: 2011-09-18 01:18:41 UTC
Ok, it is almost fixed, but we're not sure yet if it's really ok, we are still checking with SFR to make sure that everything is fine.

Date: 2011-09-18 01:16:02 UTC
Since 2:03 we noticed that the traffic increased on the \"interco\" LNS SFR

Date: 2011-09-18 01:12:12 UTC
It seems to be difficult to identify the origin of this problem.
We provided them again all the informations from our side to help resoving this problem.

According to SFR the whole network is impacted.
But we don't see any decrease in traffic between SFR / OVH on hosting.Wierd.

This is the first time when such a problem affects all the \"BAS\".
Until now it touched one more \"BAS\". SFR thinks that the problem could be located between their \"BAS\" and proxy-radius-sfr

Date: 2011-09-18 01:02:25 UTC
We are working with their team and we are exchanging informations between us trying to fix the problem.
We still don't know when this service will be restored.

Date: 2011-09-18 00:59:54 UTC
We are still investigating from SFR.
We still don't know when this service will be restored.

Date: 2011-09-18 00:57:34 UTC
At our side,100% of our ADSL customers using SFR's \"collect\" network are impacted by this outage.

There is no impact on our DSLAMs

At 22:45 and 23:15 we had some customers who recovered their services for 15 minutes.

Date: 2011-09-18 00:42:06 UTC
According to the traffic graphs from SFR network,SFR direct customers do not seem to be impacted by this outage.
Traffic is stable on the peering SFR / OVH.

Bouygues who is another SFR client for the ADSL collection does not seem to be impacted by this problem.

Nerim seems to be impacted.

Date: 2011-09-18 00:37:17 UTC
Apparently they have a general problem on their network,lot of their customers are impacted.
This problem is not related to OVH services.

Posted Sep 17, 2011 - 21:32 UTC