PDA

View Full Version : UK Ping times gone to pot again


Ryan
11-19-2003, 14:16
This started at 7pm GMT [about +5hours], pings 600ms+
Screenshot: http://www.fatboy.me.uk/burst191103.gif


Any ideas? I take it from the network status this isnt related as that was 2 hours ago.

Ryan

mbarb
11-19-2003, 17:03
I'm seeing int packet loss and int high ping times on that hop (66.191.191.45) as well. I was just about to open a support ticket as it's been going on for a few days.

Ryan
11-19-2003, 17:05
Been going on a few weeks - here is me original post:
http://forums.burst.net/showthread.php?s=&threadid=1714

Still nothing definate been said....

Ryan
11-20-2003, 17:59
Well whoopee bloody doo - here we go again, same S**t different day.
Todays UK pings are hitting 500ms+

Anyone care to explain why this is happening every night (UK, +5 EST).

Oversold bandwidth, throttling it back????
Anyone want to give me another guess?
So far we have had my ISP doing strange things, cables broken, routers, DDoS....and todays excuse????

Proof: http://www.fatboy.me.uk/burst201103.gif

Now looking elsewhere as it seems you cannot be arsed to give us a straight answer what is going on.

Ryan

Paul
11-20-2003, 19:33
I'd look at that image, but it's taking ages to reslove :(

mbarb
11-20-2003, 19:42
Target Name: fatboy.me.uk
IP: 64.191.59.105
Date/Time: 11/20/2003 6:27:30 PM

5 63 ms sl-bb23-chi-4-2.sprintlink.net [144.232.26.45]
6 62 ms sl-st21-chi-14-1.sprintlink.net [144.232.20.86]
7 65 ms [144.232.9.22]
8 99 ms p16-1-1-2.r20.nycmny01.us.bb.verio.net [129.250.2.123]
9 84 ms p16-6-0-0.r00.nwrknj01.us.bb.verio.net [129.250.2.216]
10 86 ms p16-3-0-0.r01.nwrknj01.us.bb.verio.net [129.250.4.22]
11 82 ms p4-0-1.a03.phlapa01.us.ra.verio.net [129.250.16.121]
12 88 ms fa-1-0.a05.phlapa01.us.ra.verio.net [129.250.116.197]
13 96 ms ge-1-2.a01.phlapa04.us.ra.verio.net [129.250.116.213]
14 89 ms ge-1-2.a01.phlapa04.us.ce.verio.net [130.94.0.166]
15 82 ms [66.197.191.45]
16 95 ms enterprise.vsdns4u.com [64.191.59.105]

Ping statistics for fatboy.me.uk
Packets: Sent = 1, Received = 1, Lost = 0 (0.0%)
Round Trip Times: Minimum = 95ms, Maximum = 95ms, Average = 95ms

Ryan
11-21-2003, 02:35
Strange that it is - for the last 2 days this happens, I stick up a moan and a picture showing the crappy response times and 20 or so minutes later everything is back to normal, 125ms pings etc.....
mmmmmmmm.

Both days that happens......

Ryan
11-21-2003, 03:35
Glad to see the reasons so forthcoming to the public forums, piles of confidence spilling my way.....ahem.

Only 12 hours since I asked the question have gone by.

Ryan

BurstAlex
11-21-2003, 08:39
Since in either even nothing was changed on our side, I do not see a point of saying "Aliens did it".

Ryan
11-21-2003, 10:04
Alex.

Thank you for answering - so if its nothing your side, how come all the traceroutes that have appeared in these here forums point at teh NOC or the hop just before it show some problems.
Its the same time every night - the pings go stupid, someone moans here and then they are right again.

Something not right if on the whole UK people are reporting problems and no doubt we are all coming via different routes, going to different servers etc the only other commom factor in the thing is Burst.

Ryan

BurstAlex
11-21-2003, 12:05
The only difference between the way the network was operating before the massive problem two days ago and now is that we still have the Cogent link out of service. If I recall correctly, the UK connected customers were the ones complaining about presence of Cogent to begin with anyway.

We will be working on trying to carefully restore full service, including Cogent, however, I must state that I am not being overly optimistic at this time.

I would like to stress that those who claim that certain things do not happen with a network that has redundancy are full of it - the reason that BurstNET clients had connectivity during the meltdown is a testament that this design works and works rather well.

It is not that I would not prefer to simply rip it apart and redo it to fix the problem - it is that trying to do that may cause extensive total outage that I am sure no one wants to see.

Ryan
11-21-2003, 18:14
Now that is the best answer we have had to date - thank you.

BurstAlex
11-21-2003, 19:24
Originally posted by BurstAlex
The only difference between the way the network was operating before the massive problem two days ago and now is that we still have the Cogent link out of service. If I recall correctly, the UK connected customers were the ones complaining about presence of Cogent to begin with anyway.

We will be working on trying to carefully restore full service, including Cogent, however, I must state that I am not being overly optimistic at this time.

I would like to stress that those who claim that certain things do not happen with a network that has redundancy are full of it - the reason that BurstNET clients had connectivity during the meltdown is a testament that this design works and works rather well.

It is not that I would not prefer to simply rip it apart and redo it to fix the problem - it is that trying to do that may cause extensive total outage that I am sure no one wants to see.

Now the giant clusterf!ck of the outage that everyone experienced in Philadelphia _is_ what happens when something that is not supposed to happen does happen.

While I will post the full explanation later, what happened, accoridng to all the text books, documentation, vendor claims, vendor sales people, vendor literature has a probability of less than one in a few trillion. A transit switch (trsw-1.phl) decided not just to die, but die to a point where nothing but the fans works, _and_ damage the GBIC modules of the router that it is attached to via _fiber_ and _only_ fiber. While we had a person on site in minutes after the outage started, he was not able to either recover or to bypass the affected equipment even though it was _supposed_ to work fine, and _did_ work fine when several months ago we have simulated equipment failure.

sightz
11-21-2003, 20:04
Thank you for the excellent response BurstAlex!

I am gratified to know that you had spares available, and had tested a simulated failure!

From the outside looking in, it seemed that if a single switch could kill the whole network, perhaps there wasn't any redundancy in place.

When will Burst be big enough to clone another one of you so one can stay at each NOC?

Has the issue of backups/alternates between Philly and Scranton been discussed? Not knowing your network topology, that looks like a (dreaded) single point of failure.

I don't know how you sleep at night with the pressures of keeping this machine going! I know you haven't slept much this month and we appreciate the hard work and detailed reports. Things will get better, right?

Domenico
11-21-2003, 20:29
What equipment was damaged (brand, type) Alex?

BurstAlex
11-21-2003, 20:33
Two SX GBICs, one SM to MM gige converter, one Extreme Summit4

GNGNetwork
11-21-2003, 21:31
Originally posted by BurstAlex
Two SX GBICs, one SM to MM gige converter, one Extreme Summit4

Unfortunately, even the highest-end highest-quality finest equipement can die exactly the time you won't expect. That's why an Emergency Response Team for these rare incidents should always be ready and well-trained for all cases...

Because it's Murfy's Law that applies when exactly you won't expect...

It's a matter of luck, but I am glad that I know now exactly what happened so that I will be able to explain to my clients that even something is setup withredudancy and high availability sh*t happens and no-one can be 100% sure that everything will go right for the rest of the Time... :confused: I just hope that the propability that something like this happens again won't occur in the near future because this could be very difficult to explain to my clients...

BurstAlex
11-22-2003, 00:30
And in this case, the highly trained Emergency Response Team happened to be driving one of these rental cars with a spare Summit4 in a trunk that was fetched from storage (even though at that time we were sure it was not the Summit that failed) and three GBICs in my pocket obtained from Ebay the previous day to be plugged into my personal network at home.

sightz
11-22-2003, 11:10
Originally posted by BurstAlex
... three GBICs in my pocket obtained from Ebay the previous day to be plugged into my personal network at home.

Alex, you have just been elevated even higher in my "ranking of the world's nerds" :-)

Your home network must be quite interesting if it needs 3 GBICs. I didn't even know you could get fiber into a private home yet.

GBIC = GigaBit Interface Converter:- an interface module which converts the light stream from a fibre channel cable into electronic signals for use by the network interface card.


I thought I was cool just because I have cat5 in every room and a patch panel in the basement that scared the Cable Guy. :-)


--He who has the most bandwidth when he dies, wins.

Ryan
11-22-2003, 14:58
And sorry to pop the feel good bubble, but yet again - on cue as always, the pings have gone stupid:

http://www.fatboy.me.uk/burst221103.gif

*sigh*

Best go move some more sites away if I can.......

burstSalman
11-22-2003, 15:15
How is it now? Just pulled a server that was flooding.

I really wish people would audit their code before putting out php scripts.

Ryan
11-22-2003, 15:22
Back to normal.
Very many thanks Salman.

Now if these scripts are the cause over the last few weeks I will be very happy man - shame no one else could find them though ;)

BurstHan
11-22-2003, 15:40
Originally posted by burstSalman
How is it now? Just pulled a server that was flooding.

I really wish people would audit their code before putting out php scripts.

Or cap every port to 10mbps like other datacenters... but I don't think many people will be happy.

Ryan
11-22-2003, 15:43
ooo we have a staff difference of opinion.......

burstSalman
11-22-2003, 16:22
It's not really a difference of opinions rather a different approach to the same issue.

The majority of the floods that we come across occur because a php script got exploited and a user (or just anyone) was able to upload a backdoor telnet daemon or some other type of shell-script to the server.

Once that's done, that person can log onto the server and have a shell interface and place anything they'd like in the world write/executable directories.. e.g: flood scripts.

Capping ports at 10mbps would prevent floods that rise up to 98-100mbps.

Ryan
11-22-2003, 17:00
ooo so close, but alas, the network has again developed a nasty side effect of huge pings: another picture:

http://www.fatboy.me.uk/burst221103b.gif

Back to the drawing board guys....

sightz
11-22-2003, 17:54
Originally posted by burstSalman
Just pulled a server that was flooding.

I really wish people would audit their code before putting out php scripts.

Can you explain what was happening? My sites run lots of PHP written by others and I don't even know what "flooding" is. I don't wanna be the one who affects other people's sites!

Thanks

Ryan
11-23-2003, 18:28
Right, off we go again.
11.10pm GMT here and the pings have gone crap again.

www.justahost.co.uk/burst231103.gif

Am I blowing this all aout my ass for no reason or does someone actually care what people think in the UK??

Ryan

BurstAlex
11-23-2003, 18:37
Originally posted by Ryan
This started at 7pm GMT [about +5hours], pings 600ms+
Screenshot: http://www.fatboy.me.uk/burst191103.gif


Any ideas? I take it from the network status this isnt related as that was 2 hours ago.

Ryan

This is related to the post that I just made - basically working around one of the issues that still does exist sent traffic that sits behind sprint the wrong way...

I believe it had been fixed now.

Ryan
11-23-2003, 18:50
Alex - that quote is from the 19th of the month, the post i made above you is from tonight.

the pings are now at 140ms and any site hosted on the Burst network are taking bloody ages to load whilst on another DC in the states they are flying along...
Ryan