« Older Entries Newer Entries » Subscribe to Latest Posts

21 Nov 2013

Amazon: Increased Launch Error Rates

Posted by iwgcr. No Comments

Between 5:21 PM and 5:42 PM PST, we experienced increased error rates for new EBS-backed launches in the US-EAST-1 Region.

Date

Service

Duration

Critical Data Lost

2013-11-21

Amazon Cloud

21 minutes

no

 

 

 

Reference:

http://status.aws.amazon.com/

Tags:

20 Nov 2013

HostGator data center issue

Posted by iwgcr. No Comments

HostGator experienced an issue on November 20th at one of its data centers. The data center issue was discovered at 11:37 am ET and HostGator isolated the affected servers and began restoring services shortly after, Joshua Martin, HostGator director of customer service said.

By 2 pm, HostGator reported that full service had been restored for the majority of its affected shared and VPS customers. At 3:10 pm ET, HostGator said most affected customers using dedicated servers were also back to normal. Customers on Twitter were still having issues around 4 pm.

Since moving away from SoftLayer data centers and its acquisition by Endurance International Group, many HostGator customers have reported downgraded services. In August, HostGator customers in Ace Data Centers’ Provo, UT facility experienced an outage due to network issues.

Date

Service

Duration

Critical Data Lost

2013-11-20 HostGator 4 hours 23 minutes no

 

Resources:

http://www.thewhir.com/web-hosting-news/hostgator-restores-hosting-services-data-center-outage

19 Nov 2013

Amazon: Increased API Error Rates and Latencies

Posted by iwgcr. No Comments

Between 10:37 PM and 11:45 PM PST on November 18 Amazon experienced increased API error rates and latencies in the US-EAST-1 Region.

Date

Service

Duration

Critical Data Lost

2013-11-19

Amazon Cloud

38 minutes

no

 

 

 

Reference:

http://status.aws.amazon.com/

Tags:

18 Nov 2013

Google’s YouTube experiences worldwide outage

Posted by iwgcr. No Comments

A brief YouTube outage on Monday was the one of the biggest recent glitches for the popular video site, according to a company that uses complaints on Twitter and other sources to measure the impact of online outages.

Around 3 p.m. Pacific Time, the site started displaying a plain-text page with a “500 internal service error” message that read, “Sorry, something went wrong. A team of highly trained monkeys has been dispatched to deal with this situation.”

Google’s YouTube division issued a statement but didn’t immediately give a reason for the outage, which lasted about 10 minutes, or detail how many people it affected.

“Some people encountered errors, or a slower than normal experience on YouTube today,” the statement said. “We worked quickly to address the issue and fixed the problem. We’re sorry for any inconvenience this caused.”

Downdetector’s YouTube index quickly spiked to 19,986 reports of problems on the site. Of the 10 previous YouTube outages that Downdetector recorded, which date back to Aug. 16, none generated even 1,000 reports. YouTube says more than 1 billion unique visitors use the site every month.

Date

Service

Duration

Critical Data Lost

2013-11-18 YouTube 10 minutes no

 

Resources:

http://www.pcworld.com/article/2064720/youtube-returns-after-a-short-widely-seen-outage.html

 

Tags: ,

18 Nov 2013

Rackspace Inc’s server hosting outage affects Pipedrive Inc

Posted by iwgcr. No Comments

Pipeline Inc, a CRM and pipeline management software experienced an outage on November 18th 2013. An update was posted on Pipeline Inc’s blog stating:

“We had an outage of our core services (application and REST API) between Mon 00:35 AM – 2:35 AM UTC / Sun 4:35 PM – 6:35 PM PST time. Customers who tried to use Pipedrive received a 502 error message. The service is now restored, and I am very sorry for the trouble this caused.

We worked hard on restoring the service, and also got back to those of you reaching out to us on customer support. In case you have any questions or concerns, feel free to get in touch.

There is no loss or damage to your data caused by this outage. This is because the problem occurred in one of our central accounts databases which contains only basic information about the accounts — but none of the actual content. Plus, this database is thrice backed up, as is all of our data.

For those interested in tech details — the issue started with a seemingly complete crash of the database server process. However none of the open connections got dropped, so not all of the alarms started ringing the moment the problem occurred. It took about the amount of connection timeout to start the alarm bells, and only after that we identified the problem had occurred somewhere much deeper than just the regular database service layer as the VM stopped responding after forcing a restart — in fact, the problems seem to have occurred as deep as on the KVM virtualization layer inside OpenStack. The exact cause is unknown for us at the moment but we escalated the issue to our service provider at Rackspace, and their senior technicians are investigating this. Meanwhile, we spinned up our services using the real-time backup that we have of this database, and things are working properly again.”

Date

Service

Duration

Critical Data Lost

2013-11-18 Rackspace 2 hours no

 

Resources:

http://blog.pipedrive.com/2013/11/update-about-the-recent-outage/

 

18 Nov 2013

Rockstar’s cloud outage disrupts GTA V online game once again

Posted by iwgcr. No Comments

Rockstar Games posted a new update in regards to recent character save issues in “GTA Online.”According to a post from the Rockstar Support page on Nov. 18, a representative from the company disclosed that the Rockstar North development team is still looking at several reports from players that were affected by the glitch.

The bug apparently occurred during a cloud server outage on Nov. 16. Although the connection issues have been resolved, the company confirmed that a few problems are still lingering.

Rockstar update 11/18/13 3:00 PM ET: “We are continuing to look into some reports today about various character save issues following the resolution of the Cloud Server outage in Saturday, November 16th.”

Several fans reported that their characters completely changed during the cloud server outage while others complained that their avatars were completely deleted. Rockstar Games previously had to deal with the missing character issues when “GTA Online” was launched last month for “GTA 5.”

Stimulus packages were sent to all players who logged into the online multiplayer mode of the open-world action-adventure video game during the month of October due to the numerous issues. Another batch was sent out recently to compensate select players that lost their cars and vehicle mods. Rockstar Games may have to end up sending out more deposits due to the latest character saving problems.

Date

Service

Duration

Critical Data Lost

2013-11-16 Rockstar 1 hour 30 minutes yes

 

Resources:

http://www.examiner.com/article/gta-online-character-save-issues-discussed-by-rockstar-games

http://gtaforums.com/topic/650516-rockstar-cloud-server-unavailable-right-now/

 

Tags: , ,

14 Nov 2013

Salesforce Goes Down in North America and Europe

Posted by iwgcr. No Comments

A system update that went awry appears to be the cause of a failure that took Salesforce.com down across much of North America and most of Europe. In some cases, the outage lasted for as long as three hours.

Salesforce acknowledged the outage on its System Status page, which shows that seven out of 17 instances in North America were affected, as were two out of four in the Europe, Middle East and Africa region. Two instances in the Asia Pacific region were unaffected.

According to a Salesforce statement, the problem began at a little before 9 pm ET on November 14th 2013. User have reported having trouble signing on.

The company said the preliminary findings point to planned maintenance on networking equipment that clearly didn’t go as planned.

Salesforce issued a statement, referring specifically to its NA2 instance in North America, but the message was the same across all the others:

“Time: 11/15/13 01:51 AM UTC

Detail: On November 15, 2013 the salesforce.com Technology Team resolved a service disruption affecting the NA2 instance.

The problem began at 01:51 UTC and was resolved by 04:51 UTC. During this time, customers may have experienced an inability to access or intermittent errors to Salesforce application.

Root Cause: The salesforce.com Team is investigating the root cause of this issue. The preliminary findings point it to planned maintenance in the network tier.”

Date

Service

Duration

Critical Data Lost

2013-11-14 Salesforce 3 hours no

 

Resources:

http://allthingsd.com/20131115/salesforce-went-down-for-about-three-hours-today-in-north-america-and-europe/

 

13 Nov 2013

Microsoft hit by second Office 365 email outage in five days

Posted by iwgcr. No Comments

On November 13th, some Microsoft Office 365 customers in North America were reporting (via Twitter and email) that they were experiencing email problems — just like they were five days before.

A Microsoft spokesperson provided the following update around 2 pm ET November 13th:

“On Tuesday, Nov. 13, some customers served from our North America data centers are experiencing intermittent access to e-mail services. Customers are being updated regularly via our normal communication channels. We sincerely apologize to our customers for any inconvenience.”

The problem was solved the same day and an update was posted on Office 365’s blog:

From 9:08AM to 2:10PM PST today, November 13th, some customers in North and South America were unable to access email services.  The service incident resulted from a combination of issues related to maintenance, network element failures, and increased load on the service.  This morning, the Office 365 team was performing planned non-impacting network maintenance by shifting some load out of the datacenters under maintenance.  In combination with this standard process, we experienced a ‘gray’ failure of some active network elements; the elements failed, but did not alert us to their failure.  Additionally, we have an increasing load of customers on-boarding to the service.  These three issues in combination caused customer access to email services to be degraded for an extended period of time.  By 10:42am PST, remediation work was underway to balance users to healthy sites, broaden the service access points and remediate the failed network devices.  At 2:10PM PST all services were fully restored.  Significant capacity increase has already been well underway, but we are also adding automated handling on these gray failures to speed recovery time.  Across the organization, we are executing a full review of our processes to proactively identify further actions needed to avoid these situations.”

Date

Service

Duration

Critical Data Lost

2013-11-13 Microsoft Office 365 5 hours 2 minutes no

 

Resources:

http://www.zdnet.com/microsoft-hit-by-second-office-365-email-outage-in-five-days-7000007342/

http://blogs.office.com/2012/11/13/update-on-recent-customer-issues/

 

11 Nov 2013

Twitter mobile outage

Posted by iwgcr. No Comments

Twitter experienced a brief mobile service issue on November 11th 2013 that prevented some users from loading timelines on their mobile devices. The issue lasted 14 minutes.

Date

Service

Duration

Critical Data Lost

2013-11-11

Twitter

14 minutes

no

References:

http://status.twitter.com/post/66717974698/brief-mobile-service-issue

8 Nov 2013

Microsoft’s Office 365 experiences mail delays

Posted by iwgcr. No Comments

Office 365 experienced an issue on November 8th 2013 that resulted in prolonged mail flow delays. A post was published on Office 365’s blog explaining the problem:

“The first event occurred on November 8th from 11:24AM to 7:25PM PST.  This service incident resulted in prolonged mail flow delays for many of our customers in North and South America.  Office 365 utilizes multiple anti-virus engines to identify and clean virus messages from our customers’ inboxes. One of these multiple engines identified a virus being sent to customers, but the engine started to exhibit a lot of latency even as it handled the messages.  To compound the issue, our service was configured to allow too many retries and provide too long of a timeout for these messages.  Given the flood of these specific emails to some of our service capacity, this improper handling caused a significant backlog of valid email message throughput in these units.  We resolved the issue by deploying an interceptor fix to deal with the offending messages and send them directly to quarantine.  Going forward, we are instituting multiple further levels of defense. In addition to fixing the engine handling, we now have instituted more aggressive thresholds for deferring problem messages.  We have also built and implemented better recovery tools that allow us to remediate these situations much faster, and we are also adding some additional architectural safeguards that automatically remediate issues of this general nature.”

Date

Service

Duration

Critical Data Lost

2013-11-8 Microsoft Office 365 8 hours 1 minute no

 

Resources:

http://blogs.office.com/2012/11/13/update-on-recent-customer-issues/