Amazon Web Services storm outages serve as a warning of cloud risk to businesses

Australian businesses have been warned they need to spread the risk in their cloud computing operations across different regions after the Sydney storms on Sunday knocked out the operations of numerous Amazon Web Services customers.

The ferocious storms that hit NSW, left AWS clients including Domino’s Pizza, Foxtel, The Iconic, Stan and Domain without websites or key systems for hours.

It served as a warning that sending systems to the cloud, rather than hosting them on-premise did not remove the risk of costly failures.

Amazon Web Services declined to discuss whether it would compensate customers who lost business during the outage. AP

The failure represents a major embarrassment for the company, which generated $US2.57 billion revenue in the latest quarter, based largely on the fact that it is perceived as being hugely reliable.

It meant that customers that had committed all of their systems to its care were unable to trade from mid-afternoon until as late as Monday morning.

On Monday AWS refused to discuss the reasons behind the outage, instead referring inquiries to an online site, which showed the status of its data centres.

A spokesman also declined to comment on whether affected businesses would be entitled to compensation, and what damage the outages had caused for its own business locally.

Dan Nolan, co-founder of Proxima said his company was saved from the outage because it had decided to take a multi-availability zone failover approach. Anthony Johnson

It is understood that power was lost in a number of systems in its data centre, before being restored about 90 minutes later. It then took hours for some systems to be rebooted and brought back online.

Experts said that customers who insisted all their systems and data remained in Australia had been impacted, as other customers simply had their systems clicked over to Singapore and continued to trade as normal.

IBRS analyst Joe Sweeney said many would use the outage as an example of why businesses should not adopt cloud computing.

However it was more of a wake-up call for businesses to structure their technology in such a way that one data centre failing would not be fatal.

AirService co-founder Dominic Bressan said he understood things could go wrong with cloud computing services. Supplied

“In short, I think the outage is damaging for the entire market – not just AWS,” Mr Sweeney said.

“I think this event is a great opportunity for organisations to rethink what they were trying to get from moving to the cloud. It is also a cautionary tail – ensure your critical applications are built to survive failures of any data centre.”

The outages came as more than 226,000 homes and businesses lost power during the weekend storms. Roads, bridges and public transport were also affected.

The bodies of three men were also discovered in cars caught in floods in separate incidents in the ACT, the NSW Southern Highlands and Sydney’s south-west.

So far, $30 million worth of insurance claims have been made.

Co-founder of event software company Proxima, Dan Nolan, said his company was an AWS client, but was saved from the outage because it had decided to take a multi-availability zone failover approach.

“When we were setting up our infrastructure a bit over a year ago, we took a punt on multi-availability zone failover,” he said.

“It was literally for if this situation ever occurred … the cost otherwise though would have been quite substantial to the business. Having multi-availability zone is a fraction of the cost to having a site go down.”

Mr Nolan said the company had taken heed of lessons learnt in the United States where businesses such as Instagram had suffered major outages in the past.

Proxima was not the only company with a backup strategy in place.

Mobile ordering and payment technology company AirService uses AWS as its main provider, but had another ready to go in case of an outage.

“As amazing as AWS is, and we’ve been using it for a long time, we do understand that things can sometimes go wrong,” AirService chief executive Dominic Bressan said.

Telsyte analyst Rodney Gedda said Sunday’s storm showed that cloud computing wasn’t infallible, and that organisations needed to factor in risks when weighing up their strategy. Despite the failure on Sunday, AWS systems are likely to be much more robust than those run by individual organisations.

“No outage is ‘good’, but it’s unlikely this event will damage AWS’s brand too much. All the main cloud players have experienced some form of downtime so Amazon is not alone and events like these are generally accepted by customers,” Mr Gedda said.

“AWS will no doubt perform a post mortem on this event and do what it can to prevent it from happening again … Cloud services won’t magically protect your business from downtime or data loss. You still need to be proactive.”

 

Advertisements

Talking innovation and disruption with Telstra’s CRO

Kate Hughes talks to StrategicRISK about how risk management is helping Australia’s largest telecommunications provider become a global technology player

It’s 6.20am and Kate Hughes’s phone goes off. The chief risk officer for Australia’s largest telecommunications provider, Telstra, has been called to activate the crisis management team to deal with a major outage affecting thousands of customers.

By 7am, an action plan is in place and Hughes can begin her day. But an hour later she receives a report from a whistleblower alleging bad behaviour of a senior executive, which sees her launch an immediate internal investigation through her fraud team. Then, a few hours later, Hughes is alerted to a customer privacy breach, so it’s then a call to the regulators to alert them of the incident.

It’s not even lunchtime, and Hughes has already fielded more incidents than most chief risk officers would see in a month.

Hughes has agreed to an interview with StrategicRISK to discuss how risk management is helping Telstra navigate a strategic business model change from a traditional domestic telecommunications provider to a global technology company.

But first, a history lesson.

Telstra is one of Australia’s most well-known companies. The country’s largest telecommunications provider builds and operates networks around Australia and markets mobile, internet access, pay television and other entertainment products and services.

But the pace of digital change has not been kind to traditional telcos, forcing Telstra, and most of its competitors, to pivot from their historic business model.

Today the company has its sights on being a global technology company.

Last year the company invested almost $1.2b in acquisitions, including a controlling stake in 15 new businesses. It also expanded its reach in Asia through acquiring Pacnet in Singapore and launching TelkomTelstra in Indonesia, and activated new business units such as Telstra Health.

This pace of change, coupled with the profound shift in the way people connect and communicate, means Telstra faces a challenging set of business risks that threaten it achieving its growth ambitions and financial targets.

This is where Hughes comes in.

“Most people say to me, I’ve got one of the most interesting jobs in the company and I would agree that I do. There’s very little that I’m not across, or not involved in, or not able to add value to,” Hughes says. “I get to make decisions about the kind of ladders we use in the field, I get to talk about the risks of having handbrake alarms in some of our cars, and I also get to talk about the risks of technology disruption as it will impact on our strategy to be a world-class technology company.”

The risk function at Telstra has evolved significantly over the past three-and-a-half years under Hughes’s leadership. The 160-strong risk office now looks after the group’s risk management, compliance and privacy functions, as well as its law enforcement capabilities, fraud investigations, enterprise resilience, security, and health, safety and environment arms.

Hughes, who reports into chief financial officer Warwick Bray, admits she is lucky to work for an executive team who take risk management seriously.

“It’s a privilege to be involved in something that helps our executives make better decisions,” she says.

And with the pace of change that Telstra is facing, that decision-making needs to happen quickly.

“We can be disruptive or we can be disrupted and we’ll probably be both. That’s not necessarily a bad thing. I think disruption creates solid incentive to be more innovative and that’s good,” she says.

Telstra is undergoing a major internal simplification process, driven by the risk of not being able to keep up with younger, more agile, tech start-ups.

“I’m in a meeting every Tuesday morning on this to see what am I doing to help us get there,” Hughes says, adding that she sees the company’s simplification and disruption impetus as an opportunity to show the benefits of risk-based decision making.

“Everything we do requires us to do a risk assessment and that shouldn’t be seen as an onerous, bureaucratic thing, but actually built in to our processes every day.

“Part of the business case is doing a risk management assessment. You don’t tack it on the end, it’s not done at five minutes to midnight, it’s not done once we’ve agreed to everything else … it’s part of the process.

“That is the evolution of risk management – to take it out of the academic, out of the process, and make it much more part of the business conversation so that it actually adds value to the commercial decision-making challenge that your leader has,” she says.

Hughes cites an example with the head of Telstra property, who had to decide how to allocate his spending when it came to upgrade work on the group’s exchange sites. By applying a safety rating to every exchange, Hughes team was able to prioritise which sites should be worked on first.

Back to where it started

In some ways Hughes has come full circle to her role at Telstra.

After graduating with a commerce degree with majors in economics and finance, she took up a role at the NSW Treasury. One of the first companies she audited was Telstra, sitting in the very same Melbourne offices that she does today.

She then moved to the Sydney Futures Exchange where she was responsible for surveying the open trading floor for rouge or illegal trades during its final year of operation.

“I was one of about four women in a room of 400 men that had some pretty bad behaviours,” Hughes recalls.

From there, she moved to the Australian Securities and Investments Commission (ASIC), the country’s corporate, markets and financial services regulator. And it’s this insider experience which has proved invaluable to Hughes at Telstra – one of the country’s most highly regulated companies.

“One of our big risks is going to be a rapidly changing regulatory environment,” she says. “It will go to things like how we regulate data ownership and data sovereignty in the long term.”

Regulators around the world are struggling to keep up with the implications of new technology – and most are doing so at different paces, not to mention with vastly different strengths of legislative iron fists.

For a company with global expansion plans, this adds a huge layer of complexity.

“How do you grow in those countries where your company’s cloud strategies aren’t going to fit with theirs, for example,” she says.

“[Regulation] has the potential to certainly change how we develop and market products. It’s one of the material risks that we talk to the board about. What you have to get very good at doing is staring over the horizon beyond your normal two to three-year period, out to five to eight years and start to think about what regulation will matter then.”

In a disruptive environment, Hughes also sees the potential for corporates to challenge existing regulation.

“If you look at Uber and Airbnb as two business model challenges, everybody talks about those as being challenging at a business model level, but what for me was most interesting is that they challenged existing regulator models as well. Uber drivers never stopped and said ‘I need a taxi license’. So what would happen to us if we fundamentally changed [current] regulation? We do a lot of black swan thinking about some of those risks,” she says.

Cyber and security challenges

In the nearer term, Australia is set to bring in data loss notification laws which will force companies to advise customers when their details have been unlawfully accessed.

“It’s not going to be a huge issue for us because we’ve always thought long and hard about who we should tell when we’ve had a breach of some kind,” Hughes says.

This stance was put the test last year. Just two weeks before Telstra’s $697m acquisition of Pacnet was finalised, the Asian telecommunications business was hacked by an unknown third party which gained complete access to the company’s network including emails and other administrative systems.

Telstra said it wasn’t told about the breach until after the deal’s completion on 16 April.

In that instance, Hughes says Telstra voluntarily went to eight different regulators about the breach.

“Each one had different expectations about whether or not we would or should tell them,” she says. “We’ve always felt better to be upfront and honest. The worst thing you can do is look like you’re hiding it.”

But Hughes fears that the new breach notification laws could result in consumers getting “notification fatigue”, where they fail to act on important data breaches because they are being alerted of them so frequently.

Instead, when it comes to cyber security, Hughes is turning the lens to the company’s employees, which are often considered the weakest link in any cyber security programme.

“We run drills to see if we can trick our employees into doing something that they shouldn’t have,” she says, such as clicking on a link or opening a suspect attachment.

In the first drill, 30% of employees failed. That dropped to 18% in the second round.

What’s in a name?

Managing major reputation crises is also something that Hughes is well versed in.

In 2005, she was asked to join a company in the midst of a major corruption scandal that saw it on the front page of the papers for more than 400 consecutive days, and its shareholder value slashed by almost $1bn overnight. That company was the Australian Wheat Board (AWB), which was accused of paying millions of dollars in bribes to Saddam Hussein’s regime in Iran in exchange for lucrative wheat contracts.

“Part of my job was to build the right internal controls, the right risk processes and the right compliance controls to ensure we never ever did that again,” she says.

For four years, Hughes worked with a new management board to help turn the business around.

“Leadership in good times is always a pleasure. The hardest job you will ever do is lead in tough times when there’s bad news on the front page of the paper and your employees feel embarrassed to work for you,” she says.

Hughes believes reputation isn’t a risk as such, but an “outcome of other things you didn’t do very well”.

Regardless, when you’re an organisation the size of Telstra, reputation is incredibly important.

“This year we have put in place much more formal metrics to measure the impact of our resilience on reputation,” Hughes says.

For example, during network outages, Telstra can map social media mentions against the network issues to give an indication on the importance of resilience to its customers.

“It’s also a really good predictor of consumer behaviour, so how many of these [incidents] does it take before a consumer, one, rings up and complains, two, gives us a negative rating, or three, possibly changes services. That’s critical insightful data that we work with marketing, media and communications teams on,” she says.

Hughes is one of the most passionate advocates for strategic risk management that you will meet. But she’s far from traditional.

“The one thing I rarely say to people is that I’m the chief risk officer; what I often say is I’m an executive at Telstra, because part of my job is not just talking about the risks, but talking about the opportunities. At the end of the day my real job is to make sure that our executives know how to make decisions.

“Helping people consciously choose to take risks is good because it means that they’re doing it utterly informed.”

Hughes says that risk managers must move from talking about the “what” – the list of risks and risk registers – to talking about the “now what”.

“Being the person who forces people to sit through three-hour long risk workshops so we can satisfy ourselves that we’ve got 25 pages of risk registers is an academic exercise that has never sat well with me,” she says.

“Doing [risk management] for the sake of governance, whilst necessarily, is not necessarily always valuable. Doing it because it helps [the company] make a better decision, save money, spend it more wisely … and potentially be a disruptor yourself because you’ve found a hole in the market that no one else has, that’s where the real value comes from.

How to Ruin Your Company with One Bad Process

 

Ben Horowitz, Horowitz at Andreessen Horowitz

 

I am a giant advocate for technical founders running their own companies, but one consistent way that technical founders deeply harm their businesses is by screwing up the budgeting process. Yes, the budgeting process. How ridiculous is that? How does it happen and why is it particularly problematic for engineers?

I’ll begin by describing how I screwed it up in my company. Our sales were growing so fast that the biggest problem that we faced was that we literally could not handle all the customers that wanted to sign up for Loudcloud. To combat this and enable us to grow, I worked diligently with my team to plan all the activities that we needed to accomplish to expand our capacity and capture the market before the competition. Next, I assigned sub-goals and activities to each functional head. In conjunction with my leadership team, I made sure that each goal was measurable and supported by paired metrics as well as lagging and leading indicators. I then told the team to figure out what it would take to accomplish those goals and return with their requirements for headcount and program dollars. Finally, I made adjustments to their requests based on industry benchmarks (mostly reductions) to get to a plan that I thought made sense.

Here’s the basic process:

  • Set goals that will enable us to grow
  • Break the goals down so that there is clear ownership and accountability for each goal by a specific team
  • Refine goals into measurable targets
  • Figure out how many new people are required to hit the targets
  • Estimate the cost of the effort
  • Benchmark against the industry
  • Make global optimizations
  • Execute

Unless you are an experienced manager, you may not even see what’s wrong with this process, but it very nearly led to my company’s demise. In fact, the above process is completely upside-down and should only be followed if you wish to bloat your company to the brink of bankruptcy and create a culture of chaos.

When I asked my managers what they needed, I unknowingly gamified the budgeting process. The game worked as follows: The objective was for each manager to build the largest organization possible and thereby expand the importance of his function. Through the transitive property of status, he could increase his own importance as well. Now you may be thinking, “That wouldn’t happen in my company. Most of my staff would never play that game.” Well, that’s the beauty of the game. It only takes one player to opt in, because once someone starts playing, everybody is going in — and they are going in hard.

Gameplay quickly becomes sophisticated as managers develop clever strategies and tactics to improve their chances for winning. One common game technique is to dramatically expand the scope of the goals: “When you said that you wanted to increase our market presence, I naturally assumed that you meant globally. Surely, you wouldn’t want me to take a U.S.-centric view.” To really motivate the CEO, another great technique involves claiming dire circumstances if the company fails to achieve its metrics: “If we don’t increase sales by 500% and our top competitor does, we will fall behind. If we fall behind, we will no longer be No. 1. If we’re not No. 1, then we won’t be able to hire the best people, command the best prices, or build the best product, and we will spin into a death spiral.” Never mind the fact that there is almost no chance that your competitor will grow 500% this year.

Another subtle problem with this process is that when I asked my team what they needed to achieve their goals, they naturally assumed they would get it. As a result, my team deeply socialized their ideas and newly found money with their teams. This has the added gaming benefit of inextricably tying their demands to company morale. When the VP of marketing asked me for 10 headcount and $5M in program expenses, then shared that plan with his team, it changed the conversation. Now a major cutback to his plan would alarm his team because they had just spent two weeks planning for a much more positive scenario. “Wow, Ben greatly reduced the plan. Should I be looking for a job?” This kind of dynamic put pressure on me to create a more expansive expense plan than was wise. Multiply this by all my managers and I was on my way to burning up all my cash and destroying my culture.

My core problem was that my budgeting process did not have any real constraints. We were private and did not have a specific profit target that we needed to hit and we had plenty of cash in the bank. Drawing the line on expenses seemed rather arbitrary. In the absence of a hard constraint, I had no constraint.

An excellent constraining principle when planning your budget is the preservation of cultural cohesion. The enemy of cultural cohesion is super-fast headcount growth. Companies that grow faster than doubling their headcount annually tend to have serious cultural drift, even if they do a great job of onboarding new employees and training them. Sometimes this kind of growth is necessary and manageable in certain functions like sales, but is usually counterproductive in other areas where internal communication is critical like engineering and marketing. If you quadruple your engineering headcount in a year, you will likely have less absolute throughput than if you doubled headcount. As an added bonus, you will burn way more cash. Even worse, you will lose cultural consistency as new people with little guidance will come in with their own way of doing things that doesn’t match your way of doing things. Note that this does not apply to you if you have very small numbers. It’s fine to grow engineering from one to four people or from two to eight. However, if you try to grow from 50 to 200, you will cause major issues if you are not extremely careful.

Starting with the cultural cohesion principle, a far better way to run the budgeting process is to start with the constraints. Some useful constraints are:

  • Run rate increase– Note that I say “run rate increase” and not “spend increase”. You should set a limit on the amount by which you are willing to increase what you are spending in the last month of the coming year vs. the previous year.
  • Earnings/Loss– If you have revenue, another great constraint is your targeted earnings or loss for the year.
  • Engineering growth rate– Unless you are making an acquisition and running it separately or sub-dividing engineering in some novel way, you should strive not to more than double a monolithic engineering organization in a 12-month period.
  • Ratio of engineering to other functions– Once you have constrained engineering, then you can set ratios between engineering and other functions to constrain them as well.

After applying the global constraints, the following steps will provide a better process:

  1. Take the constrained number that you created and reduce it by 10-25% to give yourself room for expansion, if necessary.
  2. Divide the budget created above in the ratios that you believe are appropriate across the team.
  3. Communicate the budgets to the team.
  4. Run your goal-setting exercise and encourage your managers to demonstrate their skill by achieving great things within their budgets.
  5. If you believe that more can legitimately be achieved in a group with more money, then allocate that manager extra budget out of the slush fund you created with the 10-25%.

At this point, some readers may think that I’ve lost my mind. As a technologist, you know that the worst thing that you can do is over-constrain the problem before you start. You’ll kill creativity and prevent yourself from getting a truly great outcome. That’s precisely why I, as an engineer, struggled with this process: the human factors muck up the logic. Specifically, local incentives, if not properly managed, will sharply motivate human behavior and defeat the global goals.

It’s critical to recognize this so that you don’t turn your agile, small company into a slow, big company before its time.

Scandal in fantasy sports underscores the importance of internal process and controls

Details of the scandal engulfing the online fantasy sports company DraftKings should be common knowledge by now – A DraftKings employee admitted to the early release of data not generally available to the public and won US$350,000 on a rival site, FanDuel.

The comparisons to insider trading quickly – and logically – followed.

Questions abound that why a major player in the largely unregulated, multibillion-dollar (US$3.7B annually) fantasy sports industry didn’t have stronger controls in place to restrict access to protected information or ban its employees from participating in fantasy games elsewhere.

Without the structure and processes, the mess was totally predictable and only a matter of time. While DraftKings and FanDuel announced permanent bans on employees participating in fantasy leagues within days of the scandal breaking, the damage had already been done to their brands and reputations:

  • ESPN initially announced it would end DraftKings’ sponsorship and later said it would stop its ads.
  • The New York attorney general’s office announced an investigation
  • A Kentucky man is seeking class-action status that accuses DraftKings and FanDuel of negligence, fraud, and false advertising.

The lesson from this latest corporate blunder should be crystal clear: A well-designed system of internal controls is fundamental to reducing business risks.

Should DraftKings executives be accountable for not anticipating such problems? Considering the industry is largely unregulated, has seen remarkably rapid growth, and handles huge sums of capital on a weekly basis, the answer is an unequivocal “yes.”

While DraftKings is not currently publicly traded, it is a textbook example of how limited or poorly designed internal controls can quickly be overwhelmed by the pressures of rapid business success. One of the criticisms of mandatory internal process and controls regime is that start-ups lack the resources to support them. But without such an investment, organizations are at greater risk of making much costlier mistakes in the future.

It’s all about the old expression, “Pay me now, or pay me later.”