Developing and managing Hotmail

HotmailThis interview
has already been picked up and commented
(and /.’ed),
but if you have not yet taken a look, I recommend reading this ACM piece on
Hotmail, and what it means to manage one of the largest services of the web.
Hotmail runs on 10,000 servers and involves several petabytes of storage (i.e millions of gigabytes) and serves, according to this Wikipedia article, 221M users who are operating billions of e-mail transactions daily. It is
operated by 100 sysadmins, which is not that large a team.

Phil Smoot, the PM in charge of Hotmail product development out of the
Microsoft Silicon Valley campus, shares a number of insights – from which I
noted the following points regarding automation, versionning, capacity planning, impact analysis and QA:

  • QA is a challenge in the sense that mimicking Internet loads on our QA
    lab machines is a hard engineering problem. The production site
    consists of hundreds of services deployed over multiple years, and the
    QA lab is relatively small, so re-creating a part of the environment or
    a particular issue in the QA lab in a timely fashion is a hard problem.
    Manageability is a challenge in that you want to keep your
    administrative headcount flat as you scale out the number of machines.
  • […] if you can manage five servers you should be able to manage tens of
    thousands of servers and hundreds of thousands of servers just by
    having everything fully automated—and that all the automation hooks
    need to be built in the service from the get-go. Deployment of bits is
    an example of code that needs to be automated. You don’t want your
    administrators touching individual boxes making manual changes. But on
    the other side, we have roll-out plans for deployment that smaller
    services probably would not have to consider. For example, when we roll
    out a new version of a service to the site, we don’t flip the whole
    site at once.
  • We do some staging, where we’ll validate the new version on a server
    and then roll it out to 10 servers and then to 100 servers and then to
    1,000 servers—until we get it across the site. This leads to another
    interesting problem, which is versioning: the notion that you have to
    have multiple versions of software running across the sites at the same
    time. That is, version N and N+1 clients need to be able to talk to
    version N and N+1 servers and N and N+1 data formats. That problem
    arises as you roll out new versions or as you try different
    configurations or tunings across the site.
  • The big thing you think about is cost. How much is this new feature
    going to cost? A penny per user over hundreds of millions of users gets
    expensive fast. Migration is something you spend more time thinking
    about over lots of servers versus a few servers. For example, migrating
    terabytes worth of data takes a long time and involves complex capacity
    planning and data-center floor and power consumption issues. You also
    do more up-front planning around how to go backwards if the new version
  • We strive to build tools that can replay live-site transactions and
    real-type live-site loads against single nodes. The notion is that the
    application itself is logging this data on the live site so that it can
    be easily consumed in our QA labs. Then as applications bring in new
    functionalities, we want to add these new transactions to the existing
    test beds.
  • The notion of tape backups is probably no longer feasible. Building
    systems where we’re just backing up changes—and backing them up to
    cheap disks—is probably much more where we’re headed. How you can do
    this in a disconnected fashion is an interesting problem. That is, how
    are you going to protect the system from viruses and software and
    administrative scripting bugs? What you’ll start to see is the emergence of the use of data
    replicas and applying changes to those replicas, and ultimately the
    requirement that these replicas be disconnected and reattached over
  • As you go to, let’s say, a commodity model, you have to assume that
    everything is going to fail underneath you, that you have to deal with
    these failures, that all the data has to be replicated, and that the
    system essentially self-heals. For example, if you are writing out
    files, you put a checksum in place that you can verify when the file is
    read. If it wasn’t correct, then go get the file somewhere else and
    repair the old file.
  • Last word: If you rely on scale up, you’ll probably get killed. You should always be relying on scale out.

  • lambright

    Jeff thanks for sharing this interview, Wow 10,000 servers, 100-system administrator, it’s hard not to have a profound appreciation for the complexity and scale of hotmail. I wonder how much of their energies go into fighting spam?
    Wayne Lambright

  • Rodrigo Sepúlveda Schulz

    I have been interested in this issue of scaling your infrastructure for some time now (

  • Read/WriteWeb

    Read/Write Filter

    A daily review of Web and Media news that crosses my path during the day. – comScore: Google Continues to Hold Top Position in Search Share Rankings (see also Shore analysis: “It also means that general aggregators will continue to…

  • Jeff

    Very interesting. The funny thing about Hotmail is that new users are treated better than old users. My account created in 1997 still offers only 2 MB of storage, the one created in 2005 offers 250 MB.

  • manu

    Excellent article, there is some food for everyone here!