Friday, June 26, 2015

Changing Out a Zookeeper Instance - Rolling restart

Changing Out a Zookeeper Instance:

We were recently tasked with changing out 3 (of 7) Zookeeper instances without disrupting the customer's environment. I set up a test bed using 5 independent zookeeper instances with a simulated game running on another.

After stopping the zookeeper instance (see Step #8) I changed the connection properties on my db servers and the game simulation. It didn't really help with the game simulation - that data was loaded at the start, but I had it in place for the next time the app was restarted.

These are the steps that worked for me:

  1. Check status of all running zookeepers. Do the leader last! 
  2. Spin up a new zk instance.
  3. Assign a higher myid to the new instance than the current cluster.
  4. Copy the zoo.cfg from a running instance to the new one.
  5. Add the new instance to the zoo.cfg.
    e.g.
    server.9999=10.0.0.100:2888:3888
    ...
    server.9995=10.0.0.242:2888:3888
    server.10000=10.0.0.154:2888:3888  #New zookeeper instance.
  6. Start zookeeper on the new machine.
  7. Ensure that it joins the quorum. "echo stat | nc 127.0.0.1 2181", or there are other ways to do that.
  8. Stop the instance to be retired - or you can just stop zk on the instance.
  9. On the new instance - comment out the retired instance and restart zookeeper.
  10. Ensure that is accepting connections (see #6 above) on all of the running servers.
  11. Remove the retired zookeeper and add the new zookeeper to the zoo.cfg of the next instance and restart.
Hope that works for you.