Friday 30 November 2012

Master Master Replication: Unintentional shutdown

This morning I got a rude awakening, a staff call me and told me one of the database server is down. Since I'm not on site, I got her to check if the server was on. The server was dead quite some, instantly I think of power failure. I so got her to turn on the machine beep beep and it fire up like a hurricane.

 It was silly of me to think of power failure, we got a UPS and three power supply for the database server. Once the machine turn on I did some checking and there was nothing wrong with the server at 11:49:03 suddenly all logs just stop.

The only explanation is some one turn off the power via the UPS or unplug all three power supply. I'll investigate when I get to the office.

The most disturbing event is when I check the mysql log and saw the following:
Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave
The message to me means
MySQL server wasn't shutdown properly and your our relay log are corrupted. Please flush your relay cache.

I did a few google search and I saw a really disturbing solution. The solution are to skip all the activity since the issue arise, you never want to do this especially if your replication are SQL base.

The proper solution are to reset the relay log and start replication.

  1. run `stop slave;`
  2. in console run `cat /var/lib/mysql/master.info`
  3. note down the follow master log file, master log position, username and password.
    The list below show the format of the files
    1. dont know dont care
    2. master log file
    3. master log position
    4. slave username
    5. slave password
    6. run `show slave status \G` to check if you master log file and master log position are correct
    7. run `reset slave;`
    8. run `CHANGE MASTER TO MASTER_USER="your user name", MASTER_PASSWORD = "password" , MASTER_LOG_FILE = "the master log file", MASTER_LOG_POS = "the master log position";`
    9. run `start slave;`
    10. run `show slave status \G` everything should work now

If you have a master-master... configuration, the server replicating to the above server will failed. 
With the follow error
Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position'
To fix simply change the master log file to the master log file and change the master log position to 4.
The good news is when the above happen the master log file is incremented by 1 and the first position of a log file is always 4.

Good Luck