Friday, 10 January 2014

On MySQL's Data Dictionaries

All MySQL versions share the same basic architecture: there's a "upper" part of the system, responsible for accepting connections, forking Server Threads, parsing (most of the) SQL, enforcing privileges, authenticating users, etc. The MySQL Developers refer to this component as the "SQL Engine".

On the other side of the counter, there's also a "lower" set of components we know as "Storage Engines" – MyISAM, InnoDB, TokuDB, Memory, Black-Hole, to cite a few examples. This components are responsible for storing and retrieving data from whatever storage media they use and support. This is a radically different architecture from any other implementation of Relational Database Management System (RDBMS) I've seen in my professional life. Usually RDBMSs want to have full control of their storage components, so they can easily implement tricks and shortcuts for profit and performance.

MySQL is still the only RDBMS that I've heard of with this kind of architecture. There are advantages (other storage engines can be included to support exotic data storage systems, and MySQL can easily be extended to become an SQL interface to other types of database.

Of course this architecture choice has consequences. Some of this consequences are… interesting. The one that caught my eye recently was the fact that InnoDB, in order to implement certain features, maintains it's own Data-Dictionary. InnoDB Storage Engine and the SQL Engine Data Dictionaries share nothing due to the very rigid interface between these two major components.

Also, MySQL's SQL Engine doesn't support Transactional DDL (see "13.3.2. Statements That Cannot Be Rolled Back", on MySQL documentation).

Now let's put it all together: there are two separated architecture components of interest in any MySQL database, the SQL Engine and the Storage Engine. InnoDB storage engine implements it's own Data-Dictionary, overlapping and complementing the SQL Engine Data Dictionary (to which it has no direct access due to the rigid interface between the SQL and the Storage Engine components); the SQL Engine DDL statements cannot be rolledback (and therefore are not transactional); InnoDB won't have an opportunity to change it's own Data Dictionary until the SQL Engine is done executing it. Therefore, if the RDBMS process crashes during the time a DDL change was executing, there's a fair chance that we'll end up with inconsistencies between the SQL Engine data-dictionary and the InnoDB Data Dictionary. 

What a neat and nice Theory. 

Now, how to prove this is true? Well, for starters, we need access to both the SQL Engine Data Dictionary, and the InnoDB Storage Engine Data Dictionary. This is not exposed in all available versions of MySQL. Specifically, we're looking for the INFORMATION_SCHEMA.INNODB_SYS_TABLES system table, and we want to compare the information there with the SQL Engine Data Dictionary exposed through the INFORMATION_SCHEMA.TABLES table.

For my Percona Server 5.1.69, I wrote the query below. You may have to adapt your query a bit as column names on those tables change a bit from version to version.

SELECT COUNT(*)
FROM (
    SELECT t.TABLE_SCHEMA AS "SCHEMA"
         , t.TABLE_NAME AS "TABLE"
    FROM INFORMATION_SCHEMA.TABLES t
    WHERE ENGINE = "InnoDB"
      AND NOT EXISTS (
            SELECT 1
            FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES ist
            WHERE t.TABLE_SCHEMA = ist.SCHEMA
              AND t.TABLE_NAME = ist.NAME
    )
    UNION ALL
    SELECT ist.SCHEMA AS "SCHEMA"
         , ist.NAME AS "TABLE"
    FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES ist
    WHERE ( ist.SCHEMA, ist.NAME ) NOT IN
        ( ( "", "SYS_FOREIGN" ), ( "", "SYS_FOREIGN_COLS" ) )
      AND NOT EXISTS (
            SELECT 1
            FROM INFORMATION_SCHEMA.TABLES t
            WHERE ENGINE = "InnoDB"
              AND t.TABLE_SCHEMA = ist.SCHEMA
              AND t.TABLE_NAME = ist.NAME
    )
) AS ddstats

This is a very expensive query (it will force statistics fresh of pretty much all your tables, and will acquire a lot of very expensive locks in the process). I advise planning your execution a little bit, so you're not forced to kill your query before you get some interesting results.

I was able to get inconsistencies between my tables in the SQL Engine and InnoDB Storage Engine Data Dictionary. How about you? Would you care to share some war stories with me?

Friday, 13 December 2013

Why Statement Based Replication Diverges?

MySQL Statement-Based Replication Services has plenty of opportunities to diverge. I will attempt to list the most common here, in no particular order.

Changes Applied Directly to a Replica

If you write to the Replica, it’s data will diverge from the Master’s data. Yes, I know, it sounds stupid. Depending on how your developers are accessing the database, though, it might be quite difficult to decide if a given SQL statement is being pushed to the Master or to the Replicas by just looking at the code. Take for example a Java Framework-based Data Access Layer relying on Transaction Annotations to tell if a given transaction is read-only or read-write. Transactions can span multiple methods, sometimes located far from each other in the code. If all information you have is the code non-trivial to tell when a transaction is read-only or read-write, and sometimes misplaced Annotations make it to production, damaging whole database clusters at one, many times not in immediately noticeable ways.

Misconfiguration at Replication Setup

Setting up Replication (at least in MySQL 5.1 and 5.5 series) is a manual process and can be quite error-prone depending on the DBA or Operations Engineer experience and how reliable is the source of the information you're using to determine what should be the next transaction for a given MySQL Replica. Missing a digit on the SQL statement that sets up Replication is enough to make a Database Replica diverge by a considerable amount of transactions.

Incorrect Transaction Serialisation

MySQL is multi-threaded. This means that at any given point-in-time there might be as many transactions changing the database as there are Processor Cores available on the Master Host. In order to enable the replication implementation to work, MySQL serialises the transactions in a file (the Binary Log File). This is done automatically, by the Database Server, on-the-fly. Deciding the order in which to write transactions to the Binary Log File is a non-trivial task, specially in very busy database servers. I won't be surprised if MySQL sometimes gets the serialisation order wrong, causing the Replicas to diverge from the master on a given statement, specially for statement-based replication set-ups. If you are using some sort of high-availability supervisor and maintain the database cluster by failing over form one master to another (I do), or if you add with fresh clones from arbitrarily chosen Database Replicas, constantly shuffling the cracks in the data, this might even make this kind of divergences harder to spot or track.

Database Replica Crash

Unfortunately, every now and again a Database Server Crashes, for several different reasons. Depending on the workload and local instance configuration, it is possible that up to a second of transactions might be lost after a crash. It is standard practice to lower ACID compliant restrictions of a Database Replicas to allow it to catch up to the master faster in special, controlled situations (e.g., right after adding a local index). This technique is specially useful on write-intensive workloads where you're reaching the limits of your hardware, where the disproportional number of writes makes it really though for a Replica Database to catch up with the Master during peak hours.

Non-Deterministic SQL Statements

SQL is a Fourth-Generation Declarative Language. This implies certain SQL statement interpretation liberties to both the Database Server and the Developer. This leads to a lot of apparently harmless mistakes, like attempting to throttle the rate at which certain changes happen with “LIMIT” clauses, for example by writing statements like the ones below:

UPDATE some_table SET a_column = X WHERE other_col = “some-value” LIMIT 10;

DELETE FROM another_table WHERE some_column = Y LIMIT 100;

Both statements select rows to be changed upfront, to satisfy their respective LIMIT clauses. As the ANSI-SQL (and most implementations, including MySQL) don't place any guarantees on result set order (or if the rows to be chosen come from the beginning or the end of the result set), it's quite possible that the rows selected to be changed on the master are different from the rows selected by each and every Database Replica. In MySQL case, this will be dictated mostly by the presence of rows matching the conditions in memory, which in turn is a function of the queries answered by that specific MySQL instance which in turn is a round-robin-balanced share of the traffic that has been hitting this system for the last few hours… to put it short, this is probably a good Random Number Generator Implementation, with very high entropy.

Unnecessary to say, if the same valid SQL statement applied to the Master and to the Replicas has the ability to change different sets of rows in both servers, we have to account for the fact that Statement-Based Replication as implemented by MySQL will diverge. It's not a matter of "if", it's just a matter of "when".

Thursday, 14 November 2013

Slides of My Talk at MySQL Percona Live London 2013

For those asking me about my talk at the Percona Live London Conference 2013, I'm making it available as PDF. Please feel free to contact me about it directly if you like.

Thank you all for showing up for the talk and for asking me so many questions. I was utterly pleased to answer all of your questions.

Wednesday, 30 October 2013

See you at London

Percona Live London, November 11-12, 2013
A short reminder that I will deliver a talk called MySQL, the ebay Classifieds Way at the Percona Live London Conference 2013 in London. I'm looking forward to see you all there.

Tuesday, 24 September 2013

A Guide to Organising MySQL Files and Directories

I've been a Professional Database Administrator since 2001. In this time, I had the opportunity to manage Oracle, PostgreSQL, MySQL, and MS SQL Server databases. From all those, I believe Oracle was the only one with documented best practices about how to organise your files and directories.

While not crucial for performance in small deployments, having a properly organised database file-system architecture can play a big part on your performance.

In this article I will explain my preferred database file-systems for MySQL databases, how to configure your MySQL for this purpose, and what benefits you can crop from it. I will assume here you're using some sort of Linux system, because this is, at least from my point of view, the most common case. I believe extrapolating this organisation to other Operating Systems is a simple and straightforward exercise.

Base Directory

Everything starts by choosing a base directory for the database files. The Filesystem Hierarchy Standard (FHS) dictates Application State information should be stored under /var/lib/mysql. I strongly recommend you to comply to this standard, as it makes your life easy and is also the default supported for most Linux Distributions.

I classify the database files into six categories: Data Files, InnoDB Journal Files, Binary Logs, Relay Logs, Temporary Files, and Log Files.

Run-Time Files

Complying wiht the FHS and with the Policies of most of the modern Linux distributions, you should deploy your MySQL Socket file (used to establish Inter-Process-Communication (IPC) -based connections to your MySQL Server at /var/run/mysqld/mysqld.sock. I recommend you set the configuration variable socket to /var/run/mysqld/mysql.sock in your MySQL configuration file for this effect.

Also in compliance with the aforementioned standards, I recommend you deploy your Process-ID file in the same directory, by setting pid-file to /var/run/mysqld/mysqld.pid to your configuration file.

Data Files

Data files are the files that contain your data. MySQL usually keep this files in per-database-schema directories. This is actually how MySQL implements the notion of database-schemas: the first-level directories under your database datadir are considered to be all schema names.

The InnoDB System Tablespace (also known as ibdata1 file) also belongs to this category (mostly because it makes little sense to split it away).

The Data Files location is governed by a two options in MySQL: datadir, and innodb_file_per_table

Hum. Well, that's actually a white-lie. You can also set your InnoDB System Tablespace Location with innodb_data_home_dir and innodb_data_file_path although if left unset those variables will by default place the InnoDB System Tablespace file in MySQL's datadir.

Nowadays, running a MySQL database without innodb_file_per_table to "ON" is a must. If you're not keeping your tablespaces separated, I strongly recommend you to do so.

I recommend setting datadir to /var/lib/mysql/data, and arranging for this directory to be mounted on a separated RAID set from where your Operating System is, if possible. Due to the random nature of the writes to the Data Files, this is probably the place where you can use SSD disks with better performance gains.

InnoDB Journal Files

InnoDB by default creates two fixed-size journal files named ib_logfile0 and ib_logfile1. You can change how many journal files will be used by setting innodb_log_files_in_group, although I never witnessed a situation where it would be beneficial to have a different number from the default.

InnoDB Journal Files are automatically created by the InnoDB Storage Engine upon startup, and will be placed in the directory defined by innodb_log_group_home_dir (or by default in the same directory as the datadir).

This is also an excellent location where to place the InnoDB Double-Write file, where InnoDB will write data pages before copying them to their correct positions in the data files, in order to avoid the risk of corrupting data files with half-written pages. To place the InnoDB Double-Write file here, I recommend setting innodb_doublewrite_file to /var/lib/mysql/journal/ib_doublewrite.

I recommend setting innodb_log_group_home_dir to /var/lib/mysql/journal, and arranging for this directory to be mounted on a separated set of RAID disks, if possible controlled by a different RAID controller than where your Data Files are, and also separated from your Operating System disks. Due to the sequential nature of the writes to the InnoDB Journal Files, a good set of small SAS disks and a good sized RAID Controller write-cache can provide you with the best performance gains here.

MySQL Binary Logs

MySQL Replication relays on Binary Logs, a sequential record of all the transactions applied to this MySQL instance. Binary Log recording can be enabled and disabled with the log-bin option. This option also takes one argument, which is the "prefix" we should add to the name of the individual binary log files. I recommend setting this variable to /var/lib/mysql/binary/binlog. Due to the sequential nature of the writes to the binary log files, this directory can share a RAID set with the Journal Files.

Together with the Binary Log files, MySQL maintains an index file with the known Binary Log files. This file location defaults to the log-bin prefix. It can be changed by setting the variable log-bin-index. For completeness, I recommend setting this variable to /var/lib/mysql/binary/binlog.index.

MySQL Relay Logs

The MySQL Replication I/O (Server) Thread will copy Binary Logs from the master database as (local) Relay Log files, which will be in turn read sequentially by the Replication SQL (Server) Thread and applied to the local database. The I/O Thread will use the file name prefix configured in relay-log to create the files in disk. I recommend setting this option to /var/lib/mysql/relay/relay-bin. Due to the sequential nature of reads and writes to this file, they can share the same RAID set as the Journal and Binary log files without much trouble. If you're really concerned about replication performance, you can separate this files in a different RAID set, and even a different RAID Controller.

Together with the Relay Logs files, MySQL maintains an index file with the known Relay Logs downloaded from the master. This file location defaults to the relay-log prefix. It can be changed by setting the variable relay-log-index. For completeness, I recommend setting this variable to /var/lib/mysql/relay/relay-log.index.

MySQL also stores replication state (position, master file name, etc) in a file named Relay Log Info File. This file location also defaults to the location of the relay logs. Also for completeness, I recommend explicitly setting the location of this file using the variable relay-log-info-file to /var/lib/mysql/relay/relay-log.info.

MySQL Temporary Directory

Although a lot of System Engineers set this configuration variable to point to the Operating System temporary directory. I recommend against this. MySQL Temporary Files are also part of the Application State, and accordingly to the FHS they belong under /var/lib/mysql/. Besides, MySQL Temporary Files might quite quickly fill up a typical /tmp directory.

Also, depending on your distribution settings, there is the risk that your temporary directory will be cleaned upon restart, which might destroy state MySQL requires for replication (MySQL Replication requires that it's temporary files survive system restarts, in order to be able to continue replicating LOAD DATA INFILE statements).

You can estimate the size requirements of your Temporary Directory with a simple calculation:

tmpdir size = (server threads) * (row length + sizeof(row pointer)) * (matched rows) * 2

You might want to adjust this formula to take into consideration more than one temporary table per thread.

To set the location MySQL will use for temporary files, use the tmpdir option. I recommend setting your temporary directory location to /var/lib/mysql/tmp/.

Due to the nature of disk accesses to the temporary directory, it can probably share the same RAID controller with your Data Files.

MySQL Log Files

Last but not least important, MySQL produces at the very least error logs while servicing requests. Ideally, you should also log Slow Queries (which is a great and valuable source of information about how your database is performing).

MySQL Error Log

MySQL log error messages either to console or to a log file in the disk. If you're running MySQL as a service for a professional application, I recommend using a log file. Set the location of the log file with log-error. I recommend logging to /var/lib/mysql/log/${HOSTNAME}-err.log.

By default only errors are logged. I recommend setting log-warnings to 2 so MySQL will log all the information it can. More logs usually means less debugging time.

MySQL Slow Query Log

MySQL logs queries that run for longer than the time set in the long_query_time configuration variable. Those queries are written (along with some statistics) to a log file on the disk.

I recommend setting the location of this file with slow-query-log-file to /var/lib/mysql/log/${HOSTNAME}-slow.log.

Due to the sequential nature of both the Slow Query Log file and Error Log file writes, those can safely share the same RAID controller and disk-set with the Journal, Relay Logs, and Binary Log files. Another good alternatives is to share the RAID set with the Operating System. I personally like the first option. There is usually no need to use SSD disks for this files, as SAS disks usually perform better for sequential writes.

Backup Directory

This is not actually required to run MySQL, and won't change the performance of your service. It is more of a convenience for your Database Administrator, and I personally consider it good practice to have disk space and a standardized location where to place data and log files while the DBA is backing up, restoring, or just servicing the MySQL instance.

I create and use /var/lib/mysql/backup for that. Due to the random nature of the reads and writes here, this directory can share the same RAID Controller and RAID Disk set as your Data Files (ideally you should have it on a completely separated Raid Controller and Disk set, but one can only hope for so many disk controllers on a host).


RAID
Disk Set
Mount Point Contents Configuration Variables
Operating System /var/run/mysqld/ Process-ID file, IPC Socket pid-file
socket
RAID Set #1 /var/lib/mysql/data MySQL Data Files datadir
innodb_data_file_path
innodb_data_home_dir
innodb_file_per_table
RAID Set #2 /var/lib/mysql/journal InnoDB Journal Files
InnoDB Double-Write File
datadir
innodb_log_group_home_dir
innodb_log_files_in_group
innodb_doublewrite_file
RAID Set #2 /var/lib/mysql/binary MySQL Binary Logs log-bin
log-bin-index
RAID Set #2 /var/lib/mysql/relay MySQL Relay Logs relay-log
relay-log-index
relay-log-info-file
RAID Set #1 /var/lib/mysql/tmp Temporary MySQL Files tmpdir
Operating System /var/lib/mysql/log MySQL Error, Slow Query Logs log-error
log-warnings
slow-query-log-file
RAID Set #3 /var/lib/mysql/backup DBA-owned data while database is in maintenance. N/A

Thursday, 29 August 2013

Which MySQL Version to Use, and Why?

Quite often in my career as a Database Administrator I was faced with the responsibility of choosing which MySQL version for a project, or asked to upgrade an existing MySQL instance to this-and-that version.

Professionally, I tend to search for a balance between stability (nobody likes a database that keeps crashing and losing data) and performance (if you're running a website, a faster database will have a positive impact on your profit).

The balance can be tricky to achieve, as performance is something that usually comes in newer versions (as any Agile Development practitioner would be delighted to confirm you), with the development team receiving feedback and getting a better grip on the problem; while stability comes with age (as Eric S. Raymond wisely stated: "given enough eyeballs, all bugs are shallow").

To get the best of both worlds I choose to run the oldest MySQL branch still being maintained (as of this writing, MySQL 5.1 is my MySQL release of choice) while trying my best to keep the minor version as up-to-date as possible, in an attempt to still benefit from performance improvements and bug-fixes back-ported to the release (which in turn takes me to 5.1.69).

Please keep in mind that upgrading is always a risk and will by definition cause instability on your production systems. I strongly advise sticking to your MySQL version for as long as possible, and only upgrading when forced out of your comfortable position by discontinuation of maintenance (which I consider a bigger risk than not upgrading, in the long-run) or when hitting bugs for which there are no work-arounds.

The main advantage of this strategy is that most (if not all) the bugs you will find are known, well-documented, and will have known fixes or work-arounds. The main disadvantage is that you will be missing quite a bit of novelty, and your life will sound kind of boring. Your developers will hate you for the missing features, but you will enjoy a comfortable position with management for the added stability.

What MySQL version are you running, and why?

Wednesday, 22 February 2012

Why "Deaf by Salami"?

To put it simply, "Deaf by Salami" is an anagram for "My Life as a DBA". It's also a funny, mnemonic three-word sentence that ensures virtual uniqueness at Google.

My intention is to make this a place where I can share MySQL hints, tips, and tricks that I use on a daily basis on the high-volume, high-traffic environment where I work as a full-time Database Administrator.