MySQL Left Join

Even if you think you know everything about LEFT JOIN, I bet you will learn something or two in this post!

A reminder about “A LEFT JOIN B ON conditional_expr”

The ON condition (in the expression “A LEFT JOIN B ON conditional_expr”) is used to decide how to retrieve rows from table B (Matching-Stage).
If there is no row in B that matches the ON condition, an extra B row is generated with all columns set to NULL.
In the Matching-Stage any condition in the WHERE clause is not used. Only after the Matching-Stage, the condition in the WHERE clause will be used. It will filter out rows retrieved from the Matching-Stage.

Lets see a LEFT JOIN example:

mysql> CREATE TABLE `product` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `amount` int(10) unsigned default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=5 DEFAULT CHARSET=latin1
 
mysql> CREATE TABLE `product_details` (
  `id` int(10) unsigned NOT NULL,
  `weight` int(10) unsigned default NULL,
  `exist` int(10) unsigned default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
 
mysql> INSERT INTO product (id,amount)
       VALUES (1,100),(2,200),(3,300),(4,400);
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0
 
mysql> INSERT INTO product_details (id,weight,exist)
       VALUES (2,22,0),(4,44,1),(5,55,0),(6,66,1);
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0
 
mysql> SELECT * FROM product;
+----+--------+
| id | amount |
+----+--------+
|  1 |    100 |
|  2 |    200 |
|  3 |    300 |
|  4 |    400 |
+----+--------+
4 rows in set (0.00 sec)
 
mysql> SELECT * FROM product_details;
+----+--------+-------+
| id | weight | exist |
+----+--------+-------+
|  2 |     22 |     0 |
|  4 |     44 |     1 |
|  5 |     55 |     0 |
|  6 |     66 |     1 |
+----+--------+-------+
4 rows in set (0.00 sec)
 
mysql> SELECT * FROM product LEFT JOIN product_details
       ON (product.id = product_details.id);
+----+--------+------+--------+-------+
| id | amount | id   | weight | exist |
+----+--------+------+--------+-------+
|  1 |    100 | NULL |   NULL |  NULL |
|  2 |    200 |    2 |     22 |     0 |
|  3 |    300 | NULL |   NULL |  NULL |
|  4 |    400 |    4 |     44 |     1 |
+----+--------+------+--------+-------+
4 rows in set (0.00 sec)

Is there a difference between the ON clause and the WHERE clause?

A question: Is there a difference in the result set of the following two queries?

1. SELECT * FROM product LEFT JOIN product_details
         ON (product.id = product_details.id)
         AND   product_details.id=2;
2. SELECT * FROM product LEFT JOIN product_details
         ON (product.id = product_details.id)
         WHERE product_details.id=2;

It is best to understand by example:

mysql> SELECT * FROM product LEFT JOIN product_details
       ON (product.id = product_details.id)
       AND product_details.id=2;
+----+--------+------+--------+-------+
| id | amount | id   | weight | exist |
+----+--------+------+--------+-------+
|  1 |    100 | NULL |   NULL |  NULL |
|  2 |    200 |    2 |     22 |     0 |
|  3 |    300 | NULL |   NULL |  NULL |
|  4 |    400 | NULL |   NULL |  NULL |
+----+--------+------+--------+-------+
4 rows in set (0.00 sec)
 
mysql> SELECT * FROM product LEFT JOIN product_details
       ON (product.id = product_details.id)
       WHERE product_details.id=2;
+----+--------+----+--------+-------+
| id | amount | id | weight | exist |
+----+--------+----+--------+-------+
|  2 |    200 |  2 |     22 |     0 |
+----+--------+----+--------+-------+
1 row in set (0.01 sec)

The first query retrieve all rows from product table while using the ON condition to decide which rows to retrieve from the Left joined product_details table.
The second query is doing simple Left-Join. It filters out rows not matching the WHERE clause conditions.

See more examples:

mysql>
mysql> SELECT * FROM product LEFT JOIN product_details
       ON product.id = product_details.id
       AND product.amount=100;
+----+--------+------+--------+-------+
| id | amount | id   | weight | exist |
+----+--------+------+--------+-------+
|  1 |    100 | NULL |   NULL |  NULL |
|  2 |    200 | NULL |   NULL |  NULL |
|  3 |    300 | NULL |   NULL |  NULL |
|  4 |    400 | NULL |   NULL |  NULL |
+----+--------+------+--------+-------+
4 rows in set (0.00 sec)

All the rows from the product table are retrieved. However, no matching found at product_details table (there is no row that matches the condition product.id = product_details.id AND product.amount=100).

mysql> SELECT * FROM product LEFT JOIN product_details
       ON (product.id = product_details.id)
       AND product.amount=200;
+----+--------+------+--------+-------+
| id | amount | id   | weight | exist |
+----+--------+------+--------+-------+
|  1 |    100 | NULL |   NULL |  NULL |
|  2 |    200 |    2 |     22 |     0 |
|  3 |    300 | NULL |   NULL |  NULL |
|  4 |    400 | NULL |   NULL |  NULL |
+----+--------+------+--------+-------+
4 rows in set (0.01 sec)

All the rows from the product table are retrieved. However, only one matching found at product_details table.

LEFT JOIN with WHERE … IS NULL clause

What is happening if you use the WHERE …. IS NULL clause?
As stated before, the Where-condition stage is happening after the Matching-Stage. This means that the WHERE IS NULL clause will filter out, from the Matching-Stage result, rows that didn’t satisfy the Matching-Condition.
All that is fine on the paper but if you are using more than one condition in the ON clause it is starting to be confusing.

I have developed a simple way to better understand a complex Matching-Condition with WHERE … IS NULL clause by:

  • looking at the IS NULL clause as a negation of the Matching-Condition.
  • using the Logical rule: !(A and B) == !A OR !B

Look at the following example:

mysql> SELECT a.* FROM product a LEFT JOIN product_details b
       ON a.id=b.id AND b.weight!=44 AND b.exist=0
       WHERE b.id IS NULL;
+----+--------+
| id | amount |
+----+--------+
|  1 |    100 |
|  3 |    300 |
|  4 |    400 |
+----+--------+
3 rows in set (0.00 sec)

Lets examine the Matching clause (ON clause):

(a.id=b.id) AND (b.weight!=44) AND (b.exist=0)

Remember that we can think of the IS NULL clause as a negation of the Matching-Condition.
This means we will retrieve the following rows:

!( exist(b.id that equals to a.id) AND b.weight !=44 AND b.exist=0 )
!exist(b.id that equals to a.id) || !(b.weight !=44) || !(b.exist=0)
!exist(b.id that equals to a.id) || b.weight =44 || b.exist=1

Like in a C programing language, the operands of logical-AND and logical-OR expressions are evaluated from left to right. If the value of the first operand is sufficient to determine the result of the operation, the second operand is not evaluated (short-circuit evaluation).
In our case, this means, retrieve all rows in A that don’t have matching id in B PLUS, for the rows that do have matching id in B, retrieve only the ones that have b.weight =44 OR b.exist=1

See another example:

mysql> SELECT a.* FROM product a LEFT JOIN product_details b
       ON a.id=b.id AND b.weight!=44 AND b.exist=1
       WHERE b.id IS NULL;
+----+--------+
| id | amount |
+----+--------+
|  1 |    100 |
|  2 |    200 |
|  3 |    300 |
|  4 |    400 |
+----+--------+
4 rows in set (0.00 sec)

Explanation:

! ( exist(bid that equals to aid) AND b.weight !=44 AND b.exist=1 )
!exist(bid that equals to aid) || !(b.weight !=44) || !(b.exist=1)
!exist(bid that equals to aid) || b.weight =44 || b.exist=0

The battle between the Matching-Conditions and the Where-conditions

You can get the same results (A.*) if you put only the basic matching condition in the ON clause and the negation of the rest in the Where condition clause.
For example,
Instead of writing:

SELECT a.* FROM product a LEFT JOIN product_details b
ON a.id=b.id AND b.weight!=44 AND b.exist=0
WHERE b.id IS NULL;

You can write:

SELECT a.* FROM product a LEFT JOIN product_details b
ON a.id=b.id
WHERE b.id is null OR b.weight=44 OR b.exist=1;
mysql> SELECT a.* FROM product a LEFT JOIN product_details b
       ON a.id=b.id
       WHERE b.id is null OR b.weight=44 OR b.exist=1;
+----+--------+
| id | amount |
+----+--------+
|  1 |    100 |
|  3 |    300 |
|  4 |    400 |
+----+--------+
3 rows in set (0.00 sec)

Instead of writing:

SELECT a.* FROM product a LEFT JOIN product_details b
ON a.id=b.id AND b.weight!=44 AND b.exist!=0
WHERE b.id IS NULL;

You can write:

SELECT a.* FROM product a LEFT JOIN product_details b
ON a.id=b.id
WHERE b.id is null OR b.weight=44 OR b.exist=0;
mysql> SELECT a.* FROM product a LEFT JOIN product_details b
       ON a.id=b.id
       WHERE b.id is null OR b.weight=44 OR b.exist=0;
+----+--------+
| id | amount |
+----+--------+
|  1 |    100 |
|  2 |    200 |
|  3 |    300 |
|  4 |    400 |
+----+--------+
4 rows in set (0.00 sec)

Does these queries really the same?
These queries retrieve the same result set as long as you need only the values from the first table (e.g. A.*). In a case that you are retrieving values from the LEFT JOINed table, the results values are not the same.
As stated before, the condition in the WHERE clause filters out rows retrieved from the Matching-Stage.

For example:

mysql> SELECT * FROM product a LEFT JOIN product_details b
       ON a.id=b.id AND b.weight!=44 AND b.exist=1
       WHERE b.id is null;
+----+--------+------+--------+-------+
| id | amount | id   | weight | exist |
+----+--------+------+--------+-------+
|  1 |    100 | NULL |   NULL |  NULL |
|  2 |    200 | NULL |   NULL |  NULL |
|  3 |    300 | NULL |   NULL |  NULL |
|  4 |    400 | NULL |   NULL |  NULL |
+----+--------+------+--------+-------+
4 rows in set (0.00 sec)
 
mysql> SELECT * FROM product a LEFT JOIN product_details b
       ON a.id=b.id
       WHERE b.id IS NULL OR b.weight=44 OR b.exist=0;
+----+--------+------+--------+-------+
| id | amount | id   | weight | exist |
+----+--------+------+--------+-------+
|  1 |    100 | NULL |   NULL |  NULL |
|  2 |    200 |    2 |     22 |     0 |
|  3 |    300 | NULL |   NULL |  NULL |
|  4 |    400 |    4 |     44 |     1 |
+----+--------+------+--------+-------+
4 rows in set (0.00 sec)

General Note: If you use LEFT JOIN to find rows that do not exist in some table and you have the following test: col_name IS NULL in the WHERE part,
where col_name is a column that is declared as NOT NULL, MySQL stops searching for more rows (for a particular key combination) after
it has found one row that matches the LEFT JOIN condition.

As restoring a dump table into the MySQL master – you better get some sleep

Restoring a dump table into the MySQL master server can lead to serious replication delay. The massive inserts commands cause the Master and slaves to use most of their resources for replication. As a result, replication lag may increase dramatically (linear to the table size).
To overcome the replication delay, caused by restoring the dump table on the master, there is a need to widespread the massive inserts. This can be done by the MySQL SLEEP command:

Mysql> SELECT SLEEP(1);

Each MySQL INSERT command (actually, each INSERT command consist of many values), should be follow by a SLEEP command, which will give the replication, time to recover (releasing the resources taken by the replication).
The question that can be asked here is how much SLEEP time should be written. There is no absolute answer for this question and it depends on the system properties and the table size.
In my experiments, one second was enough.

Let’s see how easily it can be done:

Bash> mysqldump  -h xxx -P xxx –u xxx -pxxx my_db my_table > /tmp/my_dump.mysql_dump
Bash> cat /tmp/ my_dump.mysql_dump | sed "s/^\(INSERT INTO.*\)$/\1\nSELECT SLEEP(1);/" > /tmp/my_dump.mysql_dump_with_sleep

You can detect the sleep time inserted by:

Bash> grep SLEEP /tmp/my_dump.mysql_dump_with_sleep | wc –l

The new dump file should look like:

Bash>  cat /tmp/my_dump.mysql_dump_with_sleep
DROP TABLE IF EXISTS `testsleep`;
SET @saved_cs_client     = @@character_set_client;
SET character_set_client = utf8;
CREATE TABLE `testsleep` (
  `col1` int(10) unsigned NOT NULL auto_increment,
  `col2` int(10) unsigned NOT NULL,
  PRIMARY KEY  (`col1`)
) ENGINE=MyISAM AUTO_INCREMENT=11 DEFAULT CHARSET=latin1;
SET character_set_client = @saved_cs_client;
 
--
-- Dumping data for table `testsleep`
--
 
LOCK TABLES `testsleep` WRITE;
/*!40000 ALTER TABLE `testsleep` DISABLE KEYS */;
INSERT INTO `testsleep` VALUES (1,1),(2,3),(3,65),(4,388),(5,408),(6,412),(7,413),(8,420),(9,431),(10,432);
SELECT SLEEP(1);
INSERT INTO `testsleep` VALUES (11,1),(12,3),(13,65),(14,388),(15,408),(16,412),(17,413),(18,420),(19,431),(20,432);
SELECT SLEEP(1);
INSERT INTO `testsleep` VALUES (21,1),(22,3),(23,65),(24,388),(25,408),(26,412),(27,413),(28,420),(29,431),(30,432);
SELECT SLEEP(1);
/*!40000 ALTER TABLE `testsleep` ENABLE KEYS */;
UNLOCK TABLES;

The output of applying the dump file with the sleeps will look like:

[user@master]$  nohup mysql --h xxx -P xxx –u xxx –pxxx  my_db  < /tmp/my_dump.mysql_dump_with_sleep
SLEEP(1)
0
SLEEP(1)
0
SLEEP(1)
0

The SLEEP effect can be seen at the master binlog:

[user@master]$  sudo mysqlbinlog mysql-binlog.009127 | less
SET TIMESTAMP=1298561696/*!*/;
INSERT INTO `testsleep` VALUES (1,1),(2,3),(3,65),(4,388),(5,408),(6,412),(7,413
# at 730
#110224 15:34:57 server id 1012  end_log_pos 908        Query   thread_id=394402
SET TIMESTAMP=1298561697/*!*/;
INSERT INTO `testsleep` VALUES (11,1),(12,3),(13,65),(14,388),(15,408),(16,412),
# at 908
#110224 15:34:58 server id 1012  end_log_pos 1086       Query   thread_id=394402
SET TIMESTAMP=1298561698/*!*/;
INSERT INTO `testsleep` VALUES (21,1),(22,3),(23,65),(24,388),(25,408),(26,412),
# at 1086
#110224 15:34:59 server id 1012  end_log_pos 1196       Query   thread_id=394402
SET TIMESTAMP=1298561699/*!*/;
/*!40000 ALTER TABLE `testsleep` ENABLE KEYS *//*!*/;

The SLEEP effect can be seen at the slave binlog:

[user@slave]$  sudo mysqlbinlog mysqld-relay-bin.022076 | less
SET TIMESTAMP=1298561696/*!*/;
INSERT INTO `testsleep` VALUES (1,1),(2,3),(3,65),(4,388),(5,408),(6,412),(7,413),(8,420),(9,431),(10,432)/*!*/;
# at 870
#110224 15:34:57 server id 1012  end_log_pos 908        Query   thread_id=3944028       exec_time=0     error_code=0
SET TIMESTAMP=1298561697/*!*/;
INSERT INTO `testsleep` VALUES (11,1),(12,3),(13,65),(14,388),(15,408),(16,412),(17,413),(18,420),(19,431),(20,432)/*!*/;
# at 1048
#110224 15:34:58 server id 1012  end_log_pos 1086       Query   thread_id=3944028       exec_time=0     error_code=0
SET TIMESTAMP=1298561698/*!*/;
INSERT INTO `testsleep` VALUES (21,1),(22,3),(23,65),(24,388),(25,408),(26,412),(27,413),(28,420),(29,431),(30,432)/*!*/;
# at 1226
#110224 15:34:59 server id 1012  end_log_pos 1196       Query   thread_id=3944028       exec_time=0     error_code=0
SET TIMESTAMP=1298561699/*!*/;

As always, I will be happy to receive comments

If you copy a MyISAM table with PRIMARY key, don’t forget to order the rows first

In a case that you copy (INSERT INTO .. SELECT ..) a MyISAM table that have a PRIMARY key, it is much faster to insert the new rows in the primary key order:

  • The insertion will be faster: because the primary key cannot disabled, every insertion will also updates the primary key. In a case the rows are inserted in the primary key order, there will be no redundant disk scans.
  • The queries that uses the primary key will perform better: it make sorting easier for MySQL if the table is in order by the column that you want to order it by later.

Let’s see some examples for the theory:

mysql> show create table test_table\G
*************************** 1. row ***************************
       Table: test_table
Create Table: CREATE TABLE ` test_table ` (
  `user_id` int(10) unsigned NOT NULL,
  `auto_id` int(10) unsigned NOT NULL auto_increment,
  `col_time` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `col1` int(10) unsigned NOT NULL default '0',
  `col2` float NOT NULL,
  `col3` float NOT NULL,
  `col4` int(10) unsigned default NULL,
  `col5` int(10) unsigned default NULL,
  `col6` varchar(255) default NULL,
  PRIMARY KEY  (`user_id`,`auto_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
 
 
mysql> CREATE TABLE test_table_ordered LIKE test_table;
Query OK, 0 rows affected (0.00 sec)
 
mysql> INSERT INTO test_table_ordered  SELECT * FROM test_table ORDER BY user_id,auto_id;
Query OK, 22391869 rows affected (10 min 37.33 sec)
Records: 22391869  Duplicates: 0  Warnings: 0
 
mysql>
mysql>
mysql>
mysql> CREATE TABLE test_table_NOTordered LIKE test_table;
Query OK, 0 rows affected (0.00 sec)
 
mysql> INSERT INTO test_table_NOTordered  SELECT * FROM test_table;
Query OK, 22391869 rows affected (1 hour 18 min 38.50 sec)
Records: 22391869  Duplicates: 0  Warnings: 0
 
 
mysql>
mysql>
mysql> CREATE TABLE test_table_orderedCOPY LIKE  test_table_ordered;
Query OK, 0 rows affected (0.00 sec)
 
mysql> INSERT INTO test_table_orderedCOPY  SELECT * FROM test_table_ordered;
Query OK, 22391869 rows affected (3 min 19.96 sec)
Records: 22391869  Duplicates: 0  Warnings: 0
 
mysql>
mysql>
mysql> CREATE TABLE test_table_NOTorderedCOPY LIKE  test_table_NOTordered;
Query OK, 0 rows affected (0.01 sec)
 
mysql> INSERT INTO test_table_NOTorderedCOPY  SELECT * FROM test_table_NOTordered;
Query OK, 22391869 rows affected (1 hour 19 min 19.63 sec)
Records: 22391869  Duplicates: 0  Warnings: 0

The non-ordered insertion into table test_table_NOTordered took ~1 hour 18 min. However, the ordered insertion into the table test_table_ordered was much faster and took only ~10 min.
The non-ordered insertion into test_table_NOTorderedCOPY took the same time as the non-ordered insertion into table test_table_NOTordered.
The most faster insertion was the ordered insertion into table test_table_orderedCOPY. It took only ~3 min. This is because the table that we selects from was already ordered, hence no time spent on ordering the table before the insertion.

A note: ORDER BY does not make sense for InnoDB tables that contain a user-defined clustered index (PRIMARY KEY or NOT NULL UNIQUE index). InnoDB always orders table rows according to such an index if one is present. The same is true for BDB tables that contain a user-defined PRIMARY KEY. (see http://dev.mysql.com/doc/refman/5.0/en/alter-table.html)

I will be happy to receive comments.

You Must Have A Killer In Your System

Every system can have a slow query that randomly seems to emerge from nowhere. The main threat from these slow queries is that it can unexpectedly bring down your site. Even though the query is not called often (less than one percent), it can definitely harm your site performance.

For example:
Assume a system with 5 MySQL servers handles lots of read/write queries. In addition, let’s assume that each slow query uses the main table, runs for 15 seconds, and that there are 10 such slow queries running simultaneously. It can be assumed that each MySQL server will get at least one of these slow queries. If a write request is called after the slow query, it will block and lock all the consecutive read requests until the slow query and the write request are completed. The system will not be able to respond for at least 15 seconds. The more the slow queries are called, the less responsive the system will be.

It is a good idea to have a script killer that can kill these slow queries when necessary. Since a script killer will sacrifice the page that called the slow query, this concept works well where the slow query is called from very few and less trafficked pages. Killing these queries will allow more productive and more frequently called queries to execute while the slow query is being investigated.

A good kill script command is mk-kill (see http://www.maatkit.org/doc/mk-kill.html). I recommend configuring the script to kill only select queries and not insert/update queries. Killing an insert/update query can result in a corrupted table.

Here’s a configuration suggestion:

/usr/bin/mk-kill
--print
--daemonize
--interval 5
--busy-time 20
--ignore-info '(?i-smx:^insert|^update|^delete)'
--match-info '(?i-xsm:select)'
--log /var/jlog/admin/mk-kill.log
--execute-command '(echo "Subject: mk-kill query found on `hostname`"; tail -1 /var/log/admin/mk-kill.log)|sendmail -t tech@yoursite.com,boss@yoursite.com'
--kill-query
--pid ${pidfile}
-S /var/lib/mysql/mysql.sock

Notes:

  • The –ignore-info and –match-info accept Perl regular expressions (see http://www.ryerson.ca/perl/manual/pod/perlre.html).
  • The –kill-query will terminate the query but will not kill the connection. This is useful if you are using connection pooling.

A many to many relationship table – solving the exclude relation problem

I will start by defining the MySQL Many to Many relationship (Experts can skip to the next paragraph)

What is A MySQL many to many relationship

A MySQL many to many relationship is a relationship that is multi-valued in both directions.

This type of relationship is helped by the use of a linking table.
For example, a Question can have more than one Category, and a Category can have more than one Question.

 
CREATE TABLE link (
   Question_id int unsigned NOT NULL,
   Category varchar(20),
   PRIMARY KEY  (`Question_id `,` Category`),
   KEY `category_question` (`Category`,`Question_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
 
+-------------+---------------+
| Question_id | Category      |
+-------------+---------------+
|  1          | Animals       |
|  1          | Business      |
|  1          | Vehicles      |
|  2          | Animals       |
|  3          | Business      |
|  3          | Vehicles      |
|  4          | Animals       |
|  4          | Vehicles      |
|  5          | Business      |
+-------------+---------------+

This is a linking table called link which help us join relationships between the two tables, Questions and Categories.
As we can see, this table is reflecting a many to many relationship (Question 1 has categories  Animals,  Business and Vehicles. In addition, Category Animals has questions 1, 2 and 4).

What is the exclude relation problem?

Given a many to many relationship table, how can you select rows that don’t have a specific relationship? In terms of the question-category example, this question is translated to the following:
How can you select  (say, the first 1,000) questions that do not have a category of Business. In our example, we only want Question_id 2 and 4 returned.

A bad solution

A bad solution would be:

SELECT
    DISTINCT Question_id
FROM
    link
WHERE
    Category <> "Business"
ORDER BY
    Question_id DESC
LIMIT
    1000;

This query will work if a question is only allowed one category. In our case, since a question has at least one category, as long as a question is associated with another category other than Business, it will return.  Therefore, the result from this query is Question_id: 1, 2, 3, and 4 which is not satisfied the exclude relation.

First solution

SELECT
    DISTINCT Question_id
FROM
    link
WHERE
    Question_id NOT IN (SELECT Question_id FROM link WHERE Category="Business")  
ORDER BY
    Question_id DESC
LIMIT
    1000;

The dependent inner query (SELECT Question_id FROM link WHERE Category<>”Business”) has a performance issue – it will be executed at least 1000 times, and if it is not fast, the delays are multiplied by a number that is at least 1000. When the access method (the type field at the EXPLAIN statement output) is index_subquery, an index lookup optimization is used and a lot of overhead is avoided. When it is range, the sub query is being re-executed as is for each row. You must check your EXPLAIN details.

Second solution

CREATE TEMPORARY TABLE not_wanted_Question_id (
     UNIQUE KEY(Question_id)
) ENGINE=MEMORY
SELECT
    DISTINCT Question_id
FROM
    link
WHERE
    Category="Business";
 
SELECT
    DISTINCT link.Question_id
FROM
    link
LEFT JOIN
    not_wanted_Question_id nw ON (link.Question_id = nw.Question_id)
WHERE
    nw.Question_id is NULL
ORDER BY
    link.Question_id DESC
LIMIT
    1000;
 
DROP TABLE not_wanted_Question_id;

The multiple inner query execution problem is reduced to a lookup on a unique key inside an in-memory table, which is basically a hash lookup and is very fast. However, the building of the unwanted Questions can be a bit expensive for large tables. Note that the creating table, selecting and then dropping should be run together for every time you need the results.

Third solution (the best performed)

This is my best solution for that problem and the most performed for heavy tables:

SELECT
     Question_id ,
     SUM(IF (Category="Business", 1, 0)) AS exclude
FROM
     link
GROUP BY
     Question_id DESC
HAVING
     exclude=0
LIMIT
     10000;

This works great in our case because we have an index on the Question_id and Category columns. The SUM function is counting the number of the unwanted category occurrences for every question_id in the order and the Having statement is filtering them.

The LIMIT terminates the calculations at the points it gets to the limit number.

I will be happy to receive your comments

The battle between FORCE INDEX and the QUERY OPTIMIZER

Query Optimizer is a part of the server that takes a parsed SQL query and produces a query execution plan.

MySQL Query Optimizer uses (as one of its parameters) the stored key distribution (Cardinality) mechanism to determine the order in which tables should be joined, when you perform a join on something other than a constant and also for determining which indexes to use for a specific table within a query.

However, BEWARE! The query analyzer is not perfect. One of the many reasons is that the key distributions can be far from accurate. If the table has been heavily modified (INSERT or DELETE for example), over time the key distributions (Cardinality) diverge from the true state. (To solve this issue one must run ANALYZE TABLE once in a while).

An Optimizer that picks the wrong plan can cause severe performance issues. This means that your server can hit a performance problem at the most unexpected time.
Lucky for us, we can force the Optimizer to pick a particular plan (index) with the “Force Index” hint.
Does this mean that we should always use a Force Index hint?

NO!
Using FORCE INDEX() indiscriminately can also easily backfire. The distribution of the data or the values of query parameters can change, and the key that you thought was best for the query can, over time, become a less preferred plan. FORCE KEY should be used when you understand exactly why you are using it, and are 100% sure that there is no way the key that is being forced could become bad.

Note, that most often the need for FORCE KEY arises when the query is not very straightforward and there is no “awesome” winner key that the optimizer can easily spot. Such needs can often be addressed by re-writing the query and/or adding better keys.

Notes:
1. By default, ANALYZE TABLE statements are written to the binary log so that they will be replicated to replication slaves.
2. Another contributing factor to the MySQL Optimizers calculation of optimal query execution plan is the number of predicted disk IO/Seeks.
3. Use the ‘SHOW INDEX FROM table‘ statement in order to see the key distribution (Cardinality) for each index.
4. MyISAM data and indexes suffer from fragmentation. Running OPTIMIZE TABLE  under MyISAM will sort and recreate the index and data files.

Thanks to Sasha from Percona for opening my eyes to this issue.

Optimizing the MySQL IN() Comparison Operations Which Include the Indexed Field

The MySQL IN() Comparison Operator is said to be very quick if all the values are constants (the values are then evaluated and sorted first, and the search is done using a binary search).
However, what if the field which the IN clause refers to, is part of the index used to execute the query?

For example, lets create a table t with columns a to e

Create table t (
a int unsigned NOT NULL,
b int unsigned NOT NULL,
c int unsigned NOT NULL,
d int unsigned NOT NULL,
e int unsigned NOT NULL,
PRIMARY KEY  (`a`),
KEY `test` (`b`,`c`,`a`));

Now, let’s evaluate the following query:

SELECT a FROM t
FORCE INDEX (test)
WHERE b=4 AND c IN (1,3,5,7,9)
ORDER BY a DESC LIMIT 100;

In order to execute this query MySQL will need to use the ‘test’ index.

At this case mysql will first evaluate a full and unlimited sub query for each value in the IN clause, then, at completion, will merge the results and limit the returned result set.
This full and unlimited evaluation can be very expensive for large tables.
In order to work around this performance caveat we can force a limited sub query evaluation for each value in the IN clause before the merge. This can be done by using a “union all” clause instead of the IN clause, as in the example below.

This will run much faster:

SELECT a FROM (
(SELECT a from t FORCE INDEX (test) where b=4 AND c =1 ORDER BY a DESC LIMIT 100)
UNION ALL
(SELECT a from t FORCE INDEX (test) where b=4 AND c =3 ORDER BY a DESC LIMIT 100)
UNION ALL
(SELECT a from t FORCE INDEX (test) where b=4 AND c =5 ORDER BY a DESC LIMIT 100)
UNION ALL
(SELECT a from t FORCE INDEX (test) where b=4 AND c =7 ORDER BY a DESC LIMIT 100)
UNION ALL
(SELECT a from t FORCE INDEX (test) where b=4 AND c =9 ORDER BY a DESC LIMIT 100)
) temp
ORDER BY a DESC LIMIT 100;

Some notes:
1. For selecting more than a very few fields, it is much faster to select only the primary field and then to join the result with the same table in order to retrieve the rest of the fields (since it is much faster to sort the primary field without the extra selected fields).
2. Substituting the IN clause with OR clauses does not improve the performance of the query.

Thank you Jeffrey Doak for your comments!

MySQL User Defined Variables

When I have discovered MySQL User-defined variables at the first time, I didn’t pay much attention to them and didn’t realize their strength. At first, one can think that they are just one of the many unnoticed MySQL features. Well, I was wrong. Here, I will try to summarize their strength, though I think there is much more to explore.

By using User-defined variables, one can add some kind of procedural logic to the MySQL’s relational logic. They are connection specific variables, that means that they persist as long as the connection to the database persists and are specific to a connection (the variable value is not accessible via different connections). User variables can be assigned a value with a SET or SELECT statements and are accessible written as @var_name.

Before we dive into the fascinating uses and examples I would like to point some cautions with MySQL User Defined Variables:

  • Case sensitivity: Currently they are case sensitive (they are case insensitive at older versions >5.0).
  • They prevent query cache.
  • If you refer to a variable that has not been initialized, it has a value of NULL and a type of string.
  • The order of user defined evaluation is undefined.

So lets see some examples:

Example 1:
The first example is the simplest one. We would like to numerate the MySQL query result. Lets say we have a table called users and we would like to numerate some result from it:

SELECT
    @counter:=@counter+1 as counter,
    users.*
FROM
    (SELECT @counter:=0) v,
    users;

Example 2:
How can you produce only the 100th rows? Can you do it in a pure relational logic? I don’t think so. However, this is possible and easy to do with the User defined variables:
Lets say we have a table called users.

SELECT
    users.*
FROM
    (SELECT @counter:=0) v, 
    users
HAVING
    (@counter:=@counter+1)%100=0
ORDER BY
    user_id;

This query will print only the 100th users from that table. The statement (select @counter:=0) initialize the @counter variable to zero. This is not the only way to do it, we could wrote a prior statement to set the variable to zero:

SET @counter:=0;
SELECT
    users.*
FROM
    users
HAVING
    (@counter:=@counter+1)%100=0
ORDER BY
    user_id;

The statement having (@counter:=@counter+1)%2=0 has two operation: the first is to increment the variable and the second is to return true value for the 100th records.

Example 3:

It is quit simple to produce an accumulative totals with User-defined variable:

Say we have a table of bank account transactions by year and month.  To produce a bank account transaction report by months we can use:

SELECT
   t.year,
   t.month,
   @x:=@x+t.c
FROM
   (SELECT @x:=0) a,
   (SELECT
        year,month,sum(amount) AS c
    FROM
        account_transactions
    GROUP BY year,month) t

Example 4:

Numerate the MySQL query result per user (restarts the numbering every time user_id changes)

SELECT
   user_id,
   user_time,
   user_total_value,
   @x:=IF(@same_value=user_id,@x+1,1) as numerate,
   @same_value:=user_id as dummy
FROM
   users,
   (SELECT  @x:=0, @same_value:='') t
ORDER BY user_id,user_time;

Accumulative totals per user (restarts the numbering every time user_id changes)

SELECT
   user_id,
   user_time,
   user_total_value,
   @x:=IF(@same_value=user_id,@x+user_total_value,1) as numerate,
   @same_value:=user_id as dummy
FROM
   users,
   (SELECT  @x:=0, @same_value:='') t
ORDER BY user_id,user_time;

That is it for now.
I will be glad to receive more interesting examples that uses MySQL’s User Defined variables.

Tip: If you liked this post I am recommending reading also the following great post: Advanced MySQL user variable techniques

Limited SELECT count(*)

A SELECT statement with COUNT returns the number of rows retrieved by the SELECT statement (see mysql select count).

For performance reasons, the desired result is to limit that count.
Including a LIMIT clause in the SELECT statement will not work since it only restricts the number of rows returned, which is always one.

The solution, what I call “Limited-Count”, is done by limiting a non-count SELECT statement and wrapping it in COUNT(*).

For example:
If your count statement looks like:

SELECT COUNT(*) FROM a_table WHERE some_conditions;

You can limit the results by moving the query into:

SELECT COUNT(*) FROM (SELECT 1 FROM a_table WHERE some_conditions LIMIT x) t;

Notes:
1. COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause (this is true only for MyISAM tables. For InnoDB tables, it’s slow – thanks for Roland Bouman’s comment)
2. If you need to know how many rows your non-count SELECT statement would have returned without the LIMIT, you could consider using the information function, FOUND_ROWS(). It will fetch the total rows number without running the statement again.

Conditional Joins in MySQL

One way to do a “Conditional Join” in MySQL is by using a “LEFT JOIN”.  Create a “LEFT JOIN” for each condition and combine the results into one column using an “IF” statement by the “SELECT” expression.  Here’s an example:

Suppose you have three tables:

  1. questions: a table consisting of question ids, timestamps, and whether or not the question has been answered.
  2. answered_questions_title: a table of answered question titles.
  3. unanswered_questions_title: a table of unanswered question titles.

(yeah, I know the example isn’t 100% realistic, but it’s only for demonstration purposes ;-) )

 Now, suppose you need to find the 50 latest arrival questions with their titles.  We will need a conditional join that combines either answered or unanswered titles with the questions table.

 CREATE TABLE questions(
                q_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
                q_is_answered TINYINT(1) UNSIGNED,
                q_timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
 
 CREATE TABLE answered_questions_title(
                q_id INT UNSIGNED,
                q_title VARCHAR(255)
);
 
 CREATE TABLE unanswered_questions_title(
                q_id INT UNSIGNED,
                q_title VARCHAR(255)
);

The query would be:

SELECT
       q.q_id,
       q.q_timestamp,
       IF(q.q_is_answered, a.q_title, u.q_title) as q_title
FROM
      questions q
      LEFT JOIN answered_questions_title a
              ON (q.q_id = a.q_id AND q.q_is_answered = 1)
      LEFT JOIN unanswered_questions_title u
              ON (q.q_id = u.q_id AND q.q_is_answered = 0);

Explanation:
By inserting the “is_answered” condition to the ON clause, we restrict the “LEFT JOIN” to this condition.  The result contains the desired values with the matching condition, and null values elsewhere.  The title is now spread between two columns: a.q_title, and u.q_title.  We use the IF clause to combine them into one column, q_title.

Note:
Another possible way to do a “Conditional Join” using MySQL is with a “UNION” (a “UNION” is the result of two or more conditions results).  However, using “Union” is bad for performance when you need only top results with an “ORDER BY” and “LIMIT”.

I’m always happy to receive your comments,
Ilan Hazan