A many to many relationship table – solving the exclude relation problem

A MySQL many to many relationship is a relationship that is multi-valued in both directions. Given that, how can you select rows that don’t have a specific relationship (the exclude relation problem)?
 
I will start by defining the MySQL Many to Many relationship (Experts can skip to the next paragraph)

What is A MySQL many to many relationship

A MySQL many to many relationship is a relationship that is multi-valued in both directions.

This type of relationship is helped by the use of a linking table.
For example, a Question can have more than one Category, and a Category can have more than one Question.

 
CREATE TABLE link (
   Question_id int unsigned NOT NULL,
   Category varchar(20),
   PRIMARY KEY  (`Question_id `,` Category`),
   KEY `category_question` (`Category`,`Question_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
 
+-------------+---------------+
| Question_id | Category      |
+-------------+---------------+
|  1          | Animals       |
|  1          | Business      |
|  1          | Vehicles      |
|  2          | Animals       |
|  3          | Business      |
|  3          | Vehicles      |
|  4          | Animals       |
|  4          | Vehicles      |
|  5          | Business      |
+-------------+---------------+

This is a linking table called link which help us join relationships between the two tables, Questions and Categories.
As we can see, this table is reflecting a many to many relationship (Question 1 has categories  Animals,  Business and Vehicles. In addition, Category Animals has questions 1, 2 and 4).

What is the exclude relation problem?

Given a many to many relationship table, how can you select rows that don’t have a specific relationship? In terms of the question-category example, this question is translated to the following:
How can you select  (say, the first 1,000) questions that do not have a category of Business. In our example, we only want Question_id 2 and 4 returned.

A bad solution

A bad solution would be:

SELECT
    DISTINCT Question_id
FROM
    link
WHERE
    Category <> "Business"
ORDER BY
    Question_id DESC
LIMIT
    1000;

This query will work if a question is only allowed one category. In our case, since a question has at least one category, as long as a question is associated with another category other than Business, it will return.  Therefore, the result from this query is Question_id: 1, 2, 3, and 4 which is not satisfied the exclude relation.

First solution

SELECT
    DISTINCT Question_id
FROM
    link
WHERE
    Question_id NOT IN (SELECT Question_id FROM link WHERE Category="Business")  
ORDER BY
    Question_id DESC
LIMIT
    1000;

The dependent inner query (SELECT Question_id FROM link WHERE Category<>”Business”) has a performance issue – it will be executed at least 1000 times, and if it is not fast, the delays are multiplied by a number that is at least 1000. When the access method (the type field at the EXPLAIN statement output) is index_subquery, an index lookup optimization is used and a lot of overhead is avoided. When it is range, the sub query is being re-executed as is for each row. You must check your EXPLAIN details.

Second solution

CREATE TEMPORARY TABLE not_wanted_Question_id (
     UNIQUE KEY(Question_id)
) ENGINE=MEMORY
SELECT
    DISTINCT Question_id
FROM
    link
WHERE
    Category="Business";
 
SELECT
    DISTINCT link.Question_id
FROM
    link
LEFT JOIN
    not_wanted_Question_id nw ON (link.Question_id = nw.Question_id)
WHERE
    nw.Question_id is NULL
ORDER BY
    link.Question_id DESC
LIMIT
    1000;
 
DROP TABLE not_wanted_Question_id;

The multiple inner query execution problem is reduced to a lookup on a unique key inside an in-memory table, which is basically a hash lookup and is very fast. However, the building of the unwanted Questions can be a bit expensive for large tables. Note that the creating table, selecting and then dropping should be run together for every time you need the results.

Third solution (the best performed)

This is my best solution for that problem and the most performed for heavy tables:

SELECT
     Question_id ,
     SUM(IF (Category="Business", 1, 0)) AS exclude
FROM
     link
GROUP BY
     Question_id DESC
HAVING
     exclude=0
LIMIT
     10000;

This works great in our case because we have an index on the Question_id and Category columns. The SUM function is counting the number of the unwanted category occurrences for every question_id in the order and the Having statement is filtering them.

The LIMIT terminates the calculations at the points it gets to the limit number.

I will be happy to receive your comments

MySQL Quiz
9 Comments
  1. Scott says:

    How does the independent subquery perform for your data set?

    SELECT
    DISTINCT Question_id
    FROM
    link
    LEFT JOIN (SELECT Question_id FROM link WHERE Category=”Business”) dt USING (Question_id)
    WHERE
    dt.Question_id IS NULL
    LIMIT
    1000;

  2. rudy says:

    whenever a standard sql construct exists, always use that instead of the proprietary alternative

    in this instance, use CASE instead of IF

    makes the sql a lot more portable

    ;o)

  3. Hazan Ilan says:

    Hi Scott,
    This is actually the same query as in the Second solution.
    The Third solution gives much faster results.

  4. Pablo says:

    What’s about a question_id that exist on questions table but NO have category related?

    For example if exist a question_id = 6 without relations on link table

    This question_id no have a category of Bussiness and it will not appear on result, however, match the condition.

  5. Hazan Ilan says:

    Pablo,
    The best to do with a question that don’t have a related category, is to attach it a dummy category (can be called Uncategorized).
    That way this question will be in the link table.
    Please note that the link table is the best place to select lists of questions related to a given category.

  6. Pablo says:

    This query may solve this problem including the case of questions without category, but I’m not compared if this is a fastest solution, but is well

    SELECT
    DISTINCT Questions.Question_id
    FROM
    Questions LEFT OUTER JOIN link ON(Questions.Question_id = link.Question_id and Category <> "Business" )
    LIMIT
    1000;
    
  7. Hazan Ilan says:

    Pablo,
    Your query will not solve the Exclude Relation problem.
    For example, in the link table described in the post example, your query will return the question ids: 1, 2, 3 and 4.

  8. Brett says:

    First off, your “NOT IN” solution (First Solution, that has a performance issue) is silly, you should be using NOT EXISTS, ala:

    SELECT
    DISTINCT Question_id
    FROM
    Question
    WHERE
    NOT EXISTS (SELECT 1 FROM link WHERE (link.Question_id = Question.Question_ID) AND (Category = “Business”))
    ORDER BY
    Question_id DESC
    LIMIT
    1000;

    NOT EXISTS is far cheaper than NOT IN. ‘course, this is still not ideal, as it is an RBAR solution.

    As for Pablo, he’s close. What you want is a left outer join for matches, and a null check. eg:

    SELECT
    DISTINCT Questions.Question_id
    FROM
    Questions LEFT OUTER
    JOIN link ON (Questions.Question_id = link.Question_id and Category = “Business” )
    WHERE
    link.Question_ID IS NULL
    LIMIT
    1000;

    That said, I haven’t tested the performance of this, and left outer joins aren’t as cheap as inner joins. This is also very similar, semantically, to the sub-select approach.

  9. Hazan Ilan says:

    Brett,
    Your solution is good. You have been using the Question table for it.
    I haven’t tested your query performance yet.
    Thanks for your comment.

    CREATE TABLE link (
    Question_id int unsigned NOT NULL,
    Category varchar(20),
    PRIMARY KEY (`Question_id`,`Category`),
    KEY `category_question` (`Category`,`Question_id`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1;

    INSERT INTO link (Question_id,Category)
    VALUES (1,’Animals’),(1,’Business’),(1,’Vehicles’),(2,’Animals’),(3,’Business’),(3,’Vehicles’),(4,’Animals’),(4,’Vehicles’),(5,’Business’);

    CREATE TABLE Question (
    Question_id int unsigned NOT NULL,
    PRIMARY KEY (`Question_id`));

    INSERT INTO Question
    VALUES (1),(2),(3),(4),(5);

    mysql> SELECT DISTINCT Question.Question_id
    FROM Question LEFT OUTER JOIN link ON (Question.Question_id = link.Question_id and Category = “Business” )
    WHERE link.Question_ID IS NULL
    LIMIT 1000;

    +—————-+
    | Question_id |
    +—————-+
    | 2 |
    | 4 |
    +—————-+
    2 rows in set (0.00 sec)

Leave a Reply