A many to many relationship table – solving the exclude relation problem
A MySQL many to many relationship is a relationship that is multi-valued in both directions. Given that, how can you select rows that don’t have a specific relationship (the exclude relation problem)?
I will start by defining the MySQL Many to Many relationship (Experts can skip to the next paragraph)
What is A MySQL many to many relationship
A MySQL many to many relationship is a relationship that is multi-valued in both directions.
This type of relationship is helped by the use of a linking table.
For example, a Question can have more than one Category, and a Category can have more than one Question.
CREATE TABLE link ( Question_id int unsigned NOT NULL, Category varchar(20), PRIMARY KEY (`Question_id `,` Category`), KEY `category_question` (`Category`,`Question_id`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 +-------------+---------------+ | Question_id | Category | +-------------+---------------+ | 1 | Animals | | 1 | Business | | 1 | Vehicles | | 2 | Animals | | 3 | Business | | 3 | Vehicles | | 4 | Animals | | 4 | Vehicles | | 5 | Business | +-------------+---------------+
This is a linking table called link which help us join relationships between the two tables, Questions and Categories.
As we can see, this table is reflecting a many to many relationship (Question 1 has categories Animals, Business and Vehicles. In addition, Category Animals has questions 1, 2 and 4).
What is the exclude relation problem?
Given a many to many relationship table, how can you select rows that don’t have a specific relationship? In terms of the question-category example, this question is translated to the following:
How can you select (say, the first 1,000) questions that do not have a category of Business. In our example, we only want Question_id 2 and 4 returned.
A bad solution
A bad solution would be:
SELECT DISTINCT Question_id FROM link WHERE Category <> "Business" ORDER BY Question_id DESC LIMIT 1000;
This query will work if a question is only allowed one category. In our case, since a question has at least one category, as long as a question is associated with another category other than Business, it will return. Therefore, the result from this query is Question_id: 1, 2, 3, and 4 which is not satisfied the exclude relation.
SELECT DISTINCT Question_id FROM link WHERE Question_id NOT IN (SELECT Question_id FROM link WHERE Category="Business") ORDER BY Question_id DESC LIMIT 1000;
The dependent inner query (SELECT Question_id FROM link WHERE Category<>”Business”) has a performance issue – it will be executed at least 1000 times, and if it is not fast, the delays are multiplied by a number that is at least 1000. When the access method (the type field at the EXPLAIN statement output) is index_subquery, an index lookup optimization is used and a lot of overhead is avoided. When it is range, the sub query is being re-executed as is for each row. You must check your EXPLAIN details.
CREATE TEMPORARY TABLE not_wanted_Question_id ( UNIQUE KEY(Question_id) ) ENGINE=MEMORY SELECT DISTINCT Question_id FROM link WHERE Category="Business"; SELECT DISTINCT link.Question_id FROM link LEFT JOIN not_wanted_Question_id nw ON (link.Question_id = nw.Question_id) WHERE nw.Question_id is NULL ORDER BY link.Question_id DESC LIMIT 1000; DROP TABLE not_wanted_Question_id;
The multiple inner query execution problem is reduced to a lookup on a unique key inside an in-memory table, which is basically a hash lookup and is very fast. However, the building of the unwanted Questions can be a bit expensive for large tables. Note that the creating table, selecting and then dropping should be run together for every time you need the results.
Third solution (the best performed)
This is my best solution for that problem and the most performed for heavy tables:
SELECT Question_id , SUM(IF (Category="Business", 1, 0)) AS exclude FROM link GROUP BY Question_id DESC HAVING exclude=0 LIMIT 10000;
This works great in our case because we have an index on the Question_id and Category columns. The SUM function is counting the number of the unwanted category occurrences for every question_id in the order and the Having statement is filtering them.
The LIMIT terminates the calculations at the points it gets to the limit number.
I will be happy to receive your comments