A couple of weeks back I posted how to find records in one table that are not in another with MySQL and received an email from Greg Jorgensen with a more efficient way of approaching the same problem and therefore revise my original post with his suggestion.
Example tables
The examples below use three tables as follows:
content: contains the content pages for a website. The primary key is content_id.
tags: the "tags" that a page is tagged with. The primary key is tag_id.
content_to_tags: a table that creates a many-to-many relationship between the above two tables; a page can belong to multiple tags.
Example 1
The first example finds records in content_to_tags that have no associated record in content. This could have happened if the application deleted a record from content but didn’t delete the associated records from content_to_tags:
SELECT * FROM content_to_tags c2t WHERE NOT EXISTS ( SELECT * FROM content c WHERE c.content_id = c2t.content_id )
Example 2
This is the same as for the first example but comparing tags with content_to_tags. Again, this might have happened if records were deleted from tags but not their associated records from content_to_tags:
SELECT * FROM content_to_tags c2t WHERE NOT EXISTS ( SELECT * FROM tags t WHERE t.tag_id = c2t.tag_id )
Benchmarking Examples 1 and 2
In my orginal post I used a LEFT JOIN from content_to_tags to content. I benchmarked my original query compared to this query on a small content table with 1000 records, a content_to_tags table with 2000 records and a tags table with 100 records.
Using NOT EXISTS was just over two times faster, and I would assume that as the tables get populated with more records the difference exponential.
A note about ACID compliance and foreign keys
Note that in a properly ACID compliant database with foreign key constraints the first two examples shouldn’t actually return any data, because it wouldn’t be possible to delete records from tags/content if there are associated records present in the content_to_tags table.
Examples 3 and 4 are looking at something slightly different: finding content that’s not tagged at all, and finding tags that are not tagged to any content.
Example 3
The next example looks for records in content where there are no associated records in content_to_tags. This is useful if you want to find any untagged posts:
SELECT * FROM content c WHERE NOT EXISTS ( SELECT * FROM content_to_tags c2t WHERE c.content_id = c2t.content_id )
Example 4
The final example is the same as the above but to find tags that have no associated records in content_to_tags. This is useful for finding tags that have no content, or in another context (with different table names etc) categories that have no products in them:
SELECT * FROM tags t WHERE NOT EXISTS ( SELECT * FROM content_to_tags c2t WHERE t.tag_id = c2t.tag_id )
Benchmarking Examples 3 and 4
The difference between my original LEFT JOIN and the NOT EXISTS queries here was minimal for examples 3 and 4 on the example tables although I would imagine as they become bigger the difference would grow. NOT EXISTS was once again faster that the LEFT JOIN syntax.