A couple of weeks back I posted how to find records in one table that are not in another with MySQL and received an email from Greg Jorgensen with a more efficient way of approaching the same problem and therefore revise my original post with his suggestion.
The examples below use three tables as follows:
content: contains the content pages for a website. The primary key is content_id.
tags: the "tags" that a page is tagged with. The primary key is tag_id.
content_to_tags: a table that creates a many-to-many relationship between the above two tables; a page can belong to multiple tags.
The first example finds records in content_to_tags that have no associated record in content. This could have happened if the application deleted a record from content but didn’t delete the associated records from content_to_tags:
SELECT * FROM content_to_tags c2t WHERE NOT EXISTS ( SELECT * FROM content c WHERE c.content_id = c2t.content_id )
This is the same as for the first example but comparing tags with content_to_tags. Again, this might have happened if records were deleted from tags but not their associated records from content_to_tags:
SELECT * FROM content_to_tags c2t WHERE NOT EXISTS ( SELECT * FROM tags t WHERE t.tag_id = c2t.tag_id )
Benchmarking Examples 1 and 2
In my orginal post I used a LEFT JOIN from content_to_tags to content. I benchmarked my original query compared to this query on a small content table with 1000 records, a content_to_tags table with 2000 records and a tags table with 100 records.
Using NOT EXISTS was just over two times faster, and I would assume that as the tables get populated with more records the difference exponential.
A note about ACID compliance and foreign keys
Note that in a properly ACID compliant database with foreign key constraints the first two examples shouldn’t actually return any data, because it wouldn’t be possible to delete records from tags/content if there are associated records present in the content_to_tags table.
Examples 3 and 4 are looking at something slightly different: finding content that’s not tagged at all, and finding tags that are not tagged to any content.
The next example looks for records in content where there are no associated records in content_to_tags. This is useful if you want to find any untagged posts:
SELECT * FROM content c WHERE NOT EXISTS ( SELECT * FROM content_to_tags c2t WHERE c.content_id = c2t.content_id )
The final example is the same as the above but to find tags that have no associated records in content_to_tags. This is useful for finding tags that have no content, or in another context (with different table names etc) categories that have no products in them:
SELECT * FROM tags t WHERE NOT EXISTS ( SELECT * FROM content_to_tags c2t WHERE t.tag_id = c2t.tag_id )
Benchmarking Examples 3 and 4
The difference between my original LEFT JOIN and the NOT EXISTS queries here was minimal for examples 3 and 4 on the example tables although I would imagine as they become bigger the difference would grow. NOT EXISTS was once again faster that the LEFT JOIN syntax.