I sell Linux on CD on my Linux CD Mall website, and it has a page showing the most popular Linux and BSD distributions that it sells. For some reason, I’d never set this up to query the information from the database, instead manually updating it periodically. Today I decided to actually get around to making it read the information from the database and have decided to share the SQL queries etc used to get the information for top sellers by category.
A summary of the relevant fields and tables are as follows (I’ve modified the tablenames and fieldnames from the actual ones to make them more generic to a regular ecommerce solution):
orders_header order_id - the primary key status - the order's status order_date - the date the order was placed orders_detail order_id - value to join this table to orders_header product_id - value to join this to products_to_categories products_to_categories product_id - the product's id, used to join to orders_detail category_id the category's id, used to join to categories categories category_id - the primary key name - the name of the category url - the url for the category page
The end result required is to have a list of category names sorted in order from the most popular to the least popular. In my case, I want to show just the top 10 but the SQL query can be easily adjusted to show more. The actual database structure for my Linux CD Mall website allows a product (in this case a CD set or DVD) to belong to one category (a Linux or BSD distribution) so a category can only be counted once per orders_detail line. If a product can belong to more than one category then a category may be counted multiple times per orders_detail line, but this may be acceptable depending on the circumstances.
The query used to get the data for the previous month is as follows (assuming the current date is 11 December 2007, making last month November 2007):
SELECT c.name, c.url, COUNT(*) FROM orders_header h INNER JOIN orders_detail d ON h.order_id = d.order_id INNER JOIN products_to_categories p2c ON d.product_id = p2c.product_id INNER JOIN product p ON p2c.product_id = p.product_id WHERE h.status >= 30 AND h.order_date >= '2007-11-01' AND h.order_date <= '2007-11-30' GROUP BY c.name, c.url ORDER BY COUNT(*) DESC LIMIT 10
In my orders_header table, a status of 30 means it has been paid and a status of 40 means it has been sent; we only want to include orders that have been paid or sent, hence the "WHERE h.status >= 30" condition.
When doing a query like this, you need to make sure all the columns in the joins and where conditions are indexed, otherwise the query may take a considerable amount of time. Adding indexes should make the queries run a lot faster. The actual query I use on my CD site is slightly more complex than the above, and was taking around 4 seconds to run, despite a lot of optimization. In the end I added a query cache to MySQL which fixed the issue for subsequent queries. You can read about this in the next post.