One thing that you might not think of when watching the size of a large WordPress site grow, is unnecessary data in the database. With the introduction of the block editor years ago, there has been a large increase in the number of revisions a post makes when being edited.
This can create a lot of revisions in the database if you’re not setting a limit.
If you’d like to do a quick and dirty audit of your revision data, you can use a very ugly SQL query like this:
This will give you the number of revisions you have, and the approximate amount of data its using in the database:
revision_count
post_size
441419
2842450412
You can see now a very large WordPress site can amass a lot of unnecessary data in its database over a number of years. 2.8 Gigabytes of revisions is a lot of stuff if you’re never going to use them again.
Recently I’ve been working with a client who’s site we’re going to soon be migrating. To help with any downtime, we’ve been looking at reducing their database size, which is something around 50-60 gigabytes. After looking through the database, one easy win would be to purge as many post revisions as possible, since their wp_posts table was about 50% revisions 😱
We couldn’t just mass delete anything with a revision post type though, because the client had some specific needs:
Keep all revisions for posts from the last 12 months.
Keep all revisions that were made after the post was published.
Keep a backup of all revisions deleted.
To do this, I crafted a custom WP-CLI command to purge these revisions. I’ve accidentally deleted the final version of the script, since it was only a one-run thing, but here’s an earlier version that could be a good starting point for anyone else that has a similar need to prune revisions:
/**
* WP-CLI command that deletes pre-publish post revisions for posts older than 12 months.
*
* @subcommand prune-revisions [--live] [--verbose]
*/publicfunctionprune_revisions( $args, $assoc_args ){
global $wpdb;
$live = (bool) $assoc_args[ 'live' ];
$verbose = (bool) $assoc_args[ 'verbose' ];
$offset = 0;
$limit = 500;
$count = 0;
$deletes = 0;
if ( $live ) {
$output_file = sanitize_file_name( sprintf( 'prune-revisions-backup_%d_%d.csv', get_bloginfo( 'name' ), current_time( 'timestamp', true ) ) );
$handle = fopen( $output_file, 'wb' );
if ( false === $handle ) {
WP_CLI::error( sprintf( 'Error opening %s for writing!', $output_file ) );
}
// Headers.
$csv_headers = array(
'ID',
'post_author',
'post_date',
'post_date_gmt',
'post_content',
'post_title',
'post_excerpt',
'post_status',
'comment_status',
'ping_status',
'post_password',
'post_name',
'to_ping',
'pinged',
'post_modified',
'post_modified_gmt',
'post_content_filtered',
'post_parent',
'guid',
'menu_order',
'post_type',
'post_mime_type',
'comment_count',
'filter',
);
fputcsv(
$handle,
$csv_headers
);
}
$count_sql = 'SELECT COUNT(ID) FROM ' . $wpdb->posts . ' WHERE post_type = "revision"';
$revision_count = (int) $wpdb->get_row( $count_sql, ARRAY_N )[0];
$progress = \WP_CLI\Utils\make_progress_bar( sprintf( 'Checking %s revisions', number_format( $revision_count ) ), $revision_count );
do {
$sql = $wpdb->prepare( 'SELECT ID FROM ' . $wpdb->posts . ' WHERE post_type = "revision" LIMIT %d,%d', $offset, $limit );
$revisions = $wpdb->get_results( $sql );
foreach ( $revisions as $revision ) {
$count++;
$post_parent_id = wp_get_post_parent_id( $revision->ID );
// Fail on either 0 or false.if ( false == $post_parent_id ) {
WP_CLI::warning( sprintf( 'Revision %d does not have a parent! Skipping!', $revision->ID ) );
continue;
}
$revision_modified = get_post_modified_time( 'U', false, $revision->ID );
$parent_publish_time = get_post_time( 'U', false, $post_parent_id );
if ( $parent_publish_time < current_time( 'timestamp') - ( MONTH_IN_SECONDS * 12 ) ) {
// Post is older than 12 months, safe to delete pre-publish revisions.if ( $revision_modified < $parent_publish_time ) {
if ( $live ) {
// We're doing it live!
WP_CLI::log( sprintf( 'Deleting revision %d for post %d. (%d%% done)', $revision->ID, $post_parent_id, ( $count / $revision_count ) * 100 ) );
// Backup data!
$output = [];
$data = get_post( $revision->ID );
// Validate the field is set, just in case. IDK how it couldn't be.foreach ( $csv_headers as $field ) {
$output = isset( $data->$field ) ? $data->$field : '';
}
fputcsv( $handle, $output );
$did_delete = wp_delete_post_revision( $revision->ID );
// Something went wrong while deleting the revision?if ( false === $did_delete || is_wp_error( $did_delete ) ) {
WP_CLI::warning( sprintf( 'Revision %d for post %d DID NOT DELETE! wp_delete_post_revision returned:', $revision->ID, $post_parent_id ) );
}
$deletes++;
// Pause after lots of db modifications.if ( 0 === $deletes % 50 ) {
if ( $verbose ) {
WP_CLI::log( sprintf( 'Current Deletes: %d', $deletes ) );
}
sleep( 1 );
}
} else {
// Not live, just output info.
WP_CLI::log( sprintf( 'Will delete revision %d for post %d.', $revision->ID, $post_parent_id ) );
}
} else {
// Revision is after the post has been published.if ( $verbose ) {
WP_CLI::log( sprintf( 'Post-Publish: Will NOT delete Revision %d for post %d.', $revision->ID, $post_parent_id ) );
}
}
} else {
// Post is too new to prune.if ( $verbose ) {
WP_CLI::log( sprintf( 'Too-New: Will NOT delete Revision %d for post %d.', $revision->ID, $post_parent_id ) );
}
}
}
// Pause after lots of db reads.if ( 0 === $count % 5000 ) {
if ( $verbose ) {
WP_CLI::log( sprintf( 'Current Count: %d', $count ) );
}
sleep( 1 );
}
// Free up memory.$this->stop_the_insanity();
// Paginate.if ( count( $revisions ) ) {
$offset += $limit;
$progress->tick( $limit );
} else {
WP_CLI::warning( 'Possible MySQL Error, retrying in 10 seconds!' );
sleep( 10 );
}
} while ( $count < $revision_count );
$progress->finish();
if ( $live ) {
fclose( $handle );
WP_CLI::success( sprintf( 'Deleted %d revisions', $deleted ) );
} else {
WP_CLI::success( sprintf( 'Processed %d revisions', $revision_count ) );
}
}Code language:PHP(php)
Of course, you could go through each post and check the meta, check for duplicates, and delete any of them but…
I think I’ve figured out a beast of an SQL query that will give you the Meta IDs of any duplicate post meta so that you can delete them with delete_metadata_by_mid().
SELECT * FROM (
SELECT meta_id FROM (
SELECT meta_table_1.* FROM wp_postmeta meta_table_1, wp_postmeta meta_table_2 WHERE (
meta_table_1.post_id = meta_table_2.post_id AND meta_table_1.meta_value = meta_table_2.meta_value AND meta_table_1.meta_key = meta_table_2.meta_key
) ORDER BY meta_table_2.post_id
) Table_All_Duplicates GROUP BY meta_id HAVING COUNT(*)>1
) Unique_Dupes WHERE meta_id NOT IN (
SELECT min(meta_id) AS min_meta_id FROM wp_postmeta GROUP BY post_id,meta_key,meta_value HAVING COUNT(post_id) > 1AND COUNT(meta_key) > 1AND COUNT(meta_value) > 1AND COUNT(meta_key >= 1)
);Code language:PHP(php)
An example:
mysql> SELECT * FROM wp_postmeta WHERE post_id = 919 AND meta_key = '_totally_duped';
+---------+---------+----------------+------------+
| meta_id | post_id | meta_key | meta_value |
+---------+---------+----------------+------------+
| 2123 | 919 | _totally_duped | 1 |
| 2124 | 919 | _totally_duped | 1 |
| 2125 | 919 | _totally_duped | 1 |
+---------+---------+----------------+------------+
3 rows inset (0.01 sec)
mysql> SELECT * FROM (SELECT meta_id FROM (SELECT meta_table_1.* FROM wp_postmeta meta_table_1, wp_postmeta meta_table_2 WHERE ( meta_table_1.post_id = meta_table_2.post_id AND meta_table_1.meta_value = meta_table_2.meta_value AND meta_table_1.meta_key = meta_table_2.meta_key ) ORDER BY meta_table_2.post_id) Table_All_Duplicates GROUP BY meta_id HAVING COUNT(*)>1) Unique_Dupes WHERE meta_id NOT IN (SELECT min(meta_id) AS min_meta_id FROM wp_postmeta GROUP BY post_id,meta_key,meta_value HAVING COUNT(post_id) > 1 AND COUNT(meta_key) > 1 AND COUNT(meta_value) > 1 AND COUNT(meta_key >= 1));
+---------+
| meta_id |
+---------+
| 2124 |
| 2125 |
+---------+
2 rows in set (0.03 sec)Code language:JavaScript(javascript)
I just want to note though, this is VERY slow. I did this because I’m currently working on trying to clean up a site that has about 24 million rows of postmeta, and I really hope that there’s some duplicates to delete. This one query has been running for about 12 hours now, and I have no idea when it will end.
So I guess I do have time for that after all ¯\_(ツ)_/¯
UPDATE:
I’ve discovered that it might be good to have more data for forensic investigation… so I’ve adjusted the query to also give me the post id, meta key, meta value length, and the original meta id:
mysql> SELECT post_id,group_concat(distinct meta_id) as meta_ids,min( distinct meta_id) as primary_meta_id,meta_key,char_length(meta_value) as meta_length,count( distinct meta_id ) * char_length(meta_value) as total_size FROM ( SELECT meta_table_1.* FROM wp_postmeta meta_table_1, wp_postmeta meta_table_2 WHERE (
-> meta_table_1.post_id = meta_table_2.post_id AND meta_table_1.meta_value = meta_table_2.meta_value AND meta_table_1.meta_key = meta_table_2.meta_key ) ORDER BY meta_table_2.post_id ) Table_All_Duplicates GROUP BY post_id,meta_key HAVING COUNT(*)>1;
+---------+----------------+-----------------+------------------+-------------+------------+
| post_id | meta_ids | primary_meta_id | meta_key | meta_length | total_size |
+---------+----------------+-----------------+------------------+-------------+------------+
| 915 | 2176,2177,2178 | 2176 | _totally_duped_2 | 1 | 3 |
| 919 | 2123,2124,2125 | 2123 | _totally_duped | 1 | 3 |
+---------+----------------+-----------------+------------------+-------------+------------+
2 rows inset (0.03 sec)Code language:JavaScript(javascript)