Converting Quotes and Removing Other Unicode Garbage in WordPress 4.*

September 10th, 2016 Permalink

The WordPress ecosystem has many odd small blind spots wherein which good functionality simply cannot be found. One of those blind spots is the conversion of various types of quote character, such as ‘item’, “item”, to the standard ASCII single and double quotes 'item', "item". Another is the related matter of clearing out garbage unicode such as unprintable characters from submitted content. The former is just one of those things that annoys some people into taking action, while the latter is fairly important because the presence of some unicode characters can cause common plugins to fail in fairly catastrophic ways, such as returning empty strings when processing post content in the administrative interface.

Outside of the established frameworks, the PHP development ecosystem doesn't handle character encoding well. PHP is a now older language that did not start from a good basis of support for multibyte character encoding for strings. Most of the common and familiar functions are not safe for use with multibyte strings, correct management of strings is actually somewhat fiddly and painful, and naive developers are still writing PHP code that fails in all sorts of interesting ways when processing the now-standard encoding of UTF-8. This last point becomes evident quite quickly when searching for a WordPress plugin to replace the many and varied quotation marks that are used nowadays with the standard ASCII single or double quote characters. I was eventually forced to write my own code for that task.

The example below is intended for inclusion into a theme, and only scratches the surface of a full examination of the topic, but works well enough for my uses. I only need the commonly used English language quote and dash characters that turn up in cut and paste content to be replaced by the standard ASCII quote and dash characters, and also the removal of some of the more esoteric Unicode junk that has in the past caused problems with the plugins that I use. Using an interface to the PCRE engine makes this all slightly easier easier than would otherwise be the case, but one could certainly use any of the engines for regular expressions and list all of the characters explicitly rather than using built-in classes where they are available.

/**
 * Remove or convert unwanted unicode characters when content is saved.
 * This can be applied to posts, comments, or really any text content.
 */
function example_filter_pre_save($str) {
  if (!$str) {
    return $str;
  }

  // Using preg_replace for the PCRE notation, which mb_ereg_replace doesn't
  // have. Using it makes things a lot easier.
  //
  // \p{Pd} = dash character.
  // \x{2212} is a dash that seems to somehow not be in that group.
  $str = preg_replace(
    '/[\p{Pd}\x{2212}]/u',
    '-',
    $str
  );

  // https://en.wikipedia.org/wiki/Quotation_mark#Curved_quotes_and_Unicode
  //
  // Single quotes; not a complete list, just the common ones.
  $str = preg_replace(
    '/[\x{2018}\x{2019}\x{201A}\x{201B}\x{FF07}]/u',
    '\'',
    $str
  );
  // Double quotes; again, not a complete list.
  $str = preg_replace(
    '/[\x{201C}\x{201D}\x{201E}\x{201B}\x{201F}\x{301D}\x{301E}\x{301F}\x{FF02}]/u',
    '"',
    $str
  );

  // Strip out any non-printable unicode nonsense.
  $str = preg_replace(
    '/[\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x9F]/u',
    '',
    $str
  );

  return $str;
}

// Apply the filter to post and comment fields.
add_filter('content_save_pre', 'example_filter_pre_save');
add_filter('title_save_pre', 'example_filter_pre_save');
add_filter('pre_comment_author_email', 'example_filter_pre_save');
add_filter('pre_comment_author_name', 'example_filter_pre_save');
add_filter('pre_comment_author_url', 'example_filter_pre_save');
add_filter('pre_comment_content', 'example_filter_pre_save');

// Next, stop WordPress from converting plain ASCII quotes and dashes
// into the more decorative versions, which it does in a wide variety
// of places.
$example_texturized_text = array(
  'comment_author',
  'term_name',
  'link_name',
  'link_description',
  'link_notes',
  'bloginfo',
  'wp_title',
  'widget_title',
  'single_post_title',
  'single_cat_title',
  'single_tag_title',
  'single_month_title',
  'nav_menu_attr_title',
  'nav_menu_description',
  'term_description',
  'the_title',
  'the_content',
  'the_excerpt',
  'comment_text',
  'list_cats'
);

foreach ($example_texturized_text as $text) {
  remove_filter($text, 'wptexturize');
}