Paulund

Sanitize Special Word Characters With PHP

Anybody who has worked with user content on a website will know the problem you get when a user writes the content in Word and copy and pastes this content into a textarea so that you can store it in the database. The problem that you have is that when you want to display this content on the front-end you will get a lot of strange black question marks. This is because some of the characters that Word uses is not recognised by the browser as a valid character. Things like curly quotes, curly double quotes, hyphens, underscores, ellipsis are all different in Word. Therefore you need to validate these values when before you insert then into the database to allow you to display the content correctly in the front-end. The following PHP code snippet will first search and replace for the microsoft word special characters and replace them with standard characters. Next we use a preg_replace to remove any non-ascii special characters from the content. This should clean up the Word characters in the content allowing you to display the content on the browser.


function sanitize_from_word( $content )
{
    // Convert microsoft special characters
    $replace = array(
        "‘" => "'",
        "’" => "'",
        "”" => '"',
        "“" => '"',
        "–" => "-",
        "—" => "-",
        "…" => "…"
    );
        
    foreach($replace as $k => $v)
    {
        $content = str_replace($k, $v, $content);
    }
        
    // Remove any non-ascii character
    $content = preg_replace('/[^\x20-\x7E]*/','', $content);

    return $content;
}