Quick and dirty script to fix “weird characters” in VBScript

For those of you that have ever used VBScript to pull information out of an XML file or out of a database, you may have noticed some strange characters showing up on your pages. From what I’ve read online, this is a result of dealing with incorrect character sets through your XML or your database. However, I have not really found a decent fix for the problem.

Everything I find online mentions using UTF8 as your character set, but I haven’t really found a concrete method to do this. In my real life example, I tried setting the character set within my XML file as UTF8, but that didn’t solve the problem.

So, I got to the root of the problem myself. I figured out which characters were not displaying properly, and I decided to fix them on the back end.


Introduction

Before going any further, I will provide a little background on the real-world application I was working with.

Before I came into my job, someone wrote a nifty little script for our public relations people. Basically, in order to post new news releases, our PR people load up a Web-based form, they type in the headline for the release, the date that they want the release issued, and then they paste the text of the news release into a textarea. The backend VBScript processor then takes all of that information and adds it into an XML file that acts sort of like a one-table flat-file database. Essentially, the XML file is laid out like:

<headline><title>This is my news release title</title><date>This is the release date</date><body>This is the body of the news release</body></headline>

The initial problem

A problem arose, however, when we noticed how long our list of “current” news releases was getting. The original scripter had never really built in any sort of archiving capability. The PR people had been using this script for a little over a year, and the releases just kept piling on top of each other.

Therefore, I decided to build in a little script that would pull out all of the old news releases, cut them out of the “current” XML file and generate a new XML file for monthly digests of archived releases.

Disappearing text

While working on this, I decided to test the posting script that allows the PR people to post their releases. In doing so, I noticed that all of our special characters were getting stripped out by either the textarea itself, the VBScript that processes the form or the XML file that stores the information. Since our PR people are writing most of their releases in MS Word, there are a lot of special characters that get thrown in (left and right double quotes, left and right single quotes, emphasis dashes, etc.). If I continued to let the script strip all of those out, then things would really get messed up.

The first solution

I did a little testing and a little searching, and I found out that all of those characters were disappearing when the content of the textarea was added into the Request.Form server variable. Therefore, I was going to have to fix the problem before the VBScript got ahold of the information. So, I wrote a quick little javascript to stop those items from getting stripped out. That script is shown below:
<script type="text/javascript" language="javascript">
function htmlEnt(what) {
var str = what.value;
str = str.replace(/“/,"&ldquo;");
str = str.replace(/”/,"&rdquo;");
str = str.replace(/‘/,"&lsquo;");
str = str.replace(/’/,"&rsquo;");
str = str.replace(/–/,"&ndash;");
str = str.replace(/—/,"&mdash;");
str = str.replace(/-/,"&#45;");
what.value = str;
}
</script>

That fixed it. Now, none of my special characters were getting stripped out. Huzzah!

The resulting problem

The bad news, however, is that those special characters were not being translated properly when I pulled them back out of the XML file. Sometimes they would work properly, but most of the time they just came out very weird. Each of them were being displayed on the screen as very strange characters. They were coming out as things like:

Double-quotes (that should say “Ski”)

Short emphasis dash (the one that’s created with &ndash;)

The right single quote that MS Office generates when you type an apostrophe

The final solution
No, I’m not making any sort of reference to death, etc.

So, I decided to fix that problem the same way I fixed my original problem. This time, however, I don’t have to rely on javascript. I could use VBScript to perform the translation before the page loaded. So, I wrote the following quick little function.

Function htmlEnt(strReplace)
strReplace = Replace(strReplace,"“","&ldquo;")
strReplace = Replace(strReplace,"”","&rdquo;")
strReplace = Replace(strReplace,"’","&rsquo;")
strReplace = Replace(strReplace,"—","&mdash;")
strReplace = Replace(strReplace,"–","&ndash;")
strReplace = Replace(strReplace,"-","&#45;")
strReplace = Replace(strReplace,"…","&hellip;")
strReplace = Replace(strReplace,"&amp;#","&#")
htmlEnt = strReplace
End Function

That function picks through the code you run through it, then removes all of the special characters that MS Office applications generate automatically, then converts them to their ASCII equivalents.

Conclusion

I’m sure there’s probably a more efficient way to do this (through regular expressions, etc.). Heck, there might even be a pre-built VBScript function/subroutine to do this for you (like URLEncode, etc.). I was unable to find anything, though, so I rolled my own. Hopefully, if you come across this problem in the future, it will help you, too.

If you think of any special characters I might have missed, please let me know. If you come across a more efficient way to do this, let me know. Thanks.

Tech Tags: HTMLCenter charset strange+characters

3 Responses

  • Peter

    Thanks for the function — so simple, but useful. Here’s a character that you missed:

    strReplace = Replace(strReplace,”‘”,”&lsquo;”)

  • Shoggie

    What also REALLY makes a file, for example, UTF-8 is to also SAVE IT in that format.
    Easy method is notepad!

    • True, but if your server and/or content management system isn’t set up to serve files in UTF-8, saving the file in UTF-8 (or, in this case, I was more referring to databases than flat files) won’t do you much good. You have to work with what you have.