Removing Odd Characters from Strings.
Programming No Comments »Okay, so I ran into an issue while working on a project at work where an internal web service was returning a large string of text as driving directions that apparently was copied and pasted out of a word document. The problem was that unicode characters (tiny rectangle representing a list item bullet for example) were strung all throughout the text. Just imagine a blurb of text that is about 2000 characters but there isn't a single bit of formatting in it. So I set out to find some code since I knew that someone had to have had this problem at some point or another. I ran into the following, which ended up being exactly what I was looking for:
The blog post entitled "A .NET Unicode Puzzle (Revised)" had the answers I sought. Below is an example of the method that I ended up using in my solution.
public static string RemoveUnicode(string s)
{
try
{
string normalized = s.Normalize(NormalizationForm.FormKD);
Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));
byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length, encodedBytes, 0);
string newString = ascii.GetString(encodedBytes);
return newString;
}
catch
{
return s;
}
}