You are viewing a read-only archive of the Blogs.Harvard network. Learn more.
Skip to content

non-ASCII characters

I do a lot of web scraping. Sometimes I need to send the data to other people and the fun non-ASCII characters I scrap will really freak other applications out. I needed a quick and dirty way to just screen out non-ASCII code. Enter Regular Expressions.
I’ve had a fondness for regexp since I first learned Perl. And my current language of choice implements all the goodness of Perl regexp.
The pattern is this simple
/\x20-\x7E/ #ascii range
to filter out all the characters outside this range simply put this pattern inside brackets and denote “not”

my filter function simply states
text = text.gsub(/[^\x20-\x7E]/,’?’) # I like ruby

This trick was found on another Rails blog post which talked about using regex to enforce good passwords.

Post a Comment

You must be logged in to post a comment.