I do a lot of web scraping. Sometimes I need to send the data to other people and the fun non-ASCII characters I scrap will really freak other applications out. I needed a quick and dirty way to just screen out non-ASCII code. Enter Regular Expressions.
I’ve had a fondness for regexp since I first learned Perl. And my current language of choice implements all the goodness of Perl regexp.
The pattern is this simple
/\x20-\x7E/ #ascii range
to filter out all the characters outside this range simply put this pattern inside brackets and denote “not”
[^\x20-\x7E]
my filter function simply states
text = text.gsub(/[^\x20-\x7E]/,’?’) # I like ruby
This trick was found on another Rails blog post which talked about using regex to enforce good passwords.
Post a Comment