About character encoding, why do we need encoding
Classic character encoding, (ANSIor ascii) only supports a few standard English characters, for non-English chars, encoding is necessary, the purpose is kind of expanding ANSI/ascii to use more than one byte to represent non-ASNI characters, like copyright sign.
Different encoding standard
UTF-8 supposed to be standard encoding, for Chinese characters GBK/GB2312 is still popularly using because it’s efficiency. UTF-8 will use 1-6 bytes to represent a char, most Chinese characters need 3 bytes storage, while GBK only take 2 bytes.
Given a bit stream like ’111000111000′, GBK will treat it as 11|10|00|11|10|00, while UTF-8 will treat it as 111|000|111|000, knowing the right string encoding is a must to avoid messy display.
Unicode
When developers represent the actual character encoding in code, usually they can add a prefix U/u with the hex code. This is what unicode look like.
For example: “中文” and “\\u4e2d\\u6587″ are same thing while the latter is just using unicode representation.
Conversion
Ruby 1.9 does support unicode to utf-8 conversion, by simply calling to switch between
Iconv.iconv("utf-8","unicode",escaped)
in Ruby 1.8, there are a few solutions.
Solution 1) Using JSON library,
escaped = "\\u4e2d\\u6587"
JSON.parse( %Q{["#{escaped}"]} )[0].should == "中文"
Solution 2) Manually convert
escaped = "\\u4e2d\\u6587"
unicode_utf8(escaped).should == "中文"
def unicode_utf8(unicode_string)
unicode_string.gsub(/\\u\w{4}/) do |s|
str = s.sub(/\\u/, "").hex.to_s(2)
if str.length < 8
CGI.unescape(str.to_i(2).to_s(16).insert(0, "%"))
else
arr = str.reverse.scan(/\w{0,6}/).reverse.select{|a| a != ""}.map{|b| b.reverse}
hex = lambda do |s|
(arr.first == s ? "1" * arr.length + "0" * (8 - arr.length - s.length) + s : "10" + s).to_i(2).to_s(16).insert(0, "%")
end
CGI.unescape(arr.map(&hex).join)
end
end
Encoding in JSON
JSON doesn’t have a HEAD section so no where we can set charset meta, using unicode is recommended, otherwise client won’t know how to display. In JSON for Ruby library, this can be done by just turning on ascii_only option.
json_string = JSON.fast_generate(@sut,
:ascii_only => true
)
The other way (not using JSON) to get unicode given utf-8 in RUBY 1.8?
p "\\u"+@sut.title.unpack("U*").map{|c|"%04x" %c}.join("\\u")
Complete code demo:
it "should convert different encoding" do
@sut.title = "中文"
unicoded_title = "\\u4e2d\\u6587"
utf8_to_unicode(@sut.title).should == unicoded_title
json_string = JSON.fast_generate(@sut,
:ascii_only => true
)
JSON.parse(json_string)['title'].should == @sut.title
JSON.parse( %Q{["#{unicoded_title}"]} )[0].should == @sut.title
unicode_to_utf8(@sut.title).should == @sut.title
end
def unicode_to_utf8(unicode_string)
unicode_string.gsub(/\\u\w{4}/) do |s|
str = s.sub(/\\u/, "").hex.to_s(2)
if str.length < 8
CGI.unescape(str.to_i(2).to_s(16).insert(0, "%"))
else
arr = str.reverse.scan(/\w{0,6}/).reverse.select{|a| a != ""}.map{|b| b.reverse}
hex = lambda do |s|
(arr.first == s ? "1" * arr.length + "0" * (8 - arr.length - s.length) + s : "10" + s).to_i(2).to_s(16).insert(0, "%")
end
CGI.unescape(arr.map(&hex).join)
end
end
end
def utf8_to_unicode(string) # :nodoc:
'\\u'+string.unpack("U*").map{|c|"%04x" %c}.join('\\u')
end
About GBK
When getting GBK encoded webpage in Ruby using net/http, sometimes it just mess up all the characters. It happens to cUrl as well.
Switch to wget to get page to file, then parsing file is OK.
Don’t know why wget is better in dealing with different encoding.