Universal Encoding Detector - 編碼偵測(Python)

什麼時候我們需要做「編碼偵測」的動作呢？最明顯的例子不外乎就是「瀏覽器」~ 假設我們的網頁沒附上「<meta http-equiv="Content-Type" content="text/html;charset=utf-8">」這樣的字句~ 那Browser還能有足夠的能力偵測此網頁是用何種編碼的嗎？

再舉另一個例子~ 當我們寫了一個Crawler來爬行網頁的同時~ 在下載這些網頁之後~ 我們又該如何得知這些網頁的編碼呢？

所以~ 「編碼偵測」算是處理文字資訊前的必要動作~ 而「Universal Encoding Detector」就提供了一個這麼好的工具~ 當然也是給它Open Source的嚕~ 不過這是針對Python語言的~ 當然也還有其它的解決方案~ 就請參考相關資源!

Universal Encoding Detector

「Universal Encoding Detector」目前的版本是1.0.1版~ 而在使用它之前必須先安裝在你的電腦~

下載：chardet-1.0.1.tgz

安裝過程如下：

tar zxvf chardet-1.0.1.tgz
cd chardet-1.0.1
setup.py build
setup.py install

接著就給它寫一個簡單的測試程式：

import urllib2, chardet

if __name__ == "__main__":
	urlread = lambda url: urllib2.urlopen(url).read()
	running = True
	while running:
		str = raw_input('Please enter a url: ')
		if str == 'q':
			running = False
		else:
			print chardet.detect(urlread(str))
	else:
		print 'Done'

測試結果：

Please enter a url: http://www.google.com.cn
{'confidence': 0.98999999999999999, 'encoding': 'GB2312'}
Please enter a url: http://blog.ring.idv.tw
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}
Please enter a url: http://www.cnn.com
{'confidence': 1.0, 'encoding': 'ascii'}
Please enter a url: q
Done

相關資源

．中文編碼偵測 || William's Blog

．A composite approach to language/encoding detection

．大步向前走: Programming 自動偵測編碼

．Shared Development: Character encoding detection

．Java port of Mozilla charset detector

．cpdetector, free java code page detection.

2008-09-29 03:11:15

Universal Encoding Detector - 編碼偵測(Python)

Universal Encoding Detector - 編碼偵測(Python)

Leave a Comment

::: 搜尋 :::

::: 分類 :::

::: 最新文章 :::

::: 最新回應 :::

::: 訂閱 :::