Category Archives: python

reading utf-8 file in Python

import codecs
fp = codecs.open(fileName, "r", "utf-8")
fp.read()
* http://evanjones.ca/python-utf8.html
* http://www.jorendorff.com/articles/unicode/python.html

Share This

feedparser.text content type

I need to change this line from
true_encoding = http_encoding or ‘us-ascii’
to
true_encoding = http_encoding or xml_encoding or ‘us-ascii’
for those buggy sites that don’t obey the standard. And set content type to text/* but don’t offer a charset, set their encoding in the xml file.

Share This

feedparser.whitespace

According to the XML spec http://www.w3.org/TR/REC-xml/#NT-EncodingDecl whitespace is allowed around the quotes of encoding Here is a simple patch:
— /usr/ports/textproc/py-feedparser/work/feedparser/feedparser.py.old Sat Jul  2 16:17:11 2005
+++ /usr/ports/textproc/py-feedparser/work/feedparser/feedparser.py     Sat Jul  2 16:18:25 2005
@@ -2101,7 +2101,7 @@
else:
# ASCII-compatible
pass
-        xml_encoding_match = re.compile(’^<\?.*encoding=[\’"](.*?)[\’"].*\?>’).match(xml_data)
+        xml_encoding_match = re.compile(’^<\?.*encoding\s=\s[\’"](.*?)[\’"].*\?>’).match(xml_data)
except:
xml_encoding_match = None
if xml_encoding_match:
I’ve send this patch to the […]

feedparser.encoding

Looks Feedparser was written with Python 2.3. With python 2.4, the CJKcodecs is included in the official release. So the line
import cjkcodecs.aliases
should be changed to
import encodings.aliases

Share This

quixote session

http://darcs.idyll.org/~t/projects/quixote2-tutorial/advanced/sessions.html
http://quixote.ca/qx/StoringSessionsInDatabase
http://ksenia.nl/code/quixote_sql_sessions_06.tgz

Share This

Challenge 5

Well this is not an easy one.
The unpickling is as easy as it should:
import pickle
f = open(”banner.p”, “r”)
x = pickle.load(f)
But after that I was lost. x is a list of lists, with 23 items, each item are constructed with one or more tuples, each tuple has two elements: one ” ” or “#”, and one […]

Challenge 6

Well first I need to figure out what to zip. I took a look at the HTML source again and found something unusual: the hint looks like:
<html> <!– <– zip –>
There’s an extra “<—” in the HTML comment. I believe this is pointing me what to zip. But after I tried zip “html”, “<html”, the […]

Challenge 4

Following the idea yesterday, here is my first script:
import urllib;
import string;
import re;
def followNothing(url):
page=urllib.urlopen(url)
pagecontent = page.read()
print pagecontent
result =re.search(”[0-9]{5}”, pagecontent)
if result:
num = pagecontent[result.start():result.end()]
nexturl = “http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=” + num
followNothing(nexturl)
else:
print “fix the script!”
and I got:
In [12]: followNothing("http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345")
and the next nothing is 92512
and the next nothing is 64505
&#60;font color=red&gt;Your hands are getting tired &#60;/font&gt;and the next nothing is 50010
and the next […]

Challenge 3

Well, it said EXACTLY three big bodyguard, so the first try is like this:
import re
target = re.compile(r”[!A-Z][A-Z]{3}([a-z])[A-Z]{3}[!A-Z]”)
f = open(”mess.txt”, “r”)
myout = target.findall(f.read())
And I got this:
In [7]: print myout
[’e’, ‘i’, ‘z’, ‘r’, ‘v’, ‘g’, ‘b’, ‘m’, ’s’, ‘z’, ‘u’, ‘o’, ‘w’, ‘i’, ‘b’, ‘z’, ‘b’, ‘c’, ‘j’, ‘g’, ‘d’, ‘o’, ‘h’, ‘l’, ‘
z’, ‘x’, ‘j’, ‘v’, […]

Challenge 2

First of all I checked the lib to see if there’s an OCR module, looks no, at least not in the standard distribution
OK, it says to find the rare chars, so here is my first try:
f = open("mess.txt", "r")
ss = {}
for c in f.read():
if ss.has_key(c):
ss[c] += 1
else:
ss[c]=1
And here is the output:
In [6]: ss
Out[6]:
{’\n’: […]

Close
E-mail It