Each Answer to this Q is separated by one/two green lines.
I want to parse my XML document. So I have stored my XML document as below
class XMLdocs(db.Expando): id = db.IntegerProperty() name=db.StringProperty() content=db.BlobProperty()
Now my below is my code
parser = make_parser() curHandler = BasketBallHandler() parser.setContentHandler(curHandler) for q in XMLdocs.all(): parser.parse(StringIO.StringIO(q.content))
I am getting below error
'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128) Traceback (most recent call last): File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 517, in __call__ handler.post(*groups) File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/base_handler.py", line 59, in post self.handle() File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 168, in handle scan_aborted = not self.process_entity(entity, ctx) File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 233, in process_entity handler(entity) File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 71, in process parser.parse(StringIO.StringIO(q.content)) File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 207, in feed self._parser.Parse(data, isFinal) File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 136, in characters print ch UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
The actual best answer for this problem depends on your environment, specifically what encoding your terminal expects.
The quickest one-line solution is to encode everything you print to ASCII, which your terminal is almost certain to accept, while discarding characters that you cannot print:
print ch #fails print ch.encode('ascii', 'ignore')
The better solution is to change your terminal’s encoding to utf-8, and encode everything as utf-8 before printing. You should get in the habit of thinking about your unicode encoding EVERY time you print or read a string.
.encode('utf-8') at the end of object will do the job in recent versions of Python.
It seems you are hitting a UTF-8 byte order mark (BOM). Try using this unicode string with BOM extracted out:
import codecs content = unicode(q.content.strip(codecs.BOM_UTF8), 'utf-8') parser.parse(StringIO.StringIO(content))
strip instead of
lstrip because in your case you had multiple occurences of BOM, possibly due to concatenated file contents.
This worked for me:
from django.utils.encoding import smart_str content = smart_str(content)
The problem according to your traceback is the
parseXML.py. Unfortunately you didn’t see fit to post that part of your code, but I’m going to guess it is just there for debugging. If you change it to:
then you should at least see what you are trying to print.
The problem is that you’re trying to print an unicode character to a possibly non-unicode terminal. You need to encode it with the
'replace option before printing it, e.g.
print ch.encode(sys.stdout.encoding, 'replace').
An easy solution to overcome this problem is to set your default encoding to utf8. Follow is an example
import sys reload(sys) sys.setdefaultencoding('utf8')