[Solved] Use lxml to parse text file with bad header in Python

I would like to parse text files (stored locally) with lxml’s etree. But all of my files (thousands) have headers, such as:

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: [email protected]
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 AHxm/u6lqdt8X6gebNqy9afC2kLXg+GVIOlG/Vrrw/dTCPGwM15+hT6AZMfDSvFZ
 YVPEaPjyiqB4rV/GS2lj6A==

<SEC-DOCUMENT>0001193125-07-200376.txt : 20070913
<SEC-HEADER>0001193125-07-200376.hdr.sgml : 20070913
<ACCEPTANCE-DATETIME>20070913115715
ACCESSION NUMBER:       0001193125-07-200376
CONFORMED SUBMISSION TYPE:  10-K
PUBLIC DOCUMENT COUNT:      7
CONFORMED PERIOD OF REPORT: 20070630
FILED AS OF DATE:       20070913
DATE AS OF CHANGE:      20070913

and the first < isn’t until line 51 in this case (and isn’t 51 in all cases). The xml portions starts as follows:

</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>d10k.htm
<DESCRIPTION>FORM 10-K
<TEXT>
<HTML><HEAD>
<TITLE>Form 10-K</TITLE>
</HEAD>
 <BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>

Can I handle this on-the-fly with lxml? Or should I use a stream editor to omit each file’s header? Thanks!

Here is my current code and error.

from lxml import etree
f = etree.parse('temp.txt')

XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Edit:

FWIW, here is a link to the file.

Solution #1:

Given that there’s a standard for these files, it’s possible to write a proper parser rather than guessing at things, or hoping beautifulsoup gets things right. That doesn’t mean it’s the best answer for you, but it’s certainly work looking at.

According to the standard at http://www.sec.gov/info/edgar/pdsdissemspec910.pdf what you’ve got (inside the PEM enclosure) is an SGML document defined by the provided DTD. So, first go to pages 48-55, extract the text there, and save it as, say, “edgar.dtd”.

The first thing I’d do is install SP and use its tools to make sure that the documents really are valid and parseable by that DTD, to make sure you don’t waste a bunch of time on something that isn’t going to pan out.

Python comes with a validating SGML parser, sgmllib. Unfortunately, it was never quite finished, and it’s deprecated in 2.6-2.7 (and removed in 3.x). But that doesn’t mean it won’t work. So, try it and see if it works.

If not, I don’t know of any good alternatives in Python; most of the SGML code out there is in C, C++, or Perl. But you can wrap up any C or C++ library (I’d start with SP) pretty easily, as long as you’re comfortable writing your own wrapped in C/Cython/boost-python/whatever or using ctypes. You only need to wrap up the top-level functions, not build a complete set of bindings. But if you’ve never done anything like this before, it’s probably not the best time to learn.

Alternatively, you can wrap up a command-line tool. SP comes with nsgmls. There’s another good tool written in perl with the same name (I think part of http://savannah.nongnu.org/projects/perlsgml/ but I’m not positive.) And dozens of other tools.

Or, of course, you could write the whole thing, or just the parsing layer, in perl (or C++) instead of Python.

Respondent: Richard Herron

Solution #2:

You can easily get to the encapsulated text of the PEM (Privacy-Enhanced Message, specified in RFC 1421 ) by stripping the encapsulation boundries and separating everything in between into header and encapsulated text at the first blank line.

The SGML parsing is much more difficult. Here’s an attempt that seems to work with a document from EDGAR:

from lxml import html

PRE_EB = "-----BEGIN PRIVACY-ENHANCED MESSAGE-----"
POST_EB = "-----END PRIVACY-ENHANCED MESSAGE-----"

def unpack_pem(pem_string):
    """Takes a PEM encapsulated message and returns a tuple
    consisting of the header and encapsulated text.  
    """

    if not pem_string.startswith(PRE_EB):
        raise ValueError("Invalid PEM encoding; must start with %s"
                         % PRE_EB)
    if not pem_string.strip().endswith(POST_EB):
        raise ValueError("Invalid PEM encoding; must end with %s"
                         % POST_EB)
    msg = pem_string.strip()[len(PRE_EB):-len(POST_EB)]
    header, encapsulated_text = msg.split('nn', 1)
    return (header, encapsulated_text)


filename = 'secdoc_htm.txt'
data = open(filename, 'r').read()

header, encapsulated_text = unpack_pem(data)

# Now parse the SGML
root = html.fromstring(encapsulated_text)
document = root.xpath('//document')[0]

metadata = {}
metadata['type'] = document.xpath('//type')[0].text.strip()
metadata['sequence'] = document.xpath('//sequence')[0].text.strip()
metadata['filename'] = document.xpath('//filename')[0].text.strip()

inner_html = document.xpath('//text')[0]

print(metadata)
print(inner_html)

Result:

{'filename': 'd371464d10q.htm', 'type': '10-Q', 'sequence': '1'}

<Element text at 80d250c>
Respondent: abarnert

Solution #3:

You could use BeautifulSoup for this:

>>> from BeautifulSoup import BeautifulStoneSoup
>>> soup = BeautifulStoneSoup(xmldata)
>>> print soup.prettify()
-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: [email protected]
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 AHxm/u6lqdt8X6gebNqy9afC2kLXg+GVIOlG/Vrrw/dTCPGwM15+hT6AZMfDSvFZ
 YVPEaPjyiqB4rV/GS2lj6A==
<sec-document>
 0001193125-07-200376.txt : 20070913
 <sec-header>
  0001193125-07-200376.hdr.sgml : 20070913
  <acceptance-datetime>
   20070913115715
ACCESSION NUMBER:       0001193125-07-200376
CONFORMED SUBMISSION TYPE:  10-K
PUBLIC DOCUMENT COUNT:      7
CONFORMED PERIOD OF REPORT: 20070630
FILED AS OF DATE:       20070913
DATE AS OF CHANGE:      20070913
  </acceptance-datetime>
 </sec-header>
 <document>
  <type>
   10-K
   <sequence>
    1
    <filename>
     d10k.htm
     <description>
      FORM 10-K
      <text>
       <html>
        <head>
         <title>
          Form 10-K
         </title>
        </head>
        <body bgcolor="WHITE">
         <h5 align="left">
          <a href="#toc">
           Table of Contents
          </a>
         </h5>
        </body>
       </html>
      </text>
     </description>
    </filename>
   </sequence>
  </type>
 </document>
</sec-document>
Respondent: Lukas Graf

Solution #4:

Although the problem definition implies you want to start parsing at the first ‘<‘, I don’t think this is a good idea. Those look like PEM headers (if not, they’re something else derived from RFC(2)822), and they could have ‘<‘ characters in them. For example, you might find Originator-Name: "Foo Bar" <[email protected]> one day. It’s possible that the particular files you’re looking at never will, but unless you can know that for sure, it’s better not to rely on it.

If you want to actually parse this as an RFC822 message with an XML body, that’s pretty easy:

with file('temp.txt') as f:
  rfc822.Message(f).rewindbody()
  x = etree.parse(f)

But technically this isn’t valid for PEM (because PEM’s header-body format is effectively a fork of RFC822 rather than incorporating it by reference). And it may not be even practically valid for various other similar not-quite-RFC822 formats. And really, all you care about is how headers and bodies are separated, which is a very simple rule:

with file('temp.txt') as f:
  while f.readline():
    pass
  x = etree.parse(f)

The other alternative is to rely on the (apparent) fact that the body is always a SEC-DOCUMENT node:

with file('temp.txt') as f:
  text = f.read()
body = '<SEC-DOCUMENT>' + text.split('<SEC-DOCUMENT>, 1)[1]
x = etree.fromstring(body)

One last note: Generally, once you see RFC822 headers, that raises the question of whether the format is actually full RFC2822 + optional MIME. The fact that there’s no content headers anywhere implies that you’re probably safe here, but you might want to grep a large collection of them (or, if there’s a definition of the file format somewhere, skim it over).

Respondent: jterrace

The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Leave a Reply

Your email address will not be published.