PARSING ... or how to best digest a web-page ...

Hi,



where should I look for python classes supporting the parsing of web pages.



Here is what I have: A few web pages, which give me LOCATION / THUMBS / DESCRIPTION of other web pages, which finaly have the streaming urls.



Here is what I think I need:


  • something to capture / download the web page
  • if possible have a few tools to “easily” digest / extract the needed information



    Any help / pointer are highly appreciated.



    Thanks

xpath works really well for this stuff as it allows you to drill down by specifying tags (including id and class parameters) and get lists of items from the HTML that way.

If you have a look at some of the existing plugins you’ll find a lot of them use it and it should give you a good idea of how it works. Personally I found looking at actual code from other plugins a lot easier than googling for it too.

Thanks for the pointer. I will check it out.



I just started “learning” to make my own plug-in exactly the way you suggested … looking at existing code. On the way I have questions and try to find answerers here :slight_smile:



I will start a thread where I just write what I discover … this should hopefully help get more folks interested creating new plug-ins.

Just an UPDATE:



To better understand how to extract the data / links from within HTML / XML sources I will have to start learning phyton … a good starter is here:



http://docs.python.org/tutorial/index.html



Then there is a great library out there and even included in plex, which will help to do most of the heavy lifting:



lxml is the most feature-rich and easy-to-use library for working with XML and HTML in the Python language.



lxml.etree is the subset, which I will use.



The tutorial is here: http://codespeak.net/lxml/tutorial.html



BTW, this is also the fastest implementation of XML parsers out there as mentioned here:



http://blog.ianbicking.org/2008/03/30/pyth…er-performance/





Happy coding …

I think you may find that using lxml.etree will be pretty tedious compared to just going with xpath, not that I want to talk you out of going with lxml if you have a good reason for doing so. I would note that the added performance of lxml is kind of irrelevant in this case due to the fact that you are only working with single (or perhaps couple depending on the context) pages on a desktop machine. The performance with this kind of thing only really becomes an issue when you’re thinking about handling thousands of documents for say data mining or in a server context, so I wouldn’t make performance too high a priority.

I am ROOKIE when it comes to python and any of the web based languages / tools … so I am still learning :slight_smile:



I found a good / simple tutorial for XPATH:



http://www.zvon.org/xxl/XPathTutorial/General/examples.html