PARSING ... or how to best digest a web-page ...

rnio · April 19, 2009, 2:30pm

Hi,

where should I look for python classes supporting the parsing of web pages.

Here is what I have: A few web pages, which give me LOCATION / THUMBS / DESCRIPTION of other web pages, which finaly have the streaming urls.

Here is what I think I need:

something to capture / download the web page
if possible have a few tools to “easily” digest / extract the needed information

Any help / pointer are highly appreciated.

Thanks

riviera · April 19, 2009, 6:26pm

xpath works really well for this stuff as it allows you to drill down by specifying tags (including id and class parameters) and get lists of items from the HTML that way.

If you have a look at some of the existing plugins you’ll find a lot of them use it and it should give you a good idea of how it works. Personally I found looking at actual code from other plugins a lot easier than googling for it too.

rnio · April 19, 2009, 7:45pm

Thanks for the pointer. I will check it out.

I just started “learning” to make my own plug-in exactly the way you suggested … looking at existing code. On the way I have questions and try to find answerers here

I will start a thread where I just write what I discover … this should hopefully help get more folks interested creating new plug-ins.

rnio · April 20, 2009, 2:39pm

Just an UPDATE:

To better understand how to extract the data / links from within HTML / XML sources I will have to start learning phyton … a good starter is here:

http://docs.python.org/tutorial/index.html

Then there is a great library out there and even included in plex, which will help to do most of the heavy lifting:

lxml is the most feature-rich and easy-to-use library for working with XML and HTML in the Python language.

lxml.etree is the subset, which I will use.

The tutorial is here: http://codespeak.net/lxml/tutorial.html

BTW, this is also the fastest implementation of XML parsers out there as mentioned here:

http://blog.ianbicking.org/2008/03/30/pyth…er-performance/

Happy coding …

riviera · April 20, 2009, 3:00pm

I think you may find that using lxml.etree will be pretty tedious compared to just going with xpath, not that I want to talk you out of going with lxml if you have a good reason for doing so. I would note that the added performance of lxml is kind of irrelevant in this case due to the fact that you are only working with single (or perhaps couple depending on the context) pages on a desktop machine. The performance with this kind of thing only really becomes an issue when you’re thinking about handling thousands of documents for say data mining or in a server context, so I wouldn’t make performance too high a priority.

rnio · April 20, 2009, 4:02pm

I am ROOKIE when it comes to python and any of the web based languages / tools … so I am still learning

I found a good / simple tutorial for XPATH:

http://www.zvon.org/xxl/XPathTutorial/General/examples.html

Topic		Replies	Views
XPATH for Dummies Dev/API Corner plugin-dev	11	781	December 21, 2019
Need Help with Plugin and Xpath Dev/API Corner plugin-dev	5	127	December 20, 2019
need help with xpaths Dev/API Corner plugin-dev	3	78	December 20, 2019
xpath coding Dev/API Corner plugin-dev	50	420	December 20, 2019
New to this, troubles with XPath Dev/API Corner plugin-dev	5	154	December 20, 2019

PARSING ... or how to best digest a web-page ...

Related topics