xpath coding

Been looking at the xpath and went through some tutorials, but still not understanding the format vs how it is called. Let’s take the function below that is calling an RSS feed.



def MainMenu():<br />
  oc = ObjectContainer()<br />
<br />
  for video in XML.ElementFromURL(RSS_FEED).xpath('//item'):<br />
    url = video.xpath('./link')[0].text<br />
    title = video.xpath('./title')[0].text<br />
    date = video.xpath('./pubDate')[0].text<br />
<br />
    oc.add(VideoClipObject(<br />
      url = url, <br />
      title = title, <br />
      originally_available_at = date))<br />
<br />




I am having trouble with how the above code and its xpath queries are interpreted by the original XML document. I can look at the source code of the RSS feed's XML document and see that there is a lists of elements called items, each of which list children elements that are called link, title, and pubDate that contain the data I am trying to extract from the RSS feeds XML page. I also get that //item tells it to look in the parent element named item and that ./link, ./title, and ./pubDate tells it to call for the data held in the children elements named link, title, and pubDate. I would like to know exactly what value would be returned to the variables url, title, and date by the RSS XML document every time the function loops through and performs the video.xpath commands used above. But I cannot figure out what query string I would enter into Xpath Finder in Firefox to see, for example, the data that would be returned to the value of the variable url based on the coding used above.

Also in general, what is video.xpath? Is that a python command? Is it a combination of API and python? I ask because it would be nice to know the syntax for it and why it is usually followed by [0].text. I have seen some that add variables in the video.xpath command along with the xpath query and also others that use .get() at the end instead. So it would be helpful to understand all of its parts and parameters. I tried to search for it within python documentation and though xpath is talked about, I do not see anything related to video.xpath.

Lastly, how do you deal with extracting data from elements that contain CDATA sections especially when they contain multiple pieces of information (ex. thumbnail, title, url)? As far as the multiple pieces of info, I guess you could do a couple inner loops to pull out the individual parts of the data you needed, but since I have not figured out Xpath finder to see the actual data that is being pulled, I was just curious how if any CDATA affects your codes ability to get the info?

In the sample code you included above, each iteration of the For-loop assigns a new value to a variable called “video”. Specifically, the video variable takes on the result of the xpath expression XML.ElememtFromURL(RSS_FEED).xpath(’//item’). In other words, for each iteration of the loop, ‘video’ refers to the next instance of an “item” in the XML document. Each of the successive statements in the For-loop contain an xpath statement which is applied to the variable named ‘video’. Each of those xpath expressions contain the syntax ‘./’ before the title of the referenced element. That ‘./’ indicates that the expression should only match direct descendants of the specified parent. Even though we specified that the xpath expression applies to one specific instance of “item” in the XML document, we must be specific with the expression because it is still very simple to reference elements elsewhere in the XML which may not be what we want.

Xpath expressions always return a list, even if there is only one element in the list. In order to carry out any further manipulation of the desire element, we must first extract it from the list. We do this by referencing the list index for the element we want (almost always the first element) [0] .

Xpath elements can have many attributes. Often when parsing HTML using xpath, we are interested in the value of the ‘href’ attribute of a tag. For example, HTML.ElementFromURL(some_url).xpath(’//a’)[0].get(‘href’) would grab the href value for the first tag in the HTML document returned by “some_url”. Often, we’re just interested in the text contained between matching … s, in which case we use .text as the reference. Fortunately, when using the .text reference, the CDATA garbage is automatically removed from the returned value.



Putting it all together, the first iteration of the loop grabs the first “item”, //item[0]. Then it grabs the text from the first ‘link’ element of the first item, //item[0]/link[0]/text. And so on for title and pubDate.

Ok I get that .get will return the exact data contained within an element vs .text will just return the text or data within the element without any of it’s surrounding code. Now as far as the format .get(), when you see .get(url) or .get(href) is the contents of the () the element tag (like title or pubDate are child elements in the example above) or is url or href a predefined variable?



If [0] refers to grabbing the first child element named text, link, pubDate of parent "item" elements, then that means that when it grabs the second "item" it would be //item[1], and the third would be //item[2], etc? And the reason for the [0] is that we are always grabbing the first occurrence of the children elements and pulling all the parent "item" elements, so there is no need for a counter? (ex: //item[2]/link[0]/text would pull the first link child element in the second parent item element?) The problem I have with this is that it does not look like xpath occurrences start with [0] but with [1]. For example, the first child title element in the third parent item element would be //item[3]/title[1]. So why is [0] used and not [1]? Is [0] just the default for only pulling the first sibling element?


And HTML documents are confusing me more. Let's say with the example above, instead of just getting the first url on an html page, you wanted to get all the information available from each link on the page. So instead of just using .get(), you set up a for video loop to go through the document and get the link and the text for each anchor tag.


def MainMenu():<br />
  oc = ObjectContainer()<br />
<br />
  for video in HTML.ElementFromURL(some_url).xpath('//a'):<br />
    # I have not explored alot the xpath commands for retrieving data from html documents, so below is just a guess <br />
    # at how you would simply call the link and the text between anchor tags<br />
    url = video.xpath('./@href') [0]  # not even sure what code you would use here maybe a .get('href') would work better but not sure of its syntax<br />
    title = video.xpath('./text()')[0]  # again not sure of the actual format for this<br />
<br />
    oc.add(VideoClipObject(<br />
      url = url, <br />
      title = title)) <br />
<br />




With the XML example, you had //item[1], //item[2], etc. I cannot see how to pull up each anchor individually with a similar command like //a[1], //a[2], etc. If each time through the loop, video picks up the information or data specific to one anchor at a time, why can you not just type in //a[3] in an xpath finder tool and see the third anchor on the page?

What you choose to grab with the .get() command depends on what data is available and what data you want. For example the RSS_FEED referenced by your previous example would probably include items that look like this:

<br />
<item><br />
	<link>http://some_url.blah</link><br />
	<title>Title of this Item</title><br />
	<pubDate>Oct 22, 2012</pubDate><br />
</item><br />



Now, with that example you should use .text because the relevant data is contained as text between two enclosing tags text. However, if you're using xpath on a sample of HTML instead, it might look like this:

<br />
<a href="http://some_url.blah" title="Title of this Item" pubDate="Oct 22, 2012"><br />
	<img src="http://some_image_url.jpg"/><br />
</a><br />




Note that the relevant data is contained **within** the tag as opposed to between two tags. In this case, you should use .get(). When using, .get() you should use quotes around the name of the parameter you want to get, ie. .get('title') or .get("href") as opposed to .get(pubDate).


Some programming languages may differ, but Python, which is the primary language used for channel development, starts counting at list index 0. So the first item in a list is always [0] and the second is always [1]. NB. the last item in a list is always [-1] and the second last is always [-1]. That tidbit can come in handy from time to time. If the xpath checker you are using starts counting at 1 rather than 0, that will be confusing. Just remember that our lists always start at [0]. Since, we tend to use loops for grabbing each consecutive item in the reference document (whether it's XML or HTML, or other), the iteration is generally taken care of by the loop so that we do not need to specify an explicit counter. You could if you want to but it's not necessary and tends to clutter the code. If you wanted to change your previous example to use an explicit counter, it might look something like this:

<br />
def MainMenu():<br />
	oc = ObjectContainer()<br />
<br />
	# for clarity let's separate some of the steps #<br />
	# first - grab the XML #<br />
	data = XML.ElementFromURL(RSS_FEED)<br />
	<br />
	# setup our counter, remember to start at 0 #<br />
	i = 0<br />
<br />
	# create the loop. In this case I'll use a WHILE loop instead of a FOR loop #<br />
	while i < len(data.xpath('//item'):<br />
		# our loop will keep iterating until it reaches the last "item" #<br />
		# grab the "item" at index i so that we can gather more info about it #<br />
		item = data.xpath('//item')*<br />
		title = item.xpath('./title')[0].text<br />
		url = item.xpath('./link')[0].text<br />
		date = item.xapth('./pubDate')[0].text<br />






You are conceptually close with your sample code above. HTML definitely tends to be more complex than basic structured XML like from an RSS feed. The xpath expresssions necessary to grab data from a webpage's HTML can vary quite a bit. If the HTML is written well, it can be pretty easy to grab the data you want. If its poorly written, it becomes a real PITA to build nice xpath expressions. There a usually a lot of different ways to grab the same data using xpath. As long as it does what you need, there is no wrong xpath but there are certainly some ways that are more right. Generally, when using xpath for HTML parsing, the recommendation is to try to use the attributes of the anchor tags you want as guides rather than traversing numerous levels. For example:


<br />
<div id="video_block"><br />
	<div><br />
		<div id="video"><br />
			<a href="some_url">Title 1</a><br />
		</div><br />
	</div><br />
</div><br />





You could say **url=blah.xpath('/div/div/div/a')[0].get('href'**)but that's not very concise and will break easily if the webpage changes. I would recommend that instead, you use something like **url=blah.xpath('div[@id="video"]/a')[0].get('href'). **The first option essentially says that you want the "href" value of the first element "a", which is a child of a "div" element, which is a child of a "div" element, which is itself a child of a "div" element. The second option simply says that you want the "href" value of the first "a" element which is the child of a "div" elelment that has the "id" attribute equal to "video". I hope that makes sense and you can see the difference. If there's a specific bit of HTML that's really giving you grief, post a sample of it and I'll try to help you work through it.

Thank you so much. That helps so much and makes great sense. And thanks for the ID tip, I had looked at some html code and was thinking it would definitely be better to pull directly from the id tags if I could figure it out. I will probably just use the RSS service for this first project, if it works since it is easier to draw from. But just having some more knowledge about the code and understanding how xpath works is gonna help me out in figuring out how to write the URL service.



And I do appreciate your help. Even though every time you answer a question, I follow up with five more question. It is just because once you explain it and I get it, I am able to look back at documentation and other examples of channels and I get a little excited that I finally understand what I am looking at, so I want to understand all the little details. Hopefully now I have a good enough handle on the basic concepts of this stuff and what I am looking at that I should be able to do some trial and error at this point and not bug you guys so much. Though that just means I will end up bugging you guys with some questions on the debugging process.



I'm glad you're starting to get a handle on things. I agree that parsing an RSS feed is generally easier than trying to scrape a website's HTML. Your choice to start with the RSS seems like a good one to me. I'm already looking forward to seeing your first plugin. Keep the questions coming as you get further along.

I know this was posted forever ago but I want to say thank for your explanation it really helped me out :slight_smile:



You're quite welcome. There is also now a dev-blog post with helpful pointers about xpath, thanks to Gerk. Check it out [here](http://devblog.plexapp.com/2012/11/14/xpath-for-channels-the-good-the-bad-and-the-fugly/).

Yes it was quite helpful.  Thank you again Mikedm139.

I had to take a break from looking into writing a channel, but I am trying to get back into it again.  I have been reviewing what I have already asked just to make sure I remember what you have helped me figure out to this point, so I will not ask the same question twice. Also, I want to create some template documents for the different python files needed for channel development and want to fill them in with alot of notes for all the possibilities of coding and syntax that I figure out as I go along.

I am still having trouble with using the Xpath Checker App in Firefox.  I can get it to work with html pages, but I cannot figure out how to get it to work for xml RSS feed pages. And I realize that the RSS feed pages are desinged to make it very easy to just look at the source code and see the parent and children elements.  So it is easier to determine what it is going to return, so testing it with Xpath Checker is not necessary. But it would still be nice to enter the xpath expression for an RSS feed page in the Xpath Checker and see the list that will be returned so I know I have it right.

Here is the example that I am using:
The RSS feed page is http://www.lstudio.com/rss/lstudiorss-web-therapy.xml
I open the page, right click on the Xpath Checker app. I have tried to entering the xpath expressions several ways and cannot get it to return any results from the elements on the page.  Here are some examples that I have tried:
/item/title
//title
/item./title
./title
//item[2]/title[1]
The only expression I can enter and get any results for is //* and that returns everything on the page.

Can anyone explain what I am doing wrong or what format I am supposed to use on these RSS feeds that is different from regular html pages to make Xpath Checker show me results?

Also, is it possible through the Xpath Checker to add the text or the get() command so I can see the exact data being returned to channel?
 

Hi!

The XPath Checker doesn't really work great on .xml files, because the Firefox browser is not showing the raw files. Instead it tries to display them as nicely formatted HTML documents. I usually use this online tool to test xpaths on xml document: http://chris.photobooks.com/xml/default.htm

Also,
is it possible through the Xpath Checker to add the text or the get()
command so I can see the exact data being returned to channel?
 

You can use text() to get the text value of an element:

//item/title/text()

And you can use @attribute-name to get the value of an attribute:

Example HTML:

Company name

Example xpath to get the 'alt' value from the image:

//img/@alt

Thank yo so much.  I thought I was going crazy with the Xpath Checker and RSS feed pages.  I will definitely bookmark the link you mentioned.

Yes it was quite helpful.  Thank you again Mikedm139.

I had to take a break from looking into writing a channel, but I am trying to get back into it again.  I have been reviewing what I have already asked just to make sure I remember what you have helped me figure out to this point, so I will not ask the same question twice. Also, I want to create some template documents for the different python files needed for channel development and want to fill them in with alot of notes for all the possibilities of coding and syntax that I figure out as I go along.

I am still having trouble with using the Xpath Checker App in Firefox.  I can get it to work with html pages, but I cannot figure out how to get it to work for xml RSS feed pages. And I realize that the RSS feed pages are desinged to make it very easy to just look at the source code and see the parent and children elements.  So it is easier to determine what it is going to return, so testing it with Xpath Checker is not necessary. But it would still be nice to enter the xpath expression for an RSS feed page in the Xpath Checker and see the list that will be returned so I know I have it right.

Here is the example that I am using:
The RSS feed page is http://www.lstudio.com/rss/lstudiorss-web-therapy.xml
I open the page, right click on the Xpath Checker app. I have tried to entering the xpath expressions several ways and cannot get it to return any results from the elements on the page.  Here are some examples that I have tried:
/item/title
//title
/item./title
./title
//item[2]/title[1]
The only expression I can enter and get any results for is //* and that returns everything on the page.

Can anyone explain what I am doing wrong or what format I am supposed to use on these RSS feeds that is different from regular html pages to make Xpath Checker show me results?

Also, is it possible through the Xpath Checker to add the text or the get() command so I can see the exact data being returned to channel?
 

XPath Checker can lead you down some bad paths.  I think Mike mentioned this earlier in the thread, but worth mentioning again to check out the dev article I posted about XPath (and why XPath Checker type stuff can lead you down bad paths!)

http://devblog.plexapp.com/2012/11/14/xpath-for-channels-the-good-the-bad-and-the-fugly/

Also in that example if you wanted to go to say, the very first "item" the actual xpath would be:

/channel/item[1]

A single forward slash (/) will take you "top down" to that item, it cannot be nested within something else unless you explicitly give it the full top down path like my previous example.  Two forward slashes (//) will search top down for all instances.  So another way to get the first item would be:

//item[1]

Hope this helps.  That page Sander gave you the link for is very helpful for testing your xpaths.  Not many browsers at all will show you the raw output of an RSS feed so I don't think any of the XPath addons will be of much help there.

Thanks again for that xml checker, it has been very helpful.  But I just wanted to clarify my syntax is correct for xml pages since the data I enter in the xml xpath checker website appears to be a little different than the actual xpath command I put in my python code.

First off this will be for a URL service, so there will be no looping through several items.  It will just be picking up data from one video on one xml page.  Though I do not think it makes any difference since the first of 1 or 20 entries should be zero, but I thought I would mention it just in case.

So in order to highlight the data in the video field of this xml video page:


    
        
            http://videos.lstudio.com/high/vogue_ep2_alejandro_HI.f4v
        
    

I entered

//video/videoHi/text()

in the xml xpath checker website Sanders mentioned.  So to translate that to xpath coding and put that URL in my variable, I would write it as

videoHi = page.xpath("//video/videoHi")[0].text

Is that correct?  I know I am nitpicking and asking alot of questions on this subject, especially since xml documents are so much easier to code xpath. But because it is so much easier, there is not as much documentation, details, or examples of the xml syntax and format, so I just wanted to be sure of the translation.

videoHi = page.xpath("//video/videoHi")[0].text

Is that correct?  I know I am nitpicking and asking alot of questions on this subject, especially since xml documents are so much easier to code xpath. But because it is so much easier, there is not as much documentation, details, or examples of the xml syntax and format, so I just wanted to be sure of the translation.

You could also do :

videoHi = page.xpath("//video/videoHi/text()")[0]

You should get the same result either way.  I personally prefer the latter because it's more consistent with xpath as you noted above.

Thank you Gerk, that is very helpful and it is good to know that I can use the exact coding I put into the XML xpath utility to make sure I have it accurate. I think it is probably good practice to use the second option you showed also.

I am having some trouble getting the proper commands to pull data from fields that have CData tags surrounding the data using the XML xpath utility sanders suggested. I am going to just delete some of the CData tags out of the code as a workaround, since my understanding is that xpath ignores these CDATA tags. But I just want to make sure that CData tags are always ignored by xpath.

Also, I am still not sure I am returning the right values for image tags. When I get the xpath command set up properly to get into the field and to the image tag, I added /img/@src to the end of my xpath command.   It returns the list of values like this:

1. src="http://www.domainname.com/image1.jpg"

2. src="http://www.domainname.com/image2.jpg"

3. src="http://www.domainname.com/image3.jpg"

instead of just listing the http address for the images like this:

1. http://www.domainname.com/image1.jpg

2. http://www.domainname.com/image2.jpg

3. http://www.domainname.com/image3.jpg

I guess the most important question is what format do you want to return for the value of images? When you are returning the value of the metadata object attributes thumb and art, do you want it to return that value as src="http://www.domainname.com/image1.jpg" or as just http://www.domainname.com/image1.jpg? Would this also apply to links.  Do you want the value for the metadata object attribute url returned as href="http://www.domainname.com/page.html" or as just http://www.domainname.com/page.html?

I am having some trouble getting the proper commands to pull data from fields that have CData tags surrounding the data using the XML xpath utility sanders suggested. I am going to just delete some of the CData tags out of the code as a workaround, since my understanding is that xpath ignores these CDATA tags.

You shoudn't have to manually remove any CDATA tag from the code, Xpath ignores it. If something fails I think the problem lies elsewhere, maybe broken xml or namespaces. Can you post a link to one of the XML files you're having trouble with processing?

Also, I am still not sure I am returning the right values for image tags. When I get the xpath command set up properly to get into the field and to the image tag, I added /img/@src to the end of my xpath command.   It returns the list of values like this:

1. src="http://www.domainname.com/image1.jpg"

2. src="http://www.domainname.com/image2.jpg"

3. src="http://www.domainname.com/image3.jpg"

instead of just listing the http address for the images like this:

1. http://www.domainname.com/image1.jpg

2. http://www.domainname.com/image2.jpg

3. http://www.domainname.com/image3.jpg

This is with the online utility I linked to? If so, then that's normal, that utility marks the attribute and its value. If you were to use this xpath query in your Python code you would just get the value back.

screenshot20130327at132.png

Thanks for the clarification about the XML utility and its ouptut list.  That makes me feel better.

You shoudn't have to manually remove any CDATA tag from the code, Xpath ignores it. If something fails I think the problem lies elsewhere, maybe broken xml or namespaces. Can you post a link to one of the XML files you're having trouble with processing?

No the problem is not with the xpath I use in my actual channel code. Sorry if I was confusing.  The problem is with using the website at this link (http://chris.photobooks.com/xml/default.htm) to determine the proper xpath to use in my channel coding. When I use that site to determine the xpath commands and I enter the xml document I am trying to extract data from into the "XML Input" block, I have to then delete the CDATA tags from that code that I entered in the "XML Input" field for the site to show me the proper list of results for the xpath I have entered in the "Xpath Input" field.

It is just a quirk of that online utility.  It doesn't ignore CDATA tags.  Like you pointed out the utility also shows the attribute name in the output list.

xpath in itself doesn't ignore CDATA tags, but the Plex Framework does.  So you don't have to worry about that for inside your channels.

OK, that is good to know. 

You could also do :

videoHi = page.xpath("//video/videoHi/text()")[0]

You should get the same result either way.  I personally prefer the latter because it's more consistent with xpath as you noted above.

I created a channel bundle locally on my computer and entered some xpath commands using this second method shown above in a channel __init__.py file and my Roku would not recognize the channel bundle until I went back in and changed the xpath coding in the __init__.py to the first method. 

Why would this happen?