XPATH for Dummies

shopgirl284 · October 2, 2013, 4:16pm

When I first started learning xpath, I had lots of issues grasping the concept, so I went back and looked at my numerous questions on the subject and created this document below. (And thank you to Mike and Sander for answering those many many questions for me)

I found that the w3 xpath tutorial (http://www.w3schools.com/xpath/default.asp) for how to create xpath commands is the best place to start and figure out how the xpath works with the elements within a web page. And the Plex dev blog on xpath (http://devblog.plexapp.com/2012/11/14/xpath-for-channels-the-good-the-bad-and-the-fugly/) is great at showing you how to tweak your xpath for it to work best in Plex code and withstand site changes, once you understand the basic concepts of xpath.

This is document is more of a step in between those two that bridges the gap for those like myself that had trouble grasping the concepts of how to add the xpath in my channel code. It is meant to explain the basics of how to incorporate those xpath commands and use them in combination with python code and Plex API in your Plex channel.

If you see any errors or ways the document can be improved, please let me know.

shopgirl284 · October 2, 2013, 4:31pm

Using Xpath in Plex Channel Development

Introduction

When creating a channel, you will need to pull information from the website including information about the different sections of available videos like the titles and thumbnails as well as assorted data about the individual videos like the thumbnail, title, description, duration, and date. This information, which is usually referred to as metadata is used to make your channel easier to navigate and give the end user as much information as possible about the videos you have made available in your channel.

Before beginning, it is helpful to review the xpath tutorial below to help you understand xpath and the relationship of nodes within a document and parent and child elements. http://www.w3schools.com/xpath/

This document is not intended to replace the above tutorial or explain all the possible xpath commands, their structure, or how they correspond to the web page code. This document just explains the basics of how to add these xpath commands within your Plex plugin code along with the Plex API and python code.

Pulling the Elements from the Web Page

When using xpath commands, you want to pull data that is located within specific elements or tags that appear in the source code of a web page. You first have to use Plex parsing API commands to put that page source in the right format. Since xpath commands reference the elements or tags of a web page structure, the data has to be pulled in an element tree structure in order to use xpath commands.

You will usually use the ElementFromURL Plex parsing APIs. The format of the page you are pulling the data or code from will determine whether you use XML, HTML, JSON, etc parsing APIs (Ex: If it is an HTML page, use HTMLElementFromURL(url) or if it is an XML page you would use XMLElementFromURL(url) ). The ElementFromURL APIs pull the page source from the website directly into an element tree structure.

page = HTML.ElementFromURL(url)

In the example above, the variable “page” holds the element tree structure of the webpage “url” as an object value. Later when pulling particular elements and the data in those elements with xpath commands, each command will start with page.xpath to show you want it to come from elements in the URL you pulled with the Plex parsing API you used above.

Sometimes you may also need to access the contents of a page as a string as well as an element tree structure to use with xpath commands. For example, you may want to use RegEx to pull an id number or other specific piece of data out of a script in the web page and also pull an ordered list of names and descriptions that are in a

element. In that case you would first call the HTTP. Request. The HTTP.Request just pulls source code from the URL as a string, then you would use the ElementFromString put that raw data string into an element tree structure.

content = HTTTP.Request(url)
page = HTML.ElementFromString(contentl)

In the example above, the variable “content” holds all the data from the web page as a string. The variable “page” holds the element tree structure of that string “content” as an object value. Later in your code, you can search the variable content for particular strings as well as pulling particular elements and the data in those elements with xpath commands, by using page.xpath.

Pulling Elements with Xpath

The Plex parsing API in the example above is then followed by variables that use xpath commands to get the specific data you want from inside a specific element or group of elements in that website that corresponds to the url used in the parsing API. When using xpath, to pull the next single occurrence of an element and put it in a string variable you would put [0] at the end of your xpath , but if you use the command without [0] at the end, it creates a list variable of all the elements that match your XPATH command.

title = page.xpath(‘//div/a/text()’)

In the example above, it looks through the elements of the object you stored in the variable page. And will look through those elements and when it finds any occurrence of a

tag with a child (anchor) tag that has text in it, it will store all of those values in the variable called title as a list of strings.

title = page.xpath(‘//div/a/text()’)[0]

In the example above, it looks through the elements of the object you stored in the variable page. And will look through those elements and when it finds the first occurrence of a

tag with a child (anchor) tag that has text in it, it will store that value in the variable title as a single string.

Looping through a List of Elements with Xpath to Pull Single String Variables Within It

Usually, you will want to pull a list of elements that contain data and then loop through that list to get each occurrence of an individual strings within it.

for item in page.xpath('//ol/li’):
    link = item.xpath('./a/@href')[0]

So in the example above, a list is created of all the elements that occur that start with the

tag that contain a child element of a

tag. If that occurs five times within a document, the for loop will go through all five of those occurrences. In each occurrence, it execute the XPATH code for the link variable by looking for the first occurrence of the tag with an href value that is a child element of that particular
element.
Sometimes you may want to pull variables that are not a direct child element of your loop. In that case, the xpath code for your variable will start with two slashes instead of the “./” to show it is not a direct child of the loop xpath.
```
for item in page.xpath('//ol/li’):
    link = item.xpath('//a/@href')[0]
```
In the example above, a list is created of all the elements that occur that start with the
1. tag. If that occurs five times within a document, the for loop will go through all five of those occurrences. In each occurrence, it execute the XPATH code for the link variable by looking for the first occurrence of the tag with an href value that is within that particular
2. element but is not a direct child of that
3. element.
  Testing your xpath
  
  The best way to make sure you are getting the elements you want from a document is to put the xpath commands you want to use in a xpath tester. Firefox has a built in xpath tester for HTML documents. But this tester does not work with XML. To test XML xpath, you need to use an alternate method like this XML utility. http://chris.photobooks.com/xml/default.htm. You only enter the data that occurs between the single quotes and if you are pulling from a list, you need to enter the xpath for the list and the individual item. In the last example above, you would enter //ol/li/a//@href in the xpath tester. Also, an xpath checker usually starts a list index at 1, while Python starts a list index a zero. So if you are trying to find the third occurrence of a
  
  element in a document, in the xpath checker that would be //div[3], but in Python you would want to write it as //div[2].
  Important Note for List
  
  When you are pulling a list of elements from a page, you need to make sure you are pulling that list in a way that gets all the occurrences or matches, so it will go through all the individual items you want to pull. So if you want to find five anchors within a particular area of code, you have to make sure the list portion of the xpath finds at least five matches for the xpath code. Let use this as an example:
  
  for item in page.xpath('//ol/li’): link = item.xpath('./div/a/@href')[0]
  
  If there is only one occurrence of a
  
  tag with a child element of a
  tag in a document that has five
  tag elements inside of it that each have an tag element, in the example above, your list loop will not work properly, because it will only go through the one loop or list item of //ol/li and pick up the first occurrence of ./div/a that occurs in that one list. If you want to pick up all of the tag elements that occur within that list, you have to change your code so that the list has five occurrences by including the
  tag as a child element in the list portion of the code. Like this:
  for item in page.xpath('//ol/li/div’): link = item.xpath('./a/@href')[0]
  
  The best way to ensure this is to use your xpath checker to make sure the list portion of the xpath returns at least as many matches as the number of occurrences you want returned.
  
  There can also be errors if choose xpath for your list that loops through more elements than you want to pull or elements that do not contain the child elements in your variable. Using the example above again:
  
  for item in page.xpath('//ol/li/div’): link = item.xpath('./a/@href')[0]
  
  Let’s say there are six occurrences of a
  
  tag with a child element of a
  tag that has an child element
  tag, the code will loop through all six occurrences. If even one of those
  tags does not have a child tag, your code will return an error that your xpath code is out of range.
  If you are unable to change the xpath you use for your loop to only loop through elements that contain the child elements you want to pull, you can add a try/except option to your code.
  
  for item in page.xpath('//ol/li/div’): try: link = item.xpath('./a/@href')[0] except: continue title = item.xpath('./a/text()')[0]
  
  In the example above, if on the third loop through that particular
  
  element does not have a child element, the continue command tell is to skip any further xpath variable pulls below it for that particular
  element an go directly to the fourth loop. If you replace the word “continue” with “pass”, it would not pick up the link variable for that loop, but would go on to pick up the title variable.
  Options for Writing Xpath Code:
  
  If you look at examples of Plex channels, you will see two different options for how to word your xpath code.
  
  Example 1: title = item.xpath('./title')[0].text url = item.xpath(‘./a')[0].get('href') Example 2: title = item.xpath('./title/text()')[0] url = item.xpath(‘./a/@href')[0]
  
  I tend to use the second method, because that can be entered directly into the xpath checker and give you the proper results, so when you have the right code, you can just cut it from the xpath checker and paste it directly into your channel code. This also gives you the option to use some of the variations of this code, for example, if there are multiple lines of text within the element, using /text() at the end of a line will just return the first line of text associated with an element, where using //text() will return all lines of text associated with the element. See last post in topic for more information on this.
  
  When a document contains CDATA:
  
  If a document contains CDATA tags around the text that you are trying to pull, the text() portion of the xpath command automatically removes the CDATA tag from the data returned. Note: Xpath does not ignore these CDATA tags, but the Plex Framework does, so when using an xpath checker, these CDATA tags will not be ignored.
  
  Pulling data within data:
  
  Every once in a while you will encounter a document where a field you pull with xpath contains more elements. For example, you may have an XML document that has a description element that contains and image element inside it like:
  
  Here is the description for the video that also contains an image http://wwwwebsite.com/image.jpg
  
  In that case, once you pull the outer data in as a string, you would then use the HTML.ElementfromString(str) to create elements from that string and then use xpath to pull the inner element contained in the string.
  
  summary = page.xpath(‘//description/text()’)[0] data = HTML.ElementfromString(summary) image = data.xpath(‘//image/text()’)[0]
  
  When an XML document contains namespace data:
  
  Some XML documents contain data within the items or entries that require namespace info. You will recognize these line because they contain a colon within the entry field name. To pull data that have namespaces, you have to first find the namespace associated with that piece of data. This is usually defined at the top of the XML document. Then include a variable for that namespace in your code, and add a reference to that variable to the end of the xpath command for that pull.
  
  NAMESPACES = {'media': 'http://search.yahoo.com/mrss/'}
  media = page.xpath(’./media:content//@url’, namespaces=NAMESPACES)[0]
  
  Using the contain option in xpath:
  
  Sometimes you will come across HTML documents where you want to specify the class or id but you want more than one variation for this identifier to be included in the matches. For those situations you can use the contains option to make the class or id for your xpath pull a little more vague.
  
  for items in page.xpath('//ol/li[contains(@id,"carousel")]')
  
  You can also use the contains option is to look for multiple strings with the attribute of a tag using and/or.
  
  url = html.xpath('//li[contains(@class,"navigation") and contains(@class,"right")]/a/@href')[0]
  
  For more info on choosing the best format for your xpath code and ensure you are using the most reliable path to the data, see the following tutorial: http://devblog.plexapp.com/2012/11/14/xpath-for-channels-the-good-the-bad-and-the-fugly/

kurnazbahadir · October 3, 2013, 5:33pm

Hi,

Thank you very much for your efford. I hope it will continue.

BR

shopgirl284 · November 9, 2013, 12:09am

I just found a new trick for using the contains command in xpath. When you want to find two separate strings that occur in an attribute of a tag. You just write two different contains and use and/or in between.

Example:

url = html.xpath('//li[contains(@class,"-navigation") and contains(@class,"-right")]/a//@href')[0]

The example above will look for a

tag where class contains the strings '-navigation' and '-right'. This also works with or.

One thing I found is that there can be no overlap in the characters used in each strings. In the example above, I first tried to use:

'-navigation-'

but found that it did not work because in some cases within the code of the web pages I was trying to access, the two strings overlapped and the class attribute of the

tag contained:

'-navigation-right'

It did not work in my code until I changed the first contain statement to '-navigation' to avoid any overlaps

shopgirl284 · December 7, 2013, 11:45pm

Another item that may be helpful is when to use a single or double slash between the parent and child tags of your xpath. It is mentioned in the tutorials, but it doesn't go into much detail of why the different methods are used.

In xpath, if you use a double slash between two tags, it means the tag that follows the double slashes might be anywhere within the parent tag that appears right before that double slash (meaning it could be a child of a child of a child...). While using a single slash between two tags, means the tag that follows the slash must be a direct child of the parent tag that appears right before the single slash.

For example, if you are trying to pull the image from the following code:

But you needed to included that main parent div tag (with class="frame") in your xpath, there are two ways you could write the xpath:

img = data.xpath('//div[@class="frame"]/div/a/img/@src')[0]
OR
img = data.xpath(’//div[@class=“frame”]//img/@src’)[0]

This would also apply to a loop. Using the example above, there are two ways you could write the xpath to pull the image in a loop:

for image in data.xpath('//div[@class="frame"]'):
   img = image.xpath('./div/a/img/@src')[0]
   OR
   img = image.xpath('.//img/@src)[0]

Which method to use depends on the coding of the page. Using the double slash can shorten your xpath and make your code more resistant to changes on a web page, but you will need to make sure there are not multiple matches for the child tags that could cause issues and require your code to be more specific.

When it comes to whether to use a single or double slash as the default format for text is debatable (Ex. //div//text() vs //div/text()). Many tend to use the .text() option at the end of the xpath versus choosing a single or double slash (Ex. html.xpath('//div')[0].text()). I like the ability to be able to type the whole xpath phrase in an Xpath Checker like Firefox's, so I do not use these extension. I tend to use the double slash before text as my default so if the website designer later adds bolding or another child tag within the tag I pull the text from, I will still get all the text held in that tag. Other say you should always use the single slash with text as your default in case the website designer later adds more text in child tags that you do not want included in your xpath pull for that tag.

Whichever method you use, knowing that different data may be returned based on the method you use for returning text with xpath and testing those methods in your code can often be helpful when writing your Plex channel.

eetjtl · April 4, 2016, 10:04pm

ok probably a dumb question…
but in trying find a fix for cbc.ca channel I have come upon an issue
I would like to start with 2nd item in
for item in page.xpath(’//li[contains(@class,“medialist-item”)]’):

If I do for item in page.xpath(’//li[contains(@class,“medialist-item”)]’)[2]: it only selects 2nd.

Thanks trying to learn a little python.

Twoure · April 4, 2016, 10:15pm

@eetjtl said:
ok probably a dumb question…
but in trying find a fix for cbc.ca channel I have come upon an issue
I would like to start with 2nd item in
for item in page.xpath(‘//li[contains(@class,“medialist-item”)]’):

If I do for item in page.xpath(‘//li[contains(@class,“medialist-item”)]’)[2]: it only selects 2nd.

Thanks trying to learn a little python.

Not sure what you mean but you could try to enumerate the list, and then add an if function to check the list position.

for i, item in enumerate(page.xpath('//li[contains(@class, "medialist-item")]')):
    if i > 0:
        #do all the things

In a python list, the 0 is the first element, 1 is 2nd, and so on. Example:

test = ['yo', 'hi', 'wut?']
print test[0]
#result = yo

eetjtl · April 4, 2016, 11:25pm

that’s the ticket thanks

eetjtl · April 4, 2016, 11:49pm

but it adds a lot of time to do the count of items.

shopgirl284 · April 6, 2016, 3:20pm

Sometimes you can find a value, like a class or id, in the item you want to skip or a value that is missing in those items you want to skip, like if you need to skip ad items. Then you can add either a “and contains” or a “and not contains” code to your xpath to narrow your list, like I show in this post - forums.plex.tv/discussion/comment/495325/#Comment_495325

Otherwise, you can also check for something that is not in the items you do not want to include as soon as you enter the if statement. Like a try url xpath pull, except continue. Ex. -

eetjtl · April 7, 2016, 3:30am

Thanks solution was changing to a double slash and an if not statement to filter out the unwanted one.

system · December 21, 2019, 12:11am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
xpath coding Dev/API Corner plugin-dev	50	416	December 20, 2019
CASTtv plugin Dev/API Corner plugin-dev	18	206	December 20, 2019
New to this, troubles with XPath Dev/API Corner plugin-dev	5	153	December 20, 2019
Newb Help! Dev/API Corner plugin-dev	11	128	December 20, 2019
Not sure what's wrong with my code. Dev/API Corner plugin-dev	14	225	January 8, 2020

XPATH for Dummies

Related topics