XPATH for Dummies

When I first started learning xpath, I had lots of issues grasping the concept, so I went back and looked at my numerous questions on the subject and created this document below.  (And thank you to Mike and Sander for answering those many many questions for me)

 

I found that the w3 xpath tutorial (http://www.w3schools.com/xpath/default.asp) for how to create xpath commands is the best place to start and figure out how the xpath works with the elements within a web page. And the Plex dev blog on xpath (http://devblog.plexapp.com/2012/11/14/xpath-for-channels-the-good-the-bad-and-the-fugly/) is great at showing you how to tweak your xpath for it to work best in Plex code and withstand site changes, once you understand the basic concepts of xpath.

 

This is document is more of a step in between those two that bridges the gap for those like myself that had trouble grasping the concepts of how to add the xpath in my channel code. It is meant to explain the basics of how to incorporate those xpath commands and use them in combination with python code and Plex API in your Plex channel.

 

If you see any errors or ways the document can be improved, please let me know.

 

 


Using Xpath in Plex Channel Development

Introduction

When creating a channel, you will need to pull information from the website including information about the different sections of available videos like the titles and thumbnails as well as assorted data about the individual videos like the thumbnail, title, description, duration, and date. This information, which is usually referred to as metadata is used to make your channel easier to navigate and give the end user as much information as possible about the videos you have made available in your channel.

Before beginning, it is helpful to review the xpath tutorial below to help you understand xpath and the relationship of nodes within a document and parent and child elements. http://www.w3schools.com/xpath/

This document is not intended to replace the above tutorial or explain all the possible xpath commands, their structure, or how they correspond to the web page code. This document just explains the basics of how to add these xpath commands within your Plex plugin code along with the Plex API and python code.

Pulling the Elements from the Web Page

When using xpath commands, you want to pull data that is located within specific elements or tags that appear in the source code of a web page. You first have to use Plex parsing API commands to put that page source in the right format. Since xpath commands reference the elements or tags of a web page structure, the data has to be pulled in an element tree structure in order to use xpath commands. 

 You will usually use the ElementFromURL  Plex parsing APIs. The format of the page you are pulling the data or code from will determine whether you use XML, HTML,  JSON,  etc parsing APIs (Ex: If it is an HTML page, use HTMLElementFromURL(url) or if it is an XML page you would use XMLElementFromURL(url) ).  The ElementFromURL APIs  pull the page source from the website directly into an element tree structure. 

page = HTML.ElementFromURL(url)

In the example above, the variable “page” holds the element tree structure of the webpage “url” as an object value. Later when pulling particular elements and the data in those elements with xpath commands, each command will start with page.xpath to show you want it to come from elements in the URL you pulled with the Plex parsing API you used above.

Sometimes you may also need to access the contents of a page as a string as well as an element tree structure to use with xpath commands. For example, you may want to use RegEx to pull an id number or other specific piece of data out of a script in the web page and also pull an ordered list of names and descriptions that are in a

  • element. In that case you would first call the HTTP. Request.  The HTTP.Request just pulls source code from the URL as a string, then you would use the ElementFromString put that raw data string into an element tree structure.
    content = HTTTP.Request(url)
    page = HTML.ElementFromString(contentl)

    In the example above, the variable “content” holds all the data from the web page as a string. The variable “page” holds the element tree structure of that string “content” as an object value. Later in your code, you can search the variable content for particular strings as well as pulling particular elements and the data in those elements with xpath commands, by using page.xpath.

    Pulling Elements with Xpath

    The Plex parsing API in the example above is then followed by variables that use xpath commands to get the specific data you want from inside a specific element or group of elements in that website that corresponds to the url used in the parsing API. When using xpath, to pull the next single occurrence of an element and put it in a string variable you would put [0] at the end of your xpath , but if you use the command without [0] at the end, it creates a list variable of all the elements that match your XPATH command.

    title = page.xpath(‘//div/a/text()’)

    In the example above, it looks through the elements of the object you stored in the variable page.  And will look through those elements and when it finds any occurrence of a

    tag with a child (anchor) tag that has text in it, it will store all of those values in the variable called title as a list of strings.
    title = page.xpath(‘//div/a/text()’)[0]

    In the example above, it looks through the elements of the object you stored in the variable page.  And will look through those elements and when it finds the first occurrence of a

    tag with a child (anchor) tag that has text in it, it will store that value in the variable title as a single string.

    Looping through a List of Elements with Xpath to Pull Single String Variables Within It

    Usually, you will want to pull a list of elements that contain data and then loop through that list to get each occurrence of an individual strings within it.

    for item in page.xpath('//ol/li’):
        link = item.xpath('./a/@href')[0] 

    So in the example above, a list is created of all the elements that occur that start with the

      tag that contain a child element of a
    1. tag.  If that occurs five times within a document, the for loop will go through all five of those occurrences. In each occurrence, it execute the XPATH code for the link variable by looking for the first occurrence of the tag with an href value that is a child element of that particular
    2. element.

      Sometimes you may want to pull variables that are not a direct child element of your loop. In that case, the xpath code for your variable will start with two slashes instead of the “./” to show it is not a direct child of the loop xpath.

      for item in page.xpath('//ol/li’):
          link = item.xpath('//a/@href')[0]
      

      In the example above, a list is created of all the elements that occur that start with the

        tag that contain a child element of a
      1. tag.  If that occurs five times within a document, the for loop will go through all five of those occurrences. In each occurrence, it execute the XPATH code for the link variable by looking for the first occurrence of the tag with an href value that is within that particular
      2. element but is not a direct child of that
      3. element.

        Testing your xpath

        The best way to make sure you are getting the elements you want from a document is to put the xpath commands you want to use in a xpath tester. Firefox has a built in xpath tester for HTML documents.  But this tester does not work with XML. To test XML xpath, you need to use an alternate method like this XML utility. http://chris.photobooks.com/xml/default.htm.  You only enter the data that occurs between the single quotes and if you are pulling from a list, you need to enter the xpath for the list and the individual item. In the last example above, you would enter //ol/li/a//@href in the xpath tester. Also, an xpath checker usually starts a list index at 1, while Python starts a list index a zero.  So if you are trying to find the third occurrence of a

        element in a document, in the xpath checker that would be //div[3], but in Python you would want to write it as //div[2].

        Important Note for List

        When you are pulling a list of elements from a page, you need to make sure you are pulling that list in a way that gets all the occurrences or matches, so it will go through all the individual items you want to pull. So if you want to find five anchors within a particular area of code, you have to make sure the list portion of the xpath finds at least five matches for the xpath code. Let use this as an example:

        for item in page.xpath('//ol/li’):
            link = item.xpath('./div/a/@href')[0]
         

        If there is only one occurrence of a

          tag with a child element of a
        1. tag in a document that has five
          tag elements inside of it that each have an tag element, in the example above, your list loop will not work properly, because it will only go through the one loop or list item of //ol/li and pick up the first occurrence of ./div/a that occurs in that one list. If you want to pick up all of the tag elements that occur within that list, you have to change your code so that the list has five occurrences by including the
          tag as a child element in the list portion of the code. Like this:
          for item in page.xpath('//ol/li/div’):
              link = item.xpath('./a/@href')[0]
          

          The best way to ensure this is to use your xpath checker to make sure the list portion of the xpath returns at least as many matches as the number of occurrences you want returned.

          There can also be errors if choose xpath for your list that loops through more elements than you want to pull or elements that do not contain the child elements in your variable.  Using the example above again:

          for item in page.xpath('//ol/li/div’):
              link = item.xpath('./a/@href')[0]
          

          Let’s say there are six occurrences of a

            tag with a child element of a
          1. tag that has an child element  
          2. element  does not have a child element,  the continue command tell is to skip any further xpath variable pulls below it for that particular
          3. element an go directly to the fourth loop. If you replace the word “continue” with “pass”, it would not pick up the link variable for that loop, but would go on to pick up the title variable.

            Options for Writing Xpath Code:

            If you look at examples of Plex channels, you will see two different options for how to word your xpath code.

            Example 1:  title = item.xpath('./title')[0].text
                        url = item.xpath(‘./a')[0].get('href')
            Example 2:  title = item.xpath('./title/text()')[0]
                        url = item.xpath(‘./a/@href')[0]

            I tend to use the second method, because that can be entered directly into the xpath checker and give you the proper results, so when you have the right code, you can just cut it from the xpath checker and paste it directly into your channel code. This also gives you the option to use some of the variations of this code, for example, if there are multiple lines of text within the element, using /text() at the end of a line will just return the first line of text associated with an element, where using //text() will return all lines of text associated with the element. See last post in topic for more information on this.

            When a document contains CDATA:

            If a document contains CDATA tags around the text that you are trying to pull, the text() portion of the xpath command automatically removes the CDATA tag from the data returned.  Note: Xpath does not ignore these CDATA tags, but the Plex Framework does, so when using an xpath checker, these CDATA tags will not be ignored.

            Pulling data within data:

            Every once in a while you will encounter a document where a field you pull with xpath contains more elements. For example, you may have an XML document that has a description element that contains and image element inside it like:

             Here is the description for the video that also contains an image http://wwwwebsite.com/image.jpg  

            In that case, once you pull the outer data in as a string, you would then use the HTML.ElementfromString(str) to create elements from that string and then use xpath to pull the inner element contained in the string.

            summary = page.xpath(‘//description/text()’)[0]
            data = HTML.ElementfromString(summary)
            image = data.xpath(‘//image/text()’)[0] 

            When an XML document contains namespace data:

            Some XML documents contain data within the items or entries that require namespace info. You will recognize these line because they contain a colon within the entry field name. To pull data that have namespaces, you have to first find the namespace associated with that piece of data. This is usually defined at the top of the XML document. Then include a variable for that namespace in your code, and add a reference to that variable to the end of the xpath command for that pull.

            NAMESPACES = {'media': 'http://search.yahoo.com/mrss/'}
            

            media = page.xpath(’./media:content//@url’, namespaces=NAMESPACES)[0]

            Using the contain option in xpath:

            Sometimes you will come across HTML documents where you want to specify the class or id but you want more than one variation for this identifier to be included in the matches.  For those situations you can use the contains option to make the class or id for your xpath pull a little more vague.

            for items in page.xpath('//ol/li[contains(@id,"carousel")]')

            You can also use the contains option is to look for multiple strings with the attribute of a tag using and/or.

            url = html.xpath('//li[contains(@class,"navigation") and contains(@class,"right")]/a/@href')[0]
            

            For more info on choosing the best format for your xpath code and ensure you are using the most reliable path to the data, see the following tutorial:  http://devblog.plexapp.com/2012/11/14/xpath-for-channels-the-good-the-bad-and-the-fugly/