Parsing HTML with xpath

ptath · September 18, 2010, 7:47pm

noob questions

Finished [this](http://forums.plexapp.com/index.php?/topic/18002-plugin-for-non-imdb-site/), and now trying to parse html page for metadata.

This ((http://www.kinopoisk.ru/level/1/film/251733/) for example) works good


kinopoiskHtml = HTML.ElementFromURL(kinopoiskUrl)<br />
metadata.summary = str(kinopoiskHtml.xpath("//span[@class='_reachbanner_']")[0].text)<br />
metadata.tagline = str(kinopoiskHtml.xpath("//td[@style='color: #555']")[0].text)<br />

But I'm completely noob with other cases, especially when regex needed. Cannot find good examples how to parse something like this:


<br />
<tr><td class="type">год</td><td class=""><a href="/level/10/m_act%5Byear%5D/2009/">2009</a></td></tr><br />
<tr><td class="type">жанр</td><td><a href="/level/10/m_act%5Bgenre%5D/2/">фантастика</a>, <a href="/level/10/m_act%5Bgenre%5D/3/">боевик</a>, <a href="/level/10/m_act%5Bgenre%5D/8/">драма</a>, <a href="/level/10/m_act%5Bgenre%5D/10/">приключения</a>, <a href="/level/92/film/251733/">...</a></td></tr><br />

Year is inside ** tag and is a part of *href* parameter. Please help me with xpath.

sander1 · September 18, 2010, 7:59pm

Hi! You probably don’t need regex in this case (pfew ;)). Using the contains function with your xpath can help you find the right a tag, like so:


kinopoiskHtml = HTML.ElementFromURL(kinopoiskUrl)<br />
...<br />
...<br />
year = int(kinopoiskHtml.xpath('//a[contains(@href, "year")]')[0].text)<br />

This searches for the string "year" inside all href attributes of a tags.

ptath · September 18, 2010, 8:05pm

Hi! You probably don't need regex in this case (pfew ;)). Using the [*contains* function](http://www.w3schools.com/Xpath/xpath_functions.asp) with your xpath can help you find the right a tag, like so:
kinopoiskHtml = HTML.ElementFromURL(kinopoiskUrl) 
... 
... 
year = int(kinopoiskHtml.xpath('//a[contains(@href, "year")]')[0].text) 
This searches for the string "year" inside all href attributes of a tags.

Oh thank you, it works =)

Any hint how to deal with lists (genres, actors etc like in html code above)?

sander1 · September 18, 2010, 8:16pm

I haven't worked with "genres" yet, but by looking at the Cine-Passion agent, it should be something like this:


<br />
metadata.genres.clear()<br />
genres = kinopoiskHtml.xpath('//a[contains(@href, "genre")]')<br />
<br />
for genre in genres:<br />
  metadata.genres.add( genre.text.strip() )<br />

ptath · September 19, 2010, 10:20am

I haven't worked with "genres" yet, but by looking at the Cine-Passion agent, it should be something like this:
 
metadata.genres.clear() 
genres = kinopoiskHtml.xpath('//a[contains(@href, "genre")]') 
 
for genre in genres: 
 metadata.genres.add( genre.text.strip() )

Thank you, it works, but show only first genre. Cinepassion agent is good for this examples.

Still problem with directors, actors and so on, there no info inside tag:


<tr><td class="type">режиссер</td><td><a href="/level/4/people/27977/">Джеймс Кэмерон</a></td></tr><br />
<tr><td class="type">DIRECTOR</td><td><a href="/level/4/people/27977/">JAMES CAMERON</a></td></tr>

Any ideas please?

hrcolb0 · September 19, 2010, 12:15pm

Thank you, it works, but show only first genre. Cinepassion agent is good for this examples.

Still problem with directors, actors and so on, there no info inside tag:

<tr><td class="type">режиссер</td><td><a href="/level/4/people/27977/">Джеймс Кэмерон</a></td></tr> <tr><td class="type">DIRECTOR</td><td><a href="/level/4/people/27977/">JAMES CAMERON</a></td></tr>

Any ideas please?

Use .get('href') for the link and .text for the Russian. Is thatxbmc nfo file?

sander1 · September 19, 2010, 2:46pm

Still problem with directors, actors and so on, there no info inside tag:


<tr><td class="type">режиссер</td><td><a href="/level/4/people/27977/">Джеймс Кэмерон</a></td></tr><br />
<tr><td class="type">DIRECTOR</td><td><a href="/level/4/people/27977/">JAMES CAMERON</a></td></tr>

Any ideas please?

Find the *td* tag that contains a text node with value "DIRECTOR", get its parent node *tr* and find all *a* tags that are descendants of this node that contain the string "people" inside their href attribute:


//td[text()="DIRECTOR"]/parent::tr//a[contains(@href,"people")]

system · December 20, 2019, 8:46pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
xpath function / metadata agent Dev/API Corner other-dev	3	85	December 20, 2019
XPATH for Dummies Dev/API Corner plugin-dev	11	780	December 21, 2019
parsing javascript with xpath Dev/API Corner plugin-dev	12	650	December 20, 2019
Tvolucion - Xpath Problem Dev/API Corner plugin-dev	16	109	December 20, 2019
xpath coding Dev/API Corner plugin-dev	50	416	December 20, 2019

Parsing HTML with xpath

Related topics