Parsing HTML with xpath

noob questions
Finished [this](http://forums.plexapp.com/index.php?/topic/18002-plugin-for-non-imdb-site/), and now trying to parse html page for metadata.

This ((http://www.kinopoisk.ru/level/1/film/251733/) for example) works good


kinopoiskHtml = HTML.ElementFromURL(kinopoiskUrl)<br />
metadata.summary = str(kinopoiskHtml.xpath("//span[@class='_reachbanner_']")[0].text)<br />
metadata.tagline = str(kinopoiskHtml.xpath("//td[@style='color: #555']")[0].text)<br />




But I'm completely noob with other cases, especially when regex needed. Cannot find good examples how to parse something like this:


<br />
<tr><td class="type">год</td><td class=""><a href="/level/10/m_act%5Byear%5D/2009/">2009</a></td></tr><br />
<tr><td class="type">жанр</td><td><a href="/level/10/m_act%5Bgenre%5D/2/">фантастика</a>, <a href="/level/10/m_act%5Bgenre%5D/3/">боевик</a>, <a href="/level/10/m_act%5Bgenre%5D/8/">драма</a>, <a href="/level/10/m_act%5Bgenre%5D/10/">приключения</a>, <a href="/level/92/film/251733/">...</a></td></tr><br />




Year is inside ** tag and is a part of *href* parameter. Please help me with xpath.

Hi! You probably don’t need regex in this case (pfew ;)). Using the contains function with your xpath can help you find the right a tag, like so:


kinopoiskHtml = HTML.ElementFromURL(kinopoiskUrl)<br />
...<br />
...<br />
year = int(kinopoiskHtml.xpath('//a[contains(@href, "year")]')[0].text)<br />



This searches for the string "year" inside all href attributes of a tags.


Oh thank you, it works =)

Any hint how to deal with lists (genres, actors etc like in html code above)?


I haven't worked with "genres" yet, but by looking at the Cine-Passion agent, it should be something like this:

<br />
metadata.genres.clear()<br />
genres = kinopoiskHtml.xpath('//a[contains(@href, "genre")]')<br />
<br />
for genre in genres:<br />
  metadata.genres.add( genre.text.strip() )<br />




Thank you, it works, but show only first genre. Cinepassion agent is good for this examples.

Still problem with directors, actors and so on, there no info inside tag:


<tr><td class="type">режиссер</td><td><a href="/level/4/people/27977/">Джеймс Кэмерон</a></td></tr><br />
<tr><td class="type">DIRECTOR</td><td><a href="/level/4/people/27977/">JAMES CAMERON</a></td></tr>



Any ideas please?

Use .get('href') for the link and .text for the Russian. Is thatxbmc nfo file?


Find the *td* tag that contains a text node with value "DIRECTOR", get its parent node *tr* and find all *a* tags that are descendants of this node that contain the string "people" inside their href attribute:

//td[text()="DIRECTOR"]/parent::tr//a[contains(@href,"people")]

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.