parsing javascript with xpath

Hi,



Is it possible in plex parsing a code that was generated with javascript using xpath.



For example, in the script below I want only to extract the string value of sGlobalFileName=‘what-i-think-of-tv-news’, so as a result I want to have ‘what-i-think-of-tv-news’ in the end. Is this possible using xpath? If yes, How?


Short answer. No. Xpath is for parsing XML documents or fragments, no general text.



For that, extract the text of the script tag (that you can do using xpath) then use either regular expression or python substring to extract out the part you need. Messy, but no other way around it.


mmm could anyone help me how that would like for a bit, I am also searching the internet for regex but it is not very clear to me…

If you’ve never used regular expressions before now is probably not the time to learn. They are the spawn of the devil (but very powerful).



You need a good python reference. String objects (and all others) have a number of very useful methods on them. You need to use find. Something like (pseudocode)


<br />
start = text.find('sGlobalFileName=') + 17<br />
end = text.find(";", start)<br />
substring = text[start:end]<br />





honestly, that js example looks like a primo candidate for rolling into a nice neat JSON object. as for exactly how to do that, someone with more braincells than me will have to take a look.

Although regexes are a bit more difficult, I think they are the best way to extract the data you want (in this case).



<br />
import re<br />
<br />
webpage_content = HTTP.Request('http://www.example.com/pagecontainingthejavascript.html')<br />
title = re.search("sGlobalFileName='(.+?)';", webpage_content).group(1)<br />




A little explanation about the regex:
(...) = a group within a regex, you can have multiple groups within one regex, you retrieve them with the group function. The above expression could also be written like this:

result = re.search("sGlobalFileName='(.+?)';", webpage_content)<br />
title = result.group(1)


. = matches any character except a newline
+ = match 1 or more repetitions of the preceding expression
? = make the expression ungreedy (= grab as few characters as possible)

Without the "?" the result of this regular expression would be:
what-i-think-of-tv-news';EmbedSEOLinkURL='http://www.break.com/';EmbedSEOLinkKeywords='Funny Videos';sGlobalContentID='355403';sGlobalContentTitle=document.getElementById("vid_title").getAttribute("content");sGlobalCategoryID='7';sGlobalContentFilePath='2007/8';sGlobalContentUrl='http://www.break.com/index/what-i-think-of-tv-news.html';sGlobalContentIDEncoded='boEGsDRmcc5fL4GHC%2bhfyA%3d%3d';sSubmittedBY='rbbtcee';sKeywordTitle='Flashes,man,News,reporter,What I Think of TV News';sKeywordString='Flashes,man,News,reporter

Wow thanks, This really helped me! Thanks for the explanation.

I created the following code using regex but there must be something that I forgot because it does not work.



Does anyone have any suggestions?



def Video(sender, url):<br />
<br />
  dir = MediaContainer(title3=sender.itemTitle, art=R(ART), viewGroup="InfoList")<br />
<br />
  videos = XML.ElementFromURL(url, isHTML=True, errors='ignore').xpath(XPATH_VIDEOS)<br />
  for content in videos:<br />
    title = content.xpath("./a/span")[0].text<br />
    thumb = content.xpath("./a/img")[0].get('src')<br />
    summary = content.xpath("./a")[0].get('title')<br />
    url = content.xpath("./a")[0].get('href')<br />
    dir.Append(Function(VideoItem(PlayVideo, title=title, summary=summary, thumb=thumb), url=url))<br />
<br />
  return dir<br />
<br />
####################################################################################################<br />
<br />
def PlayVideo(sender, url):<br />
	<br />
	video_link = HTTP.Request(url)<br />
	file_name_link = re.search("sGlobalFileName='(.+?)';", video_link)<br />
	file_path_link = re.search("sGlobalContentFilePath='(.+?)';", video_link)<br />
	file_path = file_path_link.group(1)<br />
	file_name = file_name_link.group(1)<br />
	<br />
	total_video_link = 'http://media1.break.com/dnet/media/' + file_path '/' + file_name '.flv'<br />
	<br />
	dir.Append(VideoItem(total_video_link))

see my response to the other topic you started.



http://forums.plexapp.com/index.php?/topic/11941-play-video-using-regex/page__view__findpost__p__70717

Your indentation is maybe wrong, but you also need to change the PlayVideo function.

This:


dir.Append(VideoItem(total_video_link))


needs to be replaced by this:

<br />
return Redirect(total_video_link)<br />




You're also missing two plusses here:
total_video_link = 'http://media1.break.com/dnet/media/' + file_path + '/' + file_name + '.flv'

You can also do the regex stuff once and grab everything you need. Here is an “optimized” (shorter) PlayVideo function:



<br />
def PlayVideo(sender, url):<br />
<br />
  video_link = HTTP.Request(url)<br />
  link = re.search("sGlobalFileName='(.+?)';.+sGlobalContentFilePath='(.+?)';", video_link, re.DOTALL)<br />
  total_video_link = 'http://media1.break.com/dnet/media/' + link.group(1) + '/' + link.group(2) + '.flv'<br />
<br />
  return Redirect(total_video_link)<br />


Wow… I feel so stupid. I just left the computer for a couple of hours alone and tried to get my concentration back and now I returned to it and read your messages it all sees so clear.



Thank you for your support!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.