I'm scraping some icon jpegs for the freecaster plugin I'm working on. The problem is the icons are on the destination pages and so I have to scrap 10 different pages to extract the 10 icons. While this works it is slow. Now I know these icons are cached after the first time they used so is there a way for me to determine that in my code to avoid scraping the child pages every time that section is accessed?
My current code is:
dir = MediaContainer(title2=breadcrumb)
for item in XML.ElementFromURL(FREECASTER_URL,True, encoding="iso-8859-1").xpath('//div[@id="navigation_menu"]/ul/li//a'):
title = item.get('title').replace('channel','')
url = item.get('href')
if(url != '/live'):
thumb = None
thumbElement = XML.ElementFromURL(FREECASTER_URL + url, True, encoding="iso-8859-1").xpath('//link[@rel="image_src"]')
thumb = thumbElement.get('href')
dir.Append(Function(DirectoryItem(SortedChannel, title=title, summary=None, thumb=thumb), breadcrumb=title, url=url))
what I'm trying to avoid is repeating the XML.ElementFromURL(FREECASTER_URL + url, True, encoding="iso-8859-1").xpath('//link[@rel="image_src"]') once the images have been cached.
Maybe creating the DirectoryItem without the thumb, checking to see if the thumb is set and scraping if not? Not sure of the code needed to do that.
The easiest way is to use the cacheTime argument in XML.ElementFromURL(). Just set it to something really high & the page will be fetched from the HTTP cache instead of being downloaded every time.
One thing you can do is implement the UpdateCache() method in your plug-in. This gets called periodically by the framework (and immediately after your plug-in starts). You can use this to go out & fetch information that would usually take a long time to gather, then store them somewhere. You can cache pages by simply by calling HTTP.Request() for each page in UpdateCache(), or you can do something a little more complicated and split the logic up a bit - use UpdateCache() to fetch the thumb URLs & store them in the plug-in's dictionary, then fetch the URLs back from the dictionary when responding to requests. It takes a bit more effort to implement, but when it's done it makes browsing within plug-ins pretty instantaneous.
Thanks Jam, that helps a lot. I'm going back and forth on actually using the thumbs from the site (they are not really a good size) but from you explanation I can already see other places where caching will help.