[REL] TGC (The Great Courses) Metadata Agent

scanner-agent-dev

#6

Niice! That code is pretty dense and might take me a little while to go through it all. However, at first glance there is a lot I can use in here that is really high quality code. UpdateMeta and other things in common.py should provide me with the tools for future releases. Not to mention your GetMetadata functions. This is a treasure for anyone developing a PLEX agent.


#7

First off, I can't thank you enough for this. I even signed up for Reddit just to thank you THERE. Bravo, Kudos, seriously, geebus this is great.

I've been going through this all day against a library and this is what I have found out so far.

I don't have urllib/urllib2 and BeautifulSoup and if I do I am unaware of it. Just a standard PLEX server install.
*It works just fine. *

The files themselves can be named:
S01E01 - text
S01E02 - text
You don't need the name of the lecture in front of the S0xE0x and once you see below you will be glad.

You DO need the name of the lecture as the folder name and it needs to be the exact text as it reads in the URL, not the lecture name. For example...

African Experience from "Lucy" to Mandela - won't work
African Experience from Lucy to Mandela - won't work
African Experience from quot Lucy quot to Mandela - works just fine. This is how the lecture is written out in the URL.

http://www.thegreatcourses.com/courses/**african-experience-from-quot-lucy-quot-to-mandela**.html

Other examples include:
Experiencing America A Smithsonian Tour through American History - won't work despite the fact that it is the title
Experiencing America A Smithsonian Tour through History - does work because that is what the URL is.

Some lectures have words like "World's" spelled "worlds" in the URL. Remote the apostrophe and it works.
Upper and lower case words don't seem to matter. But if you put all lower case as the URL says, you will get a lower case name in PLEX and will need to either change the folder name or edit the metadata title. I prefer the folder. Do it once and that's it.

Best suggestion is to give it a go and see what it can't find. They will fall into four types.

  1. Audio lectures. If you any OLD lectures that either came out before TTC started producing video versions (yeah, I've been collecting them that long) or if you have any audio versions because you haven't been able to track down the video versions, you will have to load these in as Music and add your own art and descriptions for now until bubonic314 in his amazing and generous wisdom decides to do a version for music as well. Not that big a need since a LOT of the audio only lectures no longer exist.
  2. Discontinued Video lectures. If you have any of the old VHS based lectures that TTC discontinued years ago, you will have to add your own art and metadata.
  3. Single lecture demos. TTC sends out demos from time to time to judge the marketablility of a topic and the skill of the professor. Some eventually become real lectures. Most don't. Same as above you will have to come up with your own art and metadata for these. Or don't include them. I don't honestly know why I do.
  4. Stuff that has the wrong name for the lecture title in the folder. I have yet to find a lecture that it won't locate as long as you give it the specific text of the title in the URL. Really.

If you have recent stuff - stuff that is all currently for sale and on their website - you can get it loaded up just fine if you do as I did above.

Again, thanks to bubonic314, whoever you are. So far I have loaded in history, art, music, better living and and working on literature/language and it is going just fine.


#8

@bubonic314 it will replace my normal agent code on github soon. I went overkill but didn't want to update field unless needed and am coding a log per series in agent data folder now that include scanner logs... I loved the dict functions can don't throw errors if the fields don't exist... Thanks for the kind words, I really appreciate that as a coder... It took ages to get it there to this point, but somebody could just create a module file and modify slightly the init.py to get started and have great logs :D


#9

Updated Code

I have updated TGC.bundle in the git repository. The new update includes:

  • Course description is now the full course description found on TGC website and formatted accordingly.
  • Fuzzy lecture course matching via the SearchCourse() method when urllib receive a 404 Not found from initial course URL formatting.

As mentioned by @tramp78 if you name your directories (or files) specific to what shows up in the course URL, it will always match the lectures exactly and pull all relevant data for that course. However, I like the presentation of the course names found on the TGC website to show up on PLEX, which contains commas, colons and quotes and the like. The new update allows for you to name your directories or files with these course names as it will filter out the unnecessary data and find a match to the course URL on TGC website.

My new SearchCourse() method allows for fuzzy matching of courses. For an example consider the course:

Conquest of the Americas

If your directory or lecture files is named, for whatever reason, Conquest of America S01E01.mkv the following logs show what the new method does:

2017-05-12 20:42:53,373 (7f1f89ffb700) : INFO (__init__:386) - metadata.title: Conquest of America
2017-05-12 20:42:53,373 (7f1f89ffb700) : INFO (__init__:398) - update() CourseURL: http://www.thegreatcourses.com/courses/conquest-of-america.html
2017-05-12 20:42:54,420 (7f1f89ffb700) : INFO (__init__:417) - courseURL not found... Searching for related courses: Conquest of America
2017-05-12 20:43:00,253 (7f1f89ffb700) : INFO (__init__:324) - Title: 1066: The Year That Changed Everything
2017-05-12 20:43:00,254 (7f1f89ffb700) : INFO (__init__:324) - Title: Conquest of the Americas
2017-05-12 20:43:00,254 (7f1f89ffb700) : INFO (__init__:327) - Match found for: Conquest of America
2017-05-12 20:43:00,254 (7f1f89ffb700) : INFO (__init__:328) - Title found is: Conquest of the Americas
2017-05-12 20:43:00,255 (7f1f89ffb700) : INFO (__init__:329) - Link is: http://www.thegreatcourses.com/courses/conquest-of-the-americas.html
2017-05-12 20:43:00,279 (7f1f89ffb700) : INFO (__init__:334) - Finding best match...
2017-05-12 20:43:00,279 (7f1f89ffb700) : INFO (__init__:337) - Span length for is: 23
2017-05-12 20:43:00,279 (7f1f89ffb700) : INFO (__init__:342) - CourseTitle is: Conquest of the Americas
2017-05-12 20:43:00,280 (7f1f89ffb700) : INFO (__init__:343) - CourseURL is: http://www.thegreatcourses.com/courses/conquest-of-the-americas.html

Or in an example by @tramp78 something that contains a apostrophe (and is dear to me >:) ), i.e.,

The Black Death: The World's Most Devastating Plague S01e17 Plague Saints And Popular Religion.m4v

The course URL is: http://www.thegreatcourses.com/courses/the-black-death-the-worlds-most-devastating-plague.html
A lot of times the course URL will place World's as world-s- and my formatting usually edits it that way. Let's see what the log looks like:

2017-05-13 00:51:59,642 (7f7f93fff700) : INFO (__init__:386) - metadata.title: The Black Death: The World's Most Devastating Plague
2017-05-13 00:51:59,642 (7f7f93fff700) : INFO (__init__:398) - update() CourseURL: http://www.thegreatcourses.com/courses/the-black-death-the-world-s-most-devastating-plague.html
2017-05-13 00:52:02,322 (7f7f93fff700) : INFO (__init__:417) - courseURL not found... Searching for related courses: The Black Death: The World's Most Devastating Plague
2017-05-13 00:52:13,928 (7f7f93fff700) : INFO (__init__:320) - Locating search results...
2017-05-13 00:52:13,928 (7f7f93fff700) : INFO (__init__:324) - Title: The Black Death: The World's Most Devastating Plague
2017-05-13 00:52:13,928 (7f7f93fff700) : INFO (__init__:327) - Match found for: The Black Death: The World's Most Devastating Plague
2017-05-13 00:52:13,929 (7f7f93fff700) : INFO (__init__:328) - Title found is: The Black Death: The World's Most Devastating Plague
2017-05-13 00:52:13,929 (7f7f93fff700) : INFO (__init__:329) - Link is: http://www.thegreatcourses.com/courses/the-black-death-the-worlds-most-devastating-plague.html
2017-05-13 00:52:13,929 (7f7f93fff700) : INFO (__init__:324) - Title: An Economic History of the World since 1400
2017-05-13 00:52:13,930 (7f7f93fff700) : INFO (__init__:324) - Title: (Set) The Black Death & Late Middle Ages
2017-05-13 00:52:13,930 (7f7f93fff700) : INFO (__init__:324) - Title: (Set) The Black Death & Medieval World
2017-05-13 00:52:13,931 (7f7f93fff700) : INFO (__init__:324) - Title: (Set) The Black Death & The Guide to Essential Italy
2017-05-13 00:52:13,931 (7f7f93fff700) : INFO (__init__:324) - Title: (Set) The Black Death & The Great Tours: Experiencing Medieval Europe
2017-05-13 00:52:13,931 (7f7f93fff700) : INFO (__init__:324) - Title: (Set) The Black Death & Story of Medieval England
2017-05-13 00:52:13,932 (7f7f93fff700) : INFO (__init__:334) - Finding best match...
2017-05-13 00:52:13,932 (7f7f93fff700) : INFO (__init__:337) - Span length for is: 52
2017-05-13 00:52:13,932 (7f7f93fff700) : INFO (__init__:342) - CourseTitle is: The Black Death: The World's Most Devastating Plague
2017-05-13 00:52:13,932 (7f7f93fff700) : INFO (__init__:343) - CourseURL is: http://www.thegreatcourses.com/courses/the-black-death-the-worlds-most-devastating-plague.html

As you can see it found the correct URL.

In short, you don't have to follow the naming scheme of the course URL anymore, TGC.bundle now does all that based off of the Course Lecture name. Thanks to @tramp78 for recognizing that if you follow the URL naming scheme all is well.. that is the full proof way of making sure this works; but, it is nice to have the formatting of the full course lecture name as your "Show" name in PLEX.

Thanks to @tramp78 and @ZeroQI for their input and help.

Download

TGC.bundle


#10

Seems to work fine. Analyze each library to re-download the new formatted Course Descriptions. Haven't tried changing the folder names but what I have done is leave the folder names the same as the URL and edit the name in PLEX and lock it. Kind of defeats the purpose of having your agent do all the work but over time I'm sure it will all get worked out. Any way to locate backgrounds and banner art? Not sure where on the website that it.


#11

Strange thing. Some lectures pick up the formating on the course description and some just refuse to do so. Specifically:
Classical Mythology
Books That Have Made History Books That Can Change Your Life

There may be more but I just can't get these guys to update.


#12

@tramp78 It does pull background and banner art. The Agent uses BeautifulSoup to pull those links into the metadata object. If you're not seeing art, it could be that you don't have BeautifulSoup installed. Here is a snippet of the code:

    Log("Downloading Art")
   @parallelize
   def DownloadArt(html=html):
       Log("DownloadArt()")
       art = [ ]
       Art = { }
       soup = BeautifulSoup(html)
         for link in soup.findAll("a", "cloud-zoom-gallery lightbox-group"):
           art.append(link.get('href'))
       Art['fanart'] = art[0]
       Art['poster'] = art[1]
       Log("Fanart URL: %s" % Art['fanart'])
       Log("Poster URL: %s" % Art['poster'])
       if Art['poster'] not in metadata.posters:

and here is a screenshot from my server:


#13

My bad, I do see the banner and background stuff show up. It pops up between the time you hit play and the video starts.


#14

New Update

The TGCAgent now pulls the Professor/Lecturer photo into the cast section of the series:

Also, I have added, at the suggestion of a user on GIT, a method for using the Course Number found on the TGC website. This has to be added to the file naming scheme in the following manner:

The Black Death: The World's Most Devastating Plague (TGC8241) S01E01.mp4

Even in the poorest naming of the files (or directories), if the TGC#### is included, it should always find an exact match of the course now. This should resolve any discrepancies of courses not being matched. Here is an example where the file names are not quite the name of the course:

And the Log Results:

2017-06-10 14:18:15,054 (7f6d32e7d700) : INFO (init:325) - Title: (Set) How to Listen to and Understand Great Music, 3rd Edition & Concerto
2017-06-10 14:18:15,054 (7f6d32e7d700) : INFO (init:325) - Title: (Set) How to Listen to and Understand Great Music, 3rd Edition; The Symphony & The Concerto
2017-06-10 14:18:15,054 (7f6d32e7d700) : INFO (init:325) - Title: (Set) How to Listen to Great Music & Great Masters: Mozart
2017-06-10 14:18:15,055 (7f6d32e7d700) : INFO (init:325) - Title: (Set) 30 Greatest Orchestral Works & Great Masters: Mozart
2017-06-10 14:18:15,055 (7f6d32e7d700) : INFO (init:325) - Title: (Set) Best of Robert Greenberg
2017-06-10 14:18:15,055 (7f6d32e7d700) : INFO (init:325) - Title: (Set) The Everyday Guide to Wine & How to Listen to and Understand Great Music
2017-06-10 14:18:15,056 (7f6d32e7d700) : INFO (init:335) - Finding best match...
2017-06-10 14:18:15,056 (7f6d32e7d700) : INFO (init:338) - Span length for is: 39
2017-06-10 14:18:15,056 (7f6d32e7d700) : INFO (init:345) - CourseTitle is: Great Masters: Haydn-His Life and Music
2017-06-10 14:18:15,061 (7f6d32e7d700) : INFO (init:346) - CourseURL is: http://www.thegreatcourses.com/courses/great-masters-haydn-his-life-and-music.html
2017-06-10 14:18:20,483 (7f6d32e7d700) : INFO (init:357) - Course Number Search: 751
2017-06-10 14:18:20,483 (7f6d32e7d700) : INFO (init:358) - Course Number Found: 751
2017-06-10 14:18:20,483 (7f6d32e7d700) : INFO (init:452) - Course found, URL: http://www.thegreatcourses.com/courses/great-masters-haydn-his-life-and-music.html
2017-06-10 14:18:23,667 (7f6d32e7d700) : INFO (init:462) - Adding metadata summary
2017-06-10 14:18:23,668 (7f6d32e7d700) : INFO (init:467) - Calling MyLDESCParser()

Enjoy!
-bubonic

TGC.bundle


#15

life just keeps getting better and better. Any chance you can port this to Music so I can aim it at the audio lectures? If it will take too long, don't bother. I'm not sure how many audio lectures I have that are still on theri website so I'm not sure how much good it would do.


#16

The goal is to eventually get the audio lectures supported too. As of right now, I'm not sure how to get the agent to support multiple types of media. It might be easier to create a separate Agent for the audio lectures; but that could actually turn out to be more work than what is needed. Just stay tuned for updates and hopefully one day you'll see it support audio lectures as well.

Thanks for your ongoing interest.


#17

I agree it might not be worth the time. I think it would take a second agent and I'm not sure how many of the audio lectures I have are still around as audio only lectures. I might have to check some day. But until then, thanks again for what yiou have done.


#18

Thought I would let you know of a few things that I'm seeing. Two lectures keep wanting to be labeles as a set that they are part of instead of themselves. I can't see a need to ever label a lecture as a set. The two are
The Long 19th Century European History from 1789 to 1917
History of the Bible: The Making of the New Testament Canon

I've tried putting (TGCXXXX) in the folder name & the episode name
I've done the rename the folder to the text in the URL

But it seems to see the Set description which throws off the art and descriptions.

Just an FYI since you are such a wizard as keeping this thing up to date.


#19

You can add "Apocalypse Controversies and Meaning in Western History" to the courses that pull up as a set. Specifically "(Set) Apocalypse: Controversies and Meaning in Western History & History of Christian Theology" Not sure how to make this see the right one.


#20

Sorry it's been a while since I checked this. I was just thinking today that I should remove the (Set) listings from the search result as my algorithm might match the set instead of the individual course. I've been a little busy with some other projects, but I should be able to get to this tomorrow. I'll let you know when it's all updated and ready to go.


#21

Update

There have been some changes to TGC.bundle:

  • Excludes (Sets) from matching when searching for course.
  • Adds a rating value for the course extracted from the course website.
  • Adds roles for multiple lectures. i.e., Professor 2, Professor 3, etc.
  • Adds metadata studio as TGC

Screenshots

TODO

Well, the thing that has been eating up my night has been adding genres to the courses. This has proved to be rather elusive based on the primary_subject in the courses webpage. Courses are filed under multiple genres and I've been trying to decode the primary_subject code to what the category it is. There is a category section in the course web pages, but unfortunately it's not populated unless the referrer page is from a search or category listing. So far this is what I think I've decoded, but I'm not entirely sure and the code is not yet complete:

product_category = {'901' : "Economics & Finance", '902' : 'High School', '904' : 'Fine-Arts', '905' : 'Literature & Language', '907' : 'Philosophy, Intellectual History','909' : 'Religion', '910' : 'Mathematics', '918' : 'History', '926': 'Science', '927': 'Better Living' }

Based on what I've seen so far, this list might be pretty accurate. Of course, there is the brute force way of pulling every course and the respective primary_subject code, which is in the course web page, that it has and attributing the categories respectively. This might be the only option to get an accurate genre. Anyway, it's on the the TODO list.

Enjoy!
-bubonic

TGC.bundle


#22

VERY cool. Yup, that fixed the Set issue.
Regarding the Audio lectures, I looked through my library and compared it to the website and it looks like there are only 40 courses that are still for sale that are audio only. Not that many. Here is the list if you are interested. LOL Not sure this is worth it compared to other things you are working on.

1066 (The Conquest)
20th Century American Fiction
36 Big Ideas
A Day’s Read
Abraham Lincoln: In His Own Words
Aeneid Of Virgil
American Religion History
Americas in the Revolutionary Era
Business Law - Contracts
Business Law - Negligence and Torts
China, India, and the United States: The Future of Economic Supremacy
Espionage And Covert Operations: A Global History
Ethics of Aristotle
European Thought and Culture in the 20th Century
Explaining Social Deviance
Exploring Metaphysics
Francis of Assisi
History of the US Economy in the 20th Century
How the Crusades Changed History
How to Read and Understand Poetry
Language A to Z
Legacies of Great Economists
Life and Writings of C. S. Lewis
Life and Writings of Geoffrey Chaucer
Literary Modernism
Modern British Drama
Moral Decision Making: How to Approach Everyday Ethics
Plato, Socrates and the Dialogues
Practical Philosophy: The Greco-Roman Moralists
Quest for Meaning: Values, Ethics, and the Modern Experience
Rights of Man: Great Thinkers and Great Movements
Skepticism 101: How to Think like a Scientist
The Art of War
The First Amendment and You
The Greatest Controversies of Early Christian History
The Skeptic's Guide To The Great Books
The Soul and the City: Art, Literature, and Urban Living
Turning Points in Medieval History
Understanding Literature and Life
Lives and Works of the English Romantic Poets


#23

Update

(v 0.4.1)

TGC.bundle

A few updates have been added:
* Adds Lecture thumbnails from available courses on the TGC+ website
* Cleaned up some of the code

We have a new addition. For courses found on the TGC+ (https://www.thegreatcoursesplus.com/) website, the TGC Agent now pulls the Lecture thumbnails from the respective course webpage on TGC+. This was a suggestion from a friendly user of github.

Screenshot

It should find most of the courses from TGC that have a counterpart TGC+ course site. Not all courses are available on TGC+. There might be a few straggling courses that are on TGC+ that TGC.bundle doesn't quite pull, so if you find any, PLEASE LET ME KNOW and I'll make adjustments to the code to find the TGC+ course page. Unfortunately I had to use a brute force method to find the courses on TGC+ because the search results are all in javascript and TGC.bundle primarily parses HTML.

TODO

  • Make course description identical to TGC course site. (lots of html parsing! - half way done)
  • Add genres.
  • Clean up some of the code.
  • Add more try/except error checking.
  • Add checks for existing metadata so it's not updating the metadata every time.
  • Make the code less demanding on the PLEX server.
  • Add compatibility for audio lectures.

Download

TGC.bundle

Enjoy!
-bubonic


#24

Quick question on your latest TTC plug in. I'm trying to get the episode thumbnails to work and I need to know if there is a naming convention. Most of my lectures are named
S01E01 - name
S01E02 - name
And none of those actually get the thumbs.
But I just added one that had the following names

Tgc course# - S01E01 - name
Tgc course# - S01E02 - name
And that one did. Is the key putting in the "TGC course number" before the episode numbers? IF so I guess I may need to track all those down somewhere.....


#25

Followup - just adding those values didn't do it after a rescan. But if you unmatch it and rematch it (or just fix match) it works regardless of whether you put the course number in.