This American Life, America's favorite radio show. It's been on for almost twenty years, and the staff of TAL diligently put up transcripts of each week's show. I think it could be a great data source for people looking for some pop culture data to play with and visualize. The data source includes both structured data (timestamps, speaker names, act names, dates) and unstructured transcripts of spoken words.
Out of respect for This American Life, this repo doesn't include the full datasource (~50mb, 120,000 lines). You've got to download the repo and parse the data yourself. Instructions are included below. The final CSV file breaks down the show by each paragraph of spoken content.
I'm a huge fan of the show. If anyone at the show has a problem with this little project, contact me, and I'll gladly take it now. Thanks!
series_counterAn index of the number of paragraphs spoken for the entire history of the show. Begins at 1.episode_numberThe TAL episode number. Begins at 1.episode_dateDate of original airing. Formatted as01.23.2003.episode_titleTitle of the episode.episode_linkLink to the thisamericanlife.org archive page for the episodeact_nameName of the act. Acts include "Prologue" and "Credits"act_numberNumber of the act. Begins at 0 for prologue.speaker_nameName of the speaker of the paragraphparagraph_number_per_speakerIndex of the paragraph for that speakers particular sectiontime_stampA timestamp of the beginning time of the paragraph.contentThe paragraphs content.
You need to have Node and NPM installed. If you don't have that installed yet, find out here. It's really simple.
Download this repository to your computer, navigate to it your Terminal.
npm install
Next run node get-html-files.js. This will take a little while out of respect for not bombarding This American Life's servers. You can go into the get-html-files.js and alter the wait time between requests if you'd like.
Next run node create-tal-csv.js. This will produce the CSV file in ./output.
I'm not connected to This American Life in anyway. All copyright rests with Chicago Public Media & Ira Glass (2003). This American Life comments on their archive page:
Note: This American Life is produced for the ear and designed to be heard, not read. We strongly encourage you to listen to the audio, which includes emotion and emphasis that's not on the page. Transcripts are generated using a combination of speech recognition software and human transcribers, and may contain errors. Please check the corresponding audio before quoting in print.