If you missed the notes from Day 1, see this post.
It may take me a week or so to finish putting some general thoughts and additional resources together based on the two day conference so that I might give a more thorough accounting of my opinions as well as next steps. Until then, I hope that the details and mini-archive of content below may help others who attended, or provide a resource for those who couldn’t make the conference.
Overall, it was an incredibly well programmed and run conference, so kudos to all those involved who kept things moving along. I’m now certainly much more aware at the gaping memory hole the internet is facing despite the heroic efforts of a small handful of people and institutions attempting to improve the situation. I’ll try to go into more detail later about a handful of specific topics and next steps as well as a listing of resources I came across which may provide to be useful tools for both those in the archiving/preserving and IndieWeb communities.
Archive of materials for Day 2
Below are the recorded audio files embedded in .m4a format (using a Livescribe Pulse Pen) for several sessions held throughout the day. To my knowledge, none of the breakout sessions were recorded except for the one which appears below.
Summarizing archival collections using storytelling techniques
Presentation: Summarizing archival collections using storytelling techniques by Michael Nelson, Ph.D., Old Dominion University
Saving the first draft of history
Special guest speaker: Saving the first draft of history: The unlikely rescue of the AP’s Vietnam War files by Peter Arnett, winner of the Pulitzer Prize for journalism
Kiss your app goodbye: the fragility of data journalism
Panel: Kiss your app goodbye: the fragility of data journalism
Featuring Meredith Broussard, New York University; Regina Lee Roberts, Stanford University; Ben Welsh, The Los Angeles Times; moderator Martin Klein, Ph.D., Los Alamos National Laboratory
The future of the past: modernizing The New York Times archive
Panel: The future of the past: modernizing The New York Times archive
Featuring The New York Times Technology Team: Evan Sandhaus, Jane Cotler and Sophia Van Valkenburg; moderated by Edward McCain, RJI and MU Libraries
Lightning Rounds: Six Presenters
Lightning rounds (in two parts)
Six + one presenters: Jefferson Bailey, Terry Britt, Katherine Boss (and team), Cynthia Joyce, Mark Graham, Jennifer Younger and Kalev Leetaru
1: Jefferson Bailey, Internet Archive, “Supporting Data-Driven Research using News-Related Web Archives” 2: Terry Britt, University of Missouri, “News archives as cornerstones of collective memory” 3: Katherine Boss, Meredith Broussard and Eva Revear, New York University: “Challenges facing preservation of born-digital news applications” 4: Cynthia Joyce, University of Mississippi, “Keyword ‘Katrina’: Re-collecting the unsearchable past” 5: Mark Graham, Internet Archive/The Wayback Machine, “Archiving news at the Internet Archive” 6: Jennifer Younger, Catholic Research Resources Alliance: “Digital Preservation, Aggregated, Collaborative, Catholic” 7. Kalev Leetaru, senior fellow, The George Washington University and founder of the GDELT Project: A Look Inside The World’s Largest Initiative To Understand And Archive The World’s News
Technology and Community
Presentation: Technology and community: Why we need partners, collaborators, and friends by Kate Zwaard, Library of Congress
Breakout: Working with CMS
Working with CMS, led by Eric Weig, University of Kentucky
Alignment and reciprocity
Alignment & reciprocity by Katherine Skinner, Ph.D., executive director, the Educopia Institute
Closing remarks by Edward McCain, RJI and MU Libraries and Todd Grappone, associate university librarian, UCLA
Live Tweet Archive
Reminder: In many cases my tweets don’t reflect direct quotes of the attributed speaker, but are often slightly modified for clarity and length for posting to Twitter. I have made a reasonable attempt in all cases to capture the overall sentiment of individual statements while using as many original words of the participant as possible. Typically, for speed, there wasn’t much editing of these notes. Below I’ve changed the attribution of one or two tweets to reflect the proper person(s). Fore convenience, I’ve also added a few hyperlinks to useful resources after the fact that didn’t have time to make the original tweets. I’ve attached .m4a audio files of most of the audio for the day (apologies for shaky quality as it’s unedited) which can be used for more direct attribution if desired. The Reynolds Journalism Institute videotaped the entire day and livestreamed it. Presumably they will release the video on their website for a more immersive experience.
Condoms were required issue in Vietnam–we used them to waterproof film containers in the field.
Do not stay close to the head of a column, medics, or radiomen. #warreportingadvice
I told the AP I would undertake the task of destroying all the reporters’ files from the war.
Instead the AP files moved around with me.
Eventually the 10 trunks of material went back to the AP when they hired a brilliant archivist.
“The negatives can outweigh the positives when you’re in trouble.”
Our first panel:Kiss your app goodbye: the fragility of data jornalism
I teach data journalism at NYU
A news app is not what you’d install on your phone
Dollars for Docs is a good example of a news app
A news app is something that allows the user to put themself into the story.
Often there are three CMSs: web, print, and video.
News apps don’t live in any of the CMSs. They’re bespoke and live on a separate data server.
This has implications for crawlers which can’t handle them well.
Then how do we save news apps? We’re looking at examples and then generalizing.
Everyblock.com was a good example based on chicagocrime and later bought by NBC and shut down.
What?! The internet isn’t forever? Databases need to be save differently than web pages.
Reprozip was developed by NYU Center for Data and we’re using it to save the code, data, and environment.
We make apps that serve our audience.
We also make internal tools that empower the newsroom.
We also use our nerdy skills to do cool things.
Most of us aren’t good programmers, we “cheat” by using frameworks.
Frameworks do a lot of basic things for you, so you don’t have to know how to do it yourself.
Archiving tools often aren’t built into these frameworks.
Instagram, Pinterest, Mozilla, and the LA Times use django as our framework.
Memento for WordPress is a great way to archive pages.
We must do more. We need archiving baked into the systems from the start.
Slides at http://bit.ly/frameworkfix
Got data? I’m a librarian at Stanford University.
I’ll mention Christine Borgman’s book Big Data, Little Data, No data.
Journalists are great data liberators: FOIA requests, cleaning data, visualizing, getting stories out of data.
But what happens to the data once the story is published?
BLDR: Big Local Digital Repository, an open repository for sharing open data.
For metadata: www.ddialliance.org, RDF, International Image Interoperability Framework (iiif) and MODS
We’ll open up for questions.
What’s more important: obey copyright laws or preserving the content?
The new creative commons licenses are very helpful, but we have to be attentive to many issues.
Perhaps archiving it and embargoing for later?
Saving the published work is more important to me, and the rest of the byproduct is gravy.
I work for the New York Times, you may have heard of it…
Talking about modernizing the born-digital legacy content.
Our problem was how to make an article from 2004 look like it had been published today.
There were 100’s of thousands of articles missing.
There was no one definitive list of missing articles.
Outlining the workflow for reconciling the archive XML and the definitive list of URLs for conversion.
It’s important to use more than one source for building an archive.
I’m going to talk about all of “the little things” that came up along the way..
Article Matching: Fusion – How to convert print XML with web HTML that was scraped.
Primarily, we looked at common phrases between the corpus of the two different data sets.
We prioritized the print data over the digital data.
We maintain a system called switchboard that redirects from old URLs to the new ones to prevent link rot.
The case of the missing sections: some sections of the content were blank and not transcribed.
We made the decision of taking out data we had in lieu of making a better user experience for missing sections.
In the future, we’d also like to put photos back into the articles.
Can you discuss the decision to go with a more modern interface rather than a traditional archive of how it looked?
Some of the decision was to get the data into an accessible format for modern users.
We do need to continue work on preserving the original experience.
Is there a way to distinguish between the print version and the online versions in the archive?
Could a researcher do work on the entire corpora? Is it available for subscription?
We do have a sub-section of data availalbe, but don’t have it prior to 1960.
Have you documented the process you’ve used on this preservation project?
We did save all of the code for the project within GitHub.
We do have meeting notes which provide some documentation, though they’re not thorough.