Kevin Clair: Alright cool. Thanks for coming everybody. We, Ben Goldman and I will be discussing current. Kevin Clair: I will. Past current and future activities related to archiving websites and web content. Here at the Penn State University libraries as part of the digital curation community of practice. Kevin Clair: I'm Kevin Clare. I'm the Digital Collections library and I work in the Eberly Family Special Collections library here at University Park. Ben, would you like to introduce yourself? Benjamin Matthew Goldman: Sure thanks, Kevin Ben Goldman University, Archivist interim Co. Head of special collections as well. Formerly digital archivist, which is, I think, why I'm here. Yes, ben, formerly was the Kevin Clair: facilitator manager, coordinator of these activities. And then we had a bit of a wall. And now I am the manager, facilitator, coordinator of these activities. So. Kevin Clair: We'll talk about what we've done in the past. What we hope to do in the future, how we do some of these things, and kind of the handoffs between the curatorial side of web, archiving at the libraries and the technical collection management side of web, archiving at the at the libraries. And we wanna have at the end a little bit of a conversation about Kevin Clair: There's Kevin Clair: you don't necessarily need the imprimatur of the special Collections library to do your own web archiving activity. So we wanna talk a little bit about that as well of how you can do this kinda on your own, and how we might sort of build us little sort of sub community of practice around web archiving and curation activity here at the libraries. So Kevin Clair: there we go. So this is our outline of what we're gonna talk about today. Kevin Clair: Ben's gonna talk for the first half or so about why we do this and how we select the things that we select, and we'll talk a little bit about what we've archived so far here in the special collections library, using our various tools, and then I'll talk about collection management providing access. And then we'll both kind of trade off on what some of the challenges are related to Web archiving. Kevin Clair: I'll do a little demo of how we set up some things. We use archive it, which is an Internet archive tool. I'll talk more about that in a bit. And then we'll have a hopefully if there's time, a little bit of discussion at the end. So at this point. I'm going to hand things off to Ben to get us started. Benjamin Matthew Goldman: Alright, thanks, Kevin. Well, thank you for all letting me talk about this. It's been a kind of a while before I've Benjamin Matthew Goldman: or since. I've really thought a lot about web archiving I'll just start by saying Web Archiving is one of the most fun things I've done here. Benjamin Matthew Goldman: as a librarian at Penn State. It's a great community. There's really cool tools, cool things happening, great ideas. Benjamin Matthew Goldman: great conferences. It's a fun activity, like a curatorial activity to engage in. It's also probably the hardest format Benjamin Matthew Goldman: for us to archive properly, and Benjamin Matthew Goldman: in a way that you can actually view it as it was intended later down the line. So Benjamin Matthew Goldman: Web Archiving has a lot of flavors to it. So it is large scale copying of websites. So the the crawler the tools that Kevin's gonna tell you a little bit more about it goes out. And just basically like grabs file after file Benjamin Matthew Goldman: within a domain or a set Benjamin Matthew Goldman: kind of a set parameters that you pointed to. So webs are complex objects. You're preserving a lot of different types of content when you're preserving a website, text, image, audio video. But also how all of those things interact together and behave together. Benjamin Matthew Goldman: So there's a lot of. I guess, connectivity between files and pages that you're trying to capture when you preserve websites. Benjamin Matthew Goldman: it's also it's also important to think of this as snapshots is the word that I like to use. You're not Benjamin Matthew Goldman: when you're web web web archiving. You're not capturing, like every Benjamin Matthew Goldman: websites like existence, day to day, the tools we have and the approaches we have and kind of the scale of content that's out there. We're looking at Benjamin Matthew Goldman: trying to capture drop-ins, you know. We'll do it every Benjamin Matthew Goldman: 4 months or every year, and the tools we have give us different regularities in which to schedule crawls of websites. Benjamin Matthew Goldman: And probably the most important thing, when it comes to web archives is it's being able to access it later and to play it back. So if anyone's used the way back machine, you've had that experience where you Benjamin Matthew Goldman: pop in a yeah URL, and you receive a calendar page. And you select a date from that page. And you can see what the website looked like. And in theory you can interact with it the same way you can follow links Benjamin Matthew Goldman: the design. The layout looks more or less like it did Benjamin Matthew Goldman: when it was captured. Of course all of these things are dependent on a lot of different technological issues that we're going to get into, too. Benjamin Matthew Goldman: Yeah, you move ahead. And fair Kevin. Benjamin Matthew Goldman: Back up one really quick. Benjamin Matthew Goldman: It's very jumpy. Benjamin Matthew Goldman: so there's a lot of reasons why you might want to archive the web. I thought I would just kind of share 2 example stories. So think this one and multiple examples like this. If you just do some searching on Benjamin Matthew Goldman: on news sites or even like academic publications about web archiving ephemerality is, you know, a big reason for archiving websites. I think when I started doing this, the the thing that I always heard at conferences was that the average lifespan of a website was 100 days. So Benjamin Matthew Goldman: you know, I think there's kind of a long tail to that where it's a lot of, you know, smaller websites that disappear on a frequent basis. And then you've got the behemoths who are around forever. Benjamin Matthew Goldman: But is that ephemerality that, I think, really drives a lot of web archiving and it certainly has done that a lot with some of our past collections. I think this is a really interesting story. This one is not too old. Benjamin Matthew Goldman: and I actually found it referenced in a very recent article. I wanna say from like the Washington Post? But given how much you know Penn State has done around preserving newspapers and making sure that the archives of you know, the news are accessible. Benjamin Matthew Goldman: Through various platforms, through. I know the microforms library. I just feel like this is a very relevant example to Penn State. In fact, now that I'm thinking about it, the story I Benjamin Matthew Goldman: that referenced this particular report was about how the New York Times was preventing the Internet archive from accessing its pages. Benjamin Matthew Goldman: Just kind of an interesting twist. Benjamin Matthew Goldman: If you want to go to the next one, Kevin. Benjamin Matthew Goldman: I think the other reason that is really strong on my mind is accountability, and I think Benjamin Matthew Goldman: this is a great example. Some of you may have seen Spotlight, PA. Which is doing a lot of coverage of Penn state and Benjamin Matthew Goldman: I would say, even doing some muck raking. And I mean that in the best possible sense, around university governance and administration, and they have Benjamin Matthew Goldman: This article, among many that they published recently, which I noticed when I was reading, actually referenced looking at the the web archives for Penn State, specifically for the office of ethics and compliance Benjamin Matthew Goldman: to see how many, I guess personnel had turned over during a certain period of time. So I think a lot of people. We've we've had this experience with, our researchers, you know, are using web archives to try to reconstruct information about the university that is no longer out there or available and we have often referred people to some of the University Benjamin Matthew Goldman: University websites that have been archived through our services. Those are just 2 reasons. There are many other potential reasons, but those are the ones that I think are first and foremost in a lot of people's minds. Benjamin Matthew Goldman: Alright. Go ahead, Kevin. Thank you. Benjamin Matthew Goldman: So how do we decide what to archive? The thing I've always tried to reinforce is that Benjamin Matthew Goldman: web archiving is really a curatorial tool. So what you want to do, what we have tried to do is we've tried to approach our use of this tool? Through a lens of curating collections. Benjamin Matthew Goldman: Rather than doing one off. So I have over the years when I was more involved in this. Gotten the occasional request for, hey? I came across this website. I'm not sure it'll be around. Benjamin Matthew Goldman: Could you crawl it? In some cases. That is something we've done because it does fit within a curated collection, especially Penn state archives. Benjamin Matthew Goldman: But for most kind of just random one offs. The Wayback machine is your best friend. The Internet Archive actually takes submissions through their website. If anyone from the general public just wants a page to be to be archived beyond what their kind of service activity is. It can be submitted straight to them for that purpose. Benjamin Matthew Goldman: Really, here in the university libraries and in most institutions that are doing this kind of work in an organized way. It's about curating collections. Benjamin Matthew Goldman: And so the big thing that has informed a lot of our work in special collections is trying to align that with what our collecting strengths and goals have been and are growing into Benjamin Matthew Goldman: so we will often, as we do with any collections where we're evaluating it for possible acquisition. We're looking for things like assessing research value. You know the potential for this material to be used by researchers, especially those Penn state or in courses or programs that we're working with closely. Benjamin Matthew Goldman: We also consider what others are already collecting. So there's last. I checked something like 600 at least institutions that Benjamin Matthew Goldman: subscribe to the Internet Archive tools and are doing their own web collecting. So there is a good chance that you know a lot of topical projects Benjamin Matthew Goldman: that we might think of somebody else, maybe has already thought of it. And I can give some examples of that later on, actually. Benjamin Matthew Goldman: And then we also have to think about, you know, access conditions and restrictions. Other things that could prevent us from being able to capture websites. Benjamin Matthew Goldman: beyond our own collecting goals. I think this is something that Kevin is is interested into, you know. I think there's an opportunity. And we've kind of explored this in the past Benjamin Matthew Goldman: for Web archiving to support academic research within the Penn State community. And some of that we've done Benjamin Matthew Goldman: in the past here in in special collections, in collaboration with some of our subject liaisons in the libraries. and then the other big one, I would say, that kind of drives, some Benjamin Matthew Goldman: curatorial decisions. This is kind of a big one across. Actually, the profession is, there's a lot of event based collecting. So something happens. Hurricane Katrina. I remember that was something that had a lot of web archiving done after it. Ferguson, when that happened. Benjamin Matthew Goldman: So a lot of collecting that kind of happens to try to get sort of. I guess the immediate public reaction and online Benjamin Matthew Goldman: commentary that happens right after something happens. And it's kind of fresh on everyone's minds. Benjamin Matthew Goldman: Next slide, please. Benjamin Matthew Goldman: and I think this is my last slide. Benjamin Matthew Goldman: What else? II thought it'd be useful to share some of the things we've done up till now. Benjamin Matthew Goldman: So Benjamin Matthew Goldman: As I think Kevin mentioned pretty much. All Web archiving has been done mainly through, led by special collections. There, there have been efforts we've made in the past to kind of broaden that out, though. But it has mostly been pace within our unit. Benjamin Matthew Goldman: We do it using the Internet archives, hosted tools. So that's important to note, because a lot of the things that this group might be interested in things like digital preservation, we offload that to another entity. When we do it through this tool. Benjamin Matthew Goldman: So there's no kind of local preservation activity Benjamin Matthew Goldman: happening. That's not to say it can't. Maybe Kevin will want to get into that. I don't know but really, it's a tool for curating collections, and that's how we've approached it. Benjamin Matthew Goldman: And then some of the collecting highlights, I guess. harkening back to some of the some of the criteria for selecting. So we have done a lot of institutional collecting related to the University archives. So the first project I actually worked on once we got this up and running in 2,012, was trying to capture as many of the. Benjamin Matthew Goldman: I guess news store. Not as many, because we didn't. We weren't comprehensive about this, but we tried to capture a good amount of material related to the aftermath of the Sandusky scandal here at Penn State. And then since then we've done pretty comprehensive crawling of websites in the psu.edu domain. Benjamin Matthew Goldman: Covid-19 is something my predecessor. Angel Diaz initiated right after the pandemic started and we had the lockdown, and we captured a lot of Benjamin Matthew Goldman: the University web pages and responses related to everything that was happening. Then. Benjamin Matthew Goldman: we've also done some crawling that's supporting our existing collection. So there is a collection in Benjamin Matthew Goldman: in our web archive. Benjamin Matthew Goldman: landing page and archive it. That's all about the United Steel Workers. There's a lot of different locals. There's a lot of different websites that encompass. I guess that very large organization. And so we did a lot of collecting of those websites to support that collection here in special collections. But sometimes it's also been just like one website that supports or correlates to Benjamin Matthew Goldman: a collection we've acquired. So when we acquired the Chip Kid papers several years ago, we also started crawling Benjamin Matthew Goldman: that his website, which we considered part of his papers. Benjamin Matthew Goldman: and then I mentioned, we've done over the years some collaborations and partnerships, and these have been pretty wide ranging. The first partnership we did was actually with Cornell Benjamin Matthew Goldman: University libraries. They were doing a project looking at fracking in New York, and we did a project looking at fracking in Pennsylvania, and those were 2 very different state experiences with with this natural gas development industry. So Benjamin Matthew Goldman: we did a collaborative project, working with them, selecting with them, really thinking about how these 2 projects would speak to each other. Benjamin Matthew Goldman: And then I also did a few projects collaborating with other folks in the library around topics like financial literacy or Pennsylvania waterways. We did 2 election archiving projects, one in 2016 and one in 2018. These were both based on Pennsylvania alone. Benjamin Matthew Goldman: And these were collaborations with several other librarians. But also a good example where Benjamin Matthew Goldman: we realize there could be a lot of overlap with other institutions collecting because it seems like everyone is doing web archives every election. So which is probably a good thing. Benjamin Matthew Goldman: And then we also did a student. We've we've done. A couple of student led Benjamin Matthew Goldman: projects. But the first one that I had the opportunity to do this with was a student who was interested in doing something related to the development of the Us. National Trail system. Following the fiftieth anniversary of the National Trails act a couple years ago. Benjamin Matthew Goldman: So he, you know, proposed this idea. We got him started with Benjamin Matthew Goldman: developing and curating what this collection would look like, and then we did that crawling for him. Benjamin Matthew Goldman: and finally everything we've done up to this point. It's all accessible on this landing page through archive it. Every institution kinda has one of these, and you can go there and you can do some browsing. You can do some searching. Benjamin Matthew Goldman: but everything is we've done well, I shouldn't say everything, because there are situations where we've tested things, and they're not made those publicly accessible. But Benjamin Matthew Goldman: essentially everything we've done is publicly accessible through this page. Benjamin Matthew Goldman: I think that's all I really had to say. Kevin. Kevin Clair: pardon me for one moment while I do a little screen share, shuffle here, so I can show you what that page looks like. So this is the I'll show the back Kevin Clair: and the what we see when we're doing collection management in archive it later on. But this is when you go to archive it. When you go to that link in the last slide it will take you here, and you can see Kevin Clair: collectors here for this 2018 Pennsylvania elections Archive, our our colleagues, Eric, Jeff and Andrew, who collaborated with us to build that out. Kevin Clair: you can. So you can see a lot of what Ben was just talking about is out here publicly available, and if I click through, go to 2,016 Kevin Clair: so you can see all of the different. So archive archives archive. It calls them seeds. It's Kevin Clair: every web archive collection that we build has individual urls that we add to the archive that get crawl. When we do a crawl we can crawl Kevin Clair: any or all of these websites that we include within the collection, and it will Kevin Clair: look for updates. If there are updates, it will pull down like different versions of the same site. So you can kind of see how it evolves over time. When you look at these on the way back machine. Kevin Clair: And this happens like Ben was saying, the Internet Archive will crawl some of these sites. Anyway, there's just like a really basic Kevin Clair: level of archiving that they do Kevin Clair: across the web all the time at kind of a low, ambient level. Us, adding a site to a web archive collection gives us the opportunity to crawl at a little bit greater depth, using different tools. Kevin Clair: Then archive, then the Internet Archive might necessarily use on its own when it's doing its own sort of base level of crawling. So when I do a demo in a little bit I'll show what that kind of looks like. Let me make sure this is an a Kevin Clair: there's the link. Excellent. So let me go back to my slides. Kevin Clair: So yes. We manage archives in a lot of different ways. Primarily, we use this archive, it which is like I said, Internet, or it's a tool Kevin Clair: that the Internet, the Internet Archive provides. We subscribe to it. There's different levels of subscription that gets you more or less storage space for the crawls that you do Kevin Clair: And that's kind of the main tool that most libraries and archives who are doing this activity. That's what they use. There are some other applications that we use when Kevin Clair: archive. It doesn't quite do the job that we want it to do. Kevin Clair: So this is what this is just a screenshot of what the management side of archive. It looks like. So this is ours. Kevin Clair: there. along the top. Here the main things are collections and crawls. So the collection is something that we build. That's a set of websites that Kevin Clair: constant, that are built around a theme or reflect an existing record group that we have within special collections and archives like. We might have an archive Kevin Clair: for Kevin Clair: athletics, or we might have a web archive for the United Steelworkers that captures different chapters of the Union like. That's just the I don't know if that's a real example, but it's an example of how we might go about building a collection. Kevin Clair: And then crawls are events wherein we either crawl in it. We either ask, archive it to go, get Kevin Clair: every an entire collection's worth of websites for us, or we can pick and choose which websites we want to crawl Kevin Clair: and it will download down to a particular level, get a particular set of file types so to get websites, but also we can have it Kevin Clair: select to some degree images or multimedia depending on where it's hosted and things like that. But that's what happens with a crawl. Kevin Clair: There are other tools. This is one called Conifer, which has an interesting story. It started as a tool called web recorder which came out of Kevin Clair: like Internet art communities sort of built this up. And the reason is because Ben talked earlier about the difference between contents and context, when we talk about web archiving Kevin Clair: context. So anymore, lots of websites are dynamic, they're built on top of relational databases. So as the database content changes. The actual text that you see on a website will be different. Or the imagery that you see on the website will be different or they're built using different web development frameworks like Jquery. Kevin Clair: where? Kevin Clair: As those sort of development frameworks evolve. The website, if it's doesn't keep up, will not function the way it used to anymore. And things will start to break, and you won't be able to see stuff or click the things that used to click. Kevin Clair: And that's particularly when people are using the web as a medium for art. Kevin Clair: They're really on the edge of some of that stuff. And so it's those things are really brittle, and they break easily. And so Web Recorder was built as a way to instead of crawling a website like you initiate Kevin Clair: recording with this application. And then you, you yourself actually do the clicking through the website, and you manually go and get the things that you want to preserve so that people it's not quite as interactive as an archive it crawl would be. But you can see Kevin Clair: how some of these websites behaved and acted over time. Kevin Clair: and it kind of forked off in different directions at some point. So conifer is the web application that's managed by this Kevin Clair: Arts collective, called rhizome that's in New York City. And then Web Recorder itself, which was, and apparently, as I found out last week, still is, a standalone like desktop application that you can use, and that's maintained by some people were sad to see Web Recorder go. So they're like, let's keep it going. And they did. It's kind of like, open or fine, if you, if any of you use that as the same thing happened. Kevin Clair: So those are some tools that we use. I'll if you'll bear with me for a moment. Sometimes I'm a nerd, and people just want Pdfs, and I'm like I can write a python script to crawl through the website and just get your Pdf. For you. And I do that sometimes. But I don't Kevin Clair: want to make a habit of that. Pretend I said nothing. Kevin Clair: How do we provide access to web archives? Primarily we do that through. Archive it? The main way that you're going to access Kevin Clair: the things, the collections that we have created of web archives over time is going to be through our Penn State collection page on archive at Org we also provide access through our finding aids through Penn State archival collections, which is the public catalog for our archives, space collection management system that we maintain Kevin Clair: for the special collections library, and for the various campuses that have archival collections in it. Kevin Clair: I showed. Oh, I didn't show this. So this is what a this is! A screenshot of the collection page for the Pennsylvania Shale energy web archive that Ben built back in 2013. Kevin Clair: I am going to Kevin Clair: stop sharing and do my shuffle again. so you can actually see what this is. That's the wrong page. Kevin Clair: this is the right page. So yeah, you can see, here's the collection we have basic in terms of metadata. Kevin Clair: I don't. We've never really extended it much. I don't know if it's possible to extend it. But archive. It uses Dublin core. It's a very Dublin core, influenced metadata schema that you have access to when you're doing collection management and archive it, and you can kind of see Kevin Clair: description. Subject creator, that link is broken. I'll fix it after this meeting. Kevin Clair: And you can see that Ben curated it. 40, 61. That's our Id for this collection. And then you can see all of the individual websites that Ben has set up as crawls or as seats. That we've crawled over the years to build this collection. And it's a lot of Kevin Clair: websites, blogs that people have maintained about fracking and about shale development. Environmental websites. Kevin Clair: things about pipeline projects so kind of a very wow. wide-ranging, very Kevin Clair: complete archive of all the different aspects that relate to development of the Marcellus and Pennsylvania, and of different projects that have happened related to shale over the years. Kevin Clair: We will shuffle again, because I don't know how zoom works. Kevin Clair: The next slide I want to show Kevin Clair: oops. Kevin Clair: I don't know how Zoom works. Kevin Clair: bear with me for one moment. Kevin Clair: That's the one. Okay? So Kevin Clair: I'm gonna stay here. This is what it looks like. In the archive space public user interface penn state archival collections. It's the public catalog that we provide access through which we provide access to our finding Aids. Kevin Clair: We're still kind of Kevin Clair: developing the archives catalog in a way that makes sense to people. But so we have. Kevin Clair: The way we have it set up is that Kevin Clair: we consider this to be like an access digital object. So anything digital has a record that looks like this in archive space, whether it's a web archive collection, whether it's an image or a video or an audio file that we provide access through through content. DM, like at all, has a record that looks like this. And the Kevin Clair: Little, the big blue link that you see there, that takes up most of the page. Will take you to that site that I just showed you in archive it. So this another way of Kevin Clair: both providing access to our web archives, but also positioning them within the greater context of their collection. Whether it's Kevin Clair: a collection that we maintain that is a web archive collection like this, or we also have instances in the archives where a web archive collection is just one component of a larger collection or group of records from the university archives that we maintain. Kevin Clair: So this site gives us a way to sort of preserve that context and preserve Oftentimes, like as records that we maintain move from that physical form into, we can only access them through the web we're able to present them alongside one another, and Kevin Clair: you can see the entire collection in all of its different formats in that way. Kevin Clair: I am going to let Ben talk about the ethical challenges of web archiving. If Ben does not mind. Benjamin Matthew Goldman: I do not mind. might be a little out of date on the current conversations, too. But for many years this was a big topic. Within web, archiving discourse. Benjamin Matthew Goldman: so it's interesting, because I know the Internet Archive has had a pretty big copyright decision handed down on them for the national emergency archive, which was all about Benjamin Matthew Goldman: ebooks, but they Benjamin Matthew Goldman: they kind of notoriously take the same approach to websites. But, as far as I can tell, the the Benjamin Matthew Goldman: entire kind of web archiving community is organized around a couple of kind of loose concepts for how they approach things like rights and permissions. One is you can honor sort of code snippets that are in there. So when I said the New York Times prevented the Internet Archive from crawling at site, they did that by updating this little piece of code that specifically blocked their technology from doing it. Benjamin Matthew Goldman: So you can some. Benjamin Matthew Goldman: Some collecting institutions choose to honor what those say, some will ignore, but just send a notification to websites saying that they're doing this. Benjamin Matthew Goldman: Others, I think I've seen this in very, very limited cases, will seek permission. Especially though I've seen this around more community type, websites or community archives, community memory community groups. Where people are taking a little bit more care with their interactions. So there's a bullet point there building relationships with content creators, I think that's kind of reflecting. That idea. Benjamin Matthew Goldman: I think Benjamin Matthew Goldman: another chat. I guess. Another way you can do this is, there are some websites that have specific restrictions written into their website content somewhere, like in a footer Benjamin Matthew Goldman: with a lot of the Penn State websites. Obviously, we've we own this material in the same way that we consider ourselves the holders of copyright in the University archives, we would approach websites in the Psu Edu domain the same way. Benjamin Matthew Goldman: There are also some. I think there've been a lot of discussions within community archiving profession about Benjamin Matthew Goldman: collecting around, you know, incidents where there was great trauma involved. So a lot of this event-based collecting sometimes is touching on some really really difficult subjects, and involving people who might not want their trauma or Benjamin Matthew Goldman: their participation. In an event, you know, necessarily archived. You can contrast that with a lot of the web archiving that's been done around government websites where I think most people feel that Benjamin Matthew Goldman: those are all public, you know, public websites. And Benjamin Matthew Goldman: there's there's a very like sort of civic duty approach to web archiving in some of those cases. Benjamin Matthew Goldman: And I'm really curious how the conversation is continuing these days. When you think about how the web is evolving. I don't know what kind of conversations might be happening around artificial intelligence. But I know, you know there's been dialogue around. Benjamin Matthew Goldman: you know. How does you know misinformation play into choices about web archiving? You know, there's discussions and and well known now, the ways in which algorithms can foster oppression that exists. You know, institutionally within society. And I think all of these are kind of issues that people doing web archives have thought a lot about. Kevin Clair: Yeah, I think as a way to bridge over to technical challenges. I think one of the things with particularly preserving like AI generated Kevin Clair: materials is that Kevin Clair: it's kind of in a way impossible to really do it completely, because Kevin Clair: a, you have to know how like you can't. You have to know what the model was that generated that content and what it seed content was, and Kevin Clair: all of that stuff that nobody Kevin Clair: can share or would share if they could. And Kevin Clair: you know that the same model will lead to completely different outcomes, depending on what you seed it with. Because that's how. Ll, that's how large. That's how Llms work. So it's like it's Kevin Clair: it will not. We'll never be able to do it in the way that we could for different types of content. So that's a lot of the conversations that I've been in around AI have to do with how, how. Kevin Clair: what is the best that we can look like in that context. And I think that applies to for things that were like AI art, and Kevin Clair: like Kevin Clair: deep fakes and things like things like that, like what is Kevin Clair: what? What does what does it look like to do the best that we can? So we're never going to get it perfect. We're never going to preserve all of the contacts that we would want to. Kevin Clair: so technical challenges. Kevin Clair: yeah, I think the big thing that I've talked about already is. are we able to preserve all of the content of a site, even because archive, it has Kevin Clair: limits that you can adjust, that dictate how far down within the hierarchy, like how many levels deep within a website, they'll crawl Kevin Clair: and any with websites now that are driven, that are database driven for the most part, and may not have that clear hierarchy that you see, even even Kevin Clair: archives. Space is almost exceptional, like it has. You can see the slashes. There's nothing like that's really Kevin Clair: necessarily a query in the same way that you would see if you went to like Amazon. And so, Kevin Clair: How how does that? How do we get as much as we can. Or Kevin Clair: how much do we actually wanna get from some of these websites, too, is another question. And then the context, are we able to preserve all the functionality of a site. Kevin Clair: will things start to break as some of these Kevin Clair: frameworks that websites are built on upgrade over time and functionality starts to get lost. If you don't keep up like, what is. Kevin Clair: are we able to both preserve the context right now and then maintain that context, as technology evolves and time passes and things. Kevin Clair: one of the things that we haven't talked about yet is how we go about preservation of web archives. We are not, to my knowledge, keeping any web. Are any Kevin Clair: work. Is the format. So I'll say works from now on. It's I think it literally just means web archive. But these dot work files that are the outputs of web archiving activity. I don't think we put anything in lipsafe yet. So Kevin Clair: what will that look like going forward? Kevin Clair: how do we provide meaningful access to web archive files? Also, it may be enough just to show people archive it, or put people in the where point people to the Wayback machine where they can see the timeline of all these different website crawls and things? Kevin Clair: Or do people want to view these websites like, would people want to see instead of that big blue link and archive space? Would they wanna have a little window that displays what the website looked like. And they can kind of. Kevin Clair: you know, change the view over time like you can do in Google Maps or street view? Kevin Clair: Like, what are the different ways that we can provide access beyond what we're doing right now? And are they worth the time that they would take to develop for us. And then are we documenting what we do? Kevin Clair: whether it's Kevin Clair: us talking to you about what we do right now, or are we documenting some of the collection management decisions that we're making and archive it? Or in a system like archive space, where we Kevin Clair: talk about the collection management that we do. I'm like, how are we kind of sharing our work and making it repeatable for other institutions who want to collaborate with us, or do what we're doing, or Kevin Clair: for us to learn from other institutions who are dealing with some of the same challenges that we are Kevin Clair: So at this right now, I'm just gonna switch over again and show you all really quickly how we do some of this work. Kevin Clair: I'm going to demonstrate, based off of Kevin Clair: a little web archive that I made a while ago. So this is. This is the management side of archive it. So you can see Kevin Clair: all of our different collections. There's many more than 16 here. Some of these are what Ben was talking about, where we have a lot of test crawls that are not quite ready to be made Kevin Clair: publicly accessible yet to people. So some of those are in here. There's 16 that are public, but many more that are not. One of them is this one cause. It's just me kind of messing around remembering how the site works. So you can see the overview. There's nothing here, because I've only done these test crawls to make sure everything is working, and that it's crawling all the stuff that I wanted to crawl, and capturing all the pages that I wanted to capture. Kevin Clair: So I haven't really done any metadata for this yet, as you can see, Kevin Clair: you can do. Kevin Clair: I'm not gonna show any more about this. So yeah, you can edit. And this shows you like, this is. This is Dublin core metadata. If you're familiar with Dublin core, this. Kevin Clair: these 15 fields will look familiar to you. Kevin Clair: The main thing I want to show is the seeds. So I guess so we're there they are. So yes, these are the Urls that I would like for this particular collection to include Kevin Clair: it's it'll go down. So this is the top level. Yeah, URL, and it'll crawl as much as it can within this particular domain. So Kevin Clair: here. I've got all of the Kevin Clair: current I've got me and I've got yeah, the up the up site, all of the current, or soon to be division 3 programs at Penn State University. And then I added, Beaver, because I think I did this because Kevin Clair: Amy doing asked about it. Is how this collection came to exist. Kevin Clair: So if I wanted to add A. C to this, I would go to Google, and I will pick on. I will pick greater Allegheny Kevin Clair: athletics. Kevin Clair: And I will go to here. Kevin Clair: And I'm just gonna do the yeah URL part, even though Kevin Clair: just in case it automatically takes you to landing index. But sometimes it doesn't. And so I wanna make sure that we include anything else that it might direct us to. Kevin Clair: If I click on, add seed. Kevin Clair: I'll just add the one you can add many if you want. I'm gonna ignore that. Kevin Clair: Let's see. Kevin Clair: yes, this is okay. So these are just different. You can decide whether or not the public should have access to your seat or not. I'll just set this to one time. If we were doing this in production, I would set a frequency that was Kevin Clair: that would automate over time. Probably we have a lot of month. We have some Monthly Quarterly Annual calls that we do. You can kind of make a curatorial decision there, based on what you know about how much the site changes and whatnot. Kevin Clair: We'll add this seed. Kevin Clair: and then, if I wanted to run a crawl, I would click Kevin Clair: you can click any combination of these cross seeds individually, but I'm just gonna click on all of them. and then we will do edit settings. Kevin Clair: wait! No, we won't run. Crawl Kevin Clair: so Kevin Clair: So when you go to run a crawl we'll do a test. This won't save a test crawl will eventually have its data deleted after 30 days, unless you decide that you like it, and you want to keep it, and then it becomes permanently part of your collection. Well, until you delete it Kevin Clair: I will leave the time limit the way it is. I wanted to do this mainly to talk about the crawling technology here. There are 2 things. One is standard, which is the standard option. There's one called Brosler. I don't know why they have such silly names, but Brosler is now part of archive it proper. It used to be sort of a separate application that you could run, that you could run. Kevin Clair: and it was kind of created. Not Kevin Clair: one of the reasons it was created was because university athletics, websites are notoriously difficult to crawl because Kevin Clair: they're all run by the same 2 or 3 Cms providers, and they all have super dynamic content. It's a lot of like multimedia audio visual stuff, and Kevin Clair: they kind of don't want you to crawl their websites not to the same like Kevin Clair: antagonistic degree that something like Twitter or Facebook or Instagram would be like. But they definitely don't want you crawling their data, and they make it difficult to get everything that you would want to get. And so this tool, called Brosler, is able to kind of drill a little bit deeper into those like dynamically generated multimedia aspects of an athletics website, that Internet, that archive it by itself. Kevin Clair: was not really designed to do. So anytime, we're anytime. We're doing anything sports related to like athletics, departments, websites. We use this browser tool, and it helps us out to get Kevin Clair: more things than we could get otherwise. So I will let that run. It will take Kevin Clair: a long time. So we're not gonna stay here. But that's kind of how we set these up. Kevin Clair: I'm gonna go back to Kevin Clair: my main slide deck Kevin Clair: because we are at Kevin Clair: the end. And so I wanted to Kevin Clair: talk a little bit about just kind of open it up to everyone who's here because one of the things that I wanted to that I'm interested in is that Kevin Clair: web archiving is certainly not an activity that is necessarily exclusive to what we're doing in special collections and archives. Kevin Clair: there, I mean, we've heard. I've heard of some different projects that people have outside of the libraries, where they would want to potentially make use of, archive it to kind of collect information about different disciplines and how they've evolved over time or different Kevin Clair: Michael. Kevin Clair: different, just different projects that people have. And so I'm curious to hear from others who are on the meeting today? About what's happening at your campuses or what's happening in your subject libraries that potentially could be a web archiving project that we could kind of Kevin Clair: build some possible collaborations around. So at this point for the next 10 min, I think I'd like to open it up for conversation. Jennifer Gilley: Something that I'm interested in doing is trying to archive the our student Jennifer Gilley: newspaper, which is online now it's on a it's like it's on a wordpress site. I'm looking at it right now. So on a wordpress site. And then the newsletter itself, is it in flip? Snack Jennifer Gilley: is in a piece of software called Flip Snacks flipsnap. Jennifer Gilley: It's I'm just Yeah, I don't know what to how to, how to archive that or how to save it. Kevin Clair: I've never heard of what's it called. It's really it's kind of strange when you so when you click on it, it opens like a book. It's like you're reading a book. And the pages actually flip digitally. Jennifer Gilley: So it's just not like it's not easy to. We can't print it out. I don't know how to archive it. Kevin Clair: I see that it is on a website. So I wonder if Web can be? It would work. Yeah, I don't, Ben. Did you do any archiving of student newspapers we've had. We've had conversations in digital projects team, but not necessarily about not about web archiving per se. Benjamin Matthew Goldman: So I know this came up with Amy, too. Benjamin Matthew Goldman: I mean sites is just wordpress and wordpress is pretty straightforward to archive. So it's not like technically difficult. Benjamin Matthew Goldman: what came up with Amy, and what I would want to be mindful of. And this isn't really even a web archiving issue. Is just the the ethics of archiving Benjamin Matthew Goldman: student curricular work. Benjamin Matthew Goldman: yeah, I think the case that we were looking at it was possible that the student newspaper was something that was done as part of a class. And I know that most university archives don't kind of by policy collect any student Benjamin Matthew Goldman: curricular work. Of course we have tons of this that shows up Benjamin Matthew Goldman: in our collections. But we we definitely don't try to. Add to it if we can help it. Benjamin Matthew Goldman: But that being said I guess the the correlation here at University Park campus is the daily collegian. But that's a privately owned. Benjamin Matthew Goldman: you know, students who who are involved in that are employed by the company that owns that so slightly different, I guess setup. And I guess it just depends. This is where I think the curatorial, you know. Benjamin Matthew Goldman: kind of efforts required is to look at some of these details and determine whether it's, you know, archivable. Kevin Clair: Yeah, I think it's E. It's easy for me, the archive it product owner, to be like. Yes, let's go get it right now and figure it out. But then you have to kind of position it within, wherever we want to Kevin Clair: locate it, to Kevin Clair: preserve its context, and within. like the the all of the different Kevin Clair: like predecessor formats of the title. I'm sounding like a catalog, or more than an archivist, but you know, like just kind of maintain maintaining it within the continuum of all other newspapers that preceded it. And that's a curatorial decision. Kevin Clair: Any other ideas. Kevin Clair: I'll say that one thing I've thought about a lot Kevin Clair: just from other conversations that I've had with other Kevin Clair: digital archivists and digital humanities adjacent, like digital scholarship Kevin Clair: librarians is of. Kevin Clair: We get a lot of, and we have plenty of examples of faculty papers in the archives that are that are in paper form. And anymore. That's Kevin Clair: labs have websites and departments have websites and different research projects have websites and like, how do we go about Kevin Clair: working with Kevin Clair: those groups to. Kevin Clair: or do we work with those groups at all? Or what does it? What does it look like to preserve those materials over time in a way that Kevin Clair: preserves them as Kevin Clair: archives, but also potentially scholarly resources? Because I think the line between those 2 things is blurry in Kevin Clair: when working in this, in digital and digital formats. And like, where do we who Kevin Clair: do institutional archives even make sense in that context, like, does it make sense for Penn State to collect the papers of a particular Kevin Clair: entity when that entity is by nature like cross, disciplinary cross institutions like, does anything have to live in one place anymore when it's on the Internet like just things like that. Kevin Clair: I'd be curious what the curator thinks Kevin Clair: the curators. All of you Benjamin Matthew Goldman: feel like you're asking me, but I'm happy to defer to anyone else. I can only see you. I can see you and Jennifer. But yes, anybody who's on the call. His face I cannot see Kevin Clair: is welcome to answer that for me. Kevin Clair: or Ben can answer either way. Benjamin Matthew Goldman: and I'll just say we Benjamin Matthew Goldman: when it comes to faculty papers collecting, I mean. Our approach now is to seek a deed of gift from that faculty member, and so I don't think we'd be in a position to kind of do any Benjamin Matthew Goldman: widespread crawling of faculty members. Benjamin Matthew Goldman: you know, personal websites, or I don't know other presences on the web outside of what appears within whatever unit or department they're a part of. Benjamin Matthew Goldman: For all those at Commonwealth Campuses. I can tell you that all of your main Commonwealth campus websites are crawled on a regular basis as part of the Benjamin Matthew Goldman: University websites crawl. So if you're ever interested. That material is out there going back. Benjamin Matthew Goldman: actually, it goes back to the early 2,000 s. Probably for a lot of those. Benjamin Matthew Goldman: Lauren. Yes, we crawl. Benjamin Matthew Goldman: we try to. I try to get all the administrative units of the university, the academic units Benjamin Matthew Goldman: the campuses. There's always sorts of weird little sites that crop up here and there. So occasionally I'll Benjamin Matthew Goldman: come across Benjamin Matthew Goldman: something in like a Penn State news story where I'm like. Oh, I should get that like, I think there's a free speech website now, because of all the controversial speakers we have here. So as soon as I saw that pop up, I was like, I'm getting that Kevin Clair: for the recording, Lauren. I don't know if it'll be saved. But Lauren's question was, Do we preserve top level Psu Edu Kevin Clair: domain sites that pop up like ops or things like things things like that. Kevin Clair: We have a couple of more minutes, so we have time for one more question. If anybody has a question or an idea that they want to throw out there. Kevin Clair: if not, I will say just to wrap up. I don't have a conclusion slide. But I will say that the next digital curation community practice public event will be on November 20 ninth. Nathan Hallman will be convening that one. It'll be about some working definitions that we've been talking about for different areas of the digital curation Kevin Clair: activities that are happening at Penn State. So I will be. I will look forward to seeing some of you on the 20 ninth of November, at 1 0 PM. And thanks for coming to our presentation today. Derek Gideon: Thank you.

Digital Curation CoP - Web Archiving

From Bethann Rea September 26th, 2023

5 plays 0 comments You unliked the media.

A brief presentation and discussion on web archiving at Penn State University Libraries. Our hosts, Kevin Clair and Ben Goldman, will provide an overview of current web archiving activities managed by the Eberly Family Special Collections Library.

Tags: digital curationweb archivingborn digitaluniversity librariesarchives

Usage: Zoom Recording
Creative Commons: CC-BY
Appears In: Penn State Libraries Digital Curation Community of Practice

Related Media

Digital Curation CoP - Web Archiving

Related Media