This episode is the “Can AI propel cultural heritage institutions through their digital transformation” panel from the Generative AI & the Creativity Cycle Symposium hosted by Creative Commons at the Engelberg Center. It was recorded on September 13, 2023. This symposium is part of Creative Commons’ broader consultation with the cultural heritage, creative, and tech communities to support sharing knowledge and culture thoughtfully and in the public interest, in the age of generative AI. You can find the video recordings for all panels on Creative Commons’ YouTube channel, licensed openly via CC BY.
Mike Kemezis (Connecticut Humanities) moderating a conversation with Mike Trizna (Smithsonian Institution), Garvita Kapur (The New York Public Library), Abbey Potter (Library of Congress), and Amanda Figueroa (Curationist)
Announcer 0:00
Welcome to engelberg center live a collection of audio from events held by the engelberg center on innovation Law and Policy at NYU Law. This episode is the can AI propel cultural heritage institutions through their digital transformation panel from the generative AI and the creativity cycle symposium hosted by Creative Commons at the engelberg center. It was recorded on September 13 2023. This symposium is part of Creative Commons broader consultation with the cultural heritage, creative and tech communities to support sharing knowledge and culture thoughtfully and in the public interest. In the age of generative AI. You can find the video recordings for all panels on Creative Commons YouTube channel licensed via CC BY.
Mike Kemezis 0:51
Good morning. Glad to be here. So I'm going to introduce our panel, we only have three or four up here, but that's fine. So I'm Mike Kemezis. I work for kinetic humanities, which is the committee's counsel for the state of Connecticut, working on digital infrastructure projects along with consulting with cultural heritage institutions about the different collections. So I'm really honored to moderate this panel. And up here with me is gar Vita before from the New York Public Library. She's the senior director of digital technology at the library and lead the software development, QA and product management of all digital products and services. I'm also happy to introduce Abby Porter, Fabien Potter pardon me for elaborating Congress. She's a senior innovation specialist and founding member of LC labs digital innovation team and the Office of the Chief Information Officer. And right next to me is Amanda Figaro from Karina creationist, she is the platform director at creation is curation is where she brings 12 years of experience in arts and culture, community engagement, and ethic Lisa's sustainable strategy to be able to open access. So I'm not sure if we want to, you know, it says here on the program that everyone's two to three minutes to talk about themselves. So maybe we'll just a minute.
Amanda Figueroa 2:10
Sure love to talk about myself. Good morning, everyone. I'm absolutely thrilled to be here. Thank you so much to Creative Commons for organizing and keeping us all on schedule, that's very easy to do at the beginning of the day, I know it will get harder. I'm here today with a project called curationist, which every time I talk about it, I always say has become my dream job. I'll just talk briefly about what it is we do and then identify a couple of ways that that we're beginning to think about AI for ourselves, but for sort of the ecosystem of open access cultural heritage. Broadly. curation is is at heart a search aggregator a few years ago, we set a deaf team with the task of pulling together all of the Open Access API's from large museum institutions around the world with the idea that currently, all of the digitized collections in places like the Met or the Reykjavik museum are siloed to that organization, if you want to find something, you have to search, the Reykjavik Museum, and then you have to search at the British Museum. And then you have to search at the Metropolitan Museum, et cetera, et cetera, our dev team are superstars it must be sent because they weren't able to do this, we have a platform now@creationists.org, where you can go type in your search terms and be delivered, digitized open access art images around the search results from 14 of our institutional partners which have sourced us with 4.4 million art records in our database. This is incredible. It feels crazy that I get to say that it works, but it does work. And even even more. Even more special was in the pursuit of making this work, we realized we actually have an opportunity not simply to normalize metadata. To make all of these 14 institutional partner systems talk to each other and talk to us, we actually have an opportunity to contribute metadata. And to do research on individual works individual creators or sets of works to augment these digital metadata records. And in the pursuit of doing that make them more findable and therefore more accessible. I'm happy to share some of the work that our amazing metadata and digital archivists have done on individual works collections of works. I will also say we also run programs for fellows and for art right it art writers or critics more broadly, to come in share their subject matter expertise and continue doing that work. It's incredible, it's important, it is deeply vital decolonial work that intervenes on issues of knowledge parity, and it When we're at our best gets to elevate indigenous knowledge to a position of parity with the institutions that holds indigenous objects, provenance under question. That being said, we have 4.4 million object records in our database, that's an insane number. Most, if not all of those records have more than one image attached to them. So it's truly an incredible amount of content that there is no way we could have all the digital archivists in the world. And we still would not be able to do the vital work on every single individual record. I know this, because these works are held in places like the Smithsonian, and they don't even have the resources to do all of that individual work. So we're at a place now where, you know, our platform has launched, we're very excited about it. We want to continue supporting open access, digitization, and licensing for even smaller museums who don't have the same resources to create an API, manage their own web hosting, have the staff capacity to consider this, let alone launch it. But something that's increasingly on our mind is how can we use artificial intelligence to help speed up this process of metadata review, we're considering a couple of things. First and foremost is natural language search, which would help us build links between metadata terms that are related, but not necessarily overlapping. The example I gave to some of you yesterday was, if you come and you type into creationist indigenous cultural heritage from Canada, you'll get things that are tagged indigenous culture, heritage, Canada, but what you really shouldn't be receiving our things that are simply tagged anyway. But perhaps don't have all those other words. So developing natural language that can understand the linkage between these terms and and deliver results that people are seeking, even if they don't have the right terms in the search box. Of course, that means that we also have to be very considerate of some things that Bridgette mentioned earlier as major, major sticking points, which are issues of certainly bad actors of use racial bias. colonialist bias, white supremacy more broadly, that infiltrate AI systems. But even setting that aside, it's also an issue of how quickly cultural heritage terms shift change and are uncovered. So the question for us has become, how do we build into whatever system for natural language processing that we implement a review process where our metadata and digital archivists, experts can continue to make sure that we're using the most accurate, the most relevant and the most culturally sensitive terms. And that that, of course, led me very quickly to the question of like, well, is that actually improving our process? Or is that just changing the type of work that we asked those teams to do? Is it is it relieving their labor in a in a joyful, thoughtful, utopian way? Or is it just asking them to learn a different task? I'm not sure yet, but I hope to find out. So that's, that's the perspective that I'm bringing here today. And these are, these are topics that we consider deeply creationist, both in a technological sense, and, and from a perspective of prioritizing indigenous data sovereignty. So really, really excited to be here and to talk to you more.
Garvita Kapur 8:51
It's sort of hard to introduce New York Public Library institution. The mission of the library is to inspire lifelong learning, advanced knowledge and communities. There is a committed commitment to strengthening communities like humans are at the center of libraries for which we rely on our staff, our collections, our physical and digital spaces. In terms of digital at the library, it's a fairly new department. It's been around for about seven years. In digital, we can have our website, which is an ipl.org. We also do preservation with digital objects. That means that we have tons and tons of hard drives, petabytes of data sitting in hard drives that we have the challenges to preserve them and make them accessible. That's that's definitely been challenging given the amount of metadata data is going to need to type them and making them accessible. Being the library We have a couple of cons in terms of reference, there are research at the branch library. So we haven't had about four branches and catalog research catalog as millions of records in there. And last but not least, we also have EBV, providing digital books to our apps. If you haven't dug into the books, check out audiobooks. We also have a platform called ebooks. It provides free books to students in schools, where any child can log in with their school login, first off camera and get access to a library books. And whatever reading level that they are comfortable with. And there is a lot we do there are a lot of questions we're sort of dealing with, especially when it comes to AI. One of the biggest one is, but where do people fit in with AI? N. way a lot of leaders are looking at AI today is to find ways of eliminating humans using AI instead. You want AI to complement or stop them rather than replace them? That's the question. Because, like I said, libraries, it's all about human connection, it's about community. So that's the most important thing. I think we make all of our diverse collections available to all our patrons and do the best that we can bring AI in the mix to serve them better.
Abbey Potter 11:54
Thanks. I'm Abby Potter, I'm at the Library of Congress. The library Congress is a big giant place it has been includes the Copyright Office, the Congressional Research Service, Law Library, National Library for the Blind and print disabled. And I think that's it. And then the library part of the library, which is like nine different libraries inside that library and archives and special collections. So there's, it's everything. And I work in the OCIO Office Office of the Chief Information Officer. So we serve all of those units, and in the labs in a lab environment. So that this is where we experiment with different new technologies that are coming back around to see how they could help us fulfill our mission and serve our users which are very diverse, the our recent, our new library, which is not that new anymore, but My Librarian of Congress, Carla Hayden, she, her new vision for the library is to connect to all Americans. So that is, and we don't circulate our collection, it's all sort of comes from library. So digital is really the way that we're looking to do that. The, and we came by, and we've been exploring AI and looking at it for since 2018. And we came by very sort of practically, and honestly, kind of before this big hype, but you know, many of you know, AI has been around a long time. But that term has been around for a long time. But we you know, what we were first doing was looking at computational access to collections. So we've put a lot of effort into digitizing our collections, mostly public domain materials, putting them online, but there, but then we don't have the, you know, sort of item level metadata or paragraph level metadata that helps people find what they're looking for. So we were investigating ways to so how can we get more use out of these digital collections make them more valuable to our users? And and then, you know, AI became like, oh, this could solve our problem. This is really exciting. Great. So we started looking at it. And, and unfortunately, it didn't really work the way that we needed it to work. And there were lots of different reasons why AR has you know, a lot of our content is historic, it's noisy, it's messy. It's not structured, it's described at different levels. It's just a very complex set of data. And what most AI models are trained on contemporary data are trained on, especially research is well formed well known data. And this is what we don't know about our data. We don't know how it's formed. We don't know what's in it, generally. So that became a mismatch. So we but we sort of think about it in terms of, you know, AI, the next big technical sort of revolution or sort of wave that's crashing down on us. But we, as librarians, and archivists know, we've this has happened to us before with digitization with things going online, we know what to do here. And, and that basically is the shorter. I've worked on things to get any sort of share, share results of what we're doing, and come up with sort of quality standards for our sort of specific formats for our specific use cases. And so that's what we're trying to build. Now, this framework that really starts to consent that considers the people like other folks on this panel, talked about the the people that the data that we have, and then the models that we're using to sort of understand and try to understand what works and what we could use to gather it for benefit for the benefit of our users. No, stop there. No more.
Mike Kemezis 15:56
Great, thank you. And listening to some of the things that you've all mentioned about some of the things you're doing with AI are thinking about, really brings to my mind that, you know, how can this help institutions fulfill their mission? Right, we're talking about cultural heritage institutions, we're talking about libraries, from Library of Congress to NYPL, down to the local historical society, or the local art museum, or the community groups that, you know, I've worked with, and we probably all worked with in our careers. How, what's the elevator pitch? I guess how, you know, if someone comes to you and say, Well, how would How would this help me? How would this help my historical study, help somebody Museum? fulfill my mission? What should? What would you be telling people? I get this question all the time, and I'm going to a conference to find myself. So, you know, how does this I mean, we all see it, right? Like, we all can understand this is powerful. And this could work, but how do we? How do we articulate that? I guess the question I have to Yeah, okay.
Abbey Potter 16:56
So I think first being really clear on what your values and principles are as, as the mission and, or even as the department, you know, what the what the clear goal with problem statement is starting there, because that will help inform how you make decisions along the way. So getting very clear on, you know, what problem you're trying to solve, you're not just trying to use this new toy, you're actually using it to solve the problem, what is that problem? And really dig into, you know, who are who are the people involved? So who are the stakeholders? Who are the people who are, you know, making decisions? Who are the people who will be using the system or tool, the staff, the sort of, or the people, the decision makers, the managers? Who are who is depicted in this data? Are they you know, how can we bring out their voices? Who is going to be often with AI? There's, if you're going to train data there, who are the who's been labeling the data? Who knows? You know, is that information clear? What was that incentive structure for you for those for that labeling of data. And then, you know, as you sort of looked at your, you know, sort of why you're doing it, what, you know, who's involved in it, and so, and then these, there's this, you know, what I try to get down to the sort of making it very, very boring. So you have this very big, exciting idea, but try to make it very, very boring and do like risk benefit analysis on each little single part of it. And, and understand that it I mean, what we've done, it's not always been on the first try, where you get the results that you want. But I think having the stakeholders in mind, and then what we've tried to do is for specific tasks. Like for example, we have an experiment where we're trying to make Mark records, which are Kevin, get library catalog records from ebooks. So can we use AI to do this is something we were like, this should be easy. And for some fields, it is easy for other fields, it's not like we're getting, you know, like 98% accuracy for, like, ISBN number and LCCN numbers, but only 25% of it 25% of the subject terms aren't matching what catalogers are doing. So that and so overall, we're getting about 75% matching catalogers do so that's probably not good enough. So that's what that's why we're trying to think through like what the standards are, can we agree on these standards? And, and keeping in mind who the people are so like, for this cataloging, experiment, where the goal is to design the sort of design principles we set out is to help catalogers it's not to replace catalogers is to help them and which we brought up before so that I think, yeah, just the principles, the stakeholders and sort of, um, I'm certainly getting a notion of what quality could look like. Do you have like an existing quality standard that you could go to? Or do you need to sort of develop a new one? And how would you do that?
Garvita Kapur 20:13
That's a tough question. Thank you, Abby. So from when we asked him the same questions, what is the right thing for us to do that sort of drives admission value of it, in terms of using AI, the plenty of things we could do, we could use it for staff or use it for patients in terms of staff, we could use it for quality assurance. In software, we could use it for, you know, helping procurement with their contracts, we would use it for just improving productivity in general, by also looking at patrons, you know how patrons use the website and see the gaps and provide better support to our patrons and better services to our patrons by understanding their needs. And in a sort of like, pacing. It seems a little risky, right? Now it feels a little bit safer to do that with stuff, I know that we have a control group of captive audience who we can work with fine to learn and improve public facing scenes. So there's a bit more risky right now. But there's an opportunity here to use AI for accessibility, providing multilingual support transcriptions of our vast collections. So we are, you know, we can't do all of that. So we have created an AI working group, the role of that working group is to look at all of these things and make recommendations. But the very first thing that we are doing in that working group is to learn about AI, make sure we are all on the same page about what AI is. The library, we have staff who come from, you know, ranches, research and technology. How do we all speak the same language when we talk about AI? So that's the first thing we're doing, like, get our AI one on one and speak the same language, then do some investigation in terms of, you know, what is the south facing tool we should try? What's the public facing tool that we should try, and then educate the rest of the library or stakeholders, and then make recommendations on what the bad things are, and then build up by implementing AI. I think it's important that we pay attention to how AI has given us an ethical, responsible way as an organization that aligns with our mission and values. There's just with the fast adoption of Chad GBT is sort of become this thing. Everybody wants to get on the bandwagon. Things which are not even AI called AI, just because it's the new buzzword. So I think it's important to take a moment instead of too easy. Is this a problem that we need to solve with AI? Or is this a regular digital problem? That doesn't mean an AI solution? I mean, if you had to go pick up milk from your corner store, do you need to buy a Ferrari to go there? Or can you just walk there? Is that, you know, just asking those questions ourselves instead of typing is what we're doing.
Amanda Figueroa 23:46
Yeah, I really appreciate both of these answers you, you both really identified some of the best possible outcomes. So I love to sleep in like Maleficent at the end and give you the the problems. But, um, but please understand, I endorse all of that good stuff, too. I think when it comes to creationist, you know, I think seemingly our mission is access to cultural heritage through digital spaces. And so in that context, it's very easy to say, oh, yeah, great, more access, more searchable, more findable equals more access. And that doesn't even get into my problems with the loss of like Boolean search knowledge. But But I think just saying your, your mission is access so AI is good is so under undercooked, have an idea. Because yeah, broadly, our mission is access but we also more specifically, why compassionate, ethically minded sovereignty for indigenous knowledge in that access? And so in that framework, we've really had to go very slowly with AI in terms of Thinking about does it fit our mission because at this sort of rampid, like all access full distribution anyone can download at any time idea that that AI untracked would give our collection certainly is in direct contradiction to a lot of indigenous best thinking around their cultural heritage. Not everything is for everyone. And I think lated in AI as a model is this idea of infinite access, infinite scale, infinite growth and a lot of the the ugly things about capitalism more broadly, right, just as we tried to accumulate wealth, we also tried to accumulate knowledge. And so in that sense, AI can be deeply antithetical to our mission. And it takes a really thoughtful understanding of who is it that we're trying to serve? Are we trying to serve an audience that that just craves infinite access, and infinite surveillance on all human created cultural heritage? Are we trying to serve an audience of people who have already deeply thought through the use and value and dissemination of objects they've create, and simply elevate that perspective, which still, in my mind belongs under access, but it's a very different frame of access. So considering AI has really called our organization to really reconsider our mission in many ways. And that I think, has been the real advantage of doing this work so far.
Mike Kemezis 26:26
Great, thank you. And that kind of gets started to my next point here. One of the final questions about threats and risks and challenges are artificial intelligence and AI for for cultural heritage institutions, we've talked about never thought about in that light, or about traditional knowledge, and what unfettered access to all of that content means, because, you know, I deal with Connecticut institutions, right. And when you think of Connecticut, you think of old white people, that's what we deal with. But there's more, there's more to it than that. And we work with those groups, and we're trying to be more sensitive to, to what they what they desire for their collections and for their institutions. So I just want to hear from from the other two people here, like what what other what other risks or threats Do you foresee? Or have you encountered or challenges like we've said, like, one of the things that I think about is that AI needs to be perfect, right? That's good machine needs to be perfect all the time, I worked in a project for handwriting, text recognition, building a neural network for that, and it got up to 95%. And it wasn't good enough. But if you talk to a transcriber, it was really difficult. They're doing about the same. Right. So that's one of the one of the challenges there is kind of understanding, you know, we expect them to be perfect, and they might not be. And is that okay, right. So I don't know, if you have any other threats or challenges that you've come across, like you said that the public facing might not be ready yet. You want to elaborate a little bit? Sure. Absolutely.
Garvita Kapur 27:50
I mean, besides the basic bias and racism inherent in the algorithms, I think it's important to think about that the cost of AI, there are so many opportunities, but it's not cheap doing to this community. So for cultural heritage institutions, cost is always something to think about, how can we do the best in what we have some that says it's important to invest in the upskilling of staff, making sure they have the right skill set to interact with AI or to build AI or prompt AI uses in the best possible. With generative AI, one of the challenges is the walled gardens it creates. Once you sort of locked into that walled garden, it's very hard to move to a different product. There's a lack of sort of data interoperability between these tools. So you can unlock them there is there is that part of how just the power structures will change based on how how things evolve. Then the manual mentioned there's like the deepening of the digital divide, that is possible. And again, shifting up the palette chakras. So when social media was first introduced the promise of connecting and bringing together what was the big dream, but what it really ended up doing was alienating people shopping and pulling, and not really being bringing people together. So, you know, what are the times that AI might bring that we can't foresee today, but the though, taking the time to sort of learn about that, educate ourselves. And one more thing, I'd like to sort of say that, you know, this generative AI, frenzy of AI, it's not the first big technology, it won't be the last, it's only a matter of time before quantum computing comes into play and takes AI to a whole new levels, then what so I believe as cultural heritage institutions, we have to have our pulse, our finger on the pulse of see what's going on, make sure we are advocating for the right things, we are talking about having these conversations and just being responsible in the use of AI. So having having a digital mindset, and being lifelong learners of technology, and see how that affects our work, just do the best that we can.
Abbey Potter 31:11
Yeah, I love that. I think that's totally true. And I also agree that the biggest risk right now is that we pay for big costly systems, and they don't work. And I think that that is actually the biggest risk. But I mean, there are other deeper, big, you know, cyber existential risks, but that's like the most immediate one. So I think, just for that, but you know, slow down is, I think that's the biggest thing we have to do now is just sort of wait or, or, or experiment. So, you know, learn about these technologies, while you know, while trying them out. So you you get that experience, the staff, it's experience, you get to know more about your data with these models are actually capable of what you know, how you have to, you know, what the review process will be like, and so I think that just sort of taking a breath, because I think the marketplace now with AI is very crazy. And there's lots of unverified claims, and it's, there's no regulation. So there's no baseline either. So we don't have any sort of evidence that, that you know, that any of these claims are, that are going to work, it's like, oh, it's gonna be so much better next year. But we haven't even seen this year for what we need, you know. So I think, you know, taking the breath and understanding, like all sorts of acquisition of IT services, work in the federal government, there's a long process to do that, and sort of relying on both privacy and security, you know, sort of terms that are in there, that these systems don't meet yet. So like, that's a blocker too. So I think there's, you know, practical reasons to wait to before, you know, sort of anything is is, is implemented at a large scale. By but I think another risk thinking as a library, you know, archives person, you know, if further down the road, when you know, these general data, AI tools make the cost of, you know, creating things very cheap. So that, like we saw with internet, there's just going to be explosion of content that we will have to acquire steward, you know, learn even more about so the sort of job gets even bigger and more complex when there's, you know, the process of creating this content was easier and faster. And how are we going to, you know, how are we going to select for our collections? How are we going to manage these things? How are we going to determine authorship, that's going to be really complicated, and it's not. So those are all other things beyond just the practical implementation. Right now that I feel like we have to look at, in addition to the panels, which I totally agree with.
Mike Kemezis 33:57
One of the things I was thinking about last night before, this was how I could help in digital preservation. And it's kind of like going in circles, you make an AI things and AI comes in and helps you preserve, it just keeps going and going. But I think that, you know, working in background, digital preservation, everyone is scared of born digital with where we're concluding the bigger state institutions. And I'm wondering if that there's an opportunity there. I don't know what it would look like, for that selection for kind of like,
Abbey Potter 34:21
you know, even getting PII out of foreign Digital Archives web archive, you just can't do it. There's
Mike Kemezis 34:28
like a million dollar grant in Connecticut. Yeah. Yeah. Great. So we have about 10 minutes left. And we do have some time for q&a. I don't know if you want to run a microphone or Yes.
Speaker 6 34:46
Thanks so much for this. My question is mostly for Vita and Abby. And something that you had brought up is, you know, libraries and elsewhere, like libraries, cultural institutions are highly dependent on vended solutions. And, in particular, I think they're of the top 15. Service providers for libraries, like two are nonprofit, almost not open source, and one is OCLC. So there we go. But the My question is like, if you as institutions that are so in many ways dependent on bended solutions, like larger institutions might be able to make LLM neural networks, but often the resources don't exist for smaller institutions to do that. And as leaders in the field, I'm wondering, you know, what is the role of large institutions like NYPL, and Library of Congress to make and support solutions that are useful for all sorts of of institutions, large or small? And also, even within these large institutions? What, what are you thinking about when it comes to vended solutions, so that, you know, there isn't like, almost like a race to get into this market, which is, you know, hundreds of millions of dollars for libraries, you know, likely saw, for example, with eBooks, thanks.
Mike Kemezis 36:05
I'll just talk like for one second, because I'm in a weird, I'm in a different position, I'm a funder. I work at a funder, and I work with a lot of people who work in technology. And I see our role like we fund great projects. And I think that humanities, councils, funders, nonprofits should be funding this work. So it is open, to bring it to the smaller places we want to build in Connecticut, we want to build infrastructure that's available to everyone and get people where they need to go through workflows and that kind of stuff. And I think there's a layer of AI there so that people can get the benefits, but still do the work they need to get done. So that's just my view on it being kind of adjacent to libraries and archives, I'm having some funding power. I'm trying to push organization to fund more technology projects. But we're all about public humanities. So I'm trying to finagle that
Garvita Kapur 36:49
right now. So I'll address the vendors first. So I think the thing we are sort of grappling with right now is how do we even assess these vendors? Are they truly using AI? If they are? What are the implications for PII? What does that mean for our data? So right now we are the very first thing they're trying to do is to come up with a rubric to just assess vendors and sort of say, Okay, this is a vendor we are okay working with, but this is. So that's one part of being being clear how the rubric kind of matches our value system and what we are okay with what you're not okay with, and drawing that very clear line, and that making sure everybody at the library sort of is using the same tools and speaking the same language into terms of creating LLM or providing them as open source. That's something we end up with a lot of open source work. So that's definitely something we would do in the future. But again, we're sort of at this point of thinking, but what is the first thing to tackle here? What is the first element that became, what is the problem that we address? What's the most important thing for us to do? And then sort of find the metrics of success? Is this accurate enough? Or can this be solved with with a different horse doesn't need to be AI and to do that measurement. So we are we are taking a little bit slow, we just look at risk averse in that sense. So taking the time to make, we're making sure that we are doing the right things, and addressing the right problems and not rushing into it. Because once you build the thing, you can kind of help support it for the rest of its life. And that that can be costly that can be be resource and
Abbey Potter 38:59
yeah, I think I think I, at the Library of Congress, before I started on this team, we were at my part of this endup Team National Digital Information Session. It was a big digital preservation program where government gave us money. And I think this to to build a partnership network to develop tools and standards and services around digital preservation. And I think this need we need a similar type of thing from I don't know if NSF or this this one came from Congress, but I think we need a similar thing. So that we can facilitate partnerships with research universities and computer science departments so we can develop our develop tools and just learn about the how the tools work out of that digital preservation program, this thing called badji came out of it the federal agency Digitization Guidelines and To the very bad act, but they both it was different federal agencies and then different universities who came together. And and and clearly defined digitisation standards that they could all give to vendors. Because we don't have in, that's what I would love also to see for AI, where we have like, Okay, we're going to use large language models for a public facing Chatbot. What, you know, these are baseline quality metrics, and maybe, you know, and we can define, like, very high quality, quality metrics for tasks that are very mission critical that are very that you know, touch, you know, certain types of users. And then, but then we can also define quality for internal processes that maybe aren't as, you know, sort of sensitive and, and sort of move forward that way. Because I think if, if we had sort of united voice on standards and quality, then then then we can use or get the vendors to work with us, in a better way. Yeah, yeah.
Mike Kemezis 41:01
I think we're, we have time for one more question. For me,
Speaker 7 41:07
thank you all for your comments. I think as you all know, there's a huge space race for data right now for AI model training, what you're working on, you know, in the case of creationist or Library of Congress, you know, creating better mated metadata. And, you know, curating these large lists of Open Access Initiative, content could be construed as extremely valuable for commercial entities that wants to train and maybe some of these commercial entities are actually, you know, well, meaning they don't want to repeat the mistakes of stability AI and the lion dataset. But I think there's also a lot of misconceptions around what exactly open access means. So how are you thinking about commercial entities that might want to train on your datasets and your metadata?
Abbey Potter 41:53
I don't think that's already happened. The like, anything that's open, I think it's already been used. The but I think that there are models for doing it is a sort of where the most people can learn it. And the Nordic national libraries all got together and created like a training set for Nordic newspaper, Nordic historic newspapers, to train models that they can use on their content, the National Library of Norway, digitize their entire collections. They have a lot of oil money, and that's what they spend it. But they, so they have a lot of scale issues, and have been leading the way in sort of experimenting and sort of trying to figure out how to use AI for, you know, big giant collections. But I think, I mean, I think it's, it's difficult because we, I think it could set up, even in digitization, you saw, you know, sort of collections that are produced, and then vendors get their hands on, and then they're like, applied with lots of metadata, and then they sell on my back libraries. And I can definitely see something like that happening again. The but I think it's hard to unless there's like a license sort of, created that stipple, stipulates how, how training data can be used. In federal contracting, there's, there's government provided government provided data, and government, government furnished property, and you're not allowed to you reuse that for other contracts. And sort of making that clear for if there's ever, you know, if we work with vendors to create training data, sort of trying to make that clear, where they can't use it for enhancing their other their products. But I think it's, you know, it's still, I think, if your data is out there, it's been used already. So
Amanda Figueroa 43:59
yeah, I agree completely. creationist doesn't own any of the items in our database. It's, frankly, none of my business. If people have been been using this to train AI models to do whatever they want. I will say the difference would be that certainly someone could come in and just scrape the curation of site and pull down all of the, you know, all of the object records that we're sourcing from these institutions, and possibly even pull down all of the metadata that we've contributed directly to these listings. But what they can't train for is the curation of perspective on knowledge, parity and data sovereignty. And that's a perspective that will deeply affect whatever AI model we come up with internally, just as it has affected all of the metadata and art writing that we've generated already. And frankly, that's the most interesting part, I think, is the the care with which we apply this data. And that's something that you can't just RIP from the internet that's really something that only comes from thoughtful, dedicated practice and dedication to a standard of care and cultural heritage.
Mike Kemezis 45:09
Great. Thank you everybody. Really appreciate the conversation. I'm looking forward to the other panels today so really appreciate it. Really appreciate your time thank you
Announcer 45:26
the engelberg center live podcast is a production of the engelberg center on innovation Law and Policy at NYU Law is released under a Creative Commons Attribution 4.0 International license. Our theme music is by Jessica Batke and is licensed under a Creative Commons Attribution 4.0 International license