Commons talk:Digital Public Library of America/Modeling
File captions
[edit]In addition to the statements, I'd also like to start a discussion about how to format the file captions, which could be added programmatically across all DPLA uploads. My initial thought is that it should be a string formed with some combination of the title, creator, institution and ID (and date?) fields, with some basic punctuation. For example, "{title}" by {creator}, from {institution} (DPLA ID {id})
. Is there a good standard to follow here? Dominic (talk) 14:45, 16 June 2021 (UTC)
- FRomeo (WMF) (talk) 12:49, 15 July 2021 (UTC): I would prefer that file captions on Commons were descriptive of the content of the image. Some of the items on this item's description is "Two sections of a cream-colored silk valance with scalloped edge and silk fringe border. Embroidered with delicate naturalistic flowers and tendrils tied into bows. In pale greens, blues, golds, and pinks." Are you able to use descriptions when they're available? Looking at the other metadata on DPLA, the most descriptive fields that aren't description are title, format, and subjects. Given that subjects will probably be used for depicts, would some combination of title and format work? E.g. For this item, the title is "First successful dirigible, 1883" and the format is "Trade cards. Cigarette cards." Taken together, that's a reasonable description of the image. But I understand that the quality and extent of metadata varies by partner.
- @FRomeo (WMF): We have many (most?) DPLA items that lack the
description
field, and even subjects are not universal, but I take your point, so maybe we can have one approach for those that do have these types of more descriptive fields, and then do something like I said above just as a fallback when needed. One question, though. Something Carly said that stuck with me is that the MediaSearch is favoring specific fields–namely: file caption, depicts, and unstructured text—but adding more granular SDC statements might end up hurting the discoverability if certain fields (e.g. title) aren't factored into the weighting. My takeaway then was that if we want someone who is searching on basic terms found in the title or creator (but not depiction), then we should put those terms in the caption. Do we lose any discoverability if we use the narrative description like the one in your example? Dominic (talk) 13:45, 15 July 2021 (UTC)- @Dominic: It is true that depicts and file caption are weighted more heavily in search results. However, adding more granular depicts statements shouldn't hurt discoverability in any way, as long as broader depicts statements are also included. Unstructured text is still searched, and that unstructured text includes title and creator. So that information still turns up in search results - just ranked lower than matching information in depicts or file captions. There's no reason that using narrative description would hurt discoverability. Hope that helps - please let me know if you have a follow up question! CBogen (WMF) (talk) 16:57, 13 September 2021 (UTC)
- @CBogen (WMF): Thanks! So just to clarify your clarification, what we are actually hoping to do is (1) put all fielded metadata into structured data statements for easier maintenance (not just depicts statements), (2) modify existing templates so they display structured data to end users see the same metadata, and (3) remove the actual plain text from the wiki page that is duplicative of structured data, so we only have to maintain data one once place. Are other structured data fields outside of depicts and caption searched? As long this data is searched when it's imported by a template using parser functions or Lua, I guess that works. (Though, is it searching all labels in all languages if it's a Wikidata item?) It still seems like you might have issues where, for one example, if you put the actual title in a P1476 statement, but not the caption—even though a search term found in a title field should be highly relevant—this would rank below a less relevant file where the words of the title are in the caption, even though that is not the title of the work. Dominic (talk) 18:32, 15 September 2021 (UTC)
- @Dominic: You have it right and this sounds like a good plan to me. The content that you put into templates via SDC will then be indexed as full text and become searchable via MediaSearch, even though SDC fields other than
depicts
,caption
, anddigital representation of
are not directly indexed right now. If you'd like to see what's indexed on any given file page, you can append?action=cirrusDump
to file pages to see what’s indexed - the"auxiliary_text"
field contains the parsed wikitext contents. CBogen (WMF) (talk) 13:20, 29 September 2021 (UTC)
- @Dominic: You have it right and this sounds like a good plan to me. The content that you put into templates via SDC will then be indexed as full text and become searchable via MediaSearch, even though SDC fields other than
- @CBogen (WMF): Thanks! So just to clarify your clarification, what we are actually hoping to do is (1) put all fielded metadata into structured data statements for easier maintenance (not just depicts statements), (2) modify existing templates so they display structured data to end users see the same metadata, and (3) remove the actual plain text from the wiki page that is duplicative of structured data, so we only have to maintain data one once place. Are other structured data fields outside of depicts and caption searched? As long this data is searched when it's imported by a template using parser functions or Lua, I guess that works. (Though, is it searching all labels in all languages if it's a Wikidata item?) It still seems like you might have issues where, for one example, if you put the actual title in a P1476 statement, but not the caption—even though a search term found in a title field should be highly relevant—this would rank below a less relevant file where the words of the title are in the caption, even though that is not the title of the work. Dominic (talk) 18:32, 15 September 2021 (UTC)
- @Dominic: It is true that depicts and file caption are weighted more heavily in search results. However, adding more granular depicts statements shouldn't hurt discoverability in any way, as long as broader depicts statements are also included. Unstructured text is still searched, and that unstructured text includes title and creator. So that information still turns up in search results - just ranked lower than matching information in depicts or file captions. There's no reason that using narrative description would hurt discoverability. Hope that helps - please let me know if you have a follow up question! CBogen (WMF) (talk) 16:57, 13 September 2021 (UTC)
- @FRomeo (WMF): We have many (most?) DPLA items that lack the
Modeling feedback
[edit]- The catalog records, of course, describe the whole item, and not necessarily the specific digital asset, which may only be one page of a larger work, to which the statement is applied. Is this acceptable, or can we clarify the level of description somehow (with qualifier)?
- All DPLA items will necessarily have a DPLA record and a source record. Should these different URLs be qualified somehow to distinguish?
- The described at URL (P973) property is similar to source of file (P7482). Is there a reason to choose described at URL (P973) over source of file (P7482)? If it's about adding DPLA, I think we could add source of file (P7482) separately with the source catalog record URL and use described at URL (P973) with both URLs. In this case, I think it would be ok to duplicate the information.
- It might be a good idea to add a qualifier to specify each institution: maybe operator (P137) for DPLA? In any case, one more level of specificity might be good.
- There's also one more possibility: add source of file (P7482) > file available on the internet (Q74228490) > described at URL (P973) + operator (P137). As in this example.
- @GFontenelle (WMF): Thanks for pointing me to source of file (P7482). I agree with you there, and I think it makes sense actually to put URL (P2699) and IIIF manifest URL (P6108) as qualifiers under that as well. I posted this question to Commons talk:Structured_data/Modeling/Source#Additional URL types. And then I think it makes sense as you said to add described at URL (P973) at the top level. I like the idea of the operator (P137) qualifier, but it may not be feasible within the data. One issue we have to contend with is that not every contributing institution is contributing content hosted on their own sites. Some of the catalogs linked might be hosted by the DPLA hub, or by an intermediary institution (e.g. large university hosts local institutions' digital collections). Maybe it is still okay to use "operator" in that case, though—they are still responsible for the metadata, just not the site itself. Dominic (talk) 17:55, 13 July 2021 (UTC)
- Is the basic "URL" property correct here for specifying the direct link to the image itself, or is there a more specific property?
- As this property is being used only for the image, some other possibilities are: Commons compatible image available at URL (P4765) or full work available at URL (P953). Not sure if this second option is just for text, though.
- full work available at URL (P953) feels better to me. Ainali (talk) 08:24, 18 July 2021 (UTC)
- As this property is being used only for the image, some other possibilities are: Commons compatible image available at URL (P4765) or full work available at URL (P953). Not sure if this second option is just for text, though.
- As with "described at", the IIIF manifest is for a whole item. We have this data, but is there a preferred way to add this at the asset-level?
- It seems just fine. Maybe just add a qualifier with the name of the institution?
- As above, the DPLA ID is an identifier for the whole item, not the specific image that has been uploaded. How can we add it in SDC at the asset-level without confusing the two?
- Maybe add to the qualifier that this image is part of the item. Something like: DPLA ID (P760) > 62377331b08262c5f79a1be52f7fc757 > part of (P361) > file unit (Q59221146). Or just add part of (P361) as another property separately.
- The vast majority of items in DPLA are in English, but language is not specifically spelled out as a field in the data. Is it better to use language code "en" and accept a small error rate, or to apply "und" to all?
- It should be just in English, so it might be better to use the "en" code.
- Should this specify in some way that it is the title of the depicted work, and not the digital asset?
- I think it should only be the title of the item, as it is official. In the Commons' file name, we have the information that it is a page from the document already.
- Question: How is CC0 handled? Copyrighted or not, and is CC0 a "license"?
- CC0 is considered a license and, I believe, it should be applied if it was chosen as a license by the institution. I think the same should apply to any other open license, as the No Copyright - United States (Q47530911) for example.
- Hi, we had quite some discussion about CC0 on already. The community felt that it is best modelled as in: Commons:Structured_data/Modeling/Copyright#Cc-zero_license. In short, have a special copyright flag for this, as it is not technically PD but also not considered copyrighted. This model was both adopted by Multichill and myself. Please consider the same so we remain consistent. Otherwise, let's continue the discussion on Commons_talk:Structured_data/Modeling --Schlurcher (talk) 07:35, 25 August 2021 (UTC)
- If adding all organizations in the chain, do we use a qualifier to distinguish the source institution from the aggregators (and do we describe DPLA and its hubs differently)?
- As all institutions are collectively responsible for the item to be on Commons, it might be good to have them on the chain and with qualifiers informing their roles in it.
- I am not sure this is the right property, as it is mostly used in Wikidata references, but it seems like an important concept because many of our uploads are only single pages of larger works.
- Often, the page number of the sequence of files uploaded and the original page number of the scanned page are not the same. e.g., the first upload in a sequence for an book could be the cover, while the actual page "1" might be the 5th file after title page, acknowledgements, copyright page, etc.
- Can we also represent in SDC the number of pages in the work, in addition to the page number of the current file?
- I'm not totally sure this is completely right either, but I think a solution would be to add page(s) (P304) for the page number in the item and file page (P7668) for the number of the digital file.
- Here it is important to remember that we are describing the digital uploaded file with SDC. So page(s) (P304) is probably not useful at all, unless as a qualifier on depicts (P180). Likewise, file page (P7668) would probably only be fitting on a file if it was extracted from another file (like a single jpg extracted from a pdf) and then as a qualifier of the source. Ainali (talk) 08:33, 18 July 2021 (UTC)
- @Ainali: Thanks for this comment! So, it almost seems as if there is not really a good property for exactly this element. It's not file page (P7668), which is about the page within a large file, but rather about the sequence of a particular page (or digital asset) within a larger collection of files that constitute an item. Do you have any recommendation here, or do we need to propose a property (hoping to avoid that!)? Dominic (talk) 19:05, 20 July 2021 (UTC)
- Perhaps series ordinal (P1545) can be used with a qualifier with the set? Ainali (talk)
- @Ainali: Thanks for this comment! So, it almost seems as if there is not really a good property for exactly this element. It's not file page (P7668), which is about the page within a large file, but rather about the sequence of a particular page (or digital asset) within a larger collection of files that constitute an item. Do you have any recommendation here, or do we need to propose a property (hoping to avoid that!)? Dominic (talk) 19:05, 20 July 2021 (UTC)
- Here it is important to remember that we are describing the digital uploaded file with SDC. So page(s) (P304) is probably not useful at all, unless as a qualifier on depicts (P180). Likewise, file page (P7668) would probably only be fitting on a file if it was extracted from another file (like a single jpg extracted from a pdf) and then as a qualifier of the source. Ainali (talk) 08:33, 18 July 2021 (UTC)
- I'm not totally sure this is completely right either, but I think a solution would be to add page(s) (P304) for the page number in the item and file page (P7668) for the number of the digital file.
- DPLA does not have a controlled vocabulary around creator entities, so we can only use text strings here. It will be a difficult task to decide how to match to Wikidata items, but we can use the author name string property to start with.
- This is a solution to the problem, yes. However, I don't think this is the best scenario. The ideal situation would be to work with the text strings to have them as proper Wikidata items, which would ask for a metadata reconciliation and "wikidatification" process.
- "creator" is used very broadly and variably across DPLA's institutions, in ways that may not match expectations of this property's scope. One particular issue with the National Archives is that "creator" is typically the agency that preserved the record, but not the person that created it (which is sometimes the employee of the agency, but also sometimes a citizen who submitted documents to the government, gave testimony in court/Congress, correspondence/clippings saved by an agency, etc.)
- In cases like these, as exceptions, it might be interesting to separate from the rest and do an upload that moves, for example, the National Archives from creator to the right field/property.
I was not able to answer all of the questions, as we do not have solutions to all of those problems yet. And, of course, better answers and solutions to the questions might appear. Therefore, I'm really looking forward to reading the community feedback.
Question: Is there an intention to include information in the Captions field? If yes, do you have any idea on how to model and add that information? GFontenelle (WMF) (talk) 04:08, 24 June 2021 (UTC)
Sourcing
[edit]My current plan is for all of these statements, where applicable, to use determination method or standard (P459) determined by GLAM institution and stated at its website (Q61848113), (e.g. [1]) as a qualifier. When references are (hopefully) added for SDC, we would also add a reference to the catalog URL, in addition. But for now, I think this works well enough, since we will already have the described at URL statement somewhere in the data as well.
This will distinguish the statements added by DPLA from those added by the community, which means it would also be what we use for synchronization. We can design the bot to strictly change statements with this qualifier, and not the others–since changes made by DPLA would only be to make it match the source, this seems like fair game to update any statement with that qualifier. Dominic (talk) 16:18, 13 July 2021 (UTC)
Next phase of statements
[edit]I am wrapping up our first bot run across the entire DPLA set of uploads. As of now, we have added about 4.8 million statements across 1.6 media files (out of about 2.4 million). The first run added a set of very simple properties: DPLA ID (P760), heat treating (P6212), and RightsStatements.org statement according to source website (P6426), in edits such as this one. For the next phase, I'd like to propose the modeling for the following statements. These encompass what I think of as "medium difficulty" modeling questions, since these are all ones I think are fairly safe to begin based on discussion so far–more complicated than copyright status, but less complicated than ones that require more entity matching (creator and subject, particularly). Please see below:
Statement Comment This format using P7482 -> Q74228490, with qualifiers for URL types seems to be favored, per Commons:Structured_data/Modeling/Source. See the talk page there for additional DPLA-specific discussion about this format. title "A Rill from the Town Pump" essay by Sarah (Sallie) M. Field, Abbot Academy, class of 1904 edit determination method or standard determined by GLAM institution and stated at its website 0 referencesadd reference
add value Will apply "en" lang code, per above discussion. copyright license Creative Commons Attribution-ShareAlike 4.0 International edit determination method or standard determined by GLAM institution and stated at its website 0 referencesadd reference
add value This type of statement already very commonly established on Commons. collection Toledo-Lucas County Public Library edit determination method or standard determined by GLAM institution and stated at its website 0 referencesadd reference
add value This is specifically for the source institution. Commons media contributed by Digital Public Library of America edit object of statement has role aggregator 0 referencesadd referenceOhio Digital Network edit object of statement has role aggregator 0 referencesadd referenceToledo-Lucas County Public Library edit object of statement has role repository 0 referencesadd reference
add value This is how we model this type of statement for the situation where DPLA uploads material that is provided to DPLA from what we call a "service hub", a regional aggregator that harvests from local institutions in an area. In this case, three institutions are listed with their roles. repository is a new Wikidata item created to describe this role (the other "repository" item is in reference to a storage site, but not an organization). Commons media contributed by Digital Public Library of America edit object of statement has role aggregator 0 referencesadd referenceNational Archives and Records Administration edit object of statement has role repository 0 referencesadd referenceNational Archives at College Park - Still Pictures edit object of statement has role custodial unit 0 referencesadd reference
add value This is how we model this type of statement for a different type of situation, where DPLA uploads material that is provided to DPLA from what we call a "content hub", a large institution that DPLA harvests directly, such as the National Archives or Smithsonian. In this case, there are three values listed with their roles, but the last one is not an independent organization, but the unit which maintains the collection. custodial unit is a new Wikidata item created to describe this role (the general "department" item was not specific to this meaning). Note in these types of content hub situations, the content hub (e.g. National Archives) would also be used for P195, and not the custodial unit. author name string Department of State. Agency for International Development. 1961-10/1/1979 edit determination method or standard determined by GLAM institution and stated at its website 0 referencesadd reference
add value This is how we will add all creators in our first pass. In later updates, we would replace a P2093 statement with a specific creator property linking to the Wikidata item for the creator, if it is identified. Please feel free to discuss if anyone has feedback on any of these proposals. In addition, I expect to add file captions at the same time, if we can get clarity in the discussion at #File captions. I'll post this discussion to a few different places and give people a few days to see if anyone has comments. Dominic (talk) 16:12, 27 August 2021 (UTC)
- Thanks for sharing. My only comment is on author name string (P2093), which we currently do not use a a direct statement. I understand you later want to update to a creater property. How about mapping it directly to creator with a qualifier? This is also what is done for Commons contributors without wikidata item. Would also make easier queries these non-Wikidata creators. See Commons:Structured data/Modeling/Author and for example File:Maxent_(35)_Église_09.JPG. The mapping could look like this:
creator somevalue edit author name string Department of State. Agency for International Development. 1961-10/1/1979 determination method or standard determined by GLAM institution and stated at its website 0 referencesadd reference
add value - Best regards, Schlurcher (talk) 18:05, 30 August 2021 (UTC)
- @Schlurcher: Thanks, will implement it this way. I'm hoping to begin making these edits soon.Dominic (talk) 18:10, 15 September 2021 (UTC)
Structured data claims added to redirect
[edit]User:DPLA bot has added structured data claims to File:STS099-734-043 - STS-099 - Earth observation views of Pheonix,Arizona taken from OV-105 during STS-99 - DPLA - 9cb6ac83e74a3aeec7dada30c065bf3a.jpg instead of File:STS099-734-043 - STS-099 - Earth observation views of Phoenix, Arizona taken from OV-105 during STS-99 - DPLA - 9cb6ac83e74a3aeec7dada30c065bf3a.jpg. --Mirokado (talk) 07:52, 15 April 2023 (UTC)