After a brief overview of the elements of modern grid computing, a number of common use-cases of ... more After a brief overview of the elements of modern grid computing, a number of common use-cases of natural language processing tasks running on the grid are presented, notably corpus annotation with morpho-syntactic tagging (600+ million-word corpus in one day), n-gram statistics processing of a corpus and web-accessible services with annotation and term-extraction as examples. Implementation considerations and common problems of using grid for this type of tasks are laid out. Finally, a simple action plan for evolving the infrastructure created for these experiments into a fully functional Human Language Technology grid Virtual Organization is given with the goal to make the power of European grid infrastructure available to the linguistic community. 1.
ABSTRACTThe paper deals with the issues of digital curation, including storage formats, presentat... more ABSTRACTThe paper deals with the issues of digital curation, including storage formats, presentation, and access rights, in the frame of the case-study of two projects on digitisation of written materials: the Scholarly digital editions of Slovenian literature and the Slovenian biographical lexicon. Three basic aspects of digital curation, including preservation, are discussed. The first aspect is the sustainability of our digital format, i.e. of our textencoding; where the international standards and best practices are used, particularly the TEI Guidelines for Electronic Text Encoding and Interchange as a self-documenting, transparent and widely adopted de-facto standard. The second aspect is the presentation and search over the materials, where the static HTML pages and the Fedora Commons repository with SoLR full-text search are used. The last aspect deals with the access rights to the materials: in what format and for whom, and under what conditions are the texts made available....
The paper deals with the issues of digital curation, including storage formats, presentation, and... more The paper deals with the issues of digital curation, including storage formats, presentation, and access rights, in the frame of the case-study of two projects on digitisation of written materials: the Scholarly digital editions of Slovenian literature and the Slovenian biographical lexicon. Three basic aspects of digital curation, including preservation, are discussed. The first aspect is the sustainability of our digital format, i.e. of our textencoding; where the international standards and best practices are used, particularly the TEI Guidelines for Electronic Text Encoding and Interchange as a self-documenting, transparent and widely adopted de-facto standard. The second aspect is the presentation and search over the materials, where the static HTML pages and the Fedora Commons repository with SoLR full-text search are used. The last aspect deals with the access rights to the materials: in what format and for whom, and under what conditions are the texts made available. The arg...
Obravnava dramske tehnike v dveh Cankarjevih dramah je pokazala tehnicni vpliv belgijskega dramat... more Obravnava dramske tehnike v dveh Cankarjevih dramah je pokazala tehnicni vpliv belgijskega dramatika Mauricea Maeterlincka. Doslej je veljalo prepricanje, da je Maeterlinck na Cankarjevo dramatiko vplival predvsem s svojimi idejami in jezikovnimi stilnimi posebnostmi, toda ob obravnavi Jakoba Rude , Cankarjeve zgodnje drame, za katero je znacilen mocan Ibsenov vpliv, se je izkazalo, da je Cankar kombiniral dramske tehnike obeh avtorjev in tako dosegel kontrastno predstavitev sprijene družbe, ki jo posreduje prevladujoca Ibsenova analiticna tehnika, in hinavskih posameznikov, ki so prikazani s pomocjo Maeterlinckove »tehnike reagentov«, s katero je Cankar lahko sugeriral resnico, skrito za lažmi družbe. Obravnava Lepe Vide , ki je Cankarjevo najbolj simbolisticno dramsko delo, je pokazala, da je struktura drame, kakor jo lahko dolocimo s pomocjo aktantskih modelov Anne Ubersfeld, izredno podobna strukturi Maeterlinckovih zgodnjih dram, ceprav se po dramski gradnji in strukturi govo...
EXTENDED ABSTRACT: The paper gives a thorough examination of the Register of Slovenian-language m... more EXTENDED ABSTRACT: The paper gives a thorough examination of the Register of Slovenian-language manuscripts from the 17th and 18th centuries from different points of view: it is presented as a digital repository in humanities disciplines available for searching (digital library) and as a methodological framework of further scholarly research and discoveries in the field. Manuscripts, especially the manuscripts of Slovenian literature, have not been sufficiently taken into consideration so far. They have always been given but a sketchy treatment serving merely to illustrate the general outlines of the nation’s literary and cultural development. They have rarely been dealt with in specialised studies or scientific publications. This is the reason why they have not been registered and recorded in archival and library collections. Different guides to manuscripts offer only basic and limited information from which it is often impossible to identify the language, the content, and the hist...
In this article we analyze common human language technology requirements and the possibility of i... more In this article we analyze common human language technology requirements and the possibility of implementing them using G rid infrastructure. Different possibilities for the setup of an execution envir onment are treated and the standard PKI based Grid security approach is explained, with an emphasis of securing data access in a potentially untrustworthy enviro ment. Two examples of running unmodified NLP applications are presented.
The article discusses possibilities of using the Grid platform for Natural Language Processing ta... more The article discusses possibilities of using the Grid platform for Natural Language Processing tasks. Legal problems concerning distribution of copyrighted texts are described and possible solutions including encryption of data are outlined.
The paper presents the Register of Slovenian manuscripts from the Baroque and Enlightenment perio... more The paper presents the Register of Slovenian manuscripts from the Baroque and Enlightenment periods, i.e. from the 17 th and 18 th centuries. The Register comprises digital images, manuscript descriptions and associated bibliography. We outline the motivation for producing this register and elaborate its encoding, which uses the TEI Guidelines, esp. its module for manuscript description. The manuscripts in the Register are described, giving details about their content and origin, physical characteristics, and classifications along several dimensions. The paper then introduces the presentation of the register via a Web portal built on the Fedora Commons repository software, which enables viewing ms. descriptions with TEI element glosses localised to Slovenian, searching over the registry and browsing the facsimile digital images. The portal also supports export of Dublin core metadata as well as the source TEI encoding, making it suitable for harvesting. Finally, the paper discusses some more challenging aspects of analysis for such digital resources, in particular the formalisation of the locations and dates of manuscript origin, and concludes with directions for future work.
We presents the digitization of Slovenian Biographical Lexicon (SBL), an extensive publication th... more We presents the digitization of Slovenian Biographical Lexicon (SBL), an extensive publication that used bio-bibliographical methods to provide synthetic assessments of work and significance of historical figures on the basis of primary sources. SBL has been out of print for a long time, but the publication has been seen as an important resource for encyclopaedic and reference editions and research in the Slovenian humanities, social sciences and history of the natural sciences. Therefore, Slovenian Academy of Sciences and Arts (SASA) and the Scientific Research Centre of the SASA decided to produce a freely available on-line digital re-edition of SBL. In the process of digitalization, manually corrected OCR has been semi-automatically converted to XML-based Text Encoding Initiative format (TEI P5). Its extensive annotation vocabulary, notably from the biographical and prosopographical modules, has been used to markup as much data as possible. The resulting XML document has become the data resource of an online digital repository based on Fedora Commons platform, where we implemented an infrastructure of XML processing methods on top of native relationships and a Lucene/SOLR based search engine to produce a fullfledged web application and search engine with browser, metadata and web application interfaces.
The paper presents the Register of Slovenian manuscripts from the Baroque and Enlightenment perio... more The paper presents the Register of Slovenian manuscripts from the Baroque and Enlightenment periods, i.e. from the 17 th and 18 th centuries. The Register comprises digital images, manuscript descriptions and associated bibliography. We outline the motivation for producing this register and elaborate its encoding, which uses the TEI Guidelines, esp. its module for manuscript description. The manuscripts in the Register are described, giving details about their content and origin, physical characteristics, and classifications along several dimensions. The paper then introduces the presentation of the register via a Web portal built on the Fedora Commons repository software, which enables viewing ms. descriptions with TEI element glosses localised to Slovenian, searching over the registry and browsing the facsimile digital images. The portal also supports export of Dublin core metadata as well as the source TEI encoding, making it suitable for harvesting. Finally, the paper discusses some more challenging aspects of analysis for such digital resources, in particular the formalisation of the locations and dates of manuscript origin, and concludes with directions for future work.
After a brief overview of the elements of modern grid computing, a number of common use-cases of ... more After a brief overview of the elements of modern grid computing, a number of common use-cases of natural language processing tasks running on the grid are presented, notably corpus annotation with morpho-syntactic tagging (600+ million-word corpus in one day), n-gram statistics processing of a corpus and web-accessible services with annotation and term-extraction as examples. Implementation considerations and common problems of using grid for this type of tasks are laid out. Finally, a simple action plan for evolving the infrastructure created for these experiments into a fully functional Human Language Technology grid Virtual Organization is given with the goal to make the power of European grid infrastructure available to the linguistic community. 1.
ABSTRACTThe paper deals with the issues of digital curation, including storage formats, presentat... more ABSTRACTThe paper deals with the issues of digital curation, including storage formats, presentation, and access rights, in the frame of the case-study of two projects on digitisation of written materials: the Scholarly digital editions of Slovenian literature and the Slovenian biographical lexicon. Three basic aspects of digital curation, including preservation, are discussed. The first aspect is the sustainability of our digital format, i.e. of our textencoding; where the international standards and best practices are used, particularly the TEI Guidelines for Electronic Text Encoding and Interchange as a self-documenting, transparent and widely adopted de-facto standard. The second aspect is the presentation and search over the materials, where the static HTML pages and the Fedora Commons repository with SoLR full-text search are used. The last aspect deals with the access rights to the materials: in what format and for whom, and under what conditions are the texts made available....
The paper deals with the issues of digital curation, including storage formats, presentation, and... more The paper deals with the issues of digital curation, including storage formats, presentation, and access rights, in the frame of the case-study of two projects on digitisation of written materials: the Scholarly digital editions of Slovenian literature and the Slovenian biographical lexicon. Three basic aspects of digital curation, including preservation, are discussed. The first aspect is the sustainability of our digital format, i.e. of our textencoding; where the international standards and best practices are used, particularly the TEI Guidelines for Electronic Text Encoding and Interchange as a self-documenting, transparent and widely adopted de-facto standard. The second aspect is the presentation and search over the materials, where the static HTML pages and the Fedora Commons repository with SoLR full-text search are used. The last aspect deals with the access rights to the materials: in what format and for whom, and under what conditions are the texts made available. The arg...
Obravnava dramske tehnike v dveh Cankarjevih dramah je pokazala tehnicni vpliv belgijskega dramat... more Obravnava dramske tehnike v dveh Cankarjevih dramah je pokazala tehnicni vpliv belgijskega dramatika Mauricea Maeterlincka. Doslej je veljalo prepricanje, da je Maeterlinck na Cankarjevo dramatiko vplival predvsem s svojimi idejami in jezikovnimi stilnimi posebnostmi, toda ob obravnavi Jakoba Rude , Cankarjeve zgodnje drame, za katero je znacilen mocan Ibsenov vpliv, se je izkazalo, da je Cankar kombiniral dramske tehnike obeh avtorjev in tako dosegel kontrastno predstavitev sprijene družbe, ki jo posreduje prevladujoca Ibsenova analiticna tehnika, in hinavskih posameznikov, ki so prikazani s pomocjo Maeterlinckove »tehnike reagentov«, s katero je Cankar lahko sugeriral resnico, skrito za lažmi družbe. Obravnava Lepe Vide , ki je Cankarjevo najbolj simbolisticno dramsko delo, je pokazala, da je struktura drame, kakor jo lahko dolocimo s pomocjo aktantskih modelov Anne Ubersfeld, izredno podobna strukturi Maeterlinckovih zgodnjih dram, ceprav se po dramski gradnji in strukturi govo...
EXTENDED ABSTRACT: The paper gives a thorough examination of the Register of Slovenian-language m... more EXTENDED ABSTRACT: The paper gives a thorough examination of the Register of Slovenian-language manuscripts from the 17th and 18th centuries from different points of view: it is presented as a digital repository in humanities disciplines available for searching (digital library) and as a methodological framework of further scholarly research and discoveries in the field. Manuscripts, especially the manuscripts of Slovenian literature, have not been sufficiently taken into consideration so far. They have always been given but a sketchy treatment serving merely to illustrate the general outlines of the nation’s literary and cultural development. They have rarely been dealt with in specialised studies or scientific publications. This is the reason why they have not been registered and recorded in archival and library collections. Different guides to manuscripts offer only basic and limited information from which it is often impossible to identify the language, the content, and the hist...
In this article we analyze common human language technology requirements and the possibility of i... more In this article we analyze common human language technology requirements and the possibility of implementing them using G rid infrastructure. Different possibilities for the setup of an execution envir onment are treated and the standard PKI based Grid security approach is explained, with an emphasis of securing data access in a potentially untrustworthy enviro ment. Two examples of running unmodified NLP applications are presented.
The article discusses possibilities of using the Grid platform for Natural Language Processing ta... more The article discusses possibilities of using the Grid platform for Natural Language Processing tasks. Legal problems concerning distribution of copyrighted texts are described and possible solutions including encryption of data are outlined.
The paper presents the Register of Slovenian manuscripts from the Baroque and Enlightenment perio... more The paper presents the Register of Slovenian manuscripts from the Baroque and Enlightenment periods, i.e. from the 17 th and 18 th centuries. The Register comprises digital images, manuscript descriptions and associated bibliography. We outline the motivation for producing this register and elaborate its encoding, which uses the TEI Guidelines, esp. its module for manuscript description. The manuscripts in the Register are described, giving details about their content and origin, physical characteristics, and classifications along several dimensions. The paper then introduces the presentation of the register via a Web portal built on the Fedora Commons repository software, which enables viewing ms. descriptions with TEI element glosses localised to Slovenian, searching over the registry and browsing the facsimile digital images. The portal also supports export of Dublin core metadata as well as the source TEI encoding, making it suitable for harvesting. Finally, the paper discusses some more challenging aspects of analysis for such digital resources, in particular the formalisation of the locations and dates of manuscript origin, and concludes with directions for future work.
We presents the digitization of Slovenian Biographical Lexicon (SBL), an extensive publication th... more We presents the digitization of Slovenian Biographical Lexicon (SBL), an extensive publication that used bio-bibliographical methods to provide synthetic assessments of work and significance of historical figures on the basis of primary sources. SBL has been out of print for a long time, but the publication has been seen as an important resource for encyclopaedic and reference editions and research in the Slovenian humanities, social sciences and history of the natural sciences. Therefore, Slovenian Academy of Sciences and Arts (SASA) and the Scientific Research Centre of the SASA decided to produce a freely available on-line digital re-edition of SBL. In the process of digitalization, manually corrected OCR has been semi-automatically converted to XML-based Text Encoding Initiative format (TEI P5). Its extensive annotation vocabulary, notably from the biographical and prosopographical modules, has been used to markup as much data as possible. The resulting XML document has become the data resource of an online digital repository based on Fedora Commons platform, where we implemented an infrastructure of XML processing methods on top of native relationships and a Lucene/SOLR based search engine to produce a fullfledged web application and search engine with browser, metadata and web application interfaces.
The paper presents the Register of Slovenian manuscripts from the Baroque and Enlightenment perio... more The paper presents the Register of Slovenian manuscripts from the Baroque and Enlightenment periods, i.e. from the 17 th and 18 th centuries. The Register comprises digital images, manuscript descriptions and associated bibliography. We outline the motivation for producing this register and elaborate its encoding, which uses the TEI Guidelines, esp. its module for manuscript description. The manuscripts in the Register are described, giving details about their content and origin, physical characteristics, and classifications along several dimensions. The paper then introduces the presentation of the register via a Web portal built on the Fedora Commons repository software, which enables viewing ms. descriptions with TEI element glosses localised to Slovenian, searching over the registry and browsing the facsimile digital images. The portal also supports export of Dublin core metadata as well as the source TEI encoding, making it suitable for harvesting. Finally, the paper discusses some more challenging aspects of analysis for such digital resources, in particular the formalisation of the locations and dates of manuscript origin, and concludes with directions for future work.
Uploads
Papers by Jan Javoršek