Domain adaptation allows generative language models to address specific flaws caused by the domai... more Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on indomain data rapidly weakens the model's ability to generalize to other domains, making the openended deployments of the adapted models prone to errors. This work introduces novel training objectives built upon a semantic similarity of the predicted tokens to the reference. Our results show that (1) avoiding the common assumption of a single correct prediction by constructing the training target from tokens' semantic similarity can largely mitigate catastrophic forgetting of adaptation, while (2) preserving the adaptation in-domain quality, (3) with negligible additions to compute costs. In the broader context, the objectives grounded in a continuous token similarity pioneer the exploration of the middle ground between the efficient but naïve exact-match token-level objectives and expressive but computationally-and resourceintensive sequential objectives.
Progress in natural language processing research is catalyzed by the possibilities given by the w... more Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces the AdaptOr library 1 that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multi-task training, custom objectives development, dynamic training curricula, or domain adaptation. AdaptOr aims to ease the reproducibility of these research directions in practice. Finally, we demonstrate the practical applicability of AdaptOr in selected unsupervised domain adaptation scenarios.
E-learning je casto vniman jako soucasne tema s velkým potencialem do budoucna. Nahližime-li vsak... more E-learning je casto vniman jako soucasne tema s velkým potencialem do budoucna. Nahližime-li vsak toto tema pedagogickým uhlem pohledu, pak nelze jednoduse přehlednout minulost a tradici ve vzdělavani, ktera může být zdrojem zajimavých poznatků ci inspiraci k přemýsleni. V prvni casti přispěvku se proto autor věnuje minulosti, kde ukaže některe pedagogicke nazory ci teorie, ktere svým způsobem spoluvytvařeji tradici. V dalsi casti textu se zaměři na soucasnost, přicemž se zaměři na klicove aktery e-learningu – ucitele a studenta. Na zakladě ceských i zahranicnich výzkumů ukaže problematiku e-learningu z poněkud jineho uhlu pohledu než býva běžne. V zavěrecne casti bude pozornost věnovana budoucnosti, kterou je možne jen stěži přesně předpovidat. Je vsak možne alespoň naznacit budouci trendy v e-learningu. Některe z nich lze souhrnně oznacit jako tzv. mizejici „e“.
Mathematics information retrieval (MIR) is a domain specific branch of Information Retrieval. MIR... more Mathematics information retrieval (MIR) is a domain specific branch of Information Retrieval. MIR aims at searching information in documents with significant amount of mathematical content in the form of expressions and formulae. Based on the newly established international MIR evaluation forum and on the number of MIR related research groups around the world, it is definitely on the rise. In this work I have summarized and compared different approaches to math-aware search systems. More detailed description of Math Indexer and Searcher (MIaS) was provided as this is our system created at Faculty of Informatics, Masaryk University, primarily designed and developed by me. MIaS is currently reported as the best performing MIR system in terms of effectiveness. In this work I proposed several topics which are main research interests of my studies. The topics correlate with possible features that can improve the effectiveness of MIR systems. Namely, the proposed topics are math formula substree unification, integration of algebraic computational power into the indexing as well as searching phase, query expansion as a way of increasing recall, query variables, combination of more approaches within one system and a utilization of combination of text and math search. One topic that spans over all other topics is evaluation which is a necessity in a process of continuous improvement of effectiveness.
Praca sa zaobera problematikou vyhľadavania v matematických textoch. Rozobera niekoľko existujuci... more Praca sa zaobera problematikou vyhľadavania v matematických textoch. Rozobera niekoľko existujucich rieseni vyhľadavania matematiky a z tohto sa snaži si odniesťi doležite poznatky použite pri navrhu vlastneho riesenia. Ten obsahuje idey a zdovodnenia navrhnutých sucasti riesiacich vyhľadavanie matematiky, ako vhodna tokenizacia, upravy a hodnotenie formul. Casť venovana implementacii tohoto navrhu popisuje ako bolo dosiahnute konecne riesenie za použitia indexovacieho jadra Lucene. V zavere dochadza k zhodnoteniu projektu a navrhom na ďalsi vývoj.
V clanku je naznacena filozofie sazeciho systemu TeX, jeho možnosti, urceni, problemy spojene s j... more V clanku je naznacena filozofie sazeciho systemu TeX, jeho možnosti, urceni, problemy spojene s jeho pocestěnim a perspektivy jeho dalsiho rozsiřeni u nas.
The conference program will included oral presentations and two special sessions - Poster session... more The conference program will included oral presentations and two special sessions - Poster session (for parallel poster presentations) and Demonstration session, where authors are invited to present actual projects, developed software or interesting material relevant to the topics of the conference. The demonstrations presented did not appear in the printed version of the Proceedings of GWC 2004. The authors of the demonstrations provided an abstract, that appears in the CD version of the Proceedings. The Proceedings was published by Masaryk University in both paper and CD format.
Dělení slov neboli algoritmická segmentace velké množinyřetězců nějakého jazyka je problémčastějš... more Dělení slov neboli algoritmická segmentace velké množinyřetězců nějakého jazyka je problémčastější než by se na první pohled zdálo. Pro volně šiřitelné slovenské dělení slov zatím existuje pouzeřešení vycházející z definice slabiky ve slovenštině, bez rozsáhlého pokrytí výjimek. Z více než miliónu shromážděných a rozdělených slov se podařilo vygenerovat programem PATGEN nové volně šiřitelné vzory, které se s nepravidelnostmi jazyka vyrovnávají lépe než dosud dostupnéřešení. Výsledek je použitelný nejen v distribucích T E Xu, ale i v dalších systémech jako například OPENOFFICE.ORG. Použité a diskutované techniky bootstrappingu, stratifikace a generování vzorů jsou použitelné přiřešení širokého spektra dalších "segmentačních" aplikací.
Attitude prediction strives to determine whether an opinion holder is positive or negative toward... more Attitude prediction strives to determine whether an opinion holder is positive or negative towards a given target. We cast this problem as a lexicon engineering task in the context of deep linguistic grammar formalisms such as LFG or HPSG. Moreover, we demonstrate that attitude prediction can be accomplished solely through unification of lexical feature structures. It is thus possible to use our model without altering existing grammars, only the lexicon needs to be adapted. In this paper, we also show how our model can be combined with dependency parsers. This makes our model independent of the availability of deep grammars, only unification as a processing mean is needed.
... Ben Hamadou Use of a Weighted Topic Hierarchy for Document Classification 133 Alexander Gelbu... more ... Ben Hamadou Use of a Weighted Topic Hierarchy for Document Classification 133 Alexander Gelbukh, Grigori Sidorov, Adolfo Guzman ... Inge Gavat Speech Analysis and Recognition Synchronised by One-Quasiperiodical Segmentation 175 Tares K. Vintsiuk, Mykola M. Sazhok ...
Domain adaptation allows generative language models to address specific flaws caused by the domai... more Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on indomain data rapidly weakens the model's ability to generalize to other domains, making the openended deployments of the adapted models prone to errors. This work introduces novel training objectives built upon a semantic similarity of the predicted tokens to the reference. Our results show that (1) avoiding the common assumption of a single correct prediction by constructing the training target from tokens' semantic similarity can largely mitigate catastrophic forgetting of adaptation, while (2) preserving the adaptation in-domain quality, (3) with negligible additions to compute costs. In the broader context, the objectives grounded in a continuous token similarity pioneer the exploration of the middle ground between the efficient but naïve exact-match token-level objectives and expressive but computationally-and resourceintensive sequential objectives.
Progress in natural language processing research is catalyzed by the possibilities given by the w... more Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces the AdaptOr library 1 that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multi-task training, custom objectives development, dynamic training curricula, or domain adaptation. AdaptOr aims to ease the reproducibility of these research directions in practice. Finally, we demonstrate the practical applicability of AdaptOr in selected unsupervised domain adaptation scenarios.
E-learning je casto vniman jako soucasne tema s velkým potencialem do budoucna. Nahližime-li vsak... more E-learning je casto vniman jako soucasne tema s velkým potencialem do budoucna. Nahližime-li vsak toto tema pedagogickým uhlem pohledu, pak nelze jednoduse přehlednout minulost a tradici ve vzdělavani, ktera může být zdrojem zajimavých poznatků ci inspiraci k přemýsleni. V prvni casti přispěvku se proto autor věnuje minulosti, kde ukaže některe pedagogicke nazory ci teorie, ktere svým způsobem spoluvytvařeji tradici. V dalsi casti textu se zaměři na soucasnost, přicemž se zaměři na klicove aktery e-learningu – ucitele a studenta. Na zakladě ceských i zahranicnich výzkumů ukaže problematiku e-learningu z poněkud jineho uhlu pohledu než býva běžne. V zavěrecne casti bude pozornost věnovana budoucnosti, kterou je možne jen stěži přesně předpovidat. Je vsak možne alespoň naznacit budouci trendy v e-learningu. Některe z nich lze souhrnně oznacit jako tzv. mizejici „e“.
Mathematics information retrieval (MIR) is a domain specific branch of Information Retrieval. MIR... more Mathematics information retrieval (MIR) is a domain specific branch of Information Retrieval. MIR aims at searching information in documents with significant amount of mathematical content in the form of expressions and formulae. Based on the newly established international MIR evaluation forum and on the number of MIR related research groups around the world, it is definitely on the rise. In this work I have summarized and compared different approaches to math-aware search systems. More detailed description of Math Indexer and Searcher (MIaS) was provided as this is our system created at Faculty of Informatics, Masaryk University, primarily designed and developed by me. MIaS is currently reported as the best performing MIR system in terms of effectiveness. In this work I proposed several topics which are main research interests of my studies. The topics correlate with possible features that can improve the effectiveness of MIR systems. Namely, the proposed topics are math formula substree unification, integration of algebraic computational power into the indexing as well as searching phase, query expansion as a way of increasing recall, query variables, combination of more approaches within one system and a utilization of combination of text and math search. One topic that spans over all other topics is evaluation which is a necessity in a process of continuous improvement of effectiveness.
Praca sa zaobera problematikou vyhľadavania v matematických textoch. Rozobera niekoľko existujuci... more Praca sa zaobera problematikou vyhľadavania v matematických textoch. Rozobera niekoľko existujucich rieseni vyhľadavania matematiky a z tohto sa snaži si odniesťi doležite poznatky použite pri navrhu vlastneho riesenia. Ten obsahuje idey a zdovodnenia navrhnutých sucasti riesiacich vyhľadavanie matematiky, ako vhodna tokenizacia, upravy a hodnotenie formul. Casť venovana implementacii tohoto navrhu popisuje ako bolo dosiahnute konecne riesenie za použitia indexovacieho jadra Lucene. V zavere dochadza k zhodnoteniu projektu a navrhom na ďalsi vývoj.
V clanku je naznacena filozofie sazeciho systemu TeX, jeho možnosti, urceni, problemy spojene s j... more V clanku je naznacena filozofie sazeciho systemu TeX, jeho možnosti, urceni, problemy spojene s jeho pocestěnim a perspektivy jeho dalsiho rozsiřeni u nas.
The conference program will included oral presentations and two special sessions - Poster session... more The conference program will included oral presentations and two special sessions - Poster session (for parallel poster presentations) and Demonstration session, where authors are invited to present actual projects, developed software or interesting material relevant to the topics of the conference. The demonstrations presented did not appear in the printed version of the Proceedings of GWC 2004. The authors of the demonstrations provided an abstract, that appears in the CD version of the Proceedings. The Proceedings was published by Masaryk University in both paper and CD format.
Dělení slov neboli algoritmická segmentace velké množinyřetězců nějakého jazyka je problémčastějš... more Dělení slov neboli algoritmická segmentace velké množinyřetězců nějakého jazyka je problémčastější než by se na první pohled zdálo. Pro volně šiřitelné slovenské dělení slov zatím existuje pouzeřešení vycházející z definice slabiky ve slovenštině, bez rozsáhlého pokrytí výjimek. Z více než miliónu shromážděných a rozdělených slov se podařilo vygenerovat programem PATGEN nové volně šiřitelné vzory, které se s nepravidelnostmi jazyka vyrovnávají lépe než dosud dostupnéřešení. Výsledek je použitelný nejen v distribucích T E Xu, ale i v dalších systémech jako například OPENOFFICE.ORG. Použité a diskutované techniky bootstrappingu, stratifikace a generování vzorů jsou použitelné přiřešení širokého spektra dalších "segmentačních" aplikací.
Attitude prediction strives to determine whether an opinion holder is positive or negative toward... more Attitude prediction strives to determine whether an opinion holder is positive or negative towards a given target. We cast this problem as a lexicon engineering task in the context of deep linguistic grammar formalisms such as LFG or HPSG. Moreover, we demonstrate that attitude prediction can be accomplished solely through unification of lexical feature structures. It is thus possible to use our model without altering existing grammars, only the lexicon needs to be adapted. In this paper, we also show how our model can be combined with dependency parsers. This makes our model independent of the availability of deep grammars, only unification as a processing mean is needed.
... Ben Hamadou Use of a Weighted Topic Hierarchy for Document Classification 133 Alexander Gelbu... more ... Ben Hamadou Use of a Weighted Topic Hierarchy for Document Classification 133 Alexander Gelbukh, Grigori Sidorov, Adolfo Guzman ... Inge Gavat Speech Analysis and Recognition Synchronised by One-Quasiperiodical Segmentation 175 Tares K. Vintsiuk, Mykola M. Sazhok ...
Uploads
Papers by Petr Sojka