In 2015, the HSMS began working on the Old Spanish Textual Archive (OSTA), a lemmatized and morphologically tagged linguistic corpus of about 35,000,000 words, based on more than 400 semi-palaeographic transcriptions of medieval texts written in Spanish, Asturian, Leonese, Navarro-Aragonese and Aragonese carried out by the collaborators of the HSMS. The project can be accessed through the following link:
The origins of OSTA date back to 1978 when John J. Nitti, one of the editors of the Dictionary of the Old Spanish Language (DOSL) and co-founder of the Hispanic Seminary of Medieval Studies (HSMS), describes in an article titled "Computers and the Old Spanish Dictionary ” a long-term project:
the creation of the Old Spanish Archive (OSA), which is to be a repository … of all the machine-readable manuscripts and concordances of those works represented in DOSL… OSA will be established as a research archive open to any interested scholars wishing to make use of its facilities … eventually … information retrieval will be carried out via the computer … linking the magnetically-stored … machine-readable text transcriptions and concordances (43-52)
This project exceeded, at the time of its conception, the computer possibilities available at the time, so the HSMS medium-term objective became the creation and dissemination of the vast database of electronic transcriptions of manuscripts and incunabula written in Spanish between the years 1000 and 1600, using the microfiche, the CD-ROM and, as of 2011, the internet.
After an initial phase when the textual corpus was delimited—analyzing the codices and their content—we began the process of lemmatization and morphological tagging, for which we used FreeLing, a Natural Language Processing tool, and HSMS-app, a textual analysis tool developed specifically for this project.
Starting in 2017, we began to expand FreeLing's lexical resources, working on the recognition of named entities (place names and anthroponyms), medieval spelling variants, and words not identified by any of the rules developed. To do this, we processed several of the dictionaries of the HSMS's Dictionary of the Old Spanish Language project: : Diccionario español de textos médicos antiguos (Herrera 1996), Diccionario español de documentos alfonsíes (Sánchez 2000), Vocabulario militar castellano (siglos XIII-XV) (Gago Jover 2002), Diccionario de la prosa castellana del Rey Alfonso X (Kasten y Nitti 2002), and Diccionario herbario de textos antiguos y premodernos (Capuano 2017).
In early 2019 we started work on the query interface, improving FreeLing's affixation rules, reviewing the FreeLing dictionary of forms, and defining unidentified forms.
The Old Spanish Textual Archive of the Hispanic Seminary of Medieval Studies is a free electronic resource with the following usage conditions:
- Users will cite the Digital Old Spanish Textual Archive in all of the research that uses its data. The citation format should follow this model (or a similar one that includes the same bibliographical information):
Gago Jover, Francisco and F. Javier Pueyo Mena. 2020. Old Spanish Textual Archive. Hispanic Seminary of Medieval Studies. On line at http://osta.oldspanishtextualarchive.org. [date of search]
- Individual works must be cited using the OSTA code that
appears under the Obra column in the
results and in the metadata. In this code [HSMS-0286-0001] the
first four digits correspond to the codex and the last four
digits correspond to the work within the codex.
- To allow other researchers to verify the results, it is
recommended to include the query as it appears in the results,
including not only the term or terms or expressions searched,
but also the filters used:
- Q =
[(lemma='perro'%cd)]within text sort by yearobra
- Q =
[(lemma='aceite'%cd)]:: match.text_materia3 = "medicina" & match.text_sigloobra = "14" within text sort by word
- Q =
- Users are kindly asked to inform the editors of the Old Spanish Textual Archive of any relevant scientific finding that results from consulting the data. Finally, users are asked to let the editors know about any transcription or program errors.
The available version of the Old Spanish Textual Archive has the following limitations:
- The download of the results in TSV format is limited to the first 250,000 examples.
- There are some 370,000 unknown forms in the entire corpus (1.1% of the corpus).
- The lemmatization and morphological analysis of a small number of forms is not correct, something that will be corrected in future revision of the FreeLing dictionary of forms.
- Capuano, Thomas M. 2017. Diccionario herbario de textos antiguos y premodernos, Nueva York: Hispanic Seminary of Medieval Studies.
- Carreras, Xavier, Isaac Chao, Lluís Padró y Muntsa Padró. 2004. “FreeLing: An Open-Source Suite of Language Analyzers.” Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04). [pdf]
- Gago Jover, Francisco y F. Javier Pueyo Mena. 2018. “El Old Spanish Textual Archive, diseño y desarrollo de un corpus de textos medievales: lematización y etiquetado gramatical.” Scriptum Digital, 7: pp. 25-35. [pdf]
- Gago Jover, Francisco y F. Javier Pueyo Mena. 2018. “El Old Spanish Textual Archive, diseño y desarrollo de un corpus de textos medievales: el corpus textual.” Cuadernos del Instituto Historia de la Lengua, 11: pp. 165-209. [pdf]
- Gago Jover, Francisco. 2002. Vocabulario militar castellano (siglos XIII-XV). Granada: Universidad de Granada.
- Herrera, María Teresa. 1996. Diccionario español de textos médicos antiguos. Madrid: Arco/Libros.
- Kasten, Lloyd A. y John Nitti. 2002. Diccionario de la prosa castellana del Rey Alfonso X. Nueva York: Hispanic Seminary of Medieval Studies.
- Nitti, John. 1978. “Computers and the Old Spanish Dictionary,” Computers and the Humanities, 12, pp. 43-52.
- Sánchez, María Nieves, et al. 2000. Diccionario español de documentos alfonsíes. Madrid: Arco/Libros.
- Sánchez Marco, Cristina, Gemma Boleda, y Lluís Padró. 2011.
“Extending the tool, or how to annotate historical language
varieties”, Proceedings of the 5th ACL-HLT Workshop on
Language Technology for Cultural Heritage, Social Sciences,
and Humanities, pp. 1-9, Portland, OR, USA, 24 June 2011.