Old Spanish Textual Archive

In 2015, the HSMS began working on the Old Spanish Textual Archive (OSTA), a lemmatized and morphologically tagged linguistic corpus of about 35,000,000 words, based on more than 400 semi-palaeographic transcriptions of medieval texts written in Spanish, Asturian, Leonese, Navarro-Aragonese and Aragonese carried out by the collaborators of the HSMS. The project can be accessed through the following link:

http://osta.oldspanishtextualarchive.org

Project description

The origins of OSTA date back to 1978 when John J. Nitti, one of the editors of the Dictionary of the Old Spanish Language (DOSL) and co-founder of the Hispanic Seminary of Medieval Studies (HSMS), describes in an article titled "Computers and the Old Spanish Dictionary ” a long-term project:

the creation of the Old Spanish Archive (OSA), which is to be a repository … of all the machine-readable manuscripts and concordances of those works represented in DOSL… OSA will be established as a research archive open to any interested scholars wishing to make use of its facilities … eventually … information retrieval will be carried out via the computer … linking the magnetically-stored … machine-readable text transcriptions and concordances (43-52)

This project exceeded, at the time of its conception, the computer possibilities available at the time, so the HSMS medium-term objective became the creation and dissemination of the vast database of electronic transcriptions of manuscripts and incunabula written in Spanish between the years 1000 and 1600, using the microfiche, the CD-ROM and, as of 2011, the internet.

After an initial phase when the textual corpus was delimited—analyzing the codices and their content—we began the process of lemmatization and morphological tagging, for which we used FreeLing, a Natural Language Processing tool, and HSMS-app, a textual analysis tool developed specifically for this project.

Starting in 2017, we began to expand FreeLing's lexical resources, working on the recognition of named entities (place names and anthroponyms), medieval spelling variants, and words not identified by any of the rules developed. To do this, we processed several of the dictionaries of the HSMS's Dictionary of the Old Spanish Language project: : Diccionario español de textos médicos antiguos (Herrera 1996), Diccionario español de documentos alfonsíes (Sánchez 2000), Vocabulario militar castellano (siglos XIII-XV) (Gago Jover 2002), Diccionario de la prosa castellana del Rey Alfonso X (Kasten y Nitti 2002), and Diccionario herbario de textos antiguos y premodernos (Capuano 2017).

In early 2019 we started work on the query interface, improving FreeLing's affixation rules, reviewing the FreeLing dictionary of forms, and defining unidentified forms.

Additional Resources

	Manual de consulta: To take full advantage of all the possibilities that OSTA offers, it is recommended to read the Manual de consulta (in Spanish), which describes in detail the query interface, the types of queries, the filtering and the ordering of results.
	Tabla códices: Collects the metadata of each of the codices included in OSTA. It consists of the following fields: HSMS-ID (codex identifier), abreviatura HSMS (alphanumeric sequence used by HSMS to identify each of the transcriptions), BETA manid (registration number assigned by PhiloBiblon to each of the manuscripts or printed works where a work appears), BETA copid (record number assigned by PhiloBiblon to a specific copy of a printed book), biblioteca (current location of the manuscript or printed edition), signatura (signature of the manuscript or printed edition), SPDT-inicio (specific production date, corresponding to the earliest date of the copy of a manuscript or the printing of an edition), SPDT-fin (specific production date, corresponding to the latest date of the copy of a manuscript or to the printing of an edition), lugar específico (name of the place where the codex was written or printed), productor específico (name of the copyist or printer when this is known), formato (format of the codex, it can be handwritten or printed), número de folios (total number of folios in the codex), PhiloBiblon (direct link to PhiloBiblon), facsímil digital (direct link to the digital facsimile of the codex when it exists)
	Tabla obras: Collects the metadata of each of the works included in OSTA. It consists of the following fields: abreviatura HSMS (alphanumeric sequence used by HSMS to identify each of the transcriptions), BETA manid (registration number assigned by PhiloBiblon to each of the manuscripts or printed works where a work appears), BETA copid (record number assigned by PhiloBiblon to a specific copy of a printed book), HSMS-ID (codex identifier), Obra ID (work identifier), BETA cnum (control number for each entry), Autor (name of the author when this is known, otherwise it appears as "desconocido"), Traductor (name of the translator when it is known, otherwise it appears as "desconocido"), Título (general or standardized title, following the rules established in PhiloBiblon), folio (the sequence of folios that each work occupies within the codex), OPDT inicio (original production date, corresponding to the earliest known or supposed date of writing of the original of each work), OPDT fin (original production date, corresponding to the latest known or supposed writing date of the original of each work), lengua-1, lengua-2 (language or languages used in a given work), tipo textual (basic typology of the work, can be verse or prose), materia-1, materia-2, materia-3 (taxonomic classification of works by subject matter)
	Frequency table (word_lemma_AbsFreq_RelFreq): This table contains the frequency list of the whole corpus. The table is organized as follows: rank - word (token) - lemma - absolute frequency (total number of tokens) - relative frequency (%).
	Frequency table (word_lemma_PoS_AbsFreq_RelFreq): This table contains the frequency list of the whole corpus. The table is organized as follows: rank - word (token) - lemma - PoS - absolute frequency (total number of tokens) - relative frequency (%).

Usage Conditions

The Old Spanish Textual Archive of the Hispanic Seminary of Medieval Studies is a free electronic resource with the following usage conditions:

Users will cite the Digital Old Spanish Textual Archive in all of the research that uses its data. The citation format should follow this model (or a similar one that includes the same bibliographical information):

Gago Jover, Francisco and F. Javier Pueyo Mena. 2020. Old Spanish Textual Archive. Hispanic Seminary of Medieval Studies. On line at http://osta.oldspanishtextualarchive.org. [date of search]

Individual works must be cited using the OSTA code that appears under the Obra column in the results and in the metadata. In this code [HSMS-0286-0001] the first four digits correspond to the codex and the last four digits correspond to the work within the codex.
To allow other researchers to verify the results, it is recommended to include the query as it appears in the results, including not only the term or terms or expressions searched, but also the filters used:
- Q = [(lemma='perro'%cd)] within text sort by yearobra
- Q = [(lemma='aceite'%cd)] :: match.text_materia3 = "medicina" & match.text_sigloobra = "14" within text sort by word
Users are kindly asked to inform the editors of the Old Spanish Textual Archive of any relevant scientific finding that results from consulting the data. Finally, users are asked to let the editors know about any transcription or program errors.

Limitations in the available version

The available version of the Old Spanish Textual Archive has the following limitations:

The download of the results in TSV format is limited to the first 250,000 examples.
There are some 370,000 unknown forms in the entire corpus (1.1% of the corpus).
The lemmatization and morphological analysis of a small number of forms is not correct, something that will be corrected in future revision of the FreeLing dictionary of forms.

Bibliography

Capuano, Thomas M. 2017. Diccionario herbario de textos antiguos y premodernos, Nueva York: Hispanic Seminary of Medieval Studies.
Carreras, Xavier, Isaac Chao, Lluís Padró y Muntsa Padró. 2004. “FreeLing: An Open-Source Suite of Language Analyzers.” Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04). [pdf]
Gago Jover, Francisco y F. Javier Pueyo Mena. 2018. “El Old Spanish Textual Archive, diseño y desarrollo de un corpus de textos medievales: lematización y etiquetado gramatical.” Scriptum Digital, 7: pp. 25-35. [pdf]
Gago Jover, Francisco y F. Javier Pueyo Mena. 2018. “El Old Spanish Textual Archive, diseño y desarrollo de un corpus de textos medievales: el corpus textual.” Cuadernos del Instituto Historia de la Lengua, 11: pp. 165-209. [pdf]
Gago Jover, Francisco. 2002. Vocabulario militar castellano (siglos XIII-XV). Granada: Universidad de Granada.
Herrera, María Teresa. 1996. Diccionario español de textos médicos antiguos. Madrid: Arco/Libros.
Kasten, Lloyd A. y John Nitti. 2002. Diccionario de la prosa castellana del Rey Alfonso X. Nueva York: Hispanic Seminary of Medieval Studies.
Nitti, John. 1978. “Computers and the Old Spanish Dictionary,” Computers and the Humanities, 12, pp. 43-52.
Sánchez, María Nieves, et al. 2000. Diccionario español de documentos alfonsíes. Madrid: Arco/Libros.
Sánchez Marco, Cristina, Gemma Boleda, y Lluís Padró. 2011. “Extending the tool, or how to annotate historical language varieties”, Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 1-9, Portland, OR, USA, 24 June 2011. [pdf]