A conference organized by Patrick J. Burns, David Ratzan, and Sebastian Heath and sponsored by the Institute for the Study of the Ancient World, the ISAW Library, NYU Center for Humanities, NYU Division of Libraries, NYU Center for Ancient Studies and NYU Department of Classics.
Institute for the Study of the Ancient World
15 E. 84th Street
New York, NY
Friday, April 20, 2018
9am-5pm
Future Philologies will provide a forum for historical-language projects with strong text analysis and NLP components to present their work across language-specific barriers in an effort to foster cross-linguistic, comparative feedback, recommendations, criticism, etc. between projects. Moreover, it is meant to embrace the scope of ancient-world languages represented at ISAW, which states in its mission the goal of offering "an unshuttered view of antiquity across vast stretches of time and place.” The format will be presentations on the state of corpus/text analysis/NLP work for each language coupled with recent successes and immediate challenges to be addressed in the near future. Projects will represent Latin, Greek, Coptic, Arabic, Classical Chinese, and cuneiform languages among others. Researchers in Computer Science and Information Science who can offer different perspectives on philological and corpus-based language work have also been included.
These paper describes four cumulative topics. First, it provides an inclusive model of Philology, based on an 1822 statement by August Boeckh’s, in which philology encompasses any and all possible activities by which we mine the linguistic record of humanity to understand the past in its entirety including everything that happens in the human mind and in the world around us. We are experiencing radical, perhaps revolutionary change in how we realize this goal but the Latin description that Boeckh delivered before the King of Prussia remains as true today as then. All that has changed are the ways in which we can imagine this goal. Second, in Digital Philology 1.0, philologists valued digital methods only insofar as they were useful to compose traditional publications for traditional, almost exclusively specialist, audiences. Scanned pages (as in JStor), PDFs and E-books provided the most advanced digital models, each carefully tracking the formats of print culture. New features, such as search, digital distribution, and (occasionally) supplementary materials (such as extra images, sound etc) offered important, but fundamentally incremental, advances. Digital Philology 2.0 recognizes that we are experiencing a profound media shift. We do not use digital methods to compose traditional articles. The shift to a digital world transforms how we imagine the forms, functions, audiences, and authors of publications and the ways in which historical languages can advance the intellectual life of humanity as a whole. Third, “smart editions” are an emergent phenomenon where networks of born-digital annotations combine to create a qualitatively new reading environment that subsumes every and challenges every function of printed editions: the critical edition becomes on visualization of a wider textual tradition; the bilingual edition includes a dense array of annotations linking source text to words and phrases in the modern language translation, to exhaustive and expanding classes of linguistic analyses, and to a growing class of machine actionable annotations that can be rapidly localized into multiple languages. Customization technologies allow readers to configure a smart edition to mimic its print predecessors or to present annotations by which readers interact directly with sources in languages that they do not know. Personalization technologies allow the system to adapt to the needs of each reader, supporting new methods by which readers can internalize historical languages, identifying the vocabulary, grammar and other features most relevant to their interests. Smart editions open up opportunities for wholly new classes of contributions by experts, by citizen scientists and by machine learning. Finally, this new hybrid mode of intellectual production reflects the evolving nature of work more generally in society as a whole. Students of philology have an opportunity to acquire new configurations of skills that can prepare them for the most dynamic areas of the workplace. A new liberal arts education is emerging in which students can master languages such as Ancient Greek or Latin while developing skills for which they do need to apologize in the wider world.
Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this talk, we present our work on a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development. We also discuss ongoing challenges in the use of this corpus for historical linguistic research.
Despite a growing interest in digital humanities as a field of study and focus of specialization, significant barriers to the adoption of digital techniques remain within research and teaching in practice in many humanities disciplines. While an increasing number of humanities scholars have demonstrated willingness to invest time and effort in cultivating necessary technical skills, in practice many more are prevented from experimenting with digital methods due to perceived high barriers to entry. One approach to accelerate the adoption of digital techniques is to attempt to reduce the prerequisite technical skills required to apply techniques to research data in practice through the creation of platforms and tools able to bridge technical gaps for some of the most powerful and generally applicable use cases. With this goal in mind, this talk introduces a suite of browser-based text analysis tools designed for pre-modern Chinese materials and intended to easily integrate into scholarly workflows, including in particular those common in Chinese literature, philosophy, and history departments. Major goals include accessibility of the tools themselves, as well as transparency of their working and ability to introspect the mechanisms underwriting the results and visualizations produced. By enabling rapid exploration of arbitrarily chosen textual materials while also providing insight into the algorithms used, these tools have pedagogical applications in addition to research uses, and are already in use teaching at several institutions.
This paper discusses a workflow for the automated translation of the 69,000 Ur III period (~2100-2000 BC) corpus of administrative texts from Southern Mesopotamia (ancient Iraq), written in the Sumerian language. The morphological complexity of Sumerian leads to data sparsity which makes it difficult to employ off-the-shelf tools for machine translation directly. To remedy this problem we perform a morphological analysis (lemmatization, morphological parsing), part manually and part automatically, in order to increase the rate of repetition in the data. It is then possible to apply both statistical and neural algorithms for machine translation. This paper focuses on the specific challenges brought by the idiosyncrasies of the language and how we address those in our translation workflow, which can be divided in three sections: pre-requirements (eg. lemmatization), the analysis (eg. annotation) and the machine translation proper. We discuss the algorithms, techniques and tools we use, including annotations projection based on annotated English translations, word embeddings used for neural machine translation, as well as word- and phrase-based statistical machine translation. In addition, we perform syntactic annotation in line with the Universal Dependencies, which map grammatical features across languages with the future goal to integrate this information in the translation workflow. This workflow is part of our work under the the Machine Translation and Automated Analysis of Cuneiform Languages project, a collaboration of the University of Frankfurt, University of Toronto and the University of California, Los Angeles, and funded through the Digging into Data challenge 2016 by the DFG, SSHRC, and NEH.
The era of mass digitization seems to provide a mountain of source material for digital scholarship, but its foundations are constantly shifting. Selective archiving and digitization obscures data provenance, metadata fails to capture the presence of texts of mutable genres and uncertain authorship embedded within the archive, and automatic optical character recognition (OCR) transcripts contain word error rates above 30% for even eighteenth-century English. Beyond these issues, even the identity of any given transcription might change due to improved image processing or upgraded OCR. The condition of the mass-digitized text is thus closer to the manuscript sources of an edition than to a scholarly publication. In this talk, I will discuss several aspects of our work on "speculative bibliography" in the Viral Texts and Ocean Exchanges Projects (http://viraltexts.org) as applied to multi-authored, generically hybrid nineteenth-century newspapers specifically and to mass-digitized collections generally. After summarizing the low-level text-reuse techniques used to organize the data, I discuss methods for exploiting the redundancy and cluster structure in these archives to improve OCR accuracy using multi-input attention models and unsupervised learning. I then discuss methods for inferring network structure and mapping information propagation among texts and publications.
Speaker presentations will be followed by a panel discussion.
We are now fully booked for this event and are only accepting names for the wait-list. Visit this page for more information.