On October 30, 2014, a presentation titled “Tibetan in Digital Communication: Corpus Linguistics and Lexicography: Using an annotated corpus to facilitate the philological study of Tibetan texts” was given at Columbia University’s Butler Library. Dr. Nathan W. Hill of SOAS (School of Oriental and African Studies), University of London spoke on the goals, processes, and potential impacts of “Tibetan in Digital Communication: Corpus Linguistics and Lexicography”, a three-year research project funded by the UK’s Arts and Humanities Research Council (AHRC) on which he serves as a co-investigator. The presentation was co-sponsored by the Digital Humanities Center and Columbia Libraries Digital Program Division (CUL/IS), the C.V. Starr East Asian Library, and the Modern Tibetan Studies Program, Columbia University. Roughly two dozen people were in attendance, including representatives from Latse Library, an institution in New York City that holds modern publications on Tibetan culture, society, history and literature.
The event was initiated by Barbara Rockenbach, Director of the Humanities and History Libraries at Columbia University who is also charged with development of the Digital Humanities Center. She spoke of the importance of this kind of seminar, which exposes people from Columbia and beyond to new methodologies and tools for the digital humanities. After these short remarks the Tibetan Studies Librarian, Lauran Hartley, introduced Dr. Nathan W. Hill, focusing on his previous contributions to Tibetan language studies.
To frame his talk, Dr. Hill used Maslow’s hierarchy of needs, a psychology construct in the shape of a pyramid that posits that humans have several levels of needs and that the more basic needs (starting with physiological needs including food and shelter) must be met before more sophisticated needs can be pursued, until the top level of the pyramid, self-actualization, is reached. In the field of corpus linguistics, Dr. Hill argues, a similar pyramid can be constructed, in which each level is dependent on the one below. The two most basic levels are the need for script in Unicode and the availability of some texts in digital format. Once these are obtained, it is possible to create a segmenter that divides words, then a tagger that identifies unique words (for instance, the two versions of ‘chair’, which may be either a verb or a noun), followed by a lemmatizer that can associate different forms of a word with one another (Dr. Hill’s example was ‘sing’ and ‘sang’), and finally by a parser capable of a higher order of syntactic analysis. For the English language, all of these tools and more are available, but for Tibetan, only the first two levels are available. In order to begin rectifying this situation, this SOAS project has three goals, namely, a one million syllable Part-of-Speech (POS) tagged corpus of Tibetan texts spanning the language’s history, an automatic word-segmenter, and an automatic POS tagger.
Dr. Hill next outlined the project’s workflow, specifically the process for tagging. In this highly iterative process, a text is first run through the computer, which assigns every possible tag to each word based on the lexicographical tools available to it as well as its memory of previous texts. The text is then pre-tagged by a rule-based tagger in order to provide the fewest possible choices to the student who will hand-tag the text. Once the text has been hand-tagged by a human being, it is run through the rule-based tagger again in order to see if the student has made any mistakes or if the computer has suggestions for changes to rules that may have been written incorrectly. The thoroughly tagged text can then be deposited in the project’s corpus. Dr. Hill concluded his talk with some examples of rule revisions that have been required, and a brief tour of the project’s website.
In response to a question from one of the attendees, Dr. Hill gave some suggestions for practical application of the project’s outputs. These included the ability to note changes in prominent keywords in digital newspapers and tie results to particular events, track multiple web sources and notify a user when a new word emerges, and discover whether a long-ago drought was mentioned in the literature of that time period.
The project embraces many of the tenets of digital humanities work. It is highly open and accessible, as it was purposely designed broadly in order to be useful to researchers from many different disciplines (history, literature, theology, etc.) and all the tools developed thus far are freely available online. It is also without a true finish line. Though the project has its own specific end goals, Dr. Hill acknowledged that it is only one step in a grander process, as he welcomed thoughts and contributions from others and envisioned work to be done in the future.
Thought they were not presented as such, the talk also provided two excellent counter-arguments to a major criticism of digital humanities work – that it is all about tool use, while excluding the component of human reflection. First, the use of Maslow’s hierarchy of needs puts the need for development of tools in perspective. Just as words must be separated before they can be tagged, tools must be created in order for other projects to proceed. Second, the use of digital tools does not exclude humans from the process; rather it enables them to engage at a higher level. Despite the project’s use of computing, every word in the corpus has been examined by a human being. As Dr. Hill said, a person is good at making hard decisions, but significantly less adept at making consistent decisions. By using the computer to ensure rules are followed, and people to ensure that the rules make sense, the best qualities of both the digital and the humanist are utilized for successful implementation.
As DH both values diversity and has been criticized for not being diverse enough, I would like to have heard more about the empowerment aspect of the project. As acknowledged in the project documentation, Tibetan is a much-neglected language in the digital realm, and work of this kind has the potential to change how Tibetans use and study their own language.
Overall, Dr. Hill thoroughly justified the need for this project, which seems to hold a great deal of promise for Tibetan studies. For more information and access to these linguistic tools as they stand, please visit the project website at: http://larkpie.net/tibetancorpus/.