Digital Humanities
@ Pratt

Inquiries into culture, meaning, and human value meet emerging technologies and cutting-edge skills at Pratt Institute's School of Information

Topic Modeling Cryptome’s Archive Over Time

Introduction

For 19 years, the nonprofit website Cryptome has collected and published a wide range of materials primarily related to domestic and international governmental affairs which have otherwise faced obstacles to traditional publication. Founded and solely maintained by the architects John Young and Deborah Natsios, Cryptome openly “welcomes documents for publication that are prohibited by governments worldwide, in particular material on freedom of expression, privacy, cryptology, dual-use technologies, national security, intelligence, and secret governance – open, secret, and classified documents.”(1)

Within this same epigraphical mission statement, the Cryptome home page purports that documents are only removed in cases of direct US court order, and there are instances within their archives where Young and Natsios have posted their legal discourse over the inclusion of a document rather than remove its presence completely. These kinds of fascinating issues of representation and information freedom abound in the 100,000+ files that Cryptome has amassed since June 1996, though its size and site presentation give the impression of impenetrable opacity. Young and Natsios remain open with the contents of their archives, offering flash drives with rolling updates of their holdings for $100 donations on their website not to mention hosting digital locations for every document freely online. Even so, my studies in the Digital Humanities I course led me to believe there were more thorough tools available for evaluating what the Cryptome archive might offer a casual user.

Developing a Research Question

Spending some time browsing both Cryptome’s online catalog and reading through interviews they’ve given, I developed a central research question. Given the breadth of the collection, not to mention the ambition of Cryptome’s welcoming any document “that are prohibited by governments worldwide”, I was curious about a way to examine the collection in a manner that might produce a set of topics which might be considered “prohibited.” In other words, I wanted to analyze the breadth of content within Cryptome’s massive archive in relation to Young and Natsios’s own informal epigraph. If readily penetrable, I imagined the data set might give linguistic and thematic definition to the genre of “open, secret, and classified” security and intelligence documents.

In light of our in-class digital humanities discussions and an exploratory methodology meeting with Professor Sula, I consulted literature on topic modeling with the intention of solidifying my research approach. In his article “Topic Modeling and Digital Humanities”, Blei defines a “topic” for probabilistic text models as “a probability distribution over terms.”(2) I concluded that this would be well-suited to analyzing the Cryptome archive primarily due to the exceptional number of terms to be considered. Because of the nature of the archive, though, I wanted to go deeper than mere word frequency or occurrences within the whole of the corpus. Young and Natsios had already outlined, albeit in broad strokes, the kinds of words and ideas they were interested in: freedom of expression, cryptology, classified documents, and so on. Their mission statement, though, does little to define the kinds of word or thematic patterns that make such documents the kind of work Cryptome sets out to publish. Neuhaus writes that in topic modeling, “documents are considered to be bags of words, and words are considered to be independent given the topics (i.e., word order is irrelevant).”(3) The topic modeling approach might be more likely to answer my research question of what this genre of document tends to look like, sound like, or even questions about the geography of these kinds of security issues.

Methods

I began my experiment with the question: what types of documents are in the Cryptome corpus, and what are identifiable topic patterns? I would be using a Mac for my experiment, and my primary software for this task was to be Google’s open-source Topic Modeling Tool, a Java application which would require that all documents analyzed be first converted into .txt files. However, a cursory look through the unsorted files on Cryptome’s flash drive revealed that the vast majority of Cryptome files fit into three categories of file type: .txt, .htm / .html, and .pdf. Rather than pull simply from the Cryptome flash drive, an endeavor which would have necessitated a mass clean-up of working folders, I opted to utilize Cryptome’s own indices. These indices were collected in a flash drive folder as .htm files that were seemingly identical to the pages hosted on Cryptome’s public website. The primary advantage to using these indices were that they were sorted into 40 separate pages, by and large bifurcating each year since their 1996 inauguration: “January-June 1998, July-December 1998” etc. Although my research question did not explicitly involve time, I concluded that downloading and analyzing 20,000+ documents could only benefit from such implicit organization.

Having established my approach to the corpus, I set about downloading the entirety of the files to an external hard drive. To do this, I used the Firefox add-on Download Them All. Although the limits of my software and expertise ultimately precluded the analysis of images, I chose to download the whole of the files within each individual section.The only exception I made to this were the two 2016 sections, titled “Current Listings” and “Recent Listings”, which accounted for various documents from 2016 and were both actively updating over the course of my data collection. Throughout the process there was a fluctuating amount of either broken links or files that were not properly downloaded by the add-on, with the majority of these occurring in the 1990s listings and for files filed as “Offsite.” Adhering to Cryptome’s own half-year structure, I sorted what files I had by my three core types, leaving me with a corpus still totaling close to 20,000 .txt, .htm / .html, and .pdf files.

Because the Topic Modeling Tool only analyzes .txt files, there were many documents within each of my 38 sections that needed no additional conversion. I used Terminal’s “textutil -convert txt” command to batch convert .htm and .html files into readable .txt files and copied these converted .txts into a separate folder alongside the native .txt files for each section. The final component of my data collection used the open-source Java application Drop 2 Read in order to output unicode plain text files via optical character recognition. In the hopes of accelerating this automated process, I copied the .pdfs from every section into one folder and set it as Drop 2 Read’s input. Similarly, I directed the .txt output to a single folder. This proved a problem when, either due to the demand of converting 6,000 .pdfs or my computer’s subpar processing power, the conversion rate had scarcely reached 20% over the course of three days. At this point I realized I had severely miscalculated the three file groups—.txt, .htm / .html, .pdf—as rough thirds of the Cryptome corpus. Sorting my .pdf folder by file size, I realized that this file type comprised several very documents that outweighed most of the .txt and .htm files: full-book scans, more than a few collected document series, and other such sizable concerns. In the interest of moving forward towards the actual topic-modeling, I ultimately sorted the remaining unconverted pdfs by file size and used Drop 2 Read to process every file sized up to 1 MB. I was disappointed to have unintentionally compromised the dataset, but this decision would ideally ensure a dynamic if disproportionate range of topics.

Now with 38 nearly-complete sections of .txt files of Cryptome’s documents, I copied all 20,000 .txt files into one folder for the Topic Modeling Tool. To determine my number of topics, I used the formula √(n/2) where n=total number of documents. Upon initializing the topic modeling, however, my computer returned an error reporting insufficient java memory. I first tried adjusting the number of iterations, then manually increased the java allotment within my preferences, and with no success I attempted to run the model on various Pratt computers. Unfortunately, I was unable to properly download the tool on any public terminal due to restrictions on unauthorized java applications.

Assuming my computer’s failure was resultant of the corpus size (20,000+ documents, approximately 12 GB), I would need to reframe my research question. I decided that to randomly limit the corpus simply so that I could use the Topic Modeling Tool would too drastically compromise the question of what the Cryptome corpus looked like. Fortunately, throughout this process I had preserved each section’s chronologically-filed .txt and converted-.htm and .html files. To sort the .txt files for the successfully converted .pdfs by year, I used Terminal’s copy command to relocate the .txt files with identical names to each section’s .pdfs to a composite folder that also included the previously-sorted .txt and .htm / .html files. I could now run topic models for each section of the Cryptome archive and address questions about the makeup of their holdings over segments of time. To more concisely answer this question, I decided to combine the 38 half-year sections into 19 distinct corpora according to year. The exception to this would be a two-year hybrid culled from the first section, technically a collection of 1996 documents as well as the first half of 1997, and the second section which solely comprised latter-1997 documents. Finally, I ran each year of Cryptome’s holdings through the Topic Modeling Tool, again using the √(n/2) formula in addition to standardizing each model at 200 iterations and 10 topic words.

Results

My change in scope for the topic modeling ultimately yielded a set of results that seems more appropriate to analyze for broad discrepancies as opposed to interrogating topic occurrence within specific documents or vice versa. Looking at 19 years worth of topics, certain trends emerge. One such example is the gradual globalization of Cryptome’s archive. For the first few years, topics with reference to international affairs err either on the general side (“international criminal crime countries states united groups drug world organized” in 2000) or seem filtered through an American perspective (“nuclear china weapons united states satellite satellites . . .” in 1998). Perhaps unsurprisingly given the focus of American media at the time, the topics for 2001 and 2002 display a newfound preponderance of international geography. Countries such as Afghanistan and Israel are present alongside non-U.S. centric topic words like “food” and these years return non-English topic words throughout. Not every one of these patterns holds across Cryptome’s lifetime, but it was enlightening to identify how these kinds of pivot points arise from the topic modeling method.

Another approach I took to analyzing the archive over time was tracking particular topic words across various annual topics. Certain topic words, such as “key”, are consistently paired with the same topic words (“encryption” and “crypto” for “key”) no matter the year. The “key” topics in particular occur in both 1997 and 2015, not to mention several years in between. Similarly, over half of all the topics including the word “military” include “weapons”, “war”, or more specific words like “drone” that can be placed in similar context. These examples demonstrate the extent to which the vocabulary of Cryptome’s content seems to consistently add up to a bigger picture of how certain words are used in “classified documents.” The redefinition of diction, then, speaks to the possible understanding of “classified” as a genre with a distinct linguistic character. Still, this method is indirect in this comparison, and it seems that modeling across time like this essentially answers a different question than my initial question to the full Cryptome corpus.

More than the topic modeling’s pivot points or varying word occurrence, though, carrying out the analysis by year demonstrates a kind of survey of Cryptome as a public information resource. Topics in the first few years of the archive tend to revolve around matters of encryption, both as it pertains to laws and government but also with regards to software license, copyright, and public users. Other words in these early topics tend towards what might now be thought as tech jargon, such as “data algorithm packet” and so on. Even beyond the specific “encryption” topics, Cryptome’s early years exhibit a kind of niche interest that includes little reference to the events or even politics that are now historically known to have been occurring simultaneously, issues regarding Presidents or, as previously stated, foreign affairs. It isn’t until 2006 that topic words which now seem commonplace in the discussion on information security, such as “nsa”, begin occurring. Curiously, “nsa” becomes a fixture in the annual topics around the same time that “encryption” disappears for several years.

Furthermore, 2009 features a topic which, as a reflection of the public record, seems quite out of place with the way early Cryptome documents seemed to operate: “president obama war barack house america washington http secret national”. There are through lines in the way that politics are still correlated with “war” and possibly subjective terms such as “secret”. Still, it marks a shift in Cryptome’s potential role as a decidedly non-niche public source of reference. This trend continues: from 2010 to 2013, at least one topic per year includes the topic word “wikileaks” or “assange”, sometimes both, and both share topics with words like “security”, “nsa”, and “file”. This stretch, even more than Cryptome’s acknowledgement of 2008-09’s public discussion of Obama, reflects how Cryptome’s canon has now become aligned with the historical record’s own burgeoning interest in information security rights. Because Cryptome had been publishing and possibly soliciting or searching for documents of encryption and the like as far back as 1997, it is difficult to say that Cryptome’s 2010-13 topics exhibit a concerted effort at following the same trends and sensational figures as the mainstream American media. In order to answer a question like that, it would be necessary to gather data on Young’s and Natsios’s acquisitions process. Nonetheless, by the time “assange” shares a 2013 topic with “snowden” and then disappears while the latter continues through 2015, it is clear that the topic modeling of Cryptome’s chronology has resulted in possible evidence that their archive has in some sense transcended the fringe pursuits of its origins. Another potential interpretation, of course, is that the fringe’s concern with information leaks has instead moved towards the mainstream. It is worth noting that 2015’s topics include both “pdf nsa snowden update . . .” and “encryption security crypto software . . .” By examining the corpus in this piecemeal chronological manner, despite my best efforts not to, I stumbled upon one possible manifestation of Cryptome’s otherwise opaque identity complex.

Future Directions

Given hypothetical circumstances of increased time and resources, the clearest direction forward would be to find a way to topic model my first 20,000+ corpus. It is possible that processing through MALLET directly could compensate for the ostensible performance issues with Google’s Topic Modeling Tool. As indicated above, I believe that the analysis I was able to carry out with topic modeling annually across Cryptome’s archive was a promising place to start in terms of considering how co-occurring topic words reveal the semblance of genre in consistency of language and theme. Ultimately, though, my necessary shift in scope ended up only answering my initial question on this matter indirectly, instead more directly answering the separate question of Cryptome’s relationship to the historical record.

I am also hopeful that further work with this data might also be able to include the image files that gradually carry more weight within Cryptome’s holdings as the years progress. Although it was beyond my scope for this project, I would think that data on the points in time when these variable file types begin to appear would also be worth including.


References

  1. http://cryptome.org/
  2. Blei, David (2012). Topic Modeling and Digital Humanities. Journal of Digital Humanities 2(1).
  3. Neuhaus, Stephan and Zimmermann, Thomas. http://research.microsoft.com/pubs/136976/neuhaus-issre-2010.pdf