Project Details


Funding will provide graduate student support for two workshops Web Archiving Data Workshops in 2017. The workshops focus on analyzing archived Web data. Such data sources provide unique opportunities to examine social science phenomena that have evolved over time, such as how news media is presented on the Web or how individuals interact in online discussions. On the other hand, such data are often difficult to access and challenging to analyze due to the fact that the data were often captured using sporadic or random approaches, due to data formats that are opaque and not always standardized, and due to the large size of the data (mid-sized collections can be on the terabyte scale). The workshops will educate graduate students with regards to combinations of code and research questions that can be used to design appropriate research studies and to then access and analyze this type of data. The focus of the workshops is interdisciplinary, bringing together research interests from disciplines as diverse as computer science, social science and the humanities Participants will work hands-on with data, will learn new programming skills, and will develop pilot projects for new research that they will be able to continue at their home institutions. For example, workshop participants will learn about the Warcbase software package (based on Spark and Scala), and will also have the opportunity to analyze data using in-browser Python notebooks. The first workshop will take place in February 2017 at the Internet Archive in San Francisco, CA, and the second workshop will take place in June 2017 at the British Library in London, UK. The first workshop will focus on the use of application programming interfaces for interoperability between collections and institutions. The second workshop will focus on establishing international collaborations between North American and European researchers, and will include a discussion of how to utilize archived Web data with other types of data such as socioeconomic variables. Data used and projects developed during the workshops will be published to the workshop website, and will also be made available via a GitHub repository. A final report will be produced summarizing the results of the workshop, and reviewing current state of the art research pertaining to archived Web data. This will be a significant contribution to the field as it is beginning to coalesce.
Effective start/end date2/15/171/31/18


  • National Science Foundation (National Science Foundation (NSF))


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.