CSET

Building a Data Collection Pipeline: Insights from a CSET Internship

Jordan Monts

September 12, 2024

This blog post recounts the development of a Python-based data collection pipeline project completed in the summer of 2024 by CSET inaugural intern Jordan Monts. During his project research and work, he used the Requests and BeautifulSoup libraries to create a two-part system to gather and process web data, support ongoing research initiatives, and strengthen his skills in data processing as well as application programming interface (API) management.

Jordan Monts was an inaugural research intern at CSET during the summer of 2024 and is a junior at Florida A&M University in Tallahassee, Florida.

Collection and Extraction

Data collection is one of the most important steps of research. To work productively with the data, it is essential to collect it accurately and organize it well. During my CSET internship this summer I developed code to collect data from an external website to help inform CSET products. While working on this project I learned to leverage software libraries, navigate application programming interfaces (APIs), and refine quite a few other skills to build my proficiency as a developer.

In any complex software project, leveraging libraries helps streamline your work. For my summer project, I used Requests and BeautifulSoup. I used Requests to fetch the external site’s HTML content. Other libraries can also do this, but Requests has extensive documentation and tutorials that made debugging easy. After fetching the site’s content through Requests, I used BeautifulSoup to navigate through it, allowing for efficient information extraction from the HTML files that define what the user sees when they load a web page in their browser.

The script I wrote has two components: a collector and a parser. The collector’s role is to retrieve and store the contents of external pages of interest in a local directory. The parser’s role is to read the content of those pages and extract useful metadata, such as author, title, publication date, and full text. These steps could all be performed by a single script, but splitting them up allowed me to debug the parser without making repeated requests to the external site and parse additional information from the collected files at a later time.

Python libraries like BeautifulSoup can parse and extract information from collected web data. (Image source): Pexels.

Nearly every step in the project raised unexpected challenges, and I developed new skills and insights as I figured out how to address them. Installing libraries and verifying they worked were new experiences for me. At first, the libraries did not work because my local version of Python was not compatible with my version of pip, a necessary library management tool. To fix this, I had to carefully verify the versions and use pyenv and a virtual environment. Another hurdle emerged when I started writing the collector component. When I tried to collect multiple pages of data at once from the external site’s undocumented API, data was duplicated across different pages. To work around the issue, I decided to run the collector multiple times daily, which let me collect all of the content I needed without reading multiple pages at once. Finally, I learned why proxies are useful and how to implement them as I integrated a proxy into my code.

I was given several choices of projects this summer, mostly traditional research papers. However, as a computer science student, the option to develop a data collection pipeline immediately captured my interest. I knew the code I designed would benefit CSET even after my internship ended. That’s why computer science is so meaningful to me: it gives me the chance to develop lasting resources from my skills that can be shared and used beyond my own involvement.

Although my program is already generating useful data for CSET’s research, it has limitations. Most importantly, the API issue prevents data collection past the external site’s first page, preventing the access of data prior to its implementation. This restriction affects the historical scope of the dataset, as it cannot access data from before its implementation. However the pipeline remains a robust tool for processing current and future data.

This internship experience not only enhanced my technical skills but also highlighted the importance of adaptability and problem-solving in software development. The challenges I overcame have prepared me for future projects and reinforced my commitment to creating policy-relevant technology. By building tools that support the research efforts of others, I have seen how computer science can drive innovation in diverse sectors, even long-standing fields such as policy-an intersection I had not previously envisioned. This project demonstrates how technical expertise can translate into long-lasting contributions, deepening my appreciation for addressing complex problems and creating valuable resources. With this newfound appreciation, I plan to focus on generating more tools that lead to impactful solutions to real-world issues.

Introducing the Emerging Technology Observatory

October 2022

Making sense of the often overwhelming world of emerging tech with data-driven tools and resources. Read More

Data Visualization

ETO PARAT

June 2024

PARAT (Private-sector AI-Related Activity Tracker) is ETO's hub for data about private-sector companies and their AI activities. PARAT's easy-to-use interface brings together data on companies' AI research publications, patents, and hiring, enabling customizable, data-driven comparison… Read More

Data Visualization

ETO Scout

September 2023

Scout is ETO's discovery tool for Chinese-language writing on science and technology. Scout compiles, tags, and summarizes news and commentary from selected Chinese sources, helping English-speaking users easily keep up to date, skim the latest… Read More

Data Visualization

ETO Research Almanac

May 2023

ETO’s Research Almanac provides high-level data on key trends in emerging technology research, including overall research output, growth, and trends among countries, research institutions, and companies active in R&D. This initial version of the Almanac… Read More

Center for Security and Emerging Technology

A Year-End Review

CSET

Building a Data Collection Pipeline: Insights from a CSET Internship

Collection and Extraction

Related Content

Introducing the Emerging Technology Observatory

ETO PARAT

ETO Scout

ETO Research Almanac

A Year-End Review

CSET

Building a Data Collection Pipeline: Insights from a CSET Internship

Collection and Extraction

Related Content

Introducing the Emerging Technology Observatory

ETO PARAT

ETO Scout

ETO Research Almanac

This website uses cookies.