Jordan Monts was an inaugural research intern at CSET during the summer of 2024 and is a junior at Florida A&M University in Tallahassee, Florida.
Collection and Extraction
Data collection is one of the most important steps of research. To work productively with the data, it is essential to collect it accurately and organize it well. During my CSET internship this summer I developed code to collect data from an external website to help inform CSET products. While working on this project I learned to leverage software libraries, navigate application programming interfaces (APIs), and refine quite a few other skills to build my proficiency as a developer.
In any complex software project, leveraging libraries helps streamline your work. For my summer project, I used Requests and BeautifulSoup. I used Requests to fetch the external site’s HTML content. Other libraries can also do this, but Requests has extensive documentation and tutorials that made debugging easy. After fetching the site’s content through Requests, I used BeautifulSoup to navigate through it, allowing for efficient information extraction from the HTML files that define what the user sees when they load a web page in their browser.
The script I wrote has two components: a collector and a parser. The collector’s role is to retrieve and store the contents of external pages of interest in a local directory. The parser’s role is to read the content of those pages and extract useful metadata, such as author, title, publication date, and full text. These steps could all be performed by a single script, but splitting them up allowed me to debug the parser without making repeated requests to the external site and parse additional information from the collected files at a later time.
Python libraries like BeautifulSoup can parse and extract information from collected web data. (Image source): Pexels.
Nearly every step in the project raised unexpected challenges, and I developed new skills and insights as I figured out how to address them. Installing libraries and verifying they worked were new experiences for me. At first, the libraries did not work because my local version of Python was not compatible with my version of pip, a necessary library management tool. To fix this, I had to carefully verify the versions and use pyenv and a virtual environment. Another hurdle emerged when I started writing the collector component. When I tried to collect multiple pages of data at once from the external site’s undocumented API, data was duplicated across different pages. To work around the issue, I decided to run the collector multiple times daily, which let me collect all of the content I needed without reading multiple pages at once. Finally, I learned why proxies are useful and how to implement them as I integrated a proxy into my code.
I was given several choices of projects this summer, mostly traditional research papers. However, as a computer science student, the option to develop a data collection pipeline immediately captured my interest. I knew the code I designed would benefit CSET even after my internship ended. That’s why computer science is so meaningful to me: it gives me the chance to develop lasting resources from my skills that can be shared and used beyond my own involvement.
Although my program is already generating useful data for CSET’s research, it has limitations. Most importantly, the API issue prevents data collection past the external site’s first page, preventing the access of data prior to its implementation. This restriction affects the historical scope of the dataset, as it cannot access data from before its implementation. However the pipeline remains a robust tool for processing current and future data.
This internship experience not only enhanced my technical skills but also highlighted the importance of adaptability and problem-solving in software development. The challenges I overcame have prepared me for future projects and reinforced my commitment to creating policy-relevant technology. By building tools that support the research efforts of others, I have seen how computer science can drive innovation in diverse sectors, even long-standing fields such as policy-an intersection I had not previously envisioned. This project demonstrates how technical expertise can translate into long-lasting contributions, deepening my appreciation for addressing complex problems and creating valuable resources. With this newfound appreciation, I plan to focus on generating more tools that lead to impactful solutions to real-world issues.