Post

The Worst Project I Ever Finished

Tangled wires. McKay Savage from London, UK, CC BY 2.0 https://creativecommons.org/licenses/by/2.0, via Wikimedia Commons

When I was 15 and new to the internet, I made the mistake of making a Reddit account. Reddit features a “saved posts” category, meant to be a way to bookmark posts to revisit them later. I quickly accumulated everything from movie discussion threads to news articles to self-help to video game fan art, and realized I wasn’t going to be able to sort through them all without some kind of help. (Which has become a recurring theme in my life.) I was just getting the hang of Python after reading Automate the Boring Stuff, so when I saw a blog post from the author sketching out a Python program to scrape post info and download images from a user’s saved category, I decided to start a solo Python project with that as inspiration. Of course, being the ambitious and incredibly talented 15 year old I was, I set much sights much larger. It could only end well.

That was about nine years ago. What started off as a simple time-saver rapidly became an over-scoped, ever-evolving mess that I struggled to wrangle on and off over the course of my education. It grew from just a basic Python project to include pytest, requests/httpx, beautifulsoup, pre-commit, and many other packages, with poetry to help manage them all. Whether it’s the same project or not after rescoping, renaming, and rewriting every line of code at least once depends on your outlook on the ship of Theseus riddle, but I’ve always considered it a single project. My aim here, now that I think it’s gotten to a (mostly) finished state and my unit are all passing, is to deliver a post-mortem on the problems I made for myself, how I eventually discovered them, and how I solved them.

xkcd 1319, "Automation"xkcd 1319, “Automation”

Problem 1: Overscoping

The first problem was that I wanted this program to do everything: it would know to bookmark news articles, reddit text posts, and Wikipedia articles in different folders my internet browser. Any posts linking to YouTube videos would be saved to my YouTube watch later playlist. I would even search entire post comment sections from comments made by the original poster. This was all on top of the original plan to download and sort gifs, jpgs, and pngs linked in posts and comments, which itself would require some dedicated work with webscraping and APIs. The problem was that I wanted too much functionality.

Another scoping mistake, and one that took me until midway through college to fix, was that I wanted to deliver my program in two totally different ways before it was off the ground: I wanted a script that could be run locally and a web version. I first started the website with Flask, then moved to FastAPI, then considered WebAssembly. I’d also need a JavaScript frontend to handle all the complicated UI elements I wanted. Was it too much to build a block-based UI for users to custom format their file titles?

I ended up learning a lot about how the internet, APIs, and webscraping all worked, but I had nothing even close to a working prototype after months. This shouldn’t be a surprise: because I overscoped, I had no chance of delivering a minimum viable product in any reasonable timeframe; even worse, if I ever got close to finishing a feature, I’d decide to add two more and redesign to accommodate. Unfortunately, it wasn’t even problem #1 that made me realize I had to scale back: it was problem #2.

Problem 2: Ill-Defined Problems

So, when your program has one input (a reddit post) and wants to jump into many possible different outputs (interfacing with Google Chrome bookmarks, YouTube playlists, webscraping, and more), you need a way to determine which inputs trigger what functionality. The goal of my program was unfortunately too poorly defined for me to achieve this: an article linking to CNN or the Associated Press should certainly go into a news folder, but what about all the random blogs and websites that also posted articles? Where should they go? Should there be an age threshhold to filter old content?

Sometimes I couldn’t know the answers to these questions even on an individual, case-by-case basis without actually reading or engaging with whatever media I had bookmarked in the first place. Sometimes you can’t tell whether an article is worth reading until after you’ve read it, or an image if worth making your wallpaper until after you see it on your screen. My problem was that in too many cases, I didn’t have a clear idea of how I wanted the program to behave. So even when I had the core control flow worked out, I kept going back and forth on which posts should trigger which if clauses. I ended up just going in circles.

Eventually I realized this problem wasn’t solvable: I needed to scope down and focus on the cases I did know what to do with. I eventually returned to the original idea of webscraping because it applied in clear cases and I knew what should happen in each case: find all scrape-able pictures in a post and download them to a folder. (Hence “PaperScraper”.)

Problem 3: Complexity

Even within just webscraping, I had some issues: some websites like reddit, imgur, and artstation often consisted of just top-level images, and could be scraped directly… sometimes. Most of them had alternate display pages for single images, which usually contained a direct link to the content I actually wanted. Other sites like gfycat or flickr seemed much harder to scrape and seemed to railroad me into using their API. Some sites like gfycat seemed to have both a complicated page layout and a complicated or restrictive API. (Funnily enough, gfycat shut down before I ever added support, but I’m always happy to close a #TODO either way.)

So I had a single problem that had a homogenous interface – I just wanted to take a url and pass it to some kind of download function – but actually involved a whole bunch of heterogenous behavior under the hood. Unfortunately, I didn’t have the design experience to know how to organize my program.

I broke everything down into a chain of ifs and elifs, which I rewrote and reorganized constantly. I eventually heard the rule of thumb that if one part of a project is being constantly changed, that means it’s probably a “pain point” in need of more fundamental redesign. I approached that chunk of code with an open mind, breaking it down into smaller functions. After noticing some commonalities between each of those functions, I refactored and simplified them until they all had the same signature. Now instead of having a single gargantuan function with a massive chain of semi-unrelated elifs, I just had one list of parser functions; all I had to do was try each one until one of the functions returned a positive result, or I exhausted the list. Each parsing technique was neatly self-contained, and I could easily add, remove, or refactor support for websites on an individual level much more easily because they were now all decoupled from each other and the control flow. I later discovered that this design is called the strategy pattern.

By changing my perspective and using a design pattern, I was able to find an elegant solution to my problem. I’ve now spent quite a bit of time studying design patterns to save time in the future (mostly from gang of four and game programming patterns).

Problem 4: Technology

It took me until the end of highschool to discover Git. In the meantime, I got by with the only tool I had: the Windows file system. I would copy my code to a new folder and rename that folder the current date every time I started working on a new revision. Even with a seemingly simple system, I managed to screw that up, and would sometimes mix up code and then need to merge old and new code by hand. I also didn’t have a clear system for tracking what was done, in progress, or remained to be done.

I would also take breaks and return as school and motivation permitted. Sometimes I would come back to my (finally!) working prototype and discover that it had somehow broken. It wasn’t long before I realized the why: a package that I relied on, the Python Reddit API Wrapper library (or praw for short), had pushed out a lot of breaking changes when my development was most active: I started using it right at the end of version 3 in 2015/2016. It was then fully rewritten into praw 4 in 2016, and then had three more major releases culminating in the now current version 7 in 2020.

Both of these factors meant I would have to wrangle a lot of old versions of my project without knowing where I had left off, what I had left to do, or if my program was even in a working state. On top of all that, restarting development on a new machine meant running pip install requirements.txt with unversioned dependencies, meaning a new batch of compatibility bugs would be mixed in!

I ended up fixing this by learning Git just before my freshman year of college to handle version control, and eventually using pyenv and poetry to manage my python environments and packages. My reward was a full online backup of my project complete with version history, commit messages annotating each major change, and an elegant way to track and install versioned packages. By putting in effort to do project overhead, I was able to save myself a huge amount of frustration and confusion.

Problem 5: Consistency and Tests

My biggest fear when making this project was that I was going to mishandle and lose the data I was trying to organize. Sooner or later I realized I was going to need a lot of unit tests. Fortunately, Python provides a great built-in testing library with mocking features like the @patch decorator and MagicMock. The nice part is that I can easily mock functions that would require me to make many API calls all at once, each with potentially undesirable side effects. PyTest also offers a lot of plugins, so I could install a testing library like vcrpy to cache requests and run the whole test suite offline while using cached responses to mock network calls.

The problem was just that writing these tests was a lot of work that I had left to the end of my development cycle. I sometimes realized that a single function was difficult to test because it was doing more work than it should be, and that I had to break it down further to save myself time and effort while testing. If I had written my tests before writing so much code (or at the very least sprinkled them in to my development), I could have saved a lot of time and energy. Unfortunately, testing, refactoring, and fixing bugs all at the end of a project, rather than consistently during development, caused me to burn out.

I learned that I should break down work that I don’t enjoy and distribute it across the entire development cycle, rather than pushing it off to the very end. It’s a bit demoralizing to have a project that I believe works and is 99% done, but requires a lot of difficult work to safely verify.

Conclusion

While it’s true that developing PaperScraper was difficult and pretty painful at times, I want to provide a disclaimer: I think working on a self-managed project like this is one of the best ways to learn software development as a beginner. Having a one-person project with many different parts gave me the freedom (and ability) to do large-scale refactors and architecting the same project multiple different ways; solo projects give essential space for experimentation and iteration that large projects don’t. I’ve come back to PaperScraper so many times over the past nine years not because I believed that it would actually save me time anymore — cutting back on reddit in the first place accomplished that — but because I enjoyed working on it, and I learned something new about software development and my programming habits every time I came back.

The development cycle was long and unwieldy, but that’s mostly because I did this as an exercise in self-teaching rather than out of a desire to get the end product out the door: the journey was the destination. Some parts of development also went very well: I got a lot of solid experience with Git, GitHub, common Python libraries, webscraping, and I finally understood asynchronous programming after replacing the requests library with httpx for a performance boost. At some point in development, I think I graduated to an “advanced” Python programmer: I’m very experienced with the standard library, various language features and quirks, and a solid sampling of third party tooling. (Especially stuff like pre-commit, which I love.)

I still have a lot to learn about software development and maintenance, and I’ll probably keep coming back to PaperScraper — it wouldn’t be the first time I got it in an actually working state and then insisted on a rewrite. There’s always more to be done. (And I still haven’t given up on a web implementation…) But I’m nine years wiser, and I’m confident that I have the experience and wisdom to avoid making the same mistakes, or at least make those mistakes in new and more creative ways.

This post is licensed under CC BY 4.0 by the author.