Plotting the trail for Django Cairn

This high-level post will go over how I design Django projects. The goal is to consider every action the application will need and determine what is needed to implement those actions. We’ll encounter some hand-waving around the details, but that’s fine. We can’t know everything. And if we did, why are we bothering with this?

The benefit of this exercise is to realize a more complete understanding of the project and to create an outline of what needs to be implemented. An outline makes it much easier to split the work into chunks and determine the dependencies and priorities. Then, we’ll delegate to others to maximize project efficiency. Even if you’re working alone, it’s nice to know how far you need to go. Personally, if I chug along on a project without knowing where it ends, I’m likely to become demoralized and quit. Plus, I simply find this exercise fun.

My general approach is to ask myself, “how do I want this to work?” That answer is probably a list of features or requirements. Next, I go through that list and ask the question again. “How do I want this to work?” I continue this process until I’m satisfied with the level of details included.

Step 1: Define the purpose of the project

The project I’ll be working on is Django Cairn (read more about what it is here). The purpose is to create an index of Django knowledge around the community and to guide folks to particularly useful resources.

Step 2: How do I want Django Cairn to work

Now that I have the two purposes for the project, I can move onto defining how the project should work to accomplish those goals. One thing the project will need to do will be to collect and store knowledge from sources. This will achieve the purpose of being an index of knowledge. The second purpose is a little more subtle. My current interpretation of “being a guide for community resources” means there will need to be curation and reviews of content.

This all means I want Django Cairn to do the following:

Step 3: Skeleton data model

Some of the previous requirements are straightforward. Some hand-wave the details. I like to start with defining the interfaces of the project first, then move directly to the data model.

When I use the term interface here, it doesn’t necessarily mean the user interface. Instead it means the system of interaction between the system and the user. Step 2 outlined the interface in a series of statements of what the system should do and how users will interact with it.

As more of the interface gets defined, the data model will require updates. At this phase of the project changing the data model has a low cost. However, as the project gets closer to completion the cost to change the data model grows significantly. Django Cairn is almost certainly missing definitions on how it should work. This process should help me discover them.

So far, here’s what I have for model definitions to accomplish the goals as I currently understand them:

Step 4: How does the data model get populated

With the basic data model established, I can focus on the next phase. Namely, how to get data into the data model.

Going back to the list of desired actions, there are four related to content creation:

Let’s consider them individually.

Support adding new content sources

This sounds fairly straightforward. There needs to be a way to add new content sources, however, I never clarified how exactly those content sources should be added. Will it scour the web searching for any blog using Django, or will it be a minimalistic form? If you’re working with others, here’s when you need to collaborate with them.

Since I’m building this alone, I get to decide. I want to get something off the ground as quickly as possible, but still support adding known content types in the future. The known content type for me will be any blog that publishes a RSS feed.

I don’t plan on automatically ingesting content from conferences such as DjangoCon US because there’s no guarantee that the data will be the same from one year to the next. And there’s especially no guarantee that DjangoCon US, DjangoCon EU, and Python Web Conference will all use the same data file structure. While I’d love to automatically pull that data in every year, it’s simply too much work. If I find that there are commonalities in the future, I can build it out then.

Hidden in that decision is a change to my data model. I’ve started talking about different types of sources. Now the data model needs to support that difference. The new model will be:

Let’s get back to actually creating a source. Who is performing this action? To start, it will largely be only myself. However, the purpose of this application is to be a catalog of Django knowledge and I certainly will never know all the sources. At some level, the application needs to support accepting outside submissions for new sources. Going further along that process, after a source has been requested, I’ll need to review something. Automatically ingesting content from an unknown or anonymous source and then rendering it is an easy way to host content that violates the Django Code of Conduct—and my values. I think a sufficient requirement for a new source request is an email address to contact for more information, and a text-based reason upon submission.

Now, let’s consider this specific action again: “a new source is requested via a form”. How does that work? Should it create an entry in a table that sends a notification to be reviewed, or should that table be the Source model or an entirely different model? Seeing as my driving force is speed and functionality, I think adding a few fields to the Source model and relying on email to manage source requests is an effective solution. After all, the email can contain a link to the Django administration page to modify a Source instance’s properties. Yes, I lose the historical record of reasons for why a Source instance should be created, but I don’t envision that data being useful, either now or in the future.

Getting back to our model, I’m going to store the contact. Hey, it might be nice to know who to contact if an issue were to arise in the future!

We also now have information on how the view should work for requesting a new source.

If a Source instance already exists, don’t re-create or change content. Only send an email. Send an email with all submitted fields in the body of the email.

Phew, done with that one!

Fetch new content from known sources.

I’ve decided that blog sources should have content automatically pulled in. This implies a need to periodically run some logic to fetch data. The alternative would be to have the content creators ping the site to have the new post(s) fetched, but that goes against industry standards for search engines. Plus, they’re focused on creating their own stuff and I’d rather them do that.

As soon as I say the phrase “periodically run”, I know it means background jobs. For now, I can probably get away with cron and management commands. That said, this approach means I now need to store any information regarding the last time the source was checked for new content. That can be handled with some new fields to the Source model.

The field last_checked will be used to identify when a check was last performed on the source so the system can skip it until the next period. The following is the logic of the background task that will need to be created to fetch new content.

  1. Identify any unchecked Source instances in the last X hours
  2. Request posts from source URL
  3. Create new Post instances
  4. Update existing Post instances if needed

Refetch content periodically to check staleness.

The main goal here is to not link to material that is no longer accessible. “Inaccessible” will be defined as a non-200 response after three tries over three days. This means the Content model will require additional fields.

The field active will be used to identify when a Content instance should no longer be displayed on the site. The field last_checked will be used to identify the last time a check was performed on the source so the system can skip it until the next period. However, this will only be useful when the content is valid. If the content isn’t valid, it should be checked sooner. Right now, it makes the most sense to have an override datetime field to identify the next time a check should be performed (next_check), regardless of last_checked.

The last new field is staleness_count. Frankly, naming is hard for me. This one will probably change in the future. The purpose of this field is to identify how many times the content was fetched but failed to return valid content.

The logic of the task should be:

  1. Identify any active, unchecked content in last X days or active content that has a next_check in the past
  2. Request content from Content.url
  3. If content is valid, clear next_check, staleness_count, and update content fields as necessary
  4. If content is invalid, increment staleness_count and set next_check a future datetime
  5. If staleness_count exceeds Y, set active = False

Support creating reviews of content.

A goal for Django Cairn is to provide some guidance to other developers. This will be accomplished by providing commentary on what a reader can expect from a given piece of content. I’d also like to highlight particularly good content and the Djangonaut experience level that the content is meant for.

For now, I plan on being the only reviewer, but in the future this may open up to others. As far as I can tell, the initial draft of the model suffices for the given requirements. Since I will be the only person reviewing content, I’m making the executive decision to use the Django Administration site to manage those reviews. After all, the goal is to deliver functionality, not build the perfect web app.

Step 5: Ask what’s missing

The application now has sources, content, and reviews. It has a way to add new sources, fetch new content, and create reviews for the content. So what’s missing?

How will Djangonauts find content that’s relevant to them!?

Welp, that’s a pretty egregious oversight. Nonetheless, it’s pretty typical for the process of planning a project. My next step is to identify the key actions enabling users to find content that’s relevant to them. I’ve received some good ideas from the general community and I settled on the following:

Okay, almost there… Except not quite. Sorry!

The next part of the process is to go back to Step 2. “How do I want this to work?” This time, it will be in the context of the above features, but I need to be careful. While the more time spent planning and designing the application the better, there are diminishing returns. I could potentially spend an infinite amount of time stuck on this phase. And keeping that theme in mind, I’m going to skip a more detailed explanation in favor of concluding this blog post.

Step N-1: Cut scope.

Okay, almost there! I promise.

The last step is to re-review all the features and work that has been identified, then eliminate the excess. Every project is suspect to feature creep, and it’s especially easy to let it slip in during the planning phase. All ideas sound great until you hit that 80% done mark and the last 80% of the project finally begins! That’s why it’s imperative to keep a tight leash on the project’s scope; it’s always better to ship something on time and add more later than to never ship at all.

That’s it. Thanks for following along to the bitter end. It means a lot to me. Have questions? Shoot me an email or reach out to me on the Fediverse.