Building The Architecture Of An Article Page For BMJ Journals

In our previous articles, we delved into the collaborative journey of revamping BMJ's digital presence. We explored the initial decoupling of BMJ's website architecture and the subsequent migration of non-article content pages across their vast library of journals. We also detailed the process of optimising article pages for the BMJ Public Health journal, achieving a remarkable Lighthouse score of 100.

One of the main challenges during this process was finding a way to process the HTML document we got from our data source to fit the new design of the article page. Today, we’re taking a closer look at the architecture we’ve built to minimize the impact of a large page.

Large articles inherently impact load times. Since page performance remains paramount, we needed a fast and lightweight solution. While we couldn't control article size, we could optimize our system's architecture.

Luckily, the introduction of Next.js's App Router with React Server Components during this project was perfect timing. This technology minimized our frontend bundle size by delivering React Server Components to the client. This shift to a React-based solution proved crucial for maintaining peak performance.

Let’s look at the technical aspects of figuring out this particular challenge.

HTML to React conversion

Parsing the API-provided HTML into React components can be done with multiple packages which we tested before finally landing on html-react-parser. There are multiple reasons we chose to go with this package. The first reason is that, in the process of parsing and converting to React components, we were able to replace certain default components with custom ones that would introduce the functionalities we were required to include. Secondly, html-react-parser uses under the hood htmlparser2 and domhandler, packages that are part of the same ecosystem as cheerio and css-select, which we used as well in our process.

The last step involved determining which HTML elements necessitated custom handling and which could be efficiently converted using readily available React components. Html-react-parser has a simple API that calls several functions the user provides for each DOM Node that it processes. With this in mind, we used the Strategy Pattern and created distinct modules that would target specific elements. This way, for each DOM Node we run our strategies to identify if it’s a node of interest and act accordingly to replace it with our custom React Server Components or Client Components if there is a case for something more dynamic like an accordion. The identification was done using the CSS selectors by the css-select module, and, for some special strategies where we needed to process the data included in the processed node, we used cheerio. Being part of the same ecosystem, these packages integrated seamlessly into our process helping us deliver the page without major setbacks.

This pattern of splitting the processing into areas of interest enabled us to profile the performance of the page and identify possible bottlenecks. In doing so, we could observe how specific features of the article page are affecting the overall performance and the time spent on parsing and replacing elements.

Finalized Function Execution Time Analysis

In summary, our process relies on the html-react-parser to go through the DOM Nodes one by one, the nodes that match our CSS selectors are then processed and replaced with custom components. In the end, the whole processed DOM is passed to a skeleton component which in turn passed to Next.js app router to do its magic.

The result of all this is a performance score of 100 for pages that, in general, have a large DOM, including images, tables, PDFs, and other types of files.

Cloudflare workers

Some of the features that an article page has require us to create API endpoints. Especially the ones around PDFs. The article PDFs and the supplementary files are statically stored, but we needed to timestamp them at the moment of the download.

For this, we created an API endpoint in Next.js that would annotate the PDF with the date and time of download and the source. We then figured out that we had to add some redirects from old routes to the new ones, and then realized we would need that API endpoint for all the journals. We could share the code between the journals, but still, we needed to create at least one file for each of the 64 journals, which would open up a possible maintenance problem in the long run.

We were already using Cloudflare workers to serve pages from both sources, the Next.js setup and the old Highwire setup, so we decided to create the PDF download API endpoint as a Cloudflare worker that could be parametrized and serve all the journals. Since Cloudflare Workers supports Typescript, this was a matter of moving the code from the Next.js API endpoint we created into the new Cloudflare Worker and adjusting it to be agnostic of the journal it served.

We also use Cloudflare workers to hide and protect the origin URLs for the static assets, and our data retrieval APIs from robots and crawlers.

Conclusion

At the beginning of the project, the goal of having a performant, accessible, search engine optimised, and best practices-abiding page seemed extremely ambitious. By splitting the work into smaller problems and taking the time to test hypotheses, we were able to move quickly and cleanly once we decided what solution to apply.

At the time of writing this article, we are preparing to launch the article pages for the second journal on our list. This presents some unique challenges that we look forward to talking about in one of our next articles.

Published by
Share this post via: