Motivation and Concept

I recently began encountering carnivorous plants in the wild again, after a few years of not being actively engaged in the topic. Living in Oaxaca, I am finding Pinguicula specifically.

pinguicula.org has long been the go-to resource for information about members of the genus Pinguicula, as one might assume from the domain name. Well over a decade ago, Noah Elhardt and I contributed a few pages of field trip reports to the website, so it also has something of a personal connection.

Eric Partrat has done a wonderful job building and maintaining the website but hasn’t been able to keep up with it recently. While researching plants, two things have become abundantly evident:

  1. There are no other equivalent data repositories online, at least for Pinguicula
  2. The old version of pinguicula.org has become increasingly difficult to navigate and much of the data a bit antiquated.

I genuinely don’t know if anyone else uses or wants to use pinguicula.org. However, I do, so contacted Eric to see if he would be open to collaboration in updating and modernizing the website.

screenshot of a field report page on the old pinguicula.org website written in Microsoft FrontPage with lots of html tables A field trip page on the old website. Lots of HTML tables, and no heading tags!

Tools

Here are the tools that I used to make the new website. I used them in part because I believe they are well suited to the project, but also because I use them elsewhere and find they produce solid results via a convenient process.

Page Format/Language

Static Website Conversion / Content Management System

Jekyll Theme

Old HTML Conversion

  • Pandoc to convert the HTML to GitHub Flavored Markdown
  • cwebp for WebP image conversion
  • Atom editor for working with the Markdown and config files. I should really switch to Kate or similar since Atom is no longer maintained.

Hosting

  • Gitlab repository for version control
  • Netlify to build and deploy the website

Goals

The primary goals of converting pinguicula.org to Markdown on Jekyll via Minimal Mistakes are the following:

Improve On-Page Website Use

  • Make it easier to update. The current website is written in a version of Microsoft FrontPage from the early 2000s with, as far as I can tell, manually placed links between pages. It hasn’t been updated since 2010 - 12 years ago - which certainly indicates that it isn’t working well in that regard.
  • Make the pages internally accessible. The current website doesn’t have even the most basic on-page accessibility features like menus or headings, let alone less obvious things like image alt text. Immediate changes include the following:
    • Place all content within consistent categories.
    • Provide a main menu, present on all pages, that represents the main categories of content presented on the website.
    • Create submenus on some category pages where the complexity may benefit from the additional depth.
  • Enable functionality on phones and other non-desktop devices. The current code is not mobile-friendly, to put it mildly.
  • Make the presentation more visually appealing. I genuinely find the current website charming and would see no reason to intervene in its existence if it were still actively updated. However, the current combination of lack of updates and extremely period styling makes the data look much less relevant and authoritative than it truly is.

Improve Accessibility via Search Engines

  • Make the pages easy to find externally. The current website is relatively visible in specific searches but should generally rank higher for many keywords. It could also enjoy a broader reach. The following will help in that regard:
    • Hosting with full SSL support. In 2022, websites are definitely penalized in search results when they lack functioning SSL implementation.
    • Editing the existing content to conform with web standards.
      • Converting extant page delineators to headings (h2, h3, etc)
      • Restructure URLs (with redirects from the old URLs!) so that they have a logical structure that conforms with web standards. Current urls are extremely variable and include characters like _ that are not ideal in an SEO context.
      • Convert all images to the WebP format. The images on the old website are presented as small JPEGs that are already quite compressed, but the size can be further reduced by converting to WebP without incurring any visible loss in quality. The process requires little effort.
      • Establish structure and practices for new image files so that they may be stored and presented in a manner that is more accessible. The current images have completely non-descriptive titles that often include characters that reduce search visibility, like _. They are also found in a mishmash of directories that span the entire website, often mixed in with HTML files.
        • Move all current images into one directory (/assets/images/old/), while maintaining the extant substructure as much as possible.
        • Create new directories that reflect use (and size) of images under /assets/images/. This will make the images much easier to replace with different resolutions or formats when necessary.
          • /assets/images/header/
          • /assets/images/feature-row/
          • /assets/images/post/

Enable Maintenance and Updates

  • Store the individual pages in a format that is portable and human readable.
    • Writing and editing content for the web in Markdown is ideal for me. It is a free and open format that is read and processed to HTML by many static and dynamic content management systems. It is also highly readable even without conversion to HTML.
    • Conversion from Markdown to anything else is supported by a wide variety of tools, should it be required.
  • Minimize hosting cost
    • Hosting/building/deploying static content via Gitlab, Github, and Netlify is all free at this scale. Likewise, should those companies all change their policies, hosting static content either via a VPS running a flavor of Git or simply receiving the static files would be extremely inexpensive and performant.
  • Not depending on a database or dynamic CMS. I have worked extensively with Drupal and Wordpress hosted on both VPS and shared hosting. Updates can work well or can turn into involved migration projects that require substantial manual intervention to regain basic functionality. Migration into those systems can also be fraught.
  • Enable collaboration
    • Building the website from an online distributed version control system makes it possible for others to directly collaborate in the page development and maintenance, should they be so inclined. My experience building, hosting and maintaining the Los Angeles Carnivorous Plant Society website points to this sort of collaboration being unlikely. Despite running the site for many years on a Drupal instance, I was not once able to convince anyone else to even log into the system, let alone directly edit pages. I am hosting the new pinguicula.org codebase on Gitlab, but that is largely for redundancy as well as my convenience.

Accomodate Philosophical Preferences

  • Use free and open-source software (foss) and formats. This can
    • Reduce cost
    • Increase portability - open formats are consistently the easiest to translate, should the need arise.
    • Promote open, transparent models of development.
  • Avoid collection of visitor data. No one needs to be spied on by Google Analytics tracking or the like, and certainly not while visiting a plant website.
    • Netlify and their partners may spy on website viewers, but probably not appreciably more than any other third-party host. Hosting the website on a private server, which could enhance visitor privacy, is not economically feasible.

Before and After Screenshots

Here are a few screenshots of pages on the website before and after the migration.

As you can see in the individual pages, there are some funky HTML image tables that I have preserved from the old pages. I don’t have the inclination to replace the image files right now and their small physical size makes different display modalities fraught.

Desktop

Old website homepage New website homepage
Screenshots of the new and old homepages of the pinguicula.org website.
Old website category page New website category page Old website category page New website category page
New and old category / collection pages.
Old website species plant page New website species plant page Old website page with photos New website page with photos
New and old individual pages.

Mobile

Old website homepage viewed on mobile New website homepage viewed on mobile
Screenshots of the new and old homepages of the pinguicula.org website on mobile.
Old website category page viewed on mobile New website category page viewed on mobile Old website category page viewed on mobile New website category page viewed on mobile
New and old category / collection pages on mobile.
Old website species plant page viewed on mobile New website species plant page viewed on mobile Old website page with photos viewed on mobile New website page with photos viewed on mobile
New and old individual pages on mobile.

The Conversion Process

The below are my notes from the website migration. They are partial, incomplete, and mostly unedited. I wrote them for my own reference. If they are useful to you too, well, great!

Gathering the Materials

Scrape the entireity of the pinguicula.org website using wget, using the examples from the wget manpage for reference.

wget -prkH -Dpinguicula.org -l 0 -t 45 -nc --show-progress http://www.pinguicula.org/ -o gnulog

Some resources are coded into the pages using an old base url that can’t be found.

http://perso.club-internet.fr/cpartrat/Fernando/P_gigantea_Ayautla_16(HR).jpg

Those links need to be found, converted, and downloaded using the correct links.

http://www.pinguicula.org/Fernando/P_gigantea_Ayautla_16(HR).jpg

Check for broken image links:

wget -l 0 -nv -t 1 -r -H -A jpg,jpeg,gif,png --spider http://www.pinguicula.org/ -o imagelog

Code conversion

After moving image files to /assets/images/* and correcting image links in html files, it is time to convert the html to markdown.

Modified from Converting HTML to markdown? #49:

pandoc --from html --to markdown doc.html -o doc.md

Use at your own risk, create convert.sh in _posts directory and run it.

#!/bin/bash
FILES=*.htm
for f in $FILES
do
  echo "Processing $f file..."
  pandoc --from html --to markdown "$f" -o "${f%.*}.md"
done

Then just rm -rf *.html

This works but now I want to add YAML frontmatter to each markdown document.

Modified from Sed Insert Multiple Lines:

#!/bin/bash
FILES=*.md
for f in $FILES
do
sed -i -f - "$f" <<EOF
1i\\
--- \\
title: \\
header: \\
  image: \\
  teaser: \\
--- \\

EOF
done

Some of the pages had weird character sets passed through Microsoft FrontPage header metadata. Pandoc wasn’t able to process pages with this string specifically:

<META http-equiv=Content-Type content="text/html; charset=windows-1252">

This certainly could have been resolved in bash in some way, but I did a batch find and replace action in the text editor.

Image conversion

The image files already sported small dimensions and relatively high compression, but some additional space could be saved by converting the .jpg files to the .webp format.

A bash file with the following did the trick.

for file in *; do cwebp -q 50 "$file" -o "${file%.*}.webp"; done

I am not sure how to make that operation recursive, so ended up re-running it a few times, each time adding another layer to the directory with another /*.

Then recursivly delete the .jpg files

find . -type f -name '*.jpg' -delete

Although probably very premature, I have begun testing the results. For me this is easiest online, via a Gitlab repository that feeds a Netlify account that builds and hosts the website. The results are are not great. There are way too may weird tables that are translated into Markdown in a nonfunctional manner, headings are simply passed to *** rather than ###, and other issues that require manual intervention.

At this point, it has been overwhelmingly obvious that the Markdown generated by Pandoc with the pandoc with the straight pandoc --from html --to markdown command is horribly littered with non-functional tables and other cruft that takes ages to toss out. pandoc --from html --to gfm seems like a better solution, but perhaps there are other options I can pass? Perhaps markdown_strict will be needed to toss extra things? Creating and populating YML frontmatter automatically could also enhance the process.

Here is attempt #2, where column wrap is discarded via --wrap=none, HTML comments discarded via --strip-comments (which may not be necessary), --no-highlight, which may also not really apply but can’t hurt (I hope), --ascii to hopefully kill some weird, unreadable Microsoft substitutions for accented and unusual characters,

--include-in-header inc.tex is also a more elegant way of inserting Jekyll YAML header data.

This script from this page gave me the idea of also converting _ in filenames to -.

Converting uppercase characters to lowercase in filenames is also appealing: tr '[:upper:]' '[:lower:]'. I prefer the look of this rename command though: rename -f 'y/A-Z/a-z/' *.

Also, remove charset lines that interfere with pandoc: sed -i -- '/charset/'d * as here, here, and here.

After some difficulties, I discovered that the character was blocking pandoc from working on some of the files, with the error that The input must be a UTF-8 encoded text, despite stripping weird Microsoft headers first. So sed -i -- 's/�/°/g' *.

Altogether:

#!/bin/bash

sed -i -- '/http-equiv/'d *
sed -i -- '/FrontPage/'d *
sed -i -- '/ProgId/'d *
sed -i -- '/charset/'d *
sed -i -- 's/&nbsp;/ /g' *
sed -i -- 's/�/°/g' *

FILES=*.htm

    # convert htm to markdown
    # echo "pandoc -f htm -t gfm \"$line\" -o $the_filename.md"
      for f in $FILES
        do
      echo "Processing $f file..."
      pandoc --from html --to gfm --wrap=none --strip-comments --no-highlight --ascii --include-in-header yaml-header.tex "$f" -o "${f%.*}.md"

done

rename -f 'y/A-Z/a-z/' *
rename -f 'y/_/-/' *

At this point, the discover of more weird HTLM tags has lead to a much larger cleanup file:

#!/bin/bash

sed -i -- 's/.jpg/.webp/g' *
sed -i -- '/http-equiv/'d *
sed -i -- '/FrontPage/'d *
sed -i -- '/ProgId/'d *
sed -i -- '/charset/'d *
sed -i -- 's/&nbsp;/ /g' *
sed -i -- 's%../images/%/assets/images/%g' *
sed -i -- 's%../../pages/plantes/%/plants/%g' *
sed -i -- 's%../Fernando/%/assets/images/Fernando/%g' *
sed -i -- 's%</span>%%g' *
sed -i -- 's%<span class="underline">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-weight: bold">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-FAMILY: Arial">%%g' *
sed -i -- 's%<span style="TEXT-TRANSFORM: uppercase">%%g' *
sed -i -- 's%<span style="text-transform: uppercase">%%g' *
sed -i -- 's%<span lang="EN-GB">%%g' *
sed -i -- 's%<div data-align="center">%%g' *
sed -i -- 's%</div>%%g' *
sed -i -- 's%<div data-align="justify">%%g' *
sed -i -- 's%<span style="mso-ansi-language:EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-WEIGHT: bold; FONT-SIZE: 12pt; FONT-FAMILY: Arial">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-SIZE: 12pt; FONT-FAMILY: Arial">%%g' *
sed -i -- 's%<div data-align="left">%%g' *
sed -i -- 's%<span style="FONT-WEIGHT: bold">%%g' *
sed -i -- 's%<span style="font-size: 12pt">%%g' *
sed -i -- 's%<span style="font-family: Arial" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-ansi-language:EN-GB">%%g' *
sed -i -- 's%<span style="text-transform: capitalize">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-US">%%g' *
sed -i -- 's%<span style="mso-spacerun: yes; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="font-family: Arial; mso-ansi-language: EN-GB; mso-bidi-font-family: Times New Roman" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-US" style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span>%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-FAMILY: Arial; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-ansi-language: EN-GB; mso-bidi-font-weight: bold">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-ansi-language: EN-GB; font-family: Arial">%%g' *
sed -i -- 's%<span style="font-size: 12pt; font-family: Arial" lang="EN-GB">%%g' *
sed -i -- 's%<span style="mso-ansi-language: EN-GB" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="PT-BR" style="mso-ansi-language:PT-BR">%%g' *
sed -i -- 's%<span style="mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA; layout-grid-mode: line">%%g' *
sed -i -- 's%<span class="394421507-10072003">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-family:Arial;mso-ansi-language:EN-GB">%%g' *
sed -i -- 's%<span lang="EN-US">%%g' *
sed -i -- 's%<span style="font-size: 12.0pt; font-family: Arial">%%g' *
sed -i -- 's%<span style="font-family: Arial; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="FR">%%g' *
sed -i -- 's%> mso-fareast-language:FR;mso-bidi-language:AR-SA">%>%g' *
<sed -i -- 's%span style="font-family:Arial;mso-ansi-language:EN-GB">%%g' *
sed -i -- 's%<span class="meinStil">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-ansi-language: EN-GB; mso-fareast-font-family: Times New Roman; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-FAMILY: Arial; mso-bidi-font-family: &#39;Times New Roman&#39;; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-SIZE: 12pt; FONT-FAMILY: Arial; mso-ansi-language: EN-GB; mso-fareast-font-family: &#39;Times New Roman&#39;; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB; mso-bidi-font-family: Arial">%%g' *
sed -i -- 's%<span style="mso-bidi-font-style: normal; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB; mso-bidi-font-family: Arial">%%g' *
sed -i -- 's%<span style="mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB; mso-bidi-font-family: Arial" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-size: 12pt; font-family: Arial">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-family: Arial; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span style="font-family: Arial; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB; mso-bidi-font-family: Arial; mso-fareast-font-family: Times New Roman; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span lang="EN-US" style="FONT-FAMILY: Arial; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-US">%%g' *
sed -i -- 's%<span style="font-family: Arial; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-US" lang="EN-US">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-FAMILY: Arial; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Arial; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-US" style="mso-ansi-language: EN-US">%%g' *
sed -i -- 's%<span lang="EN-US" style="FONT-FAMILY: Arial; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">%%g' *
sed -i -- 's%<span style="color:black">%%g' *
sed -i -- 's%<span style="mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">%%g' *
sed -i -- 's%<span style="font-size: 12.0pt; mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA" lang="EN-US">%%g' *
sed -i -- 's%<span style="font-size: 12.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt" lang="EN-US">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-char-type: symbol; mso-symbol-font-family: Symbol; mso-ascii-font-family: Arial; mso-fareast-font-family: Times New Roman; mso-hansi-font-family: Arial; mso-bidi-font-family: Arial; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-char-type: symbol; mso-symbol-font-family: Symbol; mso-ascii-font-family: Arial; mso-hansi-font-family: Arial; mso-bidi-font-family: Arial; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span style="mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-GB">%%g' *
sed -i -- 's%<span style="mso-spacerun: yes">%%g' *
sed -i -- 's%<span style="FONT-FAMILY: Arial">%%g' *
sed -i -- 's%<span lang="FR" style="mso-ansi-language: FR">%%g' *
sed -i -- 's%<span id="d26629">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-size: 10.0pt; mso-bidi-font-family: Times New Roman; color: black; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-size: 10.0pt; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-family: Arial">%%g' *
sed -i -- 's%<span lang="EN-GB" style="layout-grid-mode: line; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span style="font-family: Arial; mso-ansi-language: EN-GB" lang="EN-GB">%%g' *
sed -i -- 's%<span style="font-size: 12.0pt; font-family: Arial; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-GB">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; color: black; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-US" style="font-size: 12.0pt; mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span lang="EN-US" style="mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span lang="EN-US" style="FONT-FAMILY: Arial">%%g' *
sed -i -- 's%<span style="font-family: Arial">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-family: Arial; color: black">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-family: Arial; font-size: 12pt">%%g' *
sed -i -- 's%<span style="font-family:Arial">%%g' *
sed -i -- 's%<span lang="EN-GB" style="LAYOUT-GRID-MODE: line; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-font-family: Times New Roman; mso-fareast-language: FR; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">%%g' *
sed -i -- 's%<span lang="ES-TRAD" style="mso-fareast-font-family: Times New Roman; mso-ansi-language: ES-TRAD; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span class="postdetails1">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA; mso-bidi-font-weight: bold">%%g' *
sed -i -- 's%<span style="mso-bidi-font-size: 10.0pt; mso-bidi-font-family: Arial">%%g' *
sed -i -- 's%<span lang="EN-US" style="mso-bidi-font-size:10.0pt;font-family:Arial;mso-ansi-language:EN-US">%%g' *
sed -i -- 's%<span style="line-height: normal">%%g' *

Character Encoding Issues

Character encoding, specifically the character was still causing issues and not being stripped out by other means, so I found a way to batch change to UTF-8 encoding. In ended up running file -i * in the directory then manually sorting the files into directories for the ISO-8859-1 encoded files and the US-ASCII files.

#!/bin/bash
#enter input encoding here
FROM_ENCODING="ISO-8859-1"
#output encoding(UTF-8)
TO_ENCODING="UTF-8"
#convert
CONVERT=" iconv  -f   $FROM_ENCODING  -t   $TO_ENCODING"
#loop to convert multiple files
for  file  in  *.htm; do
     $CONVERT   "$file"   -o  "${file%.txt}"
done
exit 0
#!/bin/bash
#enter input encoding here
FROM_ENCODING="US-ASCII"
#output encoding(UTF-8)
TO_ENCODING="UTF-8"
#convert
CONVERT=" iconv  -f   $FROM_ENCODING  -t   $TO_ENCODING"
#loop to convert multiple files
for  file  in  *.htm; do
     $CONVERT   "$file"   -o  "${file%.txt}"
done
exit 0

Edit: the above commands have issues of some sort that ended up truncating some file output. The below, modified from this page, works correctly and looks a whole lot more elegant.

#!/bin/bash
for file in *.htm
do
    iconv -f ISO-8859-1 -t UTF-8 -o "$file.new" "$file" &&
    mv -f "$file.new" "$file"
done

#!/bin/bash
for file in *.htm
do
    iconv -f US-ASCII -t UTF-8 -o "$file.new" "$file" &&
    mv -f "$file.new" "$file"
done

Here is the Markdown conversion script without the other things mixed in:

#!/bin/bash

FILES=*.htm

    # convert htm to markdown
    # echo "pandoc -f htm -t gfm \"$line\" -o $the_filename.md"
      for f in $FILES
        do
      echo "Processing $f file..."
      pandoc --from html --to gfm --wrap=none --strip-comments --no-highlight --ascii --include-in-header yaml-header.tex "$f" -o "${f%.*}.md"

done

rename -f 'y/A-Z/a-z/' *
rename -f 'y/_/-/' *

Code Cleanup

Despite all that, I still ended up with weird code blocks like the below:

The flower of *Pinguicula heterophylla*<span style="font-size: 12.0pt; mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA" lang="EN-US">.<span style="font-size: 12.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt" lang="EN-US"> <span style="mso-fareast-font-family: Times New Roman; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt" lang="EN-US"><span style="mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"><span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">T<span style="mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">he corolla is whitish, with a yellow-greenish blot on the throat<span style="mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">.<span style="font-size: 12.0pt; mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA" lang="EN-US">

I mostly edited out those strings manually using the search and replace function in the Atom editor. There is probably a better method, but I didn’t find or think of it.

Likewise, the resultant pages lacked headings, which should be of no surprise, as the original HTML doesn’t have them either. Again, lots of search and replace was employed to turn those into headings, mostly ## and ### in Markdown, which outputs as h2 and h3 in HTML.

Perhaps the most important step was in editing the page YAML. These are the headings that I used on almost every page:

---
title:
excerpt:
redirect_from:
header:
  teaser:
---

The URL structure on the old website was inconsistent, at best, so all pages ended up with redirects. Most collection pages also have teaser images so that they display well in archive pages, as well as excerpts for the same reason (as well as for SEO). Titles and page URLs are sometimes generated directly by the collections function in Jekyll, as I formatted the filnames and directory structure with that in mind.

The linkchecker application may be familiar to every one else, but I hadn’t used it before this occasion. Even after visually checking through the website, there were still hundreds of dead internal links. Hundreds.

The basic usage is simply linkchecker plus the domain - i.e., linkchecker https://www.pinguicula.org/. The --check-extern option is also helpful for auditing external links. The full argument with that option is linkchecker --check-extern https://www.pinguicula.org/.