Motivation and Concept
I recently began encountering carnivorous plants in the wild again, after a few years of not being actively engaged in the topic. Living in Oaxaca, I am finding Pinguicula specifically.
pinguicula.org has long been the go-to resource for information about members of the genus Pinguicula, as one might assume from the domain name. Well over a decade ago, Noah Elhardt and I contributed a few pages of field trip reports to the website, so it also has something of a personal connection.
Eric Partrat has done a wonderful job building and maintaining the website but hasn’t been able to keep up with it recently. While researching plants, two things have become abundantly evident:
- There are no other equivalent data repositories online, at least for Pinguicula
- The old version of pinguicula.org has become increasingly difficult to navigate and much of the data a bit antiquated.
I genuinely don’t know if anyone else uses or wants to use pinguicula.org. However, I do, so contacted Eric to see if he would be open to collaboration in updating and modernizing the website.
A field trip page on the old website. Lots of HTML tables, and no heading tags!
Tools
Here are the tools that I used to make the new website. I used them in part because I believe they are well suited to the project, but also because I use them elsewhere and find they produce solid results via a convenient process.
Page Format/Language
- Markdown, specifically GitHub Flavored Markdown, with YAML data headers.
Static Website Conversion / Content Management System
Jekyll Theme
Old HTML Conversion
- Pandoc to convert the HTML to GitHub Flavored Markdown
- cwebp for WebP image conversion
- Atom editor for working with the Markdown and config files. I should really switch to Kate or similar since Atom is no longer maintained.
Hosting
Goals
The primary goals of converting pinguicula.org to Markdown on Jekyll via Minimal Mistakes are the following:
Improve On-Page Website Use
- Make it easier to update. The current website is written in a version of Microsoft FrontPage from the early 2000s with, as far as I can tell, manually placed links between pages. It hasn’t been updated since 2010 - 12 years ago - which certainly indicates that it isn’t working well in that regard.
- Make the pages internally accessible. The current website doesn’t have even the most basic on-page accessibility features like menus or headings, let alone less obvious things like image alt text. Immediate changes include the following:
- Place all content within consistent categories.
- Provide a main menu, present on all pages, that represents the main categories of content presented on the website.
- Create submenus on some category pages where the complexity may benefit from the additional depth.
- Enable functionality on phones and other non-desktop devices. The current code is not mobile-friendly, to put it mildly.
- Make the presentation more visually appealing. I genuinely find the current website charming and would see no reason to intervene in its existence if it were still actively updated. However, the current combination of lack of updates and extremely period styling makes the data look much less relevant and authoritative than it truly is.
Improve Accessibility via Search Engines
- Make the pages easy to find externally. The current website is relatively visible in specific searches but should generally rank higher for many keywords. It could also enjoy a broader reach. The following will help in that regard:
- Hosting with full SSL support. In 2022, websites are definitely penalized in search results when they lack functioning SSL implementation.
- Editing the existing content to conform with web standards.
- Converting extant page delineators to headings (
h2
,h3
, etc) - Restructure URLs (with redirects from the old URLs!) so that they have a logical structure that conforms with web standards. Current urls are extremely variable and include characters like
_
that are not ideal in an SEO context. - Convert all images to the WebP format. The images on the old website are presented as small JPEGs that are already quite compressed, but the size can be further reduced by converting to WebP without incurring any visible loss in quality. The process requires little effort.
- Establish structure and practices for new image files so that they may be stored and presented in a manner that is more accessible. The current images have completely non-descriptive titles that often include characters that reduce search visibility, like
_
. They are also found in a mishmash of directories that span the entire website, often mixed in with HTML files.- Move all current images into one directory (
/assets/images/old/
), while maintaining the extant substructure as much as possible. - Create new directories that reflect use (and size) of images under
/assets/images/
. This will make the images much easier to replace with different resolutions or formats when necessary./assets/images/header/
/assets/images/feature-row/
/assets/images/post/
- Move all current images into one directory (
- Converting extant page delineators to headings (
Enable Maintenance and Updates
- Store the individual pages in a format that is portable and human readable.
- Writing and editing content for the web in Markdown is ideal for me. It is a free and open format that is read and processed to HTML by many static and dynamic content management systems. It is also highly readable even without conversion to HTML.
- Conversion from Markdown to anything else is supported by a wide variety of tools, should it be required.
- Minimize hosting cost
- Hosting/building/deploying static content via Gitlab, Github, and Netlify is all free at this scale. Likewise, should those companies all change their policies, hosting static content either via a VPS running a flavor of Git or simply receiving the static files would be extremely inexpensive and performant.
- Not depending on a database or dynamic CMS. I have worked extensively with Drupal and Wordpress hosted on both VPS and shared hosting. Updates can work well or can turn into involved migration projects that require substantial manual intervention to regain basic functionality. Migration into those systems can also be fraught.
- Enable collaboration
- Building the website from an online distributed version control system makes it possible for others to directly collaborate in the page development and maintenance, should they be so inclined. My experience building, hosting and maintaining the Los Angeles Carnivorous Plant Society website points to this sort of collaboration being unlikely. Despite running the site for many years on a Drupal instance, I was not once able to convince anyone else to even log into the system, let alone directly edit pages. I am hosting the new pinguicula.org codebase on Gitlab, but that is largely for redundancy as well as my convenience.
Accomodate Philosophical Preferences
- Use free and open-source software (foss) and formats. This can
- Reduce cost
- Increase portability - open formats are consistently the easiest to translate, should the need arise.
- Promote open, transparent models of development.
- Avoid collection of visitor data. No one needs to be spied on by Google Analytics tracking or the like, and certainly not while visiting a plant website.
- Netlify and their partners may spy on website viewers, but probably not appreciably more than any other third-party host. Hosting the website on a private server, which could enhance visitor privacy, is not economically feasible.
Before and After Screenshots
Here are a few screenshots of pages on the website before and after the migration.
As you can see in the individual pages, there are some funky HTML image tables that I have preserved from the old pages. I don’t have the inclination to replace the image files right now and their small physical size makes different display modalities fraught.
Desktop










Mobile










The Conversion Process
The below are my notes from the website migration. They are partial, incomplete, and mostly unedited. I wrote them for my own reference. If they are useful to you too, well, great!
Gathering the Materials
Scrape the entireity of the pinguicula.org website using wget, using the examples from the wget manpage for reference.
wget -prkH -Dpinguicula.org -l 0 -t 45 -nc --show-progress http://www.pinguicula.org/ -o gnulog
Some resources are coded into the pages using an old base url that can’t be found.
http://perso.club-internet.fr/cpartrat/Fernando/P_gigantea_Ayautla_16(HR).jpg
Those links need to be found, converted, and downloaded using the correct links.
http://www.pinguicula.org/Fernando/P_gigantea_Ayautla_16(HR).jpg
Check for broken image links:
wget -l 0 -nv -t 1 -r -H -A jpg,jpeg,gif,png --spider http://www.pinguicula.org/ -o imagelog
Code conversion
After moving image files to /assets/images/* and correcting image links in html files, it is time to convert the html to markdown.
Modified from Converting HTML to markdown? #49:
pandoc --from html --to markdown doc.html -o doc.md
Use at your own risk, create convert.sh in _posts directory and run it.
#!/bin/bash
FILES=*.htm
for f in $FILES
do
echo "Processing $f file..."
pandoc --from html --to markdown "$f" -o "${f%.*}.md"
done
Then just
rm -rf *.html
This works but now I want to add YAML frontmatter to each markdown document.
Modified from Sed Insert Multiple Lines:
#!/bin/bash
FILES=*.md
for f in $FILES
do
sed -i -f - "$f" <<EOF
1i\\
--- \\
title: \\
header: \\
image: \\
teaser: \\
--- \\
EOF
done
Some of the pages had weird character sets passed through Microsoft FrontPage header metadata. Pandoc wasn’t able to process pages with this string specifically:
<META http-equiv=Content-Type content="text/html; charset=windows-1252">
This certainly could have been resolved in bash in some way, but I did a batch find and replace action in the text editor.
Image conversion
The image files already sported small dimensions and relatively high compression, but some additional space could be saved by converting the .jpg files to the .webp format.
A bash file with the following did the trick.
for file in *; do cwebp -q 50 "$file" -o "${file%.*}.webp"; done
I am not sure how to make that operation recursive, so ended up re-running it a few times, each time adding another layer to the directory with another /*
.
Then recursivly delete the .jpg files
find . -type f -name '*.jpg' -delete
Although probably very premature, I have begun testing the results. For me this is easiest online, via a Gitlab repository that feeds a Netlify account that builds and hosts the website. The results are are not great. There are way too may weird tables that are translated into Markdown in a nonfunctional manner, headings are simply passed to *** rather than ###, and other issues that require manual intervention.
At this point, it has been overwhelmingly obvious that the Markdown generated by Pandoc with the pandoc with the straight pandoc --from html --to markdown
command is horribly littered with non-functional tables and other cruft that takes ages to toss out. pandoc --from html --to gfm
seems like a better solution, but perhaps there are other options I can pass? Perhaps markdown_strict
will be needed to toss extra things? Creating and populating YML frontmatter automatically could also enhance the process.
Here is attempt #2, where column wrap is discarded via --wrap=none
, HTML comments discarded via --strip-comments
(which may not be necessary), --no-highlight
, which may also not really apply but can’t hurt (I hope), --ascii
to hopefully kill some weird, unreadable Microsoft substitutions for accented and unusual characters,
--include-in-header inc.tex
is also a more elegant way of inserting Jekyll YAML header data.
This script from this page gave me the idea of also converting _
in filenames to -
.
Converting uppercase characters to lowercase in filenames is also appealing: tr '[:upper:]' '[:lower:]'
. I prefer the look of this rename command though: rename -f 'y/A-Z/a-z/' *
.
Also, remove charset
lines that interfere with pandoc: sed -i -- '/charset/'d *
as here, here, and here.
After some difficulties, I discovered that the �
character was blocking pandoc from working on some of the files, with the error that The input must be a UTF-8 encoded text
, despite stripping weird Microsoft headers first. So sed -i -- 's/�/°/g' *
.
Altogether:
#!/bin/bash
sed -i -- '/http-equiv/'d *
sed -i -- '/FrontPage/'d *
sed -i -- '/ProgId/'d *
sed -i -- '/charset/'d *
sed -i -- 's/ / /g' *
sed -i -- 's/�/°/g' *
FILES=*.htm
# convert htm to markdown
# echo "pandoc -f htm -t gfm \"$line\" -o $the_filename.md"
for f in $FILES
do
echo "Processing $f file..."
pandoc --from html --to gfm --wrap=none --strip-comments --no-highlight --ascii --include-in-header yaml-header.tex "$f" -o "${f%.*}.md"
done
rename -f 'y/A-Z/a-z/' *
rename -f 'y/_/-/' *
At this point, the discover of more weird HTLM tags has lead to a much larger cleanup file:
#!/bin/bash
sed -i -- 's/.jpg/.webp/g' *
sed -i -- '/http-equiv/'d *
sed -i -- '/FrontPage/'d *
sed -i -- '/ProgId/'d *
sed -i -- '/charset/'d *
sed -i -- 's/ / /g' *
sed -i -- 's%../images/%/assets/images/%g' *
sed -i -- 's%../../pages/plantes/%/plants/%g' *
sed -i -- 's%../Fernando/%/assets/images/Fernando/%g' *
sed -i -- 's%</span>%%g' *
sed -i -- 's%<span class="underline">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-weight: bold">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-FAMILY: Arial">%%g' *
sed -i -- 's%<span style="TEXT-TRANSFORM: uppercase">%%g' *
sed -i -- 's%<span style="text-transform: uppercase">%%g' *
sed -i -- 's%<span lang="EN-GB">%%g' *
sed -i -- 's%<div data-align="center">%%g' *
sed -i -- 's%</div>%%g' *
sed -i -- 's%<div data-align="justify">%%g' *
sed -i -- 's%<span style="mso-ansi-language:EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-WEIGHT: bold; FONT-SIZE: 12pt; FONT-FAMILY: Arial">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-SIZE: 12pt; FONT-FAMILY: Arial">%%g' *
sed -i -- 's%<div data-align="left">%%g' *
sed -i -- 's%<span style="FONT-WEIGHT: bold">%%g' *
sed -i -- 's%<span style="font-size: 12pt">%%g' *
sed -i -- 's%<span style="font-family: Arial" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-ansi-language:EN-GB">%%g' *
sed -i -- 's%<span style="text-transform: capitalize">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-US">%%g' *
sed -i -- 's%<span style="mso-spacerun: yes; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="font-family: Arial; mso-ansi-language: EN-GB; mso-bidi-font-family: Times New Roman" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-US" style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span>%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-FAMILY: Arial; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-ansi-language: EN-GB; mso-bidi-font-weight: bold">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-ansi-language: EN-GB; font-family: Arial">%%g' *
sed -i -- 's%<span style="font-size: 12pt; font-family: Arial" lang="EN-GB">%%g' *
sed -i -- 's%<span style="mso-ansi-language: EN-GB" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="PT-BR" style="mso-ansi-language:PT-BR">%%g' *
sed -i -- 's%<span style="mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA; layout-grid-mode: line">%%g' *
sed -i -- 's%<span class="394421507-10072003">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-family:Arial;mso-ansi-language:EN-GB">%%g' *
sed -i -- 's%<span lang="EN-US">%%g' *
sed -i -- 's%<span style="font-size: 12.0pt; font-family: Arial">%%g' *
sed -i -- 's%<span style="font-family: Arial; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="FR">%%g' *
sed -i -- 's%> mso-fareast-language:FR;mso-bidi-language:AR-SA">%>%g' *
<sed -i -- 's%span style="font-family:Arial;mso-ansi-language:EN-GB">%%g' *
sed -i -- 's%<span class="meinStil">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-ansi-language: EN-GB; mso-fareast-font-family: Times New Roman; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-FAMILY: Arial; mso-bidi-font-family: 'Times New Roman'; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-SIZE: 12pt; FONT-FAMILY: Arial; mso-ansi-language: EN-GB; mso-fareast-font-family: 'Times New Roman'; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB; mso-bidi-font-family: Arial">%%g' *
sed -i -- 's%<span style="mso-bidi-font-style: normal; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB; mso-bidi-font-family: Arial">%%g' *
sed -i -- 's%<span style="mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB; mso-bidi-font-family: Arial" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-size: 12pt; font-family: Arial">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-family: Arial; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span style="font-family: Arial; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB; mso-bidi-font-family: Arial; mso-fareast-font-family: Times New Roman; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span lang="EN-US" style="FONT-FAMILY: Arial; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-US">%%g' *
sed -i -- 's%<span style="font-family: Arial; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-US" lang="EN-US">%%g' *
sed -i -- 's%<span lang="EN-GB" style="FONT-FAMILY: Arial; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Arial; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-US" style="mso-ansi-language: EN-US">%%g' *
sed -i -- 's%<span lang="EN-US" style="FONT-FAMILY: Arial; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">%%g' *
sed -i -- 's%<span style="color:black">%%g' *
sed -i -- 's%<span style="mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">%%g' *
sed -i -- 's%<span style="font-size: 12.0pt; mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA" lang="EN-US">%%g' *
sed -i -- 's%<span style="font-size: 12.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt" lang="EN-US">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-char-type: symbol; mso-symbol-font-family: Symbol; mso-ascii-font-family: Arial; mso-fareast-font-family: Times New Roman; mso-hansi-font-family: Arial; mso-bidi-font-family: Arial; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span style="mso-char-type: symbol; mso-symbol-font-family: Symbol; mso-ascii-font-family: Arial; mso-hansi-font-family: Arial; mso-bidi-font-family: Arial; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span style="mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-GB">%%g' *
sed -i -- 's%<span style="mso-spacerun: yes">%%g' *
sed -i -- 's%<span style="FONT-FAMILY: Arial">%%g' *
sed -i -- 's%<span lang="FR" style="mso-ansi-language: FR">%%g' *
sed -i -- 's%<span id="d26629">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-size: 10.0pt; mso-bidi-font-family: Times New Roman; color: black; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-bidi-font-size: 10.0pt; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-family: Arial">%%g' *
sed -i -- 's%<span lang="EN-GB" style="layout-grid-mode: line; mso-ansi-language: EN-GB">%%g' *
sed -i -- 's%<span style="font-family: Arial; mso-ansi-language: EN-GB" lang="EN-GB">%%g' *
sed -i -- 's%<span style="font-size: 12.0pt; font-family: Arial; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-GB">%%g' *
sed -i -- 's%<span style="mso-fareast-font-family: Times New Roman; color: black; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA" lang="EN-GB">%%g' *
sed -i -- 's%<span lang="EN-US" style="font-size: 12.0pt; mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span lang="EN-US" style="mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span lang="EN-US" style="FONT-FAMILY: Arial">%%g' *
sed -i -- 's%<span style="font-family: Arial">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-family: Arial; color: black">%%g' *
sed -i -- 's%<span lang="EN-GB" style="font-family: Arial; font-size: 12pt">%%g' *
sed -i -- 's%<span style="font-family:Arial">%%g' *
sed -i -- 's%<span lang="EN-GB" style="LAYOUT-GRID-MODE: line; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-font-family: Times New Roman; mso-fareast-language: FR; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">%%g' *
sed -i -- 's%<span lang="ES-TRAD" style="mso-fareast-font-family: Times New Roman; mso-ansi-language: ES-TRAD; mso-fareast-language: FR; mso-bidi-language: AR-SA">%%g' *
sed -i -- 's%<span class="postdetails1">%%g' *
sed -i -- 's%<span lang="EN-GB" style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA; mso-bidi-font-weight: bold">%%g' *
sed -i -- 's%<span style="mso-bidi-font-size: 10.0pt; mso-bidi-font-family: Arial">%%g' *
sed -i -- 's%<span lang="EN-US" style="mso-bidi-font-size:10.0pt;font-family:Arial;mso-ansi-language:EN-US">%%g' *
sed -i -- 's%<span style="line-height: normal">%%g' *
Character Encoding Issues
Character encoding, specifically the �
character was still causing issues and not being stripped out by other means, so I found a way to batch change to UTF-8 encoding. In ended up running file -i *
in the directory then manually sorting the files into directories for the ISO-8859-1 encoded files and the US-ASCII files.
#!/bin/bash
#enter input encoding here
FROM_ENCODING="ISO-8859-1"
#output encoding(UTF-8)
TO_ENCODING="UTF-8"
#convert
CONVERT=" iconv -f $FROM_ENCODING -t $TO_ENCODING"
#loop to convert multiple files
for file in *.htm; do
$CONVERT "$file" -o "${file%.txt}"
done
exit 0
#!/bin/bash
#enter input encoding here
FROM_ENCODING="US-ASCII"
#output encoding(UTF-8)
TO_ENCODING="UTF-8"
#convert
CONVERT=" iconv -f $FROM_ENCODING -t $TO_ENCODING"
#loop to convert multiple files
for file in *.htm; do
$CONVERT "$file" -o "${file%.txt}"
done
exit 0
Edit: the above commands have issues of some sort that ended up truncating some file output. The below, modified from this page, works correctly and looks a whole lot more elegant.
#!/bin/bash
for file in *.htm
do
iconv -f ISO-8859-1 -t UTF-8 -o "$file.new" "$file" &&
mv -f "$file.new" "$file"
done
#!/bin/bash
for file in *.htm
do
iconv -f US-ASCII -t UTF-8 -o "$file.new" "$file" &&
mv -f "$file.new" "$file"
done
Here is the Markdown conversion script without the other things mixed in:
#!/bin/bash
FILES=*.htm
# convert htm to markdown
# echo "pandoc -f htm -t gfm \"$line\" -o $the_filename.md"
for f in $FILES
do
echo "Processing $f file..."
pandoc --from html --to gfm --wrap=none --strip-comments --no-highlight --ascii --include-in-header yaml-header.tex "$f" -o "${f%.*}.md"
done
rename -f 'y/A-Z/a-z/' *
rename -f 'y/_/-/' *
Code Cleanup
Despite all that, I still ended up with weird code blocks like the below:
The flower of *Pinguicula heterophylla*<span style="font-size: 12.0pt; mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA" lang="EN-US">.<span style="font-size: 12.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt" lang="EN-US"> <span style="mso-fareast-font-family: Times New Roman; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt" lang="EN-US"><span style="mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"><span style="mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">T<span style="mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-GB; mso-fareast-language: FR; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">he corolla is whitish, with a yellow-greenish blot on the throat<span style="mso-fareast-font-family: Times New Roman; mso-bidi-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 10.0pt">.<span style="font-size: 12.0pt; mso-bidi-font-size: 10.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA" lang="EN-US">
I mostly edited out those strings manually using the search and replace function in the Atom editor. There is probably a better method, but I didn’t find or think of it.
Likewise, the resultant pages lacked headings, which should be of no surprise, as the original HTML doesn’t have them either. Again, lots of search and replace was employed to turn those into headings, mostly ##
and ###
in Markdown, which outputs as h2
and h3
in HTML.
Perhaps the most important step was in editing the page YAML. These are the headings that I used on almost every page:
---
title:
excerpt:
redirect_from:
header:
teaser:
---
The URL structure on the old website was inconsistent, at best, so all pages ended up with redirects. Most collection pages also have teaser images so that they display well in archive pages, as well as excerpts for the same reason (as well as for SEO). Titles and page URLs are sometimes generated directly by the collections function in Jekyll, as I formatted the filnames and directory structure with that in mind.
Fixing Broken Links
The linkchecker
application may be familiar to every one else, but I hadn’t used it before this occasion. Even after visually checking through the website, there were still hundreds of dead internal links. Hundreds.
The basic usage is simply linkchecker
plus the domain
- i.e., linkchecker https://www.pinguicula.org/
. The --check-extern
option is also helpful for auditing external links. The full argument with that option is linkchecker --check-extern https://www.pinguicula.org/
.