Construction Finished

After eight months, my blog have finally reached a place where I feel comfortable taking down the "under heavy construction" notice on my home page. In stead of out right deleting the site road map though, I'm stashing it into a blog post.

Site Road Map

  • Find new hosting location. Currently using DigitalOcean.
  • ☑ Install Arch Linux on server.
  • Search for WP replacement. Hugo is pretty good.
  • Find a suitable theme. Currently using hugo-xmin , may consider forking it and write my own ( soresu ).
  • ☑ Server side config, like post-receive for git auto deploy.
  • ☑ Language switcher that does more than redirecting to home page.
  • ☑ Enable Disqus.
  • ☑ Support \(\LaTeX\) expressions via MathJax KaTeX.
  • ☑ Copy-paste fixed page contents from old site (and translate them).
  • ☑ Enable https.
  • ☑ Backup old WP site.
  • ☑ Transfer domain to Google Domains and ensure DNS works as intended.
  • ☑ Find out how to write with org-mode or R markdown.
  • ☑ Configure multilingual support, including footer text, title, etc.
  • ☑ Find out how to make emacs work with fcitx .
  • ☑ Use Google's Noto Sans font Oxygen Sans and Source Code Pro Iosevka for code.
  • ☑ Find a suitable icon/favicon.
  • ☑ Improve templates for posts to display tags and categories.
  • ☑ Cosmetic changes, i.e. no underlines for hyperlinks.
  • ☑ Deal with some nuances in using org-mode with hugo , like how to get syntax highlighting to work properly.
  • ☑ Host my own email.
  • ☑ Customize hugo new to make it more useful, i.e. create multilingual versions directly.
  • ☑ Self-host commenting system as a replacement of Disqus.
  • ☑ Use Let's Encrypt's wildcard certificate.
  • ☑ Restore/rewrite and translate some of the more valuable old posts.

What's on Home Page Now?

I already have an about page and a contact page for whatever I think people might be interested in knowing about myself, so I have no clue what I should put on home page. Since I found the old site road map to be a great way of reminding myself the stuffs I need to get done, I'll replace the road map with another to-do list: my goals for 2018. I am definitely not the most motivated kind of person, but seeing an unfinished to-do list every once in a while does get on my nerves. Let's see how well this is gonna work.

Fun With Fonts in Emacs

I finally took some time to look at the my font configurations in Emacs and cleaned them up as much as possible. This dive into the rabbit hole have been tiring yet fruitful, revealing the cravat of typesetting that I didn't know before, especially for CJK characters.

I primarily use Emacs by running a daemon and connecting to it via a graphical emacsclient frame, and I am attempting to tackle three major problems: I don't have granular control over font mapping, glyph widths are sometimes inconsistent with character widths, and emoji show up as weird blocks. Terminal Emacs doesn't suffer as much from these problems, yet I don't want to give away the nice perks like system clipboard access and greater key binding options, so here goes nothing.

Font Fallback Using Fontsets

Ideally, I want to specify two sets of fonts, a default monospace font and a CJK-specific font. Here's how I originally specified the font in Emacs:

(setq default-frame-alist '((font . "Iosevka-13")))

The method above obviously leaves no ground for fallback fonts. However, it turns out I can specify the font to be a fontset instead of an individual font. According to Emacs Manual, a fontset is essentially a mapping from Unicode range to a font or hierarchy of fonts and I can modify one with relative ease.

Sounds like an easy job now? Not so fast. I don't really know which fontset to modify: fontset behavior is quirky in that the fontset Emacs ends up using seems to differ between emacsclient and normal emacs, between terminal and graphical frames, and even between different locales. While there is a way to get the current active fontset ((frame-parameter nil 'font)), this method is unreliable and may cause errors like this one.

After all kinds of attempts and DuckDuckGoing (that really rolled right off the tongue, and no, I am not the first one), I finally found the answer: just define a new fontset instead of modifying existing ones.

(defvar user/standard-fontset
  (create-fontset-from-fontset-spec standard-fontset-spec)
  "Standard fontset for user.")

;; Ensure user/standard-fontset gets used for new frames.
(add-to-list 'default-frame-alist (cons 'font user/standard-fontset))
(add-to-list 'initial-frame-alist (cons 'font user/standard-fontset))

I won't bore you with the exact logic just yet, as I also made other changes to the fontset.

Displaying Emoji

Solution to emoji display is similar—just specify a fallback font with emoji support—or so I thought. I tried to use Noto Color Emoji as my emoji font, only to find Emacs does not yet support colored emoji font. Emacs used to support colored emoji on macOS, but this functionality was later removed.

I ended up using Symbola as my emoji fallback font (actually I used it as a fallback for all Unicode characters), which provided comprehensive coverage over all the emoji and special characters. Also note that since Emacs 25, customization to the symbols charset, which contains puncation marks, emoji, etc., requires some extra work:

(setq use-default-font-for-symbols nil)

There does exist a workaround for colored emoji though, not with fancy fonts, but by replacing Unicode characters with images. emacs-emojify is a package that provides this functionality. I ultimately decided against it as it does slow down Emacs quite noticeably and the colored emoji image library is not as comprehensive.

Quotation Marks

I've always used full-width directional curly quotation marks ("“”" and "‘’") when typing in Chinese, and ASCII style ambidextrous straight quotation marks (""" and "'") when typing in English. Little did I know there really is no such thing as full-width curly quotation marks: there is only one set of curly quotation mark codepoints in Unicode (U+2018, U+2019, U+201C, and U+201D) and the difference between alleged full-width and half-width curly quotation marks is caused solely by fonts. There have been proposals to standardize the two distinct representations, but for now I'm stuck with this ambiguous mess.

It came as no surprise that these curly quotation marks are listed under symbols charset, instead of a CJK one, thus using normal monospace font despite the fact that I want them to show up as full-width characters. I don't have a true solution for this—being consistent is the only thing I can do, so I forced curly quotation marks to display as full width characters by overriding these exact Unicode codepoints in my fontset. I'm not really sure how I feel when I then realized ASCII style quotation marks also suffered from confusion—maybe we are just really bad at this.

My fallback font configurations can be found on both GitHub and Trantor Holocron and I'll list them here just for sake of completeness:

(defvar user/cjk-font "Noto Sans CJK SC"
  "Default font for CJK characters.")

(defvar user/latin-font "Iosevka Term"
  "Default font for Latin characters.")

(defvar user/unicode-font "Symbola"
  "Default font for Unicode characters, including emojis.")

(defvar user/font-size 17
  "Default font size in px.")

(defun user/set-font ()
  "Set Unicode, Latin and CJK font for user/standard-fontset."
  ;; Unicode font.
  (set-fontset-font user/standard-fontset 'unicode
                    (font-spec :family user/unicode-font)
                    nil 'prepend)
  ;; Latin font.
  ;; Only specify size here to allow text-scale-adjust work on other fonts.
  (set-fontset-font user/standard-fontset 'latin
                    (font-spec :family user/latin-font :size user/font-size)
                    nil 'prepend)
  ;; CJK font.
  (dolist (charset '(kana han cjk-misc hangul kanbun bopomofo))
    (set-fontset-font user/standard-fontset charset
                      (font-spec :family user/cjk-font)
                      nil 'prepend))
  ;; Special settings for certain CJK puncuation marks.
  ;; These are full-width characters but by default uses half-width glyphs.
  (dolist (charset '((#x2018 . #x2019)    ;; Curly single quotes "‘’"
                     (#x201c . #x201d)))  ;; Curly double quotes "“”"
    (set-fontset-font user/standard-fontset charset
                      (font-spec :family user/cjk-font)
                      nil 'prepend)))

;; Apply changes.
(user/set-font)
;; For emacsclient.
(add-hook 'before-make-frame-hook #'user/set-font)

CJK Font Scaling

My other gripe is the width of CJK fonts does not always match up with that of monospace font. Theoretically, full-width CJK characters should be exactly twice of that half-width characters, but this is not the case, at least not in all font sizes. It seems that CJK fonts provide less granularity in size, i.e. 16px and 17px versions of CJK characters in Noto Sans CJK SC are exactly the same, and does not increase until size is bumped up to 18px, while Latin characters always display the expected size increase. This discrepancy means their size would match every couple sizes, but different in between with CJK fonts being a bit too small.

One solution is to specify a slightly larger default size for CJK fonts in the fontset. However, this method would render text-scale-adjust (normally bound to C-x C-= and C-x C--) ineffective against CJK fonts for some reason. A better way that preserves this functionality is to scale the CJK fonts up by customizing face-font-rescale-alist:

(defvar user/cjk-font "Noto Sans CJK SC"
  "Default font for CJK characters.")

(defvar user/font-size 17
  "Default font size in px.")

(defvar user/cjk-font-scale
  '((16 . 1.0)
    (17 . 1.1)
    (18 . 1.0))
  "Scaling factor to use for cjk font of given size.")

;; Specify scaling factor for CJK font.
(setq face-font-rescale-alist
      (list (cons user/cjk-font
                  (cdr (assoc user/font-size user/cjk-font-scale)))))

bWhile the font sizes might still go out of sync after text-scale-adjust, I am not too bothered. The exact scaling factor took me a few trial and error to find out. I just kept adjusting the factor until these line up (I found this table really useful):

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云云
雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲雲
ㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞㄞ
ああああああああああああああああああああああああああああああああああああああああ
가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가가

Unfortunately, the CJK font I used has narrower Hangul than other full-width CJK characters, so this is still not perfect—the solution would be to specify a Hangul specific font and scaling factor—but good enough for me.

It took me quite some effort to fix what may seem like a minor annoyance, but at least Emacs did offer the appropriate tools. By the way, I certainly wish I had found this article on Emacs Wiki sooner, as it also provides a neat write up of similar workarounds.

2018 in Review

Before anything, happy New Year!

It's an interesting feeling when the time span of one year gradually becomes shorter relative to the time that has already passed in one's life. If only the actual length of one year also scales with one's age, perhaps we would feel more of the excitement instead of anxiety during the New Year count down. That being said, 2018 was a lot of fun for me, even without ray-tracing graphic cards.

The Amazing 2018

To quote my 2017 self:

If I've learned anything from my past failed plans, it would be to always underestimate my own capabilities when planning...

Yeah, it's totally just that my estimates about the amount of free time I would have was off, as can be seen from the status of my 2018 goals.

  • ☒ Run 1000 miles. [405/1000]
  • ☒ Finish a marathon.
  • ☒ Write 20 blog posts. [10/20]
  • ☒ Get the first signature for my PGP key.
  • ☒ Install Gentoo.

Knowing that I can always change the 'publish date' of blog entries (thanks to hugo), I grew into the bad habit of starting an article and just then shelving it for months to come. When I finally remember that one unfinished article, I frequently dismiss the idea as not really worth elaborating. Now that I think about it, maybe this is exactly what blogs are for, providing a snapshot of myself that I can look back later, whether my future self find it silly or 'not really worth elaborating'.

The number of movie theater visits I had in 2018 probably accounts for 50% of my lifetime total, and with double doses of disappointment from Star Wars: The Last Jedi and Incredibles 2. By the way, 2018 also saw 90% of my lifetime popcorn consumption. I've never realized those can be such addicting.

Although not a marathon, I did ran my first trail half marathon in May. It was the first time I've ever hit the wall while running, due to bad pacing and unpreparedness for the weather. The race started mid afternoon on a scorchingly hot day. After witnessing quite a few people stopped to walk in the first 2 miles, I started off quite a bit faster than my intended pace fueled by a stupid sense of superiority, and hit the wall right at the mark of 4 miles. Fortunately the feeling faded away as I walked the next half of the race, gulping ice-cold Gatorade at every hydration point. However, the ice-cold Gatorade was another trap—temperature dropped rapidly as sun started to set and my stomach started to complain about all the chilly liquid. As the finish line appeared within 400 meters of my sight, my legs were hit by the strongest cramps I've ever had. After barely making it through while being surpassed by 3 people right before finish line, I could only be happy to learn that I was still not the last one: actually, I'm even the first one in my age group (whose size is one). The somewhat illegitimate feeling of compliment, mixed with a bit of salt and guilt made the race a wondrous experience.

The Spectacular 2019

Since Google is deprecating Inbox in the coming March, I've lost my last excuse for clinging to Gmail. I'll try to gradually fade out my Gmail usage for my own email server.

On the front of searching for best solution for blog comments, quite a few bloggers I follow have started embracing IndieWeb and Webmention. In a lot of ways, Webmention was the exact thing I wanted: federated blog comments, posts, and more. Yet I'm reluctant to move further away from a static site, not to mentioning most easy-to-follow Webmention solutions I have found relies heavily on third-party services. The IndieWeb movement itself though is fairly intriguing. I've never had much use for Keybase aside from it being a hub linking most of my online presences (decryption and encryption does not work without uploading PGP private keys, and I have no one to securely chat with), perhaps I should just replace it with rel=me links.

Diving into C++17 was fairly enjoyable during the past year, so I'm looking into learning other new programming languages. Rust and Julia have been on my radar for a while, especially Rust. Having a full suite of officially supported tools makes writing Rust a smooth and deeply satisfying experience. I'll try to dive deeper into both languages and hopefully put them into some uses.

As for running and blog posts, I'll try to match 2018's numbers. On top of those, I'm thinking about keeping a record of the books, music, and shows I've read/listened/watched on this blog, along with my thoughts. I actually attempted something similar during this blog's Wordpress days: I once setup a MediaWiki instance for similar purposes, but lacked the motivation to continue maintaining the entries. I'll keep it simple this time, and I should come up with a set of rating system.

What should I do with the remaining 2018 goals? A separate wishlist is a pretty good idea—let's go with that. As a stretch goal, I should probably clean my desktop computer, which is stuffed with four-year-old dirt, cat hair, and dead skin cells.

Here's to another spectacular 2.9e+17 radiation periods of Caesium-133!

Installing Gentoo

I finally bite the bullet and installed Gentoo on VirtualBox (totally not motivated by the front page wishlist), thereby achieving my ultimate digital @5c3n510n (or descent according to DistroWatch).

Jokes aside, the installation process is surprisingly pleasant: the Gentoo Handbook is wonderfully written, and seems to have a plan for everything that might go wrong. I like the Handbook more than ArchWiki's Installation Guide as it also details the rationale behind each step I took, which is often a fun read in its own right. I would go as far as saying the Gentoo Handbook is actually more beginner friendly, as it carefully assembles bits of information that are normally scattered all over the place, providing a great starting point for learning how to tame the operating system. Besides, Gentoo Handbook covers more than installation: it also contains other necessary setup processes to set up a usable system. I will be gradually replicating my current desktop setup to decide if a migration is worth the time.

My very first encounter with GNU/Linux operating systems is Ubuntu 12.04: one of my classmates (vacuuny/A2Clef) was installing it in school's computer labs. There was a time when I would switch between various Ubuntu variants every few days. I dual booted Windows and Ubuntu for a while before switching entirely to Ubuntu in 2014. Much annoyed by the Amazon ads, I tried out Arch Linux as part of my New Year's resolution in 2015. Even with a second computer to look up instructions, it still took me quite a while to adapt to the new system. I ranted "maybe I still haven't gotten the Arch way" in my old blog, but never looked back once I got the knack of it.

I still try out other distributions from time to time in VirtualBox, but never find them to offer much improvements compared with Arch beyond the setup processes, and even more so when considering the excellent documentation on ArchWiki (well now we have a contender). Once I have my desktop environment set up, the experience between distributions is not that different, but the distinctions kicks in when problems occur and I search online for troubleshooting tips. Having more up-to-date packages is another charm Arch has. More recently, the systemd controversy caused me to start shopping around for a new distribution to try out, not so much because of the actual security concerns, but just to see what it is like to use different init system: my time in Ubuntu was spent mostly in GUIs (apt-get and nano was probably the only command I knew for the longest time) without knowing about init systems and Arch was already using systemd when I switched. Aside from Gentoo, the candidates include Void Linux and the BSDs. Void Linux was easy to set up with its installer wizard, yet I didn't feel compelled to move to it. Let's see if Gentoo would change my mind.

Trackpad and Swollen Batteries

For the last few weeks, the right click on my Dell XPS 13's trackpad is getting less responsive: the entire right half of the trackpad sunk around 2mm beneath the palm rest, making clicks hard to register. At first I dismissed it as normal wear, but it turned out that the swollen batteries lifted the left half of the trackpad to such a degree that the trackpad warped. I immediately ordered an OEM replacement (Dell JD25G) swapped out the swollen batteries. XPS 13 (9343) was a breeze to service. The screws that hold the bottom panel (a quite hefty hunk of aluminum) in place are all clearly visible and the component layout allows battery to be swapped with minimal disassembly. I also swapped out the WLAN card (Dell DW1560) for an Intel AC9560, whose drivers are in the mainline Linux kernel.

The trackpad felt normal after the battery swap, of course. However, the fact that average laptop battery starts to degrade around 18 months surprised me quite a bit. Mine lasting nearly four years is probably quite decent. Newer laptops uses prismatic cells (those slab shaped batteries also found in phones) instead of cylindrical ones, as can be found in my first laptop, Dell Vostro 3750. Roughly speaking, prismatic cells trade size for lifespan by emitting external casing and gas vents found on cylindrical cells. The battery swell is caused by gas build up, which might have been avoided in cylindrical cells with vents. It's interesting that (easily) removable batteries have largely disappeared in consumer laptops - even the large desktop replacements (to be fair, those spend most of the time plugged in anyways). The only consumer electronics that still almost always have removable batteries I can think of are cameras.

After the incident, I started to browse current laptops on the market as the new quad/hex core laptop CPUs are quite tempting an upgrade (my XPS 13 has a i5-5200U). I was not a huge fan of the latest XPS 13 (9380) mostly because of the port selection: I just don't have any USB Type-C devices, so the 1 Type-C plus 2 Type-A combination found on XPS 13 (9360) is superior in my opinion. Besides ports, the onboard WLAN card and removal of full-sized SD card slot also make the latest model less appealing.

I also came across the Let's Note line of laptops from Panasonic, which are reliable, lightweight business laptops that often comes with removable batteries and a wide spectrum of ports. If only they weren't so prohibitively expansive, doesn't have those ugly "Wheel Pads", and come with US keyboard layout, they are quite the ideal laptops. I like the aesthetics of 2016 CF-MX5 series the most, but that won't make much of an upgrade.

More realistic choices include HP's EliteBook, Lenovo's ThinkPad T series, and Dell's Latitude/Precision lines. I vetoed EliteBook because all of them had a huge glaring proprietary docking port that I might never use. Latitude 5491 seem to have cooling issues due to the 45W TDP CPUs, while Latitude 7390 and 7490 both seem quite decent, with options to disable Intel ME and official Linux support. ThinkPad T480 pretty much ticks everything on my list, but it seems that the next iteration T490 will no longer have the bridge battery system and only one SODIMM slot, pretty much like T480s.

Hunting for second-handed machines is also an option, but it defeats the purpose of the upgrade since my primary motivation is the new quad core CPUs. Some may argue our laptops are overpowered already, and indeed my XPS 13 still feels pretty snappy though, so I'm not in urgent need for an upgrade. However, I did come up with a list of what I want in a laptop in case the ideal candidate shows up someday.

  • Good Linux driver support.
  • Below 15 inch in size and low travel weight. XPS 13 converted me from a DTR enthusiast to an Ultrabook follower: it does feel nice to be able carry a laptop all day without feeling it.
  • Non-Nvidia graphics. Both AMD and Intel has better open source driver support and I use my desktop for tasks heavily reliant on GPU.
  • Reasonable battery life (6 hours or more) and removable battery.
  • Not-too-radical port selections, not until all mouses and flash drives default to USB Type-C at least.
  • Standard components and easy to upgrade, i.e. SODIMM slot for memory, PCIe for WLAN card/SSD.
  • A nice trackpad. I'm rather insensitive to quality of laptop keyboards so anything marginally decent would do. It would be really cool to have an ErgoDox laptop though.
  • Not-super-high-resolution display. I'm not too picky about screens either, but 4K feels like an utter overkill for laptops this size that provides marginal improvements while draining more power. I've always used 16:9 displays, but I'm open to trying out different ones.

enumerate() with C++

Quite a few programming languages provide ways to iterate through a container while keeping count of the number of steps taken, such as enumerate() in Python:

for i, elem in enumerate(v):
    print(i, elem)

and enumerate() under std::iter::Iterator trait in Rust:

for (i, elem) in v.iter().enumerate() {
    println!("{}, {}", i, elem);
}

This is just a quick note about how to do similar things in C++17 and later without declaring extra variables out of the for loop's scope.

The first way is to use a mutable lambda:

std::for_each(v.begin(), v.end(),
              [i = 0](auto elem) mutable {
                  std::cout << i << ", " << elem << std::endl;
                  ++i;
              });

This could be used with all the algorithms that guarantees in-order application of the lambda, but I don't like the dangling ++i that could get mixed up with other logic.

The second way utilizes structured binding in for loops:

for (auto [i, elem_it] = std::tuple{0, v.begin()}; elem_it != v.end();
     ++i, ++elem_it) {
    std::cout << i << ", " << *elem_it << std::endl;
}

We have to throw in std::tuple as otherwise compiler would try to create a std::initializer_list, which does not allow heterogeneous contents.

The third least fancy method is to just calculate the distance every time:

for (auto elem_it = v.begin(); elem_it != v.end(); ++elem_it) {
    auto i = std::distance(v.begin(), elem_it);
    std::cout << i << ", " << *elem_it << std::endl;
}

Since we have to copy paste the starting point twice, I like other counter based approaches better.

In C++20, we have the ability to add an init-statement in ranged-based for loops, so we can write something like

for (auto i = 0; auto elem : v) {
    std::cout << i << ", " << elem << std::endl;
    i++;
}

Meh, not that impressive. The new <ranges> library provides a more appealing way to achieve this:

for (auto [i, elem] : v | std::view::transform(
         [i = 0](auto elem) mutable { return std::tuple{i++, elem}; })) {
    std::cout << i << ", " << elem << std::endl;
}

I like the structured binding method and the <ranges> based method the most. It would be even better though if we can get a std::view::enumerate to solve this problem once and for all.

Hello Darkness, My Old Friend

With system wide dark modes becoming commonplace, I took the effort to tweak the color scheme of my blog and added a dark mode specific one using prefers-color-scheme in CSS. I also toyed around the idea of adding a user toggle using JavaScript per instructions here, but ultimately decided against it because of my (totally unjustified and groundless) distaste towards the language.

Color UsageLight ThemeDark Theme
Accent#700000#8fffff
Background#f7f3e3#080c1c
Text#2e2d2b#d1d2d4
Code Background#e3dacb#1c2534
Border 1#e7e3d3#181c2c
Border 2#d7d3c3#282c3c

Writing CSS is a such tiring endeavor, but on the bright side, picking colors is a surprisingly relaxing activity. The light mode color scheme now has reduced contrast, and I updated the isso style sheets with matching colors. Yes, I only inverted the colors in dark mode and did not reduce the font weights because of the peculiar way in which human vision work. Part of me already screams heresy when I look at the color codes formed by three numbers that seem to have no connection whatsoever—they are like dissonant chords that cause itches in brain—so I need them to at least sum up to a nice number.

Wissen ist Nacht!

Fun with Fonts on the Web

A more accurate version of the title probably should be "Fun with Fonts in Web Browsers", but oh well, it sounds cooler that way. Text rendering is hard, and it certainly doesn't help that we have a plethora of different writing systems (blame the Tower of Babel for that, I guess) which cannot be elegantly fitted into a uniform system. Running a bilingual blog doubles the trouble in font picking, and here's a compilation of the various problems I encountered.

Space Invaders

Most browsers join consecutive lines of text in HTML to a single one with an added space in between, so

<html>Line one and
line two.</html>

renders to

Line one and line two.

Such a simplistic rule doesn't work for CJK languages where no separators is used between words. The solution is to specify the lang attribute for the page (or any specific element on the page) like so:

<html lang="zh">第一行和
第二行。</html>

If your browser is smart enough (like Firefox), it will join the lines correctly. All the Blink based browsers, however, still stubbornly shove in the extra space, so it looks like I will be stuck in unwrapped source files like a barbarian for a bit longer. While not a cure-all solution, specifying the lang attribute still have the added benefit of enabling language-specific CSS rules, which comes in handy later.

Return of the Quotation Marks

As mentioned in a previous post, CJK fonts would render quotation marks as full-width characters, different from Latin fonts. This won't be a problem as long as a web page doesn't try to mix-and-match fonts: just use language specific font-stack.

body:lang(en) {
    font-family: "Oxygen Sans", sans-serif;
}

body:lang(zh) {
    font-family: "Noto Sans SC", sans-serif;
}

Coupled with matching lang attributes, the story would have ended here. Firefox even allows you to specify default fonts on a per language basis, so you can actually get away with just the fallback values, like sans-serif or serif, and not even bother writing language specific CSS.

However, what if I want to use Oxygen Sans for Latin characters, Noto Sans SC for CJK characters? While seemingly an sensible solution, specifying font stack like so,

body:lang(zh) {
    font-family: "Oxygen Sans", "Noto Sans SC", sans-serif;
}

would cause the quotation marks to be rendered using Oxygen Sans, which displays them as half-width characters. The solution I found is to declare an override font with a specified unicode-range that covers the quotation marks,

@font-face {
    font-family: "Noto Sans SC Override";
    unicode-range: U+2018-2019, U+201C-201D;
    src: local("NotoSansCJKsc-Regular");
}

and revise the font stack as

body:lang(zh) {
    font-family: "Noto Sans SC Override", "Oxygen Sans", "Noto Sans SC", sans-serif;
}

Now we can enjoy the quotation marks in their full-width glory!

Font Ninja

Font files are quite significant in size, and even more so for CJK ones: the Noto Sans SC font just mentioned is over 8MB in size. No matter how determined I am to serve everything from my own server, this seems like an utter overkill considering the average HTML file size on my site is probably closer to 8KB. How does all the web font services handle this then?

Most web font services work by adding a bunch of @font-face definitions into a website's style sheet, which pulls font files from dedicated servers. To reduce the size of files been served, Google Fonts slice the font file into smaller chunks, and declare corresponding unicode-range for each chunk under @font-face blocks (this is exactly how they handle CJK fonts). They also compress the font files into WOFF2, further reducing file size. On the other hand, Adobe Fonts (previously known as Typekit) seem to have some JavaScript wizardry that dynamically determines which glyphs to load from a font file.

Combining best of both worlds, and thanks to the fact that this is a static site, it is easy to gather all the used characters and serve a font file containing just that. The tools of choice here would be pyftsubset (available as a component of fonttools) and GNU AWK. Compressing font files into WOFF2 also requires Brotli, a compression library. Under Arch Linux, the required packages are python-fonttools, gawk, brotli, and python-brotli.

Here's a shell one-liner to collect all the used glyphs from generated HTML files:

find . -type f -name "*.html" -printf "%h/%f " | xargs -l awk 'BEGIN{FS="";ORS=""} {for(i=1;i<=NF;i++){chars[$(i)]=$(i);}} END{for(c in chars){print c;} }' > glyphs.txt

You may need to export LANG=en_US.UTF-8 (or any other UTF-8 locale) for certain glyphs to be handled correctly. With the list of glyphs, we can extract the useful part of font files and compress them:

pyftsubset NotoSansSC-Regular.otf --text-file=glyphs.txt --flavor=woff2 --output-file=NotoSansSC-Regular.woff2

Specifying --no-hinting and --desubroutinize can further reduce size of generated file at the cost of some aesthetic fine-tuning. A similar technique can be used to shrink down Latin fonts to include only ASCII characters (or keep the extended ASCII range with U+0000-00FF):

pyftsubset Oxygen-Sans.ttf --unicodes="U+0000-007F" --flavor=woff2 --output-file=Oxygen-Sans.woff2

Once this is done, available glyphs can be checked using most font manager software, or this online checker (no support for WOFF2 though, but you can convert into other formats first, such as WOFF).

I also played around the idea of actually dividing the glyphs into further chunks by popularity, so here's another one liner to get list of glyphs sorted by number of appearances

find . -type f -name "*.html" -printf "%h/%f " | xargs -l awk 'BEGIN{FS=""} {for(i=1;i<=NF;i++){chars[$(i)]++;}} END{for(c in chars){printf "%06d %s\n", chars[c], c;}}' | sort -r > glyph-by-freq.txt

It turns out my blog has around 1000 different Chinese characters, with roughly 400 of them appearing more than 10 times. Since the file sizes I get from directly a single subsetting is already good enough, I didn't bother proceeding with another split.

For Your Browsers Only

With all the tricks in my bag, I was able to cut down the combined font file size to around 250KB, still magnitudes above that of an HTML file though. While it is nice to see my site appearing the same across all devices and screens, I feel the benefit is out of proportion compared to the 100-fold increase in page size.

Maybe it is just not worth it to force the choice of fonts. In case you want to see my site as I would like to see it, here are my go-to fonts: