FOSDEM 2020

Saturday, February 1, 2020

This is the 20 year anniversary edition of FOSDEM, and it's big.

The convergence of the geeks.

General

There are live streams for all talks, which is nice because there are also quite long lines when waiting to enter a room.

As an aside, I noticed that various places had links to outdated Facebook and Twitter entries. A while ago, I dropped both Facebook and Twitter, and I reluctantly came back, but could not get my old handles. So on Twitter, it's @zanyware now and not @descubes, and on Facebook, it's christophe.dedinechin.18 (I doubt there are 18 of us, but oh well...)

A bit concerned with the battery lifetime of my laptop. Yesterday, it stopped at around 50%. Hopefully I won't have the problem today.

Keynote: We have to finish that thing one day

Keynote given by Thorsten Leemhuis. As usual, it's crip, very detailed, very informed.

The key topic of the talk is hwo Linux wins by solving big problems little by little. "Solve big problems in small steps". Interestingly, one of the examples he took are continers, and he retraced some of the history that I outlined in my DevConf.cz 2020.

Talks about BPF, cBPF vs eBPF, mentions of it replacing the Linux kernel someday This is not as oulandish as it sounds, IMO. Talks about DTrace, etc.

Also gives interesting counter-examples, e.g. BTRFS vs. ZFS. Does not however mention that ZFS is cross-platform, whereas BTRFS is not. BTRFS was initially overhyped. Will Linux some day get something that competes credibly with ZFS? Probably yes, but will take 10-20 years. Will it be bcachefs? Not submitted upstream yet.

Problems of Linux kernel development: no central forge, everything done through mail, long unstable development phases. Reminds that initially we did not even have a version control system. But then got git in 2005, the second world-changing project by Linus. Got a mostly predictable release cycle. Got stable and long term kernels. Hundreds of mailing lists. Still no automated central code checking in a central place (is that a bad thing?) The amazing thing is that he could say such things without any reaction from the audience. No booohs, no aaahs.

Should the Linux foundation help more? "Not sure about that". Linux development runs at the usual pace. "Famous last words, but the patch volume has to drop off one day" (Andrew Morton).

Gave a link to Brendan D. Gregg's page, which seems to be a raw collection of links. At some point, I need to spend the time reading all that.

Opinion I truly believe this guy should be part of the teams building "whole Linux" documentation packages. He knows what changes, he knows how to explain it, and he knows how to make sense of a large pile of somewhat unrelated topics.

Kata Containers on openSUSE

Talk about Kata Containers on openSUSE. Curious to see if I will learn anything interesting

Starts with the very basics, i.e. "running containers in virtual machines". "If you want to escape the container", you need to escape two layers, so that's improved isolation". I don't think this is necessarily true given the number of technologies designedd to bypass overhead e.g. you could end up controlling a network card virtual function (VF) directly when you use DPDK.

They are using a smaller kernel in the container, called "KVM small Want to use QEMU microVM.

OCI compatible.

Mention replacement of 9pfs (slow) with virtiofs. Did not know exactly when it was merged into the kernel. I think it's 5.3.15, definitely there in 5.4, though the qemu part of it was only merged last week.

Mention that a small change has to be made to be able to run rootless. Need to add runtimes in libpod.conf, because they use a non-standard path. So you can add that in the kata-runtime section of libpod.conf.

Evolution of kube-proxy

Datadog has 10000's hosts in their infra, were hitting scalability limits.

kube-proxy running on each Kubernetes node. Implements the Kubernetes service abstraction.

Initial proxy implementation was from user-space. "Proxy mode = userspace". An iptables rule redirects traffic to the proxy, which will do the load-balancing between nodes. Prerouting sends to portal containers. Limitations: performance, and source IP cannot be kepts. Since Kubernetes 1.2, default is iptables.

Another limitation is that iptables was not designed for load balancing. It's hard to debug with 10K rules. Performance impact:

  • Control plane: syncing rules.
  • data plane: going through rules.

20K services = 160K rules = 5 hours to load them.

Proxy mode = ipvs (only start talking about it halfway through the talk, which I believe was a bit late). Service with 2K endpoints ~100B / endpoint 5K nodes. Each node gets 2Kx100B = 200k.

Addressed recently in Kubernetes with endpoint slices. Maximum 100 endpoints in each slice. Much more efficient for services with many endpoints. Beta in Kubernetes 1.17.

Containers live migration

A talk by Adrian Reber about how to transfer a running container around.

CRIU: Checkpoint Restore in User Space

  • First step: checkpoint a container using ptrace().
  • Generate parasite code, injected into the process
  • Then restart the process with the parasite code, daemon waiting for commands.
  • Then checkpoint continues.

That all sounds perfectly reasonable to me

"If you run with podman, you probably have SELinux, and CRIU does things that SElinux does not really approve of". NSS (No Shit Sherlock, said in a work-safe place).

There is another talk about this that is probably interesting.

Use clone3() for each PID/TID, which might be better.

A user of this is Google in their container runtime Borg to live-migrate processes in production a lot. Apparently happy with how it works. LXC/LXD has a long history of CRIU integration. For Docker, need an experimental mode to use it, unmaintained.

Useful commands:

podman container checkpoint
podman container restore

Q: Stuff from Borg tends to flow into Kubernetes. Will this happen in Kubernetes.

A: No sign of this happening. Problem is that containers are stateless, why would you want to migrate them.

Tried to migrate a database, but database shutdown after migration, which might be caused by time differences. Time namespace has been accepted in Linux, might help.

Supervising and emulating syscalls

Talk about how to intercept sysscalls.

Seccomp runs before the syscall. Seccomp never blocks. It asks userspace for return value and errno. Execution does not continue in the kernel, userspace must do the work.

Slides are a bit weak on content compared to what is being said, which is very dense. So this is the typical case where not listening to the talk for 30 seconds gets you totally lost and you cannot recover by looking at the slides. Chrisian Brauner clearly knows what he's talking about, but there is really much (too much?) more than what is on the slides.

Uses lxc for the demo. Demo starts with cat /proc/self/uid_map and mknod bbb c 5 1 trying to create a device. The amazing thing is that he his explaining what is happening under the hood, so not as obscure as it might first seem.

lxc config set f1 security.syscallsintercept.mknod true

lxc restart --force f1 lxc shell f1

Then can do an mknod and then stat ./bbb (the device node just created).

Showed that the policies allow to control mount and the associated type system, e.g. mount.ext4.

Below Kubernetes: Demystifying container runtimes

Talk about what is happening below Kubernetes. More specifically, the space between Kubernetes and the Linux kernel.

It's a "mess of overlapping projects and products". (glad it's not just me). "How many different meanings can container runtime have?"

OCI established circa 2015 to try and unify things.

Container runtime interface established Dec 2016. Primitives to manage pods of containers. A single interface for rkt and docker.

Thierry Carrez is creating diagrams that look way too similar with what I showed at DevConf.cz (i.e. increasing number of boxes showing up as time goes by). Current state looks like this:

I will clearly need to link to the slides or video if they are posted, because the evolution is funny.

Time for me to play with mermaid:

graph TD Kubernetes[Kubernetes] --> ccd[cri-containerd] ccd --> cd[containerd] cli[docker CLI] --> cd ccd --> runc[runC]

"The dirty secret of containers: they are not very good at containing". In the real world, they run in VMs.

Firecracker is a "highly opnionated runtime".

"That is when the diagram began to become too complex", e.g. directly connect containerd to firecracker. Also the case for Kata Containers to "leverage advanced features", i.e. things that are not in the OCI runtime interface.

Alpha Waves

Since it takes a lot of time to switch rooms, that was the time I decided to leave the "Containers" track and join the "Retrocomputing" track. I almost had a major accident, splashing water over my keyboard minutes before the talk, but fortunately no damage. I was concerned because regular Apple keyboard are notoriously sensitive to liquids. I already lost at least 3 keyboards to a single splash of liquid by one of my kids. Apparently, better with the PowerBook.

BASICODE

Learned about something called BASICODE, which I had never heard of. A way to send BASIC programs over the airwaves, with an API made of GOSUB subroutines with pre-determined line numbers. So something like GOSUB 100 would do a "clear screen" whether the program ran on Apple II or Sinclair Spectrum. Super weird. One or two people in the room had actually used it to download software from radio programs.

Retro music - Open Cubic Player

An interesting talk about music in the old days (Amiga .mod files if you can remember that).

Nostalgia for adlib sound.

"OpenSource was a real eye opener for learning how to program"

"Your multimedia program is an operating system in itself, except file system control"

Reviving le MINITEL

(I briefly mentioned Minitel during my talk, there was a "3615 Infogrames" sign on one of the Alpha Waves boxes.

In the late 60s, France had only 15% phone lines, 3 years installation delay (vs 90% and 3 days for the US). Last manual switch was decomissioned in 1978, first automated switch had been tested in 1912.

France had a plan. Packet switched network "Transpac", from 50 bits/s for telex up to 64kbit/s. Heterogeneous. New pricing, depending on rate and connexion time, not on distance.

Transpac had B2B applications (banks, etc), and B2C applications (Minitel).

Minitel = Videotex screen with a keyboard connected with a modem. 1974: BBC's Ceefax. 1979, CCETT's Antiope.

In France, microcomputing in France was hard to grasp, no network. Very late on modems. Free terminal. One minitel cost 260 euros (1000F) to build. 6.5M of minitels installed. 750M euros of installed Mintel. Paid by the French state, which needed to get their money back.

40 colums x 25 rows, could display 8 colors shown as shades of gray. Rate 1200 baud download, upload at 75 bits/s. Videotex offered 64 mosaic characters, which divded characters in 2x3 pixels, so could have 80x72 "graphic" resolution.

Very complex set of attributes, each character is coded as 16 bits in the Minitel memory. Could encode 449 characters.

They even went as far as reconstructing Minitel services, and you could see the display reconstruction in its full 1200 baud glory, including a famous "Minitel Rose", 3615 ULLA, which was plastering all the walls in France for years with little stickers advertising their service.

Gate project

Portable execution state, i.e. compiles stuff to wasm, and then can suspend on one machine (x86) and resume execution on a different machine (arm64).

Reminded me a lot of TAOS (or at least the promises it had), but done with modern technologies. It only took 25 years.

The room is almost empty. Lightning talks are not very successful. Also, running late (10 minutes behind schedule). Had I known, I would probably have watched the VIC-20 cartridge reverse engineering talk, and missed the talk about the Gate project. That is one of the talks I had marked on my calendar, so I'm happy it happened that way.

The pool next to the ocean

Subtitle: How to bring open source skills to more people.

Talk about contribution.

Tracking local storage configuration on linux

Using continuous recording of disk events in order to be able to recreate a configuration after the fact.

Command is lsblkj which filters stuff from the journal (the j in the command name) and extracts stuff that matters wrt. storage. Behaves like lsblkj, but misses filesystem info because that is not logged yet.

The user interface reminds me of Patrick Duverger's "open data map" Need to look at Skydive later.

(Rant) The projector has a lot of echoing. I'm glad I'm using a pastel background instead of straight black and white. We'll see how it plays "for real".

In other news (not FOSDEM): Blogmax improvements

Still working on improving the BlogMax package to fit my needs. So you will sometimes see incorrectly formatted output on this page. You can always refer to the source code for reference.

As I write this, for example, this is the case for tt formatting, and the root cause seems to be that Emacs Lisp replace-regexp-in-string does not seem to recognize the \b form (word boundary). It does work in re-search-forward, so that's weird. Also, simple experiments with the *scratch* buffer in Emacs show that this is not really the problem.

(replace-regexp-in-string "\btoto\b" "tata\&" "this is toto and atoto")

"this is tatatoto and atoto"

But then:

(replace-regexp-in-string "\b_\b" "\1" "This is an example")

"This is _an example" (replace-regexp-in-string "\b_\b" "\&" "This is an example") "This is an example"

WAT? How is e even seen as a word boundary? Something is definitely wrong with this function. Ah, no. The _ itself makes it a word boundary. Let me check that idea:

(replace-regexp-in-string "\b." "[\&]" "This is an example")

"[T]his[ ][i]s[ ]_[a]n[ ][e]xample[_]"

OK, so then the correct regexp syntax is to use s- in the regexp and reinject whatever was matched, i.e.:

    ("\(\s-\)_\(.*?\)_" "\1\2")
    ("\(\s-\)\*\(.*?\)\*" "\1\2")
    ("\(\s-\)`\(.*?\)`" "\1\2")

That seems to work.