<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://yggdrasil-network.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://yggdrasil-network.github.io/" rel="alternate" type="text/html" /><updated>2026-03-19T09:50:05+00:00</updated><id>https://yggdrasil-network.github.io/feed.xml</id><title type="html">Yggdrasil Network</title><subtitle>End-to-end encrypted IPv6 networking to connect worlds
</subtitle><entry><title type="html">Upcoming v0.5 Release</title><link href="https://yggdrasil-network.github.io/2023/10/22/upcoming-v05-release.html" rel="alternate" type="text/html" title="Upcoming v0.5 Release" /><published>2023-10-22T00:00:00+00:00</published><updated>2023-10-22T00:00:00+00:00</updated><id>https://yggdrasil-network.github.io/2023/10/22/upcoming-v05-release</id><content type="html" xml:base="https://yggdrasil-network.github.io/2023/10/22/upcoming-v05-release.html"><![CDATA[<h3 id="introduction">Introduction</h3>

<p>With the v0.5.0 release coming soon, now seems like a good time to explain what we’ve been working on for the past couple of years. While we’ve generally been pretty happy with v0.4.X, there are a few problems with that design which can cause the network to behave in ways we do not like. This blog post is meant to give a short review of how v0.4 works, explain the problems with this approach, and describe the changes we’ve made in v0.5 to try to address them.</p>

<h3 id="background">Background</h3>

<p>The v0.4.X design has 3 major components to the routing scheme:</p>

<ol>
  <li>A DHT-based routing scheme, used to route traffic when no route to the destination is known.</li>
  <li>A greedy treespace routing scheme, used to route certain protocol traffic in the DHT and the “pathfinder” for source routing.</li>
  <li>A source routings scheme, which encodes a path found through treespace in packet headers, so traffic can take a more direct route than what the DHT offers (and keep routing while the state of the tree is changing).</li>
</ol>

<p>The life cycle of a connection walks through those three stages in sequence. During the initial key exchange, no path to the destination is known, so traffic is routed over the DHT. When nodes receive traffic that was routed over the DHT, they initiate pathfinding, to find a more efficient route through treespace. When pathfinding finishes, they switch to source routing, which encapsulates the DHT-routeable packet inside of a source routed packet. If a source routed packet ever hits a dead end, the source routing header is removed and finishes routing via the DHT. Receving this DHT routed packet triggers pathfinding in the background, so a new path can be found.</p>

<p>Overall, this design works well. Nodes can begin communicating (in our case, sending key exchange traffic) before needing to look up any routes, and things fall back gracefully. No special protocol traffic is needed to detect broken paths, since the DHT fallback takes care of signaling that a path has failed.</p>

<h3 id="problems">Problems</h3>

<p>While I don’t have any concerns with the overall design of v0.4, the individual components all have issues.</p>

<p>First and foremost, the DHT design used in v0.4 does not scale as well as we had hoped. Nodes need to keep track of not only the paths to their keyspace neighbors, but also any such paths that go through that node. This means some fraction of nodes are stuck knowing a large percentage of all paths through their node. That leads to high memory costs and potentially high bandwidth. The v0.4 network’s DHT bandwidth use is relatively low, since the DHT is predominantly hard state, but attempts at a more secure DHT all led to soft state designs where the bandwidth costs can become significant. Without securing the DHT, it would remain vulnerable to some attacks (or behave badly in the presence of misconfigured nodes, such as accidental anycast nodes). The more insidious issue is DHT convergence time: it takes <code class="language-plaintext highlighter-rouge">O(n)</code> “steps” to converge in the worst case, and we have good reason to believe that some typical use cases experience this. Additionally, the hard state design required actively monitoring each peer link, to quickly detect when a link is dead. This leads to a lot more idle traffic between peers than what we’d like to see.</p>

<p>Secondly, the tree can produce inconsistent views of the network, depending on which peer’s information a node pays attention to. This leads to “flapping”, when a non-parent ancestral link fails, as nodes tend to switch to a new parent that used the same (now broken) link, but which hasn’t had time to advertise the link failure yet. So nodes tend to switch from their parent to an alternative, and then back to the original parent, when the alternative eventually advertises the same failure. That flapping causes down-tree (child) nodes to flap, which can cascade through the network. There are mechanisms in place to throttle how fast things flap in v0.4, but that’s a bandaid fix to an underlying problem in the design.</p>

<p>Lastly, source routing is good in principle, but the packet format we used for this is not. It’s too easy for a malicious node to insert multiple redundant hops to produce a (finite) loop, which can waste bandwidth on a targeted set of links.</p>

<h3 id="changes">Changes</h3>

<p>Quite a number of changes have been made to the design of Yggdrasil in an effort to combat the above issues. The new approaches are not necessarily how we want the network to function long term, but rather they are alternatives that we wanted to test to better explore the solution space. Generally speaking, these are not user-facing, outside of some changes to the information available in <code class="language-plaintext highlighter-rouge">yggdrasilctl</code>’s API.</p>

<h4 id="destination-lookups">Destination Lookups</h4>

<p>The most significant change is the removal of the DHT-based routing scheme used to initially set up routes through treespace. We now use a simpler YggIP/key-&gt;coord lookup protocol which resembles ARP/NDP lookups in an ethernet broadcast network (but without broadcast traffic through the full network). Nodes keep track of which peers are reachable by an on-tree link (that is, the node’s parent and children) along with a bloom filter of the keys of all nodes reachable by that link (with keys truncated to the parts used in /64 prefixes, to allow for IP/prefix lookups). A lookup packet received from an on-tree link is forwarded to any other on-tree link where the destination is found in the bloom filter.</p>

<p>While there are down sides to this approach, it has a number of advantages. First, accidental anycast configurations (using the same key from multiple nodes) will not break any network-wide data structures, it simply causes lookup traffic to arrive at more than one node. Subsequent steps will generally fail (route lookup, key exchange, etc), but there is no collateral damage to the rest of the network. Secondly, this requires very little idle maintenance traffic, and only needs a <em>constant</em> amount of state per peer. This means nodes in the core of the network are not responsible for maintaining a view of anything more than their immediate neighborhood, and are not hammered with idle DHT maintenance traffic originating at distant nodes. Similarly, nodes at the edge of the network do not need to send any regular DHT keepalive traffic, which may help with bandwidth use and power consumption on mobile devices. Third, this structure converges asynchronously and in time proportional to the depth of the tree, rather than sequentially and in time proportional to the size of the network, so the very poor worst-case-scenario convergence times of the DHT are avoided.</p>

<p>The major down side to this approach is that bloom filters can and will generate false positives as they fill. In practice, we would expect filters in the “core” of the network to saturate, where every node appears to be reachable by every path. This in turn means that a node’s route to the “core” of the network (generally via their parent) will take on the role of a “default route” and receive a copy of every lookup sent by the node. We expect lookup traffic will reach the core of the network, effectively act like broadcast traffic within the core, and then be culled by the bloom filters as it approaches the edges (such that a leaf node is unlikely to receive any traffic for which they are not the intended recipient). In short, the nodes in the core will see lower memory use and less bandwidth used by idle maintenance traffic, but active network use will consume more bandwidth. It remains to be seen whether or not this is a worth-while trade-off.</p>

<p>Just to put some hard numbers on things: we use 8192-bit bloom filters with 8 hash functions. If there is a node that acts as a gateway to a subnet with 200 nodes in it, then that has a false positive rate of about 1 in a million (that is, we expect that network needs about a million nodes before the gateway sees <em>any</em> false positive lookup traffic). A majority of lookup traffic is true positives up to a gateway to a 500 node subnet in a 1 million node network.</p>

<p>So in practice, most nodes should not see any meaningful number of false positives, unless they are acting as the gateway to a very large subnet (or are in a network many orders of magnitude larger than the current v0.4 network). In our current network, a handful of nodes may find themselves in the “core” region, where they receive false positive lookup traffic from most lookups. We hope this is still preferable to constant idle DHT maintenance traffic and potentially very high memory requirements.</p>

<h4 id="crdt-tree">CRDT Tree</h4>

<p>Previously, each node’s location in the tree was verified by a chain of signatures (reach referencing their parent) from the node back to the root. This can lead to inconsistencies where different nodes have mutually incompatible views of the same ancestor (e.g. node A says parent P has grandparent G, but node B says the same parent P has grandparent G’), which complicates parent selection in response to changes in network state. To address this, we have broken up the tree information into separate per-link information, which is gossiped between nodes and merged into a CRDT structure. This forces nodes to have a locally consistent view of the network, which prevents unnecessary “flapping” in some cases where a node’s route to the root has broken. This also reduces the amount of information which must be sent over the wire, as a node does not need to send information back to a peer when it knows the peer has already seen it.</p>

<h4 id="greedy-routing">Greedy Routing</h4>

<p>Source routing (from v0.4) has been removed in favor of greedy routing (as was done in v0.3.X). In a stable network, this has no effect on the route that packets take, only on how the decision to take that route is made. We may move back to a source routed approach in the future, but the approach used in v0.4 had some issues that would need to be addressed first. Source routing is a nice performance optimization to have, if it can be done securely, but it’s not an explicit goal of this project. While I have ideas on how to do this, it isn’t a high priority in the short term. Since the source routed scheme would presumably still depend on greedy routing for pathfinding, I think it’s useful to focus on stress testing the greedy routing part of the network in this release, and leave source routing for when other parts of the stack are closer to stable.</p>

<h4 id="per-peer-keepalive-removed">Per-peer Keepalive Removed</h4>

<p>We no longer spam peer links with keepalive traffic every few seconds. Instead, when traffic is sent, we require an acknowledgement within a few seconds (unless the traffic we sent was an ack). This means we do not detect link failures as quickly in an idle network (we need to wait for user traffic or protocol traffic to use the link), but it should reduce idle bandwidth consumption (and likely reduce power consumption for mobile devices). Note that this is separate from e.g. TCP’s own keepalive mechanisms, which are left enabled.</p>

<h3 id="new-features">New Features</h3>

<p>There are also a few new features added in v0.5. It is now possible to restrict peers with a <code class="language-plaintext highlighter-rouge">?password=X</code> argument to the listen and connecting strings (and multicast configuration). This requires nodes to agree on the password before they will peer. Note that this does not allow for network isolation: nodes can still peer with the rest of the network if they wish, and reachability is still transitive. This does make it easier to restrict who can automatically connect within a subnet, or to set up a node that’s public-facing without allowing connections from everyone who finds it. There’s also support for <code class="language-plaintext highlighter-rouge">quic://</code> connections. Peering over QUIC will only use a single stream of traffic, so it’s largely the same semantics as peering over TCP/TLS, but it may be useful in cases where UDP packets have an easier time punching through a NAT or firewall. We generally expect it to perform worse than TCP/TLS, so we do not recommend using it when it’s not needed.</p>

<h3 id="summary">Summary</h3>

<p>Barring any unforeseen delays, Yggdrasil v0.5 should be out within the next few weeks. We’ve hopefully addressed the most significant issues with stability and scaling in v0.4, and significantly reduce the memory footprint and idle bandwidth consumption for some nodes. Some aspects of the new design are radically different from v0.4, so it remains to be seen how well these changes will work in the real world. Preliminary tests (and lots of simulation work) have us optimistic that v0.5 will give us a stable foundation to build on for the immediate future, as we study any limitations of this new approach and work on the inevitable redesign for v0.6.</p>]]></content><author><name>Arceliar</name></author><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">v0.4 Pre-release Benchmarks</title><link href="https://yggdrasil-network.github.io/2021/06/26/v0-4-prerelease-benchmarks.html" rel="alternate" type="text/html" title="v0.4 Pre-release Benchmarks" /><published>2021-06-26T21:00:00+00:00</published><updated>2021-06-26T21:00:00+00:00</updated><id>https://yggdrasil-network.github.io/2021/06/26/v0-4-prerelease-benchmarks</id><content type="html" xml:base="https://yggdrasil-network.github.io/2021/06/26/v0-4-prerelease-benchmarks.html"><![CDATA[<h3 id="revisiting-v03">Revisiting v0.3</h3>

<p>In the current stable release of Yggdrasil, <code class="language-plaintext highlighter-rouge">v0.3.16</code>, routing works basically the same way that it has always worked since release. Traffic is forwarded by greedy routing in a metric space. In essence, each node has a “distance label”, and given the distance label of any two nodes, you can calculate the distance of some path between them. In the code, this label is usually called <code class="language-plaintext highlighter-rouge">coords</code>, as it represents a position in the tree, but technically we don’t care about the position itself, we only care that it works as a distance label. Traffic is forwarded to whichever peer minimizes that distance to the destination. This has been discussed in an <a href="2018-07-17-world-tree.md">earlier blog post</a>, so lets not worry about the details of how it works for now. Instead, we’ll focus on what happens when it <em>doesn’t</em> work.</p>

<p>To be able to send traffic to a destination <code class="language-plaintext highlighter-rouge">D</code>, the sender <code class="language-plaintext highlighter-rouge">S</code> must look up the node’s distance label and key in the DHT. This happens just before session setup, where ephemeral keys are exchanged. You can think of it a bit like a DNS lookup: it maps some known static information (the node’s Yggdrasil IPv6 address) onto some unknown or dynamic information (the node’s static key and dynamic distance label). If anything happens to the network that causes the destination node <code class="language-plaintext highlighter-rouge">D</code>’s distance label to change, then all traffic to <code class="language-plaintext highlighter-rouge">D</code> will drop until the <code class="language-plaintext highlighter-rouge">S</code> can look up <code class="language-plaintext highlighter-rouge">D</code>’s new distance label. However, that lookup depends on the DHT, and the DHT <em>also</em> uses distance labels for communication, so DHT lookups for <code class="language-plaintext highlighter-rouge">D</code> will fail for some amount of time, until the out-of-date information about <code class="language-plaintext highlighter-rouge">D</code> times out or is replaced. While that’s happening, <code class="language-plaintext highlighter-rouge">S</code> cannot communicate with <code class="language-plaintext highlighter-rouge">D</code>, even if the path between <code class="language-plaintext highlighter-rouge">S</code> and <code class="language-plaintext highlighter-rouge">D</code> is unaffected. Further exacerbating the problem, the DHT search is an iterative process, which requires round trip communication with multiple nodes. These nodes are, for the most part, randomly distributed across the physical network, meaning most of them are likely to be near the edge of the network, where connections are comparatively unreliable and costly to use. If any part of the lookup fails, then this delays search progress (if it doesn’t cause the search to fail entirely).</p>

<p>The network tries to combat these problems by having <code class="language-plaintext highlighter-rouge">D</code> refresh itself in the DHT and send a notification to <code class="language-plaintext highlighter-rouge">S</code> when <code class="language-plaintext highlighter-rouge">D</code>’s distance label changes. However, there is no guarantee that <code class="language-plaintext highlighter-rouge">D</code> knows every node which is tracking it in the DHT, and these notifications will hit a dead and and be dropped if the distance labels of the recipients have also changed. This often happens if <code class="language-plaintext highlighter-rouge">S</code> and <code class="language-plaintext highlighter-rouge">D</code> share a common ancestor in the tree.</p>

<p>To give a concrete example, if <code class="language-plaintext highlighter-rouge">S</code> and <code class="language-plaintext highlighter-rouge">D</code> are in a LAN with gateway <code class="language-plaintext highlighter-rouge">G</code>, and <code class="language-plaintext highlighter-rouge">G</code>’s connection to the outside world dies, then this disrupts the traffic flow between <code class="language-plaintext highlighter-rouge">S</code> and <code class="language-plaintext highlighter-rouge">D</code>. That happens even when the path between them in their own network is unaffected. It also causes various issues in the DHT, which hurt performance for the network in general, and prevents <code class="language-plaintext highlighter-rouge">S</code> and <code class="language-plaintext highlighter-rouge">D</code> in particular from being able to resume communication.</p>

<h3 id="improvements-in-v04">Improvements in v0.4</h3>

<p>As noted in a <a href="2021-06-19-preparing-for-v0-4.md">recent post</a>, the upcoming v0.4 release will include a number of major changes to how Yggdrasil routes traffic.
Most of these changes aim to improve performance in dynamic networks and reduce bandwidth consumption from protocol traffic.
Without repeating too much from that earlier blog post, the basic goal here is to insulate the routing from changes to distance labels.
This happens through a mix of reactive opportunistic source routing and falling back to to proactive DHT-based routing, both of which use distance labels for path setup, but neither of which is broken when the distance labels change (provided that the links in the path still work).</p>

<p>Since it may take a while to see how this affects performance in a live network, and because it’s a bit difficult to actually measure these things in a real network, it seems like it would be useful to look at some results from benchmarks on simulated networks.</p>

<h3 id="mesh-network-lab">Mesh Network Lab</h3>

<p>All of the results shown here are from <a href="https://github.com/mwarning/meshnet-lab">meshnet-lab</a>. You should probably just read the documentation if you want to know more, but to summarize: meshnet-lab simulates mesh networks using network namespace on linux. Each node is given a network namespace, which can be linked to other namespaces to simulate an arbitrary topology. Links are added and removed as needed to e.g. simulate movement in a mobile adhoc network.</p>

<p>Although meshnet-lab supports many other mesh networking protocols, this post will focus on comparing Yggdrasil <code class="language-plaintext highlighter-rouge">v0.3.16</code> (the latest stable release) with <code class="language-plaintext highlighter-rouge">v0.4rc3</code> (the most recent release candidate). Comparisons with other mesh routers would be interesting, but it would be best if those were done by an unbiased 3rd party (and using a stable <code class="language-plaintext highlighter-rouge">v0.4.X</code> release instead of a release candidate). Instead, this post will try to highlight (qualitatively) what sort of performance changes we expect to see in the new release.</p>

<h4 id="mobility1">Mobility1</h4>

<p>The <code class="language-plaintext highlighter-rouge">mobility1</code> benchmark simulates a dynamic <a href="https://en.wikipedia.org/wiki/Unit_disk_graph">unit disc graph</a>. Nodes are simulated within a two-dimensional Euclidean plane, with each node having connections to other nodes that fall within a certain radius. The network periodically moves all nodes a random distance between 0 and X (X=10,30,60m) in a 1km x 1km virtual space, then waits some amount of time (10s or 30s) before pinging 200 random paths. The paths are limited to source/destination pairs that are in the same connected component, so it only tests paths that plausibly could work.</p>

<p><img src="/assets/images/2021-06-26/mobility1-10-10_arrival_progress.svg" alt="mobility1-10-10_arrival_progress" />
<img src="/assets/images/2021-06-26/mobility1-10-30_arrival_progress.svg" alt="mobility1-10-30_arrival_progress" />
<img src="/assets/images/2021-06-26/mobility1-10-60_arrival_progress.svg" alt="mobility1-10-60_arrival_progress" /></p>

<p><img src="/assets/images/2021-06-26/mobility1-30-10_arrival_progress.svg" alt="mobility1-30-10_arrival_progress" />
<img src="/assets/images/2021-06-26/mobility1-30-30_arrival_progress.svg" alt="mobility1-30-30_arrival_progress" />
<img src="/assets/images/2021-06-26/mobility1-30-60_arrival_progress.svg" alt="mobility1-30-60_arrival_progress" /></p>

<p>These mobility tests are an area where Yggdrasil has struggled up to now, as seen in the <code class="language-plaintext highlighter-rouge">v0.3.16</code> results. Basically, when a node moves, this can affect the coords of other nodes in the network. With the changes in <code class="language-plaintext highlighter-rouge">v0.4rc3</code>, the 30s tests are generally in good shape. The 10s tests see some loss, due to the time it takes to detect failed links before we can route around them.</p>

<h4 id="mobility2">Mobility2</h4>

<p>The <code class="language-plaintext highlighter-rouge">mobility2</code> test is essentially a much more aggressive variation of the above. Nodes periodically move a random (increasing) step size with a 15s delay before testing 200 random paths. This test also monitors bandwidth usage.</p>

<p><img src="/assets/images/2021-06-26/mobility2_arrival_progress.svg" alt="mobility2_arrival_progress" />
<img src="/assets/images/2021-06-26/mobility2_traffic_progress.svg" alt="mobility2_traffic_progress" /></p>

<p>The main feature to note is that, aside from having terrible reliability in this test, <code class="language-plaintext highlighter-rouge">v0.3.16</code> uses a ridiculous amount of bandwidth when mobility is involved. With <code class="language-plaintext highlighter-rouge">v0.4rc3</code>, the bandwith use drops to at or below around 10KBps, depending on how mobile things are. I’m fairly certain that most of this bandwith is still a reaction to mobility events in the network, because (as we’re about to see) the bandwith use is pretty low in static networks.</p>

<h4 id="scalability1">Scalability1</h4>

<p>The <code class="language-plaintext highlighter-rouge">scalability1</code> test set involves running the network over line, tree, or square grid networks. The line and tree networks start at 50 nodes and increase to 300. The grid network starts at 49 nodes (7x7) and increases the side length by 1 at each step, up to 298 nodes (17x17). This test waits for about 5 minutes before pinging 200 paths (slowly, over an additional 5 minutes), and measures both packet delivery rate and network utilization.</p>

<p><img src="/assets/images/2021-06-26/scalability1-line.svg" alt="scalability1-line" />
<img src="/assets/images/2021-06-26/scalability1-rtree.svg" alt="scalability1-rtree" />
<img src="/assets/images/2021-06-26/scalability1-grid4.svg" alt="scalability1-grid" /></p>

<p>There’s not a whole lot to say here, <code class="language-plaintext highlighter-rouge">v0.4rc3</code> is just an improvement across the board. Note that it’s a little surprising how the bandwidth use <em>decreases</em> as the network grows. This may be an artifact of how the test works, since a fixed number of pings may represent proportionally more traffic in small network, but that’s speculation.</p>

<h3 id="conclusion">Conclusion</h3>

<p>The upcoming v0.4 release changes how packets are routed through the network. While it’s hard to say exactly how things will behave in the real world, the performance gains in the simulated networks give us reason to be optimistic.</p>

<p>If things go according to plan, then these changes should improve the user experience and overall usefulness of the network. Changes to the network state should no longer affect existing traffic flows, as long as the path the flow is using is unaffected. In cases where the path <em>is</em> affected, it should take much less time for the network to detect this and route around the damage (when it’s possible to do so). With or without disruptive changes in the network, there should be reduced bandwidth from protocol traffic, leading to lower data use and longer battery life in energy constrained environments (e.g. mobile phones).</p>]]></content><author><name>Arceliar</name></author><summary type="html"><![CDATA[Revisiting v0.3]]></summary></entry><entry><title type="html">Preparing for Yggdrasil v0.4</title><link href="https://yggdrasil-network.github.io/2021/06/19/preparing-for-v0-4.html" rel="alternate" type="text/html" title="Preparing for Yggdrasil v0.4" /><published>2021-06-19T21:00:00+00:00</published><updated>2021-06-19T21:00:00+00:00</updated><id>https://yggdrasil-network.github.io/2021/06/19/preparing-for-v0-4</id><content type="html" xml:base="https://yggdrasil-network.github.io/2021/06/19/preparing-for-v0-4.html"><![CDATA[<h3 id="version-04-is-coming-soon">Version 0.4 is coming soon</h3>

<p>In the coming weeks, we will be preparing to release Yggdrasil v0.4. This is a significant change from the v0.3 branch with an all-new protocol implementing an improved routing scheme.</p>

<p>This release brings some new and significant benefits:</p>

<ul>
  <li><strong>Improved mobility performance</strong> — For nodes that move around or change peerings frequently. This was largely impractical with v0.3 as the sessions would have to time out and a new search repeated.</li>
  <li><strong>Spanning tree changes are now less disruptive</strong> — Previously it was common for sessions to fail or for traffic to be dropped if the root or parent coordinates changed. This is no longer the case as tree routing is largely only used for bootstrapping DHT paths and determining source routes.</li>
  <li><strong>Opportunistic source routing</strong> — Session traffic will now use source routing if available, to ensure that the overall connection quality of sessions is preserved. If a source-routed path fails, the traffic will revert to DHT forwarding seamlessly.</li>
</ul>

<p>However, there are also a number of user-impacting changes coming in this release to be aware of, as we have worked to simplify the codebase and reduce complexity.</p>

<h4 id="protocol-changes">Protocol changes</h4>

<p>Yggdrasil v0.4 contains a number of breaking changes to the protocol. That means that v0.4 nodes <strong>will not</strong> peer with v0.3 nodes. We will be wiping the public peers list around the time of release as a result and asking users to re-submit their information once they have upgraded their public nodes.</p>

<h4 id="ipv6-address-changes">IPv6 address changes</h4>

<p>In v0.3, IPv6 addresses on the network were generated as a hash of your curve25519 keys (the <code class="language-plaintext highlighter-rouge">EncryptionPublicKey</code> configuration option). This was made possible due to the iterative search nature of the DHT. In v0.4, the new DHT is based on ed25519 keys instead and therefore we have had to switch to generating IPv6 addresses from the ed25519 keys instead.</p>

<p>This is sadly unavoidable. We understand that this is a rather disruptive change, especially for those who operate public services. However, we believe that the added robustness of the new routing scheme and DHT is more than worth the disruption. We will also be clearing the public services list and asking service operators to re-submit their details after upgrading.</p>

<h4 id="session-firewall-deprecated">Session firewall deprecated</h4>

<p>We decided to remove the session firewall from Yggdrasil v0.4. It’s no longer straight-forward to implement in the new codebase and we believe that it often lulled users into a false sense of security. While it may have given the impression of being stateful, it was much more rudimentary. For example, if the user allowed only outbound connections, it would still be possible for the remote side to send traffic back to you for the length of the session. It was also possible to extend this window just by sending more session traffic.</p>

<p>With the new version, the <code class="language-plaintext highlighter-rouge">SessionFirewall</code> options are no longer present in the configuration and will not take effect. You should look to use your operating system firewall instead if you need to control traffic coming to your node. If you intend to operate a node solely as an Yggdrasil router and do not need to send/receive Yggdrasil traffic from that node directly, you can also disable the TUN adapter by setting <code class="language-plaintext highlighter-rouge">IfName</code> to <code class="language-plaintext highlighter-rouge">"none"</code> in the configuration file.</p>

<h4 id="tunnel-routing-deprecated">Tunnel routing deprecated</h4>

<p>We also took the decision to remove tunnel routing from v0.4. We know that this has been a somewhat popular feature with some users, but it ultimately was the source of a significant number of bugs within v0.3. It increased the complexity of the TUN module substantially and often also didn’t behave in the way that users expected, particularly those who were used to configuring Wireguard already.</p>

<p>With the new version, the <code class="language-plaintext highlighter-rouge">TunnelRouting</code> configuration options are no longer present and will not take effect either. It’s still possible to tunnel over Yggdrasil by using a number of other technologies: GRE, IPIP, Wireguard and others, using the Yggdrasil IPs as endpoint addresses for the tunnels. We recommend tunnelling one of these protocols over Yggdrasil instead.</p>

<h4 id="release-candidates">Release candidates</h4>

<p>There are <a href="https://github.com/yggdrasil-network/yggdrasil-go/releases">release candidate builds available</a> if you want to try out v0.4 today. We recommend though that you take a backup of your configuration before upgrading or installing any packages and be aware that you will not be able to peer with or access services from the v0.3 network.</p>

<p>At this point the release candidates are using a developmental protocol number. We will bump the protocol version on the final release candidate, but until then, any nodes running v0.4 release candidates should be considered to be experimental only.</p>

<h4 id="technical-details">Technical Details</h4>

<h5 id="routing">Routing</h5>

<p>The core routing logic has been redesigned and written into a <a href="https://github.com/Arceliar/ironwood">separate library</a>. This began as a toy hard-state reimplementation of Yggdrasil’s routing logic to test a new DHT design, but it eventually became a soft-state implementation using generational hard state – basically, nodes periodically set up a new network and throw away the old one, but everything acts like a hard-state protocol within the life of any one generation of the network.</p>

<p>Similar tree and DHT structures were reimplemented in <a href="https://github.com/matrix-org/pinecone">pinecone</a>, so if you <a href="https://en.wikipedia.org/wiki/Grok">grok</a> the <a href="https://matrix.org/blog/2021/05/06/introducing-the-pinecone-overlay-network">SNEK</a> then this should seem very familiar.</p>

<h6 id="treespace">Treespace</h6>

<p>The spanning tree works largely the same way as before. The only significant differences are with the root selection and updates: the root is the node with the lowest ed25519 public key, rather than the highest sha512sum hash of the public key, and the root updates the timestamp for its spanning tree announcements every 30 <em>minutes</em> (previously 30 seconds) with a timeout after 60 <em>minutes</em> (previously 60 seconds). Parent selection uses whatever non-looping path has advertised the best root &amp; timestamp combination the longest, i.e. the path that sent the update the fastest, unless that path was unstable, in which case any flapping should push the network towards the fastest stable path.</p>

<p>As before, each node uses the path from the root to itself as the node’s distance label. Given the distance labels of two nodes, the distance between them can easily be calculated (it’s the sum of the distance from each of them to their last common ancestor in the tree). This is used to do greedy routing in a metric space, but this is only used to find paths for protocol traffic. User traffic uses one of the following two routing schemes.</p>

<h6 id="keyspace">Keyspace</h6>

<p>Each node need to be able to contact any other node given only the node’s IPv6 address (in the Yggdrasil address range). To do that, we use a distributed hash table (DHT). The new DHT for v0.4 is very different from the old DHT (which was based on Chord).</p>

<p>The new DHT takes advantage of the fact that node identifiers are simply ed25519 public keys, and that the root of the tree is the node with the lowest key. Since every node knows a path to the root, and the root is at one of the edges of keyspace, we don’t need to wrap keyspace to form a Chord-like ring. So the new DHT is simply a line of nodes, ordered from the lowest key to the highest key, beginning with the root of the tree.</p>

<p>Each node is responsible for setting up a path from itself to its predecessor in the line. The predecessor uses that path to route traffic to the node. In addition to this, intermediate nodes store a routing table entry for the path. So if node B sets up a path to node A, which node A uses to forward traffic to B, then every node in the path A-&gt;B also has a routing table entry for a path that routes traffic towards B. If a link in the path times out, then the nodes on either end of the broken link send an explicit notification that the path is broken, which is what allows the DHT to quickly detect and react to mobility events.</p>

<p>Packets are forwarded towards the key that’s highest without being higher than the destination (“The Price Is Right” rules). These routing decisions are made by any node along a path, not only the nodes at the endpoints of a path. So if A is the predecessor of B, then A need not handle traffic to B — traffic which happens to cross any node on the path A-&gt;B will flow towards B without reaching A. The root, as well as all peers and all ancestors of peers, act as additional DHT paths that nodes know about “for free” (since they were learned by necessity as part of the spanning tree setup).</p>

<p>To discover their keyspace neighbors, nodes with no predecessor (or with a predecessor based on an outdated version of the tree) periodically send a bootstrap packet. The bootstrap packet is routed via the DHT until it hits a dead end — at the node’s predecessor. The bootstrap contains the sender’s treespace distance label, which the predecessor uses to send a bootstrap acknowledgement message. The acknowledgement includes the predecessors treespace distance label, which the original node uses to set up a path for the DHT.</p>

<p>Because the new DHT rules are based around forwarding, rather than lookups, DHT searches (and the crawling operations that come with them) are no longer part of the network. Instead of looking up a path to a node, nodes simply forward traffic towards the destination key via the DHT. While establishing a session, the nodes set up a source route and transparently switch to it in the background.</p>

<p>To figure out how many DHT entries are needed in the network, consider the following:</p>

<ol>
  <li>Each node sets up at most 1 DHT path to other nodes (its predecessor).</li>
  <li>Each path has 1 DHT entry in the routing table of each node in the path.</li>
  <li>The longest possible path between two nodes is one which goes through the root.</li>
  <li>In a stable network, nodes select a parent which minimizes the latency of the path from the root.</li>
  <li>If we assume latency is proportional to hop count, then this minimizes the hop count.</li>
  <li>This means the longest path we expect on the tree is equal to at most the diameter of the network <code class="language-plaintext highlighter-rouge">d</code>.</li>
  <li>This means the longest possible path between two nodes via treespace is <code class="language-plaintext highlighter-rouge">2d</code>.</li>
  <li>Therefore, in an <code class="language-plaintext highlighter-rouge">n</code>-node network, there are <code class="language-plaintext highlighter-rouge">O(nd)</code> DHT entries across all nodes, or an average of <code class="language-plaintext highlighter-rouge">O(d)</code> entries per node (but with no bounds on how that’s distributed across nodes).</li>
  <li>Internet-like (scale-free) graphs are observed to have a diameter that scales slowly with network size, most likely <code class="language-plaintext highlighter-rouge">d~logn</code> (or possibly <code class="language-plaintext highlighter-rouge">d~loglogn</code>).</li>
  <li>Therefore, we expect <code class="language-plaintext highlighter-rouge">O(nlogn)</code> total DHT routing table entries in a large internet-like network, or <code class="language-plaintext highlighter-rouge">O(logn)</code> average state per node (with no bounds on the variance).</li>
</ol>

<p>That works out to the same average state per node as in most popular DHT implementations, so this may scale OK in practice. However, we expect that distribution to be skewed, with a large number of nodes having very few entries, which plausibly could mean that some small fraction of nodes have <code class="language-plaintext highlighter-rouge">O(n)</code> routing table entries in the worst case. These nodes are the same nodes that would need to carry per-path keep-alive traffic if intermediate nodes did not store routing state, so in reality we’re trading one resource for another (possibly higher memory use in exchange for lower bandwidth consumption).</p>

<h6 id="source-routing">Source routing</h6>

<p>The DHT seems quite reliable in benchmarks (using e.g. <a href="https://github.com/mwarning/meshnet-lab">meshnet-lab</a>), but it is not efficient: paths through the DHT keyspace typically have higher stretch than paths through treespace. In addition, while coord flapping and node join/leave events cause less disruption for the new DHT, that’s not the same as no disruption. To combat this, nodes transparently switch to source routing, and fall back to routing on the DHT only when no source route is known or when a source route reaches a dead end.</p>

<p>To be specific, when a node A sends traffic to node B, A adds B to a list of nodes it cares about. If B also cares about A, then when B receives traffic from A, B sends a notification packet back to A (containing B’s treespace distance label). If A receives a notification from a node they care about (in this case, B), then A sends a pathfinding packet via the tree. At each hop along the way, nodes along the path add the port number to the previous node to a reverse route at the end of the packet. When B receives the pathfinding packet, they send an acknowledgement back to A, using the reverse route back to A as a source route. The acknowledgement builds up its own reverse route, which represents a path back to B. When A receives the acknowlegement, A stores the reverse route as its source route to B. Subsequent traffic is sent to B via that source route, with infrequent (1/minute) checks for a new source route. The same process occurs in reverse, with A sending a notification to B, when A receives DHT-routed traffic from B.</p>

<p>If node A’s source routed traffic to B hits a dead end, due to e.g. a link failure in the network, then the source route is stripped from the packet, and the packet is routed the rest of the way over the DHT. When B receives the DHT-routed packet from A, this immediately and automatically causes B to send a new notify (possibly with B’s new treespace distance label) back to A, which leads to A discovering a new source route to B.</p>

<p>This gives us the best of both worlds. In a stable network, traffic is source routed along the familiar treespace routes from v0.3.X and earlier. If the path between A and B is stable, but the network is not, then the source route continues to work while the tree and DHT try to catch up to changes in network state. If the destination node is mobile, then source routed traffic gets as far as it can before falling back to the DHT, which often puts the packet close to the destination (minimizing the stretch added by the DHT fallback).</p>

<h5 id="encryption">Encryption</h5>

<p>The encryption and session logic has seen some minor changes as well. Since node IDs are now based on ed25519 keys, we no longer have permanent curve25519 keys to use for the Diffie–Hellman key exchange. Also, in older versions, we would perform one ephemeral key exchange, and then keep using the same key pair for the life of the session. The new version uses a ratcheting system, where keys are rotated after each round trip (or whenever a sender nonce overflows). This should offer better forward secrecy than the previous code, though it’s still subject to change in future versions (if we find something off-the-shelf that we’re happier with).</p>

<h4 id="conclusion">Conclusion</h4>

<p>We are looking forward to releasing Yggdrasil v0.4 and are optimistic that the benefits will significantly outweigh any disruption caused at this stage. We’ve also made a number of other fixes and developed both iOS and Android apps, which we will talk more about soon.</p>

<p>We will be continuing to perfect the release candidates and will make announcements both in our <strike>&gt;Matrix channel</strike> and on the blog around the time of the release. Please stay tuned for updates!</p>

<p>As always, please bear in mind that Yggdrasil is not production-grade software and we ask you to continue to <a href="https://github.com/yggdrasil-network/yggdrasil-go/issues">report problems to us on GitHub</a>.</p>]]></content><author><name>Neil Alexander, Arceliar</name></author><summary type="html"><![CDATA[Version 0.4 is coming soon]]></summary></entry><entry><title type="html">Release v0.3.13</title><link href="https://yggdrasil-network.github.io/2020/02/21/release-v0-3-13.html" rel="alternate" type="text/html" title="Release v0.3.13" /><published>2020-02-21T09:00:00+00:00</published><updated>2020-02-21T09:00:00+00:00</updated><id>https://yggdrasil-network.github.io/2020/02/21/release-v0-3-13</id><content type="html" xml:base="https://yggdrasil-network.github.io/2020/02/21/release-v0-3-13.html"><![CDATA[<h3 id="release-time">Release time!</h3>

<p>Our last Yggdrasil release, v0.3.12, was merged a couple of months ago at the
end of November. For the most part we have seen good stability with the v0.3.12
builds, not to mention good adoption (with the crawler showing over 500 nodes
running it). Today we are releasing our next version, v0.3.13.</p>

<p>Many of our releases tend not to warrant blog post entries, especially given
that the changelog documents the changes. However, there’s some fairly big news
points associated with this version therefore this post aims to discuss them in
a bit more detail.</p>

<h4 id="tun-adapter-changes">TUN adapter changes</h4>

<p>The first big talking point is that this is the first Yggdrasil release that
departs entirely from the Water library and replaces it with the Wireguard TUN
library. There are a few reasons why we decided to switch from Water to the
Wireguard library, but one of the most prominent is that it gives us better TUN
support across all platforms and allows us to finally remove TAP support
altogether.</p>

<p>At a high-level, TUN interfaces are effectively emulating “Layer 3” interfaces -
they deal only in IP packets - whereas TAP interfaces are emulating “Layer 2”
full-fat Ethernet interfaces.</p>

<p>To run in TAP mode, Yggdrasil not only had to add and remove Ethernet headers
for each packet, but it also has to implement an entire NDP implementation and
track MAC addresses in order to trick the host operating system into believing
that there was a real Ethernet domain on the other end of the adapter. Needless
to say, the amount of boilerplate code in order to make TAP mode work correctly
was significant and much of that code was very fragile.</p>

<p>Although we implemented NDP, we did not ever get around to implementing ARP,
which also meant that sending tunnel-routed IPv4 traffic over TAP interfaces
invariably did not work either. We have now been able to remove much of this
code and simplify the TUN code massively, closing the gaps between some of our
supported platforms.</p>

<p>There is one platform that is negatively impacted by this change and that’s
NetBSD. The Wireguard TUN package that we are using currently has <strong>no support
for NetBSD</strong>, so we are also removing NetBSD as a supported target until the
necessary code appears upstream. To our knowledge, we don’t have a base of
NetBSD users anyway, but we will aim to re-add this soon.</p>

<p>The <code class="language-plaintext highlighter-rouge">IfTAPMode</code> configuration option has now been removed from Yggdrasil
entirely and it will be ignored if specified. <strong>If you are using TAP mode today,
then this will affect you</strong>. Please make sure to check your Yggdrasil
configuration since this may result in interface naming changes and you may have
to update network settings in your host operating system.</p>

<p>Initially we added TAP support into Yggdrasil as it was the only way to support
Windows, since the OpenVPN driver that we used at the time only supported TAP
mode. Thankfully, this is no longer a problem, as the Wireguard project have
also released <a href="https://wintun.net">Wintun</a>, which is supported by the Wireguard
TUN library. The net result is that we gain TUN support on Windows and the
performance is <em>far</em> better than the buggy OpenVPN driver, which is a nice segue
into…</p>

<h4 id="windows-installer-and-performance">Windows installer and performance</h4>

<p>We have spent a lot of time trying to improve the installation and setup
experience on Windows. This mostly falls into two areas.</p>

<p>The first is that using the Wintun driver has <em>massively</em> improved performance,
in some cases by hundreds of MB/s, and starting the Yggdrasil process is now
much more reliable too - it should no longer be necessary to restart Yggdrasil
due to cases of the TAP adapter not being set up or configured correctly.</p>

<p>The second is that we now automatically generate Windows <code class="language-plaintext highlighter-rouge">.msi</code> installers using
Appveyor, which means that installing or upgrading Yggdrasil is now simpler than
ever. It is no longer necessary to create directories, copy files and register
Windows services by hand - a marked improvement!</p>

<p>The installer also bundles the Wintun driver and it is installed automatically
if required, therefore there is no longer a need to hunt down and install the
OpenVPN TAP driver separately. We hope that these changes will help to encourage
adoption of Yggdrasil on Windows platforms by significantly reducing the barrier
to entry.</p>

<p>As in the previous section, Yggdrasil on Windows has gone from supporting TAP
mode only to now supporting TUN mode only. <strong>This may mean that you need to
review your configuration</strong>. If you no longer need the OpenVPN TAP driver on
your system, it is best to entirely uninstall it. It is also important to make
sure that the <code class="language-plaintext highlighter-rouge">IfName</code> configuration option in your <code class="language-plaintext highlighter-rouge">yggdrasil.conf</code> does not
specify the same name as an existing OpenVPN TAP interface or Yggdrasil may fail
to start.</p>

<h4 id="end-of-the-v03-release-cycle">End of the v0.3 release cycle</h4>

<p>Generally we try, where possible, to avoid make any changes which would damage
backward compatibility with previous versions. The last version that had
breaking changes was v0.2.1 - over a year and a half ago. However, maintaining
backward compatibility so tightly also prevents us from improving the Yggdrasil
design in various ways.</p>

<p>Therefore, unless any serious bugs or security vulnerabilities appear, it is
very likely that this version will be the last in the v0.3 release cycle.
Instead, we will start working on the v0.4 release, which is likely to include a
number of breaking protocol changes and will be incompatible with v0.3 releases
as a result.</p>

<p>More information will be announced on the types of changes in v0.4 as they
happen - expect to see more blog posts and chatter in the <strike>Matrix channel</strike> on this
subject - but we will aim to give as much notice as possible before releases
occur that contain breaking changes.</p>

<h4 id="final-mentions">Final mentions</h4>

<p>In addition to the release notes above, I’d like to relay the message that
<a href="https://github.com/mwarning">@mwarning</a> has a proposal open for a Google Summer
of Code (GSoC) project under the Freifunk umbrella, comparing a number of mesh
routing protocols including Yggdrasil. More information about the proposal is
available <a href="https://projects.freifunk.net/#/projects?project=freifunk_meshnet_protocol_evaluation&amp;lang=en">here</a>.
If you are interested, please reach out!</p>]]></content><author><name>Neil Alexander</name></author><summary type="html"><![CDATA[Release time!]]></summary></entry><entry><title type="html">Acting out</title><link href="https://yggdrasil-network.github.io/2019/09/01/actors.html" rel="alternate" type="text/html" title="Acting out" /><published>2019-09-01T21:00:00+00:00</published><updated>2019-09-01T21:00:00+00:00</updated><id>https://yggdrasil-network.github.io/2019/09/01/actors</id><content type="html" xml:base="https://yggdrasil-network.github.io/2019/09/01/actors.html"><![CDATA[<h3 id="overture">Overture</h3>

<p>We’ve recently rewritten much of Yggdrasil’s internals to change from Go’s native <a href="https://en.wikipedia.org/wiki/Communicating_sequential_processes">communicating sequential processes</a> (goroutine+channel) style to using an asynchronous <a href="https://en.wikipedia.org/wiki/Actor_model">actor model</a> approach to concurrency. While this change should be invisible to the average user, it dramatically changes what we developers need to think about when working on the code. I thought it would be useful to explain a little about the motivation for rewriting things this way, and what the consequences are.</p>

<p>Caution: theatre puns and references throughout, because <code class="language-plaintext highlighter-rouge">Actor</code>s.</p>

<h3 id="exposition">Exposition</h3>

<p>Yggdrasil is written in the Go programming language. Go makes it easy to start a function running concurrently, and gives developers the tools they need to make concurrently executing functions communicate, but it’s not always easy to use them correctly. To be clear, the things I’m about to rant about are all fixable. Working around them is a normal thing to do in Go. More importantly, it’s a case where doing things the obvious way (which is sometimes even safe in isolation) leads to <em>wrong</em> behavior in a larger program. I prefer models where the obvious thing is still correct, and non-obvious things are only needed as a performance optimization.</p>

<h4 id="composition">Composition</h4>

<p>There’s a common pattern that has emerged many times in the Yggdrasil code base. We’ll have a <code class="language-plaintext highlighter-rouge">struct</code> with some mutable fields that need reading or updating, such as information about a particular cryptographic session, or the switch’s table of idle peers and buffered traffic. Since shared mutable state is hard, and Go is all about “<a href="https://blog.golang.org/share-memory-by-communicating">Share Memory By Communicating</a>”, we’ll have packets get passed to a dedicated worker goroutine that “owns” that particular <code class="language-plaintext highlighter-rouge">struct</code>. The worker uses information from the packet and the owned <code class="language-plaintext highlighter-rouge">struct</code> to do whatever it is needs to do, updates these things accordingly, and passes the packet along to the next goroutine in the pipeline.</p>

<p>This often results in a “<code class="language-plaintext highlighter-rouge">for select</code>” pattern, where goroutines sit in an infinite <code class="language-plaintext highlighter-rouge">for</code> loop and <code class="language-plaintext highlighter-rouge">select</code> on several channels, to wait for packets to process or various types of signals from other goroutines. There are a few ways around it (with heavy use of <code class="language-plaintext highlighter-rouge">reflect</code> or <code class="language-plaintext highlighter-rouge">chan interface{}</code>, for example), but in most cases, every <code class="language-plaintext highlighter-rouge">select</code> statement needs to fully enumerate every behavior that the goroutine may need to engage in at that point in the code. If there’s a common set of <code class="language-plaintext highlighter-rouge">case</code>s that always need to be handled, and then a few exceptional <code class="language-plaintext highlighter-rouge">case</code>s that may or may not matter (possibly when the associated <code class="language-plaintext highlighter-rouge">struct</code>s the workers are using are similar but not exactly the same types, or as the state of a <code class="language-plaintext highlighter-rouge">struct</code>’s fields change), then that typically involves multiple <code class="language-plaintext highlighter-rouge">select</code> statements with only the addition or modification of one or two <code class="language-plaintext highlighter-rouge">case</code>s.</p>

<p>Go embraces composition in its type system, but <code class="language-plaintext highlighter-rouge">select</code> statements (and channel operations in general) make execution resistant to composition.</p>

<h4 id="deadlocks">Deadlocks</h4>

<p>The “<code class="language-plaintext highlighter-rouge">for select</code>” pattern is safe, as far as I know, if the flow of messages through the program form a directed acyclic graph. However, in our case, cycles emerge if we try to handle things in the obvious way. For example, a cryptographic session needs to somehow get outbound encrypted traffic to the switch, but incoming encrypted traffic also needs to make it from the switch to the sessions for decryption (via the router, which is responsible for, among other things, identify which session is associated with the traffic).</p>

<p>When cycles of goroutines naively pass messages over channels, deadlocks are all but inevitable. There are a few ways to address this, but they’re not always appropriate. Ideally, we would change the design to remove cycles, but this is not always possible, and may require significant changes to the workflow in cases where it is possible. In practice, what we’d actually do is either buffer messages (having some dedicated reader goroutine to take the message, add it to a slice, and then pass it to the real destination ASAP) or drop messages entirely (with a <code class="language-plaintext highlighter-rouge">select</code> statement that aborts and does cleanup in a <code class="language-plaintext highlighter-rouge">default</code> case, or by having a dedicated reader that drops messages more intelligently, such as from the front of the queue, under the assumption that older messages are less useful).</p>

<h4 id="leaks">Leaks</h4>

<p>Typically, when a goroutine is started, it continues to run until either the function returns or the program exits. For this reason, if a goroutine executes any statements which can block (such as a channel operation), it’s important to include some <code class="language-plaintext highlighter-rouge">case</code> which signals that it’s time to return. Forgetting to do this can result in goroutine leaks. <a href="https://dave.cheney.net/2016/12/22/never-start-a-goroutine-without-knowing-how-it-will-stop">Never start a goroutine without knowing how it will stop</a>, or so the experts say.</p>

<p>This is sometimes harder than it needs to be. To be blunt, the single producer N consumer cases are fine, you just close the channel and have all the consumers take this as a signal to exit. Anything involving multiple producers requires some sort of signaling to indicate that all producers have exited. Since you’re using a channel already, the obvious option is a <code class="language-plaintext highlighter-rouge">select</code> statement with another channel that closes to signal shutdown, and then something like e.g. a <a href="https://golang.org/pkg/sync/#WaitGroup"><code class="language-plaintext highlighter-rouge">sync.WaitGroup</code></a> to wait for all producers to exit before closing the channel. Until your number of producers needs to change at runtime, and you realize that this races if you start to <code class="language-plaintext highlighter-rouge">Wait</code> before <code class="language-plaintext highlighter-rouge">Add</code>ing everything to the group, so you need to implement a custom counter, and be careful that additions and subtractions can also race and cause it to shut down early. And have fun solving it, because with how much <code class="language-plaintext highlighter-rouge">select</code> resists composition and code reuse, you’re going to be implementing the same patterns over, and over, and over, and over…</p>

<p>It’s not that this is some impossible problem to solve, it’s just that Go’s take on the <a href="https://en.wikipedia.org/wiki/Communicating_sequential_processes">CSP</a>, combined with the rest of the tools the language gives you, makes it easy and concise to run thing the <em>wrong</em> way, and leads to comparatively complex and delicate code when trying to run it the right way. At least, that’s my personal view of it based on my experience so far, but it probably varies some based on the problem the code is trying to solve.</p>

<h3 id="rising-action">Rising action</h3>

<p>The <a href="https://en.wikipedia.org/wiki/Actor_model">actor model</a> is another programming paradigm that embraces concurrency with a “share memory by communicating” philosophy.</p>

<p>For our purposes, an actor is basically a data type with a few special properties:</p>
<ol>
  <li>It has an inbox where messages to the actor are placed.</li>
  <li>It has an associated unit of execution, such as a thread, which processes messages from the inbox one at a time.</li>
  <li>Rather than exposing ordinary functions for other code to call, the actor exposes <em>behaviors</em>. A behavior is a function which has no return value, and is executed only for its side effects. When an actor <code class="language-plaintext highlighter-rouge">A</code> calls a behavior of an actor <code class="language-plaintext highlighter-rouge">B</code>, what really happens is that <code class="language-plaintext highlighter-rouge">A</code> places a message in <code class="language-plaintext highlighter-rouge">B</code>’s inbox, and <code class="language-plaintext highlighter-rouge">B</code> processes that message by executing some code.</li>
</ol>

<p>Different implementations differ on details after that, such as what order messages are processed in, if actors are allowed to wait for a particular type of message before continuing, whether actors run locally or are distributed across a cluster, etc., but they tend to all include some version of the broad strokes above.</p>

<h3 id="turing-point">Turing point</h3>

<!-- a play on "Turning point", aka the Climax of a classic 5-act play structure, which is what this post's structure is modeled after if you hadn't figured it out by this point -->

<p>I’m particularly fond of the <a href="https://ponylang.io">pony</a> programming language’s take on the actor model. I really can’t say enough nice things about their approach, and fully describing it is beyond the scope of this blog post, but if you come out of here with an interest in the actor model, then I highly recommend checking out that language. Maybe watch a few of the talks from the developers that have been posted to YouTube, or read their papers about what is <em>easily</em> the most promising approach to garbage collection I’ve ever come across.</p>

<p>Anyway, I don’t actually work on anything written in pony, but I like their version of the actor model so much that I decided to see if I could trick Go’s runtime into faking it. The result is <a href="https://github.com/Arceliar/phony"><code class="language-plaintext highlighter-rouge">phony</code></a>, which manages to do most of what I want in under 70 lines of code. When we write code using this asynchronous message passing style, instead of ordinary goroutines+channels, the implications are pretty significant:</p>

<ol>
  <li>There are no deadlocks. Message sends always succeed, and are quite fast (it doesn’t even require <a href="https://en.wikipedia.org/wiki/Compare-and-swap">CAS</a> instructions in the normal case).</li>
  <li>Inbox sizes stay small due to backpressure: if the sender sees that the receiver’s inbox has too many pending messages, it will schedule itself to stop at some deadlock-free safe point in the future, to wait until the receiver signals that it’s handled the message.</li>
  <li><code class="language-plaintext highlighter-rouge">Actor</code>s are <em>shockingly</em> lightweight: on a modern 64-bit processor, an idle <code class="language-plaintext highlighter-rouge">Actor</code>’s only resources are 24 bytes for an empty <code class="language-plaintext highlighter-rouge">Inbox</code>, some of which is padding that may not apply if embedded into a struct. In particular, an idle <code class="language-plaintext highlighter-rouge">Actor</code> with an empty <code class="language-plaintext highlighter-rouge">Inbox</code> has no associated goroutine, so it requires no stack.</li>
  <li>The lack of a goroutine also means that idle <code class="language-plaintext highlighter-rouge">Actor</code>s, even cycles of <code class="language-plaintext highlighter-rouge">Actor</code>s, can be garbage collected automatically.</li>
  <li>Any <code class="language-plaintext highlighter-rouge">struct</code> that embeds an <code class="language-plaintext highlighter-rouge">Inbox</code> satisfies the <code class="language-plaintext highlighter-rouge">Actor</code> interface. Since <code class="language-plaintext highlighter-rouge">Actor</code>s encapsulate their own unit of execution, it means the range of behaviors that unit of execution can engage in are encoded into the type system and can even be abstracted through <code class="language-plaintext highlighter-rouge">interface</code> types. In my opinion, the resulting code is cleaner, easier to read and understand, and far easier to reuse or extend than the <code class="language-plaintext highlighter-rouge">for select</code> pattern from goroutine+channel use.</li>
</ol>

<h3 id="falling-action">Falling action</h3>

<p>I’m happy enough with the current state of <code class="language-plaintext highlighter-rouge">phony</code> that I decided to start migrating the <code class="language-plaintext highlighter-rouge">yggdrasil-go</code> code base to use it. This is still work in progress (there are some non-<code class="language-plaintext highlighter-rouge">Actor</code> goroutines around the edges of the code, mostly in main <code class="language-plaintext highlighter-rouge">Accept</code> loops and that sort of thing), but the hot paths are now <code class="language-plaintext highlighter-rouge">Actor</code> based.</p>

<p>Most of this was done in a weekend and came together with surprisingly little pain. I had exactly 2 crashes the entire time (1 accidental <code class="language-plaintext highlighter-rouge">nil</code> pointer deference and 1 legitimate bug I needed to fix in <code class="language-plaintext highlighter-rouge">phony</code>), and more importantly, 0 deadlocks. Most things just worked as intended the first time they compiled. There were a few bugs to work out when I was rewriting the <code class="language-plaintext highlighter-rouge">link</code> code, but nothing compared to the mess I had to deal with when writing the old code (which was a couple of horrifying interdependent <code class="language-plaintext highlighter-rouge">for select</code> loops to build a state machine).</p>

<p>So by now you’re probably wondering what any of this looks like in practice. Just to give a generic example, suppose we have some struct with an exported function that needs to run code on a worker goroutine. We could end up with something like the following when writing Go in the CSP style:</p>

<pre><code class="language-Go">
// This is the function we want the worker to run.
func (n *NonActorStruct) theFunction(arg1 Type1, arg2 Type2) {
    // this is where the code we actually care about goes, the rest is basically boilerplate
}

// This is the struct that we want the worker to own and manipulate.
type NonActorStruct struct {
    inputForTheFunction chan argsForTheFunction
    // fields we care about, plus maybe more channels for other things
}

// Needed to initialize the channel to a working state
func NewNonActorStruct() *NonActorStruct {
    n := NonActorStruct{
        inputForTheFunction: make(chan argsForTheFunction),
    }
    return &amp;n
}

// This is just a helper struct to carry arguments for the function.
type argsForTheFunction struct {
    Arg1 Type1
    Arg2 Type2
}

// This is the function we export.
func (n *NonActorStruct) RunTheFunction(arg1 Type1, arg2 Type2) {
    n.inputForTheFunction&lt;-argsForTheFunction{arg1, arg2}
}

// This is needed to start the worker, otherwise things block.
func (n *NonActorStruct) Start() {
    go func() {
        for {
            select{
            // cases for other things we may need to do would also be here
            // presumably at least one is involved in safely shutting down
            case args := &lt;-n.inputForTheFunction:
                // We could possibly have a switch statement here
                // Then switch on the arg type to pick which function to run
                n.theFunction(args.Arg1, args.Arg2)
            }
        }
    }()
}

// This is needed to stop the worker when we're done.
func (n *NonActorStruct) Stop() {
    // Actual implemenation depends on what else the worker does in its loop,
    // but it probably just sends a specific message and/or closes some channel.
}

// Then to use the code, we have something like:
myStruct := NewNonActorStruct()
myStruct.Start()
defer myStruct.Stop() // Or arrange this to happen somewhere else
myStruct.RunTheFunction(arg1, arg2)
</code></pre>
<!-- just to reset ugly highlighting in my editor, ignore me -->

<p>When migrating to the actor model, the basic pattern that emerged was to embed a <code class="language-plaintext highlighter-rouge">phony.Inbox</code> into any <code class="language-plaintext highlighter-rouge">struct</code> we wanted to make into a <code class="language-plaintext highlighter-rouge">phony.Actor</code>, and then define functions of the struct like so:</p>

<pre><code class="language-Go">
// This is the function we want the worker to run.
func (a *ActorStruct) theFunction(arg1 Type1, arg2 Type2) {
    // this is where the code we actually care about goes, the rest is basically boilerplate
}

// This is the struct that we want the worker to own and manipulate.
type ActorStruct struct {
    phony.Inbox // This defines the Act function, satisfying the Actor interface
    // fields we care about
}

// This is the function we export.
func (a *ActorStruct) RunTheFunction(from phony.Actor, arg1 Type1, arg2 Type2) {
    a.Act(from, func() {
        a.theFunction(arg1, arg2)
    })
}

// And then to use it, an Actor x would run something like:
myActor := new(ActorStruct)
myActor.RunTheFunction(x, arg1, arg2)
</code></pre>

<p>And that’s about it. The first argument to <code class="language-plaintext highlighter-rouge">myActor.RunTheFunction</code> also <code class="language-plaintext highlighter-rouge">nil</code>able, if we have non-<code class="language-plaintext highlighter-rouge">Actor</code> code that needs to send a message, it just means there’s no backpressure to slow down the non-<code class="language-plaintext highlighter-rouge">Actor</code> code if it’s sending messages faster than the <code class="language-plaintext highlighter-rouge">Actor</code> can handle them. A <code class="language-plaintext highlighter-rouge">phony.Block</code> function exists to help non-<code class="language-plaintext highlighter-rouge">Actor</code>s wait for an <code class="language-plaintext highlighter-rouge">Actor</code> to process a message before continuing, since this seems like a common enough use case (especially when a package wants to export a non-<code class="language-plaintext highlighter-rouge">Actor</code> interface that uses <code class="language-plaintext highlighter-rouge">Actor</code> code internally).</p>

<p>What’s great is that we don’t need to think about starting or stopping workers, deadlocks and leaks are not possible outside of blocking operations (e.g. I/O), and we can add or reuse behaviors just as easily as any function.  I find the code easier to read and reason about too.</p>

<p>I/O is one rough spot, since an <code class="language-plaintext highlighter-rouge">Actor</code> can block on a <code class="language-plaintext highlighter-rouge">Read</code> or a <code class="language-plaintext highlighter-rouge">Write</code> and not process incoming messages as a result. This isn’t really any worse than working with normal Go code, and the pattern we’ve adopted is to have separate <code class="language-plaintext highlighter-rouge">Actor</code>s for <code class="language-plaintext highlighter-rouge">Read</code> and <code class="language-plaintext highlighter-rouge">Write</code>, where one mostly just sits in a <code class="language-plaintext highlighter-rouge">Read</code> loop and sends the results (and/or error) somewhere whenever a <code class="language-plaintext highlighter-rouge">Read</code> finishes. These two workers can be children of some parent <code class="language-plaintext highlighter-rouge">Actor</code>, which is the only one the rest of the code needs to know about, and then all we need to remember to do is close the <code class="language-plaintext highlighter-rouge">ReadWriteCloser</code> (e.g. socket) at some point when we’re done. This is the sort of thing that we’ll eventually want to write a standard <code class="language-plaintext highlighter-rouge">struct</code> for, update our code everywhere to use it, and then never have to think about it again. In the meantime, we have a couple of very similar implementations for working with sockets or the tun/tap device.</p>

<h3 id="dénouement">Dénouement</h3>

<p>The Go language makes concurrency easy, but for some problems it can be difficult to do safely out-of-the-box. However, the language provides the tools needed to implement an actor model approach very easily. While I won’t claim that the actor model is a panacea for all development woes, Yggdrasil by its very nature requires us to think about networks of nodes communicating asynchronously, so it makes sense to use a programming paradigm that lets us model that approach more explicitly in our code base. Outside of a couple of corner cases (namely blocking I/O for the network sockets and the tun/tap device), we expect this to obviate any need to even thing about deadlocks, make development easier moving forward, and generally lead to a better user experience as a result. The code migration is still a work in progress, but <code class="language-plaintext highlighter-rouge">Actor</code>s have replace <code class="language-plaintext highlighter-rouge">for select</code> workers along the hot paths through the code (minus 1 crypto worker pool in the session code) and will slowly replace synchronization primitives in the remaining code base. The current code has been merged into our <code class="language-plaintext highlighter-rouge">develop</code> branch, and I’m quite excited to see it land in Yggdrasil <code class="language-plaintext highlighter-rouge">v0.3.9</code>, along with the usual bug fixes and incremental improvements, which we plan to release in the near future.</p>]]></content><author><name>Arceliar</name></author><summary type="html"><![CDATA[Overture]]></summary></entry><entry><title type="html">Meshing using Apple Wireless Direct Link (AWDL)</title><link href="https://yggdrasil-network.github.io/2019/08/19/awdl.html" rel="alternate" type="text/html" title="Meshing using Apple Wireless Direct Link (AWDL)" /><published>2019-08-19T08:00:00+00:00</published><updated>2019-08-19T08:00:00+00:00</updated><id>https://yggdrasil-network.github.io/2019/08/19/awdl</id><content type="html" xml:base="https://yggdrasil-network.github.io/2019/08/19/awdl.html"><![CDATA[<h3 id="wireless-without-borders">Wireless without borders</h3>

<p>I was mostly prompted to write this post in response to a <a href="https://news.ycombinator.com/item?id=20735462">Hacker News
thread</a> recently, which announced
the release of an open-source AirDrop implementation called
<a href="https://github.com/seemoo-lab/opendrop">OpenDrop</a>, from the same team at Seemoo
Lab who produced an open-source implementation of Apple Wireless Direct Link
(AWDL) protocol called <a href="https://github.com/seemoo-lab/owl">OWL</a>. AWDL is the
secret sauce behind AirDrop, peer-to-peer AirPlay and some other Apple wireless
technologies. Even though everything covered in this post was done some time
ago, I have never spent the time to document it.</p>

<p>With a few exceptions, most wireless networks in the world operate in
“infrastructure mode” which is where a wireless access point serves one or more
wireless clients. Think of your Wi-Fi at home, at work or in a coffee shop.
However, as implied by the name, reliable and usable infrastructure Wi-Fi is
often only available in certain physical locations with “good infrastructure”.
If you wanted to connect some devices together anywhere not served by an
infrastructure Wi-Fi network, or in a location where you can’t suddenly plug in
a wireless access point, you may not have many options (Bluetooth aside).</p>

<p>AWDL is designed to avoid this problem by extending the 802.11 wireless standard
to allow client devices to communicate directly with each other, without the
help of the central wireless access point. You can walk out into a field with a
couple of iPhones or Macs and they can use AWDL to discover each other and
exchange data, peer-to-peer. Even better is that nearby devices that are
connected to different infrastructure Wi-Fi networks can still communicate with
each other using AWDL!</p>

<h3 id="the-science">The science</h3>

<p>Normally, when connected to a wireless access point, wireless clients remain
locked to the specific radio channel that the AP is using. AWDL works by
instructing the wireless adapter in the device to “hop” between channels so that
it can not only remain connected to the wireless access point, but can also
listen to other nearby devices.</p>

<p>Devices announce their presence and information about their services on a
“social channel” for other devices to hear, effectively creating peer-to-peer
service discovery. Once two devices have decided that they want to communicate
directly, they agree to jump to another channel for real data exchange so that
they don’t interrupt existing Wi-Fi networks or, indeed, the social channel.
These “hops” between wireless channels happen so quickly that there’s very
little disruption to what the user is doing with their Wi-Fi connection already
(except for some minor wireless performance degradation - to be covered later).</p>

<p>A number of papers have been published by the OWLink team on the inner workings
of the AWDL protocol, which can be <a href="https://owlink.org/publications/">found
here</a>. In particular, <a href="https://arxiv.org/pdf/1808.03156.pdf">this
paper</a> from Mobicom 2018 contains a
significant amount of detail about the AWDL protocol itself, channel hopping
techniques and security considerations, amongst other things.</p>

<h3 id="mesh-opportunities">Mesh opportunities</h3>

<p>Yggdrasil is designed to create a mesh network automatically out of
interconnected nodes - the idea being that all nodes can route to all other
nodes on the mesh network by routing through other nodes.</p>

<p>Today, many of these connections happen between nodes across the Internet, since
the community is still relatively small and geographically dispersed. A node
joining the Yggdrasil network needs to only peer with a single device that is
already connected to the wider network in order to participate in the
fully-routable mesh.</p>

<p>However, it’s not the goal of Yggdrasil to remain something that we just toy
with over the Internet. We want to build a protocol that can scale globally and
work ad-hoc, even in places where infrastructure might not be particularly
strong otherwise. We think that one of Yggdrasil’s greatest strengths is that it
is very close to zero-configuration, beyond giving it a very small number of
configuration options, and it should scale well too in principle.</p>

<p>Yggdrasil can already discover potential peers on the same network segment by
using multicast service discovery, which sounds a lot like what AWDL does on the
social channel. You can configure which interfaces Yggdrasil beacons on with the
<code class="language-plaintext highlighter-rouge">MulticastInterfaces</code> configuration directive.</p>

<p>I wanted to know if we could blend the two so that Yggdrasil could automatically
discover other nearby devices and initiate peering connections with them using
AWDL.</p>

<h3 id="getting-started">Getting started</h3>

<p>Macs are a good target for developing and testing AWDL-aware applications as
AWDL is exposed to userspace through a network adapter called <code class="language-plaintext highlighter-rouge">awdl0</code>. It sits
there with a link-local IPv6 address, you can run <code class="language-plaintext highlighter-rouge">tcpdump</code> or Wireshark on it
to listen to AWDL traffic and you can even ping multicast group addresses on the
interface and get responses from other nearby devices, e.g. using <code class="language-plaintext highlighter-rouge">ping6
ff02::1%awdl0</code>! However, Apple devices don’t always keep AWDL alive and
listening all of the time.</p>

<p>On macOS, the AWDL driver is only woken up when either AirDrop is being
actively used in Finder, or where a <code class="language-plaintext highlighter-rouge">NetService</code> has been created (usually
through Objective-C or Swift) which requests peer-to-peer networking. AWDL is
normally kept alive long enough to satisfy connectivity for these sessions and
then will be sent back to sleep after a period of idleness.</p>

<p>On iOS, the story is somewhat similar to above, except that AWDL is often woken
up as soon as the device is unlocked if AirDrop is enabled. The <code class="language-plaintext highlighter-rouge">NetService</code> API
otherwise functions the same way.</p>

<p>tvOS is the outlier in that it seems to wake up and listen to AWDL randomly,
even when the device is otherwise asleep, presumably because it is advertising
the ability to receive incoming AirPlay sessions to nearby devices.</p>

<p>From a user perspective, the <code class="language-plaintext highlighter-rouge">awdl0</code> interface looks entirely unremarkable. It
behaves largely like any other ethernet interface, carrying regular IPv6
traffic. In the background it’s a bit more complicated, as the AWDL driver
performs traffic filtering for security reasons, namely, to stop someone sat
next to you in the airport from browsing your file shares. Regular listening
sockets won’t accept connections over AWDL unless a specific socket option was
configured on the socket before it started listening.</p>

<p>Multicast traffic, however, does largely get passed through the filter
untouched. Bingo.</p>

<h3 id="waking-up-awdl">Waking up AWDL</h3>

<p>The <code class="language-plaintext highlighter-rouge">NetService</code> API is effectively a wrapper around multicast DNS-SD, which in
Apple’s colourful language, is affectionately known as Bonjour. The API has the
added benefit of being able to tell the operating system to wake up the AWDL
driver pretty much on demand on behalf of “peer-to-peer” services.</p>

<p>So all we would need to do to wake up AWDL is to call the <code class="language-plaintext highlighter-rouge">NetService</code> API,
publish a service that requests peer-to-peer functionality and let the operating
system do the hard work for us. Yggdrasil, being written in Go, didn’t have any
concept of <code class="language-plaintext highlighter-rouge">NetService</code> but thankfully we were able to use Cgo to do this
instead.</p>

<p>We wrote a Cgo function which calls the NetService API and advertises our new
fake service, <code class="language-plaintext highlighter-rouge">_yggdrasil._tcp</code>, which causes the operating system to wake up
the AWDL driver. Amazingly this worked.</p>

<p>Yggdrasil doesn’t actually use DNS-SD - we currently use a custom-formatted
multicast beacon on a different multicast group. It is planned to eventually
migrate to something more standard, like DNS-SD, for service discovery. However,
in this instance, registering a fake DNS-SD service was just enough to wake up
AWDL.</p>

<h3 id="peering-automatically">Peering automatically</h3>

<p>Once the driver is active, the regular Yggdrasil multicast beacons on the
<code class="language-plaintext highlighter-rouge">ff02::114</code> multicast group address seem to be passed through to the driver
normally and the Yggdrasil nodes running on each machine start to hear each
other’s calls.</p>

<p>The only thing that remained to be done was to configure the sockets with the
aforementioned socket option to allow them to communicate over the AWDL
interface. This socket option is called <code class="language-plaintext highlighter-rouge">SO_RECV_ANYIF</code> and is defined in
<code class="language-plaintext highlighter-rouge">sys/socket.h</code> on Darwin as <code class="language-plaintext highlighter-rouge">0x1104</code>.</p>

<p>We configure the socket option on our TCP peering socket:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>err = unix.SetsockoptInt(int(fd), syscall.SOL_SOCKET, 0x1104, 1)
if err != nil {
  ...
}
</code></pre></div></div>

<p>Now that the Yggdrasil nodes can hear each other’s advertisements over the
<code class="language-plaintext highlighter-rouge">awdl0</code> interface, the regular automatic peering process kicks in and a TCP
session is opened between the two devices, creating a peering. The net result?
AWDL peerings!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo yggdrasilctl getSwitchPeers
   bytes_recvd   bytes_sent  coords       endpoint                         ip                                      port  proto
1  244278        313907      [3 5 5 2 1]  fe80::xxxx:xxxx:xxxx:xxxx%awdl0  xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx  1     tcp
</code></pre></div></div>

<p>To further cement the experiment, we can actually disconnect the two devices
from each other, or connect to different Wi-Fi networks automatically, and the
peering over the <code class="language-plaintext highlighter-rouge">awdl0</code> interface still continues to function!</p>

<p>An <code class="language-plaintext highlighter-rouge">iperf3</code> test over Yggdrasil using the new AWDL link looks fairly good - the
devices are sat next to each other:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-1.00   sec  15.4 MBytes   129 Mbits/sec
[  5]   1.00-2.00   sec  16.9 MBytes   141 Mbits/sec
[  5]   2.00-3.00   sec  15.9 MBytes   133 Mbits/sec
[  5]   3.00-4.00   sec  17.6 MBytes   147 Mbits/sec
[  5]   4.00-5.00   sec  16.8 MBytes   141 Mbits/sec
[  5]   5.00-6.00   sec  16.2 MBytes   136 Mbits/sec
[  5]   6.00-7.00   sec  12.5 MBytes   105 Mbits/sec
[  5]   7.00-8.00   sec  12.7 MBytes   106 Mbits/sec
[  5]   8.00-9.00   sec  14.9 MBytes   125 Mbits/sec
[  5]   9.00-10.00  sec  13.5 MBytes   113 Mbits/sec
</code></pre></div></div>

<h3 id="observations-and-ios">Observations and iOS</h3>

<p>As the <code class="language-plaintext highlighter-rouge">iperf3</code> test above shows, the link performance is actually quite good!
It routinely exceeds 100mbps, although this is between only two devices. I have
not been able to test this with Yggdrasil nodes running over AWDL in any
particular density due to only having a limited number of Macs to hand.</p>

<p>One thing that I did notice though is that, while AWDL is active, my wireless
connection to my home Wi-Fi network does reduce in speed somewhat. This is to be
expected, given that the wireless chipset is hopping between channels rather
than spending all of its time on a single channel.</p>

<p>Sadly we weren’t able to reproduce this test using iOS Testflight builds of
Yggdrasil. On iOS, we implement Yggdrasil as a VPN service which is subject to a
number of probably reasonable restrictions imposed by the OS, which presumably
exist to stop VPN extensions from spying on you.</p>

<p>We were able to create a <code class="language-plaintext highlighter-rouge">NetService</code> from within the VPN extension and the
service beacons were advertised as expected, however, we weren’t able to
initiate any other kind of connections over the <code class="language-plaintext highlighter-rouge">awdl0</code> interface. After a chat
with an engineer at Apple, it turns out that the <code class="language-plaintext highlighter-rouge">awdl0</code> interface isn’t scoped
for use within a VPN extension, thus squashing our hopes and dreams of being
able to sprinkle this kind of magic onto our iOS port of Yggdrasil. We have a
feature request radar open with Apple in the hope that they may be able to
change this restriction in the future.</p>

<p>But we were able to get this to work on macOS and that, itself, is quite
awesome.</p>

<h3 id="conclusion">Conclusion</h3>

<p>Yggdrasil doesn’t enable AWDL by default because of the reduction in wireless
performance that AWDL being active can cause. Therefore, to enable AWDL peering,
you must add the <code class="language-plaintext highlighter-rouge">awdl0</code> interface specifically into the <code class="language-plaintext highlighter-rouge">MulticastInterfaces</code>
configuration option in <code class="language-plaintext highlighter-rouge">yggdrasil.conf</code>. However, we do have working support
for connecting Macs together and meshing automatically using AWDL, and you can
enable it very easily if you wish to experiment!</p>

<p>We’d love to hear if you are peering Yggdrasil nodes using AWDL, or have
performed any more extensive tests of how it performs in real-world scenarios -
join us on our <strike>Matrix channel</strike>.</p>]]></content><author><name>Neil Alexander</name></author><summary type="html"><![CDATA[Wireless without borders]]></summary></entry><entry><title type="html">Version 0.3.6</title><link href="https://yggdrasil-network.github.io/2019/08/03/release-v0-3-6.html" rel="alternate" type="text/html" title="Version 0.3.6" /><published>2019-08-03T08:00:00+00:00</published><updated>2019-08-03T08:00:00+00:00</updated><id>https://yggdrasil-network.github.io/2019/08/03/release-v0-3-6</id><content type="html" xml:base="https://yggdrasil-network.github.io/2019/08/03/release-v0-3-6.html"><![CDATA[<h3 id="new-release">New release!</h3>

<p>It’s been nearly five months since we released version 0.3.5 of Yggdrasil. In
that time we’ve seen the node count rise to over 400 nodes on the public network
at times (over 80% of which are running the latest released version) and we’ve
gained valuable insight to the kinds of challenges that our users have. We’ve
worked to fix a number of bugs and to improve Yggdrasil.</p>

<p>In terms of lines of code changed, version 0.3.6 is the biggest release of
Yggdrasil to date, with several thousands of lines of code affected. It
represents a massive refactoring exercise in which we’ve broken up and
modularised the code, dividing core Yggdrasil functionality, TUN/TAP, admin
socket and multicast features into their own respective Go packages.</p>

<h3 id="fixes">Fixes</h3>

<p>Most of the user-facing changes in this release are fairly minimal, however some
bugs have been corrected. A complete list is available in the <a href="/changelog.html">changelog</a>.</p>

<p>Highlights include peers now being added correctly even when one or more
configured peers are unavailable or unreachable. Multicast interfaces are also
being evaluated more frequently now, which can help if an interface becomes
available or goes down after Yggdrasil has already started.</p>

<p>A number of bugs have been fixed in the TUN/TAP and IP-specific code, including
problems that affected ICMPv6 and Neighbour Discovery in TAP mode specifically.
This helps reliability on platforms where TAP mode is used more commonly, e.g.
on BSD platforms or on Windows, although this also improves TAP support on Linux
too.</p>

<h3 id="refactoring-and-api">Refactoring and API</h3>

<p>Around the previous release, it became obvious to us that our codebase was
turning into a monolith. We had pretty much all of the necessary behaviour in
a single <code class="language-plaintext highlighter-rouge">yggdrasil</code> package to run a single node, but this made our codebase
inflexible and difficult to maintain and extend. It also meant that Yggdrasil
was virtually impossible to integrate into other applications.</p>

<p>Our refactoring efforts in version 0.3.6 mean that our codebase is now easier to
manage and to understand. It also includes the first taste of our API! The
API makes it possible to take the Yggdrasil core, drop it into your own Go
application and use the Yggdrasil network as a fully end-to-end encrypted and
distributed transport layer. We’ve also moved all of the IP-specific code into
the TUN/TAP module, which means that Yggdrasil’s core now provides a completely
protocol-agnostic transport.</p>

<p>Documentation on how to use the API to integrate Yggdrasil into your own
applications will follow soon—watch this space! In the meantime, <a href="https://godoc.org/github.com/yggdrasil-network/yggdrasil-go/src/yggdrasil"><code class="language-plaintext highlighter-rouge">godoc</code> can be
used to examine our new API functions</a>.</p>

<p>Please note though that <strong>API functions are not yet finalised and may be subject
to change</strong> in future versions. Yggdrasil is still alpha-grade software at this
point so all of the usual warnings apply.</p>

<h3 id="platform-support">Platform Support</h3>

<p>We enjoy great support from our community in bringing and packaging Yggdrasil on
new platforms. Since the release of version 0.3.5, the following third-party
packages have cropped up, and we are very grateful to the maintainers:</p>

<ul>
  <li>A <a href="https://copr.fedorainfracloud.org/coprs/leisteth/yggdrasil/">new RPM build</a> for Red Hat, Fedora, CentOS etc.</li>
  <li>An <a href="https://aur.archlinux.org/packages/yggdrasil-git/">AUR package</a> for Arch Linux</li>
  <li>A <a href="https://github.com/void-linux/void-packages/tree/master/srcpkgs/yggdrasil">Void package</a> for Void Linux</li>
  <li>A <a href="https://github.com/macports/macports-ports/blob/master/net/yggdrasil-go/Portfile">MacPorts package</a> for macOS</li>
</ul>

<p>We expect that any third-party packages which have not yet been updated for
v0.3.6 will be updated soon!</p>

<p>We are aware of a few outstanding issues with Windows, which are largely related
to one or two bugs in the <a href="https://github.com/songgao/water">Water</a> library
which we use for TUN/TAP support. We are hoping to address these problems with
the maintainer of this library soon. Using Yggdrasil in router-only mode does
work as expected, but some bugs when using the TAP adapter still remain. In the
meantime, we’d certainly welcome any assistance in maintaining the Windows port
of Yggdrasil.</p>

<p>The iOS build has been largely neglected due to API changes, although hopefully
a new TestFlight build for version 0.3.6 will be available before too long.</p>

<h3 id="upgrading">Upgrading</h3>

<p>We recommend that all Yggdrasil users always run the latest version of the code
wherever possible, so please upgrade as soon as it is convenient. New downloads
are available from our <a href="/builds.html">Builds</a> page and
<a href="https://github.com/neilalexander">Neil</a>’s S3 repositories are up-to-date for
Debian and EdgeRouter installs.</p>

<p>If you have installed through a package manager, you should be able to upgrade
in-place as soon as the new packages are available. On macOS, you can simply
install the new <code class="language-plaintext highlighter-rouge">.pkg</code> from the builds page over the top of the old one. On
Windows, and on any installation where the binary was installed by hand, you can
simply replace the <code class="language-plaintext highlighter-rouge">yggdrasil</code> and <code class="language-plaintext highlighter-rouge">yggdrasilctl</code> binaries with the newly
released builds.</p>

<p>Building from source is simple if you have Git and Go 1.11 or later installed:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/yggdrasil-network/yggdrasil-go
cd yggdrasil-go
./build
</code></pre></div></div>

<h3 id="feedback">Feedback</h3>

<p>We always welcome feedback, so please do feel free to join us either in our
<a href="https://riot.im/app/#/room/%23yggdrasil:matrix.org">Matrix channel</a> or on IRC
in <code class="language-plaintext highlighter-rouge">#yggdrasil</code> on Freenode. You can also raise bug reports and issues in <a href="https://github.com/yggdrasil-network/yggdrasil-go/issues">our
GitHub repository</a>.</p>]]></content><author><name>Neil Alexander</name></author><summary type="html"><![CDATA[New release!]]></summary></entry><entry><title type="html">Practical peering</title><link href="https://yggdrasil-network.github.io/2019/03/25/peering.html" rel="alternate" type="text/html" title="Practical peering" /><published>2019-03-25T04:00:00+00:00</published><updated>2019-03-25T04:00:00+00:00</updated><id>https://yggdrasil-network.github.io/2019/03/25/peering</id><content type="html" xml:base="https://yggdrasil-network.github.io/2019/03/25/peering.html"><![CDATA[<h3 id="how-many-peers-do-i-need-and-which-ones">How many peers do I need, and which ones?</h3>

<p>Perhaps the most common questions we receive are about peering. If you’re not familiar with how Yggdrasil works, or even if you are but you haven’t tested things carefully, then it’s sometimes easy to do things which seem like they should work right, but lead to higher latency and lower bandwidth for you or nodes that depend on you in the network. This post is meant to explain what happens when the wrong peers are selected, and what you can do to avoid it.</p>

<h3 id="the-problem">The problem</h3>

<p>When building a physical network, the cost of adding a link between two nodes, as well as the benefits that having that link would give, play a role in deciding which nodes are ultimately linked. That cost often correlates with the cost of using the link – long links are more expensive to create and have higher latency than short links of the same type, for example, and there’s no point in adding a link to another node unless it’s worth the cost. Yggdrasil is designed to work well on the kinds of networks we see in the real world, and makes implicit assumptions which benefit from the relationship between the higher cost to both create and use a long link.</p>

<p>However, when peering Yggdrasil nodes over the internet, the performance difference between two links can be dramatic, but the cost of creating them is always the same: there is no difference between adding a link over the internet to a node if it’s 1 km away or 1000. As a result, it’s easy to add links over the internet which would make no sense if deploying dedicated infrastructure, and can violate some of Yggdrasil’s assumptions as a result. This can lead to worse performance for not only the two linked nodes, but other nodes in their area.</p>

<h3 id="rules-of-thumb">Rules of thumb</h3>

<p>In an effort to clarify how nodes should connect to public peers, and how public peers should connect to each other, I think it’s helpful if we establish some rules of thumb:</p>

<ol>
  <li>When deciding if to connect to another node, you should only connect to the ones that are “good enough” to be worth the effort. Here, “good enough” means that they have as much (approximately) at least as much bandwidth as your own. A fast node shouldn’t decide to connect to a slow node, instead the slow node should decide if it wants to connect to the fast one.</li>
  <li>When connecting to nodes, start with the “closest” (lowest latency) nodes, subject to the above constraint, and work your way out. Try not to skip over (equal or better) nodes if there’s no reason to.</li>
</ol>

<p>While this may not be the only way to fix the problem, following these rules of thumb should approximate the kinds of constraints that real networks need to deal with. Nodes tend to connect to whoever is closest, and better nodes tend to skip over worse ones to establish a long range “backbone” connection between remote points.</p>

<p>In addition, the number of peers you want to add depends on what you want to do. If you only want to connect to the network, then 1 (better connected) peer is technically enough, but this acts as a single point of failure. Two to four peers adds some redundancy, but keep in mind that you may end up routing traffic between these peers if that ends up being the best route they can find. If your goal is to set up a public peer that can route traffic for the network, and you have enough bandwidth to spare, then keep adding peers. Generally speaking, an asymmetric home internet connection shouldn’t try to route traffic. And, wherever possible, replace internet links with real connections over directional wifi or similar – to avoid having multiple peers share bandwidth over a shared link.</p>

<h3 id="what-happens-when-things-go-wrong">What happens when things go wrong</h3>

<p>Let’s imagine we have some nodes in New York, and initially they follow the peering rules outlined above. Now suppose that two of these nodes decide that they want to add connections to London. In Yggdrasil, nodes tend to select parents that minimize latency to the root, which happens to be a node in Paris at the time I’m writing this. As a result, both of the NY nodes are likely to select their respective London peers as their parents. If the nodes are following the peering rules, then at least one of them has also decided to peer with the other, so they have a shortcut they can use to talk to each-other (or any descendants in the tree).</p>

<p>However, if they ignore the peering rules and <em>don’t</em> peer with each other, then they are likely to route through London instead of communicating over their local mesh network. A shorter path exists, through their local mesh network, but it’s not one that the network <em>must</em> know about for routing to work, so they won’t necessarily know about it. As a result, the latency between these two nodes (or decedents thereof) will likely be an order of magnitude more than it needs to be (and probably lower bandwidth as well).</p>

<h3 id="conclusion">Conclusion</h3>

<p>Yggdrasil was designed with scalability in mind, and to that end, it makes some assumptions about how nodes in the network are connected to avoid communicating unnecessary information. Peering over the internet allows you to violate these assumptions. When this happens, it’s possible for network performance to suffer unintended consequences when adding new links. If you prioritize adding new links the same way as you would when building physical links, you can expect lower latency and, in many cases, higher bandwidth, compared to adding peers at random.</p>]]></content><author><name>Arceliar</name></author><summary type="html"><![CDATA[How many peers do I need, and which ones?]]></summary></entry><entry><title type="html">History of the World Tree, Part I</title><link href="https://yggdrasil-network.github.io/2019/01/09/history.html" rel="alternate" type="text/html" title="History of the World Tree, Part I" /><published>2019-01-09T05:00:00+00:00</published><updated>2019-01-09T05:00:00+00:00</updated><id>https://yggdrasil-network.github.io/2019/01/09/history</id><content type="html" xml:base="https://yggdrasil-network.github.io/2019/01/09/history.html"><![CDATA[<h3 id="how-did-yggdrasil-get-started">How did Yggdrasil get started?</h3>

<p>On a few occasions I’ve been asked about how Yggdrasil was started, or what motivated certain things about the design.
I’ve talked about the motivation and technical details in other blog posts, but I haven’t talked about the history before, so I thought it’s about time.</p>

<h3 id="batman-begins">B.A.T.M.A.N. begins</h3>

<p>The first time I can recall hearing about mesh networks, as a concept, was some time in late 2010 or early 2011, when <a href="https://www.open-mesh.org">B.A.T.M.A.N.</a> <a href="https://kernelnewbies.org/Linux_2_6_38#B.A.T.M.A.N._mesh_protocol">reached the mainline Linux kernel</a>.
I liked the idea, but since I obsess over how things scale, I was worried about the network’s ability to cope with an internet-like number of users.
In B.A.T.M.A.N., as in most other protocols, nodes must either rely on some externally configured (and coordinated) subnetting, or else every node in a network must know about every other node in the network.
In particular, each node periodically sends a broadcast packet through the network, which allows the rest of the network to find a path back to the originating node.
Other approaches, such as <a href="https://en.wikipedia.org/wiki/Ad_hoc_On-Demand_Distance_Vector_Routing">AODV</a>, only search for routes when they’re needed, but the same <code class="language-plaintext highlighter-rouge">~O(n)</code> cost applies for each node in a network with <code class="language-plaintext highlighter-rouge">n</code> nodes.
At a certain point, particularly in a shared medium wireless network, the cost of protocol traffic can become larger than the resources available to the network, and so the network no longer has room to route any traffic for the user.</p>

<h3 id="cjdns">CJDNS</h3>

<p>I came across <a href="https://github.com/cjdelisle/cjdns">cjdns</a> in the summer of 2012.
The thing about cjdns that caught my attention was how it used a <a href="https://en.wikipedia.org/wiki/Distributed_hash_table">Distributed Hash Table</a> to allow each node to look up a path to any other node, instead of relying on broadcast traffic.
The idea being, if you can use a DHT instead of broadcast traffic, then you can just throw the whole network into one large subnet, with “flat” identifiers (IP addresses) that have nothing to do with the position of a node in the network.
Then, since you still need some way to assign addresses, you can derive them from a hash of a node’s public encryption key.
That simultaneously addresses the protocol overhead issue, address assignment, and lets you do end-to-end encryption without depending on public key infrastructure.</p>

<p>What could go wrong?
Well, the best short example I can give, is to imagine that Alice wants to deliver a package to Carol, and they live in a world without maps or addresses, and where you can’t rely on directions like “go North by any route until you reach X”, so everyone needs to memorize any roads or routes that they care about.
Alice doesn’t know where Carol lives, but she knows where Bob lives, and she has reason to believe that Bob knows where Carol lives.
So, Alice visits Bob and asks for directions to Carol.
Bob tells Alice how to get from Bob’s house to Carol’s house, and Alice memorizes this.
Now, any time Alice wants to deliver a package to Carol, she travels form her house to Bob’s house, and then from Bob’s house to Carol’s house.
If anyone asks Alice for a path to Carol, she will give them the path from herself to Carol, including the unnecessary detour past Bob.
If someone knows enough about the layout of the streets to recognize the detours, or otherwise know that there’s a shorter path between two points somewhere on the route, then they could improve upon this path, but <em>in general</em> this doesn’t happen, because nobody knows enough about the layout of things to see the big picture of where everything is.</p>

<p>That’s basically how cjdns routing worked before supernodes were introduced.
Supernodes keep a (centralized) view of the full network, and then other nodes can ask a supernode (instead of doing DHT lookups) for a path.
Ignoring any technical complaints I may have about that approach, it sidesteps the problem I’m interested in solving, so I stopped actively contributing to cjdns once the decision was made to go that route, and started looking for other ways to solve the routing problems cjdns had faced.</p>

<h3 id="just-like-the-simulations">Just like the simulations</h3>

<p>By around the middle of 2015, I had thrown together a basic skeleton of a network simulator in python, so I could compare the paths that different routing schemes find to the shortest paths through the same networks.
Having studied up on the latest and greatest academic works at the time, I had initially been thinking that something resembling Thorup and Zwick’s universal compact routing scheme made the most sense, but I had issues finding a way to implement that <em>securely</em> as a distributed algorithm running on a dynamic network.</p>

<p>To make a long story short, I ultimately took the most inspiration from <a href="https://en.wikipedia.org/wiki/Robert_Kleinberg">Robert Kleinberg</a>’s approach, which is to use a <a href="https://en.wikipedia.org/wiki/Greedy_embedding#Hyperbolic_and_succinct_embeddings">greedy embedding</a>.
Here’s the thing, the Kleinberg approach grows a spanning tree of a (static) network, and embeds the tree in the hyperbolic plane, then proves that this embedding is always greedy (meaning, if you just forward to the point in the metric space closest to the destination, you’ll never hit a dead end).
The only real difference is that Yggdrasil doesn’t bother to embed the tree in the hyperbolic plane.
Instead, each node remembers the path from the root to itself, and we use these paths to calculate distance apart on the tree.
This saves us the trouble of embedding, and we’d need to know the per-hop tree information <em>anyway</em> to securely build the tree, so this saves us some complexity.</p>

<p>Using a DHT, we can look up <em>who</em> we want to talk to (specified by an IPv6 “address”, which is a flat identifier / hash of a key, as in cjdns), we can learn <em>where</em> they are on the spanning tree.
Then, when a node needs to forward a packet, it checks the tree location of each of its peers and forwards to whichever one is closest to the destination (+- a few caveats about congestion control).
This is explained in more detail in earlier blog posts, if you’re not familiar with how Yggdrasil routes and care to read more.</p>

<p>In our package delivery example, imagine if the streets in Alice’s town were laid out in a grid, and then named and numbered systematically by blocks, with street signs to label where any off-grid bypasses go.
Alice and friends still haven’t bought maps, but they know each other’s <em>addresses</em> instead.
So, if Alice wants to contact Carol, she first travels to Bob’s house and asks him for Carol’s address.
Now, when she wants to deliver a package to Carol, she can simply follow the block structure of the town until she arrives on Carol’s block, and she has the option to take any bypass she happens to come across if it brings her closer to Carol’s place.
That’s basically how routing on the tree, or taking an off-tree shortcut, work in Yggdrasil’s greedy routing scheme, except with a tree instead of a grid (which, in addition to working everywhere, seems to work <em>well</em> in the places we care about).</p>

<p>I had most of the important parts of this working, in simulations, by mid September of 2015.
Initially, I also included off-tree distance-vector like routes to nodes where the on-tree path would be too long, but I abandoned this once I saw that it added relatively little (except protocol overhead) for the kinds of networks that tend to show up in practice, including some internet topology maps from CAIDA and DIMES.
In particular, it seems to work well any time the network diameter is small and the number of triangles in the network is large, since the former limits the worst case scenario paths that the network can use, and the latter adds many opportunities for off-tree shortcuts.</p>

<h3 id="going-public">Going public</h3>

<p>Having (mostly) finished simulation tests by about spring of 2016, I sat on the idea for a while, trying to work up the motivation to do anything with it.
I eventually sat down one weekend and worked through <a href="https://gobyexample.com/">gobyexample</a>.
The language seemed fast enough for a reasonable prototype, easy enough to learn/read that other people could pick it up quickly if they want to contribute, and generally made multithreading/multiprocessing bearable for me.
Since I wanted to continue playing with the language, and I’d been meaning to implement my routing scheme for a while, I ultimately resolved to rewrite my sim in Go, refactor the important parts into the library, and then add the missing pieces to make it more-or-less a cjdns clone with different routing.
Most of the work happened over a couple of long weekends, and I released the first working prototype on GitHub just before the end of 2017.</p>

<p>Changes since then are mostly documented in the <code class="language-plaintext highlighter-rouge">git log</code>, GitHub issues and pull requests, and discussions in our public matrix channel.
Neil joined and started adding support for other platforms, and we started to roll out public nodes and attract more users.
As of writing, a year or so after the first public release, there are around 130-140 nodes in the network, depending on the time of day, with maybe half of them having joined in the last few months.</p>]]></content><author><name>Arceliar</name></author><summary type="html"><![CDATA[How did Yggdrasil get started?]]></summary></entry><entry><title type="html">Announcing Yggdrasil Network v0.3</title><link href="https://yggdrasil-network.github.io/2018/12/12/announcing-v0-3.html" rel="alternate" type="text/html" title="Announcing Yggdrasil Network v0.3" /><published>2018-12-12T00:00:00+00:00</published><updated>2018-12-12T00:00:00+00:00</updated><id>https://yggdrasil-network.github.io/2018/12/12/announcing-v0-3</id><content type="html" xml:base="https://yggdrasil-network.github.io/2018/12/12/announcing-v0-3.html"><![CDATA[<h3 id="its-finally-here">It’s finally here</h3>

<p>At the end of 2017, Yggdrasil’s first commit was uploaded to GitHub - a project
to explore whether it was possible to build a decentralised, end-to-end
encrypted and scalable compact routing scheme modelled around the concept of a
global spanning tree. Many concept routing schemes that we have seen to date
seem to have problems with scalability - after the network exceeds a certain
size, they either fail to perform or they start to rely on centralised points in
order to consolidate routing information. We want to figure out how to build
something that would not be subject to these limitations, and to maintain
decentralisation as far as possible, and the best way to test our ideas is to
build that network. To our knowledge, this hasn’t quite been achieved before.</p>

<p>Throughout the course of 2018, Yggdrasil has gone from being a very early-stage
project supporting only a single platform to a feature-strong and relatively
stable project which now runs on many supported platforms. Although we currently
still haven’t advanced from the “alpha” label, our network has grown to exceed
70 nodes across the world (and growing slowly but steadily), with a good portion
of these users coming on-board and contributing their own
<a href="https://yggdrasil-network.github.io/services.html">services</a> to the network and using the network for their own
purposes. We’ve even had a small amount of publicity - <a href="https://tomesh.net">Toronto
Mesh</a> have been exploring using Yggdrasil on their city-wide
mesh net, and even <a href="https://www.nuug.no/aktiviteter/20181009-mesh/">presented some Yggdrasil
fundamentals</a> to the Norwegian
Unix User Group (NUUG) back in October.</p>

<p>So far, we believe that Yggdrasil is well on track to delivering on its promises
to build a fully end-to-end encrypted, self-arranging IPv6 network. We also
believe that Yggdrasil should be scalable on paper; we have somewhat proven
this in simulations, but the real proof will come in how the Yggdrasil Network
scales up in the real world, on real hardware, across real links. Having users
helping us to test brings us closer to our goal and enhances our understanding
of how our software will behave on large-scale network graphs.</p>

<p>Version 0.3 has been quite some time coming - we released version 0.2.7 on the
13th October and we have been working since then on what will make it into this
release. Even though it feels in some ways that version 0.3 is a relatively
small evolutionary release, it’s actually by far our biggest release yet. We’ve
included quite a large list of fixes, changes and even new features and over
2000 lines of code changed. We’ve taken a lot of feedback from our users about
their use-cases and pain points, and we’ve collected topographical data from
various contributor nodes to try and get a good view of what the network looks
like. We’ve even experienced some rather large topology changes and enjoyed
relatively good network stability throughout.</p>

<p>For much of the time that we were developing v0.3, we had thought that there
would end up being protocol-breaking changes and that this would render v0.3
incompatible with nodes running previous versions. I am happy to announce that
we have <strong>not</strong> needed to introduce breaking changes at this stage and currently
the network has been running a mix of both older and newer developmental nodes
without any particular issues.</p>

<h3 id="features">Features</h3>

<p>You can see the full list of modifications that have been made in our
<a href="https://yggdrasil-network.github.io/changelog.html">changelog</a>.</p>

<p>Perhaps the largest user-visible change is the introduction of Crypto-Key
Routing for traffic tunnelling, allowing you to effectively use Yggdrasil as a
VPN for both IPv4 and IPv6 traffic between any two given points on the network.
This tunnelled traffic enjoys the same benefits as regular Yggdrasil IPv6
traffic in that it is end-to-end encrypted and our many optimisations assist in
preventing TCP-over-TCP anomalies that often arise in other solutions. I wrote
an introductory <a href="https://yggdrasil-network.github.io/2018/11/06/crypto-key-routing.html">blog post</a> back at the
beginning of November about CKR, which explains some more about how to configure
it and how it works.</p>

<p>In the background, we’ve made a substantial change from using a Kademlia-based
DHT to a Chord-based DHT. The Chord-based approach allows us to do lookups with
<code class="language-plaintext highlighter-rouge">O(1)</code> (constant) state, and only depends on additional (<code class="language-plaintext highlighter-rouge">O(logn)</code>) state as a
performance optimisation, which allows us to bootstrap more quickly after
changes. We also believe that using Chord can help us to reduce some idle DHT
chatter on the network in the future, which will save a little bandwidth, and
may be helpful on battery-powered devices.</p>

<p>The spanning tree is now constructed a little differently. Previously, in a
stable network, each node would select a new parent only if this reduced the
length of the path to the root of the tree, measured by the number of other
Yggdrasil nodes in the path. This has the virtue of simplicity, but it sometimes
leads to poor performance when a node replaces a few low-latency/high-bandwidth
local links with a comparatively high-latency/low-bandwidth link over the
internet (or an anonymous overlay like <a href="https://github.com/yggdrasil-network/public-peers/blob/master/other/tor.md">Tor</a>
or <a href="https://github.com/yggdrasil-network/public-peers/blob/master/other/i2p.md">I2P</a>).
Starting with this release, nodes will switch to a new parent if it provides a
consistency lower latency path to the root, and its less eager to immediately
switch again after having just changed parents. This should lead to lower
latency in stable networks, and better reliability in unstable ones.</p>

<p>We’ve fixed a reasonable number of bugs and crashes, including in the DHT,
switch and ICMPv6 code, and have made a number of additions to the admin socket
in order to support new functionality and to make parameter naming more
consistent throughout.</p>

<h3 id="upgrading">Upgrading</h3>

<p>Our CI pipeline automatically produces builds for all supported platforms and
these will become available on our <a href="https://yggdrasil-network.github.io/builds.html">Builds page</a>. In addition, our S3
repository for Debian and RPM-based distributions will also be updated with the
new package releases.</p>

<p>New macOS .pkg installers are now available as a part of the v0.3 release too,
so installing and upgrading on macOS is now significantly easier than before.
You can find these installers on the <a href="https://yggdrasil-network.github.io/builds.html">Builds page</a> also.</p>

<p>On other platforms, simply download the latest binary for your platform and drop
it into place. Remember to take a backup of your configuration and normalise it,
which will add any new options for features in v0.3:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cp /path/to/yggdrasil.conf /var/backups/yggdrasil.conf
yggdrasil -useconffile /var/backups/yggdrasil.conf -normaliseconf &gt; /path/to/yggdrasil.conf
</code></pre></div></div>

<h3 id="whats-next">What’s next?</h3>

<p>Our work is far from over. We still have a list of things that can potentially
be rolled into future releases and we will be looking to see what we should
prioritise for our next version.</p>

<p>A big thanks to our contributors, particularly those who have worked on creating
packages for Yggdrasil and bringing it to their distributions of choice, and to
all of the users who use Yggdrasil, contributing services and providing feedback
to us on a regular basis!</p>]]></content><author><name>Neil Alexander, Arceliar</name></author><summary type="html"><![CDATA[It’s finally here]]></summary></entry></feed>