KMC Blog

Creating a BitTorrent client in Kotlin — Bencoding

June 25, 2020

BitTorrent

Introducing Corbit

Distributed computing has interested me for a long time — there’s just something about orchestrating tasks across a vast network of systems that scratches that itch. One of the best examples of this I can think of is BitTorrent. To better understand how torrents work under the hood, I’ve been working on implementing my own client in Kotlin — Corbit

This is the first post in a series where I hope to write about my process creating this client as I go. Let’s go ahead and set some expectations for myself from the start.

Realistic Goals

  • Corbit will be written in Kotlin
  • Corbit will be asynchronous, getting as close to completely non-blocking as possible
  • Corbit will be multiplatform, exposing APIs for the JVM, Native, and JS

Lofty Goals

  • Corbit will be interoperable with the WebTorrent protocol
  • Corbit will be split up into artifacts, which will include a web server and a desktop application

A BitTorrent Primer

In the typical transfer of data, a client connects to a server and requests something. It is then the responsibility of the server to serve that data to the client. In the case of downloading a file, a client will receive the data as a sequential stream from the server and maybe write it to disk. In this model of data distribution, there are two points of failure — either the client or the server are susceptible to dropping the connection.

Client Server Animation

However, BitTorrent does things a bit differently. Rather than setting up a client-server relation to exchange data, a client connects to a swarm of peers to request data — it’s a peer-to-peer protocol. A peer is another client who can request data from the swarm, provide data to the swam, or both.

In this model of data distribution, the swarm is resilient to data disruption. Clients typically request data non-sequentially from the swarm. Different distribution algorithms apply, but one of the most popular prioritizes distributing the rarest pieces to make the swarm more fault tolerant. Therefore, if a peer decides to exit a swarm, there are likely plenty of other peers in the swarm which provide data redundancy.

In a normal client-server scenario, the more connections requesting data the likelier the server is to fail. In a BitTorrent swarm, the the opposite is true. The more connected peers, the more resilient the swarm.

There are plenty of use cases for BitTorrent which are, um, legitimate.

If you’ve ever used BitTorrent before, you might know that you typically download something called a .torrent file. This is a small file, usually a couple of kilobytes, that can be used in conjunction with a client to download some data. If by chance you’ve accidentally opened a torrent file in a text editor, you’ll see something like the following.

torrent file text

To the naked eye, this information looks pretty useless. How does a torrent client make sense of the countless errors shown in that text file? Answering that question begins our journey.

Decoding a torrent file

A torrent file includes the information we need in order to connect to a tracker. In the BitTorrent world, a tracker is the entity which hosts the set of of connection capable peers from which to download data. Trackers are normally just an http server — for example the tracker for Ubuntu is https://torrent.ubuntu.com. The torrent file also includes other valuable information like piece length, pieces, name, and length which is all information that describes the files hosted by other clients in the swarm.

Unfortunately, this information isn’t in a simple JSON format for us to read. BitTorrent is a binary protocol where information is encoded in a format called Bencoding (pronounced B-encoding).

Understanding Bencoding

Bencoding is a way to specify and organize data in a terse format. It supports byte strings, integers, lists, and dictionaries.

Bencoded Integers

Bencoded integers are delimited by the tokens i at the beginning and an e at the end. The data housed between those two tokens are the valid base 10 ASCII representation of an integer.

For example, the data i3e represents the integer 3.

Bencoded Byte Strings

In contrast with the other bencoded data types, byte strings do not have a beginning and end delimiter. Instead, a byte string is specified by a base 10 ASCII representation of the data length, a separator :, and the data. Byte strings are specifically useful when housing binary data which should not be interpreted as encoded text.

For example, the data 4:spam represents the byte string spam.

Bencoded Lists

Bencoded lists begin with a l delimiter while ending with an e delimiter. Contained within those bounds are any bencoded data types. Lists, like bencoded dictionaries, are recursive data types. They are bencoded data as well as contain bencoded data. This will be important later on.

For example, the data li123e4:spame represents the list [123, "spam"]

Bencoded Dictionaries

Bencoded dictionaries begin with a d delimiter and end with an e delimiter. Within a dictionary, a bencoded byte string is associated with the subsequent generic bencoded data as a key-value pair.

For example, the data d4:spami123ee represents the dictionary {"spam": 123}

With the above specification, arbitrary data can be decoded into a structured format and encoded into a binary format. Modeling this data in Corbit may seem challenging given Kotlin’s strict type system, but I think it will be easier than we expect given a couple of tools.

I found this resource, which is the entirety of the data model in Augmented BNF syntax.

<BE>    ::= <DICT> | <LIST> | <INT> | <STR>

<DICT>  ::= "d" 1 * (<STR> <BE>) "e"
<LIST>  ::= "l" 1 * <BE>         "e"
<INT>   ::= "i"     <SNUM>       "e"
<STR>   ::= <NUM> ":" n * <CHAR>; where n equals the <NUM>

<SNUM>  ::= "-" <NUM> / <NUM>
<NUM>   ::= 1 * <DIGIT>
<CHAR>  ::= %
<DIGIT> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Those unfamiliar with ABNF syntax (like me) probably don’t know what to make of the above snippet. With a little bit of research though, we can gather that this method for modeling data is called a sum type.

Sum Types

In computer science, a sum type is a data structure used to hold a value that could take on several different, but fixed, types.

In Kotlin, the preferred method for creating a sum type is a sealed class. You may have worked with sealed classes before, a common example is a Result class.

sealed class Result<T>

class Success<T>(val payload: T) : Result<T>()

class Error(val message: String) : Result<Nothing>()

Any instance of Result must either be a Success or Error. Applying the same school of thought to our Bencoded data model allows us to create a simple structure to work with.

sealed class BencodedData

data class BencodedInt(val value: Long) : BencodedData()

data class BencodedString(val value: ByteArray) : BencodedData()

data class BencodedList(val value: List<BencodedData>) : BencodedData()

data class BencodedDictionary(val value: Map<BencodedString, BencodedData>) : BencodedData()

The data model in Corbit differs slightly from the one above. Most notably, the ByteArray in BencodedString alone would be clunky when dealing with binary data. Unfortunately, Kotlin does not have a built in ByteString type like many other languages.

Despite the existence of third party implementations of byte string in Kotlin (specifically Okio’s), I decided to roll my own implementation of byte string for fun and (no) profit. This is something I plan to revisit once I select an I/O library.

The BinaryString implementation can be found within the corbit-binary module. Similar to String, BinaryString is just a wrapper around a sequence of bytes with convenience functionality baked in.

Thus, our BencodedString looks more like this

data class BencodedString(
    val value: BinaryString
) : BencodedData()

Decoding Bencoded Data

Our API for decoding binary data into structured bencoded data will be simple. Given a BinaryString, please return an instance of BencodedData.

fun BinaryString.decode(): BencodedData = Decoder(this).decode()

internal class Decoder(private val source: BinaryString) {
  fun decode(): BencodedData = TODO()
}

Implementing that logic isn’t actually too difficult. With a couple of convenience functions which handle reading to tokens and reading a fixed amount of bytes, this is simple.

To maintain clarity while demonstrating decoding, we’ll be referring to the following example data.

d4:spamli123eee

This example will touch on each data type that we’ll need to decode.

Decoding Primitives

Let’s pluck off the integer from the example above.

d4:spamli123eee

In order to decode this, we first identify what type we should be decoding to by looking at the next upcoming byte.

fun decode(): BencodedData {
    return when (BencoderToken.fromToken(nextByte)) {
      BencoderToken.INTEGER -> decodeInteger()
      else -> error("Tried to parse from $nextByte")
    }
}

Once we know we’re decoding an integer, the decoder reads the subsequent bytes up to and including the end token, e. Everything between the first and last byte of the read data is of interest to us. We can discard everything else.

private fun decodeInteger(): BencodedInt {
    val value: Long = readToNextToken(BencoderToken.END).run {
        slice(1, size - 2).utf8.toLong()
    }

    return BencodedInt(value)
}

The bencoding specification states that the data in a bencoded integer is base 10 ASCII text. Therefore, we can decode the binary data to unicode (which ASCII is a subset of) and convert that text to a number.

Next up, we’ll decode a bencoded string. Let’s shift our focus to that.

d4:spamli123eee

As mentioned above, bencoded strings don’t have normal delimiters like the other values. Instead, we’ll key our decoding off of any digit that specifies the string length. Our fromToken function maps this to BencoderToken.STRING.

fun decode(): BencodedData {
    return when (BencoderToken.fromToken(nextByte)) {
      BencoderToken.INTEGER -> decodeInteger()
      BencoderToken.STRING -> decodeString()
      else -> error("Tried to parse from $nextByte")
    }
}

Similar to decoding a bencoded integer, we read up to and including the string separator token, :. We discard the separator itself and convert that data to a number. Then, all we have to do is read the amount of bytes specified by the length information.

private fun decodeString(): BencodedString {
    val stringLength: Int = readToNextToken(BencoderToken.STRING_SEPARATOR).run {
        slice(endIndex = size - 2).utf8.toInt()
    }

    return BencodedString(read(stringLength))
}

Recursive Decoding

I mentioned above that bencoded lists and dictionaries are recursive data types. To decode those values, we’re going to perform some recursion. Let’s start with a list.

d4:spamli123eee

First up, more of the same token delimiter checks.

fun decode(): BencodedData {
    return when (BencoderToken.fromToken(nextByte)) {
      BencoderToken.INTEGER -> decodeInteger()
      BencoderToken.STRING -> decodeString()
      BencoderToken.LIST -> decodeList()
      else -> error("Tried to parse from $nextByte")
    }
}

We read and discard the list delimiter and instantiate an empty list. Afterwards, until nextByte is an end token, we recursively add the next decoded BencodedData to our list. When we encounter an end token, we can be sure that it encloses the list token because our calls to decode discard end tokens as part of their respective data types.

private fun decodeList(): BencodedList {
  readToNextToken(BencoderToken.LIST) // read and discard

  val list: MutableList<BencodedData> = mutableListOf()
  while (BencoderToken.fromToken(nextByte) != BencoderToken.END) {
      list.add(decode()) // <-- recurse
  }

  readToNextToken(BencoderToken.END) // read and discard
  return BencodedList(list)
}

Let’s trace the execution of our decoder on the above example.

  1. read and discard the l token
  2. instanciate an empty list
  3. recursively call decode
  4. check nextByte and dispatch to decodeInteger
  5. read to the next end token, which corresponds to the integer
  6. return the BencodedInteger
  7. add the integer to the list
  8. discard the end token
  9. return the BencodedList

Dictionaries are the last data type needed to complete our decoder.

d4:spamli123eee

Decoding dictionaries and lists are a nearly identical process. In contrast to decoding a list, dictionaries reads two values at a time. First a string via decodeString, and then a generic BencodedData via a recursive call to decode.

private fun decodeDictionary(): BencodedDict {
  readToNextToken(BencoderToken.DICT) // read and discard

  val dict: MutableMap<BencodedString, BencodedData> = mutableMapOf()
  while (BencoderToken.fromToken(nextByte) != BencoderToken.END) {
      dict[decodeString()] = decode()
  }

  readToNextToken(BencoderToken.END) // read and discard
  return BencodedDict(dict)
}

And with that, we can decode any arbitrary properly structured bencoded data! The full implementation is here along with some tests here.

Encoding

Encoding Bencoded data is more or less the same process shown above, just in reverse.

The one noteworthy implementation detail of encoding is how the BinaryString instances are constructed. Similar to String in Kotlin, BinaryString instances are immutable. To mutably construct a binary string which represents the BencodedData we’ve chosen to encode, corbit-binary includes a buildBinaryString utility that functions very similarly to Kotlin’s buildString utility.

You can view the implementation and its corresponding tests if you’d like.

Conclusion

corbit-bencoding will be available as its own standalone package with no dependencies other than corbit-binary in the near future. In the next post in this series, we will decode a .torrent file and use the information contained within the connect to a tracker. Happy coding :)


Written by Kevin Cianfarini