about summary refs log tree commit diff
path: root/content/posts
diff options
context:
space:
mode:
Diffstat (limited to 'content/posts')
-rw-r--r--content/posts/2023-04-11-dict-server.md171
1 files changed, 171 insertions, 0 deletions
diff --git a/content/posts/2023-04-11-dict-server.md b/content/posts/2023-04-11-dict-server.md
new file mode 100644
index 0000000..6b0e851
--- /dev/null
+++ b/content/posts/2023-04-11-dict-server.md
@@ -0,0 +1,171 @@
+---
+title: "Writing a DICT (RFC 2229) server"
+date: 2023-04-11
+lang: en
+categories: [ blog, development, hacking ]
+tags: [dict]
+translationKey: "2023-04-11-dict-server"
+---
+
+In last few weeks, I've implemented a minimal, barely compliant[^1]
+[DICT][rfc2229] server called ExTra (also stylized ex.tra).  The server
+implements the protocol as described in the linked specification, as well as
+reading from existing text-databases used by other servers ([dictd][dictd] and
+[GNU dico][dico]). I've done this as an exercise for learning Elixir, mainly,
+and it lacks many features in comparison with the two other existing
+implementations.
+
+[rfc2229]: https://www.rfc-editor.org/rfc/rfc2229
+[dictd]: https://github.com/cheusov/dictd
+[dico]: https://puszcza.gnu.org.ua/software/dico/
+
+## Spec summary
+
+DICT is a simple protocol for looking up dictionaries over
+<abbr>TCP</abbr>/<abbr>IP</abbr>.  It include a handful (well, more than
+handful, if you include the optional authentication) commands like `MATCH` or
+`DEFINE` to retrieve entries from a dictionary database.  The database format
+is not defined, so technically, one can make it to work with, say, an SQL
+database, but it's customary to use the format the reference implementation
+(dictd) uses.
+
+I largely built this based on [Elixir's guide for building a key-value
+server][kv]
+
+[kv]: https://elixir-lang.org/getting-started/mix-otp/introduction-to-mix.html
+
+## Architecture
+
+This diagram shows a very rough and simplified architecture overview of ExTra.
+Yea, I know, it looks ugly, but organizing a diagram is hard, so I'll describe
+it in details below.
+
+![Architecture overview of ExTra](/images/extra-arch.png)
+
+The processes[^2] are supervised by the application supervisor, and are
+respawned once they crash. There are three main processes concerned here:
+
+- `ExTra.Server`: This one accepts connections from the dedicated port
+- `Task.Supervisor`: This one is a supervisor that will spawn new processes
+    that will bind to clients until disconnected.
+- `ExTra.Dict` is a [GenServer][gensrv] that executes commands sent to the
+    server, and reads from dictionaries to generate response.  Note that
+    GenServer is not a TCP/IP server.
+
+[gensrv]: https://hexdocs.pm/elixir/GenServer.html
+
+## TCP server
+
+The <abbr>TCP</abbr> server is implemented using erlang's `:gen_tcp`:
+
+```elixir
+{:ok, client} = :gen_tcp.accept(socket)
+```
+
+Here the `client` is a connection to client. We send and receive data via this
+process.  To read from this until disconnection, we would put it on an infinite
+loop (`acceptor` in the diagram).  However, that also means the server is
+locked to that client and cannot accept another---you probably have learned
+this from a network class implementing an echo server in C.  This is where the
+task supervisor comes in: instead of running directly into the infinite loop,
+we spawn a `Task` that does it:
+
+```elixir
+Task.Supervisor.start_child(ExTra.ConnSupervisor, fn -> serve_first(client, host) end)
+```
+
+In the loop, commands are parsed and run, something like
+`ExTra.Command.parse(data)` and `ExTra.Command.run(commands)`
+The `ExTra.Command` module then sends these commands to `ExTra.Dict` GenServer
+to execute it, something like `ExTra.Dict.command(ExTra.Dict, command)`.
+
+## GenServer
+
+The commands are sent to the `GenServer` and handled by some other modules:
+
+```elixir
+@impl true
+def handle_call({:define, dictionary, word}, _from, state) do
+  {:reply, ExTra.Dict.Define.define(dictionary, word), state}
+end
+
+def handle_call({:match, dictionary, strategy, word}, _from, state) do
+  {:reply, ExTra.Dict.Match.match(dictionary, strategy, word), state}
+end
+```
+
+This level of abstraction may seems a bit convoluted, but using GenServer here
+would allow for caching matches and definitions, and separating matches and
+definitions to a separate modules allow for different search modules depending
+on config.  Not a necessary thing, just for educational purpose.
+
+## Matching definitions
+
+The `.dict` file stores entries as well as metadata as plain text, while the
+`.index` file store positions of the entries as:
+
+```
+<entry> <start> <length>
+```
+
+Where `<start>` and `<length>` are in quartosexagesimal or base 64.  Numeral
+base 64, not base 64 encoding that is implemented in the standard library.
+After a few shortening, the conversion is done in less than 20 lines:
+
+```elixir
+def base64num(num) do
+  alphabet =
+    'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
+    |> Stream.with_index()
+    |> Enum.into(%{})
+
+  len = String.length(num)
+
+  num
+  |> String.to_charlist()
+  |> Stream.with_index()
+  |> Stream.map(fn {c, i} -> {alphabet[c], len - i - 1} end)
+  # Left-shift 6 * power bits is equal to multiply by (2^6)^power, but faster
+  |> Stream.map(fn {digit, power} -> Bitwise.bsl(digit, power * 6) end)
+  |> Enum.sum()
+end
+```
+
+Firstly, I mapped the digits in the alphabet to its respective values, which
+are also their indices in the char list.  The digits in the input string are
+then mapped to their values based on this map, while their indices are mapped
+to the power.  Finally, these values are powered and summed as the answer.
+
+Upon reading these values, the definition from the `dict` files can be
+retrieved as simply as:
+
+```elixir
+# the content that comes before what we need
+_ = IO.read(file, start)
+# has to `binread` to interpret the UTF-8 encoded characters
+IO.binread(file, length)
+```
+
+## Further
+
+The full implementation can be found on [SourceHut][src].  As this is the first
+application I've written in Elixir, I'm sure there's a lot of stuff I've
+written here isn't recommended, so if you have some suggestion for improvement,
+please send me.
+
+As of writing, there are still several features I'd like to implement that I
+haven't, such as:
+
+- Proper response for `STATUS` and `OPTION MIME` commands
+- Implement more matching strategies
+- Make uses of the GenServer's state for caching matches
+
+[src]: https://git.sr.ht/~huyngo/ex.tra
+
+[^1]: The `OPTION MIME` command is not yet compliant, actually. It currently is
+  a no-op command while it should check `00-database-mime-header` in the file
+  and respond an empty line if not present.
+  However, none of the clients (that is, only `dict` and `dico`, as far as I
+  know) does anything with that field, so it doesn't cause any trouble.
+[^2]: To be understood as <abbr title="erlang virtual machine">BEAM</abbr>
+  processes and not <abbr title="operating system">OS</abbr> processes