diff options
Diffstat (limited to 'content/posts')
-rw-r--r-- | content/posts/2023-04-11-dict-server.md | 171 |
1 files changed, 171 insertions, 0 deletions
diff --git a/content/posts/2023-04-11-dict-server.md b/content/posts/2023-04-11-dict-server.md new file mode 100644 index 0000000..6b0e851 --- /dev/null +++ b/content/posts/2023-04-11-dict-server.md @@ -0,0 +1,171 @@ +--- +title: "Writing a DICT (RFC 2229) server" +date: 2023-04-11 +lang: en +categories: [ blog, development, hacking ] +tags: [dict] +translationKey: "2023-04-11-dict-server" +--- + +In last few weeks, I've implemented a minimal, barely compliant[^1] +[DICT][rfc2229] server called ExTra (also stylized ex.tra). The server +implements the protocol as described in the linked specification, as well as +reading from existing text-databases used by other servers ([dictd][dictd] and +[GNU dico][dico]). I've done this as an exercise for learning Elixir, mainly, +and it lacks many features in comparison with the two other existing +implementations. + +[rfc2229]: https://www.rfc-editor.org/rfc/rfc2229 +[dictd]: https://github.com/cheusov/dictd +[dico]: https://puszcza.gnu.org.ua/software/dico/ + +## Spec summary + +DICT is a simple protocol for looking up dictionaries over +<abbr>TCP</abbr>/<abbr>IP</abbr>. It include a handful (well, more than +handful, if you include the optional authentication) commands like `MATCH` or +`DEFINE` to retrieve entries from a dictionary database. The database format +is not defined, so technically, one can make it to work with, say, an SQL +database, but it's customary to use the format the reference implementation +(dictd) uses. + +I largely built this based on [Elixir's guide for building a key-value +server][kv] + +[kv]: https://elixir-lang.org/getting-started/mix-otp/introduction-to-mix.html + +## Architecture + +This diagram shows a very rough and simplified architecture overview of ExTra. +Yea, I know, it looks ugly, but organizing a diagram is hard, so I'll describe +it in details below. + +![Architecture overview of ExTra](/images/extra-arch.png) + +The processes[^2] are supervised by the application supervisor, and are +respawned once they crash. There are three main processes concerned here: + +- `ExTra.Server`: This one accepts connections from the dedicated port +- `Task.Supervisor`: This one is a supervisor that will spawn new processes + that will bind to clients until disconnected. +- `ExTra.Dict` is a [GenServer][gensrv] that executes commands sent to the + server, and reads from dictionaries to generate response. Note that + GenServer is not a TCP/IP server. + +[gensrv]: https://hexdocs.pm/elixir/GenServer.html + +## TCP server + +The <abbr>TCP</abbr> server is implemented using erlang's `:gen_tcp`: + +```elixir +{:ok, client} = :gen_tcp.accept(socket) +``` + +Here the `client` is a connection to client. We send and receive data via this +process. To read from this until disconnection, we would put it on an infinite +loop (`acceptor` in the diagram). However, that also means the server is +locked to that client and cannot accept another---you probably have learned +this from a network class implementing an echo server in C. This is where the +task supervisor comes in: instead of running directly into the infinite loop, +we spawn a `Task` that does it: + +```elixir +Task.Supervisor.start_child(ExTra.ConnSupervisor, fn -> serve_first(client, host) end) +``` + +In the loop, commands are parsed and run, something like +`ExTra.Command.parse(data)` and `ExTra.Command.run(commands)` +The `ExTra.Command` module then sends these commands to `ExTra.Dict` GenServer +to execute it, something like `ExTra.Dict.command(ExTra.Dict, command)`. + +## GenServer + +The commands are sent to the `GenServer` and handled by some other modules: + +```elixir +@impl true +def handle_call({:define, dictionary, word}, _from, state) do + {:reply, ExTra.Dict.Define.define(dictionary, word), state} +end + +def handle_call({:match, dictionary, strategy, word}, _from, state) do + {:reply, ExTra.Dict.Match.match(dictionary, strategy, word), state} +end +``` + +This level of abstraction may seems a bit convoluted, but using GenServer here +would allow for caching matches and definitions, and separating matches and +definitions to a separate modules allow for different search modules depending +on config. Not a necessary thing, just for educational purpose. + +## Matching definitions + +The `.dict` file stores entries as well as metadata as plain text, while the +`.index` file store positions of the entries as: + +``` +<entry> <start> <length> +``` + +Where `<start>` and `<length>` are in quartosexagesimal or base 64. Numeral +base 64, not base 64 encoding that is implemented in the standard library. +After a few shortening, the conversion is done in less than 20 lines: + +```elixir +def base64num(num) do + alphabet = + 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/' + |> Stream.with_index() + |> Enum.into(%{}) + + len = String.length(num) + + num + |> String.to_charlist() + |> Stream.with_index() + |> Stream.map(fn {c, i} -> {alphabet[c], len - i - 1} end) + # Left-shift 6 * power bits is equal to multiply by (2^6)^power, but faster + |> Stream.map(fn {digit, power} -> Bitwise.bsl(digit, power * 6) end) + |> Enum.sum() +end +``` + +Firstly, I mapped the digits in the alphabet to its respective values, which +are also their indices in the char list. The digits in the input string are +then mapped to their values based on this map, while their indices are mapped +to the power. Finally, these values are powered and summed as the answer. + +Upon reading these values, the definition from the `dict` files can be +retrieved as simply as: + +```elixir +# the content that comes before what we need +_ = IO.read(file, start) +# has to `binread` to interpret the UTF-8 encoded characters +IO.binread(file, length) +``` + +## Further + +The full implementation can be found on [SourceHut][src]. As this is the first +application I've written in Elixir, I'm sure there's a lot of stuff I've +written here isn't recommended, so if you have some suggestion for improvement, +please send me. + +As of writing, there are still several features I'd like to implement that I +haven't, such as: + +- Proper response for `STATUS` and `OPTION MIME` commands +- Implement more matching strategies +- Make uses of the GenServer's state for caching matches + +[src]: https://git.sr.ht/~huyngo/ex.tra + +[^1]: The `OPTION MIME` command is not yet compliant, actually. It currently is + a no-op command while it should check `00-database-mime-header` in the file + and respond an empty line if not present. + However, none of the clients (that is, only `dict` and `dico`, as far as I + know) does anything with that field, so it doesn't cause any trouble. +[^2]: To be understood as <abbr title="erlang virtual machine">BEAM</abbr> + processes and not <abbr title="operating system">OS</abbr> processes |