From 2c93df3d11bf8ceeb5c203416a2533cf32275e1a Mon Sep 17 00:00:00 2001 From: "B. Wilson" Date: Tue, 20 Apr 2021 11:49:27 +0900 Subject: services: Add a service for rasdaemon. * gnu/services/linux.scm (rasdaemon-configuration, rasdaemon-configuration?, rasdaemon-configuration-record?, rasdaemon-service-type): New variables. * doc/guix.texi (Linux Services): Document it. Signed-off-by: Leo Famulari --- doc/guix.texi | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) (limited to 'doc') diff --git a/doc/guix.texi b/doc/guix.texi index a6c1556977..14e502da84 100644 --- a/doc/guix.texi +++ b/doc/guix.texi @@ -88,6 +88,7 @@ Copyright @copyright{} 2020 John Soo@* Copyright @copyright{} 2020 Jonathan Brielmaier@* Copyright @copyright{} 2020 Edgar Vincent@* Copyright @copyright{} 2021 Maxime Devos@* +Copyright @copyright{} 2021 B. Wilson@* Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or @@ -31442,6 +31443,86 @@ parameters, can be done as follow: @end lisp @end deffn +@cindex rasdaemon +@cindex Platform Reliability, Availability and Serviceability daemon +@subsubheading Rasdaemon Service + +The Rasdaemon service provides a daemon which monitors the platform Reliablity, +Availability and Serviceability (RAS) reports from the Linux kernel trace +events, logging them to syslogd. + +Reliability, Availability and Serviceability is a concept used on servers meant +to measure their robustness. + +@strong{Relability} is the probability that a system will produce correct +outputs: + +@itemize @bullet +@item Generally measured as Mean Time Between Failures (MTBF), and +@item Enhanced by features that help to avoid, detect and repair hardware +faults +@end itemize + +@strong{Availability} is the probability that a system is operational at a +given time: + +@itemize @bullet +@item Generally measured as a percentage of downtime per a period of time, and +@item Often uses mechanisms to detect and correct hardware faults in runtime. +@end itemize + +@strong{Serviceability} is the simplicity and speed with which a system can be +repaired or maintained: + +@itemize @bullet +@item Generally measured on Mean Time Between Repair (MTBR). +@end itemize + + +Among the monitoring measures, the most usual ones include: + +@itemize @bullet +@item CPU – detect errors at instruction execution and at L1/L2/L3 caches; +@item Memory – add error correction logic (ECC) to detect and correct errors; +@item I/O – add CRC checksums for transferred data; +@item Storage – RAID, journal file systems, checksums, Self-Monitoring, +Analysis and Reporting Technology (SMART). +@end itemize + +By monitoring the number of occurrences of error detections, it is possible to +identify if the probability of hardware errors is increasing, and, on such +case, do a preventive maintenance to replace a degraded component while those +errors are correctable. + +For detailed information about the types of error events gathered and how to +make sense of them, see the kernel administrator's guide at +@url{https://www.kernel.org/doc/html/latest/admin-guide/ras.html}. + +@defvr {Scheme Variable} rasdaemon-service-type +Service type for the @command{rasdaemon} service. It accepts a +@code{rasdaemon-configuration} object. Instantiating like + +@lisp +(service rasdaemon-service-type) +@end lisp + +will load with a default configuration, which monitors all events and logs to +syslogd. +@end defvr + +@deftp {Data Type} rasdaemon-configuration +The data type representing the configuration of @command{rasdaemon}. + +@table @asis +@item @code{record?} (default: @code{#f}) + +A boolean indicating whether to record the events in an SQLite database. This +provides a more structured access to the information contained in the log file. +The database location is hard-coded to @file{/var/lib/rasdaemon/ras-mc_event.db}. + +@end table +@end deftp + @cindex zram @cindex compressed swap @cindex Compressed RAM-based block devices -- cgit 1.4.1