Author: Eiko

Tags: Haskell, domain specific language, Bash, script, code generation, embedded domain specific language, edsl, GADTs, abstract syntax tree

Time: 2024-09-21 22:09:00 - 2024-09-22 00:50:13 (UTC)

Motivation

I have an idea of writing a Haskell library that generates bash scripts. The library will be called GHB, which stands for Generate (from) Haskell (to) Bash (actually it is a joke name, because it differs only 1 from GHC).

Because writing bash script is error-prone and the conditioning / casing / loops are not very pleasant to write, I want to write a library that will be able to generate complicated bash scripts in a type-safe manner.

The library, as I imagine, will provide a set of combinators that will allow the user to generate bash scripts in a type-safe manner. The library will enforce best practices for writing bash scripts and will help the user avoid common pitfalls and errors.

Why not Turtle?

There are some haskell libraries like Turtle that allows you to do scripting directly in haskell, but that means you must have haskell compiler installed in the system. The idea of GHB is to generate bash scripts that can be run on any system.

Design

We want to avoid the following errors:

  1. Using a variable before it is defined, or using a variable that is not defined.

  2. Using a variable that is defined in a different scope, for example, using a variable defined in a function in the global scope.

  3. Enforce safe variable names

    local variables use lowercase letters global variables use uppercase letters

  4. Expand long parameter notation whenever possible

    like rm --recursive --force instead of rm -rf

  5. Use the correct quoting style

    • Use single quotes for strings that do not contain variables

    • Use double quotes for strings that contain variables

  6. Always Enforce

    set -o errexit  # Exit on error
    set -o nounset  # Exit if an unset variable is used
    set -o pipefail # Exit if any command in a pipeline fails

    Or we can let users to handle errors in a more systematic way.

  7. Automatically examine if the command is available in the system

  8. potentially: automatically install the command if it is not available in the system,

    for this we need to know the package manager of the system.

  9. Automatically add the shebang line to the script

  10. Safe conditionals, and control structures, casing, loops, etc.

  11. Variables will have a type, and the type will be checked at compile time, and in bash script, the type info will be erased, but will be inserted in the comments.

A command or a function has a type, represented in our type system.

common functions should be included in the library, their types will be checked and eventually rendered as a command in the bash script.

Understanding Bash Functions and Commands

Let’s consider what types of commands we can have in bash:

Input and Output

  • A command can have some input values, strings, numbers, file paths or directory paths, etc.

  • A command may produce output in stdout, stderr, or files.

  • A command can have exit codes, with 0 meaning success and non-zero meaning failure.

Types of commands we are considering

  1. A command taking no input or one string input, printing string in stdout or strerr, with return codes.

  2. A command that can take flags with no associated arguments, or flags with associated arguments.

    For example

    • ls -l or ls --long or ls --long --all are examples of flags with no associated arguments. It has one input as a directory path.

    • find . -name "*.hs" is an example of a flag with an associated argument, the associated argument to the flag -name is *.hs and the input to the command is a file path ..

  3. A command that reads input in stdin, and produces output in stdout or stderr.

    For example, sort command reads input from stdin and produces output in stdout or read from a file.

  4. Commands that produce side effect only, for example modifying enviroment

    export PATH=$PATH:/usr/local/bin

  5. Bash script functions

    • Functions can take arguments $1, $2, etc.

    • Functions can produce output in stdout or stderr.

    • Functions can have return codes using return statement.

Command Input and Output Mechanisms

  1. Arguments and option flags

    • flags can be optional or required,

    • flags can have associated arguments or not, depending on the flag.

  2. Stdin and Stdout

    • A command can read input from stdin, like

    command < file.txt

    command <<EOF
    (...)
    EOF
    • A command can produce output in stdout or stderr.

      • write to stdout command > file.txt

      • append command >> file.txt

      • write to stderr command 2> file.txt

      • write both stdout and stderr to the same file command > file.txt 2>&1

  3. Pipes

    A command can pass its output to another command using pipes.

    command1 | command2

Understanding Return Codes

Typically return code 0 meaning success and non-zero meaning failure. Other return codes can depend on the command. Our library should encode some of the return types in its abstract syntax for common commands, and at the same time allow directly manipulating the return codes.

The return code is stored in the special variable $? immediately after the command is executed, to read it we can do like

command
status=$? # can be omitted, you can use "if [ $? -eq 0 ]; then" instead.
if [ $status -eq 0 ]; then
  echo "Success"
else
  echo "Failure"
fi

# or something like this
if [ $? -ne 0 ]; then
  echo "Failure"
fi

GADT and Monadic Design

  • We want to write things monadically, and move the job of assigning variables to the library. We want to write things like

    do
      str <- ls "-l" -- which should be a command that returns a string
      echo str
  • We want to have a GADT that represents the type of the command, and the type of the command will be checked at compile time.

  • Command can have a parsable type class, which can help us to parse the output of the command into structured data. The type class will have a type family inside that will represent the type of the output of the command.

    class Parsable cmd where
      type Parsed cmd
      parse :: Stdout cmd String -> Maybe (Parsed cmd)
    
    instance Parsable Ls where
      type Parsed Ls = LinesOf FilePath
      parse = undefined

    Unfortunately these parsers will need to be compiled into bash as well, I wonder if there is a way to compile any monadic parser into a bash one.

Design of GADT representations

Let’s review the above simplistic example to get an idea of what we are looking for.

The output of ls is stdout, the output should be given this type ‘Stdout String’

ls :: Params Ls -> Maybe DirPath -> Command (Stdout Ls String)

if we use cat str, it can be done by piping ls -l | cat, but not with echo, because echo does not take input from stdin. This time we need to use variables, or xargs. We will consider using variables as a more systematic way.

Consider the following types for ls and cat:

echo :: String -> Command (Stdout Echo String)

cat :: Params Cat 
    -> [FilePath] 
    -> Maybe (Stdin String)
    -> Command (Stdout Cat String)

(|) :: Command (Stdout cmda a) 
    -> (Stdin a -> Command (Stdout cmdb b)) 
    -> Command (Stdout cmdb b)

Here Params a is a type family or just some plain algebraic data type (change to ParamsLS and ParamsCAT for this design), that represents different type of parameters that a command can take.

Definition

  • A value is a string, array or number or some form of data that can be accessed and manipulated in bash. It could be literals Lit a or variables Var name.

    data Value a
      = Lit a
      | Var String -- variable
  • A command is a piece of code (doesn’t have to be one command, can be multiple lines) that already filled all its parameters, when you run it, it will do something and produce the desired output prescribed by the type wrapped by the Command type.

    data Command o where
        Ls   :: Params Ls -> Value (Maybe DirPath) -> Command (Stdout LS String)
    
        Echo :: Value String -> Command (Stdout ECHO String)
    
        Cat  :: Params CAT 
             -> Value [FilePath] 
             -> Maybe (Stdin a)
             -> Command (Stdout CAT a)
    
        CmdStdin :: (HasStdin cmd) 
             => Params cmd
             -> Stdin cmd -> Command (Stdout cmd a)
    
        CmdNoStdin  
             :: Params cmd
             -> Command (Stdout cmd a)
    
        (|>) :: Command (Stdout cmda a) 
             -> (Stdin a -> Command (Stdout cmdb b)) 
             -> Command (Stdout cmdb b)

    Command should be a ‘functor’, but should only work on functions that is implementable in bash.

  • A bash is a bash program, piece of code, function that can run or compose.

    data Bash a where

Implementation

The following is my first trial on implementing the GHB library. The code is not complete, and the design is probably not optimal, but it gives an idea of what I am trying to achieve owo!

{-# LANGUAGE GADTs, MultiParamTypeClasses, PolyKinds, TypeOperators, TypeFamilies, OverloadedStrings, FlexibleInstances, TypeApplications, DataKinds, ScopedTypeVariables, AllowAmbiguousTypes, DerivingVia #-}
-- | GHB is an EDSL for Haskell -> Bash scripting.
-- Author: Eiko
module GHB where

import Control.Monad.State
import Data.String
import qualified Data.Map as M
import Data.Functor.Identity
import Data.HList hiding (get)

data Value a where
  Lit     :: a -> Value a                      -- ^ a literal holding a value
  Var     :: VarName a -> Value a                -- ^ variable name
  FromCmd :: (KnownCommand cmd) => Command (Stdout cmd a) -> Value a -- ^ value from a command

newtype VarName a = VarName String 
  deriving (IsString, Ord, Eq) via String

newtype Path = Path String
  deriving (IsString, Ord, Eq) via String

data Command o where
  CmdStdin :: (HasStdin cmd i, EmbedValue i) => Params cmd -> Stdin i -> Command (Stdout cmd o)
  CmdNoStdin :: Params cmd -> Command (Stdout cmd o)

  Pipe :: (KnownCommand cmda) => Command (Stdout cmda a) -> (Stdin a -> Command (Stdout cmdb b)) -> Command (Stdout cmdb b)

(|>) :: (KnownCommand cmda) => Command (Stdout cmda a) -> (Stdin a -> Command (Stdout cmdb b)) -> Command (Stdout cmdb b)
(|>) = Pipe

data Stdin a      
  = Stdin 
  | FromValue (Value a) -- should be 'a'
  | FromFile (Value Path)

data Stdout cmd b = Stdout

data Output (a :: k)

data Or (a :: k) (b :: k) -- = Or (Output a) (Output b)

data Many (a :: k)

data Func (inp :: [k]) (out :: [k])

data BashStruct a where
  Sequential  :: BashStruct as -> BashStruct bs -> BashStruct (as ++ bs)
  IfElse      :: Value Bool    -> BashStruct as -> BashStruct bs -> BashStruct '[Or as bs]
  If          :: Value Bool    -> BashStruct as -> BashStruct '[Or as '[]]
  While       :: Value Bool    -> BashStruct ls -> BashStruct '[Many ls]
  Command     :: KnownCommand cmd => Command (Stdout cmd a) -> BashStruct '[a]
  Assign      :: EmbedValue a => VarName a  -> Value a  -> BashStruct '[]
  FunctionDec :: VarName (Func input output)  -> BashStruct output -> BashStruct '[]
  FunctionApp :: VarName (Func input output)  -> FList Value input -> BashStruct output

functionCall :: Monad m 
  => VarName (Func input output) 
  -> FList Value input 
  -> GHB m (Command (Stdout (Func input output) output))
functionCall = undefined

trimString :: BashStruct '[]
trimString = FunctionDec "trimString" $ Command (echo [Var "1"] |> cat nop)

(>>>) :: BashStruct as -> BashStruct bs -> BashStruct (as ++ bs)
(>>>) = Sequential
infixr 5 >>>

class Assignable a where
  type AssignedType a
  (@=) :: VarName (AssignedType a) -> a -> BashStruct '[]

instance (EmbedValue a, KnownCommand cmd) => Assignable (Command (Stdout cmd a)) where
  type AssignedType (Command (Stdout cmd a)) = a
  vname @= cmd = Assign vname $ FromCmd cmd

type PureGHB a = GHB Identity a

type BashLine = String

ghb :: BashStruct a -> PureGHB [BashLine]
ghb (Sequential s1 s2) = (++) <$> ghb s1 <*> ghb s2
ghb (If cond sif) = do
  ifcondthen <- insertIndent $ "if [" ++ embedValue cond ++ "]; then" 
  blockif <- increaseIndent >> ghb sif
  endif <- decreaseIndent >> insertIndent "fi"
  return $ ifcondthen : blockif ++ [endif]
ghb (IfElse cond sif selse) = do
  ifcondthen <- insertIndent $ "if [" ++ embedValue cond ++ "]; then" 
  blockif <- increaseIndent >> ghb sif
  else' <- decreaseIndent >> insertIndent "else"
  blockelse <- increaseIndent >> ghb selse
  endif <- decreaseIndent >> insertIndent "fi"
  return $ ifcondthen : blockif ++ else' : blockelse ++ [endif]
ghb (While cond s) = do
  while <- insertIndent $ "while [" ++ embedValue cond ++ "]; do"
  block <- increaseIndent >> ghb s
  endwhile <- decreaseIndent >> insertIndent "done"
  return $ while : block ++ [endwhile]
ghb (Command cmd) = pure <$> insertIndent (protoCompiler cmd)
ghb (Assign vname val) = pure <$> insertIndent (embedVarAtAssignment vname ++ "=" ++ embedValue val)
ghb (FunctionDec fname body) = do
  let funcname = embedVarAtAssignment fname
  funcdec <- insertIndent $ "function " ++ funcname ++ " {"
  block <- increaseIndent >> ghb body
  endfunc <- decreaseIndent >> insertIndent "}"
  return $ funcdec : block ++ [endfunc]

getIndent :: Monad m => GHB m Indent
getIndent = gets _indent

increaseIndent :: Monad m => GHB m ()
increaseIndent = modify $ \bstate -> bstate { _indent = Indent $ unIndent (_indent bstate) + 1 }

decreaseIndent :: Monad m => GHB m ()
decreaseIndent = modify $ \bstate -> bstate { _indent = Indent $ unIndent (_indent bstate) - 1 }

insertIndent :: Monad m => BashLine -> GHB m BashLine
insertIndent str = do
  Indent i <- getIndent
  return $ replicate (indentUnit * i) ' ' ++ str
  where indentUnit = 4

-- | This function declares a new variable name
newName :: Monad m => String -> GHB m (VarName a)
newName vname = do
  BState currentMap position <- get
  case M.lookup vname currentMap of
    Just idmap -> 
      let mmaxid = M.lookupMax idmap in
      case mmaxid of
        Nothing -> do
          let newvarName = vname ++ "_0"
          put $ BState (M.insert vname (M.insert 0 True idmap) currentMap) position
          return $ VarName newvarName
        Just (maxid, _) -> do
          let newvarName = vname ++ "_" ++ show (maxid + 1)
          put $ BState (M.insert vname (M.insert (maxid + 1) True idmap) currentMap) position
          return $ VarName newvarName
    Nothing -> do
      let newvarName = vname ++ "_0"
      put $ BState (M.insert vname (M.singleton 0 True) currentMap) position
      return $ VarName newvarName

type GHB m a = StateT BState m a 
-- this is a monad where we can keep track of the state of the bash script, i.e. declarations of variables and functions

type (#>) = M.Map
infixr 5 #>

type InScope = Bool
data BState = BState { _vars :: String #> Int #> InScope, _indent :: Indent }

newtype Indent = Indent { unIndent :: Int }
data Position = Position { row :: Int, col :: Int }


instance IsString (Value String) where fromString = Lit
instance IsString (Value (Maybe String)) where 
  fromString [] = Lit Nothing
  fromString xs = Lit $ Just xs

class HasStdin cmd i 

instance HasStdin CAT String

data CAT
data LS
data ECHO a
data RSYNC
  
type family Params cmd
type instance Params CAT = Value (Maybe String)
type instance Params LS  = Value (Maybe String)
type instance Params (ECHO a) = [Value a] -- echo has a mandatory argument
--type instance Params RSYNC = 

ls :: Params LS -> Command (Stdout LS String)
ls = CmdNoStdin @LS

cat :: Params CAT -> Stdin String -> Command (Stdout CAT String)
cat = CmdStdin @CAT

echo :: forall a. Params (ECHO a) -> Command (Stdout (ECHO a) a)
echo = CmdNoStdin @(ECHO a)

class KnownCommand cmd where
  applyCommand :: Params cmd -> String

instance KnownCommand LS where
  applyCommand (Lit Nothing) = "ls"
  applyCommand v = "ls " ++ embedValue v

instance KnownCommand CAT where
  applyCommand (Lit Nothing) = "cat"
  applyCommand v = "cat " ++ embedValue v

instance EmbedValue a => KnownCommand (ECHO a) where
  applyCommand v = "echo " ++ unwords (embedValue <$> v)

class EmbedLit a where
  embedLit :: a -> String

instance EmbedLit String where
  embedLit str = "\"" ++ str ++ "\""

class EmbedValue a where
  embedValue :: Value a -> String

embedVarAtApplication :: VarName a -> String
embedVarAtApplication (VarName vname) = "${" ++ vname ++ "}"

embedVarAtAssignment :: VarName a -> String
embedVarAtAssignment (VarName vname) = vname

instance EmbedValue String where
  embedValue (Lit a) = embedLit a
  embedValue (Var vname) = embedVarAtApplication vname
  embedValue (FromCmd cmd) = slice $ protoCompiler cmd

instance EmbedValue Path where
  embedValue (Lit (Path path)) = embedLit path
  embedValue (Var vname) = embedVarAtApplication vname
  embedValue (FromCmd cmd) = slice $ protoCompiler cmd

instance EmbedValue (Maybe Path) where
  embedValue (Lit (Just (Path path))) = embedLit path
  embedValue (Lit Nothing) = ""
  embedValue (Var vname) = embedVarAtApplication vname
  embedValue (FromCmd cmd) = slice $ protoCompiler cmd

instance EmbedValue (Maybe String) where
  embedValue (Lit (Just str)) = embedLit str
  embedValue (Lit Nothing) = ""
  embedValue (Var vname) = embedVarAtApplication vname
  embedValue (FromCmd cmd) = slice $ protoCompiler cmd

instance EmbedValue Bool where
  embedValue (Lit True) = "true"
  embedValue (Lit False) = "false"
  embedValue (Var vname) = embedVarAtApplication vname
  embedValue (FromCmd cmd) = slice $ protoCompiler cmd

slice :: String -> String
slice str = "$(" ++ str ++ ")"

nop :: Value (Maybe a)
nop = Lit Nothing

nos :: Stdin a
nos = Stdin

protoCompiler :: forall cmd a. (KnownCommand cmd) => Command (Stdout cmd a) -> String
protoCompiler (CmdStdin p Stdin)                      = applyCommand @cmd p
protoCompiler (CmdStdin p (FromValue vstdin@(Lit _))) = applyCommand @cmd p ++ " <<EOF\n" ++ embedValue vstdin ++ "\nEOF"
protoCompiler (CmdStdin p (FromValue (Var vn)))       = protoCompiler (echo [Var vn] |> CmdStdin @cmd p)
protoCompiler (CmdStdin p (FromValue (FromCmd cmd)))  = protoCompiler (cmd |> CmdStdin @cmd p)
protoCompiler (CmdStdin p (FromFile vfilepath))       = applyCommand @cmd p ++ " < " ++ embedValue vfilepath
protoCompiler (CmdNoStdin p)                          = applyCommand @cmd p
protoCompiler (Pipe c1 c2)                            = protoCompiler c1 ++ " | " ++ protoCompiler (c2 Stdin)

home = Var "HOME"

program :: Command (Stdout CAT String)
program = ls home |> cat nop |> cat nop

test = toBashLines programb

toBash :: PureGHB (BashStruct a) -> String
toBash = unlines . toBashLines

toBashLines :: PureGHB (BashStruct a) -> [BashLine]
toBashLines prog = runIdentity $ evalStateT (ghb =<< prog) (BState M.empty (Indent 0))

varJust :: VarName a -> Value (Maybe a)
varJust (VarName vname) = Var (VarName vname)

programb :: PureGHB (BashStruct '[String])
programb = do
  lshome    <- replicateM 10 $ newName "lshome"
  catlshome <- replicateM 10 $ newName "catlshome"
  return $   foldr1 (>>>) (map (@= ls (Var "HOME")) lshome)
         >>> foldr1 (>>>) (zipWith (\v a -> v @= echo @String [Var a]) catlshome lshome) 
         >>> Command (echo $ Var <$> catlshome)