2020-10-08 12:00:00

Llama preview 0.1.2

So you may have seen in previous blog entries that I've been working on tools to compile LLVM bitcode into a .NET assembly. A lot of this effort has been focused on Rust, but the approach can be used for other languages as well. I've started referring to this overall project as "Llama".

Today I published the core functionality of Llama as a dotnet tool on nuget.org. This tool is called "bc2cil", and its only feature is to compile a bitcode file into a .NET assembly.

dotnet tool install -g sourcegear.llama.bc2cil

In this blog entry I'll walk through a couple of examples to show how this tool is used.

Coder Beware

This project is not production ready.
Think of this as just a demo or a proof of concept.
Lots of things are broken or incomplete.
This is not (currently) open source.
I am making no promises about the future of this project.

This preview release is simply for interested folks who want to fiddle around with LLVM bitcode and .NET.

Llama.bc2cil

To install the Llama bc2cil tool:

dotnet tool install -g sourcegear.llama.bc2cil

Now if you have an LLVM bitcode file called foo.bc, you can compile it into a .NET assembly like this:

bc2cil foo.bc

This will produce foo.dll, which you can use like any other .NET assembly. With a few caveats.

But where did that foo.bc file come from?

Bitcode files

The LLVM system is built around an Intermediate Representation (IR), which is basically a quasi-portable assembly language. The textual form of IR is usually stored in a file with a .ll extension. This form looks like this:

  %127 = getelementptr i8, i8* %94, i64 %106
  %128 = add i64 %123, %122
  %129 = icmp eq i64 %103, %106
  br i1 %129, label %143, label %130

In its binary form, an IR file has a .bc extension, and is often referred to as "bitcode". These two representations are equivalent, and can be easily converted to and from each other.

When a developer is using an LLVM-based compiler, most of this stuff is usually hidden. The compiler parses the language and generates IR, then it manipulates the IR, then it translates the IR into machine code for a specific CPU, and the developer never actually sees the IR.

But there is [usually] a way to tell the compiler to stop the pipeline early and emit bitcode instead. For example, when using clang, this is done by passing the -emit-llvm option on the command line.

(Llama could have been implemented in C++ as an LLVM target and backend. In fact, there is archeological evidence that LLVM once had a backend for CIL, but it was abandoned and removed a long time ago. Instead, my implementation is a separate tool that reads a bitcode file.)

Hello World in C

Let's take a look at the following C program:

int puts(const char*);

int main()
{
    puts("Hello World");
}

Note that I've intentionally added my own C prototype delaration for puts() instead of #include <stdio.h> because I don't want the system header files bringing in any surprises. For this trivial example, I just want the code to depend on one externally defined function called puts(), nothing more.

A look at IR

Let's compile this hello program into a bitcode file:

clang -emit-llvm -c hello.c

This will give me the hello.bc I wanted, but I also want to see it in text form, so I use the LLVM disassembler:

llvm-dis hello.bc

Now I get hello.ll, which contains a bit too much noise to be suitable for a blog post. Instead of showing you the whole thing, here's the essence of it:

@"??_C@_0M@KPLPPDAC@Hello?5World?$AA@" = constant [12 x i8] c"Hello World\00"

define dso_local i32 @main() #0 {
  %1 = call i32 @puts(i8* @"??_C@_0M@KPLPPDAC@Hello?5World?$AA@")
  ret i32 0
}

declare dso_local i32 @puts(i8*) #1

There are 3 global symbols here:

@(expletive deleted) is a memory block, 12 bytes long, containing the string literal
main() is defined as a function which simply calls puts() and returns 0
puts() is declared, meaning that its implementation is not here, so it needs to be provided by somebody else

From bitcode to CIL

So let's compile that bitcode file with Llama:

$ bc2cil hello.bc
import method not found: puts
Call Missing method b: puts

Unsurprisingly, we're getting complaints about the missing puts() function.

Typically, when a compiler toolchain finds an unresolved symbol, the build process is halted with a fatal error. Currently Llama behaves differently, replacing the method call with an exception throw. When Llama is ready for production use, it won't do that, but for now, I often find it handy to have it work this way.

So despite the missing symbol, I do get an assembly. Let's see what it's in it, using ildasm to output the DLL to CIL in textual form:

dotnet ildasm hello.dll > hello.cil

Just as with the .ll file, the textual form of CIL is a bit much for a blog post, so again I'll just show the essence:

  .method public static default int32 main() cil managed
  {
    // Method begins at Relative Virtual Address (RVA) 0x2050
    // Code size 17 (0x11)
    .maxstack 8
    IL_0000: ldstr "Call Missing method b: puts"
    IL_0005: newobj instance void class [System.Private.CoreLib]System.Exception::.ctor(string)
    IL_000a: throw
    IL_000b: ldc.i4 0
    IL_0010: ret
  } // End of method System.Int32 foo::main()

Yup, that's what I expected. The puts() function wasn't available, so that function call has been replaced with an exception.

And then my code generator blithely goes on to return 0 even though that's unreachable after the throw.

The point here is to illustrate that compiling bitcode to CIL is just one piece of the story. Most code has lots of external dependencies. For Llama to be useful, we need those external dependencies to be provided somehow, and that can be a big problem.

But for this situation, I'm only missing one thing. We can push through this.

The world's most pathetic libc

All we need for this case is puts(). It accepts a string and prints it on stdout. How hard could that be?

The only slightly complicated thing here is that C and .NET have differing notions of what a string is. Our C code will provide a pointer to a zero-terminated "C string" of unspecified encoding, whereas .NET strings are managed objects encoded as UTF-16.

But we can deal with that. Let's implement puts() using C#.

Create a class library project:

mkdir libc
cd libc
dotnet new classlib

And here's what Class1.cs needs:

public static class pathetic
{
    static unsafe int strlen(byte* p)
    {
        var n = 0;
        while (*p != 0)
        {
            n++;
            p++;
        }
        return n;
    }
    public static unsafe int puts(byte* s)
    {
        var str = System.Text.Encoding.UTF8.GetString(s, strlen(s));
        System.Console.WriteLine(str);
        return 0;
    }
}

Mostly we just need to convert the C string to a .NET string and call Console.WriteLine(). For the purpose of converting to a .NET string, I'm assuming the C string to be UTF8.

And then there's the fact that C deals with pointers, so this code needs to be compiled with unsafe turned on. Here's my libc.csproj file:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TargetFramework>netcoreapp3.1</TargetFramework>
    <AllowUnsafeBlocks>True</AllowUnsafeBlocks>
  </PropertyGroup>

</Project>

Fulfilling the external dependency

So how do we make our pathetic libc available for hello to use it?

Llama's bc2cil tool has a command line option to provide assemblies to be referenced. The argument value for the --ref option is (1) the path to the assembly, and (2) the name of the class in which to search, separated by a comma. So I copied libc.dll into my work directory and did this:

bc2cil --ref=libc.dll,pathetic hello.bc

Now we aren't getting those missing function errors that we saw before. When Llama encounters that reference to puts(), it looks in libc.dll, in the class called pathetic, and finds it there. And the resulting CIL (from ildasm again) looks better:

  .method public static default int32 main() cil managed
  {
    // Method begins at Relative Virtual Address (RVA) 0x2050
    // Code size 20 (0x14)
    .maxstack 1
    IL_0000: ldsfld [mscorlib]System.IntPtr foo::??_C@_0M@KPLPPDAC@Hello?5World?$AA@
    IL_0005: call int32 class [libc]pathetic::puts(pointer)
    IL_000a: stloc class V_0
    IL_000e: ldc.i4 0
    IL_0013: ret
  } // End of method System.Int32 foo::main()

Again, you may notice that my code generator isn't going to win any awards. It stores the integer result of the puts() call in a local that never gets used.

Running the program

How do we run this? We're using .NET Core, so we need a hello.runtimeconfig.json file:

{
  "runtimeOptions": {
    "tfm": "netcoreapp3.1",
    "framework": {
      "name": "Microsoft.NETCore.App",
      "version": "3.1.8"
    }
  }
}

Now we can try:

$ dotnet hello.dll
Unhandled exception. System.MissingMethodException: Entry point not found in assembly 'hello...

Ooops. We never told .NET about an entry point. Llama.bc2cil has another option for that, called --exe. When this option is true, it looks for a main() and provides startup code to call it.

bc2cil --ref=libc.dll,pathetic --exe=true hello.bc

Finally:

$ dotnet hello.dll
Hello World

WELL then. That was a lot of work just for Hello World.

Let's walk through one more example.

Swift

Swift's compiler is based on LLVM (which is unsurprising, as Swift and LLVM were developed by the same folks). Can we do .NET development with Swift?

Warning: I have VERY little actual experience with Swift, so let's not be surprised if I do or say something stupid here.

The first thing is to figure out if the Swift compiler can give me a .bc file. Looks like the swiftc option I need is -emit-bc.

(Some experimentation and digging suggests that I want the -parse-as-library option as well. Without this flag, swiftc seems to assume the source file is a script, putting its contents into an implicit main().)

Now I need a bit of Swift code to compile. But I don't want to be ambitious at all. People tend to think of "Hello World" as simple, but writing text to stdout can require all kinds of stuff in terms of library dependencies. For this first test, I just want a snippet of code that has no dependencies at all, if that's possible. Like maybe just a function that multiplies two integers.

func mul(_ a : Int, b : Int) -> Int {
    return a * b;
}

That might work. My goal is to to compile this to a .NET assembly and then call it from C#. First the bitcode file:

swiftc -parse-as-library -emit-bc mul.swift

That gives me mul.bc, so this looks promising so far. But let's run llvm-dis and look at the textual IR:

define hidden swiftcc i64 @"$s3mulAA_1bS2i_SitF"(i64 %0, i64 %1) #0 {
  %3 = alloca i64, align 8
  %4 = bitcast i64* %3 to i8*
  call void @llvm.memset.p0i8.i64(i8* align 8 %4, i8 0, i64 8, i1 false)
  %5 = alloca i64, align 8
  %6 = bitcast i64* %5 to i8*
  call void @llvm.memset.p0i8.i64(i8* align 8 %6, i8 0, i64 8, i1 false)
  store i64 %0, i64* %3, align 8
  store i64 %1, i64* %5, align 8
  %7 = call { i64, i1 } @llvm.smul.with.overflow.i64(i64 %0, i64 %1)
  %8 = extractvalue { i64, i1 } %7, 0
  %9 = extractvalue { i64, i1 } %7, 1
  %10 = call i1 @llvm.expect.i1(i1 %9, i1 false)
  br i1 %10, label %12, label %11

11:                                               ; preds = %2
  ret i64 %8

12:                                               ; preds = %2
  call void @llvm.trap()
  unreachable
}

WHOA -- that's a lot of code. I was expecting something much shorter. What I had in mind was something more like this:

define hidden swiftcc i64 @mul(i64 %0, i64 %1) #0 {
  %2 = mul i64 %0, %1
  ret i64 %2
}

Looking more closely at what Swift gave me, a lot of what is happening is overflow checking on the integer multiplication (llvm.smul.with.overflow.i64). Llama does have some support for those features, so we can keep going.

But the other problem is the name of the function. The Swift compiler mangled it to $s3mulAA_1bS2i_SitF. That's not a valid C# identifer. How am I supposed to call that? There's gotta be a way to tell Swift not to mangle that name.

After a bit of digging, I found the @_cdecl attribute. People describe it as rather unofficial, but for now I'll take it. So now mul.swift looks like this.

@_cdecl("mul")
func mul(_ a : Int, b : Int) -> Int {
  return * b;
}

I re-run the Swift compiler and then try Llama on the resulting .bc file:

$ swiftc -parse-as-library -emit-bc mul.swift
$ bc2cil mul.bc

This does produce a mul.dll assembly. Now I need a little C# console app to call Swift mul().

mkdir main
cd main
dotnet new console

One thing I haven't explained yet is that when Llama compiles a bitcode file, it outputs each function as a static method, putting all of them in a static class. The name of that class is configurable, but its default name is foo, for no particular reason. So from the perspective of C#, the Swift mul() function is named foo.mul():

using System;

namespace main
{
    class Program
    {
        static void Main(string[] args)
        {
            var x = foo.mul(7, 6);
            Console.WriteLine($"Hello {x}!");
        }
    }
}

And of course I need to reference the mul.dll assembly by adding the following to main.csproj:

  <ItemGroup>
    <Reference Include="..\mul.dll" />
  </ItemGroup>

So hopefully now I can run this and get Swift to multiply 7 times 6:

$ dotnet run
Hello 42!

Yay!

If you want to see something truly dreadful, run ildasm/ilspy/dnspy on hello.dll and look in the cctor() at how that string literal is initialized. (Sooner or later I'll write the code to do that properly.)

Comments or questions? Find me on Twitter at @eric_sink.