1 Mar 2017

Languages That Let You Assemble Software from Components

I was reading Fred Brooks's classic paper, No Silver Bullet, and the enthusiasm about component marketplaces was striking. In the late 80s and 90s, at the peak of the excitement about object-oriented programming, many people envisioned a new model of building software: instead of coding it yourself, you'd build it from pre-made modules obtained from a component marketplace.

This marketplace hasn't materialised. We still code a lot from scratch. Too much, in fact. Ideally, when you build a new app, you should be coding only the things that make your app different and unique. But, in reality, most of our time goes in building the basics that all apps must have. This doesn't add value to users or to the business. Put differently, software development is highly wasteful in time, money and opportunity cost. How many more or better services might we have had if we spent our time better? How much more could an independent app developer accomplish over his career if he didn't waste most of his time reinventing the wheel?

What can we do to realise these benefits? I think the key aspect is reducing the risk of integrating a third-party library into your app. Risk in privacy, security, reliability and correctness. If you had a language that guarantees that a third-party library can't cause your app to crash, or leak your private data, or cause bugs or security issues in the rest of your app, it would reduce the barrier to entry, hopefully letting you use a third-party component in a situation where you might not otherwise have.

Some communities, like Nodejs, have a different culture. They've embraced components wholeheartedly. You use some libraries, and they use others, and so on, and now you're shipping hundreds of packages from people you don't necessarily trust. The Node community doesn't need encouragement to adopt components. They already have. But the question still remains: can we reduce the risk of using random NPM packages from people you don't know or necessarily trust?

Even if you're not using third-party libraries, you can factor out some of your own code into an internal library for ease of understanding and maintainability. Or if a different person or team is responsible for that. The same techniques that let you confidently use third-party libraries can be applied within a company or project to help you write better software.

Suppose we were to design a language from scratch to reduce the barriers and the risks to assembling software from components. What form would this language take?

Memory safety

... prevents a third-party library from crashing your app, or corrupting your data. A library should be able to corrupt only its own data. This means, like Java, no raw pointers, garbage-collection, bounds-checked arrays, checked casts, compulsorily initialised references, and so on.

Memory safety is foundational for all other guarantees. If you don't have memory safety, all bets are off.

To verify these properties at runtime, libraries should be distributed as bytecode, not machine code. Or source code. In this post, when I say "bytecode", I mean "bytecode or source".

This doesn't apply to apps (things with a main() function). Apps could be distributed in binary, using an AOT compiler, like Ngen. Or they could be distributed in binary and compiled to native code during installation. Or JIT'ted. Any of these is fine. It doesn't matter how apps are distributed, since they're not reused; it matters that libraries are distributed as bytecode.


Exceptions

Even modern, memory-safe languages like Swift let third-party libraries crash your app, say if they dereference null. It's a controlled crash, not random memory corruption, but that still shouldn't be the case. If things go wrong in a third-party library, it should produce an exception, like in Java. Or, equivalently, return an error code. Not crash. Crashing should be a thing of the past, no matter how carelessly the library was coded.


No static variables

We should look at getting rid of static variables (including global and local static variables). Pass a reference explicitly. That way, a library can modify only objects it's given a reference to. You know at the point of invocation what objects can be affected.

You don't have to worry that a library modified some global state. Either corrupted it, or modified it in a sensible way, but so does your app or another library, stepping on each other's toes [0].


Side-effect-free functions

A language designed for assembly of software from pre-made components should support side-effect-free functions. These don't modify their arguments, or the object they're invoked on (this). A side-effect-free function can call only other side-effect-free functions.

You can confidently invoke a side-effect-free library function anywhere, and in any situation.

The language should let us mark functions as having a side effect. Better, it should let us annotate which argument a function modifies. Here's an example of a function that copies one List to another:

void copy(mutating List destination, List source)

This way, you know not just that a function modifies state, but which argument it modifies. In the absence of this keyword, functions are side-effect-free.

The mutating keyword also applies to methods:

class Person {
  int getAge();
  mutating void setAge(int age);
}

This lets you invoke library functions with confidence, knowing what objects might change as a result, and which won't.


Immutable Classes

Immutable objects help reduce unintended side-effects, especially when working with a third-party library built by someone else. But immutable objects are very hard to implement in languages that don't have support for them [2]. So the language should have support for immutable classes, perhaps via an immutable keyword that you'd use when declaring a class:

immutable class Person {

The compiler will do whatever it takes to make the class immutable. It would make the class final. It will make all fields final, and if they're of class type, ensure that they're themselves immutable. Or defensively copied, and only side-effect free functions are invoked on them. These are all implementation details. You don't have to be concerned about how to implement an immutable object, just that it is immutable.

The compiler will make sure that objects remain immutable even in the presence of race conditions, that the state of the object won't appear to have changed when accessed by multiple threads without synchronisation. Maybe you haven't thought of multithreading. Or you don't understand the memory model of the language. Or don't know what a memory model is. The compiler will still guarantee that an immutable object is, in fact, immutable.

Even highly-skilled programmers make mistakes, so it's good to have the compiler do the hard work for us [3], so that we can focus on the high-level properties of our classes, not on the mechanics of their implementaiton.


Sandboxing

You can load a library into a sandbox, or invoke a function in a sandbox. Code in the sandbox won't be allowed to do I/O and invoke dangerous OS or standard library functions. Such calls would be guarded by a sentry, which is an object that inspects and allows or disallows them. The default sentry denies everything, but you can define your own sentry that has a different policy [4].

The compiler automatically makes defensive copies of everything going into or out of the sandbox, except for immutable classes [5]. This way, there are no references across the sandbox boundary, which defeats the point of the sandbox [6].

When a language is designed from scratch with sandboxing in mind, it can be implemented with zero or close to zero overhead than retrofitting it after the fact for a language like C++, as with Google's Native Client.

In addition to limiting what API calls sandboxed code can make, we could also limit the memory used by the sandbox. This would probably require the sandbox to have a separate heap.

The language could support timeouts for calls into the sandbox, to prevent the library running in the sandbox from hanging your app. Or use async/await, so that your main thread isn't unresponsive.

You could restrict background execution, which will let the sandbox run only when there's a pending call into it, and for a grace period of 5 seconds afterward. This prevents a third-party library from draining battery or slowing down your app when it's not being used.

If that's too strict, you could let it use create threads, but limit their priority, so that they don't interfere with higher priority threads, like the UI thread.

You could force libraries to use efficient abstractions like Grand Central Dispatch or goroutines instead of kernel threads. You can create tons of GCD tasks or goroutines without overloading the system, which may be especially important in constrained environments like phones. Tasks and goroutines also consume far less memory than a kernel thread.

With a sandbox, an ad library in a mobile app that happens to have contacts permission won't be able to abuse the app's permission to upload the user's contact list to the ad network's servers, for example. Or require it to make HTTPS calls, not HTTP, if that's what you want. Or hang your app, or make it unresponsive. Or consume too much memory and crash. And so on.

At a high level, when you sandbox a library, you can prevent it from hurting the privacy, security, reliability or performance of your app.


No extensions

Some languages like Swift let you modify classes, including system classes. This would be a mis-feature in a language designed to aid assembly of software from components. A library you include in your app shouldn't be able to modify classes it doesn't own, interfering with your app or other libraries.


Portable

A language designed to aid assembly of software from components should be portable, in many different ways.

First, code should be hardware-independent, like Java. A library should work on all devices, and in exactly the same way. It should be distributed as bytecode or source code, not machine code.

Second, languages like C leave some decisions to the compiler, like whether a char is signed or unsigned. This is different from hardware-dependence, because two compilers on the same hardware could implement things differently, making it harder for you from using a library built with one compiler in an app built with another compiler. This is again a mis-feature in a language designed for component reuse. Any two compilers should generate compatible code.

This means ABI stability as well. That mostly comes free when you distribute bytecode, but not always. For example, in Java, if you're using a third-party library, and it defines a public static final integer field that you use, the compiler inlines it. If the next version of the library changes the value of the constant, unless you recompile, your code is stuck with the old version, which can cause problems. This can cause bugs or even exceptions. This is a mis-feature of Java. Recompiling should never change the behavior of your app.


Conclusion

When we use a third-party library, the language should contain as many problems as possible, as opposed to risking the privacy, security, reliability or correctness of our software. Then, we'll be able to confidently use third-party libraries without knowing or caring as much who's built them. This frees us to implement only what's different and unique in our software, relying on premade components for things that are the same, letting us be more productive, both individually and for the industry as a whole.


[0] Eliminating static will be painful, requiring references to be passed around. Perhaps we can ease the pain by allowing dynamic scoping. Dynamic scoping is a way for a function to declare a local variable and have it be visible to all the functions it calls, recursively, without having to pass it explicitly. But dynamic scoping should be allowed only within the same module. A module is, as in Swift, a bunch of files that are compiled together, that don't make sense to distribute or use without the other files in it. Each library is a separate module, and your main app, another module. That way, you won't have the pain of passing around tons of references within your module, but at the same time, calls to a library explicitly identify what objects the library can modify.

[1] We don't need pure functions. It's okay for a function to do some computation based on the current time, for example. Or print something to the console. The thing we want to prevent is changing anything the values of variables and objects you have.

[2] For example, do you think this Java class is immutable?

class Person {
    private int age;
    static Person lastCreatedPerson = null;

    Person(int age) {
        lastCreatedPerson = this;
        this.age = age;
    }

    int getAge() {
        return age;
    }
}

This isn't immutable, because a reference to the object can be accessed by another thread before the field is initialised. So let's fix that, by eliminating the static variable, and make the field final too, to be safe:

class Person {
    private final int age;

    Person(int age) {
        this.age = age;
    }

    int getAge() {
        return age;
    }
}

Is this immutable? Again, no:

class MischievousPerson extends Person {
    public int age2;

    @Override int getAge() {
        return age2;
    }
}

Since the field is now public, its state can be changed. Suppose we fix this by making the class final. Consider this example:

final class Person {
    private final List<Person> friends;

    Person(List<Person> friends) {
        this.friends = friends;
    }

    List<Person> getFriends() {
        return friends;
    }
}

Is this immutable? No, because you could pass in a reference to a mutable list:

List<Person> friends = ...;
Person kartick = new Person(friends);
friends.add(...);

So let's say we change the constructor to make a defensive copy:

Person(List<Person> friends) {
    this.friends = new ArrayList<>(friends);
}

Is this now immutable? No:

Person kartick = ...;
kartick.getFriends().add(...);

See how hard it is to define an immutable class?

[3] It's more helpful to have the compiler validate such high-level properties, as opposed to the administrivia that today's statically-typed languages distract us with.

[4] A sentry could theoretically re-implement calls, in addition to just allowing or denying them, like an in-memory filesystem.

[5] A class that's neither immutable nor copyable can't be passed into or out of the sandbox.

[6] You'll be able to add an annotation to tell the compiler not to make a defensive copy of an argument or a return value. Imagine an image-processing library. A 12-megapixel, 16-bits-per-channel bitmap is 72MB. We may not want to copy that in the name of sandboxing. The sandbox can't be too rigid. Defer to the programmer.

4 comments:

  1. Another factor to consider: indirect dependencies. That is, application A depends on modules B and C; module B depends on module D version 1.0; module C depends on module D version 2.0. So D 1.0 and D 2.0 need to be able to exist side by side. This gets particularly problematic if B or C expose types from D as part of their public interface - should the language prevent that?

    ReplyDelete
    Replies
    1. Good question. I don't think side by side makes sense, for the reason you gave — what if the app takes the return value from module B and gives it to C? So, there should be only one version of each library.

      I suppose each module should declare a minimum required version for each dependency, like B requires at least D 1.0, and C requires D 2.0. Then the package manager would pick a version that satisfies all requirements, like 2.0 in this example. Module authors would be required to support the newest version of each dependency — they won't be allowed to hold others back from upgrading.

      There are unanswered questions — if D version 3 is out, should the authors of B and C be required to test out their libraries with D 3 and mark it as tested, so that clients of B and C can confidently update without testing it themselves? If they don't use D directly, they would consider it a waste of time to do this testing themselves.

      I'm sure we can think of other unanswered questions :)

      Delete
  2. With regards to "We don't need pure functions. It's okay for a function to do some computation based on the current time, for example. Or print something to the console. The thing we want to prevent is changing anything the values of variables and objects you have." is this any different from immutable and/or pass by value arguments?

    ReplyDelete
    Replies
    1. It's the same if
      - ALL arguments are immutable (or passed by value).
      - The method doesn't change the object it's called on.
      - There are no static or global variables.

      Delete