WARNING: nothing in this post has been implemented or tested, except
maybe at a minimal proof-of-concept level ("it compiles!"). It's just some musings on how you can make the computer do your work for you.
I was reading a
defense of Hungarian notation when I started to get a familiar notion tickling the back of my brain. The author was explaining that the original idea behind Hungarian was not to replicate type information in variable names, but rather to include some sort of
usage information: one example would be giving a string value holding a value that is untrusted the name
utName. While it's more reasonable than
lpschwzMyVariable, this suggesion made me suspicious that the author was really hacking around suckiness in the C language.
Tagging values with usWhatever and then verifying that you never mix usBlah with Blah isn't just tedious and error-prone, it smells a lot like doing work by hand that the computer ought to do for you. It seemed to me that it ought to be possible to "tag" unsafe values at the type level and let the language's type-checker sort things out. The simple approach is of course to define something like this:
newtype Unsafe t = Unsafe t
NB: I'm using Haskell because it was the first language that sprang to mind when I thought "type system hack".
This
newtype declaration lets you attach an
Unsafe tag to any type. Presumably the type signatures of routines that return untrusted data (such as a hypothetical
getHTMLFormValue or a routine to retrieve previously entered form values from a database, if they aren't sanitized first) should be modified; for instance, if the signature of
getHTMLFormValue was
getHTMLFormValue :: HTMLForm -> String -> String
it should instead read
getHTMLFormValue :: HTMLForm -> String -> Unsafe String
To create a new
Unsafe value you simply apply the
Unsafe constructor to the value:
getHTMLFormValue = Unsafe (oldGetHTMLFormValue)
To unwrap
Unsafes you use pattern-matching:
let (Unsafe unwrapped) = wrapped in ...
So, this sort of works, but it's unsatisfying on several levels:
- It allows the creation of types such as Unsafe Unsafe Integer and even Unsafe [Unsafe Integer]. If a value is already unsafe, how can it be "more unsafe"? And what in the world is an unsafe list, or an unsafe list of unsafe values?
Unfortunately, this is probably a limitation that any scheme that relies on type-tagging will have. You can get around it by either explicitly defining each new unsafe type:
newtype UnsafeString = Unsafe String
This is arguably a good idea anyway, since virtually all unsafe values (in the sense the previously linked article mentiones) are likely to be strings, and as I note later, it's not clear how to "sanitize" a non-string. OTOH, placing unnecessary type restrictions feels un-Haskellish to me, and it seems likely that you can trust the programmer not to pile up Unsafes.
- This scheme makes it very tempting to strip off the
unsafeness
of a value and lose the fact that you did so.
Since the only way to strip off unsafeness is via pattern-matching, it seems to me that it would be tempting to pattern-match away the unsafeness in function declarations:
foo (Unsafe x) = ...
This may be just fine, but my preference would be to forbid pattern-matching altogether (by not exporting the constructor directly to other modules; just export a normal function that constructs an Unsafe value) and instead providing two routines to remove unsafety:
escapeData :: Unsafe t -> t
unsafeExtractRawData :: Unsafe t -> t
The first routine encodes the data into a "safe" form, while the second simply fetches the wrapped value:
unsafeExtractRawData (Unsafe x) = x
foo x = doSomeStuffWith (unsafeExtractRawData x)
The rationale here is twofold: first, by giving the dangerous extraction routine a scary and hard-to-type name, you remind yourself to think twice before invoking it (cf unsafePerformIO); second, you can easily audit all the locations where the unsafety is "stripped away" with a textual search.
There are some issues involving the safe encoding -- you probably ought to make a typeclass of things that can be safely encoded from an Unsafe value; not too hard and this entry is already too long.
- The final issue involves manipulation of Unsafe values. So far you can pack and unpack Unsafes; however, this seems like it would get a bit annoying if you had to do any significant operations on them. For instance, searching in an unsafe string for a substring seems like it would be a pain and a bit error-prone (since you'd have to remember to wrap the string back up in an Unsafe). You might think that ideally, all normal String operations should work directly on an Unsafe String or pairs of Unsafe Strings -- you just shouldn't be able to convert between the two.
Unfortunately, I can't see any way of doing everything that you'd want to here. For some types you can. For instance, the following definitions let you compare values wrapped in Unsafe in the usual way:
instance (Eq t) => Eq (Unsafe t) where
(==) (Unsafe v1) (Unsafe v2) = v1 == v2
(/=) (Unsafe v1) (Unsafe v2) = v1 /= v2
This admits (Unsafe t) to the Eq class, and looks reasonable enough (assuming that it's safe to compare unsafe
values, of course). However, for probably the most important Unsafe values -- Unsafe Strings -- I can't see how to do this. The problem is that String is not a typeclass, it's a synonym for [Char], a list of characters. The important string operations are either defined directly on the String type, or generically on [t]; i.e., "a list of anything". Because there is no typeclass, you simply can't overload these operations to apply to unsafe strings as well as regular strings.
One way around this is to provide an "unwrapping combinator":
liftUnsafe :: (t -> t) -> Unsafe t -> Unsafe t
liftUnsafe f (Unsafe v) = Unsafe (f v)
This would automatically convert string operations to unsafe-string operations; not ideal, but reasonable enough. Operators are still a little annoying, but you can mangle their names:
(++u) :: Unsafe [t] -> Unsafe [t] -> Unsafe [t]
(++u) = liftUnsafe (++)
So you have to use ++u instead of ++ to concatenate unsafe lists; not too bad.
While it may not be perfect -- it only makes sure that you don't accidentally violate safety, and doesn't stop you from deliberately doing something dumb -- this approach feels a lot safer to me than ad-hoc mangling of variable names. I used Haskell because you can do type hacks fairly easily in it, but I don't see why you couldn't play a similar game (albeit much less conveniently) in C++, or even in a dynamic language like Python. Of course, Python would give you runtime errors about safety violations instead of static ones, but that still beats getting runtime security holes.
I would add as a hypothesis that it should be possible to express just about any variable-name-mangling technique whose goal is to "make wrong code look wrong" in the type system, provided your type system doesn't suck; and that doing so is preferable to doing it manually, since it means that a lot of repetitive error-checking can be handled by the computer. And since the computer is better than you are at verifying formal invariants, forcing it to handle them whenever possible is always a good idea.