Sunday, June 26, 2011

RObjectTables are AWESOME

Why isn't everyone using the RObjectTables package? This is the best thing ever!

Here's the basic idea of RObjectTables: An environment is an object where you can lookup names and associate them with values. And in particular its where you look up variables used in an expression. But there's no reason you can't take any other object that associates names with values (data.frame, list, SQL database, CSV file, filesystem, ...) and use that to lookup your variables.

R already has sortof this because you can use eval() and friends on lists and data.frames. But it's not extensible. The purpose of RObjectTables is to make it extensible.

In its current version RObjectTables is somewhat limited, because you can only attach() the created environments, not pass them to with() and such. But with extremely minor modifications this becomes possible. You can find a version with these changes on my GitHub site.

Now let me demonstrate how unbelievably useful this is. So naturally I will start with a useless example: an environment of only strings!

> library(RObjectTables)

> db = newRClosureTable(list(

> assign = function(name, value) {

> # Not used

> },

>

> get = function(name) {

> name

> },

>

> exists = function(name) {

> T

> },

>

> remove = function(name) {

> # Not used

> },

>

> objects = function() {

> # Not used

> }

> ))

> with(db, x)

[1] "x"

Muahahaha.

This environment, as you can see, is not terribly useful:

> try(

> with(db, x + y),

> silent=T

> )

> geterrmessage()

[1] "Error in x + y : non-numeric argument to binary operator\n"

Yes... using "+" (character) as a function... not so good. Also we don't have to feel so bad about not implementing assign() because with `<-` turning into "<-" there's not much we could assign.

But now let's do something really useful.

> reality = function() {

> parent = parent.frame()

> outer = new.env(parent = parent)

>

> formulas = list()

> valid = list()

> values = list()

>

> make.ref = function(name) {

> ref = list(

> db = self,

> name = name

> )

> class(ref) = 'ref'

> ref

> }

>

> dep.table = list()

> rdep.table = list()

>

> propagate = function(name) {

> if (is.null(valid[[name]]) || !valid[[name]]) {

> }

> else {

> for (ref in rdep.table[[name]]) {

> ref$db$reset(ref$name)

> }

> }

> }

>

> reset = function(name) {

> propagate(name)

> valid[[name]] <<- F

> }

>

> add.dep = function(name, ref) {

> dep.table[[name]] <<- c(list(ref), dep.table[[name]])

> }

>

> add.rdep = function(name, ref) {

> rdep.table[[name]] <<- c(list(ref), rdep.table[[name]])

> }

>

> del.rdep = function(name, ref) {

> rdep.table[[name]] <<- setdiff(rdep.table[[name]], list(ref))

> }

>

> ptr = newRClosureTable(list(

> assign = function(name, value) {

> force(value)

> formulas[[name]] <<- function() value

> reset(name)

> },

> get = function(name) {

> if (is.null(formulas[[name]])) {

> return(tryCatch(

> get(name, outer),

> error = function(e) getUnbound()

> ))

> }

>

> this.ref = make.ref(name)

> for (ref in working.refs) {

> ref$db$add.dep(ref$name, this.ref)

> }

>

> if (is.null(valid[[name]]) || !valid[[name]]) {

> old.deps <<- dep.table[[name]]

> dep.table[[name]] <<- list()

>

> working.refs <<- c(list(this.ref), working.refs)

>

> values[[name]] <<- formulas[[name]]()

>

> working.refs <<- working.refs[-1L]

>

> lost.deps = setdiff(old.deps, dep.table[[name]])

> for (ref in lost.deps) {

> ref$db$del.rdep(ref$name, this.ref)

> }

>

> gained.deps = setdiff(dep.table[[name]], old.deps)

> for (ref in gained.deps) {

> ref$db$add.rdep(ref$name, this.ref)

> }

>

> valid[[name]] <<- T

> }

>

> values[[name]]

> },

> exists = function(name) {

> !is.null(valid[[name]]) || exists(name, parent)

> },

> remove = function(name) {

> # TODO: this

> },

> objects = function(name) {

> names(valid)

> }

> ))

> class(ptr) = c('reality', class(ptr))

>

> attr(ptr, 'delayedAssign') = (

> function(name, promise) {

> formulas[[name]] <<- promise

> reset(name)

> promise

> }

> )

>

> self = list(

> ptr = ptr,

> reset = reset,

> add.dep = add.dep,

> add.rdep = add.rdep,

> del.rdep = del.rdep

> )

> class(self) = 'realptr'

>

> ptr

> }

> working.refs = list()

Why did I call it a "reality"? Well... uh... all the good names are taken. So think of it as storing the current version of... reality. What it does is keep track of a set of variables that depend on each other. Oh yeah and we need some way to assign into it:

> `$.reality` = function(r, name) {

> get(name, envir=r)

> }

> `$<-.reality` = function(r, name, value) {

> expr = substitute(value)

> env = parent.frame()

>

> attr(r, 'delayedAssign')(

> name, function() eval(expr, env)

> )

>

> r

> }

Now we can write:

> r = reality()

> r$x = 10

> r$y = sqrt(r$x)

> r$y

[1] 3.162278

> r$x = 20

> r$y

[1] 4.472136

It is the feature I have always been wishing for. Now if only we could get .GlobalEnv to do this... You might think of attach()ing it: don't do that. Trust me. I can't show you because it will crash Sweave and not produce any output, but that doesn't work yet (it would be nice if it did).

Luckily R has a useful feature that was never intended for this purpose (I find it is this way with most of R's finest features): the debugging browser. So let us...

> with.db = function(env, func1, func2) {

> if ('reality' %in% class(env)) {

> func1(env)

> }

> else {

> func2()

> }

> }

> reality.assign = function(name, rvalue) {

> env <- parent.frame()

> name <- substitute(name)

> srvalue <- substitute(rvalue)

>

> with.db(env,

> function(db) {

> name <- as.character(deparse(name))

> promise <- function() eval(srvalue, env)

>

> attr(db, 'delayedAssign')(name, promise)

> alist(x=)$x

> },

> function() {

> do.call(`<-`, list(name, rvalue), envir=env)

> }

> )

> }

> with(r, `=` <- reality.assign)

> run.interpreter = function(base.env) {

> options(browserNLdisabled = T)

> with(base.env, browser())

> }

Now I don't know how to feed an interactive debugging session into Sweave, so here is copy-pasted this feature in use:

Browse[1]> y = sqrt(x)

Browse[1]> for (i in 1:10) {
+ x = i
+ print(y)
+ }
[1] 1
[1] 1.414214
[1] 1.732051
[1] 2
[1] 2.236068
[1] 2.449490
[1] 2.645751
[1] 2.828427
[1] 3
[1] 3.162278
Browse[1]> b = a

Browse[1]> c = b

Browse[1]> a = 1

Browse[1]> c
> run.interpreter(r)
Called from: eval(expr, envir, enclos)
Browse[1]> y = sqrt(x)

Browse[1]> for (i in 1:10) {
+ x = i
+ print(y)
+ }
[1] 1
[1] 1.414214
[1] 1.732051
[1] 2
[1] 2.236068
[1] 2.449490
[1] 2.645751
[1] 2.828427
[1] 3
[1] 3.162278
Browse[1]> z = y**2

Browse[1]> z
[1] 10
Browse[1]> x = 1:3

Browse[1]> z
[1] 1 2 3
Browse[1]> b = a

Browse[1]> c = b

Browse[1]> a = 1

Browse[1]> c
> 

Haha "c" closed the debugger ;)

So far this actually seems stable enough for general use. I wouldn't trust it with data you care about, but I intend to make great use of it.

Thursday, June 23, 2011

Crashed!!

This crashes R (sometimes):

    1 
    2 f = function(
    3     a = {x <- y <- z <- w <- t <- u <- 7}
    4 ) environment()
    5 
    6 for (i in 1:1000) {
    7     try(as.list(f()), silent=T)
    8 }
    9 
   10 

The problem is the same as I described in the last post

Inheriting in Items

What if we want to extend the behavior of one item with another? Or, to put it another way, what if we want one item to be able to grab the functionality of another?

It turns out we can (almost) get this behavior without modifying the item constructor at all. Here's how you do the grabbing:

> grab = function(src, obj=parent.frame()) {

> for (name in ls(src)) {

> (function(name) {

> if (!exists(name, obj, inh=F)) {

> delayedAssign(name, src[[name]], assign.env=obj)

> }

> })(name)

> }

> }

Now let's use that:

> A = item(

> x =,

> y =,

>

> sum = x + y

> )

> B = item(

> x =,

> y =,

>

> init = grab(A(x, y))

> )

> b = B(2, 3)

> as.list(b)

$sum
[1] 2

$x
[1] 3

$y
NULL

What is going on here?? This one is really tricky; it took me 1/2 hour or so to figure it out.

So, clearly, the problem starts with the fact that init is evaluated too late. We could easily fix the problem like this:

> b = B(2, 3)

> force(b$init)

NULL

> as.list(b)

$sum
[1] 5

$x
[1] 2

$y
[1] 3

$init
NULL

But what's with the argument values appearing with the wrong names? Here's the C function that implements as.list.environment:

    1 SEXP attribute_hidden do_env2list(SEXP call, SEXP op, SEXP args, SEXP rho)
    2 {
    3     SEXP env, ans, names;
    4     int k, all;
    5 
    6     checkArity(op, args);
    7 
    8     env = CAR(args);
    9     if (ISNULL(env))
   10     error(_("use of NULL environment is defunct"));
   11     if( !isEnvironment(env) ) {
   12         SEXP xdata;
   13     if( IS_S4_OBJECT(env) && TYPEOF(env) == S4SXP &&
   14         (xdata = R_getS4DataSlot(env, ENVSXP)) != R_NilValue)
   15         env = xdata;
   16     else
   17         error(_("argument must be an environment"));
   18     }
   19 
   20     all = asLogical(CADR(args));
   21     if (all == NA_LOGICAL) all = 0;
   22 
   23     if (env == R_BaseEnv || env == R_BaseNamespace)
   24     k = BuiltinSize(all, 0);
   25     else if (HASHTAB(env) != R_NilValue)
   26     k = HashTableSize(HASHTAB(env), all);
   27     else
   28     k = FrameSize(FRAME(env), all);
   29 
   30     PROTECT(names = allocVector(STRSXP, k));
   31     PROTECT(ans = allocVector(VECSXP, k));
   32 
   33     k = 0;
   34     if (env == R_BaseEnv || env == R_BaseNamespace)
   35     BuiltinValues(all, 0, ans, &k);
   36     else if (HASHTAB(env) != R_NilValue)
   37     HashTableValues(HASHTAB(env), all, ans, &k);
   38     else
   39     FrameValues(FRAME(env), all, ans, &k);
   40 
   41     k = 0;
   42     if (env == R_BaseEnv || env == R_BaseNamespace)
   43     BuiltinNames(all, 0, names, &k);
   44     else if (HASHTAB(env) != R_NilValue)
   45     HashTableNames(HASHTAB(env), all, names, &k);
   46     else
   47     FrameNames(FRAME(env), all, names, &k);
   48 
   49     setAttrib(ans, R_NamesSymbol, names);
   50     UNPROTECT(2);
   51     return(ans);
   52 }
   53 

Notice that it grabs the values first, and the names second. So when it goes to grab the values, they are 2, 3, and a promise to evaluate grab(A(x, y)). But wait... while it's getting those values, won't the promise be evaluated, and change the structure R is working on?

Yes... and that could be a problem, but it isn't:

    1 static void FrameValues(SEXP frame, int all, SEXP values, int *indx)
    2 {
    3     while (frame != R_NilValue) {
    4     if ((all || CHAR(PRINTNAME(TAG(frame)))[0] != '.') &&
    5                       CAR(frame) != R_UnboundValue) {
    6         SEXP value = CAR(frame);
    7         if (TYPEOF(value) == PROMSXP) {
    8         PROTECT(value);
    9         value = eval(value, R_GlobalEnv);
   10         UNPROTECT(1);
   11         }
   12         SET_VECTOR_ELT(values, *indx, duplicate(value));
   13         (*indx)++;
   14     }
   15     frame = CDR(frame);
   16     }
   17 }
   18 

Values are stored in a pair list, which is like a linked list in that modifications to the front of the list don't affect someone already iterating further down. So grab() sticks sum at the beginning, where it won't show up in the values. And the result of grab() is the NULL that you see in the 3rd value.

Then it goes to get the names. But it's only allocated space for 3 names, because it already checked the size. So the first 3 names are sum, x, and y. And that's what you see. From looking at how FrameNames works, it looks like this should cause a buffer overrun, but that doesn't show up.

We could have an item automatically check for the existence of init at instantiation time and run it if it exists, but I prefer a more general solution:

> item = function(...) {

> foo = function() {

> E = environment()

>

> for (i in seq_along(args)) {

> name = args.names[[i]]

> if (substr(name, 1, 1) == '.') {

> force(E[[name]])

> }

> }

>

> E

> }

>

> args = process.args(as.list(substitute(list(...)))[-1L])

> args.names = names(args)

> formals(foo) = args

>

> foo

> }

That is, we force all arguments starting with a '.'. Then we can modify process.args to give unnamed arguments a name with a '.':

> process.args = function(args) {

> arg.names = names(args)

> if (is.null(arg.names)) {

> arg.names = replicate(length(args), NULL)

> }

>

> for (i in seq_along(args)) {

> if (is.null(arg.names[[i]]) || arg.names[[i]] == '') {

> arg.names[[i]] = paste('.', as.character(i), sep='')

> }

> }

>

> names(args) = arg.names

>

> args

> }

Now we can write:

> B = item(

> x =,

> y =,

>

> grab(A(x, y))

> )

> b = B(2, 3)

> as.list(b)

$sum
[1] 5

$name
[1] ".3"

$i
[1] 3

$E


$x
[1] 2

$y
[1] 3

Oh yeah we have some extra junk in there. We could hide it with '.' names, or we could hide everything by nesting:

> item = function(...) {

> foo = function() {

> (function() {

> E = parent.frame()

>

> for (i in seq_along(args)) {

> name = args.names[[i]]

> if (substr(name, 1, 1) == '.') {

> force(E[[name]])

> }

> }

> })()

>

> environment()

> }

>

> args = process.args(as.list(substitute(list(...)))[-1L])

> args.names = names(args)

> formals(foo) = args

>

> foo

> }

> B = item(

> x =,

> y =,

>

> grab(A(x, y))

> )

> b = B(2, 3)

> as.list(b)

$sum
[1] 5

$x
[1] 2

$y
[1] 3

Perfect.

We are joining R Bloggers

So that we may spread the R. It's a great resource -- lot's of interesting people on there. http://www.r-bloggers.com/

Wednesday, June 22, 2011

Items

Let's define a new pattern!

I do a lot of programming for research, and part of what this involves is turning calculations, usually expressed as verbs, into nouns. That is, I need to keep the calculation and all it's intermediate steps around so that I can inspect them.

Alas, I am programming in Python most of the time and end up writing a lot of code like this:

class Foo:
    
    def __init__(self, tol, angle):
        self.tol   = tol
        self.angle = angle
        
        ...
        
        self.a    = a
        self.b    = b
        self.step = step

ie manually saving all the steps. But in R we can do better. Imagine we could write code like:

> Center.Find = item(

> # These define the parameters

> t =,

> x =,

>

> # These give the calculation

> theta = Arg(x),

> dtheta = diff(theta) / diff(t),

> center = median(dtheta[dethat>0])

> )

where all the variables defined in the item become named members of a list, and we can inspect all of them.

We could try something like this:

> item = function(...) {

> foo = function() {

> environment()

> }

>

> args = process.args(as.list(substitute(list(...)))[-1L])

> formals(foo) = args

>

> foo

> }

That is, the names defined in the item become function parameters, which get included in the environment. process.args is an unfortunate necessity that flips the argument list.

> process.args = function(args) {

> arg.names = names(args)

> if (is.null(arg.names)) {

> arg.names = replicate(length(args), NULL)

> }

>

> names(args) = arg.names

>

> args

> }

> A = item(

> x =,

> y =,

>

> sum = x + y

> )

No error yet...

> a = A(2, 3)

> as.list(a)

$x
[1] 2

$y
[1] 3

$sum
[1] 5

Excellent.

Notice that we get a few other features for free:

> A = item(

> sum = x + y,

>

> x =,

> y =

> )

> a = A(x=2, y=3)

> as.list(a)

$sum
[1] 5

$x
[1] 2

$y
[1] 3

Lazy evaluation means we can define variables in any order; also ones we don't use will never be evaluated. This saves me from a common dilemma: I want to include lots of diagnostic information just in case I need it, but I don't want to wait for it to be calculated. Now I can add in any extra info I can imagine using, and it won't be calculated unless I need it.

Another nice feature is you can "reach in" to the item and change its behavior -- not terribly good style, but this is research code.

> a = A(x=2, y=3, sum=8)

> as.list(a)

$sum
[1] 8

$x
[1] 2

$y
[1] 3

Notice, however, that this (which would really be the only useful purpose of "reaching in"), fails:

> a = A(x=2, y=3, sum = x - 1)

> try(as.list(a), silent=T)

> geterrmessage()

[1] "Error in as.list.environment(a) : object 'x' not found\n"

So, what if we wanted to defy common sense and make it possible to modify the behiavor of an item at instantiation time? Well remember how the result of calling an item is an environmenth? Uh oh...

> a = A(

> x = 2,

> y = 3,

> sum = with(a, x - 1)

> )

> as.list(a)

$sum
[1] 1

$x
[1] 2

$y
[1] 3

Of course this means we cannot modify the behavior of an anonymous item. I couldn't think of a way to do that...