Monday, August 22, 2011

Haskell Buzzing in my Ear

Unfortunately GHC will not have Type Directed Name Resolution (TDNR) for the foreseeable future. TDNR allows you to distinguish similarly named variables based on some portion of their types.

The reason TDNR is unlikely to appear in Haskell is that it is hard to show why you need it. This is because you never actually need it. Any time you run into a problem with naming conflicts, you can always rename them. And if you complain, someone is sure to show you how they would have named their variables differently and there really is no problem.

When you need TDNR is at that moment when you hit a name conflict, and you have to interrupt your thought process to go mangle some names. And you can't show that moment to people. It's like going to StackOverflow with "I'm having trouble writing this function because there's a fly buzzing in my ear" and someone responds "Just write this this and this what's your problem?". But the problem isn't the code -- the problem is the fly buzzing in your ear.

You don't need TDNR to make your code work, you need TDNR to make you work, to save you from being interrupted by random unrelated declarations. To keep your thoughts on the puzzle at hand. Haskell takes a lot of thinking. It's that kind of language.

I've often heard it said that Haskell doesn't need TDNR because it has typeclasses. But typeclasses say the wrong thing -- semantically, they just mean something different. A typeclass says "Here are some operations on a similar pattern that can be specialized in various different ways.". TDNR says "Here are some unrelated functions that, because we are using English, just happen to have the same name." Typeclasses are about how things are similar, TDNR is about how if you're using one you just don't care about the other.

Because that's just how language works -- names take on their meaning based on the context. You couldn't have a conversation if this were not the case. Context encompasses a lot of things -- it happens that types provide a lot of context. But types aren't the only way -- I'd be happy to see Haskell add any kind of context to name resolution. (Well it has two -- modules and local scope. But these aren't enough. Haskell functions are "tiny" and you tend to have a lot at global scope).

I know of no language that fuels name collisions like Haskell. Every other language I can think of has at least one thing Haskell doesn't that helps prevent conflicts. Even OCaml lets you "open" a module locally. In any Haskell program sufficiently large, the name conflicts start building up, and you start adding more qualifying information to your function names. From the programmer's perspective that's just as bad as adding an explicit type annotation every time you call the function. Type inference makes Haskell terser but name collisions blow it back up again.

Of course, if I really wanted to be constructive, I'd learn to hack GHC and add this feature myself. You can get so close with typeclasses that I don't think it would be too hard to add -- but I've never touched the GHC source code. Some day...

Thursday, August 18, 2011

Do older SOers use fewer words?

On StackOverflow, to posters with more experience ask their questions in fewer words?

No. There's no visible difference:

Chars of non-code:

so.rst-weaver/description.png

Chars of code:

so.rst-weaver/code.png

The data comes from the super-handy StackOverflow API, which was retrieved using wget and then parsed using rjson and XML.

First read in and parse the JSON:

     so.R 

  1  library(rjson)
  2  library(XML)
  3  library(ggplot2)
  4  library(plyr)
  5  
  6  read.qs = function(path) {
  7      fromJSON(file = path)$questions
  8  }
  9  
 10  questions = do.call(c,
 11      lapply(c('page-1.json', 'page-2.json', 'page-3.json'),
 12          read.qs
 13      )
 14  )

Then for each one parse the HTML and look for <pre> and <p> tags:

     so.R (cont)

 15  Table = ldply(questions, function(q) {
 16      body.text = sprintf('<body>%s</body>', q$body)
 17      body = htmlParse(body.text)
 18  
 19      description = tot.length.of(body, '//p//text()')
 20      code = tot.length.of(body, '//pre//text()')
 21  
 22      rep = q$owner$reputation
 23  
 24      data.frame(
 25          rep, description, code
 26      )
 27  })

(where tot.length.of is:

     so.R (cont)

 28  tot.length.of = function(doc, query) {
 29      parts = xpathApply(doc, query, xmlValue)
 30      text = paste(parts, collapse='')
 31      nchar(text)
 32  }

)

Then make the plots:

     so.R (cont)

 33  png('description.png')
 34  print(ggplot(data=Table)
 35      + geom_point(aes(rep, description))
 36      + scale_x_log10()
 37      + scale_y_log10()
 38      + xlab('Rep')
 39      + ylab('Verbosity')
 40  )
 41  dev.off()
 42  
 43  png('code.png')
 44  print(ggplot(data=Table)
 45      + geom_point(aes(rep, code))
 46      + scale_x_log10()
 47      + scale_y_log10()
 48      + xlab('Rep')
 49      + ylab('Verbosity')
 50  )
 51  dev.off()

$ Rscript so.R >/dev/null 2>&1

Tuesday, August 16, 2011

New best thing ever: pyinotify

What could be better than pyinotify?

You can track accesses. Accesses!

Is that not the coolest?

Say we're building a program like

     a.c 

  1  #include "a.h"

     a.h 

  1  #include "b.h"

     b.h 

  1  
     c.h 

  1  

So in total we have

$ ls *.[ch]
a.c
a.h
b.h
c.h

So the dependencies we have are

  1  a.o: a.c a.h b.h

If we compile using pyinotify we'll see that:

     main5e2d.py 

  1  from treewatcher import run_watch_files
  2  from subprocess import Popen
  3  import os
  4  
  5  def compile():
  6      Popen(['gcc', '-c', 'a.c', '-o', 'a.o']).wait()
  7  
  8  _, accesses = run_watch_files(compile, '.')
  9  print('Accessed:')
 10  for path in accesses.accessed:
 11      print('    %s' % os.path.relpath(path))
 12  print('Modified:')
 13  for path in accesses.modified:
 14      print('    %s' % os.path.relpath(path))

Accessed:
    a.h
    a.c
    a.o
    b.h
Modified:
    a.o

(treewatcher)

No need for any language-dependent tool such as gcc -M. No need to even have a clue what kind of build is taking place -- you know the compiler looked at b.h so it probably made a decision based on it.

But it's tracking the wrong thing

What if we had

     d.c 

  1  #include "not-exist.h"

And build

     main3b70.py 

  1  from treewatcher import run_watch_files
  2  from subprocess import Popen
  3  import os
  4  
  5  def compile():
  6      Popen(['gcc', '-c', 'd.c', '-o', 'd.o']).wait()
  7  
  8  _, accesses = run_watch_files(compile, '.')
  9  print('Accessed:')
 10  for path in accesses.accessed:
 11      print('    %s' % os.path.relpath(path))
 12  print('Modified:')
 13  for path in accesses.modified:
 14      print('    %s' % os.path.relpath(path))

d.c:1:23: fatal error: not-exist.h: No such file or directory
compilation terminated.
Accessed:
    d.c
Modified:

Which of course fails, and a build tool would report failure and give up at this point. But when I was writing rstweaver I didn't think that would be appropriate for literate programming -- error messages are part of the product, and you want those to show up in the output just like anything else.

With this mindset, the output of the process is the error message that was produced -- that's the part you want to see. And the input to the process is the files in the directory.

The problem is that pyinotify sees this operation depending on only one file, d.c, and doesn't see the dependence on the existence (in this case nonexistence) of not-exist.h, but in reality changing that existence will change the output of the process, from an error message to sucess.

So are we back to needing an understanding of the language?

What did gcc do that might have clued us in to the missing dependency? How did it know not-exist.h wasn't there?

  1. It may have opendir()d the parent directory and stepped through the contents, looking for not-exist.h.

    If this is the case, then there's nothing we can do to spot not-exist.h without understanding something about how C works. We could see that the operation "depends on the contents of the directory", which would admit some extraneous dependencies but at least prevent us from getting stuck on a bad result.

  2. It may have attempted to open() not-exist.h and failed. In this case, you'd think that some information might show up, but I never got anything like this from pyinotify.

The fact that gcc returned a nonzero exit code is some clue that adding a new file might change the result, but that fact is specific knowledge of gcc.

So I'd say yes, to be really thorough we still need an understanding of the language. But just barely.

Monday, August 15, 2011

Welcome to my github world

Well, I did it. Against all reason and my better judgment, I went and made rstweaver depend on my fork of docutils.

This is not how we're supposed to do things. We're supposed to submit patches when we can and find workarounds while we have to, which is usually. This keeps everyone on the same libraries, and makes dependencies navigable by mere humans.

But go look at the most recently updated python-language repos on github. Right now the top ones are:

Repo Lines of code
dqwiki/lisabot 1369
benjschiller/checkmyclones 124
nolar/shortener 1458
eteq/astropysics 43,195
codebrainz/geany-sphinx 216

With one exception these are little projects. And what's the point of having such tiny projects if you can't bend them to your needs? And who's going to notice the extra space consumed by 3 slightly incompatible versions of lisabot?

Go look at the Python Package Index and see how small most of those projects are. Software is tiny these days. I think a large part this is due to Python being sufficiently easy that a lot more people can throw together one thing in Python than could in C++. And it interfaces better with itself so you have no incentive to make these massive sprawling projects.

So why?

Well cause there were some bugs I wanted to fix and a feature I wanted to add. Not compelling reasons but that's the experiment: how low can you set the threshold?

Avery Pennarun makes an interesting argument on a not-quite-the-same topic. (Linking to that might make it sound like I have some problems with docutils. I don't. Docutils is awesome).

With Python (or Ruby or Perl or ...) + Git we can do anything

I hope. No one really knows what the programming future holds. Wasn't Prolog supposed to be the future?

Why are all the "easy" languages dynamically typed?

Actually, that one's not hard to answer; a harder one is "Why do I think I still prefer static types?". The evidence is right there both in my own experience and in the numbers around me that dynamically-typed programming is just easier. You get more done. Why do I resist in admitting this?

"Languages of the future" always make me feel uneasy. Well I guess some day we'll all use Prolog and I can stop worrying.

(I want to like Scala. I really do. It's got everything I think I want in a language. But why do I have this uneasy feeling about it?).

Sunday, August 7, 2011

Sweeping changes to rstweaver

Improvements:

  • New languages: C++, Python
  • Caching
  • Pure docutils (so conversion to LaTeX works -- conversion to ODT which was one of my primary motivations for removing HTML-specific code still fails, which I think may be due to a bug in docutils but I'm still figuring that out)

And it's sitting on my github site with some examples.

Implementing new features doubled as a chance to encounter newer and yet more frustrating problems. It is an eternal fact of programming that reality is smarter than you are: the ideas that you design from your own creativity will simply never measure up to the horrors nature throws at you haphazardly. She makes it look so easy.

An example would be the case of Python decorators when the decorator function also happens to have class scope.

     decor.py 

  1  class A:
  2  
  3      def fivify(func):
  4          def handler():
  5              return func(5)
  6          return handler
  7  
  8      @fivify
  9      def a(x):
 10          return x
 11  

We are now to wonder what transformations may have been applied on fivify (such as making it a bound method) before it is used as a decorator. These thoughts give us an error message with the approximate clarity of

     decor-thoughts.py 

  1  class A:
  2  
  3      @classmethod
  4      def fivify(func):
  5          def handler():
  6              return func(5)
  7          return handler
  8  
  9      @fivify
 10      def a(x):
 11          return x
 12  
 13  a = A()
 14  
Traceback (most recent call last):
  File "decor-thoughts.py", line 1, in <module>
    class A:
  File "decor-thoughts.py", line 9, in A
    @fivify
TypeError: 'classmethod' object is not callable

So we are lead to see that decoration must happen while the methods are still sitting in the uninstantiated class (this can cause other surprises).

A fair amount of frustration comes from the docutils library itself. Docutils is actually quite wonderful, but a conspicuous need for polishing renders it a capable adversary. Most of the snags can be traced to rather boring implementation bugs, but one of its more archetictural flaws helps remind us why in programming python we stick to the pythonic.

At the heart of docutils' extensibility lie directives, which serve much the same role as tex commands except that they are written in python and manipulate docutils rather than tex. So I set out to registor a directive:

     directive.py 

  1  from docutils.parsers.rst import Directive, directives
  2  from docutils.core import publish_parts
  3  
  4  class MyDirective(Directive):
  5  
  6      def __init__(self, *a, **b):
  7          Directive.__init__(self, *a, **b)
  8  
  9      def run(self):
 10          return []
 11  
 12  directives.register_directive('mydir', MyDirective)
 13  publish_parts('\n.. mydir::\n\n')
 14  

From this we learn that register_directive() appears to take a factory, which it is going to instantiate, and pass some arguments which I appear not to care about.

I was almost content with this arrangement -- I just wanted to pass some context to MyDirective whenever it was created. Seeing as register_directive() appeared to take just a callable factory, I added some context:

     directive.py (cont)

  4  class MyDirective(Directive):
  5  
  6      def __init__(self, ctx, *a, **b):
  7          Directive.__init__(self, *a, **b)
  8          self.ctx = ctx
  9  
 10      def run(self):
 11          return []
 12  

and some factory:

     directive.py (cont)

 13  def create(*a, **b):
 14      return MyDirective(None, *a, **b)
 15  
 16  directives.register_directive('mydir', create)
 17  publish_parts('\n.. mydir::\n\n')
 18  
Traceback (most recent call last):
  File "directive.py", line 17, in <module>
    publish_parts('\n.. mydir::\n\n')
  File "/usr/lib/pymodules/python2.7/docutils/core.py", line 427, in publish_parts
    enable_exit_status=enable_exit_status)
  File "/usr/lib/pymodules/python2.7/docutils/core.py", line 641, in publish_programmatically
    output = pub.publish(enable_exit_status=enable_exit_status)
  File "/usr/lib/pymodules/python2.7/docutils/core.py", line 203, in publish
    self.settings)
  File "/usr/lib/pymodules/python2.7/docutils/readers/__init__.py", line 69, in read
    self.parse()
  File "/usr/lib/pymodules/python2.7/docutils/readers/__init__.py", line 75, in parse
    self.parser.parse(self.input, document)
  File "/usr/lib/pymodules/python2.7/docutils/parsers/rst/__init__.py", line 157, in parse
    self.statemachine.run(inputlines, document, inliner=self.inliner)
  File "/usr/lib/pymodules/python2.7/docutils/parsers/rst/states.py", line 170, in run
    input_source=document['source'])
  File "/usr/lib/pymodules/python2.7/docutils/statemachine.py", line 233, in run
    context, state, transitions)
  File "/usr/lib/pymodules/python2.7/docutils/statemachine.py", line 454, in check_line
    return method(match, context, next_state)
  File "/usr/lib/pymodules/python2.7/docutils/parsers/rst/states.py", line 2281, in explicit_markup
    nodelist, blank_finish = self.explicit_construct(match)
  File "/usr/lib/pymodules/python2.7/docutils/parsers/rst/states.py", line 2293, in explicit_construct
    return method(self, expmatch)
  File "/usr/lib/pymodules/python2.7/docutils/parsers/rst/states.py", line 2035, in directive
    directive_class, match, type_name, option_presets)
  File "/usr/lib/pymodules/python2.7/docutils/parsers/rst/states.py", line 2093, in run_directive
    'Directive "%s" must return a list of nodes.' % type_name
AssertionError: Directive "mydir" must return a list of nodes.

Oh docutils. This error message had me for... I'd say near twenty minutes, because darn it I am returning a list of nodes.

So I was a bit suprised to find out that register_directive doesn't actually take a callable object -- and it's not thinking of it as a "factory" either -- it's expecting to get either a class, which it will instantiate, or a function, which it will leave as is until it gets called to handle a directive. And there's the problem.

Now nothing against docutils because it's actually a very nice library and for the most part well designed but checking the type of an object and then branching based on the result is decidedly unpythonic.

And that's how I learned that "pythonic" is something that actually matters and isn't just something people say.

(It's not pythonic because it doesn't respect duck typing -- it's not supposed to matter what the type of the object is so long as it has the right properties.)

But then it was later in the same project that I... I found myself wanting to commit the same error! You see I had an "interface" like

     weaver.py 

  1  class WeaverLanguage:
  2  
  3      def run(self, code, args):
  4          '''
  5          Returns content to be added to the document.
  6          '''
  7          raise NotImplementedError
  8  

I could always make run() return a docutils node -- because that would cover all cases. You want just plain text? Stick the text in a node. HTML? make a raw HTML node. So that would solve all my problems.

But I just didn't want to do that. I wanted to make the most common case of returning raw text easy, and not require looking at docutils (sorry again docutils, there's nothing wrong with you, really). I could break it up into two stages:

     weaver-stages.py 

  1  class WeaverLanguage:
  2  
  3      def run_text(self, code, args):
  4          '''
  5          Returns content to be added to the document.
  6          '''
  7          raise NotImplementedError
  8  
  9      def run_node(self, code, args):
 10          text = self.run_text(code, args)
 11          return nodes.inline(text, text)
 12  

Except that I don't want to do that either, because it puts the output type in the name of the function, making it look way more important than it actually is (solution: type inference and typeclasses -- also not pythonic).

So I went with the unpythonic hack. And maybe some day someone'll hate me for it.