Thursday, August 18, 2011

Do older SOers use fewer words?

On StackOverflow, to posters with more experience ask their questions in fewer words?

No. There's no visible difference:

Chars of non-code:

so.rst-weaver/description.png

Chars of code:

so.rst-weaver/code.png

The data comes from the super-handy StackOverflow API, which was retrieved using wget and then parsed using rjson and XML.

First read in and parse the JSON:

     so.R 

  1  library(rjson)
  2  library(XML)
  3  library(ggplot2)
  4  library(plyr)
  5  
  6  read.qs = function(path) {
  7      fromJSON(file = path)$questions
  8  }
  9  
 10  questions = do.call(c,
 11      lapply(c('page-1.json', 'page-2.json', 'page-3.json'),
 12          read.qs
 13      )
 14  )

Then for each one parse the HTML and look for <pre> and <p> tags:

     so.R (cont)

 15  Table = ldply(questions, function(q) {
 16      body.text = sprintf('<body>%s</body>', q$body)
 17      body = htmlParse(body.text)
 18  
 19      description = tot.length.of(body, '//p//text()')
 20      code = tot.length.of(body, '//pre//text()')
 21  
 22      rep = q$owner$reputation
 23  
 24      data.frame(
 25          rep, description, code
 26      )
 27  })

(where tot.length.of is:

     so.R (cont)

 28  tot.length.of = function(doc, query) {
 29      parts = xpathApply(doc, query, xmlValue)
 30      text = paste(parts, collapse='')
 31      nchar(text)
 32  }

)

Then make the plots:

     so.R (cont)

 33  png('description.png')
 34  print(ggplot(data=Table)
 35      + geom_point(aes(rep, description))
 36      + scale_x_log10()
 37      + scale_y_log10()
 38      + xlab('Rep')
 39      + ylab('Verbosity')
 40  )
 41  dev.off()
 42  
 43  png('code.png')
 44  print(ggplot(data=Table)
 45      + geom_point(aes(rep, code))
 46      + scale_x_log10()
 47      + scale_y_log10()
 48      + xlab('Rep')
 49      + ylab('Verbosity')
 50  )
 51  dev.off()

$ Rscript so.R >/dev/null 2>&1

3 comments:

  1. I'm confused. Where is there a comparison between more and less experienced?

    The axes are "verbosity" vs "rep" (whatever that is - repetition maybe?) and the two plots are code vs non code. So where does experience come in??

    ReplyDelete
  2. Oh.. sorry ;( "Rep" is "Reputation", which is a score that grows as you ask and answer questions; so people with higher "Rep" are more experienced.

    ReplyDelete
  3. Ah. Perfectly clear now thanks.

    ReplyDelete