What is Statistics ?

The American Statistical Association launched a new website thisisstatistics with general information on what being a statistician really means today. I often meet fearful looks when I meet people and present myself as a “biostatistician” (the key word being statistician here). Up to the point that I now generally say I work in “applied mathematics in medicine”, before I drop the magical “big data” keyword in the necessary explanation following this title.

This website seems very easy to navigate, and I hope it will participate in the heat statistics are getting, being the sexiest job this days !

Check out the 2 short videos below:

Heating homes via intensive statistical computing !

At Bordeaux University, we are quite lucky. I mean as computational consumers.

Indeed, we have access to a big CPU cluster, a mesocenter that has been build for all the researchers in the Aquitaine area (in the south west of France). And it’s a big one. It’s named avakas, and it has brought my Ph.D. computational projects to an other scale !

But for a few month now, I have also been granted access (for free as a I work in a national research agency) to a new kind of big computer: a net of heaters. That’s right. This Qarnot company has developped electrical radiators that encompass CPUs and are connected to internet. This is so cool (and it works) !

foreach, here I come, bringing the heat !

Statistics horizons from French statistical community

Yesterday I attended a conference on the Horizons of Statistics at the Henri Poincaré Institute in Paris, organized by the French Statistical Association. As it was broadcasted on youtube, this reminded me of the Future Of Statistics unconference organized by the simply statistics blog earlier in 2013 fall. By the way I really enjoyed Daniela Witten talk from this unconference: check it out !

The Horizons of Statistics turned out very interesting, and I really enjoyed Emmanuel Candès talk on randomized computing algorithms, as well as Emmanuel Todd talk which was very refreshing at the end of the day. Unfortunately those talks are only in French. But if you are an English speaker, you can watch Robert N. Rodriguez, who is very hopeful for young statistcians, such as myself, and for our profession in general !

All in all, it seems that the Horizons of Statistics are many, and all of them are looking bright !

Ph.D. students lecture club: Open versus Closed peer review

At ISPED, the research institute where I work, we have a weekly Ph.D. students seminar. It is an informal meeting of (more or less) all the Ph.D. students of the institute, bringing together people from epidemiology, medical informatics, biostats, etc. Each student gets 20 minutes sharp to talk either about his/her research or any article of his/her choosing (possibly a little bit outside of our respective research domains), followed by 10 minutes of questions.

Last wednesday, I presented the following article (slides):
Leek JT, Taub MA, Pineda FJ (2011). Cooperation between Referees and Authors Increases Peer Review Accuracy. PLoS ONE 6(11):e26895.
It is about a game model for peer review. As a newbie to research and all, I found it quite an interesting read. Open peer review (where you know who is assessing the quality of your work) might not be as bad as one could intuitively think, quite  the opposite actually. Check it out (it’s open access)!

But remember:

All models are wrong, but some are useful.

George E. P. Box

Young Statisticians Meeting

At the end of august 2013, I was lucky enough to attend the Rencontres des jeunes statisticiens (young statisticians meeting). It is organised every other year by the Société Française de Statistique (French Statistical Society). It was thrilling to meet all these fellow statisticians in the making!

A lot of the talks were very interesting. In particular, one made me think of a recent post from Roger Peng on the famous Simply Statistics blog. Benjamin Guedj developed a nonlinear aggregation strategy, an approach aggregating different solutions (from different modeling) to a regression problem. It is implemented as an R package, COBRA, and it seems to performs quite good (and fast). Even though I suspect it might choke a little on difficult datasets (n<<p anyone?), I find it quite clever. And, relating to R. Peng post, it has the advantage of reducing the “researcher degrees of freedom”. Could it be a first step towards a deterministic statistical machine?

Journalists and statistical thinking

Earlier this month, results of the baccalauréat came out. It is the French examination you have to pass to graduate from high school. B. Coulmont, a professor in sociology at Université Paris 8, made this plot representing the number of mention Très Bien (highest honors) according to the first name. He insisted that first name do not determine academic success, but instead are a good proxy to one’s social class. But soon all the french media were predicting one’s success according to one’s first name (“Tell me your first name, and I’ll tell you your grade”).
This reminded me of the French presidential election last year. On the same day (13/03/2012), two different polls put each of the two major candidates ahead. But how could that be? Who was right? One had to be wrong! France (and its journalists) had just discovered confidence bounds.

H.G. Wells once said:

Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.

I believe this is especially true for journalists, who spread information. And of course it also applies to the public that must be able to do critical thinking. Unfortunately, we are not there yet.

First names and mention Très Bien at Baccalauréat 2013 — by B. Coulmont

First names and mention Très Bien at Baccalauréat 2013 — by B. Coulmont

My R package’s worldmap of downloads!

Last week, a colleague draw my attention on this new log files from the Rstudio cloud CRAN mirror, through a post from Tal Galili. This CRAN mirror is a little different, as it uses Amazon CloudFront to deliver the downloads rapidly from a server near you, wherever that is. But what’s really great about it, is the availability of those log files, that have been recording every package download since October 2012, daily! As R is primarily intended for statisticians, it didn’t take long before we start playing with the data. Below is my bit to this effort, a function to plot the world map of one’s package downloads from the Rstudio “0-Cloud” CRAN mirror. It relies on Tal Galili’s functions for downloading and formating the data. Of course, the Rstudio CRAN mirror is only 1 mirror among all the CRAN mirrors around the world, and is not representative due to its link with the Rstudio IDE. However, in research, there is quite a delay between one’s hard work (i.e. implementing a package) and the reward (i.e. publication). Encouragements such as download stats are welcomed!

pkgDNLs.worldmapcolor <- function(pkgname, rmvdupips=TRUE, date.start=Sys.Date()-8, date.stop=Sys.Date()-1, shp.file.repos){

if(date.stop>=Sys.Date() | date.start>=Sys.Date()){
stop("date not available yet")

cat('downloading the RStudio CRAN data...\n')
dnldata_folder <- download_RStudio_CRAN_data(START = date.start, END = date.stop)
cat('data downloaded!\n')
dnldata <- read_RStudio_CRAN_data(dnldata_folder)

data <- dnldata[which(dnldata$package == pkgname),]
data <- data[!duplicated(data$ip_id),]

counts <- cbind.data.frame(table(data$country))
names(counts) <- c("country", "count")

# you need to download a shapefile of the world map from Natural Earth (for instance)
# http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip
# and unzip it in the 'shp.file.repos' repository
world<-readShapePoly(fn=paste(shp.file.repos, "ne_110m_admin_0_countries", sep="/"))
ISO_full <- as.character(world@data$iso_a2)
ISO_full[146] <- "SOM"  # The iso identifier for the Republic of Somaliland is missing
ISO_full[89]  <- "KV" # as for the Republic of Kosovo
ISO_full[39]  <- "CYP" # as for Cyprus

colcode <- numeric(length(ISO_full))
names(colcode) <- ISO_full
dnl_places <- names(colcode[which(names(colcode) %in% as.character(counts$country))])
rownames(counts) <- counts$country
colcode[dnl_places] <- counts[dnl_places, "count"]

world@data$id <- rownames(world@data)
world.points <- fortify(world, by="id")
names(colcode) <- rownames(world@data)
world.points$dnls <- colcode[world.points$id]

world.map <-  ggplot(data=world.points) +
geom_polygon(aes(x = long, y = lat, group=group, fill=dnls), color="black") +
coord_equal() + #theme_minimal() +
scale_fill_gradient(low="white", high="#56B1F7", name="Downloads") +
labs(title=paste(pkgname, " downloads from the '0-Cloud' CRAN mirror by country\nfrom ", date.start, " to ", date.stop,"\n(Total downloads: ", sum(counts$count), ")", sep=""))

wm <- pkgDNLs.worldmapcolor(pkgname="timeROC",  shp.file.rep="~/shapefileRepository")

Here is an exemple of the result on a friend‘s package: timeROC timeROC worldmap