I’ve seen this too many times: > data<-read.table('filename.txt')
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line X did not have Y elements

Every time I see and solve this problem, I’m going to keep a note of what it was and put it here. Hopefully this will not turn out to be an exhaustive list…

Most recently, the reason for this was this default in read.table (read.csv is different): comment.char = "#"
One of the strings in my file had a “#” in it, which resulted in the rest of the line being “commented out” and the above error.

It is important to watch one’s figure. To obsess over it, inspecting every detail, carefully erasing unpleasant angles, making sure all elements are perfectly aligned. The hours I spend caring about my figures may one day be my downfall, but until that time I shall continue to worry about things like inner margins, font embedding and what shade of cornflower blue to use.*

A few weeks ago** I decided to use R for some data analysis. It transpires that the R I thought I knew was in fact a semi-incomprehensible amalgamation of gnuplot and Octave, so I was forced to go back to the beginning. A lot of language-learning boils down to reading documentation (still looking for the official documentation for Mandarin, though) but that’s inefficient when you have a specific task to do. I’ve waded around in the help files, and here’s what I dredged up for basic (but not ugly) plotting. Note: as far as I can tell, ggplot2 is the main R package for plotting. This post is not about that. That comes later.

This’ll suck your datafile into R. The sep option should contain whatever separates fields in your file. Default is whitespace. The as.is=TRUE option stops R from turning fields containing text into ‘factors’, which may not be what you want. Note: Try to name your variables informatively, unlike me.

Let’s suppose we want to plot some function of the fourth column of the data frame (data[,4]) against the first column (data[,1]). Note: if you wanted the fifth row for example, you would write data[5,]. The i,jth element is just data[i,j].

The first line is figuring out how many rows are in our data frame in two equivalent ways. Note that length(data) just gives you the number of columns. The second line creates a new column whose values are a function of the (standard normal) cumulative distribution function (pnorm) of the absolute value of whatever is in the fourth column of the data frame. The third line appends (columnwise, with cbind) it to the data frame. Note: I think this might mess up the column names of the data-frame, so there must be a better way to do this. The last line is just defining a variable we’ll use later.

Now let’s use some other functions to create (useless***) vectors.

x is just a vector of real numbers starting at min(data[,1) and ending with max(data[,1]), with evenly spaced intervals so its length is 1000 (seq creates a sequence). y is also a vector of length 1000. runif(100) creates a vector of 100 numbers between 0 and 1 taken from a uniform distribution. rep then repeats this 10 times. If you wanted the repetition to be elementwise (eg [a,a,b,b] instead of [a,b,a,b]) just write each=10.

We’re creating the plot environment here. What is going on is as follows: we’ve got three plots to arrange. Each one of these corresponds to a number in the matrix, which determines where it will go. Note: you can clearly try to give nonsensical layouts here, like demanding the first plot exist in both upper left and bottom right corners. This produces… interesting results.

The matrix command takes a list (c(1,1,2,3)) and matrix…es it. Here we’re telling it to have 2 rows and 2 columns, and to fill the matrix hull with the elements of the list row by row (byrow=TRUE – the default is by column).

We then inform layout that we want the columns of the matrix to have relative widths c(1,3) (note the second plot on the second line is 3 times as wide as the first) and the rows to have relative heights c(1.75,1). Many joyous minutes can be spent tweaking these values.

par lets you globally set graphical parameters. A lot of its options are the same as for plot or hist, but it saves you having to write them for every individual plot. Here we’re setting the font family and bty, which determines the box style for the plots (eg the presence of axis lines). The options for these are depictions of the results: L, 7, C, U, ], etc. Note: I love Avenir, but using ‘strange’ fonts like this can cause problems if/when you go to save these plots directly to pdf.

I thoroughly recommend perusing the options of par. They will open your eyes to the scope of basic R plotting. In particular mar and omi lend themselves well to tweaking, and mfrow is like a poor-man’s layout.

plot(data[,1]/(1e07), data[,4], type="h", col=ifelse(abs(data[,4])>2.5, "darkorange1","black"), ylim=c(1.03*min(data[,4], na.rm=TRUE), 1.03*max(data[,4], na.rm=TRUE)), main=paste("Some data on chromosome", chro), xlab="Physical Distance (e+07)", ylab="Test Statistic")

I’m plotting data[,4] against data[,1]/(1e07). In my case, the elements of data[,1] are all of the order 10 million, so the division just rescales the x-axis to make it easier on the eye. R understands scientific notation like this, which is nice.

type: "p" plots points, "h" plots vertical bars (as in a histogram).

pch: (see later) is point style (if we had them). Tiny dots are ".", bigger dots are "20", see a diagram for more options.

col: (colour) you could write something simple like col="cornflowerblue" to make all of your points (or vertical lines as the case is here) that colour. But that’s boring. What we have here is "darkorange1" if abs(data[,4])>2.5, and "black" otherwise. Note: you may have noticed that my data[,4] column consists of z-scores, and I’m highlighting the ‘unusual’ ones.

ylim: we’re manually setting the limits on the y axis here, so it’s ylim = c(ymin, ymax). I’m going a little bit beyond the actual range of the data, and making sure to remove “nan“s in the min/max functions. Note: if your data is nice unlike mine, you shouldn’t need to do this.

paste: lets me combine text with other variables in the environment. In this case, it’s chro, which we defined earlier.

abline: this just adds a line to the plot. In this case, because a y-position was specified, it’s horizontal (actually, I added two lines, because I specified two positions). Alternately, abline(v=a) for a vertical line.

The first two options given to legend are the location of the upper-left corner of its box. cex is an overall scaling factor. See the options to par again.

I told layout to expect 3 plots, so this goes in the first place. I need to supply two more plots. The first one will be a histogram of our allegedly-normally-distributed dataset, column 4.

The second two lines produce a normal distribution and overlay them onto the hist plot. (Both points and lines will do this, if you need to add more data to an existing plot). breaks is a histogram option saying how many individual bars we want, so you can adjust the ‘resolution’ of the histogram (so tweakable!). dnorm gives the probability density function for a standard normal, which we compare on the histogram.

And now for the final plot. Assuming the values in data[,4] come from a random variable with (standard) normal distribution, then p_col = 2*(1-pnorm(abs(data[,4])) is the probability of getting a more extreme value. (pnorm(abs(data[,4])) is just the cumulative distribution function of the absolute value of the variable, so 1-pnorm is the probability of getting a higher value. The multiplication by 2 compensates for our having taken an absolute value). We can then take the -log (base 10) of this, as is the style. Plotting is simple enough:

Here I’ve used cex to make the plot points a bit smaller.

Statistical note: depending on the nature of the analysis you’re doing, a ‘p-value’ of 0.05 may not be significant. You may be doing multiple hypothesis testing and will need to account for this. The example shown here is purely for demonstrative purposes and has no scientific legitimacy.

And we are done! For this particular example, I would be inclined to spend additional time tweaking the various inner/outer/inter-plot margins to reduce the amount of whitespace in the figure, but that’s just me.

*technical point: R only has one shade of cornflower blue.

**It seems I started this post over a month ago. Oops!

***I was originally planning on including the full complexity of my plotting script here, which involved vectors like this. Now they serve only as examples of what is possible.