It is important to watch one’s figure. To obsess over it, inspecting every detail, carefully erasing unpleasant angles, making sure all elements are perfectly aligned. The hours I spend caring about my figures may one day be my downfall, but until that time I shall continue to worry about things like inner margins, font embedding and what shade of cornflower blue to use.*
A few weeks ago** I decided to use R for some data analysis. It transpires that the R I thought I knew was in fact a semi-incomprehensible amalgamation of gnuplot and Octave, so I was forced to go back to the beginning. A lot of language-learning boils down to reading documentation (still looking for the official documentation for Mandarin, though) but that’s inefficient when you have a specific task to do. I’ve waded around in the help files, and here’s what I dredged up for basic (but not ugly) plotting. Note: as far as I can tell, ggplot2 is the main R package for plotting. This post is not about that. That comes later.
data<-read.table("file_path_goes_here", sep=",", as.is=TRUE)
This’ll suck your datafile into R. The
sep option should contain whatever separates fields in your file. Default is whitespace. The
as.is=TRUE option stops R from turning fields containing text into ‘factors’, which may not be what you want. Note: Try to name your variables informatively, unlike me.
Let’s suppose we want to plot some function of the fourth column of the data frame (
data[,4]) against the first column (
data[,1]). Note: if you wanted the fifth row for example, you would write
i,jth element is just
The first line is figuring out how many rows are in our data frame in two equivalent ways. Note that length(data) just gives you the number of columns. The second line creates a new column whose values are a function of the (standard normal) cumulative distribution function (pnorm) of the absolute value of whatever is in the fourth column of the data frame. The third line appends (columnwise, with cbind) it to the data frame. Note: I think this might mess up the column names of the data-frame, so there must be a better way to do this. The last line is just defining a variable we’ll use later.
Now let’s use some other functions to create (useless***) vectors.
x is just a vector of real numbers starting at
min(data[,1) and ending with
max(data[,1]), with evenly spaced intervals so its length is
seq creates a sequence).
y is also a vector of length
runif(100) creates a vector of 100 numbers between 0 and 1 taken from a uniform distribution.
rep then repeats this 10 times. If you wanted the repetition to be elementwise (eg
[a,a,b,b] instead of
[a,b,a,b]) just write
plot(x, y, xlab="X Axis Name", ylab="Y Axis Name", main="Boring Plot Title")
So boring and default. How about
hist(y, freq=FALSE, main="Hist takes similar options")
Surprisingly enough this makes a histogram.
freq=FALSE makes the y-axis into densities (eg fractions) rather than frequencies.
Now the basics are covered, let’s go crazy and make something that looks like this:
We’re creating the plot environment here. What is going on is as follows: we’ve got three plots to arrange. Each one of these corresponds to a number in the matrix, which determines where it will go. Note: you can clearly try to give nonsensical layouts here, like demanding the first plot exist in both upper left and bottom right corners. This produces… interesting results.
matrix command takes a list (
c(1,1,2,3)) and matrix…es it. Here we’re telling it to have 2 rows and 2 columns, and to fill the matrix hull with the elements of the list row by row (
byrow=TRUE – the default is by column).
We then inform
layout that we want the columns of the matrix to have relative widths
c(1,3) (note the second plot on the second line is 3 times as wide as the first) and the rows to have relative heights
c(1.75,1). Many joyous minutes can be spent tweaking these values.
par lets you globally set graphical parameters. A lot of its options are the same as for plot or hist, but it saves you having to write them for every individual plot. Here we’re setting the font
bty, which determines the box style for the plots (eg the presence of axis lines). The options for these are depictions of the results:
L, 7, C, U, ], etc. Note: I love Avenir, but using ‘strange’ fonts like this can cause problems if/when you go to save these plots directly to pdf.
I thoroughly recommend perusing the options of
par. They will open your eyes to the scope of basic R plotting. In particular
omi lend themselves well to tweaking, and
mfrow is like a poor-man’s
plot(data[,1]/(1e07), data[,4], type="h", col=ifelse(abs(data[,4])>2.5, "darkorange1","black"), ylim=c(1.03*min(data[,4], na.rm=TRUE), 1.03*max(data[,4], na.rm=TRUE)), main=paste("Some data on chromosome", chro), xlab="Physical Distance (e+07)", ylab="Test Statistic")
Right, so. What’s going on here:
- I’m plotting
data[,1]/(1e07). In my case, the elements of
data[,1]are all of the order 10 million, so the division just rescales the x-axis to make it easier on the eye. R understands scientific notation like this, which is nice.
"h"plots vertical bars (as in a histogram).
pch:(see later) is point style (if we had them). Tiny dots are
".", bigger dots are
"20", see a diagram for more options.
col: (colour) you could write something simple like
col="cornflowerblue"to make all of your points (or vertical lines as the case is here) that colour. But that’s boring. What we have here is
"black"otherwise. Note: you may have noticed that my
data[,4]column consists of z-scores, and I’m highlighting the ‘unusual’ ones.
ylim: we’re manually setting the limits on the y axis here, so it’s
ylim = c(ymin, ymax). I’m going a little bit beyond the actual range of the data, and making sure to remove “
nan“s in the
min/maxfunctions. Note: if your data is nice unlike mine, you shouldn’t need to do this.
paste: lets me combine text with other variables in the environment. In this case, it’s
chro, which we defined earlier.
abline: this just adds a line to the plot. In this case, because a y-position was specified, it’s horizontal (actually, I added two lines, because I specified two positions). Alternately,
abline(v=a)for a vertical line.
Let’s give it a legend.
legend(0,1.05*max(data[,4],na.rm=TRUE),"Orange : |Score| >2.5",bg="cornflowerblue",cex=0.9)
The first two options given to
legend are the location of the upper-left corner of its box.
cex is an overall scaling factor. See the options to
layout to expect 3 plots, so this goes in the first place. I need to supply two more plots. The first one will be a histogram of our allegedly-normally-distributed dataset, column 4.
The second two lines produce a normal distribution and overlay them onto the
hist plot. (Both
lines will do this, if you need to add more data to an existing plot).
breaks is a histogram option saying how many individual bars we want, so you can adjust the ‘resolution’ of the histogram (so tweakable!).
dnorm gives the probability density function for a standard normal, which we compare on the histogram.
And now for the final plot. Assuming the values in
data[,4] come from a random variable with (standard) normal distribution, then
p_col = 2*(1-pnorm(abs(data[,4])) is the probability of getting a more extreme value. (
pnorm(abs(data[,4])) is just the cumulative distribution function of the absolute value of the variable, so
1-pnorm is the probability of getting a higher value. The multiplication by 2 compensates for our having taken an absolute value). We can then take the
-log (base 10) of this, as is the style. Plotting is simple enough:
plot(data[,1]/(1e07), -log10(p_col), type="p", pch=20, cex=0.65 col=ifelse(p_col<0.05, "darkorange1","black"), ylim=c(0,max(-log10(p_col),na.rm=TRUE)), main=paste("Homebrew Manhattan plot for chromosome", chro), xlab="Physical Distance (e+07)", ylab="-log10(p)")
Here I’ve used
cex to make the plot points a bit smaller.
Statistical note: depending on the nature of the analysis you’re doing, a ‘p-value’ of 0.05 may not be significant. You may be doing multiple hypothesis testing and will need to account for this. The example shown here is purely for demonstrative purposes and has no scientific legitimacy.
And we are done! For this particular example, I would be inclined to spend additional time tweaking the various inner/outer/inter-plot margins to reduce the amount of whitespace in the figure, but that’s just me.
*technical point: R only has one shade of cornflower blue.
**It seems I started this post over a month ago. Oops!
***I was originally planning on including the full complexity of my plotting script here, which involved vectors like this. Now they serve only as examples of what is possible.