does not scan

I’ve seen this too many times:
> data<-read.table('filename.txt')
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line X did not have Y elements

Every time I see and solve this problem, I’m going to keep a note of what it was and put it here. Hopefully this will not turn out to be an exhaustive list…

Most recently, the reason for this was this default in read.table (read.csv is different):
comment.char = "#"
One of the strings in my file had a “#” in it, which resulted in the rest of the line being “commented out” and the above error.

Advertisements

Simple plotting with R

cornflowerblue

It is important to watch one’s figure. To obsess over it, inspecting every detail, carefully erasing unpleasant angles, making sure all elements are perfectly aligned. The hours I spend caring about my figures may one day be my downfall, but until that time I shall continue to worry about things like inner margins, font embedding and what shade of cornflower blue to use.*

A few weeks ago** I decided to use R for some data analysis. It transpires that the R I thought I knew was in fact a semi-incomprehensible amalgamation of gnuplot and Octave, so I was forced to go back to the beginning. A lot of language-learning boils down to reading documentation (still looking for the official documentation for Mandarin, though) but that’s inefficient when you have a specific task to do. I’ve waded around in the help files, and here’s what I dredged up for basic (but not ugly) plotting. Note: as far as I can tell, ggplot2 is the main R package for plotting. This post is not about that. That comes later.

Basics

data<-read.table("file_path_goes_here", sep=",", as.is=TRUE)

This’ll suck your datafile into R. The sep option should contain whatever separates fields in your file. Default is whitespace. The as.is=TRUE option stops R from turning fields containing text into ‘factors’, which may not be what you want. Note: Try to name your variables informatively, unlike me.

Let’s suppose we want to plot some function of the fourth column of the data frame  (data[,4]) against the first column (data[,1]).  Note: if you wanted the fifth row for example, you would write data[5,]. The i,jth element is just data[i,j].

len<-length(data[,1]) == nrow(data)
p_col<-2*(1-pnorm(abs(data[,4])))
data<-cbind(data,abs_col)
chro<-17

The first line is figuring out how many rows are in our data frame in two equivalent ways. Note that length(data) just gives you the number of columns. The second line creates a new column whose values are a function of the (standard normal) cumulative distribution function (pnorm) of the absolute value of whatever is in the fourth column of the data frame. The third line appends (columnwise, with cbind) it to the data frame. Note: I think this might mess up the column names of the data-frame, so there must be a better way to do this. The last line is just defining a variable we’ll use later.

Now let’s use some other functions to create (useless***) vectors.

x<-seq(min(data[,1]), max(data[,1]), length.out=1000)
y<-rep(runif(100), 10)

x is just a vector of real numbers starting at min(data[,1) and ending with max(data[,1]), with evenly spaced intervals so its length is 1000 (seq creates a sequence). y is also a vector of length 1000. runif(100) creates a vector of 100 numbers between 0 and 1 taken from a uniform distribution. rep then repeats this 10 times. If you wanted the repetition to be elementwise (eg [a,a,b,b] instead of [a,b,a,b]) just write each=10.

Basic plot

plot(x, y, xlab="X Axis Name", ylab="Y Axis Name", main="Boring Plot Title")

So boring and default. How about

hist(y, freq=FALSE, main="Hist takes similar options")

Surprisingly enough this makes a histogram. freq=FALSE makes the y-axis into densities (eg fractions) rather than frequencies.

Cool plots

Now the basics are covered, let’s go crazy and make something that looks like this:

A collection of meaningless plots.
A collection of meaningless plots.

layout(matrix(c(1,1,2,3), 2, 2, byrow=TRUE), widths=c(1,3),heights=c(1.75,1))

We’re creating the plot environment here. What is going on is as follows: we’ve got three plots to arrange. Each one of these corresponds to a number in the matrix, which determines where it will go. Note: you can clearly try to give nonsensical layouts here, like demanding the first plot exist in both upper left and bottom right corners. This produces… interesting results.

The matrix command takes a list (c(1,1,2,3)) and matrix…es it. Here we’re telling it to have 2 rows and 2 columns, and to fill the matrix hull with the elements of the list row by row (byrow=TRUE – the default is by column).

We then inform layout that we want the columns of the matrix to have relative widths c(1,3) (note the second plot on the second line is 3 times as wide as the first) and the rows to have relative heights c(1.75,1). Many joyous minutes can be spent tweaking these values.

par(family="Avenir",bty="l")

par lets you globally set graphical parameters. A lot of its options are the same as for plot or hist, but it saves you having to write them for every individual plot. Here we’re setting the font family and bty, which determines the box style for the plots (eg the presence of axis lines). The options for these are depictions of the results: L, 7, C, U, ], etc. Note: I love Avenir, but using ‘strange’ fonts like this can cause problems if/when you go to save these plots directly to pdf.

I thoroughly recommend perusing the options of par. They will open your eyes to the scope of basic R plotting. In particular mar and omi lend themselves well to tweaking, and mfrow is like a poor-man’s layout.

plot(data[,1]/(1e07), data[,4], type="h", col=ifelse(abs(data[,4])>2.5, "darkorange1","black"), ylim=c(1.03*min(data[,4], na.rm=TRUE), 1.03*max(data[,4], na.rm=TRUE)), main=paste("Some data on chromosome", chro), xlab="Physical Distance (e+07)", ylab="Test Statistic")

abline(h=c(-2.5,2.5),col="darkorange1")

Right, so. What’s going on here:

  • I’m plotting data[,4] against data[,1]/(1e07). In my case, the elements of data[,1] are all of the order 10 million, so the division just rescales the x-axis to make it easier on the eye. R understands scientific notation like this, which is nice.
  • type: "p" plots points, "h" plots vertical bars (as in a histogram).
  • pch: (see later) is point style (if we had them). Tiny dots are ".", bigger dots are "20", see a diagram for more options.
  • col: (colour) you could write something simple like col="cornflowerblue" to make all of your points (or vertical lines as the case is here) that colour. But that’s boring. What we have here is "darkorange1" if abs(data[,4])>2.5, and "black" otherwise. Note: you may have noticed that my data[,4] column consists of z-scores, and I’m highlighting the ‘unusual’ ones.
  • ylim: we’re manually setting the limits on the y axis here, so it’s ylim = c(ymin, ymax). I’m going a little bit beyond the actual range of the data, and making sure to remove “nan“s in the min/max functions. Note: if your data is nice unlike mine, you shouldn’t need to do this.
  • paste: lets me combine text with other variables in the environment. In this case, it’s chro, which we defined earlier.
  • abline: this just adds a line to the plot. In this case, because a y-position was specified, it’s horizontal (actually, I added two lines, because I specified two positions). Alternately, abline(v=a) for a vertical line.

Let’s give it a legend.

legend(0,1.05*max(data[,4],na.rm=TRUE),"Orange : |Score| >2.5",bg="cornflowerblue",cex=0.9)

The first two options given to legend are the location of the upper-left corner of its box. cex is an overall scaling factor. See the options to par again.

I told layout to expect 3 plots, so this goes in the first place. I need to supply two more plots. The first one will be a histogram of our allegedly-normally-distributed dataset, column 4.

hist(data[,4],breaks=50,freq=FALSE,main="Scores",xlab="Test Statistic")
z<-seq(-4,4,length=200)
lines(z,dnorm(z),col="darkorange1")

The second two lines produce a normal distribution and overlay them onto the hist plot. (Both points and lines will do this, if you need to add more data to an existing plot). breaks is a histogram option saying how many individual bars we want, so you can adjust the ‘resolution’ of the histogram (so tweakable!). dnorm gives the probability density function for a standard normal, which we compare on the histogram.

And now for the final plot. Assuming the values in data[,4] come from a random variable with (standard) normal distribution, then p_col = 2*(1-pnorm(abs(data[,4])) is the probability of getting a more extreme value. (pnorm(abs(data[,4])) is just the cumulative distribution function of the absolute value of the variable, so 1-pnorm is the probability of getting a higher value. The multiplication by 2 compensates for our having taken an absolute value). We can then take the -log (base 10) of this, as is the style. Plotting is simple enough:

plot(data[,1]/(1e07), -log10(p_col), type="p", pch=20, cex=0.65 col=ifelse(p_col<0.05, "darkorange1","black"), ylim=c(0,max(-log10(p_col),na.rm=TRUE)), main=paste("Homebrew Manhattan plot for chromosome", chro), xlab="Physical Distance (e+07)", ylab="-log10(p)")

Here I’ve used cex to make the plot points a bit smaller.

Statistical note: depending on the nature of the analysis you’re doing, a ‘p-value’ of 0.05 may not be significant. You may be doing multiple hypothesis testing and will need to account for this. The example shown here is purely for demonstrative purposes and has no scientific legitimacy.

And we are done! For this particular example, I would be inclined to spend additional time tweaking the various inner/outer/inter-plot margins to reduce the amount of whitespace in the figure, but that’s just me.

*technical point: R only has one shade of cornflower blue.

**It seems I started this post over a month ago. Oops!

***I was originally planning on including the full complexity of my plotting script here, which involved vectors like this. Now they serve only as examples of what is possible.

Learning Hangul(한글)

Hang-Cool 1

The Korean alphabet (Hangul) is – so far – my favourite writing system. It is logical and efficient. It pleases my sense of style. Since starting this post over a month ago I took up learning Mandarin so my feelings towards Hanzi are liable to threaten Hangul’s dominance in the future, but for now I side with space-robot alphabet. Because that’s what Hangul is.

See? Space robots.
See? Space robots.

At first glance one may assume that Hangul consists of logograms – characters representing words rather than phonemes, but this is not the case. The alphabet is very much phonetic. Each “block” is a single syllable, so for example Hangul(한글) is Han(한)+gul(글).

Since syllables are made of phonemes, it is not surprising that the blocks consist of sub-components representing these phonemes. (It was surprising the first time I learned of this, because such an elegant solution to written language had not occurred to me – though upon further reflection, the trick is just “writing words more compactly” so it’s not as novel as it is aesthetically pleasing.) Some insane person wrote a Wikipedia page documenting every possible syllabic block in Korean, so all you need to do is memorise all ten thousand of these (give or take a few thousand) and reading Korean will become trivial. End of post. If this idea is appealing to you, I might suggest going to Cambridge to do Part III of the Mathematical Tripos.

The more elegant solution is to learn the alphabet. Each letter is called a “jamo”, but they only occur inside blocks, sort of like quarks. Unlike quarks, we can still look at them individually. I’ll include the IPA in [], and a ‘translation’ of IPA into my accent (mileage may vary). For pronunciation purposes, text is no replacement for audio, so I would suggest finding some videos, like this one, for example.

Simple vowels: Simple vowels are made of horizontal or vertical lines and short strokes.

ㅣ [i] (“ee” in “tree”)
ㅏ [a] (“a” in “mad”)
ㅓ [ʌ] (“u” in “mud”)

ㅡ [ɯ] (somewhere between the “oo” in “cool” and the “eu” in “eugh” – I have a really hard time differentiating this from ㅜ)
ㅗ [o] (“o” in “bowl”)
ㅜ [u] (“oo” in “too”)

Complex vowels: Combinations of simple vowels (including diphthongs). I’m not going to include all combinations because many of them are self-evident given the simple vowels.

These ones are less obvious:
ㅐ [ɛ] (“e” in “bed”)
ㅔ [e] (“e” in “grey”)

Generally, ㅗ or ㅜ combined with another vowel gives a “w-” sound, so for example ㅘ is “wah”, ㅙ is “weh”, and ㅟ is “wee”.

There’s no letter for “y” in Korean, so if you want to “y” up a vowel, double up on short strokes (I believe this process is called ‘iotation’. You can do something similar in Slavic languages with ь – Cyrillic comes a close second in the space-robot race.) This produces
ㅑ [ja] (“yah”)
ㅕ [] (“yuh”)
ㅛ [jo] (“yoh”)
ㅠ [ju] (“yoo”)

We can extend this to the complex vowels, to get ㅒ for “yeh” and ㅖ for a slightly different “yeh”.

Consonants:

Syllables are usually a consonant-vowel sandwich, so consonants can be “initial”, “medial”, or “final” (I’ll write [i/m/f]), and the placement makes a (small) difference to the pronunciation of the letter.

ㄱ [k/g/k̚] (“k” as in “Kant”, “g” as in “gravity”, “k̚” as in “quark”)
ㄴ [n/n/n] (“n” as in “neutron”)
ㄷ [t/d/t̚] (“t” as in “tachyon”, “d” as in “down”, “t̚” as in “cat”)
ㅅ [s/s/t̚] (“s” as in “strange”)
ㅁ [m/m/m] (“m” as in “mass”)
ㅂ [p/b/p̚] (“p” as in “point”, “b” as in “baryon”, “p̚” as in “top”)
ㅇ [-/ŋ/ŋ] (This is just a silent placeholder in the initial position. In all others it’s “ng”, as in “ping”)
ㄹ [ɾ/ɾ/l] (“ɾ” as in “alveolar tap”, a sound which is neither “r” nor “l”)

Some consonants are obtained from others by aspiration. Aspiration is basically just adding air to the sound – so imagine trying to sneak a “h-” sound in after the consonant. In Hangul, the addition of a horizontal line seems to denote this aspiration, or a general ‘softening’ or alteration of the sound (in the case of the letter I like to think of as “j”). This produces:
ㄱ > ㅋ [kʰ/kʰ/k̚] (“kʰ” is an aspirated “k”, oddly enough)
ㄷ > ㅌ [tʰ/tʰ/t̚]
ㅅ > ㅈ [tɕ/dʑ/t̚] (“tɕ” as in “charm”, “dʑ” as in “jam”)
ㅈ > ㅊ [tɕʰ/tɕʰ/t̚] (“tɕʰ” as in “oh god send help”)
ㅂ > ㅍ [pʰ/pʰ/p̚] (“pʰ” as in *strangling noises*)
ㅇ > ㅎ [h/ɦ/-] (“h” as in “hello”, “ɦ” as in “cool whip”)

There are also “double letters”: ㄲ, ㄸ, ㅃ, ㅆ, ㅉ which are “tense”, so they’re pronounced a bit like you’re after spending the last hour reading articles about phonetics and just realised it’s too late to watch Breaking Bad. “Damn it!” ~ “땀읻!”

I should stress that this entire post has very little to do with the Korean language. I don’t know any Korean, but transliteration can be fun, and this article was largely about IPA. Trying to cram English into a foreign language really makes you appreciate phonetic differences.

치샔시챐파에대시추러…

피탤퍼이팰챜앧아퍀어퍀랟페펤… (curse you lack of “ɘ”!)

Printing a range of bytes from a specific line in a file

To read the first B bytes from the Nth line of a file using pipes in bash:

cat filename | head -N | tail -1 | head -c B

The first pipe takes the output of cat filename (which is simply a printout of filename) and feeds it as input to head -N which produces the first N lines of filename, the last line of which is produced by tail -1, and then the first B bytes are pulled out by head -c B.

So, to read from byte B1 to B2 (inclusive) on the Nth line:

cat filename | head -N | tail -1 | tail -c +B1 | head -c B2-B1

This is cheating a bit because you have to work out B2-B1 first. If you type it as-is you will not succeed. The +B1 bit in the second tail command tells it to work from the start of the file (normally tail counts backwards).

Or you could do it much more “easily” using sed and cut. To start, we get the Nth line of the file:

sed 'Nq;d' filename

Knowing nothing about sed, this took a while to understand. I’m still not entirely sure if my intuition is correct, but it goes as follows: sed scans the file line by line. The q command is triggered only on line N. On every other line, instead of printing the file, it doesn’t print (thanks to d), so we see nothing. When it gets to line N it quits, after printing the line, but before triggering d. Another way to print only line N is with

sed -n 'Np' filename

What happens here is that -n tells said not to print anything, while p says ‘print line N’. The difference between this and the previous one is that sed will continue to the end of the file, quietly not printing anything. That’s sort of a waste of time, so – as per usual – the less comprehensible version is faster.

And now we can simply do

sed 'Nq;d' filename | cut -b B1-B2

Where here B1-B2 can can be written literally, because cut takes a range as its argument.

We could have also used awk to extract a line, instead of sed:

awk 'NR==N' filename

I am particularly interested in learning sed though, so I’ll try to stick to sed solutions where possible.