vcf confusion


vcf confusion

The contacts in the address book on your phone can be imported and exported using the vCard format. This is basically a text file with a .vcf file extention (for virtual contact file). It is surprisingly readable, here is an example (slightly redacted example from here):

BEGIN:VCARD
VERSION:3.0
N:Doe;John;;;
FN:John Doe
EMAIL;type=INTERNET;type=WORK;type=pref:johnDoe@example.org
TEL;type=WORK;type=pref:+1 617 555 1212
TEL;type=WORK:+1 (617) 555-1234
TEL;type=CELL:+1 781 555 1212
TEL;type=HOME:+1 202 555 1212
NOTE:John Doe has a long and varied history\, being documented on more police files that anyone else. Reports of his death are alas numerous.
CATEGORIES:Work,Test group
END:VCARD

I mean, what is there even to explain? You can just read it. And if anything is unclear (perhaps why the name is there twice) you can just read all about it on the official standard (RFC 6350). And even that one is nice and readable.

What an awesome format.

I want to analyze my address book

I wanted to do some data analysis on my address book

It was tempting to write a parser for it myself. But one of the things I learned over the last years is to not reinvent the wheel. Especially if the alternative is just an import and reading a bit of documentation.

So I looked on pypi and was pleased to find a lot of vcf packages. 445 - perhaps a bit more then you would expect.

I just tried out one or two, but I got a strange error messages.

vcfpy.exceptions.IncorrectVCFFormat: Missing line starting with "#CHROM"

Hmm, that’s right, there is no line starting with #CHROM in my .vcf-file. Is my export broken? Is the package outdated perhaps?

So I tried another package, but I kept getting these error messages. Strange… What does $CHROM even mean? Should I just add it to my file if the package wants it so desperately?

I looked up the standard - there wasn’t anything about #CHROM. I looked up the error message on the internet - there definitely were people talking about #CHROM in their .vcf-files. I looked up the documentation of the package - no really useful information.

I kept reading and finally found it: There are multiple file formats named VCF. There is the one about contacts, but there is also the “Variant Cal Format” used for genome-data in bioinformatics. No wonder there are so many vcf-packages out there. And #CHROM stands for chromosomes of course.

I had been reading the wrong documentation all along. Guess I should have just reinvented the wheel.