Read multiples of same columns in a fixed width file

Read multiples of same columns in a fixed width file

Consider the following few lines from a Stata .dct file which defines for Stata how to read this fixed width ASCII file (can be decompressed with any ZIP software on any platform):

start             type                            varname width  description
_column(24)       long                               rfv1   %5f  Patient's Reason for Visit #1            
_column(29)       long                               rfv2   %5f  Patient's Reason for Visit #2             
_column(34)       long                               rfv3   %5f  Patient's Reason for Visit #3             
_column(24)       long                             rfv13d   %4f  Patient's Reason for Visit #1 - broad     
_column(29)       long                             rfv23d   %4f  Patient's Reason for Visit #2 - broad     
_column(34)       long                             rfv33d   %4f  Patient's Reason for Visit #3 - broad     

Basically the 24th through 39th characters in every row of this ASCII file look like this:

AAAAaBBBBbCCCCc

Where the first broad code is AAAA, the narrower code for that same reason is AAAAa, etc.

In other words, because the codes themselves have a heirarchical structure, the same characters in every row are read twice to create two different variables.

read.fwf, by contrast, just takes a widths argument, which precludes this type of double-reading.

Is there a standard way of handling this, without recreating the wheel from scratch by scanning in the entire file and parsing it by hand?

The background here is that I’m writing a function to parse these .DCT files, in the style of SAScii, and my job would be much simpler if I could specify (start, width) pairs for every variable rather than just widths.

I had started working on a .DCT parser but lost steam. My intended usage scenario was to actually simply parse the file and create a csvkit schema file to allow me to use csvkit to convert the file from fixed width to csv. To that end, the package was successful, but it is very un-refined and only very minimally tested.

A couple of the problems to look out for include (1) not all DCT files have the same columns; (2) some DCT files have instructions for implicit decimal places, and I never got around to coming up with a method for dealing with those types of files.

You can find the initial work on the package here.

The main functions are:

  • dct.parser — Does what you would expect. There is a “preview” argument that reads in the first few lines to let you determine whether the DCT file has all the columns you expect or not.
  • csvkit.schema — Using the info extracted from dct.parser, a csv file is created with the relevant columns required by in2csv from csvkit.
  • csvkit.fwf2csv — Basically a system call to csvkit. Can also be done outside of R.

For your particular example, I successfully read it using:

## The extracted data file and the DCT file are in my downloads directory
setwd("~/Downloads/") 
dct.parser("ed02.dct", preview=TRUE) ## It seems that everything is there
temp <- dct.parser("ed02.dct")       ## Can be used as a lookup table later

## The next line automatically creates a csv schema file in your 
##   working directory named, by default, "your-dct-filename.csv"
csvkit.schema(temp) 
csvkit.fwf2csv(datafile = "ED02", schema="ed02.dct.csv", output="ED02.csv")

## I haven't set up any mechanism to check on progress...
## Just check the directory and see when the file stops growing :)
ED02 <- read.csv("ED02.csv")

Another alternative that I had intended to work on (but never did) was to use paste to construct substr commands that could be used by sqldf to read in data where the columns contain overlapping data. See this blog post for an example to get started.


Update: An sqldf example

As mentioned above, sqldf can make good use of the output of dct.parser and read in your data by using substr. Here’s an example of how you would do this:

## The extracted data file and the DCT file are in my downloads directory
setwd("~/Downloads/") 
temp <- dct.parser("ed02.dct")       ## Can be used as a lookup table later

## Construct your "substr" command
GetMe <- paste("select", 
               paste("substr(V1, ", temp$StartPos, ", ",
                     temp$ColWidth, ") `", temp$ColName, "`", 
                     sep = "", collapse = ", "), 
               "from fixed", sep = " ")

## Load "sqldf"
library(sqldf)

fixed <- file("ED02")
ED02 <- sqldf(GetMe, file.format = list(sep = "_"))
dim(ED02)
# [1] 37337   260

As can be seen, a little modification was necessary in the sqldf line. In particular, since sqldf uses read.csv.sql, it saw any comma characters in your data as a delimiter. You can just change that to something you don’t expect in the data.

.
.
.
.