What I done learned about REDCap

2015-07-15

A few surprises

For those not in the know, REDCap is a platform for creating and editing databases through the web. And by and large, it works fine. It saves a lot of development effort. It provides good reporting tools for users. It's secure and robust. But there are some things to be aware of, perhaps because of being insufficiently documented, perhaps because of expectations.

Note

REDCap jargon is a little odd. To avoid confusion, I'll use the REDCap terms but just to remind you how REDCap-ese maps to some common terms:

an instrument:: is a form
a field:: is a field or column
a label:: is a field title
a name:: is an field ID, identifier or internal ID
a record:: is a record or row
a record ID:: is a unique ID or key

Note

Most of these experiences are with REDCap version 5.12 through to 6.4.4. Of course, later versions may have fixed or adjusted some of this behaviour.

Field labels

You may be tempted to have fields with the same labels (i.e. the visible title). For example, you have a instrument for a subject's initial assessment. Within it, you record the date of that visit as "Date". Then you have a later instrument for a later visit, which you again equip with a field "Date". Don't do this. When data is exported in the "labels" format, you'll end up multiple columns with similar titles and confuse people. Give each field a unique label: "Date of initial visit", "Date of second visit" ...

26 character or less IDs

REDCap warns you if you create a field with an name that is more than 26 characters long, but lets you do it anyway. Don't. I naively assumed that as it let me, the problems wouldn't be great and if there were any problems, they would be flagged-up.

Foolish me.

Eventually I discovered cases where the fields with the long names weren't appearing in downloaded data and reports, silent errors that caused huge problems. This may be dependent upon the format of exported data (e.g. R, CSV), but every field's name must be no more than 26 characters. Be very aware of this for checkboxes and radio buttons (see elsewhere). But also see notes on the "complete" fields below.

Checkboxes

I used checkboxes for a number of REDCap fields, which on reflection may have been better or more simply implemented otherwise. Don't mistake me, they work, but there were some unobvious implications to using them.

Automatic formation of choice values:

It's hard to see how REDCap could do this better, but when you give it a list of choices (say for a multiple-choice checkbox), it just assigns them consecutive numbers. Bar and Baz will be represented by the coded internal values 0 and 1. Which isn't very meaningful. There's no way around this but to insert more meaningful names yourself. Remember they are limited to an alphanumeric string with underscores. (No spaces!)

Name formation:

If a REDCap checkbox (multichoice) field has the label Foo (name foo) and the choice values bar, Bar and baz, Baz, this will actually be represented as multiple fields: Foo (Bar) (name foo___bar) and Foo (Baz) ``(name ``foo___baz). Users can sometimes find these a bit confusing.

A far greater problem is that while REDCap warns you if you create a field with an name greater than 26 characters (see above), it does not alert you if a checkbox or radio button will result in a similar long name. (For example: longfieldname___longchoicename). Watch out for these, they will result in the same problems as other long names.

Data export:

If you export checkbox data in the "labels" format, it is rendered as the values Checked and Unchecked. Which is logical but a little surprising and frightens some users. (I have some easily scared users.)

Blank values:

REDCap is generally very tolerate of blank or missing values but checkboxes is one place where it is not. Say you have a checkbox name foo with choices bar and baz and thus the columns foo___bar and foo___baz. If either is checked, they will have the value 1, otherwise 0. If you import data with those values, it will be set accordingly. However, there's the (logical) temptation to set unchecked fields as blank. Don't do this. I found cases where these blank checkbox fields acquired a random value. As near as I could tell, while imported blank fields are usually recorded faithfully (e.g. an empty field in the input leads to an empty text field in the database), if an imported checkbox field is blank, its value is not recorded and it seems like it might acquire a random value. Unchecked checkboxes and radiobuttons must be explicitly set as unchecked (i.e. recorded as '0')

Complete fields

Each instrument automatically acquires the field foo_complete, which captures whether the form has been marked as complete. Users can be surprised by these fields when they "mysteriously" appear in exports, bearing values like "unverified". For reference (and if you are importing data and wish to mark "completeness"), the possible values of this field are "Incomplete" (0), "Unverified" (1) and "Complete" (2)

Importing data

REDCap import "just works" but it's worth underlining how it works.

Record IDs:: If you import a record with the same ID as a pre-existing record, the previous record is over-written. There is no exception or way of getting around this, it's a fundamental and largely useful behaviour of REDCap. See "checkboxes" for a delightful exception to this.
Data import tool:: "Manually" filling in a data import template and uploading works fine. However, REDCap is fairly slow at importing data and the process consumes a lot of memory, such as to limit the size of file you can upload. (It will depend on how your server is set up but as a rule of thumb, watch out as you tend towards megabyte-sized uploads.) So then you have to split your upload into multiple files, while retaining the header. There's a number of command-line and online tools for doing this, but the upload task gets to be fairly tedious. Select file, click upload, wait, confirm you want to commit records, select file ... If you have to a lot of data often, you can instead script something to use the REST interface. See "PyCap".

Exporting data

Exporting is another thing that largely just works, but there are a few surprises on the system size:

The process takes a huge amount of memory. Apparently the files are created entirely in-memory, and I would frequently see report and export fails because of running out of memory. Nothing for it but to boost the memory specs.
If you look in the file storage area, you'll see lots and lots of files in lots of formats. It seems that when you generate a report, REDCap generates it in every possible format (.R, .sps, .csv, etc.) and then leaves it in the file storage area forever. I had a system run out of space due to a huge number of reports building up. It may help to periodically purge this area.

PyCap

REDCap has a REST interface, which for those of you with a life means that there's a way to "talk" to the database over the web with other programs. This is very useful for writing scripts that automatically upload or download data. You could write these scripts direct to the interface, but there are a few libraries to help you out by wrapping the complexities of the interface. R has REDCapR and redcapAPI but I work largely in Python using PyCap. It's a great module, but as always there are a few surprises.

Not py, red:

Remember, the module is called PyCap, but it is imported under the name redcap.

Import scripts:

One of the most com on uses of the REST interface is to automate the (tedious) process of uploading a very large dataset. (See "importing data".) But note that this doesn't get you around the issue of REDCap's memory limits. You may have to break up the dataset into smaller chunks. Even despite this I've often found the CPU of the REDCap server racing as it is bombarded with successive uploads. So you may want to pause between "chunks". A piece of code like this will help:

def upload_recs (proj, recs, chunk_sz, sleep=0):
   total_len = len (recs)
   for x in range (0, total_len, chunk_sz):
      start = x
      stop = min (total_len, x+chunk_sz)
      print ("Uploading records %s-%s of %s" % (start, stop-1, total_len))
      print (proj.import_records (recs[start:stop], overwrite='overwrite'))
      if sleep and (stop != total_len):
        time.sleep (sleep)

Downloading data:

Once, when using PyCap to download data from a series of REdCap databases, for some reason it keep flaking out on a particular db. This was made even stranger by the fact that it had previously worked. The error was:

...
File "/Users/pagapow/anaconda/lib/python3.4/site-packages/redcap/project.py",
   line 271, in export_records
   pl[key] = ','.join(data)
TypeError: sequence item 0: expected str instance, int found

What seemed to have happened here is that new records were added with a different naming scheme, i.e. what was KDG102301 was joined by 1510061. PyCap was forcing new identifiers to be interpreted as integers, but later expects them to be strings, tries to join them together into a single comma delimited string and fails.

The solution is to hack on PyCap, inserting a line like this around line 269 of project.py just before it does the join:

data = [str(x) for x in data]

My redcaphelper module does this for you.

Much like the upload "chunking" problem, sometimes you have to break up downloads into smaller bits to avoid overloading REDCap. Code like this will do the download in parts and join it all together:

def chunked_export (project, chunk_size=200):
   """
   Download data in chunks to avoid memory errors.
   """
   def chunks(l, n):
      """Yield successive n-sized chunks from list l"""
      for i in range (0, len(l), n):
         yield l[i:i+n]

   record_list = project.export_records(fields=[project.def_field])
   records = [r[project.def_field] for r in record_list]
   try:
      response = []
      for record_chunk in chunks(records, chunk_size):
         chunked_response = project.export_records(records=record_chunk)
         response.extend(chunked_response)
   except redcap.RedcapError:
      msg = "Chunked export failed for chunk_size={:d}".format(chunk_size)
      raise ValueError (msg)
   else:
      return response

Downloading the data dictionary:

Is a logical and useful thing to do. I make it part of my regular backup of REDCap. The data within isn't as useful if you don't have the so I the schema and layout of each database. PyCap actually captures the schema within it's Project object, although in a way that's different to the data dictionary layout. Use a bit of script like this to record it:

csv_txt = proj.export_metadata (format='csv')
csv_rdr = csv.DictReader (io.StringIO (csv_txt))
csv_recs = [r for r in csv_rdr]
SCHEMA_FLD_ORDER = [
  'field_name',
  'form_name',
  'section_header',
  'field_type',
  'field_label',
  'select_choices_or_calculations',
  'field_note',
  'text_validation_type_or_show_slider_number',
  'text_validation_min',
  'text_validation_max',
  'identifier',
  'branching_logic',
  'required_field',
  'custom_alignment',
  'question_number',
  'matrix_group_name',
  'matrix_ranking',
]
out_hndl = open ('datadict.csv', 'w')
wrtr = csv.DictWriter (out_hndl, fieldnames=SCHEMA_FLD_ORDER)
wrtr.writerows (csv_recs)

S3 storage

Good news! You can use Amazon (AWS) to store all the files generated and uploaded to REDCap, allowing you to load balance REDCap across multiple machines and abstract the service to things like Beanstalk or Docker images. Bad news! It's not as simple as it seems.

The first (and easy) step is that you have to create an S3 bucket. Then you have to create a user - with access key and secret key - and give them access to that bucket. This is where it gets tricky: the permissions needed by REDCap are a little exotic. Here's the security policy I used, where redcap-file-storage is the name of the bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::redcap-file-storage"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::redcap-file-storage/*"
            ]
        }
    ]
}

So note: the user list the bucket, they can put-get-delete objects in the bucket and they can put ACLs on objects in the bucket. (This last one gave me the greatest trouble - it's not obvious unless you read the REDCap code.) (Also note: the version field in AWS policies is under-explained in documentation. It's the version number for the format, and so it is fixed. It cannot be any other value.)

Tools and code like boto and s3cmd are useful for debugging S3 connectivity.

References

REDCap
PyCap