A Crash Course

In Research Data Management

Scout Calvert, PhD
Data Librarian
calvert4@msu.edu
http://bit.ly/RDM-Crash-Course

Map of Talk

  • The End
  • The Means
  • Sharing
  • Resources and tools
  • Questions

Attend a workshop

Crash Course in Research Data Management
March 25

Writing Your Data Management Plan
April 1

Prepare Your Data for Upload to a General Repository
April 8

Worksheet

  • no full sentences
  • quick jots
  • crib sheet for later
  • blank spots are okay

The End

Gold Standard for RDM

A well described data package that can be passed to another researcher in the same discipline and meaningfully reused (reproduced, replicated, reanalyzed) without additional communication.

The Data Package

  • Raw data
  • Processed, cleaned data
  • Metadata, documentation, description
  • Code, scripts, analysis
  • README, file manifest, codebooks

  • . . . with human readable file names,
    in an organized file structure,
    in open, long-lived file formats.

Your data is endangered

Open Data and Open Science

Optimism that:

  • data sharing will improve knowledge production and discovery
  • data interoperability will lead to stronger knowledge claims (e.g., climate science)
  • data openness will encourage the discovery of errors and discourage fraud
  • data sharing will promote reproduction and replication in science

Incentives for Good Data Practice

RDM is a gift to your future self

  • Return to a project easily after time off or a set-back
  • Avoid catastrophic data loss
  • Improve collaborations
  • Write a DMP quickly and easily
  • Stay ready for sharing and archiving
  • Worry-free compliance with funder and campus policy
  • Safeguard data and context loss due to team changes

Data curation

Managing data through the entire research lifecycle

Your Data Management Plan

DMP should demonstrate not just that you can keep data safe from losses, keep confidential data secure, or share data on request, but that you are equipped to proactively prepare your data for archiving and sharing (via depositing) at the end of the grant.

The Means

What is Data?

  • Observations of
  • Evidence for

Metadata

They say metadata is a love note to the future.
I say without metadata, there's no data.

Science Friction

Just as with data themselves, creating, handling, and managing metadata products always exacts a cost in time, energy, and attention: metadata friction (Edwards et al. 2011).

What's Your Data?

  • Primary data your project will produce
  • Secondary data or sources your project may use
  • Code, markup, protocols, procedures, instruments
  • Metadata
  • Publications for lit review or background

How Valuable Is Your Data?

  • During the project
  • At project's end
  • 10, 15, or 20 years later?

Where do you get your data?

  • What software, instruments, or machines generate or collect?
  • What software do you use to clean, analyze, transform, store, or extract your data?
  • Services or devices that record, collect, or transcribe?
  • What touches your data?
  • Data from other sources?

Who can touch your data?

  • Who has day to day responsibility for data?
  • Who has ultimate responsibility for the data?
  • Who is on project team and/or can access data?

Special Handling

  • Is your data about people?
  • Does your data have privacy or confidentiality considerations?
  • What did you tell IRB about data sharing?
  • Data Privacy Lab: https://aboutmyinfo.org/

NB: Archive your consent forms forever.

Your data as computer data

  • How many files?
  • What file types?
  • How much space on disk?

File types

Who has it now?!?

Where is your data now?

and up to where is it backed?

Backup Principles

Your data as computer data, redux

  • File naming convention
  • Folder structure
  • Version control

File naming

  • Memorable key words for searching
  • Dates and alphabet for sorting
  • Distinguish similar items
  • Avoid spaces and keep shortish
  • Human readable
  • Consistency is key!

Folder structure

  • Project
  • Data
  • Lit
  • Drafts
  • Posted
  • Templates

Version control

  • Software
  • File naming
  • Workflows

Versioning software

Data documentation

  • Codebook
  • Data dictionary
  • README
  • Electronic lab notebook
  • Research diary or log
  • Other metadata

Documentation Resources

  • Colectica for Excel
  • Sumatra for Python
  • Built in functions (Stata, SPSS)
  • Scott Long, The Workflow of Data Analysis

Metadata

  • All data about data
  • Documentation, codebooks, README, etc
  • Metadata schemas
  • Descriptions at variable or dataset level
  • Automatically generated or applied
  • Powers discovery tools and human recognition
  • Packaged with data, entered at upload

Metadata schemas

  • Lists of needed attributes
  • Describes data, powers searches (see FAIR)
  • Typically entered in a form
  • But encoded in xml, html, json, MARC
  • Generalist schemas like Dublin Core or LCSH
  • Disciplinary schemas like EML
  • Often specifies thesaurus, like Getty or MESH
  • If needed, librarians/curators can help

Metadata: dataset level

Metadata: variable level

  • Codebook documentation
  • Column names, variable definitions
  • Measurement units, decimal places
  • Expected values
  • How is missing data encoded?
  • Anything else to explain variables, codes

Citation Managers

  • Zotero
  • Mendeley
  • EndNote

Identity Management

Claim your work and disambiguate from similar names

Here's how it looks:

Zip, Compress, Archive

  • .tar .zip .rar
  • Archive Utility (Mac)
  • 7zip
  • lossless file compression

YOU DID IT!

You did everything right and have a neat data package. What next?

Sharing

Sharing

  • Who (are you willing to share with?)
  • What (portions of your dataset will you share?)
  • How (will you make sharing frictionless?)
  • Where (will you put data to share or archive?)
  • When (will you share it)?

FAIR Data Principles

  • Findable: metadata, persistent ID, indexing
  • Accessible: retrievable, open protocols and metadata
  • Interoperable: metadata, vocabularies
  • Reusable: license, provenance, and disciplinary metadata

The Repository Question

  • Disciplinary
  • Multidisciplinary/General
  • Institutional
  • Associated costs

Some Repositories

Repository requirements

  • Self deposit and self curation
  • Mediated deposit and full curation

NACJD at ICPSR

  • Data
  • Cleaned data with analysis variables
  • Project documents
  • Syntax, code, scripts for transformed variables
  • Variable labels and codes, record counts, missing data
  • Documentation, including questionnaires, instruments, etc.
  • Codebooks, glossaries, technical documentation, etc.

Licensing

Resources and tools

Resources and tools

More resources

Questions?


Scout Calvert
calvert4@msu.edu