The End
Gold Standard for RDM
A well described data package that can be passed to another researcher in the same discipline and meaningfully reused (reproduced, replicated, reanalyzed) without additional communication.
The Data Package
- Raw data
- Processed, cleaned data
- Metadata, documentation, description
- Code, scripts, analysis
- README, file manifest, codebooks
. . . with human readable file names,
in an organized file structure,
in open, long-lived file formats.
Open Data and Open Science
Optimism that:
- data sharing will improve knowledge production and discovery
- data interoperability will lead to stronger knowledge claims (e.g., climate science)
- data openness will encourage the discovery of errors and discourage fraud
- data sharing will promote reproduction and replication in science
Incentives for Good Data Practice
RDM is a gift to your future self
- Return to a project easily after time off or a set-back
- Avoid catastrophic data loss
- Improve collaborations
- Write a DMP quickly and easily
- Stay ready for sharing and archiving
- Worry-free compliance with funder and campus policy
- Safeguard data and context loss due to team changes
Data curation
Managing data through the entire research lifecycle
Your Data Management Plan
DMP should demonstrate not just that you can keep data safe from losses, keep confidential data secure, or share data on request, but that you are equipped to proactively prepare your data for archiving and sharing (via depositing) at the end of the grant.
The Means
What is Data?
- Observations of
- Evidence for
Metadata
They say metadata is a love note to the future.
I say without metadata, there's no data.
Science Friction
Just as with data themselves, creating, handling, and managing metadata products always exacts a cost in time, energy, and attention: metadata friction (Edwards et al. 2011).
What's Your Data?
- Primary data your project will produce
- Secondary data or sources your project may use
- Code, markup, protocols, procedures, instruments
- Metadata
- Publications for lit review or background
How Valuable Is Your Data?
- During the project
- At project's end
- 10, 15, or 20 years later?
Where do you get your data?
- What software, instruments, or machines generate or collect?
- What software do you use to clean, analyze, transform, store, or extract your data?
- Services or devices that record, collect, or transcribe?
- What touches your data?
- Data from other sources?
Who can touch your data?
- Who has day to day responsibility for data?
- Who has ultimate responsibility for the data?
- Who is on project team and/or can access data?
Special Handling
- Is your data about people?
- Does your data have privacy or confidentiality considerations?
- What did you tell IRB about data sharing?
- Data Privacy Lab: https://aboutmyinfo.org/
NB: Archive your consent forms forever.
Your data as computer data
- How many files?
- What file types?
- How much space on disk?
Who has it now?!?
Where is your data now?
and up to where is it backed?
Your data as computer data, redux
- File naming convention
- Folder structure
- Version control
File naming
- Memorable key words for searching
- Dates and alphabet for sorting
- Distinguish similar items
- Avoid spaces and keep shortish
- Human readable
- Consistency is key!
Folder structure
- Project
- Data
- Lit
- Drafts
- Posted
- Templates
Version control
- Software
- File naming
- Workflows
Data documentation
- Codebook
- Data dictionary
- README
- Electronic lab notebook
- Research diary or log
- Other metadata
Documentation Resources
- Colectica for Excel
- Sumatra for Python
- Built in functions (Stata, SPSS)
- Scott Long, The Workflow of Data Analysis
Metadata
- All data about data
- Documentation, codebooks, README, etc
- Metadata schemas
- Descriptions at variable or dataset level
- Automatically generated or applied
- Powers discovery tools and human recognition
- Packaged with data, entered at upload
Metadata schemas
- Lists of needed attributes
- Describes data, powers searches (see FAIR)
- Typically entered in a form
- But encoded in xml, html, json, MARC
- Generalist schemas like Dublin Core or LCSH
- Disciplinary schemas like EML
- Often specifies thesaurus, like Getty or MESH
- If needed, librarians/curators can help
Metadata: variable level
- Codebook documentation
- Column names, variable definitions
- Measurement units, decimal places
- Expected values
- How is missing data encoded?
- Anything else to explain variables, codes
Identity Management
Claim your work and disambiguate from similar names
Here's how it looks:
Zip, Compress, Archive
- .tar .zip .rar
- Archive Utility (Mac)
- 7zip
- lossless file compression
YOU DID IT!
You did everything right and have a neat data package. What next?
Sharing
Sharing
- Who (are you willing to share with?)
- What (portions of your dataset will you share?)
- How (will you make sharing frictionless?)
- Where (will you put data to share or archive?)
- When (will you share it)?
- Findable: metadata, persistent ID, indexing
- Accessible: retrievable, open protocols and metadata
- Interoperable: metadata, vocabularies
- Reusable: license, provenance, and disciplinary metadata
The Repository Question
- Disciplinary
- Multidisciplinary/General
- Institutional
- Associated costs
Repository requirements
- Self deposit and self curation
- Mediated deposit and full curation
- Data
- Cleaned data with analysis variables
- Project documents
- Syntax, code, scripts for transformed variables
- Variable labels and codes, record counts, missing data
- Documentation, including questionnaires, instruments, etc.
- Codebooks, glossaries, technical documentation, etc.
Resources and tools
Questions?
Scout Calvert
calvert4@msu.edu