Principle 10 Use Appropriate and Tidy Data

Use the right data structures for the job. Programming languages offer many different ways to work with the same data. Using the right one will make a task easier, and decrease your chance of getting it wrong.

You Must - Know what 'Tidy Data' is, and understand why it is valuable.

You Should - Be familiar with the data types and structures available to you and ensure that you use the right ones.

You Could - Think about relationships between datasets, design schemas and store data in an efficient way.

Related Areas: Sensible Defaults

10.1 Tidy Data?

A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways; every value belongs to a variable and an observation:

  • A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units.
  • An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

The majority of data we work with comes in rectangles. For this data to be tidy, ensure that: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.

For more see the section on Tidy Data in R for Data Science or the original paper.

Use tidy data structures as part of your work. You should attempt to convert incoming data into tidy format as quickly as possible. Any data that is output that may be used in other projects should be in tidy format as well as any other required formats.

10.2 Data Types and Structures

Data types are the basic units which your language uses to store data, things like integers, doubles, strings and logical data. Typically you are working with data frames, arrays, matricies or lists. These hold multiple items of data in a data structure.

Different types and structures are used for different things, and have different capabilities. To be effective, know about the data types and structures available to you and use the right ones for the job!

10.2.1 R

The R Programming for Data Science book has a good section on the 'Nuts and Bolts' of R which covers types and structures. For more about the different data structures a good resource is the Advanced R book.

10.2.2 Python

For a list of python datatypes see the:

10.3 Schema

The R for data science book has a nice section on relational data.