retype - auto data type conversion

October 19, 2018    hablar r retype

Can the data be simpler?

retype quick start your analysis

Getting data into R can be hassle. But once you do, it often have incorrect data types/classes. For instance, it is not uncommon that numeric variables are characters or dates are classed as characters.

Data conversion is cumbersome and small coding mistakes can produce large issues. The hablar package facilitates correction of all data types directly after you import the data into R such that you can avoid dangerous operations at later stages!

What does retype do?

retype provides an easy approach for quick and dirty data type conversion. It follows a strict simplification hierarchy for each column of your data frame. It only converts the column if it can assume that no important information is lost in the process. For example, the character vector c("1", "2") should rather be an integer vector. Similarly, the character "2010-06-04" should be a date. Factors have advantages, but they are never the simplest solution and hence it is always converted to character, at least.

Usage

retype(x, ...)

where x is a data frame, and ... is the column names you want to apply retype to. x could also be a single vector.

Simple example: numeric

x <- as.numeric(3)
retype(x)
#> [1] 3
class(retype(x))
#> [1] "integer"

Simple example: character

x <- as.character("2017-03-02")
retype(x)
#> [1] "2017-03-02"
class(retype(x))
#> [1] "Date"

Simple example: character

x <- as.character(c("3,56", "0,78"))
retype(x)
#> [1] "3,56" "0,78"
class(retype(x))
#> [1] "character"

Simple example: factor

x <- as.factor(c(3, 4))
retype(x)
#> [1] 3 4
class(retype(x))
#> [1] "integer"

The simplification hierarchy

Some things are simpler than others

retype uses a procedure to determine which data type is the simplest, without loosing any vital information in your data.

  • The first thing to know about retype is that it always converts factors to character.

  • The second thing to know is that all logical columns are converted to integers.

  • Thirdly, complex and list columns are left unchanged.

  • From there it will test if the data could be coded as numeric. If true it converts the column to numeric.

  • If it is numeric it tests if it could be an integer instead. If true, it converts the column to integer.

  • If it is a character it tests if it could be a data time column. If true, it converts it to a date time (POSIXct) column.

  • If it is a date time column it tests if it could be a date. If true, it converts it to a date column.

A visualization of the hierarchy

The above procedure could more intuitively be described in a diagram. The arrows imply a test if a column could be converted to another without loosing information in your data. The procedure continues until it cannot be simplified further.

Example on a data frame

Examine the following dataset starwars from the package dplyr. First, we use convert on some columns to new data types.

df <- starwars %>% 
  select(1:4) %>% 
  convert(fct(name),
           chr(height:mass),
           fct(hair_color)) %>% 
  print()
#> # A tibble: 87 x 4
#>    name               height mass  hair_color   
#>    <fct>              <chr>  <chr> <fct>        
#>  1 Luke Skywalker     172    77    blond        
#>  2 C-3PO              167    75    <NA>         
#>  3 R2-D2              96     32    <NA>         
#>  4 Darth Vader        202    136   none         
#>  5 Leia Organa        150    49    brown        
#>  6 Owen Lars          178    120   brown, grey  
#>  7 Beru Whitesun lars 165    75    brown        
#>  8 R5-D4              97     32    <NA>         
#>  9 Biggs Darklighter  183    84    black        
#> 10 Obi-Wan Kenobi     182    77    auburn, white
#> # … with 77 more rows

We then apply retype on df:

df %>% 
  retype()
#> # A tibble: 87 x 4
#>    name               height  mass hair_color   
#>    <chr>               <int> <dbl> <chr>        
#>  1 Luke Skywalker        172    77 blond        
#>  2 C-3PO                 167    75 <NA>         
#>  3 R2-D2                  96    32 <NA>         
#>  4 Darth Vader           202   136 none         
#>  5 Leia Organa           150    49 brown        
#>  6 Owen Lars             178   120 brown, grey  
#>  7 Beru Whitesun lars    165    75 brown        
#>  8 R5-D4                  97    32 <NA>         
#>  9 Biggs Darklighter     183    84 black        
#> 10 Obi-Wan Kenobi        182    77 auburn, white
#> # … with 77 more rows

Which correctly guessed that height preferably should be an integer vector and that mass works better as a numeric column. The factors were converted to character columns.

Final notes

retype in production code

Never use retype when you need your scripts to work the next time in the exact same way. retype may change over time, it could guess wrong and your data may change. Use hablar::convert instead where you explicitly state which data type each column should have.



comments powered by Disqus