s - short and simple summarizing

October 20, 2018    hablar r s

Overview

Description

The s function is a simple function that helps you get intuitive results when summarizing data. It is made to be used in conjuction with summarize functions, for example min , sum and mean. s takes a vector and mutates it in the following ways:

  • It replaces all non-rational numbers from numeric vectors and replace them with NA. Non-rational numbers are Inf, -Inf and NaN.

  • It removes NA from the vector by default

  • If the vector has length zero or only consists of NA it returns a single NA.

Usage

s(..., ignore_na = T)

where … is one or more vector(s). If missing values should not be omitted use ignore_na = F.

Simple examples

Removing NA:

x <- c(NA, 1, 2)
s(x)
#> [1] 1 2

Replacing non-rational numbers with NA and then removes NA:

x <- c(NaN, 1, Inf)
s(x)
#> [1] 1

Empty vectors return a single NA:

x <- c()
s(x)
#> [1] NA

In conjuction with a summary function:

x <- c(NaN, Inf, 3, 4)
median(s(x))
#> [1] 3.5

The problems it solvs

Principle of least astonishment

All programming languages have their special cases when you get non-intuitive results that you did not expect. This is also true for R. The s-function provides intuitive outcomes of some of the most basic commands in R. In the next parts of the vignette some problems it solves are explained in greater detail.

Missing values

When learning R users might be surprised when creating suprised when using simple summary function. A summary function is a function that takes a vector and returns a single one value. For example, min(x) , sum(x) and mean(x). A simple example:

x <- c(1, 2, 3, 4, 5)
sum(x)
#> [1] 15

In this example the output of sum() was, which is expected since all entries in x sums to 15. However, in more messy data, the output is oftentimes less intuitive. New users to R might be confused that the next example results in NA (a missing value):

x <- c(1, 2, 3, NA, 4)
mean(x)
#> [1] NA

Since the vector above have an a missing value R does not know how to find the mean of the vector. The missing value could be anything, and thus R thus returns the output NA. However, since missing values are common when working with real data, it is also a common practise to ignore missing values. Usually the user tells R to ignore the missing value and return the mean of the vector that have values that could be averaged. The error in the previous example could be fixed by adding na.rm = TRUE that drops all missing values before calculating the mean:

x <- c(1, 2, 3, NA, 4)
mean(x, na.rm = TRUE)
#> [1] 2.5

Generally, R is strict about missing values so that you do not miss them, which often is helpful rather than harsh! However, often the programmer want R to return a ‘real’ value from the data, if there is one, even if it ignores missing values.

The s function helps you with this. Since it by default removes missing values you can simply enter:

x <- c(1, 2, 3, NA, 4)
mean(s(x))
#> [1] 2.5

Sometimes R removes too much

Adding an argument to remove all missing is common practise when summarizing data. However, it is not uncommon that some vectors only have missing values. Imagine an example where Amanda, David and Viktor sold sodas by the beach for three days. If someone did not show up they get a missing value.

#> # A tibble: 9 x 3
#>     day name   sold_sodas
#>   <dbl> <chr>       <dbl>
#> 1     1 Amanda          3
#> 2     2 Amanda         NA
#> 3     3 Amanda          8
#> 4     1 David          NA
#> 5     2 David          NA
#> 6     3 David          NA
#> 7     1 Viktor          2
#> 8     2 Viktor          1
#> 9     3 Viktor          4

Now we want to see the maximum number of sodas each person sold on a single day. The above data frame if saved as df.

df %>% 
  group_by(name) %>% 
  summarize(n_sodas_best_day = max(sold_sodas, na.rm = T))
#> # A tibble: 3 x 2
#>   name   n_sodas_best_day
#>   <chr>             <dbl>
#> 1 Amanda                8
#> 2 David              -Inf
#> 3 Viktor                4

Amanda sold the most sodas in a single day. However, David who was absent on all days, got the output -Inf. This means that negative infinity was the number of sodas he sold during his most productive day. That is astonishing! One would perhaps think that the more intuitive output would be NA.

The reason for result is that we told R to remove all missing values before calculating the maximal value. It is equivalent to:

x <- c()
max(x)
#> [1] -Inf

We could try to remove the na.rm = TRUE argument from max().

df %>% 
  group_by(name) %>% 
  summarize(n_sodas_best_day = max(sold_sodas))
#> # A tibble: 3 x 2
#>   name   n_sodas_best_day
#>   <chr>             <dbl>
#> 1 Amanda               NA
#> 2 David                NA
#> 3 Viktor                4

Suddenly R tells us that Viktor had the best day and Amanda, who was absent the second day, got NA because R doesn’t not know how to find the maximum value. However, David also got NA this time, which makes sense.

Sometimes, calculating simple descriptive statistics can be a cumbersome task. The s function is there to support you! Since it returns NA if the vector is empty we get:

df %>% 
  group_by(name) %>% 
  summarize(n_sodas_best_day = max(s(sold_sodas)))
#> # A tibble: 3 x 2
#>   name   n_sodas_best_day
#>   <chr>             <dbl>
#> 1 Amanda                8
#> 2 David                NA
#> 3 Viktor                4

Getting answers when there is none

Another astonishing result one might encounter occurs when R tries to return a value when there is none. Take this extract df from the starwars dataset from the R package dplyr.

df %>% head(10)
#> # A tibble: 10 x 4
#>    name               homeworld species height
#>    <chr>              <chr>     <chr>    <int>
#>  1 Luke Skywalker     Tatooine  Human      172
#>  2 C-3PO              Tatooine  Droid      167
#>  3 R2-D2              Naboo     Droid       96
#>  4 Darth Vader        Tatooine  Human      202
#>  5 Leia Organa        Alderaan  Human      150
#>  6 Owen Lars          Tatooine  Human      178
#>  7 Beru Whitesun lars Tatooine  Human      165
#>  8 R5-D4              Tatooine  Droid       97
#>  9 Biggs Darklighter  Tatooine  Human      183
#> 10 Obi-Wan Kenobi     Stewjon   Human      182

Say that we want to calculate find the height of the tallest human from each homeworld. For precautionary reasons, we drop all rows with missing values from the height column so that we do not get the same problem as before.

df %>% 
  filter(!is.na(height)) %>% 
  group_by(homeworld) %>% 
  summarize(tallest_human = max(height[species == "Human"]))
#> # A tibble: 49 x 2
#>    homeworld      tallest_human
#>    <chr>                  <dbl>
#>  1 <NA>                      NA
#>  2 Alderaan                 191
#>  3 Aleen Minor             -Inf
#>  4 Bespin                   175
#>  5 Bestine IV               180
#>  6 Cato Neimoidia          -Inf
#>  7 Cerea                   -Inf
#>  8 Champala                -Inf
#>  9 Chandrila                150
#> 10 Concord Dawn             183
#> # … with 39 more rows

We got negative infinity -Inf again. How could this be?

This is because some homeworld have no humans, e.g. Cerea. R tries to calculate the maximum value of nothing. The s function can help you out! Since it returns NA if the vector is empty we get:

df %>% 
  filter(!is.na(height)) %>% 
  group_by(homeworld) %>% 
  summarize(tallest_human = max(s(height[species == "Human"])))
#> # A tibble: 49 x 2
#>    homeworld      tallest_human
#>    <chr>                  <int>
#>  1 <NA>                     193
#>  2 Alderaan                 191
#>  3 Aleen Minor               NA
#>  4 Bespin                   175
#>  5 Bestine IV               180
#>  6 Cato Neimoidia            NA
#>  7 Cerea                     NA
#>  8 Champala                  NA
#>  9 Chandrila                150
#> 10 Concord Dawn             183
#> # … with 39 more rows

Now we get missing values for the homeworlds that does not have any humans. Makes sense.

Summarizing weird numbers

Numerical vectors in R can include more than numbers and missing values NA. They can also include infinite numbers Inf and -Inf as shown in the examples above. Furthermore, numerical vectors can include NaN‘s which means ’not-a-number’. If the data frame you are using have NaN or Inf it may cause you problems when summarizing your data. Some examples:

x <- c(NaN, 1)
min(x)
#> [1] NaN
x <- c(Inf, 3, 4)
mean(x)
#> [1] Inf
x <- c(5, -Inf, 2)
sum(x)
#> [1] -Inf

Often when you summarize vectors that have NaN or Inf you want to treat them as a missing value. Maybe they have appeared as a mistake when you accidentally divided a value by zero since 1/0 = Inf in R. The s function solves this for you be replacing them with NA.

x <- c(NaN, 1)
min(s(x))
#> [1] 1
x <- c(Inf, 3, 4)
mean(s(x))
#> [1] 3.5
x <- c(5, -Inf, 2)
sum(s(x))
#> [1] 7

Wrappers for s and summary functions

If things get too messy with an extra function you might prefer the wrapper functions of s. All major summary functions have an s wrapped alternative in hablar. These are accessed by adding an underscore to the name of the summary function, i.e. min_(x) and is equal to min(s(x)). Repeating the previous exercises using wrappers for s would look like:

x <- c(NaN, 1)
min_(x)
#> [1] 1
x <- c(Inf, 3, 4)
mean_(x)
#> [1] 3.5
x <- c(5, -Inf, 2)
sum_(x)
#> [1] 7

Final note

To summarize, s can help you to get results when you summarize your data, if there is an sensible answer in the vector. If not, you will get NA.



comments powered by Disqus