The s
function is a simple function that helps you get intuitive results when summarizing data. It is made to be used in conjuction with summarize functions, for example min
, sum
and mean
. s
takes a vector and mutates it in the following ways:
It replaces all non-rational numbers from numeric vectors and replace them with NA
. Non-rational numbers are Inf
, -Inf
and NaN
.
It removes NA
from the vector by default
If the vector has length zero or only consists of NA
it returns a single NA
.
s(..., ignore_na = T)
where … is one or more vector(s). If missing values should not be omitted use ignore_na = F
.
Removing NA
:
x <- c(NA, 1, 2)
s(x)
#> [1] 1 2
Replacing non-rational numbers with NA
and then removes NA
:
x <- c(NaN, 1, Inf)
s(x)
#> [1] 1
Empty vectors return a single NA
:
x <- c()
s(x)
#> [1] NA
In conjuction with a summary function:
x <- c(NaN, Inf, 3, 4)
median(s(x))
#> [1] 3.5
All programming languages have their special cases when you get non-intuitive results that you did not expect. This is also true for R. The s-function provides intuitive outcomes of some of the most basic commands in R. In the next parts of the vignette some problems it solves are explained in greater detail.
When learning R users might be surprised when creating suprised when using simple summary function. A summary function is a function that takes a vector and returns a single one value. For example, min(x)
, sum(x)
and mean(x)
. A simple example:
x <- c(1, 2, 3, 4, 5)
sum(x)
#> [1] 15
In this example the output of sum() was, which is expected since all entries in x sums to 15. However, in more messy data, the output is oftentimes less intuitive. New users to R might be confused that the next example results in NA (a missing value):
x <- c(1, 2, 3, NA, 4)
mean(x)
#> [1] NA
Since the vector above have an a missing value R does not know how to find the mean of the vector. The missing value could be anything, and thus R thus returns the output NA
. However, since missing values are common when working with real data, it is also a common practise to ignore missing values. Usually the user tells R to ignore the missing value and return the mean of the vector that have values that could be averaged. The error in the previous example could be fixed by adding na.rm = TRUE
that drops all missing values before calculating the mean:
x <- c(1, 2, 3, NA, 4)
mean(x, na.rm = TRUE)
#> [1] 2.5
Generally, R is strict about missing values so that you do not miss them, which often is helpful rather than harsh! However, often the programmer want R to return a ‘real’ value from the data, if there is one, even if it ignores missing values.
The s
function helps you with this. Since it by default removes missing values you can simply enter:
x <- c(1, 2, 3, NA, 4)
mean(s(x))
#> [1] 2.5
Adding an argument to remove all missing is common practise when summarizing data. However, it is not uncommon that some vectors only have missing values. Imagine an example where Amanda, David and Viktor sold sodas by the beach for three days. If someone did not show up they get a missing value.
#> # A tibble: 9 x 3
#> day name sold_sodas
#> <dbl> <chr> <dbl>
#> 1 1 Amanda 3
#> 2 2 Amanda NA
#> 3 3 Amanda 8
#> 4 1 David NA
#> 5 2 David NA
#> 6 3 David NA
#> 7 1 Viktor 2
#> 8 2 Viktor 1
#> 9 3 Viktor 4
Now we want to see the maximum number of sodas each person sold on a single day. The above data frame if saved as df
.
df %>%
group_by(name) %>%
summarize(n_sodas_best_day = max(sold_sodas, na.rm = T))
#> # A tibble: 3 x 2
#> name n_sodas_best_day
#> <chr> <dbl>
#> 1 Amanda 8
#> 2 David -Inf
#> 3 Viktor 4
Amanda sold the most sodas in a single day. However, David who was absent on all days, got the output -Inf
. This means that negative infinity was the number of sodas he sold during his most productive day. That is astonishing! One would perhaps think that the more intuitive output would be NA
.
The reason for result is that we told R to remove all missing values before calculating the maximal value. It is equivalent to:
x <- c()
max(x)
#> [1] -Inf
We could try to remove the na.rm = TRUE
argument from max()
.
df %>%
group_by(name) %>%
summarize(n_sodas_best_day = max(sold_sodas))
#> # A tibble: 3 x 2
#> name n_sodas_best_day
#> <chr> <dbl>
#> 1 Amanda NA
#> 2 David NA
#> 3 Viktor 4
Suddenly R tells us that Viktor had the best day and Amanda, who was absent the second day, got NA because R doesn’t not know how to find the maximum value. However, David also got NA this time, which makes sense.
Sometimes, calculating simple descriptive statistics can be a cumbersome task. The s function is there to support you! Since it returns NA
if the vector is empty we get:
df %>%
group_by(name) %>%
summarize(n_sodas_best_day = max(s(sold_sodas)))
#> # A tibble: 3 x 2
#> name n_sodas_best_day
#> <chr> <dbl>
#> 1 Amanda 8
#> 2 David NA
#> 3 Viktor 4
Another astonishing result one might encounter occurs when R tries to return a value when there is none. Take this extract df
from the starwars
dataset from the R package dplyr
.
df %>% head(10)
#> # A tibble: 10 x 4
#> name homeworld species height
#> <chr> <chr> <chr> <int>
#> 1 Luke Skywalker Tatooine Human 172
#> 2 C-3PO Tatooine Droid 167
#> 3 R2-D2 Naboo Droid 96
#> 4 Darth Vader Tatooine Human 202
#> 5 Leia Organa Alderaan Human 150
#> 6 Owen Lars Tatooine Human 178
#> 7 Beru Whitesun lars Tatooine Human 165
#> 8 R5-D4 Tatooine Droid 97
#> 9 Biggs Darklighter Tatooine Human 183
#> 10 Obi-Wan Kenobi Stewjon Human 182
Say that we want to calculate find the height of the tallest human from each homeworld. For precautionary reasons, we drop all rows with missing values from the height column so that we do not get the same problem as before.
df %>%
filter(!is.na(height)) %>%
group_by(homeworld) %>%
summarize(tallest_human = max(height[species == "Human"]))
#> # A tibble: 49 x 2
#> homeworld tallest_human
#> <chr> <dbl>
#> 1 <NA> NA
#> 2 Alderaan 191
#> 3 Aleen Minor -Inf
#> 4 Bespin 175
#> 5 Bestine IV 180
#> 6 Cato Neimoidia -Inf
#> 7 Cerea -Inf
#> 8 Champala -Inf
#> 9 Chandrila 150
#> 10 Concord Dawn 183
#> # … with 39 more rows
We got negative infinity -Inf
again. How could this be?
This is because some homeworld have no humans, e.g. Cerea. R tries to calculate the maximum value of nothing. The s
function can help you out! Since it returns NA
if the vector is empty we get:
df %>%
filter(!is.na(height)) %>%
group_by(homeworld) %>%
summarize(tallest_human = max(s(height[species == "Human"])))
#> # A tibble: 49 x 2
#> homeworld tallest_human
#> <chr> <int>
#> 1 <NA> 193
#> 2 Alderaan 191
#> 3 Aleen Minor NA
#> 4 Bespin 175
#> 5 Bestine IV 180
#> 6 Cato Neimoidia NA
#> 7 Cerea NA
#> 8 Champala NA
#> 9 Chandrila 150
#> 10 Concord Dawn 183
#> # … with 39 more rows
Now we get missing values for the homeworlds that does not have any humans. Makes sense.
Numerical vectors in R can include more than numbers and missing values NA
. They can also include infinite numbers Inf
and -Inf
as shown in the examples above. Furthermore, numerical vectors can include NaN
‘s which means ’not-a-number’. If the data frame you are using have NaN
or Inf
it may cause you problems when summarizing your data. Some examples:
x <- c(NaN, 1)
min(x)
#> [1] NaN
x <- c(Inf, 3, 4)
mean(x)
#> [1] Inf
x <- c(5, -Inf, 2)
sum(x)
#> [1] -Inf
Often when you summarize vectors that have NaN
or Inf
you want to treat them as a missing value. Maybe they have appeared as a mistake when you accidentally divided a value by zero since 1/0 = Inf
in R. The s
function solves this for you be replacing them with NA
.
x <- c(NaN, 1)
min(s(x))
#> [1] 1
x <- c(Inf, 3, 4)
mean(s(x))
#> [1] 3.5
x <- c(5, -Inf, 2)
sum(s(x))
#> [1] 7
s
and summary functionsIf things get too messy with an extra function you might prefer the wrapper functions of s
. All major summary functions have an s wrapped alternative in hablar
. These are accessed by adding an underscore to the name of the summary function, i.e. min_(x)
and is equal to min(s(x))
. Repeating the previous exercises using wrappers for s
would look like:
x <- c(NaN, 1)
min_(x)
#> [1] 1
x <- c(Inf, 3, 4)
mean_(x)
#> [1] 3.5
x <- c(5, -Inf, 2)
sum_(x)
#> [1] 7
To summarize, s
can help you to get results when you summarize your data, if there is an sensible answer in the vector. If not, you will get NA
.