#Why the tidyverse?
There are a few reasons to use R and the Tidyverse for data processing. Excel is opaque – if you make a change to a cell/row/column, unless you write down exactly what you did, you have no way of knowing how you processed your data. Using R and documenting your work in Rmarkdown will allow you to understand exactly what you are doing, which is helpful in future iterations of using the data.
Chances are, you are not going to get data from an experiment in a form that is ready to go for an analysis.Therefore, it is important to first figure out what shape you want your data to be in before you start processing. That way, you have a clear way forward towards tidy data. Tidying and transforming are called ‘wrangling’, a term you will see often in the field of data science.
##Tidy data There are three interrelated rules which make a dataset tidy:
Data doesn’t often come in this form. For example, sometimes we have data spread over various files (e.g., a file for each speaker/participant). Other untidy datasets have multiple variables over multiple columns. For example, variable name is in one column, and then the value is in a second column. Using the tidyverse, we can take this untidy data and make it tidy!
##Components of the tidyverse require(tidyverse) installs the following packages (and some others):
Other packages that are part of the tidyverse that you have to require individually: * readxl: imports excel files * magrittr: helps with programming, including different pipes * broom: formats models into tidy data * modelr: Helper functions for modeling data
##Importing data
Importing data is essential – if you can’t import your data, you can’t analyze it! Make sure to include any code relating to importing in your Rmarkdown document, so if you make a mistake you don’t have to go back to square one.
Use the read.table(), read.csv(), or read_excel() functions to import your data. Remember to name your data something descriptive, and not too long. If you want to totally emerse yourself in the tidyverse, you can use readr::read_csv() or readr::read_table(), which will import your data as a tibble.
##The pipe operator With base R, we run various data wrangling techniques on the dataset one at a time. In each of these cases, we would either overwrite the original dataset, or save the intermediate steps as new variables. This makes your workspace messy very quickly. The pipe operator (based in tidyverse package magrittr) allows multiple steps of data wrangling or analysis to occur in succession, without creating multiple intermediate dataframes. This works by dropping the input into the first argument space.
x %>% f is equivalent to f(x)
x %>% f(y) is equivalent to f(x, y)
x %>% f %>% g %>% h is equivalent to h(g(f(x)))
So, for example:
nasality %>% head() is equivalent to head(nasality)
nasality %>% str() is equivalent to str(nasality)
nasality %>% filter(freq_f3>2400) is equivalent to filter(nasality, freq_f3 > 2400)
However, this doesn’t work: nasality %>% sd(freq_f1)
If you wanted to put the left-hand-side object into a subsequent argument slot using the pipe, use the placeholder . to denote the location where you want the left-hand-side argument to go:
2 %>% round(nasality$F1, digits = .)
This translates to round(nasality$F1, digits = 2)
What dataset are we using? This is my dissertation data. I have processed this a bit, without showing you the code. I will show you the code I used to process it at the end.
The data is as follows: I have formant data (F1-3, and the difference between F2 and F3) at ten normalized time points, for various vowels. The vowel categories are /a/, /i/, and /u/, spoken by 12 Brazilian Portuguese speakers, in the following nasality conditions: oral, nasalized, and nasal (word-final and word-medial). Therefore, there are 12 (vowel*nasality) combinations for each speaker.
Note that in this file you will only see 20 rows of each tibble. This is on purpose, in order to make the file size smaller. The actual size of the data is on the order of 64k rows.
head(normtimedata)
## # A tibble: 6 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.1 clean un 3.39 569. 2372.
## 2 BP02 abun… u nasal 1 0.2 clean un 3.41 364. 2297.
## 3 BP02 abun… u nasal 1 0.3 clean un 3.43 373. 2258.
## 4 BP02 abun… u nasal 1 0.4 clean un 3.46 303. 2165.
## 5 BP02 abun… u nasal 1 0.5 clean un 3.48 415. 2247.
## 6 BP02 abun… u nasal 1 0.6 clean un 3.51 336. 2098.
## # … with 2 more variables: F3 <dbl>, F2_3 <dbl>
str(normtimedata)
## tibble [182,150 × 13] (S3: tbl_df/tbl/data.frame)
## $ Speaker : chr [1:182150] "BP02" "BP02" "BP02" "BP02" ...
## $ Word : chr [1:182150] "abunda" "abunda" "abunda" "abunda" ...
## $ Vowel : chr [1:182150] "u" "u" "u" "u" ...
## $ Nasality: chr [1:182150] "nasal" "nasal" "nasal" "nasal" ...
## $ RepNo : int [1:182150] 1 1 1 1 1 1 1 1 1 1 ...
## $ NormTime: num [1:182150] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ...
## $ Type : chr [1:182150] "clean" "clean" "clean" "clean" ...
## $ label : chr [1:182150] "un" "un" "un" "un" ...
## $ Time : num [1:182150] 3.39 3.41 3.43 3.46 3.48 ...
## $ F1 : num [1:182150] 569 364 373 303 415 ...
## $ F2 : num [1:182150] 2372 2297 2258 2165 2247 ...
## $ F3 : num [1:182150] 3130 2692 2882 2918 2893 ...
## $ F2_3 : num [1:182150] 379 197 312 376 323 ...
There are five basic functions that come with dplyr for data manipulation:
Here, you can filter data based on a specific criteria. For example, if I wanted to look only at the midpoint of the vowel, or at specific vowel subsets.
normtimedata %>% filter(F1<800)
## # A tibble: 175,011 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.1 clean un 3.39 569. 2372.
## 2 BP02 abun… u nasal 1 0.2 clean un 3.41 364. 2297.
## 3 BP02 abun… u nasal 1 0.3 clean un 3.43 373. 2258.
## 4 BP02 abun… u nasal 1 0.4 clean un 3.46 303. 2165.
## 5 BP02 abun… u nasal 1 0.5 clean un 3.48 415. 2247.
## 6 BP02 abun… u nasal 1 0.6 clean un 3.51 336. 2098.
## 7 BP02 abun… u nasal 1 0.7 clean un 3.53 336. 2272.
## 8 BP02 abun… u nasal 1 0.8 clean un 3.56 330. 1719.
## 9 BP02 abun… u nasal 1 0.9 clean un 3.58 302. 2337.
## 10 BP02 abun… u nasal 1 1 clean un 3.60 280. 966.
## # … with 175,001 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
normtimedata %>% filter(NormTime == 0.5)
## # A tibble: 18,215 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.5 clean un 3.48 415. 2247.
## 2 BP02 abun… u nasal 2 0.5 clean un 5.90 367. 2301.
## 3 BP02 abun… u nasal 3 0.5 clean un 8.51 315. 1508.
## 4 BP02 abun… u nasal 4 0.5 clean un 11.2 301. 1369.
## 5 BP02 abun… u nasal 5 0.5 clean un 14.0 336. 1252.
## 6 BP02 abun… u nasal 6 0.5 clean un 16.4 310. 2077.
## 7 BP02 abun… u nasal 7 0.5 clean un 19.0 323. 806.
## 8 BP02 abun… u nasal 8 0.5 clean un 21.8 321. 1374.
## 9 BP02 abun… u nasal 9 0.5 clean un 24.5 285. 2208.
## 10 BP02 abun… u nasal 10 0.5 clean un 27.3 357. 1703.
## # … with 18,205 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
normtimedata %>% filter(Vowel == "a", Nasality =="nasal")%>% head()
## # A tibble: 6 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 tapa… a nasal 1 0.1 clean an 3.59 460. 1271.
## 2 BP02 tapa… a nasal 1 0.2 clean an 3.62 464. 1379.
## 3 BP02 tapa… a nasal 1 0.3 clean an 3.64 433. 1393.
## 4 BP02 tapa… a nasal 1 0.4 clean an 3.67 337. 1734.
## 5 BP02 tapa… a nasal 1 0.5 clean an 3.69 335. 2246.
## 6 BP02 tapa… a nasal 1 0.6 clean an 3.71 326. 2231.
## # … with 2 more variables: F3 <dbl>, F2_3 <dbl>
meanF1 = mean(normtimedata$F1, na.rm = TRUE)
normtimedata%>% filter(F1< meanF1)
## # A tibble: 111,772 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.2 clean un 3.41 364. 2297.
## 2 BP02 abun… u nasal 1 0.3 clean un 3.43 373. 2258.
## 3 BP02 abun… u nasal 1 0.4 clean un 3.46 303. 2165.
## 4 BP02 abun… u nasal 1 0.5 clean un 3.48 415. 2247.
## 5 BP02 abun… u nasal 1 0.6 clean un 3.51 336. 2098.
## 6 BP02 abun… u nasal 1 0.7 clean un 3.53 336. 2272.
## 7 BP02 abun… u nasal 1 0.8 clean un 3.56 330. 1719.
## 8 BP02 abun… u nasal 1 0.9 clean un 3.58 302. 2337.
## 9 BP02 abun… u nasal 1 1 clean un 3.60 280. 966.
## 10 BP02 abun… u nasal 2 0.2 clean un 5.83 326. 2185.
## # … with 111,762 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
normtimedata_filtered = normtimedata %>% filter(
(F1-mean(F1, na.rm = TRUE)) < sd(F1, na.rm = TRUE),
(F2-mean(F2, na.rm = TRUE)) < sd(F2, na.rm = TRUE),
(F3-mean(F3, na.rm = TRUE)) < sd(F3, na.rm = TRUE)
)
head(normtimedata_filtered)
## # A tibble: 6 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.8 clean un 3.56 330. 1719.
## 2 BP02 abun… u nasal 1 1 clean un 3.60 280. 966.
## 3 BP02 abun… u nasal 2 0.3 clean un 5.86 287. 1767.
## 4 BP02 abun… u nasal 2 0.4 clean un 5.88 334. 1471.
## 5 BP02 abun… u nasal 2 0.6 clean un 5.93 310. 1371.
## 6 BP02 abun… u nasal 2 0.8 clean un 5.97 306. 1308.
## # … with 2 more variables: F3 <dbl>, F2_3 <dbl>
Arrange changes the order of rows based on a column or a set of columns. The default is ascending order. To use descending order, use desc().
normtimedata %>% arrange(F1)
## # A tibble: 182,150 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 baba… a oral 65 0.4 inou… ao 89.2 0 0
## 2 BP02 baba… a oral 65 0.5 inou… ao 89.2 0 0
## 3 BP02 baba… a oral 65 0.6 inou… ao 89.2 0 0
## 4 BP02 baba… a oral 65 0.7 inou… ao 89.3 0 0
## 5 BP02 baba… a oral 65 0.8 inou… ao 89.3 0 0
## 6 BP02 baba… a oral 65 0.9 inou… ao 89.3 0 0
## 7 BP02 baba… a oral 65 1 inou… ao 89.3 0 0
## 8 BP05 prop… a nasaliz… 81 1 inou… a~ 89.3 0 0
## 9 BP07 prop… a nasaliz… 81 0.9 inou… a~ 89.3 0 0
## 10 BP07 prop… a nasaliz… 81 1 inou… a~ 89.3 0 0
## # … with 182,140 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
normtimedata %>% arrange(F1, desc(F2))
## # A tibble: 182,150 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 baba… a oral 65 0.4 inou… ao 89.2 0 0
## 2 BP02 baba… a oral 65 0.5 inou… ao 89.2 0 0
## 3 BP02 baba… a oral 65 0.6 inou… ao 89.2 0 0
## 4 BP02 baba… a oral 65 0.7 inou… ao 89.3 0 0
## 5 BP02 baba… a oral 65 0.8 inou… ao 89.3 0 0
## 6 BP02 baba… a oral 65 0.9 inou… ao 89.3 0 0
## 7 BP02 baba… a oral 65 1 inou… ao 89.3 0 0
## 8 BP05 prop… a nasaliz… 81 1 inou… a~ 89.3 0 0
## 9 BP07 prop… a nasaliz… 81 0.9 inou… a~ 89.3 0 0
## 10 BP07 prop… a nasaliz… 81 1 inou… a~ 89.3 0 0
## # … with 182,140 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
normtimedata %>% arrange(Speaker, Vowel, desc(Nasality))
## # A tibble: 182,150 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 baba… a oral 1 0.1 clean ao 5.67 602. 1157.
## 2 BP02 baba… a oral 1 0.2 clean ao 5.69 709. 1181.
## 3 BP02 baba… a oral 1 0.3 clean ao 5.71 744. 1195.
## 4 BP02 baba… a oral 1 0.4 clean ao 5.73 724. 1194.
## 5 BP02 baba… a oral 1 0.5 clean ao 5.75 673. 1228.
## 6 BP02 baba… a oral 1 0.6 clean ao 5.77 691. 1238.
## 7 BP02 baba… a oral 1 0.7 clean ao 5.78 685. 1270.
## 8 BP02 baba… a oral 1 0.8 clean ao 5.80 673. 1311.
## 9 BP02 baba… a oral 1 0.9 clean ao 5.82 616. 1304.
## 10 BP02 baba… a oral 1 1 clean ao 5.84 414. 1358.
## # … with 182,140 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
Select allows you to choose one or more variable from your dataset. This is important because you may have columns that you simply don’t care about, and you don’t want to accidentially use them in your analysis.
normtimedata %>% select(Nasality, Vowel, Speaker)
## # A tibble: 182,150 x 3
## Nasality Vowel Speaker
## <chr> <chr> <chr>
## 1 nasal u BP02
## 2 nasal u BP02
## 3 nasal u BP02
## 4 nasal u BP02
## 5 nasal u BP02
## 6 nasal u BP02
## 7 nasal u BP02
## 8 nasal u BP02
## 9 nasal u BP02
## 10 nasal u BP02
## # … with 182,140 more rows
normtimedata %>% select(Time:Speaker)
## # A tibble: 182,150 x 9
## Time label Type NormTime RepNo Nasality Vowel Word Speaker
## <dbl> <chr> <chr> <dbl> <int> <chr> <chr> <chr> <chr>
## 1 3.39 un clean 0.1 1 nasal u abunda BP02
## 2 3.41 un clean 0.2 1 nasal u abunda BP02
## 3 3.43 un clean 0.3 1 nasal u abunda BP02
## 4 3.46 un clean 0.4 1 nasal u abunda BP02
## 5 3.48 un clean 0.5 1 nasal u abunda BP02
## 6 3.51 un clean 0.6 1 nasal u abunda BP02
## 7 3.53 un clean 0.7 1 nasal u abunda BP02
## 8 3.56 un clean 0.8 1 nasal u abunda BP02
## 9 3.58 un clean 0.9 1 nasal u abunda BP02
## 10 3.60 un clean 1 1 nasal u abunda BP02
## # … with 182,140 more rows
normtimedata %>% select(-Type)
## # A tibble: 182,150 x 12
## Speaker Word Vowel Nasality RepNo NormTime label Time F1 F2 F3
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.1 un 3.39 569. 2372. 3130.
## 2 BP02 abun… u nasal 1 0.2 un 3.41 364. 2297. 2692.
## 3 BP02 abun… u nasal 1 0.3 un 3.43 373. 2258. 2882.
## 4 BP02 abun… u nasal 1 0.4 un 3.46 303. 2165. 2918.
## 5 BP02 abun… u nasal 1 0.5 un 3.48 415. 2247. 2893.
## 6 BP02 abun… u nasal 1 0.6 un 3.51 336. 2098. 2623.
## 7 BP02 abun… u nasal 1 0.7 un 3.53 336. 2272. 2869.
## 8 BP02 abun… u nasal 1 0.8 un 3.56 330. 1719. 2435.
## 9 BP02 abun… u nasal 1 0.9 un 3.58 302. 2337. 2391.
## 10 BP02 abun… u nasal 1 1 un 3.60 280. 966. 2435.
## # … with 182,140 more rows, and 1 more variable: F2_3 <dbl>
What happens if you want to rename a particular variable while doing this? There are many tedious ways to do this, such as creating an identical variable with the different name, or using the colnames() function in base R. The easiest way to do it is to use the rename function, which is a different version of select. It keeps anything not specifically mentioned.
normtimedata %>% rename(Normalized_Time = NormTime)
## # A tibble: 182,150 x 13
## Speaker Word Vowel Nasality RepNo Normalized_Time Type label Time F1
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.1 clean un 3.39 569.
## 2 BP02 abun… u nasal 1 0.2 clean un 3.41 364.
## 3 BP02 abun… u nasal 1 0.3 clean un 3.43 373.
## 4 BP02 abun… u nasal 1 0.4 clean un 3.46 303.
## 5 BP02 abun… u nasal 1 0.5 clean un 3.48 415.
## 6 BP02 abun… u nasal 1 0.6 clean un 3.51 336.
## 7 BP02 abun… u nasal 1 0.7 clean un 3.53 336.
## 8 BP02 abun… u nasal 1 0.8 clean un 3.56 330.
## 9 BP02 abun… u nasal 1 0.9 clean un 3.58 302.
## 10 BP02 abun… u nasal 1 1 clean un 3.60 280.
## # … with 182,140 more rows, and 3 more variables: F2 <dbl>, F3 <dbl>,
## # F2_3 <dbl>
normtimedata %>% rename(Normalized_Time = NormTime) %>%
select(-Type)
## # A tibble: 182,150 x 12
## Speaker Word Vowel Nasality RepNo Normalized_Time label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.1 un 3.39 569. 2372.
## 2 BP02 abun… u nasal 1 0.2 un 3.41 364. 2297.
## 3 BP02 abun… u nasal 1 0.3 un 3.43 373. 2258.
## 4 BP02 abun… u nasal 1 0.4 un 3.46 303. 2165.
## 5 BP02 abun… u nasal 1 0.5 un 3.48 415. 2247.
## 6 BP02 abun… u nasal 1 0.6 un 3.51 336. 2098.
## 7 BP02 abun… u nasal 1 0.7 un 3.53 336. 2272.
## 8 BP02 abun… u nasal 1 0.8 un 3.56 330. 1719.
## 9 BP02 abun… u nasal 1 0.9 un 3.58 302. 2337.
## 10 BP02 abun… u nasal 1 1 un 3.60 280. 966.
## # … with 182,140 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
Finally, you can use the everything() function to move something to the beginning of the data frame:
normtimedata %>% rename(Normalized_Time = NormTime) %>%
select(Speaker, Nasality, everything()) %>%
select(-Type)
## # A tibble: 182,150 x 12
## Speaker Nasality Word Vowel RepNo Normalized_Time label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 BP02 nasal abun… u 1 0.1 un 3.39 569. 2372.
## 2 BP02 nasal abun… u 1 0.2 un 3.41 364. 2297.
## 3 BP02 nasal abun… u 1 0.3 un 3.43 373. 2258.
## 4 BP02 nasal abun… u 1 0.4 un 3.46 303. 2165.
## 5 BP02 nasal abun… u 1 0.5 un 3.48 415. 2247.
## 6 BP02 nasal abun… u 1 0.6 un 3.51 336. 2098.
## 7 BP02 nasal abun… u 1 0.7 un 3.53 336. 2272.
## 8 BP02 nasal abun… u 1 0.8 un 3.56 330. 1719.
## 9 BP02 nasal abun… u 1 0.9 un 3.58 302. 2337.
## 10 BP02 nasal abun… u 1 1 un 3.60 280. 966.
## # … with 182,140 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
Other things that might help when selecting a variable can be found in the select help page.
Sometimes, you want to create a new variable, that is a function of an old variable. This function will add a new variable to the end of your dataset.
normtimedata %>% mutate(F1_2= F2-F1)
## # A tibble: 182,150 x 14
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.1 clean un 3.39 569. 2372.
## 2 BP02 abun… u nasal 1 0.2 clean un 3.41 364. 2297.
## 3 BP02 abun… u nasal 1 0.3 clean un 3.43 373. 2258.
## 4 BP02 abun… u nasal 1 0.4 clean un 3.46 303. 2165.
## 5 BP02 abun… u nasal 1 0.5 clean un 3.48 415. 2247.
## 6 BP02 abun… u nasal 1 0.6 clean un 3.51 336. 2098.
## 7 BP02 abun… u nasal 1 0.7 clean un 3.53 336. 2272.
## 8 BP02 abun… u nasal 1 0.8 clean un 3.56 330. 1719.
## 9 BP02 abun… u nasal 1 0.9 clean un 3.58 302. 2337.
## 10 BP02 abun… u nasal 1 1 clean un 3.60 280. 966.
## # … with 182,140 more rows, and 3 more variables: F3 <dbl>, F2_3 <dbl>,
## # F1_2 <dbl>
normtimedata %>% mutate(SpeakerNo = substr(Speaker, 3,4))
## # A tibble: 182,150 x 14
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.1 clean un 3.39 569. 2372.
## 2 BP02 abun… u nasal 1 0.2 clean un 3.41 364. 2297.
## 3 BP02 abun… u nasal 1 0.3 clean un 3.43 373. 2258.
## 4 BP02 abun… u nasal 1 0.4 clean un 3.46 303. 2165.
## 5 BP02 abun… u nasal 1 0.5 clean un 3.48 415. 2247.
## 6 BP02 abun… u nasal 1 0.6 clean un 3.51 336. 2098.
## 7 BP02 abun… u nasal 1 0.7 clean un 3.53 336. 2272.
## 8 BP02 abun… u nasal 1 0.8 clean un 3.56 330. 1719.
## 9 BP02 abun… u nasal 1 0.9 clean un 3.58 302. 2337.
## 10 BP02 abun… u nasal 1 1 clean un 3.60 280. 966.
## # … with 182,140 more rows, and 3 more variables: F3 <dbl>, F2_3 <dbl>,
## # SpeakerNo <chr>
#substr is a base R function. What does it do?
#?substr
I know I said there were five main functions. I kind of lied. Summarise (the next function) is not very useful unless we use it in conjunction with group_by. This is because we want to see summaries within different groups, not just across the whole dataset.
normtimedata %>% group_by(Speaker, Vowel, Nasality)
## # A tibble: 182,150 x 13
## # Groups: Speaker, Vowel, Nasality [156]
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.1 clean un 3.39 569. 2372.
## 2 BP02 abun… u nasal 1 0.2 clean un 3.41 364. 2297.
## 3 BP02 abun… u nasal 1 0.3 clean un 3.43 373. 2258.
## 4 BP02 abun… u nasal 1 0.4 clean un 3.46 303. 2165.
## 5 BP02 abun… u nasal 1 0.5 clean un 3.48 415. 2247.
## 6 BP02 abun… u nasal 1 0.6 clean un 3.51 336. 2098.
## 7 BP02 abun… u nasal 1 0.7 clean un 3.53 336. 2272.
## 8 BP02 abun… u nasal 1 0.8 clean un 3.56 330. 1719.
## 9 BP02 abun… u nasal 1 0.9 clean un 3.58 302. 2337.
## 10 BP02 abun… u nasal 1 1 clean un 3.60 280. 966.
## # … with 182,140 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
#this only looks a little different than the average tibble - it has groups listed on the top! Now what happens when we use summarise:
Summarise allows you to collapse a data frame into a single row that tells you some sort of summary about that data. As you can see below, it isn’t useful unless you use it with group_by.
normtimedata %>% summarise(F1mean = mean(F1))
## # A tibble: 1 x 1
## F1mean
## <dbl>
## 1 460.
normtimedata %>% group_by(Speaker, Vowel, Nasality) %>% summarise(F1mean = mean(F1))
## `summarise()` regrouping output by 'Speaker', 'Vowel' (override with `.groups` argument)
## # A tibble: 156 x 4
## # Groups: Speaker, Vowel [39]
## Speaker Vowel Nasality F1mean
## <chr> <chr> <chr> <dbl>
## 1 BP02 a nasal 560.
## 2 BP02 a nasal_final 600.
## 3 BP02 a nasalized 592.
## 4 BP02 a oral 700.
## 5 BP02 i nasal 544.
## 6 BP02 i nasal_final 892.
## 7 BP02 i nasalized 675.
## 8 BP02 i oral 527.
## 9 BP02 u nasal 442.
## 10 BP02 u nasal_final 440.
## # … with 146 more rows
normtimedata %>% group_by(Speaker, Vowel, Nasality) %>% summarise(F1mean = mean(F1), F1sd = sd(F1), F2mean = mean(F2),F2sd = sd(F2), F3mean = mean(F3),F3sd = sd(F3))
## `summarise()` regrouping output by 'Speaker', 'Vowel' (override with `.groups` argument)
## # A tibble: 156 x 9
## # Groups: Speaker, Vowel [39]
## Speaker Vowel Nasality F1mean F1sd F2mean F2sd F3mean F3sd
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP02 a nasal 560. 316. 1643. 408. 2665. 271.
## 2 BP02 a nasal_final 600. 132. 1293. 222. 2682. 251.
## 3 BP02 a nasalized 592. 115. 1383. 196. 2642. 165.
## 4 BP02 a oral 700. 113. 1244. 158. 2366. 246.
## 5 BP02 i nasal 544. 570. 2300. 345. 2887. 263.
## 6 BP02 i nasal_final 892. 827. 2391. 385. 2921. 276.
## 7 BP02 i nasalized 675. 707. 2452. 231. 2955. 182.
## 8 BP02 i oral 527. 591. 2424. 264. 3003. 153.
## 9 BP02 u nasal 442. 251. 1746. 481. 2744. 259.
## 10 BP02 u nasal_final 440. 116. 1842. 516. 2789. 244.
## # … with 146 more rows
There is one more thing to think about: what if we need to know the sample size in each of our groups? We can do this using n() and n_distinct(). The first gives the number of times in each group, and the second gives the number of distinct items in each group.
normtimedata %>% group_by(Speaker, Vowel, Nasality) %>% summarise(F1mean = mean(F1), number = n())
## `summarise()` regrouping output by 'Speaker', 'Vowel' (override with `.groups` argument)
## # A tibble: 156 x 5
## # Groups: Speaker, Vowel [39]
## Speaker Vowel Nasality F1mean number
## <chr> <chr> <chr> <dbl> <int>
## 1 BP02 a nasal 560. 980
## 2 BP02 a nasal_final 600. 1010
## 3 BP02 a nasalized 592. 950
## 4 BP02 a oral 700. 990
## 5 BP02 i nasal 544. 950
## 6 BP02 i nasal_final 892. 1000
## 7 BP02 i nasalized 675. 1030
## 8 BP02 i oral 527. 980
## 9 BP02 u nasal 442. 1040
## 10 BP02 u nasal_final 440. 970
## # … with 146 more rows
normtimedata %>% group_by(Vowel, Nasality) %>% summarise(F1mean = mean(F1), number = n_distinct(Speaker))
## `summarise()` regrouping output by 'Vowel' (override with `.groups` argument)
## # A tibble: 12 x 4
## # Groups: Vowel [3]
## Vowel Nasality F1mean number
## <chr> <chr> <dbl> <int>
## 1 a nasal 490. 13
## 2 a nasal_final 521. 13
## 3 a nasalized 536. 13
## 4 a oral 655. 13
## 5 i nasal 422. 13
## 6 i nasal_final 434. 13
## 7 i nasalized 395. 13
## 8 i oral 364. 13
## 9 u nasal 435. 13
## 10 u nasal_final 433. 13
## 11 u nasalized 430. 13
## 12 u oral 403. 13
#Tidying data with tidyr We will continue to use my dissertation data for this section. I have reformatted the data for some sections, in order to show you these functions.
Recall that tidy data is based on the principles that every variable is saved in a column, and each observation is saved in a row. Sometimes you have to change your data to different format to achieve the tidy data shape. Keep in mind that there are times when you might want to violate these principles. If those come up, you need to know how to reshape your data.
##Shapes of data Data comes into one of two shapes: wide and long. The easiest way to demonstrate the difference between the two is with a picture.
Wide data is presented with each different data variable in a separate column, whereas long data is presented with one column containing all the values and another column listing the context of the value. There are intermediate values between the extremites of long and short. Most modeling and visualization packages prefer long data to wide data.
There are four basic functions to reshape data in tidyr:
##Gather Gather takes wide data and makes it long. It does this by taking a column name and making that into a variable (called a key), with the value being the cell contents.
normtimedata %>%
select(Speaker, Vowel, Nasality, NormTime, RepNo, F1, F2, F3) %>%
gather(variable, value, c(F1:F3))
## # A tibble: 546,450 x 7
## Speaker Vowel Nasality NormTime RepNo variable value
## <chr> <chr> <chr> <dbl> <int> <chr> <dbl>
## 1 BP02 u nasal 0.1 1 F1 569.
## 2 BP02 u nasal 0.2 1 F1 364.
## 3 BP02 u nasal 0.3 1 F1 373.
## 4 BP02 u nasal 0.4 1 F1 303.
## 5 BP02 u nasal 0.5 1 F1 415.
## 6 BP02 u nasal 0.6 1 F1 336.
## 7 BP02 u nasal 0.7 1 F1 336.
## 8 BP02 u nasal 0.8 1 F1 330.
## 9 BP02 u nasal 0.9 1 F1 302.
## 10 BP02 u nasal 1 1 F1 280.
## # … with 546,440 more rows
##Spread Spread is going to take long data and make it wide. Be careful using this! Spread is going to take each value of its second argument (the first argument being the dataset, which if you are using the pipe is implicitly included) and make that into a column, with the value in that column being the value specified in the third arugment.
To use spread, if you have a lot of dependent variables you will end up with a bunch of NAs and your data will look awful. In these cases it’s better to maintain a long format. Another way of dealing with this is to create a new variable that includes the information for both variables.
normtimedata %>% select(c(Speaker, Word:NormTime, F1)) %>% spread(Speaker, F1) %>% arrange(RepNo)
## # A tibble: 18,660 x 18
## Word Vowel Nasality RepNo NormTime BP02 BP04 BP05 BP06 BP07 BP09 BP10
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 abun… u nasal 1 0.1 569. 335. 206. 354. 459. 670. 361.
## 2 abun… u nasal 1 0.2 364. 345. 395. 311. 461. 682. 308.
## 3 abun… u nasal 1 0.3 373. 302. 249. 336. 468. 753. 368.
## 4 abun… u nasal 1 0.4 303. 291. 409. 297. 457. 773. 382.
## 5 abun… u nasal 1 0.5 415. 289. 304. 304. 441. 765. 312.
## 6 abun… u nasal 1 0.6 336. 341. 334. 275. 454. 758. 315.
## 7 abun… u nasal 1 0.7 336. 372. 302. 285. 491. 762. 303.
## 8 abun… u nasal 1 0.8 330. 338. 307. 288. 359. 781. 291.
## 9 abun… u nasal 1 0.9 302. 304. 263. 270. 365. 775. 288.
## 10 abun… u nasal 1 1 280. 285. 287. 269. 291. 756. 276.
## # … with 18,650 more rows, and 6 more variables: BP14 <dbl>, BP17 <dbl>,
## # BP18 <dbl>, BP19 <dbl>, BP20 <dbl>, BP21 <dbl>
normtimedata %>% filter(NormTime==0.5) %>% select(c(Speaker, Word:NormTime, F1)) %>% spread(Speaker, F1) %>% arrange(RepNo)
## # A tibble: 1,866 x 18
## Word Vowel Nasality RepNo NormTime BP02 BP04 BP05 BP06 BP07 BP09 BP10
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 abun… u nasal 1 0.5 415. 289. 304. 304. 441. 765. 312.
## 2 baba… a oral 1 0.5 673. 735. 535. 560. 676. 699. 682.
## 3 bebum u nasal_f… 1 0.5 368. 334. 273. 254. 349. 309. 370.
## 4 cabi… i oral 1 0.5 248. 299. 276. 311. 414. 367. 278.
## 5 cabi… i nasaliz… 1 0.5 312. 326. 279. 261. 327. 371. 277.
## 6 cupim i nasal_f… 1 0.5 311. 361. 280. 275. 391. 1847. 313.
## 7 prop… a nasaliz… 1 0.5 534. 489. 522. 458. 537. 605. 457.
## 8 subi… i nasal 1 0.5 289. 349. 268. 278. 400. 292. 381.
## 9 tapa… a nasal 1 0.5 335. 462. 389. 395. 487. 577. 345.
## 10 trib… u nasaliz… 1 0.5 573. 339. 298. 298. 376. 652. 353.
## # … with 1,856 more rows, and 6 more variables: BP14 <dbl>, BP17 <dbl>,
## # BP18 <dbl>, BP19 <dbl>, BP20 <dbl>, BP21 <dbl>
normtimedata %>%
select(Speaker, Vowel, Nasality, NormTime, RepNo, F1, F2, F3) %>%
gather(variable, value, c(F1:F3)) %>%
unite(temp, Speaker, variable) %>%
spread(NormTime, value)
## # A tibble: 54,645 x 14
## temp Vowel Nasality RepNo `0.1` `0.2` `0.3` `0.4` `0.5` `0.6` `0.7` `0.8`
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP02… a nasal 1 460. 464. 433. 337. 335. 326. 331. 345.
## 2 BP02… a nasal 2 471. 473. 433. 443. 363. 345. 341. 332.
## 3 BP02… a nasal 3 496. 491. 466. 357. 209. 273. 253. 306.
## 4 BP02… a nasal 4 521. 480. 483. 374. 1182. 245. 274. 252.
## 5 BP02… a nasal 5 490. 483. 470. 400. 283. 243. 282. 269.
## 6 BP02… a nasal 6 443. 455. 456. 409. 374. 303. 327. 343.
## 7 BP02… a nasal 7 446. 460. 426. 432. 346. 273. 285. 302.
## 8 BP02… a nasal 8 491. 487. 477. 427. 398. 280. 283. 287.
## 9 BP02… a nasal 9 470. 481. 418. 328. 319. 289. 288. 308.
## 10 BP02… a nasal 10 425. 439. 445. 420. 192. 225. 241. 297.
## # … with 54,635 more rows, and 2 more variables: `0.9` <dbl>, `1` <dbl>
Let’s say you want to take information out of a column and make it into multiple columns (similar to the ‘text to columns’ command in Excel). You can use the separate() function to do this.
By default, separate will separate based on any non-alphanumeric character. You can also specify the separater with sep = “____”. You can also specify the number of characters to separate by.
normtimedata %>% separate(Speaker, into = c("Language", "ID"), 2)
## # A tibble: 182,150 x 14
## Language ID Word Vowel Nasality RepNo NormTime Type label Time F1
## <chr> <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl>
## 1 BP 02 abun… u nasal 1 0.1 clean un 3.39 569.
## 2 BP 02 abun… u nasal 1 0.2 clean un 3.41 364.
## 3 BP 02 abun… u nasal 1 0.3 clean un 3.43 373.
## 4 BP 02 abun… u nasal 1 0.4 clean un 3.46 303.
## 5 BP 02 abun… u nasal 1 0.5 clean un 3.48 415.
## 6 BP 02 abun… u nasal 1 0.6 clean un 3.51 336.
## 7 BP 02 abun… u nasal 1 0.7 clean un 3.53 336.
## 8 BP 02 abun… u nasal 1 0.8 clean un 3.56 330.
## 9 BP 02 abun… u nasal 1 0.9 clean un 3.58 302.
## 10 BP 02 abun… u nasal 1 1 clean un 3.60 280.
## # … with 182,140 more rows, and 3 more variables: F2 <dbl>, F3 <dbl>,
## # F2_3 <dbl>
The opposite of separate is ‘unite’, which will make a new variable out of two variables. Note that for separate and unite, you end getting rid of the old variables.
normtimedata %>% unite(VowNas, Vowel, Nasality, sep = "_")
## # A tibble: 182,150 x 12
## Speaker Word VowNas RepNo NormTime Type label Time F1 F2 F3 F2_3
## <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP02 abun… u_nas… 1 0.1 clean un 3.39 569. 2372. 3130. 379.
## 2 BP02 abun… u_nas… 1 0.2 clean un 3.41 364. 2297. 2692. 197.
## 3 BP02 abun… u_nas… 1 0.3 clean un 3.43 373. 2258. 2882. 312.
## 4 BP02 abun… u_nas… 1 0.4 clean un 3.46 303. 2165. 2918. 376.
## 5 BP02 abun… u_nas… 1 0.5 clean un 3.48 415. 2247. 2893. 323.
## 6 BP02 abun… u_nas… 1 0.6 clean un 3.51 336. 2098. 2623. 263.
## 7 BP02 abun… u_nas… 1 0.7 clean un 3.53 336. 2272. 2869. 299.
## 8 BP02 abun… u_nas… 1 0.8 clean un 3.56 330. 1719. 2435. 358.
## 9 BP02 abun… u_nas… 1 0.9 clean un 3.58 302. 2337. 2391. 26.7
## 10 BP02 abun… u_nas… 1 1 clean un 3.60 280. 966. 2435. 734.
## # … with 182,140 more rows
normtimedata %>% unite(MERP, Speaker, Vowel, Nasality, sep = "!")
## # A tibble: 182,150 x 11
## MERP Word RepNo NormTime Type label Time F1 F2 F3 F2_3
## <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP02!u!nasal abunda 1 0.1 clean un 3.39 569. 2372. 3130. 379.
## 2 BP02!u!nasal abunda 1 0.2 clean un 3.41 364. 2297. 2692. 197.
## 3 BP02!u!nasal abunda 1 0.3 clean un 3.43 373. 2258. 2882. 312.
## 4 BP02!u!nasal abunda 1 0.4 clean un 3.46 303. 2165. 2918. 376.
## 5 BP02!u!nasal abunda 1 0.5 clean un 3.48 415. 2247. 2893. 323.
## 6 BP02!u!nasal abunda 1 0.6 clean un 3.51 336. 2098. 2623. 263.
## 7 BP02!u!nasal abunda 1 0.7 clean un 3.53 336. 2272. 2869. 299.
## 8 BP02!u!nasal abunda 1 0.8 clean un 3.56 330. 1719. 2435. 358.
## 9 BP02!u!nasal abunda 1 0.9 clean un 3.58 302. 2337. 2391. 26.7
## 10 BP02!u!nasal abunda 1 1 clean un 3.60 280. 966. 2435. 734.
## # … with 182,140 more rows
#Relating different data sets with dplyr There are many datasets, especially once you get into larger experiments, that involve the use of multiple tables at once. For these datasets, we need to use a set of tools that relates these tables to one another. There is a great cheat sheet found here to further explain these joins. You can also find more information in R4DS.
In order to show you how different types of joins work, I have included another set of data. It is output from a second set of Praat scripts, which were run to gather further acoustic measures related to nasality (see Styler 2017 for more information). This set of scripts was run at 5 normalized time points, rather than 10, throughout the vowels’ durations.
nasalitydata
## # A tibble: 32,145 x 34
## Speaker Word Vowel Nasality RepNo NormTime freq_f1 amp_f1 width_f1 freq_f2
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.2 352 13.9 680 557
## 2 BP02 abun… u nasal 1 0.4 469 13.1 133 2232
## 3 BP02 abun… u nasal 1 0.6 258 17.6 222 1253
## 4 BP02 abun… u nasal 1 0.8 281 15.7 103 1395
## 5 BP02 abun… u nasal 1 1 141 16.7 53 1220
## 6 BP02 abun… u nasal 2 0.2 422 13.1 120 2196
## 7 BP02 abun… u nasal 2 0.4 211 18.0 139 1324
## 8 BP02 abun… u nasal 2 0.6 234 18.2 140 1156
## 9 BP02 abun… u nasal 2 0.8 258 17.0 143 1609
## 10 BP02 abun… u nasal 2 1 258 17.0 57 860
## # … with 32,135 more rows, and 24 more variables: amp_f2 <dbl>, width_f2 <dbl>,
## # freq_f3 <dbl>, amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>, amp_h1 <dbl>,
## # freq_h2 <dbl>, amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>, freq_p0 <dbl>,
## # p0_id <chr>, p0prominence <dbl>, a1p0_h1 <dbl>, a1p0_h2 <dbl>,
## # a1p0_h3 <dbl>, a1p0 <dbl>, a1p0_compensated <dbl>, freq_p1 <dbl>,
## # amp_p1 <dbl>, a1p1 <dbl>, a1p1_compensated <dbl>, a3p0 <dbl>
I will be making subsets of the datasets, which I will show you, in order to demonstrate these joins.
There are two types of joins, and a few differeny options for these types of join:
##Mutating Joins
First, we have an inner join. An inner join will only retain rows in both datasets.
nasalitydata %>% filter(Speaker =="BP02") %>% inner_join(normtimedata)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 2,025 x 41
## Speaker Word Vowel Nasality RepNo NormTime freq_f1 amp_f1 width_f1 freq_f2
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.2 352 13.9 680 557
## 2 BP02 abun… u nasal 1 0.4 469 13.1 133 2232
## 3 BP02 abun… u nasal 1 0.6 258 17.6 222 1253
## 4 BP02 abun… u nasal 1 0.8 281 15.7 103 1395
## 5 BP02 abun… u nasal 1 1 141 16.7 53 1220
## 6 BP02 abun… u nasal 2 0.2 422 13.1 120 2196
## 7 BP02 abun… u nasal 2 0.4 211 18.0 139 1324
## 8 BP02 abun… u nasal 2 0.6 234 18.2 140 1156
## 9 BP02 abun… u nasal 2 0.8 258 17.0 143 1609
## 10 BP02 abun… u nasal 2 1 258 17.0 57 860
## # … with 2,015 more rows, and 31 more variables: amp_f2 <dbl>, width_f2 <dbl>,
## # freq_f3 <dbl>, amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>, amp_h1 <dbl>,
## # freq_h2 <dbl>, amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>, freq_p0 <dbl>,
## # p0_id <chr>, p0prominence <dbl>, a1p0_h1 <dbl>, a1p0_h2 <dbl>,
## # a1p0_h3 <dbl>, a1p0 <dbl>, a1p0_compensated <dbl>, freq_p1 <dbl>,
## # amp_p1 <dbl>, a1p1 <dbl>, a1p1_compensated <dbl>, a3p0 <dbl>, Type <chr>,
## # label <chr>, Time <dbl>, F1 <dbl>, F2 <dbl>, F3 <dbl>, F2_3 <dbl>
nasalitydata %>% filter(Speaker =="BP21", RepNo < 20) %>% inner_join(filter(normtimedata, RepNo > 10))
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 540 x 41
## Speaker Word Vowel Nasality RepNo NormTime freq_f1 amp_f1 width_f1 freq_f2
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP21 abun… u nasal 11 0.2 211 25.2 142 906
## 2 BP21 abun… u nasal 11 0.4 422 24.0 79 846
## 3 BP21 abun… u nasal 11 0.6 234 24.7 129 833
## 4 BP21 abun… u nasal 11 0.8 234 25.6 158 897
## 5 BP21 abun… u nasal 11 1 211 26.1 398 530
## 6 BP21 abun… u nasal 12 0.2 211 22.0 120 850
## 7 BP21 abun… u nasal 12 0.4 211 22.4 103 829
## 8 BP21 abun… u nasal 12 0.6 211 24.3 125 752
## 9 BP21 abun… u nasal 12 0.8 211 24.2 131 872
## 10 BP21 abun… u nasal 12 1 211 24.3 242 427
## # … with 530 more rows, and 31 more variables: amp_f2 <dbl>, width_f2 <dbl>,
## # freq_f3 <dbl>, amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>, amp_h1 <dbl>,
## # freq_h2 <dbl>, amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>, freq_p0 <dbl>,
## # p0_id <chr>, p0prominence <dbl>, a1p0_h1 <dbl>, a1p0_h2 <dbl>,
## # a1p0_h3 <dbl>, a1p0 <dbl>, a1p0_compensated <dbl>, freq_p1 <dbl>,
## # amp_p1 <dbl>, a1p1 <dbl>, a1p1_compensated <dbl>, a3p0 <dbl>, Type <chr>,
## # label <chr>, Time <dbl>, F1 <dbl>, F2 <dbl>, F3 <dbl>, F2_3 <dbl>
Whereas an inner join keeps observations in both datasets, outer joins keeps observations that exist in at least one dataset. There are three types of outer joins: left joins, right joins, and full joins.
A left join will retain all observations that are in the first dataset. Conversely, a right join will retain all observations that are in the second dataset. These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA. A left join will be more common to use in the “wild.”
dim(nasalitydata)
## [1] 32145 34
dim(normtimedata)
## [1] 182150 13
nasalitydata %>% left_join(normtimedata)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 32,145 x 41
## Speaker Word Vowel Nasality RepNo NormTime freq_f1 amp_f1 width_f1 freq_f2
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.2 352 13.9 680 557
## 2 BP02 abun… u nasal 1 0.4 469 13.1 133 2232
## 3 BP02 abun… u nasal 1 0.6 258 17.6 222 1253
## 4 BP02 abun… u nasal 1 0.8 281 15.7 103 1395
## 5 BP02 abun… u nasal 1 1 141 16.7 53 1220
## 6 BP02 abun… u nasal 2 0.2 422 13.1 120 2196
## 7 BP02 abun… u nasal 2 0.4 211 18.0 139 1324
## 8 BP02 abun… u nasal 2 0.6 234 18.2 140 1156
## 9 BP02 abun… u nasal 2 0.8 258 17.0 143 1609
## 10 BP02 abun… u nasal 2 1 258 17.0 57 860
## # … with 32,135 more rows, and 31 more variables: amp_f2 <dbl>, width_f2 <dbl>,
## # freq_f3 <dbl>, amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>, amp_h1 <dbl>,
## # freq_h2 <dbl>, amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>, freq_p0 <dbl>,
## # p0_id <chr>, p0prominence <dbl>, a1p0_h1 <dbl>, a1p0_h2 <dbl>,
## # a1p0_h3 <dbl>, a1p0 <dbl>, a1p0_compensated <dbl>, freq_p1 <dbl>,
## # amp_p1 <dbl>, a1p1 <dbl>, a1p1_compensated <dbl>, a3p0 <dbl>, Type <chr>,
## # label <chr>, Time <dbl>, F1 <dbl>, F2 <dbl>, F3 <dbl>, F2_3 <dbl>
normtimedata %>% filter(NormTime ==0.3) %>% left_join(nasalitydata)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 18,215 x 41
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.3 clean un 3.43 373. 2258.
## 2 BP02 abun… u nasal 2 0.3 clean un 5.86 287. 1767.
## 3 BP02 abun… u nasal 3 0.3 clean un 8.46 321. 2249.
## 4 BP02 abun… u nasal 4 0.3 clean un 11.2 347. 977.
## 5 BP02 abun… u nasal 5 0.3 clean un 14.0 369. 1342.
## 6 BP02 abun… u nasal 6 0.3 clean un 16.3 352. 2245.
## 7 BP02 abun… u nasal 7 0.3 clean un 18.9 278. 1958.
## 8 BP02 abun… u nasal 8 0.3 clean un 21.7 272. 1696.
## 9 BP02 abun… u nasal 9 0.3 clean un 24.5 399. 2166.
## 10 BP02 abun… u nasal 10 0.3 clean un 27.2 343. 1389.
## # … with 18,205 more rows, and 30 more variables: F3 <dbl>, F2_3 <dbl>,
## # freq_f1 <dbl>, amp_f1 <dbl>, width_f1 <dbl>, freq_f2 <dbl>, amp_f2 <dbl>,
## # width_f2 <dbl>, freq_f3 <dbl>, amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>,
## # amp_h1 <dbl>, freq_h2 <dbl>, amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>,
## # freq_p0 <dbl>, p0_id <chr>, p0prominence <dbl>, a1p0_h1 <dbl>,
## # a1p0_h2 <dbl>, a1p0_h3 <dbl>, a1p0 <dbl>, a1p0_compensated <dbl>,
## # freq_p1 <dbl>, amp_p1 <dbl>, a1p1 <dbl>, a1p1_compensated <dbl>, a3p0 <dbl>
nasalitydata %>% filter(NormTime ==0.2, Vowel =="a") %>% left_join(normtimedata)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 2,155 x 41
## Speaker Word Vowel Nasality RepNo NormTime freq_f1 amp_f1 width_f1 freq_f2
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP02 baba… a oral 1 0.2 539 16.2 100 1078
## 2 BP02 baba… a oral 2 0.2 586 20.5 63 1148
## 3 BP02 baba… a oral 3 0.2 633 18.0 64 1128
## 4 BP02 baba… a oral 4 0.2 633 20.3 81 1154
## 5 BP02 baba… a oral 5 0.2 633 19.9 45 1101
## 6 BP02 baba… a oral 6 0.2 633 20.2 46 1107
## 7 BP02 baba… a oral 7 0.2 633 19.9 53 1133
## 8 BP02 baba… a oral 8 0.2 586 18.7 58 1089
## 9 BP02 baba… a oral 9 0.2 609 15.7 63 1055
## 10 BP02 baba… a oral 10 0.2 586 18.9 49 1117
## # … with 2,145 more rows, and 31 more variables: amp_f2 <dbl>, width_f2 <dbl>,
## # freq_f3 <dbl>, amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>, amp_h1 <dbl>,
## # freq_h2 <dbl>, amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>, freq_p0 <dbl>,
## # p0_id <chr>, p0prominence <dbl>, a1p0_h1 <dbl>, a1p0_h2 <dbl>,
## # a1p0_h3 <dbl>, a1p0 <dbl>, a1p0_compensated <dbl>, freq_p1 <dbl>,
## # amp_p1 <dbl>, a1p1 <dbl>, a1p1_compensated <dbl>, a3p0 <dbl>, Type <chr>,
## # label <chr>, Time <dbl>, F1 <dbl>, F2 <dbl>, F3 <dbl>, F2_3 <dbl>
nasalitydata %>% filter(NormTime == 0.2, Vowel =="u") %>% right_join(normtimedata)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 182,150 x 41
## Speaker Word Vowel Nasality RepNo NormTime freq_f1 amp_f1 width_f1 freq_f2
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.2 352 13.9 680 557
## 2 BP02 abun… u nasal 2 0.2 422 13.1 120 2196
## 3 BP02 abun… u nasal 3 0.2 398 12.6 121 2268
## 4 BP02 abun… u nasal 4 0.2 422 13.9 1311 504
## 5 BP02 abun… u nasal 5 0.2 445 13.1 131 1754
## 6 BP02 abun… u nasal 6 0.2 398 14.9 186 551
## 7 BP02 abun… u nasal 7 0.2 305 9.55 303 618
## 8 BP02 abun… u nasal 8 0.2 422 10.7 656 530
## 9 BP02 abun… u nasal 9 0.2 211 16.1 212 674
## 10 BP02 abun… u nasal 10 0.2 398 14.0 166 679
## # … with 182,140 more rows, and 31 more variables: amp_f2 <dbl>,
## # width_f2 <dbl>, freq_f3 <dbl>, amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>,
## # amp_h1 <dbl>, freq_h2 <dbl>, amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>,
## # freq_p0 <dbl>, p0_id <chr>, p0prominence <dbl>, a1p0_h1 <dbl>,
## # a1p0_h2 <dbl>, a1p0_h3 <dbl>, a1p0 <dbl>, a1p0_compensated <dbl>,
## # freq_p1 <dbl>, amp_p1 <dbl>, a1p1 <dbl>, a1p1_compensated <dbl>,
## # a3p0 <dbl>, Type <chr>, label <chr>, Time <dbl>, F1 <dbl>, F2 <dbl>,
## # F3 <dbl>, F2_3 <dbl>
The last type of outer join is a full join. It keeps all observations from both datasets. Remember when using the pipe that the argument that goes into full join is on the left hand side of the full_join() function, so if you do any filtering, keep that in mind!
normtimedata %>% full_join(nasalitydata)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 182,155 x 41
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.1 clean un 3.39 569. 2372.
## 2 BP02 abun… u nasal 1 0.2 clean un 3.41 364. 2297.
## 3 BP02 abun… u nasal 1 0.3 clean un 3.43 373. 2258.
## 4 BP02 abun… u nasal 1 0.4 clean un 3.46 303. 2165.
## 5 BP02 abun… u nasal 1 0.5 clean un 3.48 415. 2247.
## 6 BP02 abun… u nasal 1 0.6 clean un 3.51 336. 2098.
## 7 BP02 abun… u nasal 1 0.7 clean un 3.53 336. 2272.
## 8 BP02 abun… u nasal 1 0.8 clean un 3.56 330. 1719.
## 9 BP02 abun… u nasal 1 0.9 clean un 3.58 302. 2337.
## 10 BP02 abun… u nasal 1 1 clean un 3.60 280. 966.
## # … with 182,145 more rows, and 30 more variables: F3 <dbl>, F2_3 <dbl>,
## # freq_f1 <dbl>, amp_f1 <dbl>, width_f1 <dbl>, freq_f2 <dbl>, amp_f2 <dbl>,
## # width_f2 <dbl>, freq_f3 <dbl>, amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>,
## # amp_h1 <dbl>, freq_h2 <dbl>, amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>,
## # freq_p0 <dbl>, p0_id <chr>, p0prominence <dbl>, a1p0_h1 <dbl>,
## # a1p0_h2 <dbl>, a1p0_h3 <dbl>, a1p0 <dbl>, a1p0_compensated <dbl>,
## # freq_p1 <dbl>, amp_p1 <dbl>, a1p1 <dbl>, a1p1_compensated <dbl>, a3p0 <dbl>
normtimedata %>% filter(NormTime == 0.4) %>% full_join(nasalitydata) %>% arrange(Speaker, Vowel, Nasality, RepNo, NormTime)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 43,932 x 41
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 tapa… a nasal 1 0.2 <NA> <NA> NA NA NA
## 2 BP02 tapa… a nasal 1 0.4 clean an 3.67 337. 1734.
## 3 BP02 tapa… a nasal 1 0.6 <NA> <NA> NA NA NA
## 4 BP02 tapa… a nasal 1 0.8 <NA> <NA> NA NA NA
## 5 BP02 tapa… a nasal 1 1 <NA> <NA> NA NA NA
## 6 BP02 tapa… a nasal 2 0.2 <NA> <NA> NA NA NA
## 7 BP02 tapa… a nasal 2 0.4 clean an 6.22 443. 1966.
## 8 BP02 tapa… a nasal 2 0.6 <NA> <NA> NA NA NA
## 9 BP02 tapa… a nasal 2 0.8 <NA> <NA> NA NA NA
## 10 BP02 tapa… a nasal 2 1 <NA> <NA> NA NA NA
## # … with 43,922 more rows, and 30 more variables: F3 <dbl>, F2_3 <dbl>,
## # freq_f1 <dbl>, amp_f1 <dbl>, width_f1 <dbl>, freq_f2 <dbl>, amp_f2 <dbl>,
## # width_f2 <dbl>, freq_f3 <dbl>, amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>,
## # amp_h1 <dbl>, freq_h2 <dbl>, amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>,
## # freq_p0 <dbl>, p0_id <chr>, p0prominence <dbl>, a1p0_h1 <dbl>,
## # a1p0_h2 <dbl>, a1p0_h3 <dbl>, a1p0 <dbl>, a1p0_compensated <dbl>,
## # freq_p1 <dbl>, amp_p1 <dbl>, a1p1 <dbl>, a1p1_compensated <dbl>, a3p0 <dbl>
Why does the second example here only have NormTime values of 0.2, 0.4, 0.6, 0.8, and 1?
##Filtering joins Unlike mutating joins, filtering joins do not combine multiple datasets. They only affect the observations from the first dataset. (You can think of it as using data that is or isn’t in the second set to filter the first set, based ona particular key.)
A semi join will keep all observations in the first dataset which have a match in the second dataset. An anti join will discard these observations, and will only keep the observations in the first dataset that don’t have a match in the second set.
normtimedata %>% semi_join(nasalitydata)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 32,140 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.2 clean un 3.41 364. 2297.
## 2 BP02 abun… u nasal 1 0.4 clean un 3.46 303. 2165.
## 3 BP02 abun… u nasal 1 0.6 clean un 3.51 336. 2098.
## 4 BP02 abun… u nasal 1 0.8 clean un 3.56 330. 1719.
## 5 BP02 abun… u nasal 1 1 clean un 3.60 280. 966.
## 6 BP02 abun… u nasal 2 0.2 clean un 5.83 326. 2185.
## 7 BP02 abun… u nasal 2 0.4 clean un 5.88 334. 1471.
## 8 BP02 abun… u nasal 2 0.6 clean un 5.93 310. 1371.
## 9 BP02 abun… u nasal 2 0.8 clean un 5.97 306. 1308.
## 10 BP02 abun… u nasal 2 1 clean un 6.02 305. 2169.
## # … with 32,130 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
nasalitydata %>% semi_join(normtimedata)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 32,140 x 34
## Speaker Word Vowel Nasality RepNo NormTime freq_f1 amp_f1 width_f1 freq_f2
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.2 352 13.9 680 557
## 2 BP02 abun… u nasal 1 0.4 469 13.1 133 2232
## 3 BP02 abun… u nasal 1 0.6 258 17.6 222 1253
## 4 BP02 abun… u nasal 1 0.8 281 15.7 103 1395
## 5 BP02 abun… u nasal 1 1 141 16.7 53 1220
## 6 BP02 abun… u nasal 2 0.2 422 13.1 120 2196
## 7 BP02 abun… u nasal 2 0.4 211 18.0 139 1324
## 8 BP02 abun… u nasal 2 0.6 234 18.2 140 1156
## 9 BP02 abun… u nasal 2 0.8 258 17.0 143 1609
## 10 BP02 abun… u nasal 2 1 258 17.0 57 860
## # … with 32,130 more rows, and 24 more variables: amp_f2 <dbl>, width_f2 <dbl>,
## # freq_f3 <dbl>, amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>, amp_h1 <dbl>,
## # freq_h2 <dbl>, amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>, freq_p0 <dbl>,
## # p0_id <chr>, p0prominence <dbl>, a1p0_h1 <dbl>, a1p0_h2 <dbl>,
## # a1p0_h3 <dbl>, a1p0 <dbl>, a1p0_compensated <dbl>, freq_p1 <dbl>,
## # amp_p1 <dbl>, a1p1 <dbl>, a1p1_compensated <dbl>, a3p0 <dbl>
normtimedata %>% anti_join(nasalitydata)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 150,010 x 13
## Speaker Word Vowel Nasality RepNo NormTime Type label Time F1 F2
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BP02 abun… u nasal 1 0.1 clean un 3.39 569. 2372.
## 2 BP02 abun… u nasal 1 0.3 clean un 3.43 373. 2258.
## 3 BP02 abun… u nasal 1 0.5 clean un 3.48 415. 2247.
## 4 BP02 abun… u nasal 1 0.7 clean un 3.53 336. 2272.
## 5 BP02 abun… u nasal 1 0.9 clean un 3.58 302. 2337.
## 6 BP02 abun… u nasal 2 0.1 clean un 5.81 492. 2277.
## 7 BP02 abun… u nasal 2 0.3 clean un 5.86 287. 1767.
## 8 BP02 abun… u nasal 2 0.5 clean un 5.90 367. 2301.
## 9 BP02 abun… u nasal 2 0.7 clean un 5.95 385. 2114.
## 10 BP02 abun… u nasal 2 0.9 clean un 6.00 285. 1020.
## # … with 150,000 more rows, and 2 more variables: F3 <dbl>, F2_3 <dbl>
nasalitydata %>% anti_join(normtimedata)
## Joining, by = c("Speaker", "Word", "Vowel", "Nasality", "RepNo", "NormTime")
## # A tibble: 5 x 34
## Speaker Word Vowel Nasality RepNo NormTime freq_f1 amp_f1 width_f1 freq_f2
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BP04 trib… <NA> <NA> 44 0.2 164 23.9 77 1110
## 2 BP04 trib… <NA> <NA> 44 0.4 188 18.9 33 1319
## 3 BP04 trib… <NA> <NA> 44 0.6 117 2.71 461 1660
## 4 BP04 trib… <NA> <NA> 44 0.8 234 24.0 36 1210
## 5 BP04 trib… <NA> <NA> 44 1 164 23.2 137 828
## # … with 24 more variables: amp_f2 <dbl>, width_f2 <dbl>, freq_f3 <dbl>,
## # amp_f3 <dbl>, width_f3 <dbl>, freq_h1 <dbl>, amp_h1 <dbl>, freq_h2 <dbl>,
## # amp_h2 <dbl>, amp_h3 <dbl>, amp_p0 <dbl>, freq_p0 <dbl>, p0_id <chr>,
## # p0prominence <dbl>, a1p0_h1 <dbl>, a1p0_h2 <dbl>, a1p0_h3 <dbl>,
## # a1p0 <dbl>, a1p0_compensated <dbl>, freq_p1 <dbl>, amp_p1 <dbl>,
## # a1p1 <dbl>, a1p1_compensated <dbl>, a3p0 <dbl>
#As it turns out, there was one dataset in the nasality data that, for some reason, was not calculated in the normtime dataset.
#My data processing code This is the processing code I used to get the data from my dissertation out of .txt form and into R. Note that it is not the final version of the data. Rather, it is processed to the point where it can be worked with in this module. Further processing was done, specifically using the select() and filter() verbs, to create my final dataset, which is not shown here.
Some of this is based on methods found on this website.
normtimewd = "/Users/marissabarlaz/Desktop/Work/LING/LING 490/Data/All NormTime Formant"
allfiles = dir(normtimewd, "*.normtime_formant")
normtimedata <- data_frame(filename = allfiles) %>%
mutate(file_contents = map(filename, ~ read_tsv(file.path(normtimewd, .), skip = 1, col_names = c("label", "Time", "F1", "F2", "F3", "F2_3")))) %>%
#skip = 1 because there's a problem with the column names
unnest() %>%
separate(filename, into = c("Type", "Speaker", "Position", "Word"), extra = "drop") %>%
mutate(NormTime = (1:length(Time) %%10)*0.1,
label1 = label)%>%
separate(label1, into = c("Vowel", "Nasality"), 1) %>%
select(-c(Position))
normtimedata$NormTime[normtimedata$NormTime ==0.0] = 1
normtimedata$NormTime = as.numeric(trimws(normtimedata$NormTime))
normtimedata$Nasality = plyr::mapvalues(normtimedata$Nasality, from=c("~", "m", "n", "o", "s"), to=c("nasalized","nasal_final", "nasal", "oral", "nasalized"))
normtimedata = normtimedata %>%
group_by(Speaker, Word, NormTime) %>%
mutate(RepNo = seq(n())) %>%
ungroup() %>%
select(Speaker, Word, Vowel, Nasality, RepNo, NormTime, everything())
#write_csv(normtimedata, paste(normtimewd, "normtimedata.csv", sep=""))
### NASALITY DATA PROCESSING
nasalitywd = "/Users/marissabarlaz/Desktop/Work/LING/LING 490/Data/Acoustic Nasality Data"
allnasalfiles = dir(nasalitywd, "*.txt")
nasalitydata <- data_frame(filename = allnasalfiles) %>%
mutate(file_contents = map(filename, ~ read_tsv(file.path(nasalitywd, .), skip = 0))) %>%
unnest() %>%
rename(label = vowel) %>%
separate(filename1, into = c("Speaker", "Position", "Word")) %>%
mutate(label1 = label,
NormTime = timepoint/5) %>%
separate(label1, into = c("Vowel", "Nasality"), 1) %>%
select(-c("filename", "word", "label", "timepoint", "point_time")) %>%
select(-c(Position, vwl_amp_rms:errorflag))
nasalitydata$Nasality = plyr::mapvalues(nasalitydata$Nasality, from=c("~", "m", "n", "o", "s"), to=c("nasalized","nasal_final", "nasal", "oral", "nasalized"))
nasalitydata = nasalitydata %>%
group_by(Speaker, Word, NormTime) %>%
mutate(RepNo = seq(n())) %>%
ungroup() %>%
select(Speaker, Word, Vowel, Nasality, RepNo, NormTime, everything())
#write_csv(nasalitydata, paste(normtimewd, "nasalitydata_styler_all.csv", sep=""))