5 min read

Identifying R Functions & Packages in Github Gists (funspotr part 2)

Tags: dplyr purrr stringr funspotr readr DT fs rstudioapi

This post is part two in a series of posts introducing funspotr. See also:

This post shows how funspotr can also be applied to parse gists:

A problem I bumped into was that most of Chelsea’s gists don’t actually have .R or .Rmd extensions so my approach skipped most of her snippets. I wanted to parse my own gists but ran into a related problem that most of my github gist code snippets are saved as .md files1 so knitr::purl() won’t work2.

In this post I…

  1. create a function to extract code chunks from simple .md files
  2. parse the functions and packages in my code using funspotr3.

This post was updated on 2023-10-11 to make it consistent with updated {funspotr} code. Tables were also updated to reflect brshallo gists at this time. The following post on network plots however was not updated.

Parsing code

First I used funspotr to get a table with all of my gists.

library(dplyr)
library(purrr)
library(stringr)
library(funspotr)

brshallo_gists <- funspotr::list_files_github_gists("brshallo", pattern = ".")

brshallo_gists
## # A tibble: 112 × 2
##    relative_paths                               absolute_paths                  
##    <chr>                                        <chr>                           
##  1 find_in_files.R                              https://gist.githubusercontent.…
##  2 permits-issued.md                            https://gist.githubusercontent.…
##  3 seattle-units-added-new-permits.md           https://gist.githubusercontent.…
##  4 rolling-mean-conditioned-date.R              https://gist.githubusercontent.…
##  5 rolling-mean-conditioned-on-iteration-date.R https://gist.githubusercontent.…
##  6 lag-multiple-across.md                       https://gist.githubusercontent.…
##  7 log-log-transform-example.md                 https://gist.githubusercontent.…
##  8 convert-currencies.R                         https://gist.githubusercontent.…
##  9 unique-set-speed-test.R                      https://gist.githubusercontent.…
## 10 unique-combinations.R                        https://gist.githubusercontent.…
## # ℹ 102 more rows

Parsing R files

funspotr is already set-up to parse all the unique functions and packages from R or Rmd files.

r_gists <- brshallo_gists %>% 
  filter(funspotr:::str_detect_r_docs(relative_paths))

r_gists_parsed <- funspotr::spot_funs_files(r_gists)

r_gists_unnested <- r_gists_parsed %>% 
  funspotr::unnest_results()

Hidden from this post but a warning message indicates a couple files which did not parse correctly. In this particular case those files were created using reprexes for .md output but I saved them as .R files – hence they failed parsing.

r_gists_unnested
## # A tibble: 701 × 4
##    funs      pkgs  relative_paths                  absolute_paths               
##    <chr>     <chr> <chr>                           <chr>                        
##  1 library   base  find_in_files.R                 https://gist.githubuserconte…
##  2 dir_ls    fs    find_in_files.R                 https://gist.githubuserconte…
##  3 map       purrr find_in_files.R                 https://gist.githubuserconte…
##  4 grep      base  find_in_files.R                 https://gist.githubuserconte…
##  5 readLines base  find_in_files.R                 https://gist.githubuserconte…
##  6 keep      purrr find_in_files.R                 https://gist.githubuserconte…
##  7 length    base  find_in_files.R                 https://gist.githubuserconte…
##  8 library   base  rolling-mean-conditioned-date.R https://gist.githubuserconte…
##  9 seq       base  rolling-mean-conditioned-date.R https://gist.githubuserconte…
## 10 map       purrr rolling-mean-conditioned-date.R https://gist.githubuserconte…
## # ℹ 691 more rows

Parsing markdown files

To parse my .md files, I wrote a function here extract_code_md() that…

  • reads in a file
  • extracts the text in code chunks4
  • saves it to a temporary file
  • returns the file path of the temporary file
subset_even <- function(x) x[!seq_along(x) %% 2]

extract_code_md <- function(file_path){
  
  lines <- readr::read_file(file_path) %>% 
    stringr::str_split("```.*", simplify = TRUE) %>%
    subset_even() %>% 
    stringr::str_flatten("\n## new chunk \n")
  
  file_output <- tempfile(fileext = ".R")
  writeLines(lines, file_output)
  file_output
}

I map extract_code_md() on all the .md gists and then parse the files using funspotr.

md_gists <- brshallo_gists %>% 
  filter(!funspotr:::str_detect_r_docs(relative_paths))

md_gists_local <- md_gists %>% 
  rename(urls = absolute_paths) %>% 
  # name absolute_paths because that's what funspotr::spot_funs_files() expects
  mutate(absolute_paths = map_chr(urls, extract_code_md))

md_gists_parsed <- funspotr::spot_funs_files(md_gists_local) %>% 
  mutate(absolute_paths = urls) %>% 
  select(-urls)
  
md_gists_unnested <- md_gists_parsed %>% 
  funspotr::unnest_results()

In this case also some files did not parse correctly though this is hidden due to warning = FALSE settings in the code chunks. These are essentially just not included in the unnested output.

md_gists_unnested
## # A tibble: 1,061 x 5
##    funs           pkgs    in_multiple_pkgs contents                 urls        
##    <chr>          <chr>   <lgl>            <chr>                    <chr>       
##  1 library        base    FALSE            grouped-nested-t-test.md "C:\\Users\~
##  2 require        base    FALSE            grouped-nested-t-test.md "C:\\Users\~
##  3 install_github remotes FALSE            grouped-nested-t-test.md "C:\\Users\~
##  4 na.omit        stats   FALSE            grouped-nested-t-test.md "C:\\Users\~
##  5 t.test         stats   FALSE            grouped-nested-t-test.md "C:\\Users\~
##  6 tidy           broom   FALSE            grouped-nested-t-test.md "C:\\Users\~
##  7 pull           dplyr   FALSE            grouped-nested-t-test.md "C:\\Users\~
##  8 group_by       dplyr   FALSE            grouped-nested-t-test.md "C:\\Users\~
##  9 summarise      dplyr   FALSE            grouped-nested-t-test.md "C:\\Users\~
## 10 list           base    FALSE            grouped-nested-t-test.md "C:\\Users\~
## # ... with 1,051 more rows

Note that I’m assuming all the code snippets are R code5.

Binding files together

I bind these files together and then arrange them based on the initial order in brshallo_gists6.

gists_unnested <- bind_rows(
  r_gists_unnested,
  md_gists_unnested
) %>% 
  # got this arranging by a vector trick from SO:
  # https://stackoverflow.com/questions/52216341/how-to-sort-rows-of-a-data-frame-based-on-a-vector-using-dplyr-pipe
  arrange(match(relative_paths, brshallo_gists$relative_paths)) %>% 
  # add back-in links to url's where files are rather than urls column being
  # local paths for .md snippets
  select(-absolute_paths) %>% 
  left_join(brshallo_gists, by = "relative_paths")

gists_unnested %>% 
  DT::datatable(rownames = FALSE,
            class = 'cell-border stripe',
            filter = 'top',
            escape = FALSE,
            options = list(pageLength = 20))

Organizing snippets

Perhaps I’ll do a follow-up and show some ways the relationships between the resulting parsed code snippets may be visualized in a network or organized in some other way.

Mentioned in the initial thread, Obsidian seems to be a product that does some things along these lines:

Appendix

Interactively save current gists to folder so can read from another file if want to

post_path <- fs::path_dir(rstudioapi::getSourceEditorContext()$path)

fs::dir_create(post_path, "data")

readr::write_csv(gists_unnested, fs::path(post_path, "data", paste0("brshallo-gists-", format(Sys.Date(), "%Y%m%d"), ".csv")))

  1. As I often use this output type when creating a reprex.↩︎

  2. knitr::purl() is used in functions within funspotr to parse R markdown files.↩︎

  3. In the future I may do a follow-up that passes the parsed functions and packages through a network analysis or some other approach to better visualize the relationships between code snippets.↩︎

  4. based on what exists between ticks. Kind of like a less reliable version of knitr::purl() but for .md files. Also posted function on SO question.↩︎

  5. Otherwise the R code parsing steps in funspotr will fail.↩︎

  6. Note that this will just return the unique functions in each file, if I want to see every time I used a function I would have passed in show_each_use = FALSE to github_spot_funs().↩︎