Let’s start with defining the question - today, I’m tasked with showing the patterns within the modules relating to where there are high levels of staff commits (i.e people employed by Red Hat within Ansible itself), and everyone else (which we’ll label ‘community’). Note that ‘community’ will still include some Red Hat folks, as we’re a big company and not everyone works on Ansible for the $day-job (there’s a caveat to this, see Appendix 1)

The resulting graphic, as you’ll see, largely speaks for itself, so here it is! Sometimes, a picture really is worth a thousand words…

This is a tree-map - each small circle is a single file in /lib/ansible/modules, and then the larger circles around them are the containing directory, and so on up the tree. The size of the file-circles only is determined from the total number of commits to that file.

I really like this result - apart from the non-colourblind-safe palette of red/blue (I was finding it hard to make a clearer one). It’s a lovely graphic that tells a high-level story at a glance (lots of entirely community-created modules), and some more niche tales as you dive in (e.g. look closely at vmware…).

So, done? Nah, I don’t write short posts - so, for a change, I’m going to dive into the R code I used to create it, so you can mess about with it yourself :)

Part 0 - Libraries

We need some packages, so I load them here. My greglib package has my staff list in it, so you’ll need to create one of those for yourself if you’re following along :P

library(tidyverse)  # data manipulation
library(greglib)    # staff list
library(ggraph)     # graphing tools
library(data.tree)  # tree mapping tools
library(git2r)      # git cloning
# remotes::install_github('lorenzwalthert/gitsum')
library(gitsum)     # nice log parsing tool

Part 1 - The Data

We’ll need some data, of course. We could use a variety of metrics to determine “contributions” or “content”, but I’m going to keep it simple. We’ll git clone Ansible, and look at the raw Git log itself. We’re going to restrict this to the last 2 years of data - if less, we don’t get more than 1 commit to many files; if more, it’s not representative of the current community.

# Clone if we need to
if (!dir.exists('/tmp/ansible')) {
  # Get detailed data on each commit - takes a few min :)
  gitsum::init_gitsum('/tmp/ansible/', over_write = T)

logs <- gitsum::parse_log_detailed('/tmp/ansible/')

# Filter to commits within date & touching /lib/ansible/modules
commits <- logs %>%
  filter(date > Sys.Date() - lubridate::years(2)) %>%                # 2 year date filter
  unnest(nested) %>%                                                 # unpack commits to one-row-per-file-changed
  filter(str_starts(changed_file,'lib/ansible/modules/')) %>%        # detect the commits to modules
  filter(!str_detect(changed_file,'=>')) %>%                         # drop rows with are just renames
  mutate(changed_file = str_remove(changed_file,'lib/ansible/')) %>% # tidy up the filename, everything is in lib/ansible
  mutate(staff = author_name %in% greglib::staff$gitlog.name)        # note staff vs community commits

# Preview it
head(commits) %>%
  arrange(short_hash) %>%
  select(short_hash,date,staff,changed_file) %>% # just picking a few columns to display
short_hash date staff changed_file
9f86 2017-12-19 12:22:00 TRUE modules/network/nxos/nxos_aaa_server_host.py
9f86 2017-12-19 12:22:00 TRUE modules/network/nxos/nxos_bgp.py
9f86 2017-12-19 12:22:00 TRUE modules/network/nxos/nxos_bgp_neighbor.py
9f86 2017-12-19 12:22:00 TRUE modules/network/nxos/nxos_bgp_neighbor_af.py
9f86 2017-12-19 12:22:00 TRUE modules/network/nxos/nxos_facts.py
9f86 2017-12-19 12:22:00 TRUE modules/network/nxos/nxos_gir.py

OK, so we have a nice data frame, one row per changed file with supporting data. Lovely.

Part 2 - Tree Maps

The best way to work with this data is to use a tree map, since it is actually a tree - a directory tree to be precise. Here, we’ll turn our data frame into a tree, and add the data we know about to the leaf (file) nodes.

tree <- commits %>%
  group_by(changed_file) %>% 
  summarise(n = n(),                      # total commits to this file
            staff = sum(staff),           # commits from staff
            perc_staff = staff/n          # percentage of staff commits
  ) %>%
  rename(pathString = changed_file) %>%   # renaming makes later code cleaner
  FromDataFrameTable()                    # create the tree

print(tree,'staff', limit = 7)
##                            levelName staff
## 1 modules                               NA
## 2  ¦--cloud                             NA
## 3  ¦   ¦--alicloud                      NA
## 4  ¦   ¦   ¦--__init__.py                0
## 5  ¦   ¦   ¦--_ali_instance_facts.py     0
## 6  ¦   ¦   ¦--ali_instance_facts.py      2
## 7  ¦   ¦   °--... 2 nodes w/ 0 sub      NA
## 8  ¦   °--... 41 nodes w/ 1680 sub      NA
## 9  °--... 20 nodes w/ 4561 sub          NA

OK, we have leaf data, but notice that the parents (directories) are NA - we need to aggregate the per-file data. We can traverse the tree from the bottom upwards to get at this.

tree$Do(function(x) {
  if (!isLeaf(x)) { # no need to act on leaf nodes, it's defined there already
    x$n     <- sum(Get(x$children, "n"))       # sum of commits in the dir
    x$staff <- sum(Get(x$children, "staff"))   # sum of staff commits
    x$perc_staff <- x$staff / x$n              # percentage for this dir
}, traversal = "post-order")                   # post-order means leaf-first

print(tree,'staff', limit = 7)
##                            levelName staff
## 1 modules                             4693
## 2  ¦--cloud                           1515
## 3  ¦   ¦--alicloud                       5
## 4  ¦   ¦   ¦--__init__.py                0
## 5  ¦   ¦   ¦--_ali_instance_facts.py     0
## 6  ¦   ¦   ¦--ali_instance_facts.py      2
## 7  ¦   ¦   °--... 2 nodes w/ 0 sub      NA
## 8  ¦   °--... 41 nodes w/ 1680 sub      NA
## 9  °--... 20 nodes w/ 4561 sub          NA

Boom, we have have all the data in the right format. But we will want an extra column for the display side of things - I want the directory labels to be sized based on their tree depth (i.e. amazon should be smaller font than cloud)

tree$Do(function(x) {
  x$label_size <- case_when(
    x$level == 2 ~ 6,             # size 6 for top-dirs
    x$level == 3 ~ 4,             # size 4 for module dirs
    TRUE ~ NA_real_               # don't label "modules" or filenames (level 1 & 4)
}, traversal = 'post-order')

OK, we have a tree. Let’s plot it!

Part 3 - Edges & Nodes

To plot this, we’ll need to consume it as a network - that is, a list of points(vertices) and the connections between them (edges). For a tree map, that’s just the list of directories & files, and the name of the parent (e.g. there is an edge from modules to modules/cloud, and so on).

Happily, data.tree has functions for that! The edges are trivial.

edges <- ToDataFrameNetwork(tree,'pathString')

The vertices are a little trickier, as we want to preserve all the attributes like staff_perc and so on. Also, the data.tree method for vertices would only give the leaf vertices, so we have to be a little clever - we use the list of to names in edges and merge data to that. It’s gnarly but it works:

# get data from the tree in a data.frame format
myverts <- tibble(name       = tree$Get('pathString'),
                  n          = tree$Get('n'),
                  perc_staff = tree$Get('perc_staff'),
                  # Style
                  label_text = tree$Get('name'),
                  label_size = tree$Get('label_size'))

vertices <- edges %>% 
  select(-from) %>%                                      # use the `to` column for names
  add_row(to = 'modules', pathString = 'modules') %>%    # nothing can point *to* "modules", so add it back
  distinct(pathString, .keep_all = T) %>%                # deduplicate the list
  left_join(., myverts, by=c("pathString" = "name")) %>% # join it to the tree data
  rename(name = to) %>%                                  # tidy up and handle NAs introduced for "modules"
  mutate(perc_staff = if_else(name == 'modules',0.5,perc_staff),
         label_text = if_else(name == 'modules',NA_character_,label_text))

head(vertices) %>% knitr::kable()
name pathString n perc_staff label_text label_size
modules/cloud modules/cloud 8182 0.1851626 cloud 6
modules/clustering modules/clustering 114 0.3859649 clustering 6
modules/commands modules/commands 66 0.3030303 commands 6
modules/crypto modules/crypto 352 0.0625000 crypto 6
modules/database modules/database 520 0.1384615 database 6
modules/files modules/files 278 0.3525180 files 6

OK, we have a list of vertices and edges, so we can finally plot it!

Part 4 - The Plot

First, the easy bit - we’ll create a graph object, and cache a few values for ease of use in the main call.

mygraph <- graph_from_data_frame( edges, vertices=vertices )
total_perc = tree$Get('perc_staff')[1] # row 1 == 'modules', or *all the data*
staff_filter = 50                      # These filters are arbitrary, and prevent crowding
nonstaff_filter = 200                  # the plot with labels. Adjust to suit yourself!

Right, let’s do this. One immense ggplot2 invocation coming up! This is huge, so I’ll break it down piece by piece :)

plot <- ggraph(mygraph, layout = 'circlepack', weight=n)
## Non-leaf weights ignored

ggraph is our main function, it’s a ggplot2 extension and is amazing. See the GGraph website for inspiration!

The layout obviously sets us up for circle-packing, and the weight tells the algorithm to size the circles by number of commits, n.

plot <- plot + 
  geom_node_circle(aes(fill = perc_staff, colour = as.factor(depth)), size = 0.1)

Add the circles (the nodes), and use aesthetics (aes) that fill the circle by staff_perc, colour the boundary line by depth, and make that line really thin.

plot <- plot +
                       mid = 'white', midpoint = 0.5,
                       name='Staff Commits',
                       labels = scales::percent)

This sets up the colour scale from blue to red with white at 0.5 - that is, white is where there is no strong bias towards staff or community.

plot <- plot +
  scale_color_manual(guide=F,values=c("0" = "white",
                                      "1" = "black",
                                      "2" = "black",
                                      "3" = "black",
                                      "4" = "black") )

Recall the colour is mapped to depth - this just says “map depth 1 to white, and the rest to black” which means we don’t see a circle for all modules. I think that looks nicer, but you can change the first entry to “black” to compare.

plot <- plot +
  geom_node_label(aes(label = label_text,
                      size  = I(label_size),
                      filter = (perc_staff < 0.5) & (n > nonstaff_filter),
                      fill   = perc_staff),
                  repel = T) 

This plots the “community” labels (perc < 0.5), using the filter of staff_perc < 0.5, and the commit filter defined above.

plot <- plot +
  geom_node_label(aes(label  = label_text,
                      size   = I(label_size),
                      filter = (perc_staff >= 0.5) & (n > staff_filter),
                      fill   = perc_staff),
                  repel = T)

And this does the “staff” labels. They’re separate label calls because they use different filters - if you wanted a single n_filter object for both types of label, you could simplify this bit!

plot <- plot +
  theme_void() + 
  theme(plot.margin = unit(rep(0.5,4), "cm")) +
  labs(title    = paste0('Staff vs non-staff commits, per module file. Total staff: ', scales::percent(total_perc)),
       caption  = 'Source: git-log, last 2 years, commits expanded by files touched')

This final bit is just styling - setting up the look & feel, the title, etc.

OK, this plot this baby! Be warned, if you followed along, this step takes a good few minutes to process. Get a beverage :)

# save to file so we *definitely* get a big image
## png 
##   2


And there you have it, one step-by-step Ansible community bubble-plot (or circle-pack) based on commits and who wrote them. Hope you enjoyed it as much as I enjoyed making it! If you did, stop by and let me know, I’d love to chat about it :)

Appendix 1 - The Identity Caveat

A caveat arises, then. To do this, you have to identify the staff members to categorize, and identity is a hard problem. To get around this, I had my excellent colleague Gundalow give me the list of folks he considered “staff”. No, I’m not going to share it with you, you can ask him :P

This does mean that if a name in the git logs doesn’t match my list, it’ll go in the community group. That’s fine in the vast majority of cases, but if someone updates their .gitconfig file to have a new name, I’ll lose track of them. It’s not a big impact, I think, but something to be aware of.