Git: Code as data

🛠Tool
git

The company source{d} offers a tool srcd. I was part of a workshop by Francesco … at Galvanize in San Francisco in April 2019.

See https://github.com/src-d/engine-analyses for some Jupyter notebooks which demo the tool.

Facts

  • Use git repositories as data source to analyze.
  • Run SQL queries on all infos in the GIT repo.
  • Implements a Code as Data philosophy

Run

  1. Download srcd from the sourced website and place the executable somewhere in your path (perhaps into ~/bin/).
  2. Navigate to the git repo you want to analyze.
  3. Run srcd sql for a cli tool or srcd web sql (recommended) to open a nice browser UI
  4. Run SQL commands (see examples below)

Commands

Examples of what might be interesting to query in a git repository.

Total number of commits

SELECT COUNT(*) FROM commits;

Most productive devs in a project

SELECT commit_author_name, COUNT(*) as n
FROM commits
GROUP BY commit_author_name
ORDER BY n DESC;

Most used programming language in project

SELECT LANGUAGE(te.tree_entry_name) as lang, COUNT(*) as n
FROM refs r
NATURAL JOIN commit_trees ct
NATURAL JOIN tree_entries te
WHERE r.ref_name = 'HEAD'
	AND te.tree_entry_mode != 40000
GROUP BY lang
ORDER BY n DESC

Most used words given a programming language

SELECT LANGUAGE(file_path, blob_content) AS lang,
	file_path,
    blob_content,
	UAST(blob_content,
         	LANGUAGE(file_path, blob_content),
         	'//uast:String/Value') as strings
FROM files
WHERE lang = 'JavaScript'
LIMIT 10;

List up to 10 repositories in a project

SELECT *
FROM refs
WHERE ref_name="HEAD"
LIMIT 10;

Discuss on TwitterImprove this article: Edit on GitHub

Discussion


Explain Programming

André Kovac builds products, creates software, teaches coding, communicates science and speaks at events.