Git: Code as data
🛠Tool
git
The company source{d} offers a tool srcd.
I was part of a workshop by Francesco … at Galvanize in San Francisco in April 2019.
See https://github.com/src-d/engine-analyses for some Jupyter notebooks which demo the tool.
Facts
- Use
gitrepositories as data source to analyze. - Run SQL queries on all infos in the GIT repo.
- Implements a Code as Data philosophy
Run
- Download
srcdfrom the sourced website and place the executable somewhere in your path (perhaps into~/bin/). - Navigate to the git repo you want to analyze.
- Run
srcd sqlfor a cli tool orsrcd web sql(recommended) to open a nice browser UI - Run SQL commands (see examples below)
Commands
Examples of what might be interesting to query in a git repository.
Total number of commits
SELECT COUNT(*) FROM commits;Most productive devs in a project
SELECT commit_author_name, COUNT(*) as n
FROM commits
GROUP BY commit_author_name
ORDER BY n DESC;Most used programming language in project
SELECT LANGUAGE(te.tree_entry_name) as lang, COUNT(*) as n
FROM refs r
NATURAL JOIN commit_trees ct
NATURAL JOIN tree_entries te
WHERE r.ref_name = 'HEAD'
AND te.tree_entry_mode != 40000
GROUP BY lang
ORDER BY n DESCMost used words given a programming language
SELECT LANGUAGE(file_path, blob_content) AS lang,
file_path,
blob_content,
UAST(blob_content,
LANGUAGE(file_path, blob_content),
'//uast:String/Value') as strings
FROM files
WHERE lang = 'JavaScript'
LIMIT 10;List up to 10 repositories in a project
SELECT *
FROM refs
WHERE ref_name="HEAD"
LIMIT 10;Discuss on Twitter ● Improve this article: Edit on GitHub