Git Data Mining

Disclaimer: This is not the main repository, this is a fork which I use in order to fix issues and develop some features that I require. Certainly, I would like to see them in upstream when. However, master is a pristine clone of <git://git.lwn.net/gitdm.git> and I develop using my own branches, which are also available here.

I usually run gitdm as:

$ (cd repo; git log -M -C --numstat) | gitdm/gitdm -c gitdm.config -o output/repo.txt -s -u

Data Mining Utils

Requirements

It requires the following external Python modules for the database processing:

  • chardet. This is used when there is a problem inserting data (name of developers) into a table because of a wrong charset. It helps to detect it and then convert it to UTF-8. It is available at http://chardet.feedparser.org/.
  • Either psycopg2 (Postgresql) or sqlite3. The recommended option is Postgresql.

Structure of directories:

  • repositories: This is the place where the source code (git repositories) will be mirrored.
  • csv: In this place will be stored the output of each git repository processed.
  • output: A temporary store of the output of git log. It is faster to get the log and then process it, than processing the git log’s output using pipes.
  • gitdm-config: has the configuration file of gitdm.
  • gitdm: It is a clone of the gitdm. This program will be invoked by dmgnome-utils.
  • dmgnome-utils: Utilities to parser the output of git log, move the result into CSV and a database.

The configuration files that relies on paths are gitdm-config/gitdm.config and dmgnome-utils/settings.py. You may change the configuration there to adjust it to your own environment.

The file that have the name of the repositories, versions and tags resides on dmgnome-tools/data and the name is defined in settings.py (the default is gnome-platform.csv).

On dmgnome-utils:

$ /.update-git.sh
$ python gitlogger.py
$ python gitdb.py

If everything goes fine, you can work with the data stored in the database.