GitDataMining
Git Data Mining
Disclaimer: This is not the main repository, this is a fork which I use in order to fix issues and develop some features that I require. Certainly, I would like to see them in upstream when. However, master is a pristine clone of <git://git.lwn.net/gitdm.git> and I develop using my own branches, which are also available here.
I usually run gitdm as:
$ (cd repo; git log -M -C --numstat) | gitdm/gitdm -c gitdm.config -o output/repo.txt -s -u
Data Mining Utils
Requirements
It requires the following external Python modules for the database processing:
- chardet. This is used when there is a problem inserting data (name of developers) into a table because of a wrong charset. It helps to detect it and then convert it to UTF-8. It is available at http://chardet.feedparser.org/.
- Either psycopg2 (Postgresql) or sqlite3. The recommended option is Postgresql.
Structure of directories:
- repositories: This is the place where the source code (git repositories) will be mirrored.
- csv: In this place will be stored the output of each git repository processed.
- output: A temporary store of the output of git log. It is faster to get the log and then process it, than processing the git log’s output using pipes.
- gitdm-config: has the configuration file of gitdm.
- gitdm: It is a clone of the gitdm. This program will be invoked by dmgnome-utils.
- dmgnome-utils: Utilities to parser the output of
git log, move the result into CSV and a database.
The configuration files that relies on paths are gitdm-config/gitdm.config and dmgnome-utils/settings.py. You may change the configuration there to adjust it to your own environment.
The file that have the name of the repositories, versions and tags resides on dmgnome-tools/data and the name is defined in settings.py (the default is gnome-platform.csv).
On dmgnome-utils:
$ /.update-git.sh
$ python gitlogger.py
$ python gitdb.py
If everything goes fine, you can work with the data stored in the database.

