Searching git repositories for secrets

Getting .git content inadvertently exposed on websites

Theres a good article about this here.

The simple approach is:

wget --mirror -I .git http://website/.git/

Manually searching for secrets

Clone the repo to get a local copy of all the refs

git clone <address>

If retrieving content from git directly, mirror the repo to get a local copy of all the refs, this will give you a raw copy of the repository and can hide additional things not seen in a clone as shown above

git --mirror clone <address>

Show the commit history

git log

List all the changed files in a commit

git diff-tree --no-commit-id --name-only -r [commit_id]

List all the files present in the repository at the time of a commit

git ls-tree --name-only -r [commit_id]

Show files changed between commits

git diff --name-only [commit_id_1]..[commit_id_2]

Show the contents of a file from a particular commit

git show [commit_id]:[filename]

Search for file contents across all commits (add --no-pager after git to not pipe through less or similar)

git grep --all -n 'search pattern' $(git rev-list --all)

If the above is performed on a repository with too many commits to list in one command you can get a subset of the first 10000 lines like so

git grep -n 'search pattern' $(git rev-list --all | awk 'NR >=1 && NR <= 10000')

Searching across the repository for a file with a particular name

git rev-list --all | xargs -I '{}' git ls-tree --full-tree -r '{}' | grep target_filename.txt | sort -u

Helper code

Git repositories can be analysed from a security perspective to find secrets in older commits, if you can obtain a locally cloned copy of the repository.

There is some helper code to automate some of the commands needed to do this here.

The code does not do all the work for you and is meant to be used as part of a workflow - its intended to be imported in (e.g. exec) and used from an interactive Python session (e.g. iPython).

The code is essentially providing a wrapper around some of the following commands, making it easier to manage the output thats generated, find the bits you’re interested and present the information in a usable fashion:

Search the local git repository in the present working directory for a given pattern across all commits and branches git grep -in 'search_string' $(git rev-list --all)
Show the contents of a given file of a given revision git show <revision>/<file>

At a high level the code runs the git command for you (so it needs to be installed and its path set in the “git” global variable), and searches all the branches and commits for a list of strings as specified in the global “words” variable. The results are placed in a structure that you can search through to manually eliminate entries that dont interest you. There are default settings for both of these, but you can change the values to suit.

The analysis process that the code is meant to enable looks like this:

Run git_secrets_grabber('PATH'). This returns a hash that has entries for each discovered repository found under the parent directory (you can put a parent path that contains multiple local git repositories as subdirectories if you wish), and contains child entries for each search term. Child entries beneath each search word are raw, parsed, search, which contain different views of the same data. raw is raw output from the “git grep” command, parsed is the data split by line and field, including information about commits and filenames, and search is unique matching lines only.
Look at “search” output for each search word, identify the lines you are interested in, pick a unique part of the line of interest, and add those to a list in a new “results” key for each repo:word hash entry. This is essentially eliminating lines that are not of any security concern - false positives. There will likely be a few.
Once you’re done, process the modified results hash with the gitsearch_results_parser function - it creats a new output structure with the relevant information about each of the lines you are interested in - which line number in which file in which commit. This is mainly intended to be a storage format, there are helper functions to output the results in a few useful ways.
The results at any stage of this process can be saved out to disk with either of the json_file_write_gz or json_file_write functions (and read back in with the read equivalent). The “gz” functions obviously make a smaller (compressed) output file and make it harder to check the output in a text editor (if you were so inclined to view it that way).
The output from gitsearch_results_parser in step 3 can be further parsed with the git_results_file_data function or other code to display the results in various ways.

Some examples, where variable gr contains the output from gitsearch_results_parser:

Show files for each repo containing secrets print('\n'.join(sorted(gr[repo].keys())))
Show matching lines and line numbers with content matching the search terms, check you are outputting correct results git_results_file_data(gr)
Show a breakdown of matches and commits git_results_file_data(gr, contents=['matches', 'commits'])