Is there a way to get access to the data in the “Repositories contributed to” module on GitHub profile pages via the GitHub API? Ideally the entire list, not just the top five, which are all you can get on the web apparently.
I didn't see any way of doing it in the API. The closest I could find was to get the latest 300 events from a public user (300 is the limit, unfortunately), and then you can sort those for contributions to other's repositories.
Search for that last 100 closed pull requests the user submitted. Of course you could request the second page if the first page is full to get even older prs
If the user is listed as a contributor to any of the repos there we add the repo to the list (same step as above)
This misses repos where the user has submitted no pull requests but has been added as a contributor. We can increase our odds of finding these repos by searching for
1) any issue opened (not just closed pull requests)
2) repos the user has starred
Clearly, this requires many more requests than we would like to make but what can you do when they make you fudge features \o/
One actual hack I've found is that there's a project called http://www.githubarchive.org/
They log all public events starting from 2011. Not ideal, but can be helpful.
So, for example, in your case:
SELECT payload_pull_request_head_repo_clone_url
FROM [githubarchive:github.timeline]
WHERE payload_pull_request_base_user_login='outoftime'
GROUP BY payload_pull_request_head_repo_clone_url;
Gives, if I'm not mistaken, the list of repos you've pull requested to:
SELECT repository_url
FROM [githubarchive:github.timeline]
WHERE payload_pull_request_user_login ='rgbkrk'
GROUP BY repository_url;
You can use similar semantics to pull out just the quantities of repositories you contributed to as well as the languages they were in:
SELECT COUNT(DISTINCT repository_url) AS count_repositories_contributed_to,
COUNT(DISTINCT repository_language) AS count_languages_in
FROM [githubarchive:github.timeline]
WHERE payload_pull_request_user_login ='rgbkrk';
If you're looking for overall contributions, which includes issues reported use
SELECT COUNT(DISTINCT repository_url) AS count_repositories_contributed_to,
COUNT(DISTINCT repository_language) AS count_languages_in
FROM [githubarchive:github.timeline]
WHERE actor_attributes_login = 'rgbkrk'
GROUP BY repository_url;
The difference there is actor_attributes_login which comes from the Issue Events API.
You may also want to capture your own repos, which may not have issues or PRs filed by yourself.
fork parameter set to true ensures that you query all user's repos, forked included.
However, if you want to make sure the user not only forked repository, but contributed to it, you should iterate through every repo you got with 'search' request and check if user is within them. Which quite sucks, because github returns only 100 contributors and there is no solution for that...
"""
Get all your repos contributed to for the past year.
This uses Selenium and Chrome to login to github as your user, go through
your contributions page, and grab the repo from each day's contribution page.
Requires python3, selenium, and Chrome with chromedriver installed.
Change the username variable, and run like this:
GITHUB_PASS="mypassword" python3 github_contributions.py
"""
import os
import sys
import time
from pprint import pprint as pp
from urllib.parse import urlsplit
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
username = 'jessejoe'
password = os.environ['GITHUB_PASS']
repos = []
driver = webdriver.Chrome()
driver.get('https://github.com/login')
driver.find_element_by_id('login_field').send_keys(username)
password_elem = driver.find_element_by_id('password')
password_elem.send_keys(password)
password_elem.submit()
# Wait indefinitely for 2-factor code
if 'two-factor' in driver.current_url:
print('2-factor code required, go enter it')
while 'two-factor' in driver.current_url:
time.sleep(1)
driver.get('https://github.com/{}'.format(username))
# Get all days that aren't colored gray (no contributions)
contrib_days = driver.find_elements_by_xpath(
"//*[@class='day' and @fill!='#eeeeee']")
for day in contrib_days:
day.click()
# Wait until done loading
WebDriverWait(driver, 10).until(
lambda driver: 'loading' not in driver.find_element_by_css_selector('.contribution-activity').get_attribute('class'))
# Get all contribution URLs
contribs = driver.find_elements_by_css_selector('.contribution-activity a')
for contrib in contribs:
url = contrib.get_attribute('href')
# Only care about repo owner and name from URL
repo_path = urlsplit(url).path
repo = '/'.join(repo_path.split('/')[0:3])
if repo not in repos:
repos.append(repo)
# Have to click something else to remove pop-up on current day
driver.find_element_by_css_selector('.vcard-fullname').click()
driver.quit()
pp(repos)
It uses python and selenium to automate a Chrome browser to login to github, go to your contributions page, click each day and grab the repo name from any contributions. Since this page only shows 1 year's worth of activity, that's all you can get with this script.
If you have more than 100 contributed repo (including yours), you will have to go through pagination specifying after: "END_CURSOR_VALUE" in repositoriesContributedTo for the next request.
You'll probably get the last year or so via GitHub's GraphQL API, as shown in
Bertrand Martel's answer.
Everything that happened back to 2011 can be found in GitHub Archive, as stated in Kyle Kelley's answer.
However, BigQuery's syntax and GitHub's API seems to have changed and the examples shown there no longer work in 08/2020.
So here's how I found all repos I contributed to
SELECT distinct repo.name
FROM (
SELECT * FROM `githubarchive.year.2011` UNION ALL
SELECT * FROM `githubarchive.year.2012` UNION ALL
SELECT * FROM `githubarchive.year.2013` UNION ALL
SELECT * FROM `githubarchive.year.2014` UNION ALL
SELECT * FROM `githubarchive.year.2015` UNION ALL
SELECT * FROM `githubarchive.year.2016` UNION ALL
SELECT * FROM `githubarchive.year.2017` UNION ALL
SELECT * FROM `githubarchive.year.2018`
)
WHERE (type = 'PushEvent'
OR type = 'PullRequestEvent')
AND actor.login = 'YOUR_USER'
Some of there Repos returned only have a name, no user or org. But I had to process the result manually afterwards anyway.
You can take a look at https://github.com/casperdcl/ghstat which automates counting lines of code written in all visible repositories. Extracting the relevant code and tidying it up:
#!/bin/bash
ghjq() { # <endpoint> <filter>
# filter all pages of authenticated requests to https://api.github.com
gh api --paginate "$1" | jq -r "$2"
}
repos="$(
ghjq users/$GH_USER/repos .[].full_name
ghjq "search/issues?q=is:pr+author:$GH_USER+is:merged" \
'.items[].repository_url | sub(".*github.com/repos/"; "")'
ghjq users/$GH_USER/subscriptions .[].full_name
for org in "$(ghjq users/$GH_USER/orgs .[].login)"; do
ghjq orgs/$org/repos .[].full_name
done
)"
repos="$(echo "$repos" | sort -u)"
# print repo if user is a contributor
for repo in $repos; do
if [[ $(ghjq repos/$repo/contributors "[.[].login | test(\"$GH_USER\")] | any") == "true" ]]; then
echo $repo
fi
done