querying github API for users with README that matches text -

August 15, 2014

i retrieve users repositories contain readme file contains text matched string passed in query. possible using github api?

in addition, include location , language in query.

thanks.

this not straightforward using available api now. however, can use api want.

be warned there on 10 million repositories on github - take long time. can retrieve list of 100 repositories per query, need use pagination -> more 100000 requests repositories. user limited 5000 requests per hour, "banned" hour. take more 40 hours, if you're using 1 user credentials.

steps:

get json repositories (https://developer.github.com/v3/repos/#list-all-public-repositories)
use pagination fetch 100 objects per query (https://developer.github.com/v3/#link-header)
decode json , retrieve list of repositories
for each repository need repository url object json, gives link repository.
now need readme contents. there 2 ways : a) use github api, using repo url , sending request : https://api.github.com/repos/:owner/:repo/readme( https://developer.github.com/v3/repos/contents/#get-the-readme) , either decode file (it encoded using base64) or follow html property of json e.g "html": "https://github.com/pengwynn/octokit/blob/master/readme.md". if there no readme, 404 not found code, can proceed next repository.

b) make url readme using step 4 gives e.g. https://api.github.com/repos/octocat/hello-world ; , parse , transform https://github.com/octocat/hello-world/readme.md ; more complicated, in case there no readme.
search through file specific text, , record or not if have found text.
iterate until went through repositories.

advanced things - if plan on running more often, can recommend use caching https://developer.github.com/v3/#conditional-requests ; store date + time when have done query, , use later see if has changed in repository. eliminate many of subsequent queries if need have up-to-date information. still have retrieve whole list of repositories though. (but search updated repositories)

of course make faster, can improve algorithm make parallel - retrieve 100 repositories, proceed retrieve next 100, , in meanwhile search if first 100 repositories contain readme file , if readme has searching for, , on. make things faster, certainly. need use sort of buffer, not know terminates faster (getting repositories list, or searching through them)

hope helps.

Search This Blog

My

querying github API for users with README that matches text -

Comments

Post a Comment

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

c# - How do I get the Nth largest element from a list with duplicates, using LINQ? -

jsp - "Sending a redirect is forbidden after the response has been committed" in sendRedirect -