querying github API for users with README that matches text -


i retrieve users repositories contain readme file contains text matched string passed in query. possible using github api?

in addition, include location , language in query.

thanks.

this not straightforward using available api now. however, can use api want.

be warned there on 10 million repositories on github - take long time. can retrieve list of 100 repositories per query, need use pagination -> more 100000 requests repositories. user limited 5000 requests per hour, "banned" hour. take more 40 hours, if you're using 1 user credentials.

steps:

  1. get json repositories (https://developer.github.com/v3/repos/#list-all-public-repositories)

  2. use pagination fetch 100 objects per query (https://developer.github.com/v3/#link-header)

  3. decode json , retrieve list of repositories

  4. for each repository need repository url object json, gives link repository.

  5. now need readme contents. there 2 ways : a) use github api, using repo url , sending request : https://api.github.com/repos/:owner/:repo/readme( https://developer.github.com/v3/repos/contents/#get-the-readme) , either decode file (it encoded using base64) or follow html property of json e.g "html": "https://github.com/pengwynn/octokit/blob/master/readme.md". if there no readme, 404 not found code, can proceed next repository.

    b) make url readme using step 4 gives e.g. https://api.github.com/repos/octocat/hello-world ; , parse , transform https://github.com/octocat/hello-world/readme.md ; more complicated, in case there no readme.

  6. search through file specific text, , record or not if have found text.

  7. iterate until went through repositories.

advanced things - if plan on running more often, can recommend use caching https://developer.github.com/v3/#conditional-requests ; store date + time when have done query, , use later see if has changed in repository. eliminate many of subsequent queries if need have up-to-date information. still have retrieve whole list of repositories though. (but search updated repositories)

of course make faster, can improve algorithm make parallel - retrieve 100 repositories, proceed retrieve next 100, , in meanwhile search if first 100 repositories contain readme file , if readme has searching for, , on. make things faster, certainly. need use sort of buffer, not know terminates faster (getting repositories list, or searching through them)

hope helps.


Comments

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

jsf - How to ajax update an item in the footer of a PrimeFaces dataTable? -

django - CSRF verification failed. Request aborted. CSRF cookie not set -