Skip to content


I was inspired by the git scraping technique from Simon Willison a while ago which led me to maintaining my own public scraper.

My scraper is a little bit different though. The initial technique involves scraping and overwriting the data. My technique is to commit incremental changes instead. The idea is that this allows me to have a public git dashboard as well!

Here's a workflow demo that collects download statistics from my python projects. There's a command line script that prints to stdout, which is then logged in the appropriate file. I concatenate all the results in a single file at the end before I commit the changes back to master.

name: Kollekt Pepy

    - cron:  '0 10 * * *'

    runs-on: ubuntu-latest
    - name: Check out this repo
      uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v1
        python-version: 3.7
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Fetch latest data
      run: |-
        python download/ scikit-lego >> data/pepy/scikit-lego.jsonl
        python download/ human-learn >> data/pepy/human-learn.jsonl
        python download/ whatlies >> data/pepy/whatlies.jsonl
        python download/ drawdata >> data/pepy/drawdata.jsonl
        python download/ tokenwiser >> data/pepy/tokenwiser.jsonl
        python download/ memo >> data/pepy/memo.jsonl
        python download/ clumper >> data/pepy/clumper.jsonl
        python download/ mktestdocs >> data/pepy/mktestdocs.jsonl
    - name: Concatenate it all
      run: |-
        python common/ data/pepy/*.jsonl data/pepy/downloads.csv
    - name: Commit and push if it changed
      run: |-
        git config "Automated"
        git config ""
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

You can find the project on github.


It's a pretty powerful technique that you can easily combine with my justcharts library. I'm using the related github pages to host a dashboard here but the data can also be viewed via flatgithub. This flatgithub project is part of the flat data effort on github.