Skip to content

Kolektor

I was inspired by the git scraping technique from Simon Willison a while ago which led me to maintaining my own public scraper.

My scraper is a little bit different though. The initial technique involves scraping and overwriting the data. My technique is to commit incremental changes instead. The idea is that this allows me to have a public git dashboard as well!

Here's a workflow demo that collects download statistics from my python projects. There's a command line script that prints to stdout, which is then logged in the appropriate file. I concatenate all the results in a single file at the end before I commit the changes back to master.

name: Kollekt Pepy

on:
  workflow_dispatch:
  schedule:
    - cron:  '0 10 * * *'

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
      uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v1
      with:
        python-version: 3.7
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Fetch latest data
      run: |-
        python download/pepy.py scikit-lego >> data/pepy/scikit-lego.jsonl
        python download/pepy.py human-learn >> data/pepy/human-learn.jsonl
        python download/pepy.py whatlies >> data/pepy/whatlies.jsonl
        python download/pepy.py drawdata >> data/pepy/drawdata.jsonl
        python download/pepy.py tokenwiser >> data/pepy/tokenwiser.jsonl
        python download/pepy.py memo >> data/pepy/memo.jsonl
        python download/pepy.py clumper >> data/pepy/clumper.jsonl
        python download/pepy.py mktestdocs >> data/pepy/mktestdocs.jsonl
    - name: Concatenate it all
      run: |-
        python common/concat.py data/pepy/*.jsonl data/pepy/downloads.csv
    - name: Commit and push if it changed
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

You can find the project on github.

Visuals

It's a pretty powerful technique that you can easily combine with my justcharts library. I'm using the related github pages to host a dashboard here but the data can also be viewed via flatgithub. This flatgithub project is part of the flat data effort on github.