Filling a Wikibase instance with millions of data

As more and more Wikibase instances are cropping up we are seeing attempts to start them with masses of data from already existing data bases that want to switch to the new software.

Experimenting I tried to find a faster way to insert a huge amount of items into a Wikibase instance. I have not been able to insert more than two or three statements per second using the ‘official’ tools, such as QuickStatements or the WDI library.

Therefore, I am inserting the data directly into the MySQL database used by Wikibase.

The process consists of these steps:

  • generate the data for an item in JSON
  • determine the next Q number and update the JSON item data accordingly
  • insert data into the various database tables

However, if you do this without a transaction it is still terrible slow. In my setup only 120 items per minute. However, if I wrap the inserts into a transaction I was able to insert 33,000 items/minute.

Steps to run the experiment

  mysql:
    image: mariadb:10.3
    restart: unless-stopped
    ports:
      - "3306:3306"
    volumes:
  • Start the containers: docker-compose up and wait until you see lines ending like:
[main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got no real changes
[main] INFO  org.wikidata.query.rdf.tool.Updater - Sleeping for 10 secs

For me it took a minute to insert 100 items without a transaction and 25 seconds to insert 10,000 items with a transaction.


first published at https://github.com/jze/wikibase-insert/

Leave a Reply

Your email address will not be published. Required fields are marked *