Contents

Postgres - JSON

I became curious about json and jsonb types in postgres, so today I want to check what they are and how to use them.

Today I’m going to:

  • define a table with json or jsonb column
  • create some data (manually or randomly generated)
  • apply some json-manipulating functions
  • try to create and use database index that will use such column
Documentation

These are two most important resources about json and jsonb in postgres:

And RFC 7159 saying what JSON actually is

Context

Why does postgres have json-like types at all? Isn’t it true that the world of data(bases) had split between heavy, corporate RDBMS-es with fossilized tables structures and lightweight, startuppy, almost-without-schema-eventually-consistent no-sql databases?

To be honest, this is still my mental model of database systems.

Usecases

I’ve found some, but I’m under the impression that json datatype in postgres is not particularly popular on the net.

So a column of type json or jsonb can be used to:

  • store data displayed in graphical user interfaces in web applications
  • store events data from multiple sources, which have its content (aka payload) different and dependent on the source it wos generated in - I present such hypothetical situation in the example below
  • store the state of single player game, the state which changes while the game advances (and json schema changes as well)
  • in tools for data integration between systems characterized by different data schemas
  • for data synchronization between different systems

Sources: When to use.., SO - what are the proper…, Using JSONB

Since when?

Type json exists in postgress for almost ten years already, jsonb for seven, and jsonpath as well as support for SQL/JSON standard (mainly functions operating on and generating json values) since version 12, i.e. since the year 2019.

year version scope
2012 9.2 JSON support
2014 9.4 JSONB support
2019 12 SQL/JSON and JSONPATH support

Standard SQL:2016:

And maybe mongo?

I’ve read about a dillema between choosing postres and mongodb in teh article “Postgres vs mongo for storing json data”:

  • if data stored are static, more often read than written or there is no need to modify them at all - use postgress
  • if the usage patters are such that record modification wil be frequent - use mongo since it has really powerful tools for doing this

The article concludes that the choice should be balanced by considering the scaling options, availability of specialists etc.

Hm, the article was written in 2016, and there were no functions for creating json, processing json, searching using jsonpath or filtering).

The types: json, jsonb i jsonpath

What are those three types and how do they differ?

json jsonb jsonpath
exact copy of input text stored in binary format used to search for data
processing functions need to re-parse the value each time they are used processing functions are fast
adding/inserting data is fast, no parsing needed inserting is longer, document is parsed on input
keeps spaces does not keep unnecessary spaces
keeps keys ordering does not keep keys ordering
keeps all values stored under the same key deduplicates keys - stores last value with given key

Example

Let’s assume I use database to store events (in events table) that come from different sources: from an application running on a server, from google service or from nas server.

events table

Each source (stored in sources table) generates events in different format, however, each payload contains userid value that allows to correlate events for single user comming from different systems.

tabela sources

The sample data suggests there are three users who

  • perform some clickable actions in user interface of server application (source server)
  • listen to music stored on other server (source nas) and
  • search some information on the internet (source google bot).

Searching based on user id

How to search for events specific to user with id 12? Regardless of what was the source of the event, each event contains userid key ion the top-level of payload object.

In order to retrieve data for user 12 I use ->> operatot in WHERE clause (the operator returns text type which I can compare with other text value while -> operator returns json or jsonb type).

search json objcect using a key

Searching and sorting by timestamp

Each event has its timestamp value, however, each source places timestamp in different part of the palyload:

  • source 2 (nas) and 3 (google bot) are placing timestamp under key “ts” on the top level of the object, next to “user_id” key
  • source 1 (server) is placing timestamp also under “ts” ket, hovever the key is nested inside an object placed under “action” key on the top level

In order to search for timestamp I’m going to use #>> operator which allows to provide a “path” to desired data in a form of an array of subsequent keys. Those keys are used to access elements in order until the data is found.

  • for sources 2 and 3 the path is {'ts'}
  • and for source 1 the path is {'action', 'ts'}

In order for progress to “know” that {'action', 'ts'} text means array of textx, I need to cast the value returned by CASE to text[] which is postgres array type.

In order to sort by timestamp (or in general do any timestamp-related functions) I need to cast the result to timestamptz. This is why I use #>> returning text and not #> returning jsonb: in order to cast text data into other data type.

searching in json object using a list of keys

Searching using jsonpath

Another way of pointing to or extracting an element from a complex object’s structure is using dedicated postgres type: jsonpath. Such path can be passed to jsonb_path_query(jsonb, jspnpath) function which returns a set (of possibly only one element) of jsonb values.

Here, again, one needs to differentiate between events sources in order to use proper jsonpath value:

  • for sources 2 and 3 the jsonpath is just $.ts
  • for source 1 the path is $.action.ts
searching in json object using jsonpath

Small optimization

Using CASE statement each time different event source is considered is not efficient, especially if it is possible that new event sources will be added to the database. In such case we might end up fixing all queries with CASE statement, making them more complex. Therefore it is much better to store jsonpaths as part of source data (in source table). Writing queries requiring source distinction will be much more convenient, for example:

searching in json object using jsonpath without CASE

Generating data

I will try to make events table a bit bigger: I’ll generate mock events for source 1 (_server) using several postgres functions:

  • random() to generate single random value
  • generate_series(start, stop, [step]) - to generate a series of timestamps between two arbitrarily chosen dates (with random step):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
insert into events(body, src_id) 
  select  json_build_object(
    'action', 
    json_build_object('ts', ts,'click_pos', click_pos), 
    'userid', 
    b.userid) body,
    1 src_id
  from (
    select 
      s.t ts,
      array [ round(random() * 1600)::int4, 
            round(random() * 1200)::int4 ] click_pos,
      round(random() * 100)::int userid
    from (
      select generate_series(
        '2012-02-12':: timestamptz, 
        '2013-01-23'::timestamptz,  
        ((random() * 20.0)::text ||' minutes')::interval) as t) s
    )b;

Over 60 thousands inserts took almost 3 seconds, but this happend because I had created an index on body column and it is probably the index update that took most of the time.

1
2
INSERT 0 62985
Query returned successfully in 2 secs 740 msec.

Out of curiosity I deleted the index:

1
drop index ginbody;

and inserting over twice as many rows took less than two seconds:

1
2
3
INSERT 0 113379

Query returned successfully in 1 secs 876 msec.

Different number of generated rows is the result of using randomized step in timestamp generation function.

Generated data

Here is how generated data look like. The shape of json values is the same as in the sample data I invented at the beginning of the example:

generating random data

Indexes

Postgres allows to create several types of indexes on a colum with jsonb type. Here I’m going to use GINGeneralized inverted index.

GIN index

GIN Index is used to index objects with complex internal structure. It was designed in a way to support queries about values inside such structure. The index has two modes of operation:

  • jsonb_ops (default) indexes each key and each value in complex data and allows to use following operators: ?, ?|, ?&, @>, @@, @?
  • jsonb_pathops - indexes only values of specific object and allows to use following operators: @>, @@, @?

The syntax for creating an index is as follows:

1
CREATE INDEX userginidx ON events USING gin (body);

Searching without an idex

What is the predicted execution time reported by the postgress planner/analyxer for a simple query about data for user with id 12?

1
2
select * from events
where body @> '{"userid": 12}'

Here is the analysis of the above query when there is no index. “Sequential scan” means there is a need to go through each row sequentially and remove non-matching row from the result. The search takes 85.9ms:

searching without index

Searching after creating a GIN index

After the index is created, the query planer will use it for the query above. “Bitmap heap scan” means exactly that) and the predicted time of execution drops to 3.1ms:

searching with gin index

This means that… wait a sec…

1
2
3
4
5
6
7
8
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
85.9-3.1
82.8
last/85.9
.96391152502910360884

Creating the GIN index improved the time of execution by 96% !.

Such a difference makes a difference, doesn’t it?

Źródła: Using JSONB, JSONB type perf

The End

That’s all for today. I encourage all iterested readers to read and study other types of indexes themselves 😂

The food will knock to our door soon, so I’m closing my laptop now.

Till the next time!


Ten wpis jest częścią serii sql.

Wszystkie wpisy w tej serii: