Postgres - JSON
I became curious about json
and jsonb
types in postgres, so today I want to check what they are and how to use them.
Today I’m going to:
- define a table with
json
orjsonb
column - create some data (manually or randomly generated)
- apply some json-manipulating functions
- try to create and use database index that will use such column
Context
Why does postgres have json-like types at all? Isn’t it true that the world of data(bases) had split between heavy, corporate RDBMS-es with fossilized tables structures and lightweight, startuppy, almost-without-schema-eventually-consistent no-sql databases?
To be honest, this is still my mental model of database systems.
Usecases
I’ve found some, but I’m under the impression that json datatype in postgres is not particularly popular on the net.
So a column of type json
or jsonb
can be used to:
- store data displayed in graphical user interfaces in web applications
- store events data from multiple sources, which have its content (aka payload) different and dependent on the source it wos generated in - I present such hypothetical situation in the example below
- store the state of single player game, the state which changes while the game advances (and json schema changes as well)
- in tools for data integration between systems characterized by different data schemas
- for data synchronization between different systems
Sources: When to use.., SO - what are the proper…, Using JSONB
Since when?
Type json
exists in postgress for almost ten years already, jsonb
for seven, and jsonpath
as well as support for SQL/JSON standard (mainly functions operating on and generating json values) since version 12, i.e. since the year 2019.
year | version | scope |
---|---|---|
2012 | 9.2 | JSON support |
2014 | 9.4 | JSONB support |
2019 | 12 | SQL/JSON and JSONPATH support |
Standard SQL:2016:
And maybe mongo?
I’ve read about a dillema between choosing postres and mongodb in teh article “Postgres vs mongo for storing json data”:
- if data stored are static, more often read than written or there is no need to modify them at all - use postgress
- if the usage patters are such that record modification wil be frequent - use mongo since it has really powerful tools for doing this
The article concludes that the choice should be balanced by considering the scaling options, availability of specialists etc.
Hm, the article was written in 2016, and there were no functions for creating json, processing json, searching using jsonpath or filtering).
The types: json, jsonb i jsonpath
What are those three types and how do they differ?
json | jsonb | jsonpath |
---|---|---|
exact copy of input text | stored in binary format | used to search for data |
processing functions need to re-parse the value each time they are used | processing functions are fast | |
adding/inserting data is fast, no parsing needed | inserting is longer, document is parsed on input | |
keeps spaces | does not keep unnecessary spaces | |
keeps keys ordering | does not keep keys ordering | |
keeps all values stored under the same key | deduplicates keys - stores last value with given key |
Example
Let’s assume I use database to store events (in events
table) that come from different sources: from an application running on a server, from google service or from nas server.
Each source (stored in sources
table) generates events in different format, however, each payload contains userid value that allows to correlate events for single user comming from different systems.
The sample data suggests there are three users who
- perform some clickable actions in user interface of server application (source server)
- listen to music stored on other server (source nas) and
- search some information on the internet (source google bot).
Searching based on user id
How to search for events specific to user with id 12? Regardless of what was the source of the event, each event contains userid key ion the top-level of payload object.
In order to retrieve data for user 12 I use ->>
operatot in WHERE clause (the operator returns text
type which I can compare with other text
value while ->
operator returns json
or jsonb
type).
Searching and sorting by timestamp
Each event has its timestamp value, however, each source places timestamp in different part of the palyload:
- source 2 (nas) and 3 (google bot) are placing timestamp under key “ts” on the top level of the object, next to “user_id” key
- source 1 (server) is placing timestamp also under “ts” ket, hovever the key is nested inside an object placed under “action” key on the top level
In order to search for timestamp I’m going to use #>>
operator which allows to provide a “path” to desired data in a form of an array of subsequent keys. Those keys are used to access elements in order until the data is found.
- for sources 2 and 3 the path is
{'ts'}
- and for source 1 the path is
{'action', 'ts'}
In order for progress to “know” that {'action', 'ts'}
text means array of textx, I need to cast the value returned by CASE
to text[]
which is postgres array type.
In order to sort by timestamp (or in general do any timestamp-related functions) I need to cast the result to timestamptz
. This is why I use #>>
returning text and not #>
returning jsonb: in order to cast text data into other data type.
Searching using jsonpath
Another way of pointing to or extracting an element from a complex object’s structure is using dedicated postgres type: jsonpath
. Such path can be passed to jsonb_path_query(jsonb, jspnpath)
function which returns a set (of possibly only one element) of jsonb values.
Here, again, one needs to differentiate between events sources in order to use proper jsonpath
value:
- for sources 2 and 3 the jsonpath is just
$.ts
- for source 1 the path is
$.action.ts
Small optimization
Using CASE
statement each time different event source is considered is not efficient, especially if it is possible that new event sources will be added to the database. In such case we might end up fixing all queries with CASE
statement, making them more complex. Therefore it is much better to store jsonpath
s as part of source data (in source
table). Writing queries requiring source distinction will be much more convenient, for example:
Generating data
I will try to make events
table a bit bigger: I’ll generate mock events for source 1 (_server) using several postgres functions:
random()
to generate single random valuegenerate_series(start, stop, [step])
- to generate a series of timestamps between two arbitrarily chosen dates (with random step):
|
|
Over 60 thousands inserts took almost 3 seconds, but this happend because I had created an index on body
column and it is probably the index update that took most of the time.
|
|
Out of curiosity I deleted the index:
|
|
and inserting over twice as many rows took less than two seconds:
|
|
Different number of generated rows is the result of using randomized step in timestamp generation function.
Generated data
Here is how generated data look like. The shape of json values is the same as in the sample data I invented at the beginning of the example:
Indexes
Postgres allows to create several types of indexes on a colum with jsonb
type. Here I’m going to use GIN.
GIN index
GIN Index is used to index objects with complex internal structure. It was designed in a way to support queries about values inside such structure. The index has two modes of operation:
jsonb_ops
(default) indexes each key and each value in complex data and allows to use following operators: ?, ?|, ?&, @>, @@, @?jsonb_pathops
- indexes only values of specific object and allows to use following operators: @>, @@, @?
The syntax for creating an index is as follows:
|
|
Searching without an idex
What is the predicted execution time reported by the postgress planner/analyxer for a simple query about data for user with id 12?
|
|
Here is the analysis of the above query when there is no index. “Sequential scan” means there is a need to go through each row sequentially and remove non-matching row from the result. The search takes 85.9ms:
Searching after creating a GIN index
After the index is created, the query planer will use it for the query above. “Bitmap heap scan” means exactly that) and the predicted time of execution drops to 3.1ms:
This means that… wait a sec…
|
|
Creating the GIN index improved the time of execution by 96% !.
Such a difference makes a difference, doesn’t it?
Źródła: Using JSONB, JSONB type perf
The End
That’s all for today. I encourage all iterested readers to read and study other types of indexes themselves 😂
The food will knock to our door soon, so I’m closing my laptop now.
Till the next time!
Ten wpis jest częścią serii sql.
- 2021-04-11 - Postgres - dates
- 2021-04-10 - Postgres - JSON
- 2021-04-10 - Postgres - exercises
- 2021-30-09 - Postgres - installation