Define a reusable and lazy pipeline

The problem
I have the idea of a pipeline: it is defined once, it can be reused and when called, it changes its state but does not evaluate eagerly, because it could be “fed” from an infinite source of data.
For example:
- all integers which are multiple of 13 and has in its binary representation more ones than zeroes
- indices of bytes read from /dev/random which are ascii characters
Perhaps there are two separate issues I’m thinking about here:
- Is the pipeline creating intermediate collections when processing items through the pipeline? For example, in Scala, all intermediate operations do create a collection (each map and filter)
- when exactly is the pipeline execution triggered?
In scala
Note: see thisinsightful comment on SO: https://stackoverflow.com/questions/31664918/does-scala-has-intermediate-terminal-ops-as-java8-has
By default intermediate operations are done eagerly, unless you use .iterator.
|
|
Now you can either trigger the traversal or create an iterator which will evaluate/traverse one by one:
|
|
Note how the first value get through the map, even if we haven’t explicitely trigerred it. However, the first element is not actually missing, we will get it eventually when we do next or use other terminal operations causing traversal and evaluation of pipeline operations:
|
|
The iterator approach allows to build a pipeline and evaluate elements when needed. However, we are not able to reuse the pipeline: once the iterator is exhausted, we’re done. We need to create a new iterator, repeating the pipeline code. It seems the pipeline needs to have a collection to work on attached to it. Or, in other words, there is no such thing as a pipeline here, I guess.
Reusability
According to the SO answer, if we want to reuse the pipeline, we can use a view.
I’m not sure if we would really reuse anything. Let’s check.
First, I will change the effecting function report so that it doubles the value when called:
|
|
Then, I will try to create a pipeline and reuse it by filtering values twice.
|
|
For a list, it evaluates eagerly, traverses the list and runs pipeline once - it (implicitly) creates a list (List is a Functor) and applies the pipeline once; and all subsequent filters/maps are done on the resulting list.
|
|
If I create a view instead, I still have the map eagerly traversed, but then - here’s the difference - I get SeqView which runs the pipeline each time it is triggered with terminal operation (foreach):
|
|
xv is a collection of evaluated elements. Well, I should have created a view and then pipe it thourgh the operations without sticking the collection to a val, so to speak:
|
|
It seems the map did not force the creation of the collection in this case. Filtering and limit on take caused mimimal amount of operations.
LazyList
Docs: LazyList
LazyLists can be infinite, but their values are memoized. What does it mean? The pipeline would be evaluated in a way that elements that were previously accessed will not be evaluated again.
If we create a lazy list from list l and map reportDouble, mapping operation by itself will initiate traversal :
|
|
but then, as those values have been caclulated, filtering does not trigger re-calculation
|
|
So, it seems there is no difference between mapping over simply a List… wait, perhaps I should place map before filter and then take just a few items out of resulting collection. What happens?
Ah, here is the difference: elements are only then calculated if required - lets compare: lazy list only calculates required 3 elemens:
|
|
but List always calculates all of them, no matter that we require only three elements:
|
|
Notion of terminal operation
I still don’t have a good mental model of Scala collections: perhaps there is no “terminal operation” notion at all? It seems to me that map is a terminal operation, it imlicitely creates target container of proper type (monoid and semigroup properties?) unless it is part of the bigger pipeline and then it doesn’t.
And I should not have high expectation for LazyList in this regard, unless - again - I hide the map inside:
|
|
Python
In Python, mapping and filtering operations leave you with something like iterator which by itself does not trigger enything.
|
|
One needs to explicitely create a list out of such map or filter, but interestingly, filter understands it filters over map and somehow forces its evaluation. We have to force evaluation the of the filter result:
|
|
Clojure
Perhaps what I want is the transducer concept I encountered in clojure. It seems like a definition or composition of operations which
do not require a collection and can exist by itself. The documentation states that operations such as map, filter or take (for others see also cheatscheet).
What the docs say is:
Most sequence functions included in Clojure have an arity that produces a transducer. This arity omits the input collection; the inputs will be supplied by the process applying the transducer. Note: this reduced arity is not currying or partial application.
Deriving from the documentation example, I can create xf transducer which is a composition of other transducers (comp is composing right-to-left, so the mental model for final transucer operation is that it executes left-to-right, first filtering, then mapping and finally taking two elements). Final reducing operation is “+” here; transduced sequence before addition can be retrieved by using (into [] ...) form:
|
|
What happens when I define xf such that it prints an element, let’s say: very early, before filter? Would It then print everything? At least, it will not at the time transducer is defined. I expect it to fire for all vec elements, however. I aslo expect that if I place print after take, I will print two filterred and inced values, 4 and 6:
Let’s check. Define a: it takes value a, prints it and returns it.
|
|
Transducer xf now is composed of (map a) as first element of the pipeline. Wow, see what happened: it not only did not call map on each element of the sequence - it only executec exectly 3 times - minimal number of times needed to actually retrieve two elements required by last element of the pipeline…:
|
|
Interesting.
In Rust
This behavior is similar to what we get in Rust - there are iterators everywhere :) For example, below program constructs an infinite collection (wait! it is iterator!) and traverses it:
|
|
So the example with simple collection will be as follows:
|
|
which behaves nicely - does not go through whole collection and gives exactly what I expected = three elements:
|
|
Making the pipeline more generic can be done by making collection’s element generic or by creating a function that takes an Iterator<Item=i32> - which allows to use the pipeline on both vec and set (note non-deteministic output):
|
|
or by also returning an iterator - in this case I need to force collection (note a call to .collect::<Vec<i32>>())to be able to print the results as iterator itself does not implement Display:
|
|
which gives following output (obviously, duplicates disappeared on HashSet construction, and set output changes in each execution of the program):
|
|
Conclusions
- for Scala, use views or iterators; remember that mapp creates target collection
- learn LazyList usecases; it is non-trivial
- Rust behaves intuitively
- Python seems a bit oldscool with its map/filter and a need to
[i for i in map/filter...]
Personal note
I just remembered that I like Clojure, a lot. I also like Rust (I ❤️ 🦀) becasue of my earlier exposure to Scala - that’s for sure. This is the first modern language where I could do proper matching on ADTs (I remember data classes in Kotlin but don’t have it vivid in memory), didn’t have to sit inside classes to do anything - I could just use top-level functions. Perhaps in Scala it was not possible before 3.0? Not sure.
Next
Just keep learning.