The PaSh project gives your POSIX script some superpowers using parallelization to speed up execution times. This leads to faster results for data scientists, engineers, biologists, economists, administrators and programmers.
I remember the days when the saying was “Learn Perl so you don’t have to learn the Shell and its hundreds of utilities.”
Fast forward a few decades and the use of shell scripts has still not been eradicated. Rather, their use has increased due to the rise of containers, virtual machines, cloud administration, and Linux itself.
It also serves as a lesson for those who hasten to denounce technologies as “dead”. There comes a time when a new use case revitalizes old technology.
So what do we mean by âUnix philosophyâ? It’s about taking simple, high-quality components and combining them intelligently to achieve a complex result. An example that encapsulates this notion comes straight from the PaSh documentation and shows how you can use many utilities, pipes, and redirects to combine and filter them, in order to achieve the desired result:
Consider the following spell check script, applied to two large markdown files f1.md and f2.md
cat f1.md f2.md |
tr A-Z a-z |
tr -cs A-Za-z 'n' |
comm -13 dict. txt - > out
cat out | wc -l | sed 's/$/ mispelled words!/'
The speed of an operation like this depends on the size of the two files. It may take from a few seconds to a few minutes. What if you could speed it up by breaking it down into chunks that would work in parallel, and then combine their results? You can.
PaSh is one such POSIX shell script parallelization system, which can improve performance by orders of magnitude. Given a shell script, PaSh converts it to a data flow graph, performs a series of semantics – preserving program transformations that expose parallelism, and then converts the data flow graph back to a POSIX script. The new parallel script has POSIX constructs added to explicitly guide parallelism, coupled with Unix runtime primitives provided by PaSh to address performance and accuracy issues.
For example, the above script executed from Pash with -w 2 i.e. 2x-parallelism would create 2 pipes which it would then execute in parallel. Therefore, the data flow graph would look like:
You could say that, there is also GNU Parallel for that. The problem with Parallel is that it doesn’t know the semantics of commands like grep, so it’s hard to use. The user must write a carefully parameterized command for these tools to parallelize a job, while some commands also have ad hoc custom parallel flags like -j, –jobs, –parallel. They are all different, difficult to use and difficult to compose.
PaSh instead has a compiler that works like this:
- Between a shell script and command annotations
- Construct a data flow graph
- Do the graphic transformations
- Produce a new shell script with low level parallelism & and wait
- Generate a new shell script with parallelism
Because PaSh is a source-to-source compiler, it allows the optimized shell script to be inspected and executed using the same tools, in the same environment, and with the same data as the original script.
The other two main components of PaSh are annotations, a lightweight annotation language that allows command developers to express key parallelizability properties on their commands, and a small runtime library that provides the PaSh compiler with high performance primitives and supporting its key functions.
Various references on common Unix one-liners show a performance improvement of magnitude 60.
PaSh can be run on Ubuntu, Fedora, Debian, and Arch. Use one of the following methods to configure it:
- To run
curl up.binpa.sh | shfrom your terminal,
- Clone the repository and run
- Get a Docker container by running
docker pull binpash/pash-18.04, Where
- Create a Docker container from scratch.
And on Windows WSL too.
PaSh: shell processing parallel to data by light contact
Pash on GitHub
The Linux Perfection Challenge
Three Tips for the Linux Shell Addict
To be informed of new articles on I Programmer, subscribe to our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
or send your comment to: [email protected]