The PaSh project – Taking the Unix philosophy one step further

The PaSh project gives your POSIX script some superpowers using parallelization to speed up execution times. This leads to faster results for data scientists, engineers, biologists, economists, administrators and programmers.

I remember the days when the saying was “Learn Perl so you don’t have to learn the Shell and its hundreds of utilities.”
Fast forward a few decades and the use of shell scripts has still not been eradicated. Rather, their use has increased due to the rise of containers, virtual machines, cloud administration, and Linux itself.

It also serves as a lesson for those who hasten to denounce technologies as “dead”. There comes a time when a new use case revitalizes old technology.

So what do we mean by “Unix philosophy”? It’s about taking simple, high-quality components and combining them intelligently to achieve a complex result. An example that encapsulates this notion comes straight from the PaSh documentation and shows how you can use many utilities, pipes, and redirects to combine and filter them, in order to achieve the desired result:

Consider the following spell check script, applied to two large markdown files f1.md and f2.md

cat f1.md f2.md |
tr A-Z a-z |
tr -cs A-Za-z 'n' |
sort |
uniq |
comm -13 dict. txt - > out
cat out | wc -l | sed 's/$/ mispelled words!/'

The speed of an operation like this depends on the size of the two files. It may take from a few seconds to a few minutes. What if you could speed it up by breaking it down into chunks that would work in parallel, and then combine their results? You can.

PaSh is one such POSIX shell script parallelization system, which can improve performance by orders of magnitude. Given a shell script, PaSh converts it to a data flow graph, performs a series of semantics – preserving program transformations that expose parallelism, and then converts the data flow graph back to a POSIX script. The new parallel script has POSIX constructs added to explicitly guide parallelism, coupled with Unix runtime primitives provided by PaSh to address performance and accuracy issues.

For example, the above script executed from Pash with -w 2 i.e. 2x-parallelism would create 2 pipes which it would then execute in parallel. Therefore, the data flow graph would look like:

You could say that, there is also GNU Parallel for that. The problem with Parallel is that it doesn’t know the semantics of commands like grep, so it’s hard to use. The user must write a carefully parameterized command for these tools to parallelize a job, while some commands also have ad hoc custom parallel flags like -j, –jobs, –parallel. They are all different, difficult to use and difficult to compose.

PaSh instead has a compiler that works like this:

  • Between a shell script and command annotations
  • Construct a data flow graph
  • Do the graphic transformations
  • Produce a new shell script with low level parallelism & and wait
  • Generate a new shell script with parallelism

Because PaSh is a source-to-source compiler, it allows the optimized shell script to be inspected and executed using the same tools, in the same environment, and with the same data as the original script.

The other two main components of PaSh are annotations, a lightweight annotation language that allows command developers to express key parallelizability properties on their commands, and a small runtime library that provides the PaSh compiler with high performance primitives and supporting its key functions.

Various references on common Unix one-liners show a performance improvement of magnitude 60.

PaSh can be run on Ubuntu, Fedora, Debian, and Arch. Use one of the following methods to configure it:

  • To run curl up.binpa.sh | sh from your terminal,
  • Clone the repository and run ./scripts/distro-deps.sh; ./scripts/setup-pash.sh,
  • Get a Docker container by running docker pull binpash/pash-18.04, Where
  • Create a Docker container from scratch.

And on Windows WSL too.

More information

PaSh: shell processing parallel to data by light contact

Pash on GitHub

Related Articles

The Linux Perfection Challenge

Three Tips for the Linux Shell Addict

To be informed of new articles on I Programmer, subscribe to our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner

square

comments

or send your comment to: [email protected]


Source link

About Leslie Schwartz

Check Also

From “physical progress” to the integrated approach, the philosophy behind the survey has evolved over the years | Visakhapatnam News

Visakhapatnam: Swachh Survekshan’s focus areas have evolved significantly over the years, from the first round …

Leave a Reply

Your email address will not be published. Required fields are marked *