scala - Gary Sieling

Apache Zeppelin is a JVM-based notebook product. It runs as a web application inside Jetty, and allows paragraphs within notebooks to be written in two dozen different languages, e.g. Scala, Python, R, Markdown, and SQL.

Notebooks in general are an interesting form of developer tooling. While less fully featured than traditional IDEs, they allow for rapid iteration in a project.

In many data gathering projects, I find it often convenient to switch languages: e.g. a bit of bash to pull some files from a website, a bit of python from Stackoverflow to do data processing, all of which I often convert to Scala as the project requirements become more solid, to take advantage of Scala’s rich type system.

While paragraphs in a notebook can be run in any order, the natural linearity of a notebook is a natural fit for data gathering, as compared to the hierarchical project structure found working on a library or product in an IDE.

Zeppelin and Spark

One of the sweet spots for Zeppelin use appear to be working with Spark – for instance, you can see that Amazon pushes Zeppelin as a UI entry point for Spark:

In Zeppelin, Spark paragraphs (scala or python) can communicate with each other through a global variable injected into those systems called “z”. This allows you to shuffle both simple values and dataframes back and forth (I’m not certain, but suspect it serializes dataframes to parquet binary data)

Visualization Tools

Apache Zeppelin comes with a handy tool to visualize data. This mode is triggered if the first text logged from a paragraph is “%table”, and assumes subsequent text is a tab-separated file. You can achieve similar expected behaviors with “%md” (markdown) and “%html”.

The data viewer supports a variety of visualizations (bar, pie, scatter charts), and can do some basic operations that you might do in SQL (e.g. group by, count, max , sum).

There are numerous additional UI controls – comboboxes, radio buttons, and the like. When you create one of these with the “z” object, they show up in the Zeppelin UI, and post back to your notebook when the user makes selections. This allows you to build your own UI tools, as seen below.

As shown here, you can choose a tag from a machine learning dataset, and 25 randomly selected images from that tag are shown, by base-64 encoding them into HTML output from the notebook:

As you might expect, you can also display charts from libraries like matplotlib, which offers tremendous power (I have run into some glitchy situations like tooltips not working, but I imagine this sort of thing will improve with time)

Customizing Your Tools

One of the great things I’ve enjoyed about Zeppelin is the ease with which you can customize your own work environment. If you were using a traditional IDE, you’d go through a complex process to write a plugin, but in Zeppelin you can add new debugging tools with simple paragraphs that create UI controls or use the Zeppelin APIs.

Unsurprisingly, Zeppelin notebooks are stored as JSON (see below). You can use tools like JQ to manipulate them, from within the notebook itself, which gives you powerful tools to transform your own development environment as you’re working.

For instance, I wrote the below notebook to pull a second notebook from github, then import it through the Zeppelin API”:

Data Privacy

It’s worth noting that because notebooks store the paragraph output, you should be really careful about what you check in – you could accidentally reveal sensitive information like AWS API keys or PII.

For AWS specifically, you should use git-secrets:

Running Zeppelin

For my own purposes, I’ve chosen to run Zeppelin on a docker container, with additional python libraries installed (e.g. opencv, mxnet). I mount git repositories and data as volumes on the container, so that the container itself is disposable, but the data that is hard to recover is retained.

In a more realistic production environment, you’d likely use Zeppelin attached to a spark cluster, or running inside a higher cost EC2 image.

Jupyter

In comparison to Zeppelin, Jupyter is a much older and more mature tool with a larger community, more languages and macros. Zeppelin appears to be targeting running in enterprise environments by offering functionality around multi-user access to notebooks and permissions to access them, as well as providing some built-in visualizations and an entry-point into Spark development.

Conclusion

Notebooks are most commonly used for variations on “data science,” but there are other use cases as well – for instance in API-first projects, they make great locations to do end-of sprint demos. The ability to mix markdown and code makes them well-suited to interactive developer documentation.

They are also good candidates for situations where you’re developing a continuous integration like process, but need a lot of ability to experiment or vary the process (e.g. kicking off jobs in Sagemaker with different tuning options).

Ultimately this is an interesting space to watch, and because Zeppelin seems to have some corporate backing, it will be interesting to see how it evolves.

Tag: scala

My experience with Apache Zeppelin