ScalaPB with SparkSQL
Setting up your project
Make sure that you are using ScalaPB 0.5.23 or later.
We are going to use sbt-assembly to deploy a fat JAR containing ScalaPB, and your compiled protos. Make sure in project/plugins.sbt you have a line that adds sbt-assembly:
build.sbt add a dependency on
The running container contains an old version of Google’s Protocol Buffers runtime that is not compatible with the current version. Therefore, we need to shade our copy of the Protocol Buffer runtime. Add this to your build.sbt:
Running SQL queries on protos
Assuming you have an RDD of ScalaPB protos:
You can convert it to a dataframe and register it as SparkSQL table:
The first import line adds an implicit conversion for SQLContext that supplies
protoToDF. An equivalent alternative you can use is:
Now you can run code like this:
Check out a complete example here.