Hey folks, welcome to the digest!
I'll be in SF & LA in a few weeks and hosting some Browsertech events. On May 22 in SF we're co-hosting a happy hour with Krea. Then on May 23 we're hosting the first Browsertech LA. If you're in the SF or LA area, I hope to see you there!
On the podcast, I talked with Kyle Barron of Development Seed about his work in open-source geospatial tooling in the browser. Kyle is a prolific developer who has worked on bindings for Apache Arrow, Parquet, GeoArrow, and Deck.gl, culminating in Lonboard.
We talked about why he needed to look beyond GeoJSON and Leaflet to scale to millions of points, where Arrow and Parquet fit into the picture, and optimizing the path of data from a server-side Python DataFrame to the GPU.
Something that stood out as I was editing this episode is that the sibling Apache projects Parquet and Arrow keep coming up on the podcast.
We get into them on the episode with Kyle, but the tl;dr (tl;dl?) is that they are both data formats for generic, typed tabular data. Parquet optimizes for space, so is often used on disk and on the wire. Arrow is an in-memory format that optimizes for fast random access. They are an almost equivalent data model, so a common approach is to read Parquet from disk/network directly into Arrow in memory.
Episode 2 guests Andrew and Eric also talked about how they use Arrow in Prospective. In episode 3, Ben Schmidt of Nomic talked about his use of Arrow as well.
A common theme across these episodes is eliminating transformations from data as it travels through the system. A naive approach to browser-based apps is to go from an in-memory representation on the server, to JSON on the wire, to being parsed into in-memory JavaScript objects, and then the relevant attributes are packed into attribute buffers on the GPU (often with one draw call for each object).
To be clear, this works for a lot of things! If you just need to annotate a map with a few hundred points, you won't notice the overhead.
At scale, though, each of those transformations takes linear time, so as your data grows, so does the compute time. The Arrow approach of using a singular data representation (with Parquet as a cheap compression step across the wire) means apps can work with orders of magnitude more data without sacrificing responsiveness.
Until next time,
Paul