I spend my days trying not to embarrass myself in front of the world-class team at LightStep.
A few short years ago, we built software in a world where monitoring technology designed for a single process could still tell coherent stories about the larger application. Now that distributed systems are swallowing the infrastructure space (kubernetes, microservices, FaaS, etc), those previous-generation monitoring tools break down: they cannot answer the most important questions about the distributed application or the business that depends on it. LightStep focuses on application-level visibility in modern distributed systems.
People sometimes assume that the team at LightStep is essentially commercializing ideas that the core team developed years ago at Google. While it's true that our team's collective experiences at large technology companies inform our thinking, I often remark that our LightStep work is more of "a reaction to" than "an imitation of" those older systems we built 10+ years ago. LightStep's approach requires a subtle reset of expectations about the fundamentals in monitoring systems (how things are centralized, how things are sampled, how things are summarized, and how much better workflows can be if we threw away ingrained assumptions about monitoring architecture). This means that we spend a lot of time thinking and researching, and it also means that we have a lot of fun. Thankfully the product speaks for itself, and I have never been as happy professionally.
If you want to chat about the above (or the below), I enjoy receiving emails out of the blue: email@example.com.
I graduated from Brown in 2003 with a combined degree in Math and Computer Science. While in school, my studies focused on computer graphics, machine vision, and operating systems, and I worked most closely with Michael Black and Andries van Dam. I accepted a software engineering job offer from Google in my final semester at Brown, and I joined them immediately after a three month summer internship at Microsoft Research in Silicon Valley (where I worked under the direction of Michael Isard).
I spent my first two years at Google alternating between a state of awe regarding my new colleagues and a state of disappointment about my actual project work. To make a long story short, we were trying to solve too many problems at once, and though I learned a great deal about software engineering, technical leadership, and product positioning, I eventually decided to set sail for the land of large-scale distributed infrastructure.
From 2005-2008, I built an always-on distributed tracing project called Dapper; it began life as a proof-of-concept prototype, but over the years I fleshed out the technology stack and built a team of twelve (you can read the paper some colleagues and I later wrote about it at bit.ly/google_dapper). Dapper became — and still is — a core technology at Google; it helps developers make sense of distributed systems that often involve 10K-100K distinct processes.
In 2009, I brought together a small core group to build Monarch, a large-scale (~100K process) and high-availability timeseries collection, storage, and query system, as well as a configuration and console toolchain layered on top. That group eventually grew into a 20-30 person engineering team that I led, and the technology we developed is now Google's company-wide multi-tenant monitoring service. I worked on Monarch up through the end of my time at Google. It was fun engineering work, but...
When I left school I had intended to build products that could improve people's lives on a large scale, and yet I found myself building gigantic systems that, despite their importance to the company, were mostly invisible outside of Google. In late 2012 I took a position at a startup founded by an old colleague of mine. I treated my experience as a startup employee as a continuous learning process, mostly in non-technical areas (marketing, PR, business development, recruiting), though I also valued the opportunity to catch up on non-Google technology stacks.
More than anything, leaving Google helped me realize two things: (1) how much I enjoy the dynamism of working in the open market; and (2) how, while Google's scale certainly makes for challenging engineering problems wherever you look, the solution space outside of the Googles and Facebooks of the world is in many ways more interesting: when there are 10-100x fewer transactions per second, the level of detail in monitoring workflows can increase in turn. In this way, taking Google technology and selling it to the rest of the world actually sells the rest of the world short. We can build much more featureful and flexible things when we don't need to satisfy the world's 2-3 largest/outlier software deployments.