Data in Early-stage VC

Data machines

Mar 26, 2025

How can we use data in venture?

As a former quant researcher in the trading world, tackling this question has been tons of fun.

Yes, there are tons of challenges in handling private markets data. Yes, data availability and transparency can be limited. But, despite that, I think there’s tons of room for early-stage VCs to better leverage data.

Early thoughts:

There’s a lot of low-hanging fruit VCs can tackle to make their lives easier. Hours and hours of manual work can be automated with relatively simple tools.
AI/ML shouldn’t be used to decide whether you invest or pass in a company. Early stage investing is very founder-centric, so leave the last-mile of decision making with humans.
Despite #2, data tools can be very powerful. VCs can build systems that improve their deal flow funnels and the amount of deals they see. While the last-mile decision remains with the VC, they can have more opportunities to make great investments.

#1: Low hanging fruit

Let’s start with an example to bring this into context. VCs use CRMs (Hubspot, Pipedrive, etc) to manage deal flow. They’re the beating heart of day-to-day operations. Life’s pretty easy after you get your data into these platforms, but the leg work to get there is, well, another story.

Data processing and cleaning is a manual slog, and the icing on the cake is that data comes from countless sources with non-standard formats. For a firm with dozens, maybe hundreds of employees, manual data processing is something people may just accept as being a part of the job; however, if you’re a firm with a smaller headcount, hours spent on mind numbing data cleaning is expensive.

Imagine you went to a hackathon and the organizers shared a list of attendees like this:

Looks good at first glance, but imagine your CRM needs to separate location by city, province/state, and country. You may need to separate the founder’s name into first & last names too. You can do this with built-in spreadsheet formulas, but handling all the edge cases can be annoying so there’ll be tons of manual work left over.

But, if you’re willing to learn a language like Python, building these types of data cleaning/enhancement workflows is straightforward. Plus, it’s a more robust, flexible, and scalable solution.

You really don’t need much experience in Python to make a tool like this. You’ll have to learn the basics of a library like pandas but that’s it. There are so many tutorials, YouTube videos, and LLMs out there that the learning curve is as flat as it’s going to be. Your computer isn’t a constraint either: use Google Colab for an amazing free out-of-the-box coding environment that doesn’t need any set-up.

If you’ve never coded anything in your life I genuinely believe you can have a basic data cleaning tool up in a few hours.

#2: Last-mile decision making

When I first started in venture my vision for using data was broader than just creating automations. In my eyes an ideal data model would tell me what companies to invest in. It’d be a magic black box where I could input all the information I knew about a company and out the other end there would be an investment recommendation.

That vision was wrong—at least for now.

Early stage venture (pre-seed/seed) is founder focussed: VCs look at companies that are pre-revenue or have just launched products. There aren’t many metrics to go on. It’s all about gauging a founding team’s vision, ability to execute, and unique insight into their problem space.

Just because it’s qualitative doesn’t mean it’s easy. Making these assessments is very challenging and I don’t think the data is there yet; you can throw compute at the problem but there isn’t enough data to train a model to pick up the nuances. Unless VCs start recording all of their calls and uploading them into models (with permission of course) to extract signals that identify great founders, this last-mile of decision making is out of reach. And, remember, you won’t have high confidence in how you extract signals because you’ll need to wait 5+ years to see how your investments pan out.

I’m not saying it’s impossible. I’m personally in the camp that anything can be quantified—I wouldn’t have worked in quant trading if I didn’t believe that. It’s just that lack of data combined with long feedback cycles creates a unique challenge.

That being said, I’m sure there are firms working on this which is exciting! My guess is that if it works well it’ll have little to do with the models but everything to do with the data.

#3: Useful data tools

If you’re not using data tools to make investment decisions then what’s the point of using data in venture?

Simple: improve deal sourcing.

VCs live in an information abundant world and it’s difficult to parse through everything.

Data-driven deal sourcing engines can help expand deal flow beyond warm intros. Web scraping, alternative data sources, and intelligent ranking systems can help VCs surface under-the-radar companies.

As an example, VCs can:

Automate inbound filtering: Use NLP to scan cold inbound emails and prioritize based on criteria like traction signals, sector fit, or founder background.
Track internet signals: Track trends on social media + identify high potential leads online.
Deal tracking: Synthesize publicly available signals to detect early start-up momentum.

Small improvements in sourcing can compound into better outcomes. The marginal k% of new companies you’ll see are measurable, direct improvements to your deal sourcing practices. Even if your schedule is completely packed and you can’t take more meetings, you can still broaden your start-up universe to select companies to meet with that better fit your investment thesis.

The key is not replacing judgment but augmenting it.

Data systems shouldn’t make your decisions but they can put you in situations to make better ones.

Achintya's Substack

Discussion about this post