Crunchbase News examines the intersection of money and startups, often through the lens of data. Especially in our more analysis-heavy work, we do our best to explain how we arrived at a certain set of numbers and why we chose that approach.
That’s what we want to do here, but more broadly. This page serves as a resource that explains where we get our data, how it’s aggregated, and what we and our readers must take into consideration when thinking about private company data.
For more general information about Crunchbase News, its policies, staff, and its relationship with Crunchbase, check out the About page for Crunchbase News.
What Data Do We Use?
In our reporting on various happenings at the intersection of money and technology, the Crunchbase News team primarily, though not exclusively, uses data and information from Crunchbase.
Data From Crunchbase
Here are some types of Crunchbase data we frequently use in our reporting:
- Funding rounds
- Investors (i.e. the set of institutional and individual investors—categorized by type—which lead and participate in funding rounds)
- Funds (the discrete pools of capital raised by institutional investment firms)
From time to time, the News team will also use aggregated public “people” data in Crunchbase, including information about educational backgrounds, current professional status, past jobs, and locations.
Whenever it’s possible, we also check our findings against available industry reports and news coverage, which we will typically cite in any resulting reporting.
About Reported & Projected Private Company Data From Crunchbase
Crunchbase News uses two types of Crunchbase data in its news and industry coverage: reported data and projected data. In this section, we’ll explain what both mean.
In the surpassing majority of our coverage, the Crunchbase News team will use only “reported” data from Crunchbase. In slightly more technical terms, we’re capturing a snapshot of the current state of Crunchbase data at the time of reporting. In practice, it means that our coverage is based on the best available data at the time. It’s important to keep in mind that new information is continuously added as it becomes publicly available.
The News team typically uses projected data only in its quarterly and annual reporting.
Projections—tabulated by Crunchbase in collaboration with the Crunchbase News team—are based on historical patterns in delayed reporting, which are most pronounced at the earliest stages of the venture lifecycle. Using projected data helps prevent undercounting or reporting skewed trends that correct over time.
Private Company Data
Remember that we primarily cover private companies and the various flavors of private equity (venture capital chief among them). As such, no database of private companies or their funding rounds is 100 percent complete or up-to-date, not even Crunchbase.
In our coverage, we strive to remain mindful of the biases introduced by using reported data, and in the interest of transparency, we’d like to discuss some of the known issues with private company data. These are challenges shared among all aggregators of private company data.
- Reporting delays. Oftentimes, venture capital deals are not disclosed at the precise time they’re signed. This is important to remember when looking at “snapshots” of private company funding data. There’s often a gap of several weeks between finalizing paperwork and public disclosure, even for big rounds raised by late-stage ventures. Smaller rounds might be disclosed publicly, but not get picked up by the press or an automated data collection system. Especially at the seed stage, when sums raised are often too small to trigger a regulatory filing and many companies opt to remain stealthy, many rounds are added to the public record long after they were finalized. Unless these deals are manually added to a dataset later on, such as when a company raises its first institutional round and updates its angel and seed deal history, they may not otherwise be included.
- Voluntary or selective reporting bias. Crunchbase gathers data from many sources, and some contributions to its database come directly from its users and site visitors. Some of these individual contributors represent the companies and investment firms they’re contributing information about. Accordingly, this may paint a rosier picture of a company or industry sector. You could easily understand why a failed company would want to withhold how much investor capital was lost on their venture, or why an entrepreneur wouldn’t create a Crunchbase profile for a startup that fizzled out prematurely.
- Sector and business under-coverage. Because there are so many companies out there, most private company data aggregators focus on a particular subset of companies, industries, or transaction types, which necessarily limits the scope of its coverage. A dataset built around venture capital deals is therefore likely to have a fairly comprehensive record of high-growth technology and life sciences companies and the transactions they’re involved in. Small, local businesses and sole proprietorships might be under-represented in such a dataset.
- Geographic under-coverage. The quality and quantity of information a private company data aggregator collects is not geographically uniform. Typically, datasets of private companies have the most robust coverage in a market that’s local to the aggregator. All things being equal, a dataset of American startups, compiled by an American company, is probably going to be more comprehensive than a South African data company’s collection of U.S. startup information. These sampling biases are more pronounced when there’s a language barrier. To mitigate this, many aggregators of global startup data often partner with local data collectors and employ multilingual staff to enrich and verify data.
How We Access Crunchbase Data
Although members of the News team often uses Crunchbase Pro during the course of researching and reporting stories, the majority of our “large sample size” are based on daily snapshots of the Crunchbase dataset.
These snapshots are available for some Crunchbase users here. The daily exports are useful for obtaining and wielding relatively large datasets.
Over time, the News team intends to rely more heavily on the Crunchbase API. For smaller queries, or where up-to-date data is absolutely required, using the API or Crunchbase Pro is the better option. For bigger queries (over the 1,000-row export limit currently placed on Crunchbase Pro) or queries that we’d want to execute periodically in the background, consuming data through the API is required.
How We Use And Define Industry Categories
The Crunchbase News team will often base its analysis of various industries on one or more of the industry categories defined by Crunchbase. However, we may augment these lists by including companies that use related keywords in their descriptions.
For example, in coverage about the cannabis industry, we may start with the set of companies in Crunchbase’s Cannabis category, but also include companies that use terms like “cannabinoids,” “cannabis extracts,” and “cannabidiol” in their descriptions.
We do this to ensure we’re deriving as complete a survey as possible of a given sector. Whether a company or investor is included in a given analysis is ultimately the reporter’s discretion, but we disclose all significant inclusions and exemptions.
How We Use And Define VC Investment Stages
These are the types of funding events that are collectively referred to as “venture” rounds.
In the reports we produce at the end of each quarter, we use the following heuristics for categorizing funding rounds by stage:
- Angel & Seed-stage is comprised of seed, pre-seed, and angel rounds. Crunchbase also includes venture rounds of unknown series, transactions of undisclosed type, and convertible notes totaling $1 million (USD or as-converted USD equivalent) or less. Equity crowdfunding rounds with no listed dollar value, as well as those totaling less than $5 million, are also counted as seed-stage.
- Early stage is comprised of Series A and Series B rounds, as well as other round types. Crunchbase includes venture rounds of unknown series, transactions of undisclosed type, and convertible notes totaling between $1,000,001 and $15,000,000. Convertible note rounds with missing dollar values are also counted as early-stage.
- Late stage is comprised of Series C, Series D, Series E, and later-lettered venture rounds following the “Series [Letter]” naming convention. Also included are venture rounds of unknown series, transactions of undisclosed type, and convertible notes of $15,000,001 or more.
- Technology growth is a private equity round raised by a company that has previously raised a “venture” round. (So, basically, any round from the previously-defined stages.)
We do not count private equity rounds in non-venture-backed startups, secondary market transactions, post-IPO transactions, debt financings, grants, non-equity assistance, or ICOs as venture funding rounds. Crunchbase also excludes “corporate rounds,” like the $12.8 billion deal Altria Group struck with e-cigarette maker JUUL, from the “venture” category.
Note that the “funding stage” definitions Crunchbase News uses for different funding stages is expansive by design.
Regarding External Data Sources
Quite often, we want to use data for a story that isn’t tracked by Crunchbase.
In these cases, we rely on external, authoritative data and information sources. Here are some of the types of data we typically need to source from elsewhere:
- Quarterly and annual financial reporting by public companies.
- Regulatory filings.
- Documents filed with the U.S. Securities and Exchange Commission, typically available through the main company search feature of the SEC’s website.
- Investment Adviser Public Disclosures (IAPDs) by investment firms and individual advisors can be looked up through the SEC’s advisor search engine; we’ve used these disclosures to discover and traverse the legal structure of various investment firms.
- We may also cover filings made at the state or municipal level, which are often publicly available if not always accessible online.
- Time-series market pricing data (i.e. the minute-to-minute, hour-to-hour, and day-to-day price fluctuations of assets traded on open markets, like stocks and cryptocurrencies).
- Coarser-resolution data is available publicly through information aggregators like Yahoo Finance and most online brokerages.
- Various statistics related to cryptocurrencies (e.g. network transaction volume, proof-of-work difficulty, transaction backlog size, various mining pools’ share of network hashrates, etc.), which we obtain from various sources.
- For the price of bitcoin and other major cryptos, we’ve used Coindesk’s exchange-weighted indices.
- Network statistics are openly available on sites like Blockchain.com.
- Cryptocurrency market capitalization information is available on CoinMarketCap. (See the graphic above, from October 2018, as an example.)
A List Of Software Tools We’ve Used For Data Analysis, Processing, And Visualization
The Crunchbase News team uses a broad yet fairly basic toolkit for performing data analysis and visualization tasks. Chances are you’ve used (or at least heard of) most of the software we use to derive insights and suss out trends from the venture capital and startup data we have access to through our relationship with Crunchbase.
Data Analysis & Processing
- Microsoft Excel and Google Sheets. We use Excel and Sheets pretty interchangeably for filtering, sorting, and viewing tabular data exported from Crunchbase. Additionally, we use these tools to make the majority of the pivot tables behind the charts and graphs on Crunchbase News.
- Python. For working with very big datasets or performing more complex operations and transformations, we use open source tools. We’d like to thank the creators and contributors to NumPy, Pandas, Requests, JupyterLab, Anaconda, and others for supporting and maintaining open source software.
- Other open source tools. Gephi is an open source software package designed for network analysis and visualization. Crunchbase News has used geographic layouts in Gephi to visualize intra-regional investor networks in the United States (pictured above), and we’ve used it to map out Chinese transportation giant Didi Chuxing’s direct investments in ride-hailing startups.
- Charts. Apple Numbers has a surprisingly customizable and very accessible charting system we use to make most of the bar, line, and pie charts you see on Crunchbase News.
- Interactive Maps. A few of our older pieces featured interactive maps made using the community version of Tableau.
- Network Visualizations. Our network visualizations are primarily made with Gephi.
The Crunchbase News team will update this document over time as our use of data changes. Major alterations to this document will be recorded below: