The fuel driving AI algorithms are high-quality data. Bottlenecks will arise without a constant flow of labeled data, and the algorithm can gradually get worse, adding complexity to the program.
For companies like Zoox, Cruise, and Waymo, who use it to train machine learning models to build and deploy autonomous vehicles, which is why branded data is so important. The need is what led to the development of Scale AI, a startup for businesses developing machine learning algorithms using software and people to process and mark image, Lidar and map data.
Companies focusing on autonomous vehicle technologies make up a wide swath of Scale’s user base, while Airbnb, Pinterest, and OpenAI, among others, and do use its network.
The COVID-19 pandemic has stopped or even blocked, the data flow as AV companies suspended research on public roads — the means to collect billions of images. Scale hopes to turn the tap back on and at no expense.
The company launched this week an open-source data set called PandaSet, in collaboration with Lidar manufacturer Hesai, which can be used to train autonomous driving models for machine learning. The data collection, which is free and available for academic and commercial use, contains data obtained using the image-like resolution forward-facing Hesai PandarGT Lidar, as well as its mechanical rotating Lidar, knew as Pandar64. The data was collected while driving urban areas in San Francisco and Silicon Valley, according to the firm, before officials released stay-at-home orders in the area.
“AI and machine learning are fantastic inventions with a massive capacity for effects but also a considerable pain in the butt,” said Scale CEO and co-founder Alexandr Wang in a recent interview.
The data collection contains over 48,000 camera photos and 16,000 Lidar sweeps — more than one hundred 8s scenes each, according to the company. It also includes 28 classes of annotation per scene and 37 labels of semantic segmentation for most scenes. For e.g., conventional cuboid marking, those little boxes that are positioned around a bike or vehicle, can not accurately classify all the Lidar data. Thus Scale uses a point cloud segmentation tool to reliably annotate dynamic structures such as heat.
AV data is not something new to open sourcing. Aptiv and Scale launched nuScenes last year, a large-scale data collection from an autonomous sensor system for the vehicles. Argo AI published curated data along with high-definition maps; while Cruise shared a generated data visualization application called Webviz that takes raw data from all sensors gathered on a robot and transforms the binary code into visuals.
Scale’s actions are a bit different; for example, Wang said there are no limitations on the license to use this collection of data.
The goal of this Lidar data set was to provide free access to a comprehensive and content-rich data set that Wang said was accomplished by using two types of lidars in complex urban environments packed of cars, motorcycles, traffic lights, and pedestrians.