Select experiences in testing AV perception systems 

GoMentum Station and AAA Northern California, Nevada & Utah partner with University of Waterloo Center for Automotive Research (WatCAR) to validate AV safety performance and compare results across virtual and physical scenario tests.

Authored by Paul Wells

A key challenge in validating autonomous vehicles — and, more broadly, safety-critical applications of AI — is the brittleness of machine learning (ML) based perception algorithms [1] [2]. This challenge is significant because errors in sensing & perception affect all downstream modules; if a system cannot properly detect its environment, it will have a diminished likelihood of completing its goal within this same environment. Although focused on full-vehicle testing via structured scenarios, our work with University of Waterloo highlighted this key issue. 

Our research subjected Waterloo’s automated research vehicle, Autonomoose , to several predefined scenarios: roundabout with road debris, traffic jam assist, standard pedestrian crossing, intersection pedestrian crossing, and stop sign occlusion. 

These scenarios were tested first in simulation using the University’s WiseSim platform and the GeoScenario SDL, then re-tested physically at a closed-course. Although intended as a broad exploration of the utility of controlled, physical testing for autonomous vehicle validation, this project nonetheless surfaced findings which — albeit specific to our given tech-stack — reinforce the otherwise well-documented challenges in validating autonomous systems.vehicle and test dummy

A few highlights of our experience paper, specifically as related to test result discrepancies due to the perception system, can be summarized as follows:

Simulation architecture in this case provided “perfect perception”. As such, our virtual tests assumed that all actors in were perceived and understood by the system, in this case a modified instance of Baidu’s Apollo. This assumption led to large discrepancies between virtual and physical tests, especially in scenarios containing pedestrians. In our case, once the vehicle was introduced to the physical environment, perception-related deficiencies resulted in a large number of inconsistent or completely missed detection. During the pedestrian crossing, for instance, the AV struggled with intermittent loss of pedestrian perception due to issues with the object detection module. In simulation, however, the pedestrian was readily detected. Our video at the top of the fold shows performance on the physical track, while the below shows behavior in simulation.

Sensor fidelity in simulation was limited. Further highlighting the importance of closely tracked sensor model fidelity, the modeled LIDAR beam pattern used in simulation did not match the real specs of the on-vehicle sensor. This issue was uncovered due to conflicting virtual-physical test results within a road debris scenario, designed to assess the vehicle’s ability to detect small, stationary objects. As described in our paper, “The scenario exposed some lack of fidelity in the simulated lidar which caused inconsistent road debris detection behavior in simulation compared to the closed course. The lidar mounted on the roof of the SV had a non-linear vertical beam distribution, whereby the beams were grouped more densely around the horizon, and were too sparse to reliably detect the debris on the ground. In contrast, the simulated lidar had a linear vertical beam distribution, i.e., the beams were spaced out evenly. Consequently, implementing the non-linear beam distribution in the simulated lidar resulted in SV behavior in simulation consistent with the SV behavior on the closed course.”  Pictured below, the Autonomoose fails to detect a cardboard box on the road during physical testing.

Environmental fidelity was limited. Finally, our use of WiseSim primarily involved a re-creation of the road network — but not the visual scene — present during closed-course testing. This introduced small complexities when Autonomoose was ultimately introduced into the physical track. Principally, uneven road slopes at the physical track created unexpected LiDAR returns and false detections onboard the vehicle. Because WiseSim did not recreate ground plane measurements, we encountered a bit of de-bugging at the track. This reiterates the need for close tracking of virtual model fidelity when using simulation to prepare for track tests and, more broadly, when performing correlations between virtual & physical tests.

Although these findings may not all generalize to the broader domain of AV testing, they nonetheless provide concrete instances of theoretical challenges facing the field. We will continue exploring these challenges with industry partners and sharing results. We also welcome inquiries about the scenarios used, technical assets created, and data sharing.

For the full paid SAE report visit their website to download.

For further info about the GoMentum testing team, please email us!


Understanding Unsettled Challenges in Autonomous Driving Systems

Recently published SAE Edge Reports, co-authored by test and research team at AAA Northern California, Nevada and Utah, highlights the key issues that the autonomous vehicles industry continues to face.

Authored by Atul Acharya

SAE EDGE Reports 2019
SAE EDGE Reports on Simulation and Balancing ADS Testing

The promise of highly automated vehicles (HAVs) has long been in reducing crashes, easing traffic congestion, ferrying passengers and delivering goods safely — all while providing more accessible and affordable mobility enabling new business models.

However, the development of these highly complex, safety-critical systems is fraught with extremely technical, and often obscure, challenges. These systems are typically comprised by four key modules, namely:

(i) the perception module, which understands the environment around the automated vehicle using sensors like cameras, lidars, and radars,

(ii) the prediction module, which predicts where all other dynamic actors and agents (such as pedestrians, bicyclists, vehicles, etc.) will be moving in the next 2-10 seconds,

(iii) the planning module, which plans the AV’s own path, taking into account the scene and dynamic constraints, and

(iv) the control module, which executes the trajectory by sending commands to the steering wheel and the motors.

If you are an AV developer working on automated driving systems (ADS), or perhaps even advanced driver assistance systems (ADAS), you are already using various tools to make your job easier. These tools include simulators of various kinds — such as scenario designers, test coverage analyzers, sensor models, vehicle models — and their simulated environments. These tools are critical in accelerating the development of ADS. However, it is equally important to understand the challenges and limitations in using, deploying and developing such tools for advancing the benefits of ADS.

We in the AV Testing team at GoMentum actively conduct research in AV validation, safety, and metrics. Recently, we had an opportunity to collaborate with leading industry and academic partners to highlight these key challenges. Organized by SAE International, and convened by Sven Beiker, PhD, founder of Silicon Valley Mobility, two workshops were organized in late 2019 to better understand various AV testing tools. The workshop participants included Robert Siedl (Motus Ventures), Chad Partridge (CEO, Metamoto), Prof. Krzysztof Czarnecki and Michał Antkewicz (both of University of Waterloo), Thomas Bock (Samsung), David Barry (Multek), Eric Paul Dennis (Center for Automotive Research), Cameron Gieda (AutonomouStuff), Peter-Nicholas Gronerth (fka), Qiang Hong (Center for Automotive Research), Stefan Merkl (TUV SUD America), John Suh (Hyundai CRADLE), and John Tintinalli (SAE International), along with AAA’s Atul Acharya and Paul Wells.

Simulation Challenges

One of the key challenges encountered in developing ADS is in developing accurate, realistic, reliable and predictable models for various sensors (cameras, lidars, radars), and actors and agents (such as vehicles of various types, pedestrians, etc.) and the world environment around them. These models are used for verification and validation (V&V) of advanced features. Balancing the model fidelity (“realism”) of key sensors and sub-systems while developing the product is a key challenge. The workshop addressed these important questions:

  1. How do we make sure simulation models (such as for sensors, vehicles, humans, environment) represent real-world counterparts and their behavior?
  2. What are the benefits of a universal simulation model interface and language, and how do we get to it?
  3. What characteristics and requirements apply to models at various levels, namely, sensors, sub-systems, vehicles, environments, and human drivers?

To learn more about these and related issues, check out the SAE EDGE Research Report EPR2019007:

Balancing ADS Testing in Simulation Environments, Test Tracks, and Public Roads

If you are a more-than-curious consumer of AV news, you might be familiar with various AV developing companies stating proudly that they have “tested with millions or billions of miles” in simulation environments, or “millions of miles” on public roads. How do simulation miles translate into real-world miles? Which matters more? And why?

If you have thought of these questions, then the second SAE EDGE report might be of interest.

The second workshop focused on a broader theme: How should AV developers allocate their limited testing resources across different modes of testing, namely simulation environments, test tracks, and public roads? Given that each mode of testing has its own benefits and limitations, and can accelerate or hinder development of ADS accordingly, this question is of paramount importance if the balance of limited resources is askew.

This report seeks to address three most critical questions:

  1. What determines how to test an ADS?
  2. What is the current, optimal, and realistic balance of simulation testing and real-world testing?
  3. How can data be shared in the industry to encourage and optimize ADS development?

Additionally, it touches upon other challenges such as:

  • How might one compare virtual and real miles?
  • How often should vehicles (and their subsystems) be tested? And in what modes?
  • How might (repeat) testing be made more efficient?
  • How should companies share testing scenarios, data, methodologies, etc. across industry to obtain the maximum possible benefits?

To learn more about these and related challenges, check out the SAE EDGE Research Report EPR2019011:

Physical Test Efficiency via Virtual Results Analysis

AAA Northern California, UC Berkeley and LG Silicon Valley Labs partner to examine the use of digital twin and parameterized testing to drive efficiency in closed-course testing.

Authored by Paul Wells

Physical test execution at GoMentum Station. Pedestrian scenario along Pearl St. in “Downtown” zone.

Simulation and physical testing are widely known to be complementary modes of automated vehicle verification & validation (V&V). While simulation excels in scalable testing of software modules, physical testing provides the ability for full-vehicle, everything-in-the-loop testing and “real life” evaluations of sensing stacks or downstream perception / sensor fusion layers. These tradeoffs, in part, contribute to the test distribution approach put forth by Safety First For Automated Driving (SaFAD).

Matrix that maps test modes to use cases

“Test Platform and Test Item”. Safety First for Automated Driving, Chapter 3: Verification & Validation (2019).

Due to the ability of simulation to scale rapidly, much work is currently underway to resolve its core limitation (fidelity). Developments in modeling (sensors, environment, vehicle dynamics, human drivers, etc.) and fast, photorealistic scene rendering all stand to expand the scope of validation exercises that can be performed virtually. Much less studied, however, are the means by which physical testing can improve upon its core limitation (efficiency). Whereas the speed of virtual testing allows for near constant re-submission and re-design of test plans, the physical world is much less forgiving. Being strategic about time spent on a track is therefore vital to maximize the efficiency of physical testing. Although many inputs into efficient physical testing are fixed (e.g. the inherent slowness in working with hardware), it is unclear whether the utility of physical test data also scales linearly alongside the volume of test execution. In other words, are the results of all physical tests equally valuable, or are there certain parameters within a given scenario which, if realized concretely at the track, would result in more valuable insights? If so, how might one discover these higher-value parameters using a virtual test environment?

These questions were central to our three-way research partnership between GoMentum Station (owned and operated by AAA Northern California), UC Berkeley VeHiCaL, and LG Silicon Valley Lab (LG SVL). Accordingly, we modeled a digital twin of GoMentum in the LG Simulator and leveraged an integration between Berkeley’s Scenic scenario description language (Scenic) and LG’s simulation platform. We elected a pedestrian crossing scenario and used the Scenic-LG integration to parameterize the scenario along several relevant vectors, executing nearly a thousand different virtual test cases. The results of one such test set were as follows, where rho is a proxy for the minimum distance between the ego vehicle and pedestrian target. Within this plot, we identified the clustering patterns along rho to be most interesting. As such, we elected eight cases for physical test execution: two failures (F), three successes (S), and three marginal (M) cases, where failure cases exhibited minimum distance values of less or equal to .4 meters. In short, the results from physical testing established safe/marginal/unsafe parity across test modes.

3-D plot depicting simulation test results

Each point represents a test case. X = pedestrian start delay (s), Y = pedestrian walk distance (m), Z = pedestrian hesitate time (s). Rho = proxy for minimum distance.

While the concept of correlating virtual and physical tests is not in itself novel, our results provide evidence to suggest that parametrization and analysis of virtual test results can be used to inform physical test plans. Specifically, the framework of recreating “low-rho” failure cases physically and within a reasonable degree of conformance to virtual runs allows for the capture of rich ground-truth data pertaining to Vehicle Under Test (VUT)-specific critical cases — all without the guesswork involved in manually tuning scenario parameters at the physical test site. Because we were able to tease out deficiencies of the AV stack only after running ten odd test cases, the utility of data captured relative to time spent onsite was significant. As compared to relatively unstructured physical test exercises or arbitrarily assigned scenario parameters, this framework represented an efficient means of both discovering and recreating physical critical cases.

Stepping outside this framework and the original research scope, our results also provide evidence of several challenges in validating highly automated vehicles (HAV). Even within our relatively low volume of tests, one test case (M2) produced a false negative when transitioning from virtual to physical testing — underscoring the importance of testing virtually and physically, as well as the difficulty in interpreting simulation results or using simulation results as a proxy for overall vehicle competence. We were also surprised by the exceptionally sensitive tolerances within each parameter. In certain cases the difference between a collision / no collision was a matter of milliseconds in pedestrian hesitation time, for instance. This underscores both the brittleness and, to a lesser extent, the non-determinism of machine learning algorithms — two of the broader challenges in developing AI systems for safety-critical applications.

Physical test execution at GoMentum Station. Pedestrian scenario along Pearl St. in “Urban Zone”.

Importantly, these challenges face industry and independent assessors alike. Current test protocols for agencies like EuroNCAP are very rigid. Vehicle speed, a primary test parameter in an AEB VRU test protocol for instance, varies along a step function with a range from ~20-60 kph and increments of 5kph. While perhaps suitable for L2 systems where drivers remain the primary line of defense, this approach clearly contradicts the parameter sensitivities exhibited above. If independent assessors hope to ascertain meaningful conclusions from the results of physical test exercises, these exercises will need to be highly contrived — i.e. not only will the scenarios need to be chosen according to a particular operational domain design (ODD), but the parameters used to concretize physical tests should in fact be assigned by inquiry — perhaps even VUT-specific inquiry — instead of by top-down, industry-wide mandate. This could necessitate an assessment framework built around coverage and safety case development rather than test-by-test scoring and comparison — an approach encouraged by work from groups like UL 4600 and Foretellix.

Many open questions remain in examining HAV test efficiency and overall HAV validation processes. We look forward to using the insights above in 2020 to delve deeper into research and to continue forging relationships with industry. We will be engaging in subsequent projects within the SAE IAMTS coalition to further explore both the toolchain and process used for correlating virtual and physical results.

This work also generated a number of outputs that we look forward to sharing with the industry. All videos from our track testing at GoMentum are available here, with recorded simulation runs here. The underlying datasets — both from ground truth at the track and outputs of the simulation — are also being used for further analysis and safety metric development.

Introducing The GoMentum Blog

The Bay Area has been the epicenter of the self-driving vehicle industry since Waymo’s Chauffer project in 2013. Over the last seven-odd years the industry evolved tremendously; sixty-six companies are currently permitted to test vehicles on California public roads. For Waymo and startups alike, public road testing is but one environment in a broader testing regime. Simulation, of course, is another dominant mode – with industry leaders citing millions or billions of virtual miles driven. Similarly though, controlled testing on private roads is another piece of a holistic approach to validation, and one that has been employed in the auto industry for decades (GM first bought Milford in 1936).

When the AV industry began, Bay Area teams were equipped primarily with research vehicles and needed only parking lots for controlled, full-vehicle testing before moving to public road driving. Especially over the last two years, now with fleets of vehicles driving on public roads and a focus on scale, most of the industry has outgrown parking lots. GoMentum, a military base turned proving ground since 2014, serves as a more robust tool for the growing needs of validation teams. AAA Northern California, Nevada & Utah became formally involved in 2018 and has since invested into infrastructure and operations to support efficient use of the site. We’re proud to partner with multiple companies.

As the use of GoMentum as a development tool grew, we also began to think about other opportunities – beyond the physical facility – that could advance the industry’s understanding of safety. Guided by many questions that have crystalized in the field – most notably, how safe is safe enough? – GoMentum created a research agenda in 2019. This agenda sought to complement the work of ISO 21448, UL 4600, Pegasus, and Safety First for Automated Driving by focusing on three key areas: methods for full-vehicle validation, the relationship between physical and virtual testing, and safety metric development.

Having wrapped several of these projects, we’re now excited to share our results. Please stay tuned as we continue to release more about our work and offer insights into questions like: How can AV safety be measured? What are the challenges introduced when distributing validation across physical and virtual environments? What is the state of the current validation toolchain? Which requirements are important for vehicle safety? What needs to be done by legislators and policy makers to ensure the safety of the public?