Physical Test Efficiency via Virtual Results Analysis

AAA Northern California, UC Berkeley and LG Silicon Valley Labs partner to examine the use of digital twin and parameterized testing to drive efficiency in closed-course testing. For a summary of the study download the AAA-UCB-LG AV Testing Project Whitepaper.

Authored by Paul Wells

Physical test execution at GoMentum Station. Pedestrian scenario along Pearl St. in “Downtown” zone.

UPDATE: On August 20, 2020 UC Berkeley hosted a webinar on the research outlined in this article and tools leveraged in the work. Visit the YouTube link for an in-depth overview.

Simulation and physical testing are widely known to be complementary modes of automated vehicle verification & validation (V&V). While simulation excels in scalable testing of software modules, physical testing provides the ability for full-vehicle, everything-in-the-loop testing and “real life” evaluations of sensing stacks or downstream perception / sensor fusion layers. These tradeoffs, in part, contribute to the test distribution approach put forth by Safety First For Automated Driving (SaFAD).

Matrix that maps test modes to use cases

“Test Platform and Test Item”. Safety First for Automated Driving, Chapter 3: Verification & Validation (2019).

Due to the ability of simulation to scale rapidly, much work is currently underway to resolve its core limitation (fidelity). Developments in modeling (sensors, environment, vehicle dynamics, human drivers, etc.) and fast, photorealistic scene rendering all stand to expand the scope of validation exercises that can be performed virtually. Much less studied, however, are the means by which physical testing can improve upon its core limitation (efficiency). Whereas the speed of virtual testing allows for near constant re-submission and re-design of test plans, the physical world is much less forgiving. Being strategic about time spent on a track is therefore vital to maximize the efficiency of physical testing. Although many inputs into efficient physical testing are fixed (e.g. the inherent slowness in working with hardware), it is unclear whether the utility of physical test data also scales linearly alongside the volume of test execution. In other words, are the results of all physical tests equally valuable, or are there certain parameters within a given scenario which, if realized concretely at the track, would result in more valuable insights? If so, how might one discover these higher-value parameters using a virtual test environment?

These questions were central to our three-way research partnership between GoMentum Station (owned and operated by AAA Northern California), UC Berkeley VeHiCaL, and LG Silicon Valley Lab (LG SVL). Accordingly, we modeled a digital twin of GoMentum in the LG Simulator and leveraged an integration between Berkeley’s Scenic scenario description language (Scenic) and LG’s simulation platform. We elected a pedestrian crossing scenario and used the Scenic-LG integration to parameterize the scenario along several relevant vectors, executing nearly a thousand different virtual test cases. The results of one such test set were as follows, where rho is a proxy for the minimum distance between the ego vehicle and pedestrian target. Within this plot, we identified the clustering patterns along rho to be most interesting. As such, we elected eight cases for physical test execution: two failures (F), three successes (S), and three marginal (M) cases, where failure cases exhibited minimum distance values of less or equal to .4 meters. In short, the results from physical testing established safe/marginal/unsafe parity across test modes.

3-D plot depicting simulation test results

Each point represents a test case. X = pedestrian start delay (s), Y = pedestrian walk distance (m), Z = pedestrian hesitate time (s). Rho = proxy for minimum distance.

While the concept of correlating virtual and physical tests is not in itself novel, our results provide evidence to suggest that parametrization and analysis of virtual test results can be used to inform physical test plans. Specifically, the framework of recreating “low-rho” failure cases physically and within a reasonable degree of conformance to virtual runs allows for the capture of rich ground-truth data pertaining to Vehicle Under Test (VUT)-specific critical cases — all without the guesswork involved in manually tuning scenario parameters at the physical test site. Because we were able to tease out deficiencies of the AV stack only after running ten odd test cases, the utility of data captured relative to time spent onsite was significant. As compared to relatively unstructured physical test exercises or arbitrarily assigned scenario parameters, this framework represented an efficient means of both discovering and recreating physical critical cases.

Stepping outside this framework and the original research scope, our results also provide evidence of several challenges in validating highly automated vehicles (HAV). Even within our relatively low volume of tests, one test case (M2) produced a false negative when transitioning from virtual to physical testing — underscoring the importance of testing virtually and physically, as well as the difficulty in interpreting simulation results or using simulation results as a proxy for overall vehicle competence. We were also surprised by the exceptionally sensitive tolerances within each parameter. In certain cases the difference between a collision / no collision was a matter of milliseconds in pedestrian hesitation time, for instance. This underscores both the brittleness and, to a lesser extent, the non-determinism of machine learning algorithms — two of the broader challenges in developing AI systems for safety-critical applications.

Physical test execution at GoMentum Station. Pedestrian scenario along Pearl St. in “Urban Zone”.

Importantly, these challenges face industry and independent assessors alike. Current test protocols for agencies like EuroNCAP are very rigid. Vehicle speed, a primary test parameter in an AEB VRU test protocol for instance, varies along a step function with a range from ~20-60 kph and increments of 5kph. While perhaps suitable for L2 systems where drivers remain the primary line of defense, this approach clearly contradicts the parameter sensitivities exhibited above. If independent assessors hope to ascertain meaningful conclusions from the results of physical test exercises, these exercises will need to be highly contrived — i.e. not only will the scenarios need to be chosen according to a particular operational domain design (ODD), but the parameters used to concretize physical tests should in fact be assigned by inquiry — perhaps even VUT-specific inquiry — instead of by top-down, industry-wide mandate. This could necessitate an assessment framework built around coverage and safety case development rather than test-by-test scoring and comparison — an approach encouraged by work from groups like UL 4600 and Foretellix.

Many open questions remain in examining HAV test efficiency and overall HAV validation processes. We look forward to using the insights above in 2020 to delve deeper into research and to continue forging relationships with industry. We will be engaging in subsequent projects within the SAE IAMTS coalition to further explore both the toolchain and process used for correlating virtual and physical results.

This work also generated a number of outputs that we look forward to sharing with the industry. All videos from our track testing at GoMentum are available here, with recorded simulation runs here. The underlying datasets — both from ground truth at the track and outputs of the simulation — are also being used for further analysis and safety metric development.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *