https://github.com/ google-research-datasets/granola-eq

Even when different levels of granular- ity are present, there is no notion in which matching to a more specific answer is “better”. As a result, standard QA evaluation may significantly under- estimate the knowledge encapsulated in LMs, a phenomenon which we refer to as the knowledge evaluation gap. Indeed, recent human evaluation suggests that such granularity disparities account for approximately 10-15% of the disagreements between lexical matching and human evaluation

generate coarser versions of it by utilizing an external knowledge graph (KG)