Solr进行Distinct 获取Count

本文探讨了在Solr中计算多个表中去重后的数据总量的问题。由于Solr的Facet功能存在限制,文章介绍了使用SolrCountDistinct和uniqueFacetFunction来估算不同值的数量,详细解释了当数据量超过100时如何调整算法以提供更准确的计数。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

今天碰到一个问题,数据之前入solr的时候并没有计算条数,现在需要计算出某几个表中去重后的总数。 

由于solr的ISearch并没有相关的Distinct功能.想到一个解决方案是用Solr的Facet分组进行GrupBy,但是因为Facet只能返回100条,而数据肯定大于100个分组.所有该方案PASS了。 

后来在网上搜到Solr Count Distinct,这么一个东西,是Solr已经发布的脚本(Solr Search Requests)其中有类似的功能 

A 100% accurate count of distinct values (count distinct) is not generally possible without actually observing all of the values together. However there are a number of ways to estimate the count. 

“unique” Facet Function 

The unique facet function is Solr’s fastest implementation to calculate the number of distinct values. 

It always provides exact counts on a single Solr node. For distributed search over multiple nodes, it provides exact counts when the number of values per node does not exceed 100 (by default). 

When the number of unique values does exceed 100 in any given shard, the following algorithm is used: 

It estimates the count by sending the top 100 results from each shard along with the total exact “unique” count for each shard. 

totalSeen is the number of actual results we saw from all shards (i.e. not deduped yet). uniqueSeen is the number of unique values we saw from all shards (i.e. deduped). notSeen is the number of unique values from each shard that were not sent (because of the 100 cutoff). 

factor = uniqueSeen / totalSeen (i.e. what fraction of values that we saw were unique) estimate = uniqueSeen + ( notSeen * factor ) (i.e. we simply apply the factor to the number of values we didn’t see) Example use: 

$ curl http://localhost:8983/solr/techproducts/query -d ' 

q=*:*& 

json.facet={ 

x : "unique(manu_exact)" // manu_exact is the manufacturer indexed as a single string }' 

For more facet functions, adding facet functions to each facet bucket, or sorting by facet function, see Solr Facet Functions 

Aggregation Functions 

Faceting involves breaking up the domain into multiple buckets and providing information about each bucket. 

There are multiple aggregation functions / statistics that can be used: 

 


下面是我写的一个例子 

curl http://192.168.1.1:8080/solr/xxshard/query?q=*:* -d ' 

 json.facet={ 

 x:"unique(RB040002)" 

 }' 

写在最后:  

码字不易看到最后了,那就点个关注呗,只收藏不点关注的都是在耍流氓! 关注并私信我“架构”,免费送一些Java架构资料,先到先得! 


转载于:https://juejin.im/post/5d05b47cf265da1b8b2b5cbc

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值