31 Commits

Author SHA1 Message Date
honeymoose 49c3930deb 删除不需要的文件 2021-08-03 13:41:42 -04:00
honeymoose bf74617057 修改错别字 2021-08-02 15:57:21 -04:00
honeymoose 72f026cac7 Merge pull request #30 from cwiki-us-docs/feature/tutorial-update-data
完成 数据更新 页面的翻译
2021-08-02 15:21:25 -04:00
honeymoose c171635482 完成 追加数据 内容的翻译 2021-08-02 15:20:14 -04:00
honeymoose 3c5404eea7 完成 将新数据和老数据合并后进行覆盖 中的内容 2021-08-02 15:12:36 -04:00
honeymoose 67f5094fac 对数据更新的内容进行初始化,并且创建翻译的分支 2021-08-02 14:50:33 -04:00
honeymoose b661d141ef Merge pull request #29 from cwiki-us-docs/feature/tutorial-retention.md
Feature/tutorial retention.md
2021-08-02 14:47:44 -04:00
honeymoose 1926fc1783 完成对数据保留时间的中文内容进行翻译 2021-08-02 14:45:58 -04:00
honeymoose 608e5746ff 对数据加载规则进行说明 2021-08-02 14:43:23 -04:00
honeymoose 5508a1e203 处理翻译不完全的使用 Druid 进行查询的内容 2021-08-02 14:08:23 -04:00
honeymoose 9cea283a77 tutorial-retention 载入示例数据 2021-08-01 08:21:10 -04:00
honeymoose e730545948 tutorial-retention 分支对数据保留规则进行翻译 2021-08-01 08:07:13 -04:00
honeymoose fec5fb7511 Merge pull request #28 from cwiki-us-docs/feature/roll-up
提交到 Roll-up 查询汇总数据并且进行解释说明
2021-08-01 08:04:08 -04:00
honeymoose d512d84286 提交到 Roll-up 查询汇总数据并且进行解释说明 2021-08-01 08:03:13 -04:00
honeymoose 9f13516241 Merge pull request #27 from cwiki-us-docs/feature/roll-up
roll up 载入示例数据
2021-07-31 18:04:03 -04:00
honeymoose 08cc99a7a5 提交到 Roll-up 载入示例数据 2021-07-31 18:03:20 -04:00
honeymoose 5a5aa9e1c9 提交到 Roll-up 示例数据 内容翻译 2021-07-31 18:02:00 -04:00
honeymoose d994362f95 Merge remote-tracking branch 'origin/feature/roll-up' into feature/roll-up
# Conflicts:
#	tutorials/tutorial-rollup.md
2021-07-31 18:00:47 -04:00
honeymoose 683145ef58 提交到 Roll-up 示例数据 内容翻译 2021-07-31 18:00:00 -04:00
honeymoose c41b4795d2 提交到 Roll-up 分支 2021-07-31 17:52:10 -04:00
honeymoose 9db8616086 添加查询需要的文档 2021-07-31 17:52:10 -04:00
honeymoose 6318c33333 准备对查询方面的文档进行处理 2021-07-31 17:52:10 -04:00
honeymoose 3e6d21241e 准备 Roll-up 部分的翻译 2021-07-31 17:52:10 -04:00
honeymoose 222f234eb1 提交到 Roll-up 分支 2021-07-31 17:52:01 -04:00
honeymoose d6d5fbfa40 Merge remote-tracking branch 'origin/feature/querying' into feature/querying 2021-07-31 17:44:39 -04:00
honeymoose dcca822ecb 添加查询需要的文档 2021-07-31 17:44:26 -04:00
honeymoose a125246e3c 准备对查询方面的文档进行处理 2021-07-31 17:44:26 -04:00
honeymoose f5cfe18c9c Merge pull request #26 from cwiki-us-docs/feature/getting_started
查询文档完成
2021-07-31 17:39:43 -04:00
honeymoose 1b6300a7a6 Merge pull request #25 from cwiki-us-docs/feature/getting_started
对查询部分的内容进行翻译
2021-07-31 11:24:49 -04:00
honeymoose 8cb62cd32b 添加查询需要的文档 2021-07-29 12:47:06 -04:00
honeymoose 039822b649 准备对查询方面的文档进行处理 2021-07-29 12:46:48 -04:00
14 changed files with 2385 additions and 619 deletions
+12
View File
@@ -34,7 +34,19 @@
- 摄取(Ingestion
- [面试问题和经验](interview/index.md)
- [算法题](algorithm/index.md)
- 查询(Querying
- [Druid SQL](querying/sql.md)
- [原生查询](querying/querying.md)
- [查询执行](querying/query-execution.md)
- 概念
- [数据源](querying/datasource.md)
- [连接(joins](querying/joins.md)
- 原生查询类型
- [Timeseries 查询](querying/timeseriesquery.md)
- [TopN 查询](querying/topnquery.md)
- [GroupBy 查询](querying/groupbyquery.md)
- 开发(Development
- [在 Druid 中进行开发](development/index.md)
+16 -58
View File
@@ -1,5 +1,6 @@
## 独立服务器部署
Druid includes a set of reference configurations and launch scripts for single-machine deployments:
Druid 包含有一组可用的参考配置和用于单机部署的启动脚本:
- `nano-quickstart`
- `micro-quickstart`
@@ -8,7 +9,7 @@ Druid includes a set of reference configurations and launch scripts for single-m
- `large`
- `xlarge`
The `micro-quickstart` is sized for small machines like laptops and is intended for quick evaluation use-cases.
`micro-quickstart` 适合于笔记本电脑等小型计算机,主要用于能够快速评估 Druid 的使用场景。
The `nano-quickstart` is an even smaller configuration, targeting a machine with 1 CPU and 4GiB memory. It is meant for limited evaluations in resource constrained environments, such as small Docker containers.
@@ -20,50 +21,39 @@ The example configurations run the Druid Coordinator and Overlord together in a
While example configurations are provided for very large single machines, at higher scales we recommend running Druid in a [clustered deployment](../tutorials/cluster.md), for fault-tolerance and reduced resource contention.
## Single server reference configurations
### Nano-Quickstart: 1 CPU, 4GiB RAM
- Launch command: `bin/start-nano-quickstart`
- Configuration directory: `conf/druid/single-server/nano-quickstart`
- 启动命令: `bin/start-nano-quickstart`
- 配置目录: `conf/druid/single-server/nano-quickstart`
### Micro-Quickstart: 4 CPU, 16GiB RAM
- Launch command: `bin/start-micro-quickstart`
- Configuration directory: `conf/druid/single-server/micro-quickstart`
- 启动命令: `bin/start-micro-quickstart`
- 配置目录: `conf/druid/single-server/micro-quickstart`
### Small: 8 CPU, 64GiB RAM (~i3.2xlarge)
- Launch command: `bin/start-small`
- Configuration directory: `conf/druid/single-server/small`
- 启动命令: `bin/start-small`
- 配置目录: `conf/druid/single-server/small`
### Medium: 16 CPU, 128GiB RAM (~i3.4xlarge)
- Launch command: `bin/start-medium`
- Configuration directory: `conf/druid/single-server/medium`
- 启动命令: `bin/start-medium`
- 配置目录: `conf/druid/single-server/medium`
### Large: 32 CPU, 256GiB RAM (~i3.8xlarge)
- Launch command: `bin/start-large`
- Configuration directory: `conf/druid/single-server/large`
- 启动命令: `bin/start-large`
- 配置目录: `conf/druid/single-server/large`
### X-Large: 64 CPU, 512GiB RAM (~i3.16xlarge)
- Launch command: `bin/start-xlarge`
- Configuration directory: `conf/druid/single-server/xlarge`
- 启动命令: `bin/start-xlarge`
- 配置目录: `conf/druid/single-server/xlarge`
### 单服务器部署
Druid包括一组参考配置和用于单机部署的启动脚本:
* `nano-quickstart`
* `micro-quickstart`
* `small`
* `medium`
* `large`
* `large`
* `xlarge`
`micro-quickstart`适合于笔记本电脑等小型机器,旨在用于快速评估测试使用场景。
@@ -76,35 +66,3 @@ Druid包括一组参考配置和用于单机部署的启动脚本:
通过[Coordinator配置文档](../../Configuration/configuration.md#Coordinator)中描述的可选配置`druid.coordinator.asOverlord.enabled = true`可以在单个进程中同时运行Druid Coordinator和Overlord。
虽然为大型单台计算机提供了示例配置,但在更高规模下,我们建议在集群部署中运行Druid,以实现容错和减少资源争用。
#### 单服务器参考配置
##### Nano-Quickstart: 1 CPU, 4GB 内存
* 启动命令: `bin/start-nano-quickstart`
* 配置目录: `conf/druid/single-server/nano-quickstart`
##### Micro-Quickstart: 4 CPU, 16GB 内存
* 启动命令: `bin/start-micro-quickstart`
* 配置目录: `conf/druid/single-server/micro-quickstart`
##### Small: 8 CPU, 64GB 内存 (~i3.2xlarge)
* 启动命令: `bin/start-small`
* 配置目录: `conf/druid/single-server/small`
##### Medium: 16 CPU, 128GB 内存 (~i3.4xlarge)
* 启动命令: `bin/start-medium`
* 配置目录: `conf/druid/single-server/medium`
##### Large: 32 CPU, 256GB 内存 (~i3.8xlarge)
* 启动命令: `bin/start-large`
* 配置目录: `conf/druid/single-server/large`
##### X-Large: 64 CPU, 512GB 内存 (~i3.16xlarge)
* 启动命令: `bin/start-xlarge`
* 配置目录: `conf/druid/single-server/xlarge`
+449
View File
@@ -0,0 +1,449 @@
# GroupBy 查询
> Apache Druid supports two query languages: [Druid SQL](sql.md) and [native queries](querying.md).
> This document describes a query
> type in the native language. For information about when Druid SQL will use this query type, refer to the
> [SQL documentation](sql.md#query-types).
These types of Apache Druid queries take a groupBy query object and return an array of JSON objects where each object represents a
grouping asked for by the query.
> Note: If you are doing aggregations with time as your only grouping, or an ordered groupBy over a single dimension,
> consider [Timeseries](timeseriesquery.md) and [TopN](topnquery.md) queries as well as
> groupBy. Their performance may be better in some cases. See [Alternatives](#alternatives) below for more details.
一个分组查询(groupBy query)对象的查询脚本如下示例:
``` json
{
"queryType": "groupBy",
"dataSource": "sample_datasource",
"granularity": "day",
"dimensions": ["country", "device"],
"limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "carrier", "value": "AT&T" },
{ "type": "or",
"fields": [
{ "type": "selector", "dimension": "make", "value": "Apple" },
{ "type": "selector", "dimension": "make", "value": "Samsung" }
]
}
]
},
"aggregations": [
{ "type": "longSum", "name": "total_usage", "fieldName": "user_count" },
{ "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }
],
"postAggregations": [
{ "type": "arithmetic",
"name": "avg_usage",
"fn": "/",
"fields": [
{ "type": "fieldAccess", "fieldName": "data_transfer" },
{ "type": "fieldAccess", "fieldName": "total_usage" }
]
}
],
"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],
"having": {
"type": "greaterThan",
"aggregation": "total_usage",
"value": 100
}
}
```
Following are main parts to a groupBy query:
|property|description|required?|
|--------|-----------|---------|
|queryType|This String should always be "groupBy"; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String or Object defining the data source to query, very similar to a table in a relational database. See [DataSource](../querying/datasource.md) for more information.|yes|
|dimensions|A JSON list of dimensions to do the groupBy over; or see [DimensionSpec](../querying/dimensionspecs.md) for ways to extract dimensions. |yes|
|limitSpec|See [LimitSpec](../querying/limitspec.md).|no|
|having|See [Having](../querying/having.md).|no|
|granularity|Defines the granularity of the query. See [Granularities](../querying/granularities.md)|yes|
|filter|See [Filters](../querying/filters.md)|no|
|aggregations|See [Aggregations](../querying/aggregations.md)|no|
|postAggregations|See [Post Aggregations](../querying/post-aggregations.md)|no|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|subtotalsSpec| A JSON array of arrays to return additional result sets for groupings of subsets of top level `dimensions`. It is [described later](groupbyquery.md#more-on-subtotalsspec) in more detail.|no|
|context|An additional JSON Object which can be used to specify certain flags.|no|
To pull it all together, the above query would return *n\*m* data points, up to a maximum of 5000 points, where n is the cardinality of the `country` dimension, m is the cardinality of the `device` dimension, each day between 2012-01-01 and 2012-01-03, from the `sample_datasource` table. Each data point contains the (long) sum of `total_usage` if the value of the data point is greater than 100, the (double) sum of `data_transfer` and the (double) result of `total_usage` divided by `data_transfer` for the filter set for a particular grouping of `country` and `device`. The output looks like this:
```json
[
{
"version" : "v1",
"timestamp" : "2012-01-01T00:00:00.000Z",
"event" : {
"country" : <some_dim_value_one>,
"device" : <some_dim_value_two>,
"total_usage" : <some_value_one>,
"data_transfer" :<some_value_two>,
"avg_usage" : <some_avg_usage_value>
}
},
{
"version" : "v1",
"timestamp" : "2012-01-01T00:00:12.000Z",
"event" : {
"dim1" : <some_other_dim_value_one>,
"dim2" : <some_other_dim_value_two>,
"sample_name1" : <some_other_value_one>,
"sample_name2" :<some_other_value_two>,
"avg_usage" : <some_other_avg_usage_value>
}
},
...
]
```
## Behavior on multi-value dimensions
groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
there are rows. For example, a groupBy on the dimension `tags` with filter `"t1" AND "t3"` would match only row1, and
generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
improve performance.
See [Multi-value dimensions](multi-value-dimensions.md) for more details.
## More on subtotalsSpec
The subtotals feature allows computation of multiple sub-groupings in a single query. To use this feature, add a "subtotalsSpec" to your query as a list of subgroup dimension sets. It should contain the `outputName` from dimensions in your `dimensions` attribute, in the same order as they appear in the `dimensions` attribute (although, of course, you may skip some).
For example, consider a groupBy query like this one:
```json
{
"type": "groupBy",
...
...
"dimensions": [
{
"type" : "default",
"dimension" : "d1col",
"outputName": "D1"
},
{
"type" : "extraction",
"dimension" : "d2col",
"outputName" : "D2",
"extractionFn" : extraction_func
},
{
"type":"lookup",
"dimension":"d3col",
"outputName":"D3",
"name":"my_lookup"
}
],
...
...
"subtotalsSpec":[ ["D1", "D2", D3"], ["D1", "D3"], ["D3"]],
..
}
```
The result of the subtotalsSpec would be equivalent to concatenating the result of three groupBy queries, with the "dimensions" field being `["D1", "D2", D3"]`, `["D1", "D3"]` and `["D3"]`, given the `DimensionSpec` shown above.
The response for the query above would look something like:
```json
[
{
"version" : "v1",
"timestamp" : "t1",
"event" : { "D1": "..", "D2": "..", "D3": ".." }
}
},
{
"version" : "v1",
"timestamp" : "t2",
"event" : { "D1": "..", "D2": "..", "D3": ".." }
}
},
...
...
{
"version" : "v1",
"timestamp" : "t1",
"event" : { "D1": "..", "D2": null, "D3": ".." }
}
},
{
"version" : "v1",
"timestamp" : "t2",
"event" : { "D1": "..", "D2": null, "D3": ".." }
}
},
...
...
{
"version" : "v1",
"timestamp" : "t1",
"event" : { "D1": null, "D2": null, "D3": ".." }
}
},
{
"version" : "v1",
"timestamp" : "t2",
"event" : { "D1": null, "D2": null, "D3": ".." }
}
},
...
]
```
> Notice that dimensions that are not included in an individual subtotalsSpec grouping are returned with a `null` value. This response format represents a behavior change as of Apache Druid 0.18.0.
> In release 0.17.0 and earlier, such dimensions were entirely excluded from the result. If you were relying on this old behavior to determine whether a particular dimension was not part of
> a subtotal grouping, you can now use [Grouping aggregator](aggregations.md#grouping-aggregator) instead.
## Implementation details
### Strategies
GroupBy queries can be executed using two different strategies. The default strategy for a cluster is determined by the
"druid.query.groupBy.defaultStrategy" runtime property on the Broker. This can be overridden using "groupByStrategy" in
the query context. If neither the context field nor the property is set, the "v2" strategy will be used.
- "v2", the default, is designed to offer better performance and memory management. This strategy generates
per-segment results using a fully off-heap map. Data processes merge the per-segment results using a fully off-heap
concurrent facts map combined with an on-heap string dictionary. This may optionally involve spilling to disk. Data
processes return sorted results to the Broker, which merges result streams using an N-way merge. The broker materializes
the results if necessary (e.g. if the query sorts on columns other than its dimensions). Otherwise, it streams results
back as they are merged.
- "v1", a legacy engine, generates per-segment results on data processes (Historical, realtime, MiddleManager) using a map which
is partially on-heap (dimension keys and the map itself) and partially off-heap (the aggregated values). Data processes then
merge the per-segment results using Druid's indexing mechanism. This merging is multi-threaded by default, but can
optionally be single-threaded. The Broker merges the final result set using Druid's indexing mechanism again. The broker
merging is always single-threaded. Because the Broker merges results using the indexing mechanism, it must materialize
the full result set before returning any results. On both the data processes and the Broker, the merging index is fully
on-heap by default, but it can optionally store aggregated values off-heap.
### Differences between v1 and v2
Query API and results are compatible between the two engines; however, there are some differences from a cluster
configuration perspective:
- groupBy v1 controls resource usage using a row-based limit (maxResults) whereas groupBy v2 uses bytes-based limits.
In addition, groupBy v1 merges results on-heap, whereas groupBy v2 merges results off-heap. These factors mean that
memory tuning and resource limits behave differently between v1 and v2. In particular, due to this, some queries
that can complete successfully in one engine may exceed resource limits and fail with the other engine. See the
"Memory tuning and resource limits" section for more details.
- groupBy v1 imposes no limit on the number of concurrently running queries, whereas groupBy v2 controls memory usage
by using a finite-sized merge buffer pool. By default, the number of merge buffers is 1/4 the number of processing
threads. You can adjust this as necessary to balance concurrency and memory usage.
- groupBy v1 supports caching on either the Broker or Historical processes, whereas groupBy v2 only supports caching on
Historical processes.
- groupBy v2 supports both array-based aggregation and hash-based aggregation. The array-based aggregation is used only
when the grouping key is a single indexed string column. In array-based aggregation, the dictionary-encoded value is used
as the index, so the aggregated values in the array can be accessed directly without finding buckets based on hashing.
### Memory tuning and resource limits
When using groupBy v2, three parameters control resource usage and limits:
- `druid.processing.buffer.sizeBytes`: size of the off-heap hash table used for aggregation, per query, in bytes. At
most `druid.processing.numMergeBuffers` of these will be created at once, which also serves as an upper limit on the
number of concurrently running groupBy queries.
- `druid.query.groupBy.maxMergingDictionarySize`: size of the on-heap dictionary used when grouping on strings, per query,
in bytes. Note that this is based on a rough estimate of the dictionary size, not the actual size.
- `druid.query.groupBy.maxOnDiskStorage`: amount of space on disk used for aggregation, per query, in bytes. By default,
this is 0, which means aggregation will not use disk.
If `maxOnDiskStorage` is 0 (the default) then a query that exceeds either the on-heap dictionary limit, or the off-heap
aggregation table limit, will fail with a "Resource limit exceeded" error describing the limit that was exceeded.
If `maxOnDiskStorage` is greater than 0, queries that exceed the in-memory limits will start using disk for aggregation.
In this case, when either the on-heap dictionary or off-heap hash table fills up, partially aggregated records will be
sorted and flushed to disk. Then, both in-memory structures will be cleared out for further aggregation. Queries that
then go on to exceed `maxOnDiskStorage` will fail with a "Resource limit exceeded" error indicating that they ran out of
disk space.
With groupBy v2, cluster operators should make sure that the off-heap hash tables and on-heap merging dictionaries
will not exceed available memory for the maximum possible concurrent query load (given by
`druid.processing.numMergeBuffers`). See the [basic cluster tuning guide](../operations/basic-cluster-tuning.md)
for more details about direct memory usage, organized by Druid process type.
Brokers do not need merge buffers for basic groupBy queries. Queries with subqueries (using a `query` dataSource) require one merge buffer if there is a single subquery, or two merge buffers if there is more than one layer of nested subqueries. Queries with [subtotals](groupbyquery.md#more-on-subtotalsspec) need one merge buffer. These can stack on top of each other: a groupBy query with multiple layers of nested subqueries, and that also uses subtotals, will need three merge buffers.
Historicals and ingestion tasks need one merge buffer for each groupBy query, unless [parallel combination](groupbyquery.md#parallel-combine) is enabled, in which case they need two merge buffers per query.
When using groupBy v1, all aggregation is done on-heap, and resource limits are done through the parameter
`druid.query.groupBy.maxResults`. This is a cap on the maximum number of results in a result set. Queries that exceed
this limit will fail with a "Resource limit exceeded" error indicating they exceeded their row limit. Cluster
operators should make sure that the on-heap aggregations will not exceed available JVM heap space for the expected
concurrent query load.
### Performance tuning for groupBy v2
#### Limit pushdown optimization
Druid pushes down the `limit` spec in groupBy queries to the segments on Historicals wherever possible to early prune unnecessary intermediate results and minimize the amount of data transferred to Brokers. By default, this technique is applied only when all fields in the `orderBy` spec is a subset of the grouping keys. This is because the `limitPushDown` doesn't guarantee the exact results if the `orderBy` spec includes any fields that are not in the grouping keys. However, you can enable this technique even in such cases if you can sacrifice some accuracy for fast query processing like in topN queries. See `forceLimitPushDown` in [advanced groupBy v2 configurations](#groupby-v2-configurations).
#### Optimizing hash table
The groupBy v2 engine uses an open addressing hash table for aggregation. The hash table is initialized with a given initial bucket number and gradually grows on buffer full. On hash collisions, the linear probing technique is used.
The default number of initial buckets is 1024 and the default max load factor of the hash table is 0.7. If you can see too many collisions in the hash table, you can adjust these numbers. See `bufferGrouperInitialBuckets` and `bufferGrouperMaxLoadFactor` in [Advanced groupBy v2 configurations](#groupby-v2-configurations).
#### Parallel combine
Once a Historical finishes aggregation using the hash table, it sorts the aggregated results and merges them before sending to the
Broker for N-way merge aggregation in the broker. By default, Historicals use all their available processing threads
(configured by `druid.processing.numThreads`) for aggregation, but use a single thread for sorting and merging
aggregates which is an http thread to send data to Brokers.
This is to prevent some heavy groupBy queries from blocking other queries. In Druid, the processing threads are shared
between all submitted queries and they are _not interruptible_. It means, if a heavy query takes all available
processing threads, all other queries might be blocked until the heavy query is finished. GroupBy queries usually take
longer time than timeseries or topN queries, they should release processing threads as soon as possible.
However, you might care about the performance of some really heavy groupBy queries. Usually, the performance bottleneck
of heavy groupBy queries is merging sorted aggregates. In such cases, you can use processing threads for it as well.
This is called _parallel combine_. To enable parallel combine, see `numParallelCombineThreads` in
[Advanced groupBy v2 configurations](#groupby-v2-configurations). Note that parallel combine can be enabled only when
data is actually spilled (see [Memory tuning and resource limits](#memory-tuning-and-resource-limits)).
Once parallel combine is enabled, the groupBy v2 engine can create a combining tree for merging sorted aggregates. Each
intermediate node of the tree is a thread merging aggregates from the child nodes. The leaf node threads read and merge
aggregates from hash tables including spilled ones. Usually, leaf processes are slower than intermediate nodes because they
need to read data from disk. As a result, less threads are used for intermediate nodes by default. You can change the
degree of intermediate nodes. See `intermediateCombineDegree` in [Advanced groupBy v2 configurations](#groupby-v2-configurations).
Please note that each Historical needs two merge buffers to process a groupBy v2 query with parallel combine: one for
computing intermediate aggregates from each segment and another for combining intermediate aggregates in parallel.
### Alternatives
There are some situations where other query types may be a better choice than groupBy.
- For queries with no "dimensions" (i.e. grouping by time only) the [Timeseries query](timeseriesquery.md) will
generally be faster than groupBy. The major differences are that it is implemented in a fully streaming manner (taking
advantage of the fact that segments are already sorted on time) and does not need to use a hash table for merging.
- For queries with a single "dimensions" element (i.e. grouping by one string dimension), the [TopN query](topnquery.md)
will sometimes be faster than groupBy. This is especially true if you are ordering by a metric and find approximate
results acceptable.
### Nested groupBys
Nested groupBys (dataSource of type "query") are performed differently for "v1" and "v2". The Broker first runs the
inner groupBy query in the usual way. "v1" strategy then materializes the inner query's results on-heap with Druid's
indexing mechanism, and runs the outer query on these materialized results. "v2" strategy runs the outer query on the
inner query's results stream with off-heap fact map and on-heap string dictionary that can spill to disk. Both
strategy perform the outer query on the Broker in a single-threaded fashion.
### Configurations
This section describes the configurations for groupBy queries. You can set the runtime properties in the `runtime.properties` file on Broker, Historical, and MiddleManager processes. You can set the query context parameters through the [query context](query-context.md).
#### Configurations for groupBy v2
Supported runtime properties:
|Property|Description|Default|
|--------|-----------|-------|
|`druid.query.groupBy.maxMergingDictionarySize`|Maximum amount of heap space (approximately) to use for the string dictionary during merging. When the dictionary exceeds this size, a spill to disk will be triggered.|100000000|
|`druid.query.groupBy.maxOnDiskStorage`|Maximum amount of disk space to use, per-query, for spilling result sets to disk when either the merging buffer or the dictionary fills up. Queries that exceed this limit will fail. Set to zero to disable disk spilling.|0 (disabled)|
Supported query contexts:
|Key|Description|
|---|-----------|
|`maxMergingDictionarySize`|Can be used to lower the value of `druid.query.groupBy.maxMergingDictionarySize` for this query.|
|`maxOnDiskStorage`|Can be used to lower the value of `druid.query.groupBy.maxOnDiskStorage` for this query.|
### Advanced configurations
#### Common configurations for all groupBy strategies
Supported runtime properties:
|Property|Description|Default|
|--------|-----------|-------|
|`druid.query.groupBy.defaultStrategy`|Default groupBy query strategy.|v2|
|`druid.query.groupBy.singleThreaded`|Merge results using a single thread.|false|
Supported query contexts:
|Key|Description|
|---|-----------|
|`groupByStrategy`|Overrides the value of `druid.query.groupBy.defaultStrategy` for this query.|
|`groupByIsSingleThreaded`|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.|
#### GroupBy v2 configurations
Supported runtime properties:
|Property|Description|Default|
|--------|-----------|-------|
|`druid.query.groupBy.bufferGrouperInitialBuckets`|Initial number of buckets in the off-heap hash table used for grouping results. Set to 0 to use a reasonable default (1024).|0|
|`druid.query.groupBy.bufferGrouperMaxLoadFactor`|Maximum load factor of the off-heap hash table used for grouping results. When the load factor exceeds this size, the table will be grown or spilled to disk. Set to 0 to use a reasonable default (0.7).|0|
|`druid.query.groupBy.forceHashAggregation`|Force to use hash-based aggregation.|false|
|`druid.query.groupBy.intermediateCombineDegree`|Number of intermediate nodes combined together in the combining tree. Higher degrees will need less threads which might be helpful to improve the query performance by reducing the overhead of too many threads if the server has sufficiently powerful cpu cores.|8|
|`druid.query.groupBy.numParallelCombineThreads`|Hint for the number of parallel combining threads. This should be larger than 1 to turn on the parallel combining feature. The actual number of threads used for parallel combining is min(`druid.query.groupBy.numParallelCombineThreads`, `druid.processing.numThreads`).|1 (disabled)|
|`druid.query.groupBy.applyLimitPushDownToSegment`|If Broker pushes limit down to queryable data server (historicals, peons) then limit results during segment scan. If typically there are a large number of segments taking part in a query on a data server, this setting may counterintuitively reduce performance if enabled.|false (disabled)|
Supported query contexts:
|Key|Description|Default|
|---|-----------|-------|
|`bufferGrouperInitialBuckets`|Overrides the value of `druid.query.groupBy.bufferGrouperInitialBuckets` for this query.|None|
|`bufferGrouperMaxLoadFactor`|Overrides the value of `druid.query.groupBy.bufferGrouperMaxLoadFactor` for this query.|None|
|`forceHashAggregation`|Overrides the value of `druid.query.groupBy.forceHashAggregation`|None|
|`intermediateCombineDegree`|Overrides the value of `druid.query.groupBy.intermediateCombineDegree`|None|
|`numParallelCombineThreads`|Overrides the value of `druid.query.groupBy.numParallelCombineThreads`|None|
|`sortByDimsFirst`|Sort the results first by dimension values and then by timestamp.|false|
|`forceLimitPushDown`|When all fields in the orderby are part of the grouping key, the Broker will push limit application down to the Historical processes. When the sorting order uses fields that are not in the grouping key, applying this optimization can result in approximate results with unknown accuracy, so this optimization is disabled by default in that case. Enabling this context flag turns on limit push down for limit/orderbys that contain non-grouping key columns.|false|
|`applyLimitPushDownToSegment`|If Broker pushes limit down to queryable nodes (historicals, peons) then limit results during segment scan. This context value can be used to override `druid.query.groupBy.applyLimitPushDownToSegment`.|true|
#### GroupBy v1 configurations
Supported runtime properties:
|Property|Description|Default|
|--------|-----------|-------|
|`druid.query.groupBy.maxIntermediateRows`|Maximum number of intermediate rows for the per-segment grouping engine. This is a tuning parameter that does not impose a hard limit; rather, it potentially shifts merging work from the per-segment engine to the overall merging index. Queries that exceed this limit will not fail.|50000|
|`druid.query.groupBy.maxResults`|Maximum number of results. Queries that exceed this limit will fail.|500000|
Supported query contexts:
|Key|Description|Default|
|---|-----------|-------|
|`maxIntermediateRows`|Can be used to lower the value of `druid.query.groupBy.maxIntermediateRows` for this query.|None|
|`maxResults`|Can be used to lower the value of `druid.query.groupBy.maxResults` for this query.|None|
|`useOffheap`|Set to true to store aggregations off-heap when merging results.|false|
#### Array based result rows
Internally Druid always uses an array based representation of groupBy result rows, but by default this is translated
into a map based result format at the Broker. To reduce the overhead of this translation, results may also be returned
from the Broker directly in the array based format if `resultAsArray` is set to `true` on the query context.
Each row is positional, and has the following fields, in order:
* Timestamp (optional; only if granularity != ALL)
* Dimensions (in order)
* Aggregators (in order)
* Post-aggregators (optional; in order, if present)
This schema is not available on the response, so it must be computed from the issued query in order to properly read
the results.
+118
View File
@@ -0,0 +1,118 @@
---
id: query-execution
title: "Query execution"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
> This document describes how Druid executes [native queries](querying.md), but since [Druid SQL](sql.md) queries
> are translated to native queries, this document applies to the SQL runtime as well. Refer to the SQL
> [Query translation](sql.md#query-translation) page for information about how SQL queries are translated to native
> queries.
Druid's approach to query execution varies depending on the kind of [datasource](datasource.md) you are querying.
## Datasource type
### `table`
Queries that operate directly on [table datasources](datasource.md#table) are executed using a scatter-gather approach
led by the Broker process. The process looks like this:
1. The Broker identifies which [segments](../design/segments.md) are relevant to the query based on the `"intervals"`
parameter. Segments are always partitioned by time, so any segment whose interval overlaps the query interval is
potentially relevant.
2. The Broker may additionally further prune the segment list based on the `"filter"`, if the input data was partitioned
by range using the [`single_dim` partitionsSpec](../ingestion/native-batch.md#partitionsspec), and if the filter matches
the dimension used for partitioning.
3. The Broker, having pruned the list of segments for the query, forwards the query to data servers (like Historicals
and tasks running on MiddleManagers) that are currently serving those segments.
4. For all query types except [Scan](scan-query.md), data servers process each segment in parallel and generate partial
results for each segment. The specific processing that is done depends on the query type. These partial results may be
cached if [query caching](caching.md) is enabled. For Scan queries, segments are processed in order by a single thread,
and results are not cached.
5. The Broker receives partial results from each data server, merges them into the final result set, and returns them
to the caller. For Timeseries and Scan queries, and for GroupBy queries where there is no sorting, the Broker is able to
do this in a streaming fashion. Otherwise, the Broker fully computes the result set before returning anything.
### `lookup`
Queries that operate directly on [lookup datasources](datasource.md#lookup) (without a join) are executed on the Broker
that received the query, using its local copy of the lookup. All registered lookup tables are preloaded in-memory on the
Broker. The query runs single-threaded.
Execution of queries that use lookups as right-hand inputs to a join are executed in a way that depends on their
"base" (bottom-leftmost) datasource, as described in the [join](#join) section below.
### `union`
Queries that operate directly on [union datasources](datasource.md#union) are split up on the Broker into a separate
query for each table that is part of the union. Each of these queries runs separately, and the Broker merges their
results together.
### `inline`
Queries that operate directly on [inline datasources](datasource.md#inline) are executed on the Broker that received the
query. The query runs single-threaded.
Execution of queries that use inline datasources as right-hand inputs to a join are executed in a way that depends on
their "base" (bottom-leftmost) datasource, as described in the [join](#join) section below.
### `query`
[Query datasources](datasource.md#query) are subqueries. Each subquery is executed as if it was its own query and
the results are brought back to the Broker. Then, the Broker continues on with the rest of the query as if the subquery
was replaced with an inline datasource.
In most cases, subquery results are fully buffered in memory on the Broker before the rest of the query proceeds,
meaning subqueries execute sequentially. The total number of rows buffered across all subqueries of a given query
in this way cannot exceed the [`druid.server.http.maxSubqueryRows` property](../configuration/index.md).
There is one exception: if the outer query and all subqueries are the [groupBy](groupbyquery.md) type, then subquery
results can be processed in a streaming fashion and the `druid.server.http.maxSubqueryRows` limit does not apply.
### `join`
[Join datasources](datasource.md#join) are handled using a broadcast hash-join approach.
1. The Broker executes any subqueries that are inputs the join, as described in the [query](#query) section, and
replaces them with inline datasources.
2. The Broker flattens a join tree, if present, into a "base" datasource (the bottom-leftmost one) and other leaf
datasources (the rest).
3. Query execution proceeds using the same structure that the base datasource would use on its own. If the base
datasource is a [table](#table), segments are pruned based on `"intervals"` as usual, and the query is executed on the
cluster by forwarding it to all relevant data servers in parallel. If the base datasource is a [lookup](#lookup) or
[inline](#inline) datasource (including an inline datasource that was the result of inlining a subquery), the query is
executed on the Broker itself. The base query cannot be a union, because unions are not currently supported as inputs to
a join.
4. Before beginning to process the base datasource, the server(s) that will execute the query first inspect all the
non-base leaf datasources to determine if a new hash table needs to be built for the upcoming hash join. Currently,
lookups do not require new hash tables to be built (because they are preloaded), but inline datasources do.
5. Query execution proceeds again using the same structure that the base datasource would use on its own, with one
addition: while processing the base datasource, Druid servers will use the hash tables built from the other join inputs
to produce the join result row-by-row, and query engines will operate on the joined rows rather than the base rows.
+125
View File
@@ -0,0 +1,125 @@
# 原生查询
> Apache Druid supports two query languages: [Druid SQL](../querying/sql.md) and [native queries](../querying/querying.md).
> This document describes the
> native query language. For information about how Druid SQL chooses which native query types to use when
> it runs a SQL query, refer to the [SQL documentation](../querying/sql.md#query-types).
Native queries in Druid are JSON objects and are typically issued to the Broker or Router processes. Queries can be
posted like this:
```bash
curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -H 'Accept:application/json' -d @<query_json_file>
```
> Replace `<queryable_host>:<port>` with the appropriate address and port for your system. For example, if running the quickstart configuration, replace `<queryable_host>:<port>` with localhost:8888.
You can also enter them directly in the Druid console's Query view. Simply pasting a native query into the console switches the editor into JSON mode.
![Native query](../assets/native-queries-01.png "Native query")
Druid's native query language is JSON over HTTP, although many members of the community have contributed different
[client libraries](https://druid.apache.org/libraries.html) in other languages to query Druid.
The Content-Type/Accept Headers can also take 'application/x-jackson-smile'.
```bash
curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -H 'Accept:application/x-jackson-smile' -d @<query_json_file>
```
> If the Accept header is not provided, it defaults to the value of 'Content-Type' header.
Druid's native query is relatively low level, mapping closely to how computations are performed internally. Druid queries
are designed to be lightweight and complete very quickly. This means that for more complex analysis, or to build
more complex visualizations, multiple Druid queries may be required.
Even though queries are typically made to Brokers or Routers, they can also be accepted by
[Historical](../design/historical.md) processes and by [Peons (task JVMs)](../design/peons.md)) that are running
stream ingestion tasks. This may be valuable if you want to query results for specific segments that are served by
specific processes.
## Available queries
Druid has numerous query types for various use cases. Queries are composed of various JSON properties and Druid has different types of queries for different use cases. The documentation for the various query types describe all the JSON properties that can be set.
### Aggregation queries
* [Timeseries](../querying/timeseriesquery.md)
* [TopN](../querying/topnquery.md)
* [GroupBy](../querying/groupbyquery.md)
### Metadata queries
* [TimeBoundary](../querying/timeboundaryquery.md)
* [SegmentMetadata](../querying/segmentmetadataquery.md)
* [DatasourceMetadata](../querying/datasourcemetadataquery.md)
### Other queries
* [Scan](../querying/scan-query.md)
* [Search](../querying/searchquery.md)
## Which query type should I use?
For aggregation queries, if more than one would satisfy your needs, we generally recommend using Timeseries or TopN
whenever possible, as they are specifically optimized for their use cases. If neither is a good fit, you should use
the GroupBy query, which is the most flexible.
## Query cancellation
Queries can be cancelled explicitly using their unique identifier. If the
query identifier is set at the time of query, or is otherwise known, the following
endpoint can be used on the Broker or Router to cancel the query.
```sh
DELETE /druid/v2/{queryId}
```
For example, if the query ID is `abc123`, the query can be cancelled as follows:
```sh
curl -X DELETE "http://host:port/druid/v2/abc123"
```
## Query errors
### Authentication and authorization failures
For [secured](../design/auth.md) Druid clusters, query requests respond with an HTTP 401 response code in case of an authentication failure. For authorization failures, an HTTP 403 response code is returned.
### Query execution failures
If a query fails, Druid returns a response with an HTTP response code and a JSON object with the following structure:
```json
{
"error" : "Query timeout",
"errorMessage" : "Timeout waiting for task.",
"errorClass" : "java.util.concurrent.TimeoutException",
"host" : "druid1.example.com:8083"
}
```
The fields in the response are:
|field|description|
|-----|-----------|
|error|A well-defined error code (see below).|
|errorMessage|A free-form message with more information about the error. May be null.|
|errorClass|The class of the exception that caused this error. May be null.|
|host|The host on which this error occurred. May be null.|
Possible Druid error codes for the `error` field include:
|Error code|HTTP response code|description|
|----|-----------|-----------|
|`SQL parse failed`|400|Only for SQL queries. The SQL query failed to parse.|
|`Plan validation failed`|400|Only for SQL queries. The SQL query failed to validate.|
|`Resource limit exceeded`|400|The query exceeded a configured resource limit (e.g. groupBy maxResults).|
|`Query capacity exceeded`|429|The query failed to execute because of the lack of resources available at the time when the query was submitted. The resources could be any runtime resources such as [query scheduler lane capacity](../configuration/index.md#query-prioritization-and-laning), merge buffers, and so on. The error message should have more details about the failure.|
|`Unsupported operation`|501|The query attempted to perform an unsupported operation. This may occur when using undocumented features or when using an incompletely implemented extension.|
|`Query timeout`|504|The query timed out.|
|`Query interrupted`|500|The query was interrupted, possibly due to JVM shutdown.|
|`Query cancelled`|500|The query was cancelled through the query cancellation API.|
|`Truncated response context`|500|An intermediate response context for the query exceeded the built-in limit of 7KiB.<br/><br/>The response context is an internal data structure that Druid servers use to share out-of-band information when sending query results to each other. It is serialized in an HTTP header with a maximum length of 7KiB. This error occurs when an intermediate response context sent from a data server (like a Historical) to the Broker exceeds this limit.<br/><br/>The response context is used for a variety of purposes, but the one most likely to generate a large context is sharing details about segments that move during a query. That means this error can potentially indicate that a very large number of segments moved in between the time a Broker issued a query and the time it was processed on Historicals. This should rarely, if ever, occur during normal operation.|
|`Unknown exception`|500|Some other exception occurred. Check errorMessage and errorClass for details, although keep in mind that the contents of those fields are free-form and may change from release to release.|
+1270
View File
File diff suppressed because it is too large Load Diff
+261
View File
@@ -0,0 +1,261 @@
---
id: topnquery
title: "TopN queries"
sidebar_label: "TopN"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
> Apache Druid supports two query languages: [Druid SQL](sql.md) and [native queries](querying.md).
> This document describes a query
> type in the native language. For information about when Druid SQL will use this query type, refer to the
> [SQL documentation](sql.md#query-types).
Apache Druid TopN queries return a sorted set of results for the values in a given dimension according to some criteria. Conceptually, they can be thought of as an approximate [GroupByQuery](../querying/groupbyquery.md) over a single dimension with an [Ordering](../querying/limitspec.md) spec. TopNs are much faster and resource efficient than GroupBys for this use case. These types of queries take a topN query object and return an array of JSON objects where each object represents a value asked for by the topN query.
TopNs are approximate in that each data process will rank their top K results and only return those top K results to the Broker. K, by default in Druid, is `max(1000, threshold)`. In practice, this means that if you ask for the top 1000 items ordered, the correctness of the first ~900 items will be 100%, and the ordering of the results after that is not guaranteed. TopNs can be made more accurate by increasing the threshold.
A topN query object looks like:
```json
{
"queryType": "topN",
"dataSource": "sample_data",
"dimension": "sample_dim",
"threshold": 5,
"metric": "count",
"granularity": "all",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "dim1",
"value": "some_value"
},
{
"type": "selector",
"dimension": "dim2",
"value": "some_other_val"
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "count",
"fieldName": "count"
},
{
"type": "doubleSum",
"name": "some_metric",
"fieldName": "some_metric"
}
],
"postAggregations": [
{
"type": "arithmetic",
"name": "average",
"fn": "/",
"fields": [
{
"type": "fieldAccess",
"name": "some_metric",
"fieldName": "some_metric"
},
{
"type": "fieldAccess",
"name": "count",
"fieldName": "count"
}
]
}
],
"intervals": [
"2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"
]
}
```
There are 11 parts to a topN query.
|property|description|required?|
|--------|-----------|---------|
|queryType|This String should always be "topN"; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String or Object defining the data source to query, very similar to a table in a relational database. See [DataSource](../querying/datasource.md) for more information.|yes|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|granularity|Defines the granularity to bucket query results. See [Granularities](../querying/granularities.md)|yes|
|filter|See [Filters](../querying/filters.md)|no|
|aggregations|See [Aggregations](../querying/aggregations.md)|for numeric metricSpec, aggregations or postAggregations should be specified. Otherwise no.|
|postAggregations|See [Post Aggregations](../querying/post-aggregations.md)|for numeric metricSpec, aggregations or postAggregations should be specified. Otherwise no.|
|dimension|A String or JSON object defining the dimension that you want the top taken for. For more info, see [DimensionSpecs](../querying/dimensionspecs.md)|yes|
|threshold|An integer defining the N in the topN (i.e. how many results you want in the top list)|yes|
|metric|A String or JSON object specifying the metric to sort by for the top list. For more info, see [TopNMetricSpec](../querying/topnmetricspec.md).|yes|
|context|See [Context](../querying/query-context.md)|no|
Please note the context JSON object is also available for topN queries and should be used with the same caution as the timeseries case.
The format of the results would look like so:
```json
[
{
"timestamp": "2013-08-31T00:00:00.000Z",
"result": [
{
"dim1": "dim1_val",
"count": 111,
"some_metrics": 10669,
"average": 96.11711711711712
},
{
"dim1": "another_dim1_val",
"count": 88,
"some_metrics": 28344,
"average": 322.09090909090907
},
{
"dim1": "dim1_val3",
"count": 70,
"some_metrics": 871,
"average": 12.442857142857143
},
{
"dim1": "dim1_val4",
"count": 62,
"some_metrics": 815,
"average": 13.14516129032258
},
{
"dim1": "dim1_val5",
"count": 60,
"some_metrics": 2787,
"average": 46.45
}
]
}
]
```
## Behavior on multi-value dimensions
topN queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match only row1, and
generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
improve performance.
See [Multi-value dimensions](multi-value-dimensions.md) for more details.
## Aliasing
The current TopN algorithm is an approximate algorithm. The top 1000 local results from each segment are returned for merging to determine the global topN. As such, the topN algorithm is approximate in both rank and results. Approximate results *ONLY APPLY WHEN THERE ARE MORE THAN 1000 DIM VALUES*. A topN over a dimension with fewer than 1000 unique dimension values can be considered accurate in rank and accurate in aggregates.
The threshold can be modified from its default 1000 via the server parameter `druid.query.topN.minTopNThreshold`, which needs a restart of the servers to take effect, or via `minTopNThreshold` in the query context, which takes effect per query.
If you are wanting the top 100 of a high cardinality, uniformly distributed dimension ordered by some low-cardinality, uniformly distributed dimension, you are potentially going to get aggregates back that are missing data.
To put it another way, the best use cases for topN are when you can have confidence that the overall results are uniformly in the top. For example, if a particular site ID is in the top 10 for some metric for every hour of every day, then it will probably be accurate in the topN over multiple days. But if a site is barely in the top 1000 for any given hour, but over the whole query granularity is in the top 500 (example: a site which gets highly uniform traffic co-mingling in the dataset with sites with highly periodic data), then a top500 query may not have that particular site at the exact rank, and may not be accurate for that particular site's aggregates.
Before continuing in this section, please consider if you really need exact results. Getting exact results is a very resource intensive process. For the vast majority of "useful" data results, an approximate topN algorithm supplies plenty of accuracy.
Users wishing to get an *exact rank and exact aggregates* topN over a dimension with greater than 1000 unique values should issue a groupBy query and sort the results themselves. This is very computationally expensive for high-cardinality dimensions.
Users who can tolerate *approximate rank* topN over a dimension with greater than 1000 unique values, but require *exact aggregates* can issue two queries. One to get the approximate topN dimension values, and another topN with dimension selection filters which only use the topN results of the first.
### Example First query
```json
{
"aggregations": [
{
"fieldName": "L_QUANTITY_longSum",
"name": "L_QUANTITY_",
"type": "longSum"
}
],
"dataSource": "tpch_year",
"dimension":"l_orderkey",
"granularity": "all",
"intervals": [
"1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"
],
"metric": "L_QUANTITY_",
"queryType": "topN",
"threshold": 2
}
```
### Example second query
```json
{
"aggregations": [
{
"fieldName": "L_TAX_doubleSum",
"name": "L_TAX_",
"type": "doubleSum"
},
{
"fieldName": "L_DISCOUNT_doubleSum",
"name": "L_DISCOUNT_",
"type": "doubleSum"
},
{
"fieldName": "L_EXTENDEDPRICE_doubleSum",
"name": "L_EXTENDEDPRICE_",
"type": "doubleSum"
},
{
"fieldName": "L_QUANTITY_longSum",
"name": "L_QUANTITY_",
"type": "longSum"
},
{
"name": "count",
"type": "count"
}
],
"dataSource": "tpch_year",
"dimension":"l_orderkey",
"filter": {
"fields": [
{
"dimension": "l_orderkey",
"type": "selector",
"value": "103136"
},
{
"dimension": "l_orderkey",
"type": "selector",
"value": "1648672"
}
],
"type": "or"
},
"granularity": "all",
"intervals": [
"1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"
],
"metric": "L_QUANTITY_",
"queryType": "topN",
"threshold": 2
}
```
-99
View File
@@ -1,99 +0,0 @@
<!-- toc -->
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-8828078415045620"
data-ad-slot="7586680510"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
## 配置数据保留规则
本教程演示如何在数据源上配置保留规则,以设置要保留或删除的数据的时间间隔
本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid,并运行在本地机器上。
完成[加载本地文件](tutorial-batch.md)和[数据查询](./chapter-4.md)两部分内容也是非常有帮助的。
### 加载示例数据
在本教程中,我们将使用Wikipedia编辑的示例数据,其中包含一个摄取任务规范,它将为输入数据每个小时创建一个单独的段
数据摄取规范位于 `quickstart/tutorial/retention-index.json`, 提交这个规范,将创建一个名称为 `retention-tutorial` 的数据源
```json
bin/post-index-task --file quickstart/tutorial/retention-index.json --url http://localhost:8081
```
摄取完成后,在浏览器中转到[http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources)以访问Druid控制台的datasource视图
此视图显示可用的数据源以及每个数据源的保留规则摘要
![](img-6/tutorial-retention-01.png)
当前没有为 `retention-tutorial` 数据源设置规则。请注意,集群有默认规则:在 `_default_tier` 中永久加载2个副本
这意味着无论时间戳如何,所有数据都将加载,并且每个段将复制到两个Historical进程的 `_default_tier`
在本教程中,我们将暂时忽略分层和冗余概念
让我们通过单击"Fully Available"旁边的"24 Segments"链接来查看 `retention-tutorial` 数据源的段
[Segment视图](http://localhost:8888/unified-console.html#segments) 提供了一个数据源包括的segment信息,本页显示有24个段,每一个段包括了2015-09-12特定小时的数据
![](img-6/tutorial-retention-02.png)
### 设置数据保留规则
假设我们想删除2015年9月12日前12小时的数据,保留2015年9月12日后12小时的数据。
进入到Datasources视图,点击 `retention-tutorial` 数据源的蓝色铅笔的图标 `Cluster default: loadForever`
一个规则配置窗口出现了:
![](img-6/tutorial-retention-03.png)
现在点击 `+ New rule` 按钮两次
在上边的规则框中,选择 `Load``by Interval` 然后输入在 `by Interval` 旁边的输入框中输入 `2015-09-12T12:00:00.000Z/2015-09-13T00:00:00.000Z`, 副本可以选择保持2,在 `_default_tier`
在下边的规则框中,选择 `Drop``forever`
规则看上去是这样的:
![](img-6/tutorial-retention-04.png)
现在点击 `Next`, 规则配置过程将要求提供用户名和注释,以便进行更改日志记录。您可以同时输入教程。
现在点击 `Save`, 可以在Datasources视图中看到新的规则
![](img-6/tutorial-retention-05.png)
给集群几分钟时间应用规则更改,然后转到Druid控制台中的segments视图。2015年9月12日前12小时的段文件现已消失
![](img-6/tutorial-retention-06.png)
生成的保留规则链如下:
1. loadByInterval 2015-09-12T12/2015-09-13 (12 hours)
2. dropForever
3. loadForever (默认规则)
规则链是自上而下计算的,默认规则链始终添加在底部
我们刚刚创建的教程规则链在指定的12小时间隔内加载数据
如果数据不在12小时的间隔内,则规则链下一步将计算 `dropForever`,这将删除任何数据
`dropForever` 终止了规则链,有效地覆盖了默认的 `loadForever` 规则,在这个规则链中永远不会到达该规则
注意,在本教程中,我们定义了一个特定间隔的加载规则
相反,如果希望根据数据的生命周期保留数据(例如,保留从过去3个月到现在3个月的数据),则应定义一个周期性加载规则(Period Load Rule)。
### 进一步阅读
[加载规则](../operations/retainingOrDropData.md)
-145
View File
@@ -1,145 +0,0 @@
<!-- toc -->
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-8828078415045620"
data-ad-slot="7586680510"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
## 数据更新
本教程演示如何更新现有数据,同时展示覆盖(Overwrite)和追加(append)的两个方式。
本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid,并运行在本地机器上。
完成[加载本地文件](tutorial-batch.md)、[数据查询](./chapter-4.md)和[roll-up](./chapter-5.md)部分内容也是非常有帮助的
### 数据覆盖
本节教程将介绍如何覆盖现有的指定间隔的数据
#### 加载初始数据
本节教程使用的任务摄取规范位于 `quickstart/tutorial/updates-init-index.json`, 本规范从 `quickstart/tutorial/updates-data.json` 输入文件创建一个名称为 `updates-tutorial` 的数据源
提交任务:
```json
bin/post-index-task --file quickstart/tutorial/updates-init-index.json --url http://localhost:8081
```
我们有三个包含"动物"维度和"数字"指标的初始行:
```json
dsql> select * from "updates-tutorial";
__time animal count number
2018-01-01T01:01:00.000Z tiger 1 100
2018-01-01T03:01:00.000Z aardvark 1 42
2018-01-01T03:01:00.000Z giraffe 1 14124
Retrieved 3 rows in 1.42s.
```
#### 覆盖初始数据
为了覆盖这些数据,我们可以在相同的时间间隔内提交另一个任务,但是使用不同的输入数据。
`quickstart/tutorial/updates-overwrite-index.json` 规范将会对 `updates-tutorial` 数据进行数据重写
注意,此任务从 `quickstart/tutorial/updates-data2.json` 读取输入,`appendToExisting` 设置为false(表示这是一个覆盖)
提交任务:
```json
bin/post-index-task --file quickstart/tutorial/updates-overwrite-index.json --url http://localhost:8081
```
当Druid从这个覆盖任务加载完新的段时,"tiger"行现在有了值"lion""aardvark"行有了不同的编号,"giraffe"行已经被替换。更改可能需要几分钟才能生效:
```json
dsql> select * from "updates-tutorial";
__time animal count number
2018-01-01T01:01:00.000Z lion 1 100
2018-01-01T03:01:00.000Z aardvark 1 9999
2018-01-01T04:01:00.000Z bear 1 111
Retrieved 3 rows in 0.02s.
```
### 将旧数据与新数据合并并覆盖
现在我们尝试在 `updates-tutorial` 数据源追加一些新的数据,我们将从 `quickstart/tutorial/updates-data3.json` 增加新的数据
`quickstart/tutorial/updates-append-index.json` 任务规范配置为从现有的 `updates-tutorial` 数据源和 `quickstart/tutorial/updates-data3.json` 文件读取数据,该任务将组合来自两个输入源的数据,然后用新的组合数据覆盖原始数据。
提交任务:
```json
bin/post-index-task --file quickstart/tutorial/updates-append-index.json --url http://localhost:8081
```
当Druid完成从这个覆盖任务加载新段时,新行将被添加到数据源中。请注意,“Lion”行发生了roll up
```json
dsql> select * from "updates-tutorial";
__time animal count number
2018-01-01T01:01:00.000Z lion 2 400
2018-01-01T03:01:00.000Z aardvark 1 9999
2018-01-01T04:01:00.000Z bear 1 111
2018-01-01T05:01:00.000Z mongoose 1 737
2018-01-01T06:01:00.000Z snake 1 1234
2018-01-01T07:01:00.000Z octopus 1 115
Retrieved 6 rows in 0.02s.
```
### 追加数据
现在尝试另一种追加数据的方式
`quickstart/tutorial/updates-append-index2.json` 任务规范从 `quickstart/tutorial/updates-data4.json` 文件读取数据,然后追加到 `updates-tutorial` 数据源。注意到在规范中 `appendToExisting` 设置为 `true`
提交任务:
```json
bin/post-index-task --file quickstart/tutorial/updates-append-index2.json --url http://localhost:8081
```
加载新数据后,我们可以看到"octopus"后面额外的两行。请注意,编号为222的新"bear"行尚未与现有的bear-111行合并,因为新数据保存在单独的段中。
```json
dsql> select * from "updates-tutorial";
__time animal count number
2018-01-01T01:01:00.000Z lion 2 400
2018-01-01T03:01:00.000Z aardvark 1 9999
2018-01-01T04:01:00.000Z bear 1 111
2018-01-01T05:01:00.000Z mongoose 1 737
2018-01-01T06:01:00.000Z snake 1 1234
2018-01-01T07:01:00.000Z octopus 1 115
2018-01-01T04:01:00.000Z bear 1 222
2018-01-01T09:01:00.000Z falcon 1 1241
Retrieved 8 rows in 0.02s.
```
当我们执行一个GroupBy查询而非 `Select *`, 我们看到"beer"行将在查询时聚合在一起:
```json
dsql> select __time, animal, SUM("count"), SUM("number") from "updates-tutorial" group by __time, animal;
__time animal EXPR$2 EXPR$3
2018-01-01T01:01:00.000Z lion 2 400
2018-01-01T03:01:00.000Z aardvark 1 9999
2018-01-01T04:01:00.000Z bear 2 333
2018-01-01T05:01:00.000Z mongoose 1 737
2018-01-01T06:01:00.000Z snake 1 1234
2018-01-01T07:01:00.000Z octopus 1 115
2018-01-01T09:01:00.000Z falcon 1 1241
Retrieved 7 rows in 0.23s.
```
+5 -5
View File
@@ -208,7 +208,7 @@ Druid 是通过读取和存储有关导入数据的摘要(schema)来完成
2. 但一个数据源显示为可用的时候,针对这个数据源打开 Actions (![Actions](../assets/datasources-action-button.png)) 菜单,然后选择 **使用 SQL 进行查询(Query with SQL**。
![Datasource view](../assets/tutorial-batch-data-loader-10.png "Datasource view")
![Datasource view](../assets/tutorial-batch-data-loader-10.png ':size=690')
> 请注意,你还可以对数据源进行一些其他的操作,包括有配置,保留时间规则,压缩等。
@@ -222,13 +222,13 @@ Druid 是通过读取和存储有关导入数据的摘要(schema)来完成
## 下一步
在完成上面步骤中的快速导航后,请查看 [query 教程](./tutorial-query.md) 页面中的内容来了解如何在 Druid 的控制台中使用查询语句。
在完成上面步骤中的快速导航后,请查看 [query 教程](../tutorials/tutorial-query.md) 页面中的内容来了解如何在 Druid 的控制台中使用查询语句。
还有,如果你还希望从其他的数据导入方式中导入数据到 Druid,请参考下面的页面链接:
- [从 Apache Kafka 中加载流式数据](./tutorial-kafka.md) 如何从 Kafka 的主题中加载流式数据。
- [使用 Apache Hadoop 载入一个文件](./tutorial-batch-hadoop.md) 如何使用远程 Hadoop 集群执行批处理文件加载
- [编写一个你自己的数据导入规范](./tutorial-ingestion-spec.md) – 如何编写新的数据导入规范并使用它来加载数据
- [从 Apache Kafka 中加载流式数据](../tutorials/tutorial-kafka.md) 如何从 Kafka 的主题中加载流式数据。
- [使用 Apache Hadoop 载入一个文件](../tutorials/tutorial-batch-hadoop.md) 如何使用远程 Hadoop 集群执行批处理文件加载
- [编写一个你自己的数据导入规范](../tutorials/tutorial-ingestion-spec.md) – 如何编写新的数据导入规范并使用它来加载数据
请注意,当你停止了 Druid 的服务后,可以通过删除 Druid 根目录下的 `var` 目录,并且再次运行 `bin/start-micro-quickstart` 脚本来让 Druid 启动一个完全新的实例 。
+4 -5
View File
@@ -2,8 +2,8 @@
本教程文档主要为了对如何在 Apache Druid 使用 SQL 进行查询进行说明。
假设你已经完成了 [快速开始](../tutorials/index.md) 页面中的内容或者下面页面中有关的内容的内容。因为在 Apache Druid 中进行查询之前,
你需要将注入导入到 Druid 后才能够让进行下一步的操作:
假设你已经完成了 [快速开始](../tutorials/index.md) 页面中的内容或者下面页面中有关的内容。因为在 Apache Druid 中进行查询之前,
你需要将数据导入到 Druid 后才能够让进行下一步的操作:
* [教程:载入一个文件](../tutorials/tutorial-batch.md)
* [教程:从 Kafka 中载入流数据](../tutorials/tutorial-kafka.md)
@@ -93,7 +93,7 @@ WHERE 语句将会显示在你的查询中。
![Explain query](../assets/tutorial-query-06.png "Explain query")
> Another way to view the explain plan is by adding EXPLAIN PLAN FOR to the front of your query, as follows:
> 另外一种通过纯文本 JSON 格式查看 SQL 脚本的办法就是在查询脚本前面添加 EXPLAIN PLAN FOR, 如下所示:
>
>```sql
>EXPLAIN PLAN FOR
@@ -106,8 +106,7 @@ WHERE 语句将会显示在你的查询中。
>GROUP BY 1, 2
>ORDER BY "Edits" DESC
>```
>This is particularly useful when running queries
from the command line or over HTTP.
>这种方式针对在控制台工具上运行查询脚本的时候非常有用。
11. 最后,单击 `...` 然后选择 **Edit context** 来查看你可以添加的其他参数来控制查询的执行。
+50 -62
View File
@@ -1,96 +1,85 @@
---
id: tutorial-retention
title: "Tutorial: Configuring data retention"
sidebar_label: "Configuring data retention"
---
# 数据保留规则
本教程对如何在数据源上配置数据保留规则进行了说明,数据保留规则主要定义为数据的保留(retained)或者卸载(dropped)的时间。
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
!> 请注意,dropped 我们使用了中文 `卸载` 来进行翻译。但是 Druid 对卸载的数据是会从段里面删除掉的,如果你还需要这些数据的话,你需要将数据重新导入。
假设你已经完成了 [快速开始](../tutorials/index.md) 页面中的内容或者下面页面中有关的内容,并且你的 Druid 实例已经在你的本地的计算机上运行了。
同时,如果你已经完成了下面内容的阅读的话将会更好的帮助你理解 Roll-up 的相关内容。
* [教程:载入一个文件](../tutorials/tutorial-batch.md)
* [教程:查询数据](../tutorials/tutorial-query.md)
This tutorial demonstrates how to configure retention rules on a datasource to set the time intervals of data that will be retained or dropped.
## 载入示例数据
For this tutorial, we'll assume you've already downloaded Apache Druid as described in
the [single-machine quickstart](index.html) and have it running on your local machine.
在本教程中,我们将使用W Wikipedia 编辑的示例数据,其中包含一个摄取任务规范,它将为输入数据每个小时创建一个单独的段。
It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md) and [Tutorial: Querying data](../tutorials/tutorial-query.md).
## Load the example data
For this tutorial, we'll be using the Wikipedia edits sample data, with an ingestion task spec that will create a separate segment for each hour in the input data.
The ingestion spec can be found at `quickstart/tutorial/retention-index.json`. Let's submit that spec, which will create a datasource called `retention-tutorial`:
数据摄取导入规范位于 `quickstart/tutorial/retention-index.json` 文件中。让我们提交这个规范,将创建一个名称为 `retention-tutorial` 的数据源。
```bash
bin/post-index-task --file quickstart/tutorial/retention-index.json --url http://localhost:8081
```
After the ingestion completes, go to [http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources) in a browser to access the Druid Console's datasource view.
摄取完成后,在浏览器中访问 http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources)
然后访问 Druid 的控制台数据源视图。
此视图显示可用的数据源以及每个数据源定义的数据保留规则摘要。
This view shows the available datasources and a summary of the retention rules for each datasource:
![Summary](../assets/tutorial-retention-01.png "Summary")
Currently there are no rules set for the `retention-tutorial` datasource. Note that there are default rules for the cluster: load forever with 2 replicas in `_default_tier`.
当前,针对 `retention-tutorial` 数据源还没有设置数据保留规则。
This means that all data will be loaded regardless of timestamp, and each segment will be replicated to two Historical processes in the default tier.
需要注意的是,针对集群部署方式会配置一个默认的数据保留规则:永久载入 2 个副本并且替换进 `_default_tier`load forever with 2 replicas in `_default_tier`)。ith 2 replicas in `_default_tier`.
In this tutorial, we will ignore the tiering and redundancy concepts for now.
这意味着无论时间戳如何,所有数据都将加载,并且每个段将复制到两个 Historical 进程的默认层(default tier)中。
Let's view the segments for the `retention-tutorial` datasource by clicking the "24 Segments" link next to "Fully Available".
在本教程中,我们将暂时忽略分层(tiering)和冗余(redundancy)的概念。
The segments view ([http://localhost:8888/unified-console.html#segments](http://localhost:8888/unified-console.html#segments)) provides information about what segments a datasource contains. The page shows that there are 24 segments, each one containing data for a specific hour of 2015-09-12:
通过单击 `retention-tutorial` 数据源 "Fully Available" 链接边上的 "24 Segments" 来查看段(segments)信息。
段视图 ([http://localhost:8888/unified-console.html#segments](http://localhost:8888/unified-console.html#segments)) p
[Segment视图](http://localhost:8888/unified-console.html#segments) 提供了一个数据源的段(segment)信息。
本页显示了有 24 个段,每个段包括有 2015-09-12 每一个小时的数据。
![Original segments](../assets/tutorial-retention-02.png "Original segments")
## Set retention rules
## 设置保留规则
Suppose we want to drop data for the first 12 hours of 2015-09-12 and keep data for the later 12 hours of 2015-09-12.
假设我们想卸载 2015年9月12日 前 12 小时的数据,保留 2015年9月12日后 12 小时的数据。
Go to the [datasources view](http://localhost:8888/unified-console.html#datasources) and click the blue pencil icon next to `Cluster default: loadForever` for the `retention-tutorial` datasource.
进入 [datasources view](http://localhost:8888/unified-console.html#datasources) 页面,然后单击 `Cluster default: loadForever`
边上的的蓝色铅笔,然后为数据源选择 `retention-tutorial`
A rule configuration window will appear:
一个针对当前数据源的数据保留策略窗口将会显示出来:
![Rule configuration](../assets/tutorial-retention-03.png "Rule configuration")
Now click the `+ New rule` button twice.
单击 `+ New rule` 按钮 2 次。
In the upper rule box, select `Load` and `by interval`, and then enter `2015-09-12T12:00:00.000Z/2015-09-13T00:00:00.000Z` in field next to `by interval`. Replicas can remain at 2 in the `_default_tier`.
在上层的输入框中输入 `Load` 然后选择 `by interval`,然后输入 在 `by interval` 边上的对话框中输入 `2015-09-12T12:00:00.000Z/2015-09-13T00:00:00.000Z`
副本(Replicas)在 `_default_tier` 中可以设置为默认为 2。
In the lower rule box, select `Drop` and `forever`.
然后在下侧的对话框中选择 `Drop` `forever`
The rules should look like this:
设置的规则应该看起来和下面这样是一样的:
![Set rules](../assets/tutorial-retention-04.png "Set rules")
Now click `Next`. The rule configuration process will ask for a user name and comment, for change logging purposes. You can enter `tutorial` for both.
单击 `Next`。 规则配置进程将要求提供用户名和注释,以及修改的日志以便于记录。你可以同时输入字符 `tutorial`,当然你也可以用自己的字符。
Now click `Save`. You can see the new rules in the datasources view:
单击 `Save`, 随后你就可以在 datasources 视图中看到设置的新的规则了。
![New rules](../assets/tutorial-retention-05.png "New rules")
Give the cluster a few minutes to apply the rule change, and go to the [segments view](http://localhost:8888/unified-console.html#segments) in the Druid Console.
The segments for the first 12 hours of 2015-09-12 are now gone:
给集群几分钟时间来应用修改的保留规则。然后在 Druid 控制台中进入 [segments view](http://localhost:8888/unified-console.html#segments)
这时候你应该发现 2015-09-12 前 12 小时的段已经消失了。
![New segments](../assets/tutorial-retention-06.png "New segments")
The resulting retention rule chain is the following:
针对上面的修改,新生成的保留规则链如下:
1. loadByInterval 2015-09-12T12/2015-09-13 (12 hours)
@@ -98,18 +87,17 @@ The resulting retention rule chain is the following:
3. loadForever (default rule)
The rule chain is evaluated from top to bottom, with the default rule chain always added at the bottom.
规则链是自上而下计算的,默认规则链始终添加在规则链的最底部。
The tutorial rule chain we just created loads data if it is within the specified 12 hour interval.
根据我们刚才教程使用的规则创建的内容,链在指定的12小时间隔内加载数据。
If data is not within the 12 hour interval, the rule chain evaluates `dropForever` next, which will drop any data.
如果数据不在 12 小时内的话,那么规则链将会随后对 `dropForever` 进行评估 —— 评估的结果就是卸载所有的数据。
The `dropForever` terminates the rule chain, effectively overriding the default `loadForever` rule, which will never be reached in this rule chain.
`dropForever` 终止了规则链,并且覆盖了默认的 `loadForever` 规则,因此最后的 `loadForever` 在这个规则链中永远不会实现到。
Note that in this tutorial we defined a load rule on a specific interval.
请注意,在本教程中,我们定义了一个特定间隔的加载规则。
If instead you want to retain data based on how old it is (e.g., retain data that ranges from 3 months in the past to the present time), you would define a Period load rule instead.
如果希望根据数据的生命周期来保留保留数据(例如,保留从过去到现在 3 个月以内的数据),那么你应该定义一个周期性加载规则(Period Load Rule)。
## Further reading
* [Load rules](../operations/rule-configuration.md)
## 延伸阅读
* [载入规则(Load rules](../operations/rule-configuration.md)
+31 -190
View File
@@ -1,16 +1,21 @@
# Roll-up
Apache Druid can summarize raw data at ingestion time using a process we refer to as "roll-up". Roll-up is a first-level aggregation operation over a selected set of columns that reduces the size of stored data.
This tutorial will demonstrate the effects of roll-up on an example dataset.
Apache Druid 可以在数据摄取阶段对原始数据进行汇总,这个过程我们称为 "roll-up"。
Roll-up 是第一级对选定列集的一级聚合操作,通过这个操作我们能够减少存储数据的大小。
For this tutorial, we'll assume you've already downloaded Druid as described in
the [single-machine quickstart](index.html) and have it running on your local machine.
本教程中将讨论在一个示例数据集上进行 roll-up 的示例。
It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md) and [Tutorial: Querying data](../tutorials/tutorial-query.md).
假设你已经完成了 [快速开始](../tutorials/index.md) 页面中的内容或者下面页面中有关的内容,并且你的 Druid 实例已经在你的本地的计算机上运行了。
## Example data
For this tutorial, we'll use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second.
同时,如果你已经完成了下面内容的阅读的话将会更好的帮助你理解 Roll-up 的相关内容
* [教程:载入一个文件](../tutorials/tutorial-batch.md)
* [教程:查询数据](../tutorials/tutorial-query.md)
## 示例数据
针对对于本教程,我们将使用一个网络事件流数据的小样本。如下面表格中使用的数据,这个数据是在特定时间内从源到目标 IP 地址的流量的数据包和字节的事件。
```json
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
@@ -24,9 +29,9 @@ For this tutorial, we'll use a small sample of network flow event data, represen
{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818}
```
A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json`.
包含有这个样本数据的 JSON 文件位于 `quickstart/tutorial/rollup-data.json`
We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json`.
我们将使用下面描述的数据导入任务描述规范,将上面的 JSON 数据导入到 Druid 中,有关这个任务描述配置位于 `quickstart/tutorial/rollup-index.json` 中。
```json
{
@@ -78,25 +83,25 @@ We'll ingest this data using the following ingestion task spec, located at `quic
}
```
Roll-up has been enabled by setting `"rollup" : true` in the `granularitySpec`.
通过在 `granularitySpec` 选项中设置 `rollup : true` 来启用 Roll-up。
Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`.
请注意,我们将 `srcIP` `dstIP` 定义为 **维度(dimensions**,将 `packets` `bytes` 列定义为了 longSum 类型的**指标(metric**,并将 `queryGranularity` 配置定义为 `minute`
We will see how these definitions are used after we load this data.
加载这些数据后,我们将看到如何使用这些定义。
## Load the example data
## 载入示例数据
From the apache-druid-apache-druid-0.21.1 package root, run the following command:
在 Druid 包 的apache-druid-apache-druid-0.21.1 根目录下运行以下命令:
```bash
bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
```
After the script completes, we will query the data.
当上面的脚本运行完成后,我们将会开始查询数据。
## Query the example data
## 查询示例数据
Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to see what data was ingested.
让我们运行 `bin/dsql` 命令行工具,然后执行 `select * from "rollup-tutorial";` 脚本,来查看 Druid 系统中导入的数据。
```bash
$ bin/dsql
@@ -117,7 +122,7 @@ Retrieved 5 rows in 1.18s.
dsql>
```
Let's look at the three events in the original input data that occurred during `2018-01-01T01:01`:
让我们查看在 `2018-01-01T01:01` 导入的 3 条原始数据:
```json
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
@@ -125,7 +130,7 @@ Let's look at the three events in the original input data that occurred during `
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
```
These three rows have been "rolled up" into the following row:
上面的 3 条原始数据使用 "rolled up" 后将会合并成下面 1 条数据进行导入:
```bash
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
@@ -134,8 +139,12 @@ These three rows have been "rolled up" into the following row:
│ 2018-01-01T01:01:00.000Z │ 359373 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
这输入的数据将会按按照时间列(timestamp)和维度列(dimension columns) `{timestamp, srcIP, dstIP}` 进行分组(Group By),同时在指标列(metric columns `{packages, bytes}` 上进行聚合。
在进行分组之前,原始输入数据的时间戳按分钟进行标记和记录的,这是由于摄取规范中的 `"queryGranularity""minute"` 配置中决定的。
因此,记录中的 `2018-01-01T01:02` 期间发生的时间也被聚合后汇总。
The input rows have been grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`.
Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `"queryGranularity":"minute"` setting in the ingestion spec.
@@ -154,7 +163,7 @@ Likewise, these two events that occurred during `2018-01-01T01:02` have been rol
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no roll-up took place, because this was the only event that occurred during `2018-01-01T01:03`:
针对最后的记录 1.1.1.1 2.2.2.2 之间流量事件没有被 roll-up 进行合并汇总, 这是因为这些事件是 `2018-01-01T01:03` 期间发生的唯一事件。nt that occurred during `2018-01-01T01:03`:
```json
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
@@ -168,172 +177,4 @@ For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no roll-up too
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
Note that the `count` metric shows how many rows in the original input data contributed to the final "rolled up" row.
## Roll-up
Apache Druid可以通过roll-up在数据摄取阶段对原始数据进行汇总。 Roll-up是对选定列集的一级聚合操作,它可以减小存储数据的大小。
本教程中将讨论在一个示例数据集上进行roll-up的结果。
本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid,并运行在本地机器上。
完成[加载本地文件](tutorial-batch.md)和[数据查询](./chapter-4.md)两部分内容也是非常有帮助的。
### 示例数据
对于本教程,我们将使用一个网络流事件数据的小样本,表示在特定时间内从源到目标IP地址的流量的数据包和字节计数。
```json
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
{"timestamp":"2018-01-02T21:33:14Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":38,"bytes":6289}
{"timestamp":"2018-01-02T21:33:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":123,"bytes":93999}
{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818}
```
位于 `quickstart/tutorial/rollup-data.json` 的文件包含了样例输入数据
我们将使用 `quickstart/tutorial/rollup-index.json` 的摄入数据规范来摄取数据
```json
{
"type" : "index_parallel",
"spec" : {
"dataSchema" : {
"dataSource" : "rollup-tutorial",
"dimensionsSpec" : {
"dimensions" : [
"srcIP",
"dstIP"
]
},
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"metricsSpec" : [
{ "type" : "count", "name" : "count" },
{ "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
{ "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "week",
"queryGranularity" : "minute",
"intervals" : ["2018-01-01/2018-01-03"],
"rollup" : true
}
},
"ioConfig" : {
"type" : "index_parallel",
"inputSource" : {
"type" : "local",
"baseDir" : "quickstart/tutorial",
"filter" : "rollup-data.json"
},
"inputFormat" : {
"type" : "json"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index_parallel",
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000
}
}
}
```
通过在 `granularitySpec` 选项中设置 `rollup : true` 来启用Roll-up
注意,我们将`srcIP``dstIP`定义为**维度**,将`packets``bytes`列定义为了`longSum`类型的**指标**,并将 `queryGranularity` 配置定义为 `minute`
加载这些数据后,我们将看到如何使用这些定义。
### 加载示例数据
在Druid的根目录下运行以下命令:
```json
bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
```
脚本运行完成以后,我们将查询数据。
### 查询示例数据
现在运行 `bin/dsql` 然后执行查询 `select * from "rollup-tutorial";` 来查看已经被摄入的数据。
```json
$ bin/dsql
Welcome to dsql, the command-line client for Druid SQL.
Type "\h" for help.
dsql> select * from "rollup-tutorial";
__time bytes count dstIP packets srcIP
2018-01-01T01:01:00.000Z 35937 3 2.2.2.2 286 1.1.1.1
2018-01-01T01:02:00.000Z 366260 2 2.2.2.2 415 1.1.1.1
2018-01-01T01:03:00.000Z 10204 1 2.2.2.2 49 1.1.1.1
2018-01-02T21:33:00.000Z 100288 2 8.8.8.8 161 7.7.7.7
2018-01-02T21:35:00.000Z 2818 1 8.8.8.8 12 7.7.7.7
Retrieved 5 rows in 1.18s.
dsql>
```
我们来看发生在 `2018-01-01T01:01` 的三条原始数据:
```json
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
```
这三条数据已经被roll up为以下一行数据:
```json
__time bytes count dstIP packets srcIP
2018-01-01T01:01:00.000Z 35937 3 2.2.2.2 286 1.1.1.1
```
这输入的数据行已经被按照时间列和维度列 `{timestamp, srcIP, dstIP}` 在指标列 `{packages, bytes}` 上做求和聚合
在进行分组之前,原始输入数据的时间戳按分钟进行标记/布局,这是由于摄取规范中的 `"queryGranularity""minute"` 设置造成的。
同样,`2018-01-01T01:02` 期间发生的这两起事件也已经汇总。
```json
{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
```
```json
__time bytes count dstIP packets srcIP
2018-01-01T01:02:00.000Z 366260 2 2.2.2.2 415 1.1.1.1
```
对于记录1.1.1.1和2.2.2.2之间流量的最后一个事件没有发生汇总,因为这是 `2018-01-01T01:03` 期间发生的唯一事件
```json
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
```
```json
__time bytes count dstIP packets srcIP
2018-01-01T01:03:00.000Z 10204 1 2.2.2.2 49 1.1.1.1
```
请注意,`计数指标 count` 显示原始输入数据中有多少行贡献给最终的"roll up"行。
`计数指标(count)` 显示的是原始数据中有多少条记录最后被合并汇总(roll up)了。
+44 -55
View File
@@ -1,53 +1,30 @@
---
id: tutorial-update-data
title: "Tutorial: Updating existing data"
sidebar_label: "Updating existing data"
---
# 数据更新
被页面将会对如何对现有数据进行更新进行说明,同时演示覆盖(overwrites)和追加(appends2 种更新方式。
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
假设你已经完成了 [快速开始](../tutorials/index.md) 页面中的内容或者下面页面中有关的内容,并且你的 Druid 实例已经在你的本地的计算机上运行了。
同时,如果你已经完成了下面内容的阅读的话将会更好的帮助你理解有关数据更新的内容。
This tutorial demonstrates how to update existing data, showing both overwrites and appends.
* [教程:载入一个文件](../tutorials/tutorial-batch.md)
* [教程:查询数据](../tutorials/tutorial-query.md)
* [教程:Rollup](../tutorials/tutorial-rollup.md)
## 覆盖(Overwrite
本教程的这部分内容将会展示如何覆盖已经存在的时序间隔数据。
For this tutorial, we'll assume you've already downloaded Apache Druid as described in
the [single-machine quickstart](index.html) and have it running on your local machine.
### 载入初始化数据
It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md), [Tutorial: Querying data](../tutorials/tutorial-query.md), and [Tutorial: Rollup](../tutorials/tutorial-rollup.md).
让我们先载入一部分原始数据来作为初始化的数据,随后我们将会对这些数据进行覆盖和追加。
## Overwrite
本指南使用的数据导入规范位于 `quickstart/tutorial/updates-init-index.json` 文件。这个数据导入规范将会从 `quickstart/tutorial/updates-data.json` 中导入数据文件,并且创建一个称为`updates-tutorial` 的数据源。
This section of the tutorial will cover how to overwrite an existing interval of data.
### Load initial data
Let's load an initial data set which we will overwrite and append to.
The spec we'll use for this tutorial is located at `quickstart/tutorial/updates-init-index.json`. This spec creates a datasource called `updates-tutorial` from the `quickstart/tutorial/updates-data.json` input file.
Let's submit that task:
让我们提交这个任务:
```bash
bin/post-index-task --file quickstart/tutorial/updates-init-index.json --url http://localhost:8081
```
We have three initial rows containing an "animal" dimension and "number" metric:
在任务完成后,我将会初始化看到一个 "animal" 的维度(dimension)和 "number" 的指标(metric):
```bash
dsql> select * from "updates-tutorial";
@@ -61,21 +38,23 @@ dsql> select * from "updates-tutorial";
Retrieved 3 rows in 1.42s.
```
### Overwrite the initial data
### 覆盖初始化数据
To overwrite this data, we can submit another task for the same interval, but with different input data.
为了覆盖这些初始化的原始数据,我们可以提交另外一个任务,在这个任务中我们会设置有相同的时间间隔,但是输入数据是不同的。
The `quickstart/tutorial/updates-overwrite-index.json` spec will perform an overwrite on the `updates-tutorial` datasource.
`quickstart/tutorial/updates-overwrite-index.json` 规范将定义如何覆盖 `updates-tutorial` 数据源。
Note that this task reads input from `quickstart/tutorial/updates-data2.json`, and `appendToExisting` is set to `false` (indicating this is an overwrite).
请注意,上面定义的导入规范是从 `quickstart/tutorial/updates-data2.json` 数据文件中读取数据的,并且规范中的 `appendToExisting` 设置为 `false`
(在规范中的这个设置决定了数据采取的是覆盖导入方式)。
Let's submit that task:
然后让我们提交这个任务:
```bash
bin/post-index-task --file quickstart/tutorial/updates-overwrite-index.json --url http://localhost:8081
```
When Druid finishes loading the new segment from this overwrite task, the "tiger" row now has the value "lion", the "aardvark" row has a different number, and the "giraffe" row has been replaced. It may take a couple of minutes for the changes to take effect:
Druid 从覆盖任务中完成导入新的段后,我们会看到原来的 "tiger" 行中对应当前的值为 "lion" "aardvark" 行中有了不同的数字; "giraffe" 行被完全替换了。
针对不同的环境,上面的配置需要等几分钟后才能生效:
```bash
dsql> select * from "updates-tutorial";
@@ -89,19 +68,23 @@ dsql> select * from "updates-tutorial";
Retrieved 3 rows in 0.02s.
```
## Combine old data with new data and overwrite
## 将新数据和老数据合并后进行覆盖
Let's try appending some new data to the `updates-tutorial` datasource now. We will add the data from `quickstart/tutorial/updates-data3.json`.
让我们现在将新的数据追加到 `updates-tutorial` 数据源。我们将会使用名为 `quickstart/tutorial/updates-data3.json` 的数据文件。
The `quickstart/tutorial/updates-append-index.json` task spec has been configured to read from the existing `updates-tutorial` datasource and the `quickstart/tutorial/updates-data3.json` file. The task will combine data from the two input sources, and then overwrite the original data with the new combined data.
`quickstart/tutorial/updates-append-index.json` 任务规范将会被配置从已经存在的 `quickstart/tutorial/updates-data3.json` 数据文件
`updates-tutorial` 数据源同属兑取数据后更新 `updates-tutorial` 数据源。
Let's submit that task:
这个任务将会对 2 个数据源中读取的数据进行合并,然后将合并后的数据重新写回到数据源。
然后让我们提交这个任务:
```bash
bin/post-index-task --file quickstart/tutorial/updates-append-index.json --url http://localhost:8081
```
When Druid finishes loading the new segment from this overwrite task, the new rows will have been added to the datasource. Note that roll-up occurred for the "lion" row:
Druid 完成这个任务并且创建新段后,新的行将会被添加到数据源中。
需要注意的是 "lion" 行进行了合并(roll-up)操作:
```bash
dsql> select * from "updates-tutorial";
@@ -118,19 +101,25 @@ dsql> select * from "updates-tutorial";
Retrieved 6 rows in 0.02s.
```
## Append to the data
## 追加数据
Let's try another way of appending data.
让我们尝试使用另外一种方法来对数据进行追加。
The `quickstart/tutorial/updates-append-index2.json` task spec reads input from `quickstart/tutorial/updates-data4.json` and will append its data to the `updates-tutorial` datasource. Note that `appendToExisting` is set to `true` in this spec.
`quickstart/tutorial/updates-append-index2.json` 任务规范将会被配置从已经存在的 `quickstart/tutorial/updates-data4.json` 文件中读取数据,
在数据读取后将数据追加到 `updates-tutorial` 数据源中。
Let's submit that task:
请注意,规范中的 `appendToExisting` 设置为 `true`
然后让我们提交这个任务:
```bash
bin/post-index-task --file quickstart/tutorial/updates-append-index2.json --url http://localhost:8081
```
When the new data is loaded, we can see two additional rows after "octopus". Note that the new "bear" row with number 222 has not been rolled up with the existing bear-111 row, because the new data is held in a separate segment.
当新的数据被载入后,我们会看到 octopus 中添加了 2 条新的行。
请注意,新添加的行 "bear" 中的值为 222 针对已经存在的 "bear" 行中的数据 111Druid 并没有针对数据进行了 rolled-up 操作。
这是因为新增加的数据保存在不同的段中。
```bash
dsql> select * from "updates-tutorial";
@@ -150,7 +139,7 @@ Retrieved 8 rows in 0.02s.
```
If we run a GroupBy query instead of a `select *`, we can see that the "bear" rows will group together at query time:
如果我们运行 GroupBy 查询来替代 `select *` 查询的话,我们会看到 "bear" 这一行将在 group By 查询后再合并在一起的:
```bash
dsql> select __time, animal, SUM("count"), SUM("number") from "updates-tutorial" group by __time, animal;