Compare commits
26 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 49c3930deb | |||
| bf74617057 | |||
| 72f026cac7 | |||
| c171635482 | |||
| 3c5404eea7 | |||
| 67f5094fac | |||
| b661d141ef | |||
| 1926fc1783 | |||
| 608e5746ff | |||
| 5508a1e203 | |||
| 9cea283a77 | |||
| e730545948 | |||
| fec5fb7511 | |||
| d512d84286 | |||
| 9f13516241 | |||
| 08cc99a7a5 | |||
| 5a5aa9e1c9 | |||
| d994362f95 | |||
| 683145ef58 | |||
| c41b4795d2 | |||
| 9db8616086 | |||
| 6318c33333 | |||
| 3e6d21241e | |||
| 222f234eb1 | |||
| f5cfe18c9c | |||
| 1b6300a7a6 |
@@ -1,99 +0,0 @@
|
||||
<!-- toc -->
|
||||
|
||||
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||
<ins class="adsbygoogle"
|
||||
style="display:block; text-align:center;"
|
||||
data-ad-layout="in-article"
|
||||
data-ad-format="fluid"
|
||||
data-ad-client="ca-pub-8828078415045620"
|
||||
data-ad-slot="7586680510"></ins>
|
||||
<script>
|
||||
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||
</script>
|
||||
|
||||
## 配置数据保留规则
|
||||
|
||||
本教程演示如何在数据源上配置保留规则,以设置要保留或删除的数据的时间间隔
|
||||
|
||||
本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid,并运行在本地机器上。
|
||||
|
||||
完成[加载本地文件](tutorial-batch.md)和[数据查询](./chapter-4.md)两部分内容也是非常有帮助的。
|
||||
|
||||
### 加载示例数据
|
||||
|
||||
在本教程中,我们将使用Wikipedia编辑的示例数据,其中包含一个摄取任务规范,它将为输入数据每个小时创建一个单独的段
|
||||
|
||||
数据摄取规范位于 `quickstart/tutorial/retention-index.json`, 提交这个规范,将创建一个名称为 `retention-tutorial` 的数据源
|
||||
|
||||
```json
|
||||
bin/post-index-task --file quickstart/tutorial/retention-index.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
摄取完成后,在浏览器中转到[http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources)以访问Druid控制台的datasource视图
|
||||
|
||||
此视图显示可用的数据源以及每个数据源的保留规则摘要
|
||||
|
||||

|
||||
|
||||
当前没有为 `retention-tutorial` 数据源设置规则。请注意,集群有默认规则:在 `_default_tier` 中永久加载2个副本
|
||||
|
||||
这意味着无论时间戳如何,所有数据都将加载,并且每个段将复制到两个Historical进程的 `_default_tier` 中
|
||||
|
||||
在本教程中,我们将暂时忽略分层和冗余概念
|
||||
|
||||
让我们通过单击"Fully Available"旁边的"24 Segments"链接来查看 `retention-tutorial` 数据源的段
|
||||
|
||||
[Segment视图](http://localhost:8888/unified-console.html#segments) 提供了一个数据源包括的segment信息,本页显示有24个段,每一个段包括了2015-09-12特定小时的数据
|
||||
|
||||

|
||||
|
||||
### 设置数据保留规则
|
||||
|
||||
假设我们想删除2015年9月12日前12小时的数据,保留2015年9月12日后12小时的数据。
|
||||
|
||||
进入到Datasources视图,点击 `retention-tutorial` 数据源的蓝色铅笔的图标 `Cluster default: loadForever`
|
||||
|
||||
一个规则配置窗口出现了:
|
||||
|
||||

|
||||
|
||||
现在点击 `+ New rule` 按钮两次
|
||||
|
||||
在上边的规则框中,选择 `Load` 和 `by Interval` 然后输入在 `by Interval` 旁边的输入框中输入 `2015-09-12T12:00:00.000Z/2015-09-13T00:00:00.000Z`, 副本可以选择保持2,在 `_default_tier` 中
|
||||
|
||||
在下边的规则框中,选择 `Drop` 和 `forever`
|
||||
|
||||
规则看上去是这样的:
|
||||
|
||||

|
||||
|
||||
现在点击 `Next`, 规则配置过程将要求提供用户名和注释,以便进行更改日志记录。您可以同时输入教程。
|
||||
|
||||
现在点击 `Save`, 可以在Datasources视图中看到新的规则
|
||||
|
||||

|
||||
|
||||
给集群几分钟时间应用规则更改,然后转到Druid控制台中的segments视图。2015年9月12日前12小时的段文件现已消失
|
||||
|
||||

|
||||
|
||||
生成的保留规则链如下:
|
||||
|
||||
1. loadByInterval 2015-09-12T12/2015-09-13 (12 hours)
|
||||
2. dropForever
|
||||
3. loadForever (默认规则)
|
||||
|
||||
规则链是自上而下计算的,默认规则链始终添加在底部
|
||||
|
||||
我们刚刚创建的教程规则链在指定的12小时间隔内加载数据
|
||||
|
||||
如果数据不在12小时的间隔内,则规则链下一步将计算 `dropForever`,这将删除任何数据
|
||||
|
||||
`dropForever` 终止了规则链,有效地覆盖了默认的 `loadForever` 规则,在这个规则链中永远不会到达该规则
|
||||
|
||||
注意,在本教程中,我们定义了一个特定间隔的加载规则
|
||||
|
||||
相反,如果希望根据数据的生命周期保留数据(例如,保留从过去3个月到现在3个月的数据),则应定义一个周期性加载规则(Period Load Rule)。
|
||||
|
||||
### 进一步阅读
|
||||
[加载规则](../operations/retainingOrDropData.md)
|
||||
@@ -1,145 +0,0 @@
|
||||
<!-- toc -->
|
||||
|
||||
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||
<ins class="adsbygoogle"
|
||||
style="display:block; text-align:center;"
|
||||
data-ad-layout="in-article"
|
||||
data-ad-format="fluid"
|
||||
data-ad-client="ca-pub-8828078415045620"
|
||||
data-ad-slot="7586680510"></ins>
|
||||
<script>
|
||||
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||
</script>
|
||||
|
||||
## 数据更新
|
||||
本教程演示如何更新现有数据,同时展示覆盖(Overwrite)和追加(append)的两个方式。
|
||||
|
||||
本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid,并运行在本地机器上。
|
||||
|
||||
完成[加载本地文件](tutorial-batch.md)、[数据查询](./chapter-4.md)和[roll-up](./chapter-5.md)部分内容也是非常有帮助的
|
||||
|
||||
### 数据覆盖
|
||||
本节教程将介绍如何覆盖现有的指定间隔的数据
|
||||
|
||||
#### 加载初始数据
|
||||
|
||||
本节教程使用的任务摄取规范位于 `quickstart/tutorial/updates-init-index.json`, 本规范从 `quickstart/tutorial/updates-data.json` 输入文件创建一个名称为 `updates-tutorial` 的数据源
|
||||
|
||||
提交任务:
|
||||
```json
|
||||
bin/post-index-task --file quickstart/tutorial/updates-init-index.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
我们有三个包含"动物"维度和"数字"指标的初始行:
|
||||
```json
|
||||
dsql> select * from "updates-tutorial";
|
||||
┌──────────────────────────┬──────────┬───────┬────────┐
|
||||
│ __time │ animal │ count │ number │
|
||||
├──────────────────────────┼──────────┼───────┼────────┤
|
||||
│ 2018-01-01T01:01:00.000Z │ tiger │ 1 │ 100 │
|
||||
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 42 │
|
||||
│ 2018-01-01T03:01:00.000Z │ giraffe │ 1 │ 14124 │
|
||||
└──────────────────────────┴──────────┴───────┴────────┘
|
||||
Retrieved 3 rows in 1.42s.
|
||||
```
|
||||
#### 覆盖初始数据
|
||||
|
||||
为了覆盖这些数据,我们可以在相同的时间间隔内提交另一个任务,但是使用不同的输入数据。
|
||||
|
||||
`quickstart/tutorial/updates-overwrite-index.json` 规范将会对 `updates-tutorial` 数据进行数据重写
|
||||
|
||||
注意,此任务从 `quickstart/tutorial/updates-data2.json` 读取输入,`appendToExisting` 设置为false(表示这是一个覆盖)
|
||||
|
||||
提交任务:
|
||||
```json
|
||||
bin/post-index-task --file quickstart/tutorial/updates-overwrite-index.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
当Druid从这个覆盖任务加载完新的段时,"tiger"行现在有了值"lion","aardvark"行有了不同的编号,"giraffe"行已经被替换。更改可能需要几分钟才能生效:
|
||||
|
||||
```json
|
||||
dsql> select * from "updates-tutorial";
|
||||
┌──────────────────────────┬──────────┬───────┬────────┐
|
||||
│ __time │ animal │ count │ number │
|
||||
├──────────────────────────┼──────────┼───────┼────────┤
|
||||
│ 2018-01-01T01:01:00.000Z │ lion │ 1 │ 100 │
|
||||
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 9999 │
|
||||
│ 2018-01-01T04:01:00.000Z │ bear │ 1 │ 111 │
|
||||
└──────────────────────────┴──────────┴───────┴────────┘
|
||||
Retrieved 3 rows in 0.02s.
|
||||
```
|
||||
|
||||
### 将旧数据与新数据合并并覆盖
|
||||
|
||||
现在我们尝试在 `updates-tutorial` 数据源追加一些新的数据,我们将从 `quickstart/tutorial/updates-data3.json` 增加新的数据
|
||||
|
||||
`quickstart/tutorial/updates-append-index.json` 任务规范配置为从现有的 `updates-tutorial` 数据源和 `quickstart/tutorial/updates-data3.json` 文件读取数据,该任务将组合来自两个输入源的数据,然后用新的组合数据覆盖原始数据。
|
||||
|
||||
提交任务:
|
||||
```json
|
||||
bin/post-index-task --file quickstart/tutorial/updates-append-index.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
当Druid完成从这个覆盖任务加载新段时,新行将被添加到数据源中。请注意,“Lion”行发生了roll up:
|
||||
```json
|
||||
dsql> select * from "updates-tutorial";
|
||||
┌──────────────────────────┬──────────┬───────┬────────┐
|
||||
│ __time │ animal │ count │ number │
|
||||
├──────────────────────────┼──────────┼───────┼────────┤
|
||||
│ 2018-01-01T01:01:00.000Z │ lion │ 2 │ 400 │
|
||||
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 9999 │
|
||||
│ 2018-01-01T04:01:00.000Z │ bear │ 1 │ 111 │
|
||||
│ 2018-01-01T05:01:00.000Z │ mongoose │ 1 │ 737 │
|
||||
│ 2018-01-01T06:01:00.000Z │ snake │ 1 │ 1234 │
|
||||
│ 2018-01-01T07:01:00.000Z │ octopus │ 1 │ 115 │
|
||||
└──────────────────────────┴──────────┴───────┴────────┘
|
||||
Retrieved 6 rows in 0.02s.
|
||||
```
|
||||
|
||||
### 追加数据
|
||||
|
||||
现在尝试另一种追加数据的方式
|
||||
|
||||
`quickstart/tutorial/updates-append-index2.json` 任务规范从 `quickstart/tutorial/updates-data4.json` 文件读取数据,然后追加到 `updates-tutorial` 数据源。注意到在规范中 `appendToExisting` 设置为 `true`
|
||||
|
||||
提交任务:
|
||||
```json
|
||||
bin/post-index-task --file quickstart/tutorial/updates-append-index2.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
加载新数据后,我们可以看到"octopus"后面额外的两行。请注意,编号为222的新"bear"行尚未与现有的bear-111行合并,因为新数据保存在单独的段中。
|
||||
|
||||
```json
|
||||
dsql> select * from "updates-tutorial";
|
||||
┌──────────────────────────┬──────────┬───────┬────────┐
|
||||
│ __time │ animal │ count │ number │
|
||||
├──────────────────────────┼──────────┼───────┼────────┤
|
||||
│ 2018-01-01T01:01:00.000Z │ lion │ 2 │ 400 │
|
||||
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 9999 │
|
||||
│ 2018-01-01T04:01:00.000Z │ bear │ 1 │ 111 │
|
||||
│ 2018-01-01T05:01:00.000Z │ mongoose │ 1 │ 737 │
|
||||
│ 2018-01-01T06:01:00.000Z │ snake │ 1 │ 1234 │
|
||||
│ 2018-01-01T07:01:00.000Z │ octopus │ 1 │ 115 │
|
||||
│ 2018-01-01T04:01:00.000Z │ bear │ 1 │ 222 │
|
||||
│ 2018-01-01T09:01:00.000Z │ falcon │ 1 │ 1241 │
|
||||
└──────────────────────────┴──────────┴───────┴────────┘
|
||||
Retrieved 8 rows in 0.02s.
|
||||
```
|
||||
|
||||
当我们执行一个GroupBy查询而非 `Select *`, 我们看到"beer"行将在查询时聚合在一起:
|
||||
|
||||
```json
|
||||
dsql> select __time, animal, SUM("count"), SUM("number") from "updates-tutorial" group by __time, animal;
|
||||
┌──────────────────────────┬──────────┬────────┬────────┐
|
||||
│ __time │ animal │ EXPR$2 │ EXPR$3 │
|
||||
├──────────────────────────┼──────────┼────────┼────────┤
|
||||
│ 2018-01-01T01:01:00.000Z │ lion │ 2 │ 400 │
|
||||
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 9999 │
|
||||
│ 2018-01-01T04:01:00.000Z │ bear │ 2 │ 333 │
|
||||
│ 2018-01-01T05:01:00.000Z │ mongoose │ 1 │ 737 │
|
||||
│ 2018-01-01T06:01:00.000Z │ snake │ 1 │ 1234 │
|
||||
│ 2018-01-01T07:01:00.000Z │ octopus │ 1 │ 115 │
|
||||
│ 2018-01-01T09:01:00.000Z │ falcon │ 1 │ 1241 │
|
||||
└──────────────────────────┴──────────┴────────┴────────┘
|
||||
Retrieved 7 rows in 0.23s.
|
||||
```
|
||||
@@ -2,8 +2,8 @@
|
||||
|
||||
本教程文档主要为了对如何在 Apache Druid 使用 SQL 进行查询进行说明。
|
||||
|
||||
假设你已经完成了 [快速开始](../tutorials/index.md) 页面中的内容或者下面页面中有关的内容的内容。因为在 Apache Druid 中进行查询之前,
|
||||
你需要将注入导入到 Druid 后才能够让进行下一步的操作:
|
||||
假设你已经完成了 [快速开始](../tutorials/index.md) 页面中的内容或者下面页面中有关的内容。因为在 Apache Druid 中进行查询之前,
|
||||
你需要将数据导入到 Druid 后才能够让进行下一步的操作:
|
||||
|
||||
* [教程:载入一个文件](../tutorials/tutorial-batch.md)
|
||||
* [教程:从 Kafka 中载入流数据](../tutorials/tutorial-kafka.md)
|
||||
@@ -93,7 +93,7 @@ WHERE 语句将会显示在你的查询中。
|
||||
|
||||

|
||||
|
||||
> Another way to view the explain plan is by adding EXPLAIN PLAN FOR to the front of your query, as follows:
|
||||
> 另外一种通过纯文本 JSON 格式查看 SQL 脚本的办法就是在查询脚本前面添加 EXPLAIN PLAN FOR, 如下所示:
|
||||
>
|
||||
>```sql
|
||||
>EXPLAIN PLAN FOR
|
||||
@@ -106,8 +106,7 @@ WHERE 语句将会显示在你的查询中。
|
||||
>GROUP BY 1, 2
|
||||
>ORDER BY "Edits" DESC
|
||||
>```
|
||||
>This is particularly useful when running queries
|
||||
from the command line or over HTTP.
|
||||
>这种方式针对在控制台工具上运行查询脚本的时候非常有用。
|
||||
|
||||
|
||||
11. 最后,单击 `...` 然后选择 **Edit context** 来查看你可以添加的其他参数来控制查询的执行。
|
||||
|
||||
@@ -1,96 +1,85 @@
|
||||
---
|
||||
id: tutorial-retention
|
||||
title: "Tutorial: Configuring data retention"
|
||||
sidebar_label: "Configuring data retention"
|
||||
---
|
||||
# 数据保留规则
|
||||
本教程对如何在数据源上配置数据保留规则进行了说明,数据保留规则主要定义为数据的保留(retained)或者卸载(dropped)的时间。
|
||||
|
||||
<!--
|
||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||
~ or more contributor license agreements. See the NOTICE file
|
||||
~ distributed with this work for additional information
|
||||
~ regarding copyright ownership. The ASF licenses this file
|
||||
~ to you under the Apache License, Version 2.0 (the
|
||||
~ "License"); you may not use this file except in compliance
|
||||
~ with the License. You may obtain a copy of the License at
|
||||
~
|
||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||
~
|
||||
~ Unless required by applicable law or agreed to in writing,
|
||||
~ software distributed under the License is distributed on an
|
||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
~ KIND, either express or implied. See the License for the
|
||||
~ specific language governing permissions and limitations
|
||||
~ under the License.
|
||||
-->
|
||||
!> 请注意,dropped 我们使用了中文 `卸载` 来进行翻译。但是 Druid 对卸载的数据是会从段里面删除掉的,如果你还需要这些数据的话,你需要将数据重新导入。
|
||||
|
||||
假设你已经完成了 [快速开始](../tutorials/index.md) 页面中的内容或者下面页面中有关的内容,并且你的 Druid 实例已经在你的本地的计算机上运行了。
|
||||
|
||||
同时,如果你已经完成了下面内容的阅读的话将会更好的帮助你理解 Roll-up 的相关内容。
|
||||
|
||||
* [教程:载入一个文件](../tutorials/tutorial-batch.md)
|
||||
* [教程:查询数据](../tutorials/tutorial-query.md)
|
||||
|
||||
|
||||
This tutorial demonstrates how to configure retention rules on a datasource to set the time intervals of data that will be retained or dropped.
|
||||
## 载入示例数据
|
||||
|
||||
For this tutorial, we'll assume you've already downloaded Apache Druid as described in
|
||||
the [single-machine quickstart](index.html) and have it running on your local machine.
|
||||
在本教程中,我们将使用W Wikipedia 编辑的示例数据,其中包含一个摄取任务规范,它将为输入数据每个小时创建一个单独的段。
|
||||
|
||||
It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md) and [Tutorial: Querying data](../tutorials/tutorial-query.md).
|
||||
|
||||
## Load the example data
|
||||
|
||||
For this tutorial, we'll be using the Wikipedia edits sample data, with an ingestion task spec that will create a separate segment for each hour in the input data.
|
||||
|
||||
The ingestion spec can be found at `quickstart/tutorial/retention-index.json`. Let's submit that spec, which will create a datasource called `retention-tutorial`:
|
||||
数据摄取导入规范位于 `quickstart/tutorial/retention-index.json` 文件中。让我们提交这个规范,将创建一个名称为 `retention-tutorial` 的数据源。
|
||||
|
||||
```bash
|
||||
bin/post-index-task --file quickstart/tutorial/retention-index.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
After the ingestion completes, go to [http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources) in a browser to access the Druid Console's datasource view.
|
||||
摄取完成后,在浏览器中访问 http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources)
|
||||
然后访问 Druid 的控制台数据源视图。
|
||||
|
||||
此视图显示可用的数据源以及每个数据源定义的数据保留规则摘要。
|
||||
|
||||
This view shows the available datasources and a summary of the retention rules for each datasource:
|
||||
|
||||

|
||||
|
||||
Currently there are no rules set for the `retention-tutorial` datasource. Note that there are default rules for the cluster: load forever with 2 replicas in `_default_tier`.
|
||||
当前,针对 `retention-tutorial` 数据源还没有设置数据保留规则。
|
||||
|
||||
This means that all data will be loaded regardless of timestamp, and each segment will be replicated to two Historical processes in the default tier.
|
||||
需要注意的是,针对集群部署方式会配置一个默认的数据保留规则:永久载入 2 个副本并且替换进 `_default_tier`(load forever with 2 replicas in `_default_tier`)。ith 2 replicas in `_default_tier`.
|
||||
|
||||
In this tutorial, we will ignore the tiering and redundancy concepts for now.
|
||||
这意味着无论时间戳如何,所有数据都将加载,并且每个段将复制到两个 Historical 进程的默认层(default tier)中。
|
||||
|
||||
Let's view the segments for the `retention-tutorial` datasource by clicking the "24 Segments" link next to "Fully Available".
|
||||
在本教程中,我们将暂时忽略分层(tiering)和冗余(redundancy)的概念。
|
||||
|
||||
The segments view ([http://localhost:8888/unified-console.html#segments](http://localhost:8888/unified-console.html#segments)) provides information about what segments a datasource contains. The page shows that there are 24 segments, each one containing data for a specific hour of 2015-09-12:
|
||||
通过单击 `retention-tutorial` 数据源 "Fully Available" 链接边上的 "24 Segments" 来查看段(segments)信息。
|
||||
|
||||
段视图 ([http://localhost:8888/unified-console.html#segments](http://localhost:8888/unified-console.html#segments)) p
|
||||
|
||||
[Segment视图](http://localhost:8888/unified-console.html#segments) 提供了一个数据源的段(segment)信息。
|
||||
本页显示了有 24 个段,每个段包括有 2015-09-12 每一个小时的数据。
|
||||
|
||||

|
||||
|
||||
## Set retention rules
|
||||
## 设置保留规则
|
||||
|
||||
Suppose we want to drop data for the first 12 hours of 2015-09-12 and keep data for the later 12 hours of 2015-09-12.
|
||||
假设我们想卸载 2015年9月12日 前 12 小时的数据,保留 2015年9月12日后 12 小时的数据。
|
||||
|
||||
Go to the [datasources view](http://localhost:8888/unified-console.html#datasources) and click the blue pencil icon next to `Cluster default: loadForever` for the `retention-tutorial` datasource.
|
||||
进入 [datasources view](http://localhost:8888/unified-console.html#datasources) 页面,然后单击 `Cluster default: loadForever`
|
||||
边上的的蓝色铅笔,然后为数据源选择 `retention-tutorial` 。
|
||||
|
||||
A rule configuration window will appear:
|
||||
一个针对当前数据源的数据保留策略窗口将会显示出来:
|
||||
|
||||

|
||||
|
||||
Now click the `+ New rule` button twice.
|
||||
单击 `+ New rule` 按钮 2 次。
|
||||
|
||||
In the upper rule box, select `Load` and `by interval`, and then enter `2015-09-12T12:00:00.000Z/2015-09-13T00:00:00.000Z` in field next to `by interval`. Replicas can remain at 2 in the `_default_tier`.
|
||||
在上层的输入框中输入 `Load` 然后选择 `by interval`,然后输入 在 `by interval` 边上的对话框中输入 `2015-09-12T12:00:00.000Z/2015-09-13T00:00:00.000Z`。
|
||||
副本(Replicas)在 `_default_tier` 中可以设置为默认为 2。
|
||||
|
||||
In the lower rule box, select `Drop` and `forever`.
|
||||
然后在下侧的对话框中选择 `Drop` 和 `forever`。
|
||||
|
||||
The rules should look like this:
|
||||
设置的规则应该看起来和下面这样是一样的:
|
||||
|
||||

|
||||
|
||||
Now click `Next`. The rule configuration process will ask for a user name and comment, for change logging purposes. You can enter `tutorial` for both.
|
||||
单击 `Next`。 规则配置进程将要求提供用户名和注释,以及修改的日志以便于记录。你可以同时输入字符 `tutorial`,当然你也可以用自己的字符。
|
||||
|
||||
Now click `Save`. You can see the new rules in the datasources view:
|
||||
单击 `Save`, 随后你就可以在 datasources 视图中看到设置的新的规则了。
|
||||
|
||||

|
||||
|
||||
Give the cluster a few minutes to apply the rule change, and go to the [segments view](http://localhost:8888/unified-console.html#segments) in the Druid Console.
|
||||
The segments for the first 12 hours of 2015-09-12 are now gone:
|
||||
给集群几分钟时间来应用修改的保留规则。然后在 Druid 控制台中进入 [segments view](http://localhost:8888/unified-console.html#segments)。
|
||||
这时候你应该发现 2015-09-12 前 12 小时的段已经消失了。
|
||||
|
||||

|
||||
|
||||
The resulting retention rule chain is the following:
|
||||
针对上面的修改,新生成的保留规则链如下:
|
||||
|
||||
1. loadByInterval 2015-09-12T12/2015-09-13 (12 hours)
|
||||
|
||||
@@ -98,18 +87,17 @@ The resulting retention rule chain is the following:
|
||||
|
||||
3. loadForever (default rule)
|
||||
|
||||
The rule chain is evaluated from top to bottom, with the default rule chain always added at the bottom.
|
||||
规则链是自上而下计算的,默认规则链始终添加在规则链的最底部。
|
||||
|
||||
The tutorial rule chain we just created loads data if it is within the specified 12 hour interval.
|
||||
根据我们刚才教程使用的规则创建的内容,链在指定的12小时间隔内加载数据。
|
||||
|
||||
If data is not within the 12 hour interval, the rule chain evaluates `dropForever` next, which will drop any data.
|
||||
如果数据不在 12 小时内的话,那么规则链将会随后对 `dropForever` 进行评估 —— 评估的结果就是卸载所有的数据。
|
||||
|
||||
The `dropForever` terminates the rule chain, effectively overriding the default `loadForever` rule, which will never be reached in this rule chain.
|
||||
`dropForever` 终止了规则链,并且覆盖了默认的 `loadForever` 规则,因此最后的 `loadForever` 在这个规则链中永远不会实现到。
|
||||
|
||||
Note that in this tutorial we defined a load rule on a specific interval.
|
||||
请注意,在本教程中,我们定义了一个特定间隔的加载规则。
|
||||
|
||||
If instead you want to retain data based on how old it is (e.g., retain data that ranges from 3 months in the past to the present time), you would define a Period load rule instead.
|
||||
如果希望根据数据的生命周期来保留保留数据(例如,保留从过去到现在 3 个月以内的数据),那么你应该定义一个周期性加载规则(Period Load Rule)。
|
||||
|
||||
## Further reading
|
||||
|
||||
* [Load rules](../operations/rule-configuration.md)
|
||||
## 延伸阅读
|
||||
* [载入规则(Load rules)](../operations/rule-configuration.md)
|
||||
+31
-190
@@ -1,16 +1,21 @@
|
||||
# Roll-up
|
||||
Apache Druid can summarize raw data at ingestion time using a process we refer to as "roll-up". Roll-up is a first-level aggregation operation over a selected set of columns that reduces the size of stored data.
|
||||
|
||||
This tutorial will demonstrate the effects of roll-up on an example dataset.
|
||||
Apache Druid 可以在数据摄取阶段对原始数据进行汇总,这个过程我们称为 "roll-up"。
|
||||
Roll-up 是第一级对选定列集的一级聚合操作,通过这个操作我们能够减少存储数据的大小。
|
||||
|
||||
For this tutorial, we'll assume you've already downloaded Druid as described in
|
||||
the [single-machine quickstart](index.html) and have it running on your local machine.
|
||||
本教程中将讨论在一个示例数据集上进行 roll-up 的示例。
|
||||
|
||||
It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md) and [Tutorial: Querying data](../tutorials/tutorial-query.md).
|
||||
假设你已经完成了 [快速开始](../tutorials/index.md) 页面中的内容或者下面页面中有关的内容,并且你的 Druid 实例已经在你的本地的计算机上运行了。
|
||||
|
||||
## Example data
|
||||
|
||||
For this tutorial, we'll use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second.
|
||||
同时,如果你已经完成了下面内容的阅读的话将会更好的帮助你理解 Roll-up 的相关内容
|
||||
|
||||
* [教程:载入一个文件](../tutorials/tutorial-batch.md)
|
||||
* [教程:查询数据](../tutorials/tutorial-query.md)
|
||||
|
||||
## 示例数据
|
||||
|
||||
针对对于本教程,我们将使用一个网络事件流数据的小样本。如下面表格中使用的数据,这个数据是在特定时间内从源到目标 IP 地址的流量的数据包和字节的事件。
|
||||
|
||||
```json
|
||||
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
|
||||
@@ -24,9 +29,9 @@ For this tutorial, we'll use a small sample of network flow event data, represen
|
||||
{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818}
|
||||
```
|
||||
|
||||
A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json`.
|
||||
包含有这个样本数据的 JSON 文件位于 `quickstart/tutorial/rollup-data.json`。
|
||||
|
||||
We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json`.
|
||||
我们将使用下面描述的数据导入任务描述规范,将上面的 JSON 数据导入到 Druid 中,有关这个任务描述配置位于 `quickstart/tutorial/rollup-index.json` 中。
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -78,25 +83,25 @@ We'll ingest this data using the following ingestion task spec, located at `quic
|
||||
}
|
||||
```
|
||||
|
||||
Roll-up has been enabled by setting `"rollup" : true` in the `granularitySpec`.
|
||||
通过在 `granularitySpec` 选项中设置 `rollup : true` 来启用 Roll-up。
|
||||
|
||||
Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`.
|
||||
请注意,我们将 `srcIP` 和 `dstIP` 定义为 **维度(dimensions)**,将 `packets` 和 `bytes` 列定义为了 longSum 类型的**指标(metric)**,并将 `queryGranularity` 配置定义为 `minute`。
|
||||
|
||||
We will see how these definitions are used after we load this data.
|
||||
加载这些数据后,我们将看到如何使用这些定义。
|
||||
|
||||
## Load the example data
|
||||
## 载入示例数据
|
||||
|
||||
From the apache-druid-apache-druid-0.21.1 package root, run the following command:
|
||||
在 Druid 包 的apache-druid-apache-druid-0.21.1 根目录下运行以下命令:
|
||||
|
||||
```bash
|
||||
bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
After the script completes, we will query the data.
|
||||
当上面的脚本运行完成后,我们将会开始查询数据。
|
||||
|
||||
## Query the example data
|
||||
## 查询示例数据
|
||||
|
||||
Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to see what data was ingested.
|
||||
让我们运行 `bin/dsql` 命令行工具,然后执行 `select * from "rollup-tutorial";` 脚本,来查看 Druid 系统中导入的数据。
|
||||
|
||||
```bash
|
||||
$ bin/dsql
|
||||
@@ -117,7 +122,7 @@ Retrieved 5 rows in 1.18s.
|
||||
dsql>
|
||||
```
|
||||
|
||||
Let's look at the three events in the original input data that occurred during `2018-01-01T01:01`:
|
||||
让我们查看在 `2018-01-01T01:01` 导入的 3 条原始数据:
|
||||
|
||||
```json
|
||||
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
|
||||
@@ -125,7 +130,7 @@ Let's look at the three events in the original input data that occurred during `
|
||||
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
|
||||
```
|
||||
|
||||
These three rows have been "rolled up" into the following row:
|
||||
上面的 3 条原始数据使用 "rolled up" 后将会合并成下面 1 条数据进行导入:
|
||||
|
||||
```bash
|
||||
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
|
||||
@@ -134,8 +139,12 @@ These three rows have been "rolled up" into the following row:
|
||||
│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
|
||||
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
|
||||
```
|
||||
这输入的数据将会按按照时间列(timestamp)和维度列(dimension columns) `{timestamp, srcIP, dstIP}` 进行分组(Group By),同时在指标列(metric columns) `{packages, bytes}` 上进行聚合。
|
||||
|
||||
在进行分组之前,原始输入数据的时间戳按分钟进行标记和记录的,这是由于摄取规范中的 `"queryGranularity":"minute"` 配置中决定的。
|
||||
|
||||
因此,记录中的 `2018-01-01T01:02` 期间发生的时间也被聚合后汇总。
|
||||
|
||||
The input rows have been grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`.
|
||||
|
||||
Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `"queryGranularity":"minute"` setting in the ingestion spec.
|
||||
|
||||
@@ -154,7 +163,7 @@ Likewise, these two events that occurred during `2018-01-01T01:02` have been rol
|
||||
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
|
||||
```
|
||||
|
||||
For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no roll-up took place, because this was the only event that occurred during `2018-01-01T01:03`:
|
||||
针对最后的记录 1.1.1.1 和 2.2.2.2 之间流量事件没有被 roll-up 进行合并汇总, 这是因为这些事件是 `2018-01-01T01:03` 期间发生的唯一事件。nt that occurred during `2018-01-01T01:03`:
|
||||
|
||||
```json
|
||||
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
|
||||
@@ -168,172 +177,4 @@ For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no roll-up too
|
||||
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
|
||||
```
|
||||
|
||||
Note that the `count` metric shows how many rows in the original input data contributed to the final "rolled up" row.
|
||||
|
||||
|
||||
## Roll-up
|
||||
|
||||
Apache Druid可以通过roll-up在数据摄取阶段对原始数据进行汇总。 Roll-up是对选定列集的一级聚合操作,它可以减小存储数据的大小。
|
||||
|
||||
本教程中将讨论在一个示例数据集上进行roll-up的结果。
|
||||
|
||||
本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid,并运行在本地机器上。
|
||||
|
||||
完成[加载本地文件](tutorial-batch.md)和[数据查询](./chapter-4.md)两部分内容也是非常有帮助的。
|
||||
|
||||
### 示例数据
|
||||
|
||||
对于本教程,我们将使用一个网络流事件数据的小样本,表示在特定时间内从源到目标IP地址的流量的数据包和字节计数。
|
||||
|
||||
```json
|
||||
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
|
||||
{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
|
||||
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
|
||||
{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
|
||||
{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
|
||||
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
|
||||
{"timestamp":"2018-01-02T21:33:14Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":38,"bytes":6289}
|
||||
{"timestamp":"2018-01-02T21:33:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":123,"bytes":93999}
|
||||
{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818}
|
||||
```
|
||||
位于 `quickstart/tutorial/rollup-data.json` 的文件包含了样例输入数据
|
||||
|
||||
我们将使用 `quickstart/tutorial/rollup-index.json` 的摄入数据规范来摄取数据
|
||||
|
||||
```json
|
||||
{
|
||||
"type" : "index_parallel",
|
||||
"spec" : {
|
||||
"dataSchema" : {
|
||||
"dataSource" : "rollup-tutorial",
|
||||
"dimensionsSpec" : {
|
||||
"dimensions" : [
|
||||
"srcIP",
|
||||
"dstIP"
|
||||
]
|
||||
},
|
||||
"timestampSpec": {
|
||||
"column": "timestamp",
|
||||
"format": "iso"
|
||||
},
|
||||
"metricsSpec" : [
|
||||
{ "type" : "count", "name" : "count" },
|
||||
{ "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
|
||||
{ "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
|
||||
],
|
||||
"granularitySpec" : {
|
||||
"type" : "uniform",
|
||||
"segmentGranularity" : "week",
|
||||
"queryGranularity" : "minute",
|
||||
"intervals" : ["2018-01-01/2018-01-03"],
|
||||
"rollup" : true
|
||||
}
|
||||
},
|
||||
"ioConfig" : {
|
||||
"type" : "index_parallel",
|
||||
"inputSource" : {
|
||||
"type" : "local",
|
||||
"baseDir" : "quickstart/tutorial",
|
||||
"filter" : "rollup-data.json"
|
||||
},
|
||||
"inputFormat" : {
|
||||
"type" : "json"
|
||||
},
|
||||
"appendToExisting" : false
|
||||
},
|
||||
"tuningConfig" : {
|
||||
"type" : "index_parallel",
|
||||
"maxRowsPerSegment" : 5000000,
|
||||
"maxRowsInMemory" : 25000
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
通过在 `granularitySpec` 选项中设置 `rollup : true` 来启用Roll-up
|
||||
|
||||
注意,我们将`srcIP`和`dstIP`定义为**维度**,将`packets`和`bytes`列定义为了`longSum`类型的**指标**,并将 `queryGranularity` 配置定义为 `minute`。
|
||||
|
||||
加载这些数据后,我们将看到如何使用这些定义。
|
||||
|
||||
### 加载示例数据
|
||||
|
||||
在Druid的根目录下运行以下命令:
|
||||
|
||||
```json
|
||||
bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
脚本运行完成以后,我们将查询数据。
|
||||
|
||||
### 查询示例数据
|
||||
|
||||
现在运行 `bin/dsql` 然后执行查询 `select * from "rollup-tutorial";` 来查看已经被摄入的数据。
|
||||
|
||||
```json
|
||||
$ bin/dsql
|
||||
Welcome to dsql, the command-line client for Druid SQL.
|
||||
Type "\h" for help.
|
||||
dsql> select * from "rollup-tutorial";
|
||||
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
|
||||
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
|
||||
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
|
||||
│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
|
||||
│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │
|
||||
│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │
|
||||
│ 2018-01-02T21:33:00.000Z │ 100288 │ 2 │ 8.8.8.8 │ 161 │ 7.7.7.7 │
|
||||
│ 2018-01-02T21:35:00.000Z │ 2818 │ 1 │ 8.8.8.8 │ 12 │ 7.7.7.7 │
|
||||
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
|
||||
Retrieved 5 rows in 1.18s.
|
||||
|
||||
dsql>
|
||||
```
|
||||
|
||||
我们来看发生在 `2018-01-01T01:01` 的三条原始数据:
|
||||
|
||||
```json
|
||||
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
|
||||
{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
|
||||
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
|
||||
```
|
||||
这三条数据已经被roll up为以下一行数据:
|
||||
|
||||
```json
|
||||
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
|
||||
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
|
||||
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
|
||||
│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
|
||||
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
|
||||
```
|
||||
|
||||
这输入的数据行已经被按照时间列和维度列 `{timestamp, srcIP, dstIP}` 在指标列 `{packages, bytes}` 上做求和聚合
|
||||
|
||||
在进行分组之前,原始输入数据的时间戳按分钟进行标记/布局,这是由于摄取规范中的 `"queryGranularity":"minute"` 设置造成的。
|
||||
同样,`2018-01-01T01:02` 期间发生的这两起事件也已经汇总。
|
||||
|
||||
```json
|
||||
{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
|
||||
{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
|
||||
```
|
||||
```json
|
||||
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
|
||||
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
|
||||
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
|
||||
│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │
|
||||
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
|
||||
```
|
||||
|
||||
对于记录1.1.1.1和2.2.2.2之间流量的最后一个事件没有发生汇总,因为这是 `2018-01-01T01:03` 期间发生的唯一事件
|
||||
|
||||
```json
|
||||
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
|
||||
```
|
||||
```json
|
||||
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
|
||||
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
|
||||
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
|
||||
│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │
|
||||
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
|
||||
```
|
||||
|
||||
请注意,`计数指标 count` 显示原始输入数据中有多少行贡献给最终的"roll up"行。
|
||||
列 `计数指标(count)` 显示的是原始数据中有多少条记录最后被合并汇总(roll up)了。
|
||||
|
||||
@@ -1,53 +1,30 @@
|
||||
---
|
||||
id: tutorial-update-data
|
||||
title: "Tutorial: Updating existing data"
|
||||
sidebar_label: "Updating existing data"
|
||||
---
|
||||
# 数据更新
|
||||
被页面将会对如何对现有数据进行更新进行说明,同时演示覆盖(overwrites)和追加(appends)2 种更新方式。
|
||||
|
||||
<!--
|
||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||
~ or more contributor license agreements. See the NOTICE file
|
||||
~ distributed with this work for additional information
|
||||
~ regarding copyright ownership. The ASF licenses this file
|
||||
~ to you under the Apache License, Version 2.0 (the
|
||||
~ "License"); you may not use this file except in compliance
|
||||
~ with the License. You may obtain a copy of the License at
|
||||
~
|
||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||
~
|
||||
~ Unless required by applicable law or agreed to in writing,
|
||||
~ software distributed under the License is distributed on an
|
||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
~ KIND, either express or implied. See the License for the
|
||||
~ specific language governing permissions and limitations
|
||||
~ under the License.
|
||||
-->
|
||||
假设你已经完成了 [快速开始](../tutorials/index.md) 页面中的内容或者下面页面中有关的内容,并且你的 Druid 实例已经在你的本地的计算机上运行了。
|
||||
|
||||
同时,如果你已经完成了下面内容的阅读的话将会更好的帮助你理解有关数据更新的内容。
|
||||
|
||||
This tutorial demonstrates how to update existing data, showing both overwrites and appends.
|
||||
* [教程:载入一个文件](../tutorials/tutorial-batch.md)
|
||||
* [教程:查询数据](../tutorials/tutorial-query.md)
|
||||
* [教程:Rollup](../tutorials/tutorial-rollup.md)
|
||||
|
||||
## 覆盖(Overwrite)
|
||||
本教程的这部分内容将会展示如何覆盖已经存在的时序间隔数据。
|
||||
|
||||
For this tutorial, we'll assume you've already downloaded Apache Druid as described in
|
||||
the [single-machine quickstart](index.html) and have it running on your local machine.
|
||||
### 载入初始化数据
|
||||
|
||||
It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md), [Tutorial: Querying data](../tutorials/tutorial-query.md), and [Tutorial: Rollup](../tutorials/tutorial-rollup.md).
|
||||
让我们先载入一部分原始数据来作为初始化的数据,随后我们将会对这些数据进行覆盖和追加。
|
||||
|
||||
## Overwrite
|
||||
本指南使用的数据导入规范位于 `quickstart/tutorial/updates-init-index.json` 文件。这个数据导入规范将会从 `quickstart/tutorial/updates-data.json` 中导入数据文件,并且创建一个称为`updates-tutorial` 的数据源。
|
||||
|
||||
This section of the tutorial will cover how to overwrite an existing interval of data.
|
||||
|
||||
### Load initial data
|
||||
|
||||
Let's load an initial data set which we will overwrite and append to.
|
||||
|
||||
The spec we'll use for this tutorial is located at `quickstart/tutorial/updates-init-index.json`. This spec creates a datasource called `updates-tutorial` from the `quickstart/tutorial/updates-data.json` input file.
|
||||
|
||||
Let's submit that task:
|
||||
让我们提交这个任务:
|
||||
|
||||
```bash
|
||||
bin/post-index-task --file quickstart/tutorial/updates-init-index.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
We have three initial rows containing an "animal" dimension and "number" metric:
|
||||
在任务完成后,我将会初始化看到一个 "animal" 的维度(dimension)和 "number" 的指标(metric):
|
||||
|
||||
```bash
|
||||
dsql> select * from "updates-tutorial";
|
||||
@@ -61,21 +38,23 @@ dsql> select * from "updates-tutorial";
|
||||
Retrieved 3 rows in 1.42s.
|
||||
```
|
||||
|
||||
### Overwrite the initial data
|
||||
### 覆盖初始化数据
|
||||
|
||||
To overwrite this data, we can submit another task for the same interval, but with different input data.
|
||||
为了覆盖这些初始化的原始数据,我们可以提交另外一个任务,在这个任务中我们会设置有相同的时间间隔,但是输入数据是不同的。
|
||||
|
||||
The `quickstart/tutorial/updates-overwrite-index.json` spec will perform an overwrite on the `updates-tutorial` datasource.
|
||||
`quickstart/tutorial/updates-overwrite-index.json` 规范将定义如何覆盖 `updates-tutorial` 数据源。
|
||||
|
||||
Note that this task reads input from `quickstart/tutorial/updates-data2.json`, and `appendToExisting` is set to `false` (indicating this is an overwrite).
|
||||
请注意,上面定义的导入规范是从 `quickstart/tutorial/updates-data2.json` 数据文件中读取数据的,并且规范中的 `appendToExisting` 设置为 `false`
|
||||
(在规范中的这个设置决定了数据采取的是覆盖导入方式)。
|
||||
|
||||
Let's submit that task:
|
||||
然后让我们提交这个任务:
|
||||
|
||||
```bash
|
||||
bin/post-index-task --file quickstart/tutorial/updates-overwrite-index.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
When Druid finishes loading the new segment from this overwrite task, the "tiger" row now has the value "lion", the "aardvark" row has a different number, and the "giraffe" row has been replaced. It may take a couple of minutes for the changes to take effect:
|
||||
当 Druid 从覆盖任务中完成导入新的段后,我们会看到原来的 "tiger" 行中对应当前的值为 "lion"; "aardvark" 行中有了不同的数字; "giraffe" 行被完全替换了。
|
||||
针对不同的环境,上面的配置需要等几分钟后才能生效:
|
||||
|
||||
```bash
|
||||
dsql> select * from "updates-tutorial";
|
||||
@@ -89,19 +68,23 @@ dsql> select * from "updates-tutorial";
|
||||
Retrieved 3 rows in 0.02s.
|
||||
```
|
||||
|
||||
## Combine old data with new data and overwrite
|
||||
## 将新数据和老数据合并后进行覆盖
|
||||
|
||||
Let's try appending some new data to the `updates-tutorial` datasource now. We will add the data from `quickstart/tutorial/updates-data3.json`.
|
||||
让我们现在将新的数据追加到 `updates-tutorial` 数据源。我们将会使用名为 `quickstart/tutorial/updates-data3.json` 的数据文件。
|
||||
|
||||
The `quickstart/tutorial/updates-append-index.json` task spec has been configured to read from the existing `updates-tutorial` datasource and the `quickstart/tutorial/updates-data3.json` file. The task will combine data from the two input sources, and then overwrite the original data with the new combined data.
|
||||
`quickstart/tutorial/updates-append-index.json` 任务规范将会被配置从已经存在的 `quickstart/tutorial/updates-data3.json` 数据文件
|
||||
和 `updates-tutorial` 数据源同属兑取数据后更新 `updates-tutorial` 数据源。
|
||||
|
||||
Let's submit that task:
|
||||
这个任务将会对 2 个数据源中读取的数据进行合并,然后将合并后的数据重新写回到数据源。
|
||||
|
||||
然后让我们提交这个任务:
|
||||
|
||||
```bash
|
||||
bin/post-index-task --file quickstart/tutorial/updates-append-index.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
When Druid finishes loading the new segment from this overwrite task, the new rows will have been added to the datasource. Note that roll-up occurred for the "lion" row:
|
||||
当 Druid 完成这个任务并且创建新段后,新的行将会被添加到数据源中。
|
||||
需要注意的是 "lion" 行进行了合并(roll-up)操作:
|
||||
|
||||
```bash
|
||||
dsql> select * from "updates-tutorial";
|
||||
@@ -118,19 +101,25 @@ dsql> select * from "updates-tutorial";
|
||||
Retrieved 6 rows in 0.02s.
|
||||
```
|
||||
|
||||
## Append to the data
|
||||
## 追加数据
|
||||
|
||||
Let's try another way of appending data.
|
||||
让我们尝试使用另外一种方法来对数据进行追加。
|
||||
|
||||
The `quickstart/tutorial/updates-append-index2.json` task spec reads input from `quickstart/tutorial/updates-data4.json` and will append its data to the `updates-tutorial` datasource. Note that `appendToExisting` is set to `true` in this spec.
|
||||
`quickstart/tutorial/updates-append-index2.json` 任务规范将会被配置从已经存在的 `quickstart/tutorial/updates-data4.json` 文件中读取数据,
|
||||
在数据读取后将数据追加到 `updates-tutorial` 数据源中。
|
||||
|
||||
Let's submit that task:
|
||||
请注意,规范中的 `appendToExisting` 设置为 `true`。
|
||||
|
||||
然后让我们提交这个任务:
|
||||
|
||||
```bash
|
||||
bin/post-index-task --file quickstart/tutorial/updates-append-index2.json --url http://localhost:8081
|
||||
```
|
||||
|
||||
When the new data is loaded, we can see two additional rows after "octopus". Note that the new "bear" row with number 222 has not been rolled up with the existing bear-111 row, because the new data is held in a separate segment.
|
||||
当新的数据被载入后,我们会看到 octopus 中添加了 2 条新的行。
|
||||
|
||||
请注意,新添加的行 "bear" 中的值为 222, 针对已经存在的 "bear" 行中的数据 111,Druid 并没有针对数据进行了 rolled-up 操作。
|
||||
这是因为新增加的数据保存在不同的段中。
|
||||
|
||||
```bash
|
||||
dsql> select * from "updates-tutorial";
|
||||
@@ -150,7 +139,7 @@ Retrieved 8 rows in 0.02s.
|
||||
|
||||
```
|
||||
|
||||
If we run a GroupBy query instead of a `select *`, we can see that the "bear" rows will group together at query time:
|
||||
如果我们运行 GroupBy 查询来替代 `select *` 查询的话,我们会看到 "bear" 这一行将在 group By 查询后再合并在一起的:
|
||||
|
||||
```bash
|
||||
dsql> select __time, animal, SUM("count"), SUM("number") from "updates-tutorial" group by __time, animal;
|
||||
|
||||
Reference in New Issue
Block a user