My data can’t be date partitioned, how do I use clustering?

Currently I using following query:
SELECT
ID,
Key
FROM
mydataset.mytable
where ID = 100077113  and Key='06019'

mydataset.mytable

ID - unique

If I know the key looking for ID can be done on ~10,000 rows and work much faster and process much less data.

How can I use the new clustering capabilites in BigQuery to partition on the field Key?

3 Answers
3

(I'm going to summarize and expand on what Mikhail, Pentium10, and Pavan said)

I have a table with 12M rows and 76 GB of data. This table has no timestamp column.

This is how to cluster said table - while creating a fake date column for fake partitioning:

CREATE TABLE `fh-bigquery.public_dump.github_java_clustered` (id STRING, size INT64, content STRING, binary BOOL , copies INT64, sample_repo_name STRING, sample_path STRING , fake_date DATE) PARTITION BY fake_date CLUSTER BY id AS ( SELECT *, DATE('1980-01-01') fake_date FROM `fh-bigquery.github_extracts.contents_java` )

Did it work?

# original table SELECT * FROM `fh-bigquery.github_extracts.contents_java` WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3' (3.3s elapsed, 72.1 GB processed) # clustered table SELECT * FROM `fh-bigquery.public_dump.github_java_clustered2` WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3' (2.4s elapsed, 232 MB processed)

What I learned here:

Read more: https://medium.com/@hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b

I did it as you recommended. Created my table with four clustering columns which I used in the where clause. The results are very impressive: The Table size: 13.28 GB. Number of rows: 12,693,413. Unclustered: LIMIT 10000 Query complete (2.1s elapsed, 1.05 GB processed). Clustered: LIMIT 10000 Query complete (2.8s elapsed, 91.2 MB processed). BUT with NO LIMIT - the full table: Unclustered: Query complete (77.5s elapsed, 1.05 GB processed). Clustered: Query complete (65.6s elapsed, 912 MB processed) ~100 MB less. There is huge difference with LIMIT which would not be used in practice
– thstart
Aug 14 at 7:24

I tried with LIMIT 12693413 but there is no difference. Now I have another problem. Tried a JOIN with another table and got an "Error: Query exceeded resource limits. 301984.67488667317 CPU seconds were used, and this query must use less than 147900.0 CPU seconds." This was one reason to test with clustered tables - if it can finish quicker too. The less memory used would be great if it works on full table but the resource limits is a problem now.
– thstart
Aug 14 at 7:31

If you don't publish your data, or at least the clustering strategy you used, it will be very hard to debug here. The question "how to cluster without a date" has been answered here. Please post new questions with additional context.
– Felipe Hoffa
Aug 14 at 12:42

Felipe - Thank you for your help! I cannot publish because it is from an alpha test, probably some other way?
– thstart
Aug 14 at 18:26

My tests also stress the importance of "using a fake date instead of a null date". In my first tests I used NULL and the results were horrible (it took nearly 1 hour to create the table instead of 5 minutes when using a fake date, and then the queries barely reduced the costs)
– Sourygna
Aug 15 at 14:13

you can have one filed of type DATE with NULL value, so you will be able partition by that field and since the table partitioned you will be able to enjoy clustering

The question was how to do it NOT on a DATE field.
– thstart
Aug 11 at 21:22

nowadays - partitioning can be done only on DATE type and clustering can be done only for partitioned table - period - so, I thought you were looking for workaround! per documentation - google is working on supporting clustering for non partitioned tables though
– Mikhail Berlyant
Aug 11 at 23:09

medium.com/@hoffa/… Felipe Hoffa My data can’t be date partitioned, how do I use clustering? 2 alternatives: 1. Use ingestion time partitioned table with clustering on the fields of your interest. This is the preferred mechanism if you have > ~10GB of data/day. 2. If you have smaller amounts of data per day, use a column partitioned table with clustering, partitioned on a “fake” date optional column. Just use the value NULL for it (or leave it unspecified, and BigQuery will assume it is NULL). Specify the clust clmns of interest.
– thstart
Aug 12 at 20:08

You need to recreate your table with an additional date column with all rows having NULL values. And then you set partition to the date column. This way your table is partitioned.

After you've done with this, you will add clustering, based on the columns you identified in your query. Clustering will improve processing time and query costs will be reduced.

An example assuming your table has fields a INTEGER, b STRING and c STRING and you want to cluster by b. bq query --nouse_legacy_sql 'CREATE TABLE mydataset.mytable(d DATE, a INT64, b STRING, c STRING) PARTITION BY d CLUSTER BY b AS (SELECT NULL as d, 1 as a, "2" as b, "3" as c)'
– Pavan Edara
Aug 12 at 15:51

PARTITION BY d - DATE. I don't have DATE, I have integer
– thstart
Aug 12 at 20:04

Have you read my first line in the answer. You need to add a date column, even if you don't use it.
– Pentium10
Aug 13 at 6:33

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Sfyjdyy