From a0228df5a289770654ef2d789298c039d52e40fe Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Wed, 29 Oct 2025 19:07:10 +0100 Subject: [PATCH 1/8] initial draft --- .../{ => starter_guides}/creating-tables.md | 0 .../generating-test-data.md | 0 .../{ => starter_guides}/inserting-data.md | 0 .../starter_guides/working_with_arrays.md | 261 ++++++++++++++++++ .../{ => starter_guides}/writing-queries.md | 0 sidebars.js | 11 +- 6 files changed, 266 insertions(+), 6 deletions(-) rename docs/guides/{ => starter_guides}/creating-tables.md (100%) rename docs/guides/{ => starter_guides}/generating-test-data.md (100%) rename docs/guides/{ => starter_guides}/inserting-data.md (100%) create mode 100644 docs/guides/starter_guides/working_with_arrays.md rename docs/guides/{ => starter_guides}/writing-queries.md (100%) diff --git a/docs/guides/creating-tables.md b/docs/guides/starter_guides/creating-tables.md similarity index 100% rename from docs/guides/creating-tables.md rename to docs/guides/starter_guides/creating-tables.md diff --git a/docs/guides/generating-test-data.md b/docs/guides/starter_guides/generating-test-data.md similarity index 100% rename from docs/guides/generating-test-data.md rename to docs/guides/starter_guides/generating-test-data.md diff --git a/docs/guides/inserting-data.md b/docs/guides/starter_guides/inserting-data.md similarity index 100% rename from docs/guides/inserting-data.md rename to docs/guides/starter_guides/inserting-data.md diff --git a/docs/guides/starter_guides/working_with_arrays.md b/docs/guides/starter_guides/working_with_arrays.md new file mode 100644 index 00000000000..5851606c914 --- /dev/null +++ b/docs/guides/starter_guides/working_with_arrays.md @@ -0,0 +1,261 @@ +--- +title: 'Working with arrays in ClickHouse' +description: 'Starter guide on how to use arrays in ClickHouse' +keywords: ['Arrays'] +sidebar_label: 'Working with arrays in ClickHouse' +slug: /guides/working-with-arrays +doc_type: 'guide' +--- + +> In this guide, you'll learn how to use arrays in ClickHouse along with some of the most common array functions + +## Array basics {#array-basics} + +An array is a data structure which groups together values. We call these elements, and each element can be referred to by what is known as an index which indicates the position of the element in this grouping. + +Arrays in ClickHouse can be formed using the `array` function: + +``` +array(T) +``` + +Or alternatively, using square brackets: + +``` +[] +``` + +For example, you can create an array of numbers: + +``` +SELECT array(1, 2, 3) AS numeric_array + +┌─numeric_array─┐ +│ [1,2,3] │ +└───────────────┘ +``` + +Or an array of strings: + +``` +SELECT array('hello', 'world') AS string_array + +┌─string_array──────┐ +│ ['hello','world'] │ +└───────────────────┘ +``` + +Or as an array of nested types such as tuples: + +``` +SELECT array(tuple(1, 2), tuple(3, 4)) + +┌─[(1, 2), (3, 4)]─┐ +│ [(1,2),(3,4)] │ +└──────────────────┘ +``` + +You might be tempted to make an array of different types like this: + +``` +SELECT array('Hello', 'world', 1, 2, 3) +``` + +However, array elements should always have a common super type, which is the smallest data type that can represent values from two or more different types without loss, allowing them to be used together. +If there is no common super type an exception will be raised: + +``` +Received exception: +Code: 386. DB::Exception: There is no supertype for types String, String, UInt8, UInt8, UInt8 because some of them are String/FixedString/Enum and some of them are not: In scope SELECT ['Hello', 'world', 1, 2, 3]. (NO_COMMON_TYPE) +``` + +when creating arrays on the fly, ClickHouse picks the narrowest type that fits all elements. +For example, if you create an array of ints and floats, a super type of float is chosen: + +``` +SELECT [1::UInt8, 2.5::Float32, 3::UInt8] AS mixed_array, toTypeName([1, 2.5, 3]) AS array_type; + +┌─mixed_array─┬─array_type─────┐ +│ [1,2.5,3] │ Array(Float64) │ +└─────────────┴────────────────┘ +``` + +Use of the index with square brackets provides a convenient way to access array elements. +In ClickHouse it's important to know that the array index always starts from **1**. +This may be different from other programming languages you're used to where arrays are zero-indexed. + +For example, given an array, you can select the first element of an array by writing: + +``` +WITH array('hello', 'world') AS string_array +SELECT string_array[1]; + +┌─arrayElement⋯g_array, 1)─┐ +│ hello │ +└──────────────────────────┘ +``` + +It is also possible to use negative indexes. +In this way you can select elements relative to the last element: + +``` +WITH array('hello', 'world', 'arrays are great aren\'t they?') AS string_array +SELECT string_array[-1]; + +┌─arrayElement(string_array, -1)─┐ +│ arrays are great aren't they? │ +└────────────────────────────────┘ +``` + +Despite arrays being 1 based-indexed, you can still access elements at position 0. +The returned value will be the default value of the array type. In the example +below an empty string is returned as this is the default value for the string data type: + +```sql +WITH ['hello', 'world', 'arrays are great aren\'t they?'] AS string_array +SELECT string_array[0] + +┌─arrayElement⋯g_array, 0)─┐ +│ │ +└──────────────────────────┘ +``` +## Array functions + +ClickHouse offers a host of useful functions which operate on arrays. +In this section, we'll look at some of the most useful ones, starting from the simplest and working up in complexity. + +### length, arrayEnumerate, indexOf functions + +The `length` function is used to return the number of elements in the array: + +``` +WITH array('learning', 'ClickHouse', 'arrays') AS string_array +SELECT length(string_array); + +┌─length(string_array)─┐ +│ 3 │ +└──────────────────────┘ +``` + +You can also use the [`arrayEnumerate`](/sql-reference/functions/array-functions#arrayEnumerate) function to return an array of indexes of the elements: + +``` +WITH array('learning', 'ClickHouse', 'arrays') AS string_array +SELECT arrayEnumerate(string_array); + +┌─arrayEnumerate(string_array)─┐ +│ [1,2,3] │ +└──────────────────────────────┘ +``` + +If you want to find the index of a particular value, you can use the `indexOf` function: + +```sql +SELECT indexOf([4, 2, 8, 8, 9], 8); + +┌─indexOf([4, 2, 8, 8, 9], 8)─┐ +│ 3 │ +└─────────────────────────────┘ +``` + +Notice that this function will return the first index it encounters if there are multiple identical values in the array. +If your array elements are sorted in ascending order then you can use the [`indexOfAssumeSorted`](/sql-reference/functions/array-functions#indexOfAssumeSorted) function. + +### Exploring stock market data with array functions + +So far, the examples have been pretty simple. +Let's take a real-world dataset and write some queries to learn how some of the more common array functions work. + +We will be using the [ontime dataset](/getting-started/example-datasets/ontime), which contains flight data from Bureau of Transportation Statistics. +You can find this dataset on the [SQL playground](https://sql.clickhouse.com/?query_id=M4FSVBVMSHY98NKCQP8N4K). + +We've selected this dataset as arrays are often well suited to working with time-series data and can assist in simplifying +otherwise complex queries. + +:::tip +Click the "play" button below to run the queries directly in the docs +::: + +### groupArray + +There are many columns in this dataset, but we will focus on a subset of the columns. +Run the query below to see what our data looks like: + +```sql runnable +-- SELECT +-- * +-- FROM ontime.ontime LIMIT 100 + +SELECT + FlightDate, + Origin, + OriginCityName, + Dest, + DestCityName, + DepTime, + DepDelayMinutes, + ArrTime, + ArrDelayMinutes +FROM ontime.ontime LIMIT 5 +``` + +Let's take a look at the top 10 busiest airports in the US on a particular day chosen at random, say 2024-01-01. +We're interested in understanding how many flights depart from each airport. +Our data contains one row per flight, but it would be convenient if we could group the data by the origin airport and roll the destinations into an array. + +To acheive this we can use the [`groupArray`](/sql-reference/aggregate-functions/reference/grouparray) aggregate function, which takes values of the specified column from each row and groups them in an array. + +Run the query below to see how it works: + +```sql runnable +SELECT + FlightDate, + Origin, + groupArray(toStringCutToZero(Dest)) AS Destinations +FROM ontime.ontime +WHERE Origin IN ('ATL', 'ORD', 'DFW', 'DEN', 'LAX', 'JFK', 'LAS', 'CLT', 'SFO', 'SEA') AND FlightDate='2024-01-01' +GROUP BY FlightDate, Origin +ORDER BY length(Destinations) +``` + +The `toStringCutToZero` in the query above is used to remove null characters which appear after some of the airport's 3 letter designation. + +Wth the data in this form we can easily find the order of the busiest airports by finding the length of the rolled up "Destinations" arrays: + +```sql +WITH + '2024-01-01' AS date, + busy_airports AS ( + SELECT + FlightDate, + Origin, + groupArray(toStringCutToZero(Dest)) AS Destinations + FROM ontime.ontime + WHERE Origin IN ('ATL', 'ORD', 'DFW', 'DEN', 'LAX', 'JFK', 'LAS', 'CLT', 'SFO', 'SEA') + AND FlightDate = date + GROUP BY FlightDate, Origin + ORDER BY length(Destinations) + ) +SELECT + Origin, + length(Destinations) AS outward_flights +FROM busy_airports +ORDER BY outward_flights DESC +``` + +### arrayFilter + +Many of the array functions in ClickHouse are so-called ["higher-order functions"](/sql-reference/functions/overview#higher-order-functions) and accept a lambda function as the first parameter. +Let's take a look at how this works by finding days when stocks gained significantly, say +5%. + +Run the query below: + +```sql runnable + +``` +In the query below we pass a lambda function as the first argument to the [`arrayFilter`](/sql-reference/functions/array-functions#arrayFilter) function: +itself taking the opening and closing price: + +```sql +(p, o) -> p > 0 * 1.05 +``` \ No newline at end of file diff --git a/docs/guides/writing-queries.md b/docs/guides/starter_guides/writing-queries.md similarity index 100% rename from docs/guides/writing-queries.md rename to docs/guides/starter_guides/writing-queries.md diff --git a/sidebars.js b/sidebars.js index 84494802054..b7bff9fac5a 100644 --- a/sidebars.js +++ b/sidebars.js @@ -80,12 +80,11 @@ const sidebars = { collapsible: false, link: { type: "doc", id: "starter-guides/index" }, items: [ - "guides/creating-tables", - "guides/inserting-data", - "guides/writing-queries", - "guides/developer/mutations", - "guides/generating-test-data" - ], + { + type: "autogenerated", + dirName: "guides/starter_guides", + } + ] }, { type: "category", From bf74558ba428bf1234be0528465f818dd0620bdb Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Mon, 3 Nov 2025 20:24:42 +0100 Subject: [PATCH 2/8] add starter guide on arrays --- .../starter_guides/working_with_arrays.md | 290 +++++++++++++++--- 1 file changed, 254 insertions(+), 36 deletions(-) diff --git a/docs/guides/starter_guides/working_with_arrays.md b/docs/guides/starter_guides/working_with_arrays.md index 5851606c914..d4eb2acf629 100644 --- a/docs/guides/starter_guides/working_with_arrays.md +++ b/docs/guides/starter_guides/working_with_arrays.md @@ -7,13 +7,14 @@ slug: /guides/working-with-arrays doc_type: 'guide' --- -> In this guide, you'll learn how to use arrays in ClickHouse along with some of the most common array functions +> In this guide, you'll learn how to use arrays in ClickHouse along with some of the most commonly used [array functions](/sql-reference/functions/array-functions). -## Array basics {#array-basics} +## Introduction to arrays {#array-basics} -An array is a data structure which groups together values. We call these elements, and each element can be referred to by what is known as an index which indicates the position of the element in this grouping. +An array is an in-memory data structure which groups together values. +We call these _elements_ of the array, and each element can be referred to by an index, which indicates the position of the element in this grouping. -Arrays in ClickHouse can be formed using the `array` function: +Arrays in ClickHouse can be formed using the [`array`](/sql-reference/data-types/array) function: ``` array(T) @@ -45,7 +46,7 @@ SELECT array('hello', 'world') AS string_array └───────────────────┘ ``` -Or as an array of nested types such as tuples: +Or an array of nested types such as [tuples](/sql-reference/data-types/tuple): ``` SELECT array(tuple(1, 2), tuple(3, 4)) @@ -61,8 +62,8 @@ You might be tempted to make an array of different types like this: SELECT array('Hello', 'world', 1, 2, 3) ``` -However, array elements should always have a common super type, which is the smallest data type that can represent values from two or more different types without loss, allowing them to be used together. -If there is no common super type an exception will be raised: +However, array elements should always have a common super-type, which is the smallest data type that can represent values from two or more different types without loss, allowing them to be used together. +If there is no common super-type, you'll get an exception when you try to form the array: ``` Received exception: @@ -70,7 +71,7 @@ Code: 386. DB::Exception: There is no supertype for types String, String, UInt8, ``` when creating arrays on the fly, ClickHouse picks the narrowest type that fits all elements. -For example, if you create an array of ints and floats, a super type of float is chosen: +For example, if you create an array of ints and floats, a super-type of float is chosen: ``` SELECT [1::UInt8, 2.5::Float32, 3::UInt8] AS mixed_array, toTypeName([1, 2.5, 3]) AS array_type; @@ -81,7 +82,7 @@ SELECT [1::UInt8, 2.5::Float32, 3::UInt8] AS mixed_array, toTypeName([1, 2.5, 3] ``` Use of the index with square brackets provides a convenient way to access array elements. -In ClickHouse it's important to know that the array index always starts from **1**. +In ClickHouse, it's important to know that the array index always starts from **1**. This may be different from other programming languages you're used to where arrays are zero-indexed. For example, given an array, you can select the first element of an array by writing: @@ -96,20 +97,20 @@ SELECT string_array[1]; ``` It is also possible to use negative indexes. -In this way you can select elements relative to the last element: +In this way, you can select elements relative to the last element: ``` -WITH array('hello', 'world', 'arrays are great aren\'t they?') AS string_array +WITH array('hello', 'world') AS string_array SELECT string_array[-1]; -┌─arrayElement(string_array, -1)─┐ -│ arrays are great aren't they? │ -└────────────────────────────────┘ +┌─arrayElement⋯g_array, -1)─┐ +│ world │ +└───────────────────────────┘ ``` Despite arrays being 1 based-indexed, you can still access elements at position 0. -The returned value will be the default value of the array type. In the example -below an empty string is returned as this is the default value for the string data type: +The returned value will be the _default value_ of the array type. +In the example below, an empty string is returned as this is the default value for the string data type: ```sql WITH ['hello', 'world', 'arrays are great aren\'t they?'] AS string_array @@ -119,12 +120,13 @@ SELECT string_array[0] │ │ └──────────────────────────┘ ``` -## Array functions + +## Array functions {#array-functions} ClickHouse offers a host of useful functions which operate on arrays. In this section, we'll look at some of the most useful ones, starting from the simplest and working up in complexity. -### length, arrayEnumerate, indexOf functions +### length, arrayEnumerate, indexOf, has* functions {#length-arrayEnumerate-indexOf-has-functions} The `length` function is used to return the number of elements in the array: @@ -161,22 +163,46 @@ SELECT indexOf([4, 2, 8, 8, 9], 8); Notice that this function will return the first index it encounters if there are multiple identical values in the array. If your array elements are sorted in ascending order then you can use the [`indexOfAssumeSorted`](/sql-reference/functions/array-functions#indexOfAssumeSorted) function. -### Exploring stock market data with array functions +The functions `has`, `hasAll` and `hasAny` are useful for determining whether an array contains a given value. +Consider the following example: + +```sql +WITH ['Airbus A380', 'Airbus A350', 'Airbus A220', 'Boeing 737', 'Boeing 747-400'] AS airplanes +SELECT + has(airplanes, 'Airbus A350') AS has_true, + has(airplanes, 'Lockheed Martin F-22 Raptor') AS has_false, + hasAny(airplanes, ['Boeing 737', 'Eurofighter Typhoon']) AS hasAny_true, + hasAny(airplanes, ['Lockheed Martin F-22 Raptor', 'Eurofighter Typhoon']) AS hasAny_false, + hasAll(airplanes, ['Boeing 737', 'Boeing 747-400']) AS hasAll_true, + hasAll(airplanes, ['Boeing 737', 'Eurofighter Typhoon']) AS hasAll_false +FORMAT Vertical; +``` + +```response +has_true: 1 +has_false: 0 +hasAny_true: 1 +hasAny_false: 0 +hasAll_true: 1 +hasAll_false: 0 +``` + +## Exploring flight data with array functions {#exploring-flight-data-with-array-functions} So far, the examples have been pretty simple. -Let's take a real-world dataset and write some queries to learn how some of the more common array functions work. +The utility of arrays really shows itself when used on a real-world dataset. -We will be using the [ontime dataset](/getting-started/example-datasets/ontime), which contains flight data from Bureau of Transportation Statistics. +We will be using the [ontime dataset](/getting-started/example-datasets/ontime), which contains flight data from the Bureau of Transportation Statistics. You can find this dataset on the [SQL playground](https://sql.clickhouse.com/?query_id=M4FSVBVMSHY98NKCQP8N4K). We've selected this dataset as arrays are often well suited to working with time-series data and can assist in simplifying otherwise complex queries. :::tip -Click the "play" button below to run the queries directly in the docs +Click the "play" button below to run the queries directly in the docs and see the result live. ::: -### groupArray +### groupArray {#grouparray} There are many columns in this dataset, but we will focus on a subset of the columns. Run the query below to see what our data looks like: @@ -199,11 +225,11 @@ SELECT FROM ontime.ontime LIMIT 5 ``` -Let's take a look at the top 10 busiest airports in the US on a particular day chosen at random, say 2024-01-01. +Let's take a look at the top 10 busiest airports in the US on a particular day chosen at random, say '2024-01-01'. We're interested in understanding how many flights depart from each airport. Our data contains one row per flight, but it would be convenient if we could group the data by the origin airport and roll the destinations into an array. -To acheive this we can use the [`groupArray`](/sql-reference/aggregate-functions/reference/grouparray) aggregate function, which takes values of the specified column from each row and groups them in an array. +To achieve this we can use the [`groupArray`](/sql-reference/aggregate-functions/reference/grouparray) aggregate function, which takes values of the specified column from each row and groups them in an array. Run the query below to see how it works: @@ -218,17 +244,18 @@ GROUP BY FlightDate, Origin ORDER BY length(Destinations) ``` -The `toStringCutToZero` in the query above is used to remove null characters which appear after some of the airport's 3 letter designation. +The [`toStringCutToZero`](/sql-reference/functions/type-conversion-functions#tostringcuttozero) in the query above is used to remove null characters which appear after some of the airport's 3 letter designation. -Wth the data in this form we can easily find the order of the busiest airports by finding the length of the rolled up "Destinations" arrays: +With the data in this format, we can easily find the order of the busiest airports by finding the length of the rolled up "Destinations" arrays: -```sql +```sql runnable WITH '2024-01-01' AS date, busy_airports AS ( SELECT FlightDate, Origin, +--highlight-next-line groupArray(toStringCutToZero(Dest)) AS Destinations FROM ontime.ontime WHERE Origin IN ('ATL', 'ORD', 'DFW', 'DEN', 'LAX', 'JFK', 'LAS', 'CLT', 'SFO', 'SEA') @@ -243,19 +270,210 @@ FROM busy_airports ORDER BY outward_flights DESC ``` -### arrayFilter +### arrayMap {#arraymap} + +We saw in the previous query that Denver International Airport was the airport with the most outward flights for our particular chosen day. +Let's take a look at how many of those flights were on-time, delayed by 15-30 minutes or delayed by more than 30 minutes. Many of the array functions in ClickHouse are so-called ["higher-order functions"](/sql-reference/functions/overview#higher-order-functions) and accept a lambda function as the first parameter. -Let's take a look at how this works by finding days when stocks gained significantly, say +5%. +The [`arrayMap`](/sql-reference/functions/array-functions#arrayMap) function is an example of one such higher-order functon +and returns a new array from the provided array by applying a lambda function to each element of the original array. -Run the query below: +Run the query below which uses the `arrayMap` function to see which flights were delayed or on-time: -```sql runnable +```sql +SELECT + Origin, + toStringCutToZero(Dest) AS Destination, + Tail_Number, + DepDelayMinutes AS delay_minutes, +--highlight-start + arrayMap(d -> if(d >= 30, 'DELAYED', + if(d >= 15, 'WARNING', 'ON-TIME')), + [DepDelayMinutes])[1] AS status +--highlight-end +FROM ontime.ontime +WHERE Origin = 'DEN' + AND FlightDate = '2024-01-01' + AND DepTime IS NOT NULL + AND DepDelayMinutes IS NOT NULL +ORDER BY DepDelayMinutes DESC +``` + +In the above query, the `arrayMap` function takes a single-element array `[DepDelayMinutes]` and applies the lambda function `d -> if(d >= 30, 'DELAYED', if(d >= 15, 'WARNING', 'ON-TIME'` to categorize it. +Then the first element of the resulting array is extracted with `[DepDelayMinutes][1]`. + +### arrayFilter {#arrayfilter} +Next we'll look only at the number of flights that were delayed by 30 minutes or more, for airports `DEN`, `ATL` and `DFW`: + +```sql runnable +SELECT + Origin, + OriginCityName, +--highlight-next-line + length(arrayFilter(d -> d >= 30, groupArray(ArrDelayMinutes))) AS num_delays_30_min_or_more +FROM ontime.ontime +WHERE Origin IN ('DEN', 'ATL', 'DFW') + AND FlightDate = '2024-01-01' +GROUP BY Origin, OriginCityName +ORDER BY num_on_time DESC ``` -In the query below we pass a lambda function as the first argument to the [`arrayFilter`](/sql-reference/functions/array-functions#arrayFilter) function: -itself taking the opening and closing price: + +In the query above we pass a lambda function as the first argument to the [`arrayFilter`](/sql-reference/functions/array-functions#arrayFilter) function. +This lambda function itself takes the delay in minutes (d) and returns `1` if the condition is met, else `0`. ```sql -(p, o) -> p > 0 * 1.05 -``` \ No newline at end of file +d -> d >= 30 +``` + +### arraySort and arrayIntersect {#arraysort-and-arrayintersect} + +Next, we'll figure out which pairs of major US airports serve the most common destinations with the help of the [`arraySort`](/sql-reference/functions/array-functions#arraySort) and [`arrayIntersect`](/sql-reference/functions/array-functions#arrayIntersect) functions. +`arraySort` takes an array and sorts the elements in ascending order by default, although you can also pass a lambda function to it to define the sorting order. +`arrayIntersect` takes multiple arrays and returns an array which contains elements present in all the arrays. + +Run the query below to see these two array functions in action: + +```sql runnable +WITH airport_routes AS ( + SELECT + Origin, +--highlight-next-line + arraySort(groupArray(DISTINCT toStringCutToZero(Dest))) AS destinations + FROM ontime.ontime + WHERE FlightDate = '2024-01-01' + GROUP BY Origin +) +SELECT + a1.Origin AS airport1, + a2.Origin AS airport2, +--highlight-next-line + length(arrayIntersect(a1.destinations, a2.destinations)) AS common_destinations +FROM airport_routes a1 +CROSS JOIN airport_routes a2 +WHERE a1.Origin < a2.Origin + AND a1.Origin IN ('DEN', 'ATL', 'DFW', 'ORD', 'LAS') + AND a2.Origin IN ('DEN', 'ATL', 'DFW', 'ORD', 'LAS') +ORDER BY common_destinations DESC +LIMIT 10 +``` + +The query works in two main stages. +First, it creates a temporary dataset called `airport_routes` using a Common Table Expression (CTE) that looks at all flights on January 1, 2024, and for each origin airport, builds a sorted list of every unique destination which that airport serves. +In the `airport_routes` result set, for example, DEN might have an array containing all the cities it flies to, like `['ATL', 'BOS', 'LAX', 'MIA', ...]` and so on. + +In the second stage, the query takes five major US hub airports (`DEN`, `ATL`, `DFW`, `ORD`, and `LAS`) and compares every possible pair of them. +It does this using a cross join, which creates all combinations of these airports. +Then, for each pair, it uses the `arrayIntersect` function to find which destinations appear in both airports' lists. +The length function counts how many destinations they have in common. + +The condition `a1.Origin < a2.Origin`, ensures that each pair only appears once. +Without this, you'd get both JFK-LAX and LAX-JFK as separate results, which would be redundant since they represent the same comparison. +Finally, the query sorts the results to show which airport pairs have the highest number of shared destinations and returns just the top 10. +This reveals which major hubs have the most overlapping route networks, which could indicate competitive markets where multiple airlines are serving the same city pairs, or hubs that serve similar geographic regions and could potentially be used as alternative connection points for travelers. + +### arrayReduce {#arrayReduce} + +While we're looking at delays, let's use yet another higher-order array function, `arrayReduce`, to find the average and maximum delay +for each route from Denver International Airport: + +```sql runnable +SELECT + Origin, + toStringCutToZero(Dest) AS Destination, + groupArray(DepDelayMinutes) AS delays, +--highlight-start + round(arrayReduce('avg', groupArray(DepDelayMinutes)), 2) AS avg_delay, + round(arrayReduce('max', groupArray(DepDelayMinutes)), 2) AS worst_delay +--highlight-end +FROM ontime.ontime +WHERE Origin = 'DEN' + AND FlightDate = '2024-01-01' + AND DepDelayMinutes IS NOT NULL +GROUP BY Origin, Destination +ORDER BY avg_delay DESC +``` + +In the example above, we used `arrayReduce` to find the average and maximum delays for various outward flights from `DEN`. +`arrayReduce` applies an aggregate function, specified in the first parameter to the function, to the elements of the provided array, specified in the second parameter of the function. + +### arrayJoin {#arrayJoin} + +Regular functions in ClickHouse have the property that they return the same number of rows than they receive. +There is however, one interesting and unique function that breaks this rule worth learning about - the `arrayJoin` function. + +`arrayJoin` "explodes" an array - it takes an array and creates a separate row for each element. +This is similar to SQL's `UNNEST` or `EXPLODE` functions in other databases. + +Unlike most array functions that return arrays or scalar values, `arrayJoin` fundamentally changes the result set by multiplying the number of rows. + +Consider the query below which returns a single array of destinations for a given origin: + +```sql runnable +SELECT + Origin, + groupArray(toStringCutToZero(Dest)) AS Destinations +FROM ontime.ontime +WHERE Origin = 'DEN' + AND FlightDate = '2024-01-01' +GROUP BY Origin +LIMIT 1 +``` + +With `arrayJoin`, each destination becomes it's own row: + +```sql runnable +SELECT + Origin, +--highlight-next-line + arrayJoin(groupArray(toStringCutToZero(Dest))) AS Destination +FROM ontime.ontime +WHERE Origin = 'DEN' + AND FlightDate = '2024-01-01' +GROUP BY Origin +LIMIT 10 +``` + +Let's use this function to figure out how many flights depart from `DEN` at different times of the day: + +```sql runnable +-- Create time-of-day categories for each flight, then explode to count flights per category +WITH flight_categories AS ( + SELECT + FlightDate, + Origin, + Dest, + -- Categorize the flight into multiple relevant buckets + arrayFilter(x -> x != '', [ + if(DepTime < 600, 'early-morning', ''), + if(DepTime >= 600 AND DepTime < 1200, 'morning', ''), + if(DepTime >= 1200 AND DepTime < 1800, 'afternoon', ''), + if(DepTime >= 1800, 'evening', '') + ]) AS categories + FROM ontime.ontime + WHERE FlightDate = '2024-01-01' +) +SELECT +--highlight-next-line + arrayJoin(categories) AS category, + count() AS flight_count +FROM flight_categories +GROUP BY category +ORDER BY flight_count DESC +``` + +The query above categorizes each flight by its departure time into various buckets of the time of day: early-morning, morning, afternoon, evening with the help of the now familiar `arrayFilter` function. +It then uses the `arrayJoin` function to explode the array so each category becomes a separate row, allowing us to count how many total flights departed during each time period on January 1st, 2024. + +## Next steps {#next-steps} + +Congratulations! You've learned how to work with arrays in ClickHouse, from basic array creation and indexing to powerful functions like `groupArray`, `arrayFilter`, `arrayMap`, `arrayReduce`, and `arrayJoin`. +To continue your learning journey, explore the complete array functions reference to discover additional functions like `arrayFlatten`, `arrayReverse`, and `arrayDistinct`. +You might also want to learn about related data structures such as [`tuples`](/sql-reference/data-types/tuple#creating-tuples), [JSON](/sql-reference/data-types/newjson), and [Map](/sql-reference/data-types/map) types which work well alongside arrays. +Practice applying these concepts to your own datasets, and experiment with different queries on the SQL playground or other example datasets. + +Arrays are a fundamental feature in ClickHouse that enable, efficient analytical queries - as you become more comfortable with array functions, you'll find they can dramatically simplify complex aggregations and time-series analysis. +For more array fun we recommend the Youtube video below: + + \ No newline at end of file From a429589061be6827a635cfe34cf18146e00781c0 Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Mon, 3 Nov 2025 20:28:34 +0100 Subject: [PATCH 3/8] fix spelling --- docs/guides/starter_guides/working_with_arrays.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/guides/starter_guides/working_with_arrays.md b/docs/guides/starter_guides/working_with_arrays.md index d4eb2acf629..e557ba69207 100644 --- a/docs/guides/starter_guides/working_with_arrays.md +++ b/docs/guides/starter_guides/working_with_arrays.md @@ -71,7 +71,7 @@ Code: 386. DB::Exception: There is no supertype for types String, String, UInt8, ``` when creating arrays on the fly, ClickHouse picks the narrowest type that fits all elements. -For example, if you create an array of ints and floats, a super-type of float is chosen: +For example, if you create an array of integers and floats, a super-type of float is chosen: ``` SELECT [1::UInt8, 2.5::Float32, 3::UInt8] AS mixed_array, toTypeName([1, 2.5, 3]) AS array_type; @@ -276,8 +276,7 @@ We saw in the previous query that Denver International Airport was the airport w Let's take a look at how many of those flights were on-time, delayed by 15-30 minutes or delayed by more than 30 minutes. Many of the array functions in ClickHouse are so-called ["higher-order functions"](/sql-reference/functions/overview#higher-order-functions) and accept a lambda function as the first parameter. -The [`arrayMap`](/sql-reference/functions/array-functions#arrayMap) function is an example of one such higher-order functon -and returns a new array from the provided array by applying a lambda function to each element of the original array. +The [`arrayMap`](/sql-reference/functions/array-functions#arrayMap) function is an example of one such higher-order function and returns a new array from the provided array by applying a lambda function to each element of the original array. Run the query below which uses the `arrayMap` function to see which flights were delayed or on-time: @@ -404,7 +403,7 @@ Regular functions in ClickHouse have the property that they return the same numb There is however, one interesting and unique function that breaks this rule worth learning about - the `arrayJoin` function. `arrayJoin` "explodes" an array - it takes an array and creates a separate row for each element. -This is similar to SQL's `UNNEST` or `EXPLODE` functions in other databases. +This is similar to the `UNNEST` or `EXPLODE` SQL functions in other databases. Unlike most array functions that return arrays or scalar values, `arrayJoin` fundamentally changes the result set by multiplying the number of rows. From 68e581b8d51ae17ad576149164e2c1418d7e2d21 Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Mon, 3 Nov 2025 20:33:01 +0100 Subject: [PATCH 4/8] fix markdown lint --- .../starter_guides/working_with_arrays.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/guides/starter_guides/working_with_arrays.md b/docs/guides/starter_guides/working_with_arrays.md index e557ba69207..7af56e7549f 100644 --- a/docs/guides/starter_guides/working_with_arrays.md +++ b/docs/guides/starter_guides/working_with_arrays.md @@ -16,19 +16,19 @@ We call these _elements_ of the array, and each element can be referred to by an Arrays in ClickHouse can be formed using the [`array`](/sql-reference/data-types/array) function: -``` +```sql array(T) ``` Or alternatively, using square brackets: -``` +```sql [] ``` For example, you can create an array of numbers: -``` +```sql SELECT array(1, 2, 3) AS numeric_array ┌─numeric_array─┐ @@ -38,7 +38,7 @@ SELECT array(1, 2, 3) AS numeric_array Or an array of strings: -``` +```sql SELECT array('hello', 'world') AS string_array ┌─string_array──────┐ @@ -48,7 +48,7 @@ SELECT array('hello', 'world') AS string_array Or an array of nested types such as [tuples](/sql-reference/data-types/tuple): -``` +```sql SELECT array(tuple(1, 2), tuple(3, 4)) ┌─[(1, 2), (3, 4)]─┐ @@ -58,14 +58,14 @@ SELECT array(tuple(1, 2), tuple(3, 4)) You might be tempted to make an array of different types like this: -``` +```sql SELECT array('Hello', 'world', 1, 2, 3) ``` However, array elements should always have a common super-type, which is the smallest data type that can represent values from two or more different types without loss, allowing them to be used together. If there is no common super-type, you'll get an exception when you try to form the array: -``` +```sql Received exception: Code: 386. DB::Exception: There is no supertype for types String, String, UInt8, UInt8, UInt8 because some of them are String/FixedString/Enum and some of them are not: In scope SELECT ['Hello', 'world', 1, 2, 3]. (NO_COMMON_TYPE) ``` @@ -73,7 +73,7 @@ Code: 386. DB::Exception: There is no supertype for types String, String, UInt8, when creating arrays on the fly, ClickHouse picks the narrowest type that fits all elements. For example, if you create an array of integers and floats, a super-type of float is chosen: -``` +```sql SELECT [1::UInt8, 2.5::Float32, 3::UInt8] AS mixed_array, toTypeName([1, 2.5, 3]) AS array_type; ┌─mixed_array─┬─array_type─────┐ @@ -87,7 +87,7 @@ This may be different from other programming languages you're used to where arra For example, given an array, you can select the first element of an array by writing: -``` +```sql WITH array('hello', 'world') AS string_array SELECT string_array[1]; @@ -99,7 +99,7 @@ SELECT string_array[1]; It is also possible to use negative indexes. In this way, you can select elements relative to the last element: -``` +```sql WITH array('hello', 'world') AS string_array SELECT string_array[-1]; @@ -130,7 +130,7 @@ In this section, we'll look at some of the most useful ones, starting from the s The `length` function is used to return the number of elements in the array: -``` +```sql WITH array('learning', 'ClickHouse', 'arrays') AS string_array SELECT length(string_array); @@ -141,7 +141,7 @@ SELECT length(string_array); You can also use the [`arrayEnumerate`](/sql-reference/functions/array-functions#arrayEnumerate) function to return an array of indexes of the elements: -``` +```sql WITH array('learning', 'ClickHouse', 'arrays') AS string_array SELECT arrayEnumerate(string_array); From 1ebff7ed5769b632e270b168a7c4ae71b286d65a Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Tue, 4 Nov 2025 20:23:48 +0100 Subject: [PATCH 5/8] move mutations to correct folder --- docs/guides/starter_guides/mutations.md | 107 ++++++++++++++++++++++++ 1 file changed, 107 insertions(+) create mode 100644 docs/guides/starter_guides/mutations.md diff --git a/docs/guides/starter_guides/mutations.md b/docs/guides/starter_guides/mutations.md new file mode 100644 index 00000000000..b30b2d16107 --- /dev/null +++ b/docs/guides/starter_guides/mutations.md @@ -0,0 +1,107 @@ +--- +slug: /guides/developer/mutations +sidebar_label: 'Updating and deleting data' +sidebar_position: 1 +keywords: ['UPDATE', 'DELETE', 'mutations'] +title: 'Updating and deleting ClickHouse data' +description: 'Describes how to perform update and delete operations in ClickHouse' +show_related_blogs: false +doc_type: 'guide' +--- + +# Updating and deleting ClickHouse data with mutations + +Although ClickHouse is geared toward high volume analytic workloads, it is possible in some situations to modify or +delete existing data. These operations are labeled "mutations" and are executed using the `ALTER TABLE` command. + +:::tip +If you need to perform frequent updates, consider using [deduplication](../developer/deduplication.md) in ClickHouse, which allows you to update +and/or delete rows without generating a mutation event. Alternatively, use [lightweight updates](/docs/sql-reference/statements/update) +or [lightweight deletes](/guides/developer/lightweight-delete) +::: + +## Updating data {#updating-data} + +Use the `ALTER TABLE...UPDATE` command to update rows in a table: + +```sql +ALTER TABLE [.] UPDATE = WHERE +``` + +`` is the new value for the column where the `` is satisfied. The `` must be the same datatype as the column or be convertible to the same datatype using the `CAST` operator. The `` should return a `UInt8` (zero or non-zero) value for each row of the data. Multiple `UPDATE ` statements can be combined in a single `ALTER TABLE` command separated by commas. + +**Examples**: + + 1. A mutation like this allows updating replacing `visitor_ids` with new ones using a dictionary lookup: + + ```sql + ALTER TABLE website.clicks + UPDATE visitor_id = getDict('visitors', 'new_visitor_id', visitor_id) + WHERE visit_date < '2022-01-01' + ``` + +2. Modifying multiple values in one command can be more efficient than multiple commands: + + ```sql + ALTER TABLE website.clicks + UPDATE url = substring(url, position(url, '://') + 3), visitor_id = new_visit_id + WHERE visit_date < '2022-01-01' + ``` + +3. Mutations can be executed `ON CLUSTER` for sharded tables: + + ```sql + ALTER TABLE clicks ON CLUSTER main_cluster + UPDATE click_count = click_count / 2 + WHERE visitor_id ILIKE '%robot%' + ``` + +:::note +It is not possible to update columns that are part of the primary or sorting key. +::: + +## Deleting data {#deleting-data} + +Use the `ALTER TABLE` command to delete rows: + +```sql +ALTER TABLE [.]
DELETE WHERE +``` + +The `` should return a UInt8 value for each row of data. + +**Examples** + +1. Delete any records where a column is in an array of values: + ```sql + ALTER TABLE website.clicks DELETE WHERE visitor_id in (253, 1002, 4277) + ``` + +2. What does this query alter? + ```sql + ALTER TABLE clicks ON CLUSTER main_cluster DELETE WHERE visit_date < '2022-01-02 15:00:00' AND page_id = '573' + ``` + +:::note +To delete all of the data in a table, it is more efficient to use the command `TRUNCATE TABLE [` command. This command can also be executed `ON CLUSTER`. +::: + +View the [`DELETE` statement](/sql-reference/statements/delete.md) docs page for more details. + +## Lightweight deletes {#lightweight-deletes} + +Another option for deleting rows is to use the `DELETE FROM` command, which is referred to as a **lightweight delete**. The deleted rows are marked as deleted immediately and will be automatically filtered out of all subsequent queries, so you do not have to wait for a merging of parts or use the `FINAL` keyword. Cleanup of data happens asynchronously in the background. + +``` sql +DELETE FROM [db.]table [ON CLUSTER cluster] [WHERE expr] +``` + +For example, the following query deletes all rows from the `hits` table where the `Title` column contains the text `hello`: + +```sql +DELETE FROM hits WHERE Title LIKE '%hello%'; +``` + +A few notes about lightweight deletes: +- This feature is only available for the `MergeTree` table engine family. +- Lightweight deletes are asynchronous by default. Set `mutations_sync` equal to 1 to wait for one replica to process the statement, and set `mutations_sync` to 2 to wait for all replicas. From a42efb740a18c0a7d9aaebbc4cf9c8c31b9ac341 Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Wed, 5 Nov 2025 11:04:44 +0100 Subject: [PATCH 6/8] fix links --- docs/faq/operations/index.md | 2 +- docs/guides/developer/mutations.md | 107 ------------------ docs/guides/starter_guides/index.md | 18 +++ docs/guides/starter_guides/writing-queries.md | 6 +- docs/starter-guides/index.md | 18 --- docs/whats-new/changelog/index.md | 4 - sidebars.js | 2 +- 7 files changed, 23 insertions(+), 134 deletions(-) delete mode 100644 docs/guides/developer/mutations.md create mode 100644 docs/guides/starter_guides/index.md delete mode 100644 docs/starter-guides/index.md diff --git a/docs/faq/operations/index.md b/docs/faq/operations/index.md index ba0dd509f55..faaaecd22b8 100644 --- a/docs/faq/operations/index.md +++ b/docs/faq/operations/index.md @@ -16,7 +16,7 @@ keywords: ['operations', 'administration', 'deployment', 'cluster management', ' - [How do I configure ClickHouse Keeper?](/guides/sre/keeper/index.md) - [Can ClickHouse integrate with LDAP?](/guides/sre/user-management/configuring-ldap.md) - [How do I configure users, roles and permissions in ClickHouse?](/guides/sre/user-management/index.md) -- [Can you update or delete rows in ClickHouse?](/guides/developer/mutations.md) +- [Can you update or delete rows in ClickHouse?](/guides/starter_guides/mutations.md) - [Does ClickHouse support multi-region replication?](/faq/operations/multi-region-replication.md) :::info Don't see what you're looking for? diff --git a/docs/guides/developer/mutations.md b/docs/guides/developer/mutations.md deleted file mode 100644 index b30b2d16107..00000000000 --- a/docs/guides/developer/mutations.md +++ /dev/null @@ -1,107 +0,0 @@ ---- -slug: /guides/developer/mutations -sidebar_label: 'Updating and deleting data' -sidebar_position: 1 -keywords: ['UPDATE', 'DELETE', 'mutations'] -title: 'Updating and deleting ClickHouse data' -description: 'Describes how to perform update and delete operations in ClickHouse' -show_related_blogs: false -doc_type: 'guide' ---- - -# Updating and deleting ClickHouse data with mutations - -Although ClickHouse is geared toward high volume analytic workloads, it is possible in some situations to modify or -delete existing data. These operations are labeled "mutations" and are executed using the `ALTER TABLE` command. - -:::tip -If you need to perform frequent updates, consider using [deduplication](../developer/deduplication.md) in ClickHouse, which allows you to update -and/or delete rows without generating a mutation event. Alternatively, use [lightweight updates](/docs/sql-reference/statements/update) -or [lightweight deletes](/guides/developer/lightweight-delete) -::: - -## Updating data {#updating-data} - -Use the `ALTER TABLE...UPDATE` command to update rows in a table: - -```sql -ALTER TABLE [.]
UPDATE = WHERE -``` - -`` is the new value for the column where the `` is satisfied. The `` must be the same datatype as the column or be convertible to the same datatype using the `CAST` operator. The `` should return a `UInt8` (zero or non-zero) value for each row of the data. Multiple `UPDATE ` statements can be combined in a single `ALTER TABLE` command separated by commas. - -**Examples**: - - 1. A mutation like this allows updating replacing `visitor_ids` with new ones using a dictionary lookup: - - ```sql - ALTER TABLE website.clicks - UPDATE visitor_id = getDict('visitors', 'new_visitor_id', visitor_id) - WHERE visit_date < '2022-01-01' - ``` - -2. Modifying multiple values in one command can be more efficient than multiple commands: - - ```sql - ALTER TABLE website.clicks - UPDATE url = substring(url, position(url, '://') + 3), visitor_id = new_visit_id - WHERE visit_date < '2022-01-01' - ``` - -3. Mutations can be executed `ON CLUSTER` for sharded tables: - - ```sql - ALTER TABLE clicks ON CLUSTER main_cluster - UPDATE click_count = click_count / 2 - WHERE visitor_id ILIKE '%robot%' - ``` - -:::note -It is not possible to update columns that are part of the primary or sorting key. -::: - -## Deleting data {#deleting-data} - -Use the `ALTER TABLE` command to delete rows: - -```sql -ALTER TABLE [.]
DELETE WHERE -``` - -The `` should return a UInt8 value for each row of data. - -**Examples** - -1. Delete any records where a column is in an array of values: - ```sql - ALTER TABLE website.clicks DELETE WHERE visitor_id in (253, 1002, 4277) - ``` - -2. What does this query alter? - ```sql - ALTER TABLE clicks ON CLUSTER main_cluster DELETE WHERE visit_date < '2022-01-02 15:00:00' AND page_id = '573' - ``` - -:::note -To delete all of the data in a table, it is more efficient to use the command `TRUNCATE TABLE [` command. This command can also be executed `ON CLUSTER`. -::: - -View the [`DELETE` statement](/sql-reference/statements/delete.md) docs page for more details. - -## Lightweight deletes {#lightweight-deletes} - -Another option for deleting rows is to use the `DELETE FROM` command, which is referred to as a **lightweight delete**. The deleted rows are marked as deleted immediately and will be automatically filtered out of all subsequent queries, so you do not have to wait for a merging of parts or use the `FINAL` keyword. Cleanup of data happens asynchronously in the background. - -``` sql -DELETE FROM [db.]table [ON CLUSTER cluster] [WHERE expr] -``` - -For example, the following query deletes all rows from the `hits` table where the `Title` column contains the text `hello`: - -```sql -DELETE FROM hits WHERE Title LIKE '%hello%'; -``` - -A few notes about lightweight deletes: -- This feature is only available for the `MergeTree` table engine family. -- Lightweight deletes are asynchronous by default. Set `mutations_sync` equal to 1 to wait for one replica to process the statement, and set `mutations_sync` to 2 to wait for all replicas. diff --git a/docs/guides/starter_guides/index.md b/docs/guides/starter_guides/index.md new file mode 100644 index 00000000000..b0960ed09d4 --- /dev/null +++ b/docs/guides/starter_guides/index.md @@ -0,0 +1,18 @@ +--- +slug: /starter-guides +title: 'Starter Guides' +description: 'Landing page for starter guides' +pagination_prev: null +pagination_next: null +doc_type: 'landing-page' +keywords: ['beginner', 'tutorial', 'create table', 'insert data', 'select data', 'update data', 'delete data'] +--- + +In this section of the docs you'll find starter guides for common SQL queries: `CREATE`, `INSERT`, `SELECT`, and mutations `UPDATE` and `DELETE`. + +| Page | Description | +|-----------------------------------------------------|------------------------------------------------------------------------| +| [Create Tables](/guides/creating-tables) | Starter guide on how to create a table. | +| [Insert Data](/guides/inserting-data) | Starter guide on how to insert data into a table. | +| [Select Data](/guides/writing-queries) | Starter guide on how to select data from a table. | +| [Update and Delete Data](/guides/developer/mutations) | Starter guide on mutations - updating and deleting data in ClickHouse. | diff --git a/docs/guides/starter_guides/writing-queries.md b/docs/guides/starter_guides/writing-queries.md index 79a868c6ba7..5816f63fde3 100644 --- a/docs/guides/starter_guides/writing-queries.md +++ b/docs/guides/starter_guides/writing-queries.md @@ -18,7 +18,7 @@ ORDER BY timestamp ``` :::note -View the [SQL Reference](../sql-reference/statements/select/index.md) for more details on the syntax and available clauses and options. +View the [SQL Reference](/sql-reference/statements/select) for more details on the syntax and available clauses and options. ::: Notice the response comes back in a nice table format: @@ -34,7 +34,7 @@ Notice the response comes back in a nice table format: 4 rows in set. Elapsed: 0.008 sec. ``` -Add a `FORMAT` clause to specify one of the [many supported output formats of ClickHouse](../interfaces/formats.md): +Add a `FORMAT` clause to specify one of the [many supported output formats of ClickHouse](/interfaces/formats#formats-overview): ```sql SELECT * FROM helloworld.my_first_table @@ -57,5 +57,5 @@ Query id: 3604df1c-acfd-4117-9c56-f86c69721121 :::note ClickHouse supports over 70 input and output formats, so between the thousands of functions and all the data formats, you can use ClickHouse to perform some impressive and fast ETL-like data transformations. In fact, you don't even -need a ClickHouse server up and running to transform data - you can use the `clickhouse-local` tool. View the [docs page of `clickhouse-local`](../operations/utilities/clickhouse-local.md) for details. +need a ClickHouse server up and running to transform data - you can use the `clickhouse-local` tool. View the [docs page of `clickhouse-local`](/interfaces/cli) for details. ::: diff --git a/docs/starter-guides/index.md b/docs/starter-guides/index.md deleted file mode 100644 index 85f8284fa8f..00000000000 --- a/docs/starter-guides/index.md +++ /dev/null @@ -1,18 +0,0 @@ ---- -slug: /starter-guides -title: 'Starter Guides' -description: 'Landing page for starter guides' -pagination_prev: null -pagination_next: null -doc_type: 'landing-page' -keywords: ['beginner', 'tutorial', 'create table', 'insert data', 'select data', 'update data', 'delete data'] ---- - -In this section of the docs you'll find starter guides for common SQL queries: `CREATE`, `INSERT`, `SELECT`, and mutations `UPDATE` and `DELETE`. - -| Page | Description | -|------------------------------------------------------------|------------------------------------------------------------------------| -| [Create Tables](../guides/creating-tables.md) | Starter guide on how to create a table. | -| [Insert Data](../guides/inserting-data.md) | Starter guide on how to insert data into a table. | -| [Select Data](../guides/writing-queries.md) | Starter guide on how to select data from a table. | -| [Update and Delete Data](../guides/developer/mutations.md) | Starter guide on mutations - updating and deleting data in ClickHouse. | diff --git a/docs/whats-new/changelog/index.md b/docs/whats-new/changelog/index.md index 8721e2d50fe..b832e36d5da 100644 --- a/docs/whats-new/changelog/index.md +++ b/docs/whats-new/changelog/index.md @@ -2145,7 +2145,3 @@ doc_type: 'changelog' #### Build/Testing/Packaging Improvement * The universal installation script will propose installation even on macOS. [#74339](https://github.com/ClickHouse/ClickHouse/pull/74339) ([Alexey Milovidov](https://github.com/alexey-milovidov)). -:::note -There have been no new releases yet for 2025. - View changelog for the year [2024](/docs/whats-new/changelog/2024). -::: diff --git a/sidebars.js b/sidebars.js index 98a750c81b1..2eca76e057a 100644 --- a/sidebars.js +++ b/sidebars.js @@ -78,7 +78,7 @@ const sidebars = { label: "Starter guides", collapsed: false, collapsible: false, - link: { type: "doc", id: "starter-guides/index" }, + link: { type: "doc", id: "guides/starter_guides/index" }, items: [ { type: "autogenerated", From 22e2580d95e4a694d4808ba4d0bdc266e24034fa Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Thu, 6 Nov 2025 09:39:36 +0100 Subject: [PATCH 7/8] Incorporate review feedback --- .../starter_guides/working_with_arrays.md | 156 +++++++++++------- 1 file changed, 94 insertions(+), 62 deletions(-) diff --git a/docs/guides/starter_guides/working_with_arrays.md b/docs/guides/starter_guides/working_with_arrays.md index 7af56e7549f..e52f7bbc8b0 100644 --- a/docs/guides/starter_guides/working_with_arrays.md +++ b/docs/guides/starter_guides/working_with_arrays.md @@ -81,6 +81,46 @@ SELECT [1::UInt8, 2.5::Float32, 3::UInt8] AS mixed_array, toTypeName([1, 2.5, 3] └─────────────┴────────────────┘ ``` +
+Creating arrays of different types + +You can use the `use_variant_as_common_type` setting to change the default behavior described above. +This allows you to use the [Variant](/sql-reference/data-types/variant) type as a result type for `if`/`multiIf`/`array`/`map` functions when there is no common type for argument types. + +For example: + +```sql +SELECT + [1, 'ClickHouse', ['Another', 'Array']] AS array, + toTypeName(array) +SETTINGS use_variant_as_common_type = 1; +``` + +```response +┌─array────────────────────────────────┬─toTypeName(array)────────────────────────────┐ +│ [1,'ClickHouse',['Another','Array']] │ Array(Variant(Array(String), String, UInt8)) │ +└──────────────────────────────────────┴──────────────────────────────────────────────┘ +``` + +You can then also read the types from the array by type name: + +```sql +SELECT + [1, 'ClickHouse', ['Another', 'Array']] AS array, + array.UInt8, + array.String, + array.`Array(String)` +SETTINGS use_variant_as_common_type = 1; +``` + +```response +┌─array────────────────────────────────┬─array.UInt8───┬─array.String─────────────┬─array.Array(String)─────────┐ +│ [1,'ClickHouse',['Another','Array']] │ [1,NULL,NULL] │ [NULL,'ClickHouse',NULL] │ [[],[],['Another','Array']] │ +└──────────────────────────────────────┴───────────────┴──────────────────────────┴─────────────────────────────┘ +``` + +
+ Use of the index with square brackets provides a convenient way to access array elements. In ClickHouse, it's important to know that the array index always starts from **1**. This may be different from other programming languages you're used to where arrays are zero-indexed. @@ -270,7 +310,7 @@ FROM busy_airports ORDER BY outward_flights DESC ``` -### arrayMap {#arraymap} +### arrayMap and arrayZip {#arraymap} We saw in the previous query that Denver International Airport was the airport with the most outward flights for our particular chosen day. Let's take a look at how many of those flights were on-time, delayed by 15-30 minutes or delayed by more than 30 minutes. @@ -278,25 +318,25 @@ Let's take a look at how many of those flights were on-time, delayed by 15-30 mi Many of the array functions in ClickHouse are so-called ["higher-order functions"](/sql-reference/functions/overview#higher-order-functions) and accept a lambda function as the first parameter. The [`arrayMap`](/sql-reference/functions/array-functions#arrayMap) function is an example of one such higher-order function and returns a new array from the provided array by applying a lambda function to each element of the original array. -Run the query below which uses the `arrayMap` function to see which flights were delayed or on-time: +Run the query below which uses the `arrayMap` function to see which flights were delayed or on-time. +For pairs of origin/destinations, it shows the tail number and status for every flight: + +```sql runnable +WITH arrayMap( + d -> if(d >= 30, 'DELAYED', if(d >= 15, 'WARNING', 'ON-TIME')), + groupArray(DepDelayMinutes) + ) AS statuses -```sql SELECT Origin, toStringCutToZero(Dest) AS Destination, - Tail_Number, - DepDelayMinutes AS delay_minutes, ---highlight-start - arrayMap(d -> if(d >= 30, 'DELAYED', - if(d >= 15, 'WARNING', 'ON-TIME')), - [DepDelayMinutes])[1] AS status ---highlight-end + arrayZip(groupArray(Tail_Number), statuses) as tailNumberStatuses FROM ontime.ontime WHERE Origin = 'DEN' - AND FlightDate = '2024-01-01' - AND DepTime IS NOT NULL - AND DepDelayMinutes IS NOT NULL -ORDER BY DepDelayMinutes DESC + AND FlightDate = '2024-01-01' + AND DepTime IS NOT NULL + AND DepDelayMinutes IS NOT NULL +GROUP BY ALL ``` In the above query, the `arrayMap` function takes a single-element array `[DepDelayMinutes]` and applies the lambda function `d -> if(d >= 30, 'DELAYED', if(d >= 15, 'WARNING', 'ON-TIME'` to categorize it. @@ -316,7 +356,7 @@ FROM ontime.ontime WHERE Origin IN ('DEN', 'ATL', 'DFW') AND FlightDate = '2024-01-01' GROUP BY Origin, OriginCityName -ORDER BY num_on_time DESC +ORDER BY num_delays_30_min_or_more DESC ``` In the query above we pass a lambda function as the first argument to the [`arrayFilter`](/sql-reference/functions/array-functions#arrayFilter) function. @@ -400,70 +440,62 @@ In the example above, we used `arrayReduce` to find the average and maximum dela ### arrayJoin {#arrayJoin} Regular functions in ClickHouse have the property that they return the same number of rows than they receive. -There is however, one interesting and unique function that breaks this rule worth learning about - the `arrayJoin` function. +There is however, one interesting and unique function that breaks this rule, which is worth learning about - the `arrayJoin` function. -`arrayJoin` "explodes" an array - it takes an array and creates a separate row for each element. +`arrayJoin` "explodes" an array by taking it and creating a separate row for each element. This is similar to the `UNNEST` or `EXPLODE` SQL functions in other databases. Unlike most array functions that return arrays or scalar values, `arrayJoin` fundamentally changes the result set by multiplying the number of rows. -Consider the query below which returns a single array of destinations for a given origin: +Consider the query below which returns an array of values from 0 to 100 in steps of 10. +We could consider the array to be different delay times: 0 minutes, 10 minutes, 20 minutes, and so on. ```sql runnable -SELECT - Origin, - groupArray(toStringCutToZero(Dest)) AS Destinations -FROM ontime.ontime -WHERE Origin = 'DEN' - AND FlightDate = '2024-01-01' -GROUP BY Origin -LIMIT 1 +WITH range(0, 100, 10) AS delay +SELECT delay ``` -With `arrayJoin`, each destination becomes it's own row: +We can write a query using `arrayJoin` to work out how many delays there were of up to that number of minutes between two airports. +The query below creates a histogram showing the distribution of flight delays from Denver (DEN) to Miami (MIA) on January 1, 2024, using cumulative delay buckets: ```sql runnable +WITH range(0, 100, 10) AS delay, + toStringCutToZero(Dest) AS Destination + SELECT - Origin, ---highlight-next-line - arrayJoin(groupArray(toStringCutToZero(Dest))) AS Destination + 'Up to ' || arrayJoin(delay) || ' minutes' AS delayTime, + countIf(DepDelayMinutes >= arrayJoin(delay)) AS flightsDelayed FROM ontime.ontime -WHERE Origin = 'DEN' - AND FlightDate = '2024-01-01' -GROUP BY Origin -LIMIT 10 +WHERE Origin = 'DEN' AND Destination = 'MIA' AND FlightDate = '2024-01-01' +GROUP BY delayTime +ORDER BY flightsDelayed DESC ``` -Let's use this function to figure out how many flights depart from `DEN` at different times of the day: +In the query above we return an array of delays using a CTE clause (`WITH` clause). +`Destination` converts the destination code to a string. + +We use `arrayJoin` to explode the delay array into separate rows. +Each value from the `delay` array becomes its own row with alias `del`, +and we get 10 rows: one for `del=0`, one for `del=10`, one for `del=20`, etc. +For each delay threshold (`del`), the query counts how many flights had delays greater than or equal to that threshold +using `countIf(DepDelayMinutes >= del)`. + +`arrayJoin` also has a SQL command equivalent `ARRAY JOIN`. +The query above is reproduced below with the SQL command equivalent for comparison: ```sql runnable --- Create time-of-day categories for each flight, then explode to count flights per category -WITH flight_categories AS ( - SELECT - FlightDate, - Origin, - Dest, - -- Categorize the flight into multiple relevant buckets - arrayFilter(x -> x != '', [ - if(DepTime < 600, 'early-morning', ''), - if(DepTime >= 600 AND DepTime < 1200, 'morning', ''), - if(DepTime >= 1200 AND DepTime < 1800, 'afternoon', ''), - if(DepTime >= 1800, 'evening', '') - ]) AS categories - FROM ontime.ontime - WHERE FlightDate = '2024-01-01' -) -SELECT ---highlight-next-line - arrayJoin(categories) AS category, - count() AS flight_count -FROM flight_categories -GROUP BY category -ORDER BY flight_count DESC -``` +WITH range(0, 100, 10) AS delay, + toStringCutToZero(Dest) AS Destination -The query above categorizes each flight by its departure time into various buckets of the time of day: early-morning, morning, afternoon, evening with the help of the now familiar `arrayFilter` function. -It then uses the `arrayJoin` function to explode the array so each category becomes a separate row, allowing us to count how many total flights departed during each time period on January 1st, 2024. +SELECT + 'Up to ' || del || ' minutes' AS delayTime, + countIf(DepDelayMinutes >= del) flightsDelayed +FROM ontime.ontime +ARRAY JOIN delay AS del +WHERE Origin = 'DEN' AND Destination = 'MIA' AND FlightDate = '2024-01-01' +GROUP BY ALL +ORDER BY flightsDelayed DESC +``` ## Next steps {#next-steps} @@ -473,6 +505,6 @@ You might also want to learn about related data structures such as [`tuples`](/s Practice applying these concepts to your own datasets, and experiment with different queries on the SQL playground or other example datasets. Arrays are a fundamental feature in ClickHouse that enable, efficient analytical queries - as you become more comfortable with array functions, you'll find they can dramatically simplify complex aggregations and time-series analysis. -For more array fun we recommend the Youtube video below: +For more array fun, we recommend the YouTube video below from Mark, our resident data expert: \ No newline at end of file From 41becf9174ff08f851f447abde7739a39f09d4d8 Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Thu, 6 Nov 2025 09:43:56 +0100 Subject: [PATCH 8/8] add a note on what arrayZip is doing --- docs/guides/starter_guides/working_with_arrays.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/guides/starter_guides/working_with_arrays.md b/docs/guides/starter_guides/working_with_arrays.md index e52f7bbc8b0..93d005b84e2 100644 --- a/docs/guides/starter_guides/working_with_arrays.md +++ b/docs/guides/starter_guides/working_with_arrays.md @@ -341,6 +341,7 @@ GROUP BY ALL In the above query, the `arrayMap` function takes a single-element array `[DepDelayMinutes]` and applies the lambda function `d -> if(d >= 30, 'DELAYED', if(d >= 15, 'WARNING', 'ON-TIME'` to categorize it. Then the first element of the resulting array is extracted with `[DepDelayMinutes][1]`. +The [`arrayZip`](/sql-reference/functions/array-functions#arrayZip) function combines the `Tail_Number` array and the `statuses` array into a single array. ### arrayFilter {#arrayfilter}