|
| 1 | +# Extracting the content of a publication |
| 2 | + |
| 3 | +:warning: The described feature is still experimental and the implementation incomplete. |
| 4 | + |
| 5 | +Many high-level features require access to the raw content (text, media, etc.) of a publication, such as: |
| 6 | + |
| 7 | +* Text-to-speech |
| 8 | +* Accessibility reader |
| 9 | +* Basic search |
| 10 | +* Full-text search indexing |
| 11 | +* Image or audio indexes |
| 12 | + |
| 13 | +The `ContentService` provides a way to iterate through a publication's content, extracted as semantic elements. |
| 14 | + |
| 15 | +First, request the publication's `Content`, starting from a given `Locator`. If the locator is missing, the `Content` will be extracted from the beginning of the publication. |
| 16 | + |
| 17 | +```swift |
| 18 | +guard let content = publication.content(from: startLocator) else { |
| 19 | + // Abort as the content cannot be extracted |
| 20 | + return |
| 21 | +} |
| 22 | +``` |
| 23 | + |
| 24 | +## Extracting the raw text content |
| 25 | + |
| 26 | +Getting the whole raw text of a publication is such a common use case that a helper is available on `Content`: |
| 27 | + |
| 28 | +```swift |
| 29 | +let wholeText = content.text() |
| 30 | +``` |
| 31 | + |
| 32 | +This is an expensive operation, proceed with caution and cache the result if you need to reuse it. |
| 33 | + |
| 34 | +## Iterating through the content |
| 35 | + |
| 36 | +The individual `Content` elements can be iterated through with a regular `for` loop by converting it to a sequence: |
| 37 | + |
| 38 | +```swift |
| 39 | +for (element in content.sequence()) { |
| 40 | + // Process element |
| 41 | +} |
| 42 | +``` |
| 43 | + |
| 44 | +Alternatively, you can get the whole list of elements with `content.elements()`, or use the lower level APIs to iterate the content manually: |
| 45 | + |
| 46 | +```swift |
| 47 | +let iterator = content.iterator() |
| 48 | +while let element = try iterator.next() { |
| 49 | + print(element) |
| 50 | +} |
| 51 | +``` |
| 52 | + |
| 53 | +Some `Content` implementations support bidirectional iterations. To iterate backwards, use: |
| 54 | + |
| 55 | +```swift |
| 56 | +let iterator = content.iterator() |
| 57 | +while let element = try iterator.previous() { |
| 58 | + print(element) |
| 59 | +} |
| 60 | +``` |
| 61 | + |
| 62 | +## Processing the elements |
| 63 | + |
| 64 | +The `Content` iterator yields `ContentElement` objects representing a single semantic portion of the publication, such as a heading, a paragraph or an embedded image. |
| 65 | + |
| 66 | +Every element has a `locator` property targeting it in the publication. You can use the locator, for example, to navigate to the element or to draw a `Decoration` on top of it. |
| 67 | + |
| 68 | +```swift |
| 69 | +navigator.go(to: element.locator) |
| 70 | +``` |
| 71 | + |
| 72 | +### Types of elements |
| 73 | + |
| 74 | +Depending on the actual implementation of `ContentElement`, more properties are available to access the actual data. The toolkit ships with a number of default implementations for common types of elements. |
| 75 | + |
| 76 | +#### Embedded media |
| 77 | + |
| 78 | +The `EmbeddedContentElement` protocol is implemented by any element referencing an external resource. It contains an `embeddedLink` property you can use to get the actual content of the resource. |
| 79 | + |
| 80 | +```swift |
| 81 | +if let element = element as? EmbeddedContentElement { |
| 82 | + let bytes = try publication |
| 83 | + .get(element.embeddedLink) |
| 84 | + .read().get() |
| 85 | +} |
| 86 | +``` |
| 87 | + |
| 88 | +Here are the default available implementations: |
| 89 | + |
| 90 | +* `AudioContentElement` - audio clips |
| 91 | +* `VideoContentElement` - video clips |
| 92 | +* `ImageContentElement` - bitmap images, with the additional property: |
| 93 | + * `caption: String?` - figure caption, when available |
| 94 | + |
| 95 | +#### Text |
| 96 | + |
| 97 | +##### Textual elements |
| 98 | + |
| 99 | +The `TextualContentElement` protocol is implemented by any element which can be represented as human-readable text. This is useful when you want to extract the text content of a publication without caring for each individual type of elements. |
| 100 | + |
| 101 | +```swift |
| 102 | +let wholeText = publication.content() |
| 103 | + .elements() |
| 104 | + .compactMap { ($0 as? TextualContentElement)?.text.takeIf { !$0.isEmpty } } |
| 105 | + .joined(separator: "\n") |
| 106 | +``` |
| 107 | + |
| 108 | +##### Text elements |
| 109 | + |
| 110 | +Actual text elements are instances of `TextContentElement`, which represent a single block of text such as a heading, a paragraph or a list item. It is comprised of a `role` and a list of `segments`. |
| 111 | + |
| 112 | +The `role` is the nature of the text element in the document. For example a heading, body, footnote or a quote. It can be used to reconstruct part of the structure of the original document. |
| 113 | + |
| 114 | +A text element is composed of individual segments with their own `locator` and `attributes`. They are useful to associate attributes with a portion of a text element. For example, given the HTML paragraph: |
| 115 | + |
| 116 | +```html |
| 117 | +<p>It is pronounced <span lang="fr">croissant</span>.</p> |
| 118 | +``` |
| 119 | + |
| 120 | +The following `TextContentElement` will be produced: |
| 121 | + |
| 122 | +```swift |
| 123 | +TextContentElement( |
| 124 | + role: .body, |
| 125 | + segments: [ |
| 126 | + TextContentElement.Segment(text: "It is pronounced "), |
| 127 | + TextContentElement.Segment(text: "croissant", attributes: [ContentAttribute(key: .language, value: "fr")]), |
| 128 | + TextContentElement.Segment(text: ".") |
| 129 | + ] |
| 130 | +) |
| 131 | +``` |
| 132 | + |
| 133 | +If you are not interested in the segment attributes, you can also use `element.text` to get the concatenated raw text. |
| 134 | + |
| 135 | +### Element attributes |
| 136 | + |
| 137 | +All types of `ContentElement` can have associated attributes. Custom `ContentService` implementations can use this as an extensibility point. |
| 138 | + |
| 139 | +## Use cases |
| 140 | + |
| 141 | +### An index of all images embedded in the publication |
| 142 | + |
| 143 | +This example extracts all the embedded images in the publication and displays them in a SwiftUI list. Clicking on an image jumps to its location in the publication. |
| 144 | + |
| 145 | +```swift |
| 146 | +struct ImageIndex: View { |
| 147 | + struct Item: Hashable { |
| 148 | + let locator: Locator |
| 149 | + let text: String? |
| 150 | + let image: UIImage |
| 151 | + } |
| 152 | + |
| 153 | + let publication: Publication |
| 154 | + let navigator: Navigator |
| 155 | + @State private var items: [Item] = [] |
| 156 | + |
| 157 | + init(publication: Publication, navigator: Navigator) { |
| 158 | + self.publication = publication |
| 159 | + self.navigator = navigator |
| 160 | + } |
| 161 | + |
| 162 | + var body: some View { |
| 163 | + ScrollView { |
| 164 | + LazyVStack { |
| 165 | + ForEach(items, id: \.self) { item in |
| 166 | + VStack() { |
| 167 | + Image(uiImage: item.image) |
| 168 | + Text(item.text ?? "No caption") |
| 169 | + } |
| 170 | + .onTapGesture { |
| 171 | + navigator.go(to: item.locator) |
| 172 | + } |
| 173 | + } |
| 174 | + } |
| 175 | + } |
| 176 | + .onAppear { |
| 177 | + items = publication.content()? |
| 178 | + .elements() |
| 179 | + .compactMap { element in |
| 180 | + guard |
| 181 | + let element = element as? ImageContentElement, |
| 182 | + let image = try? publication.get(element.embeddedLink) |
| 183 | + .read().map(UIImage.init).get() |
| 184 | + else { |
| 185 | + return nil |
| 186 | + } |
| 187 | + |
| 188 | + return Item( |
| 189 | + locator: element.locator, |
| 190 | + text: element.caption ?? element.accessibilityLabel, |
| 191 | + image: image |
| 192 | + ) |
| 193 | + } |
| 194 | + ?? [] |
| 195 | + } |
| 196 | + } |
| 197 | +} |
| 198 | +``` |
| 199 | + |
| 200 | +## References |
| 201 | + |
| 202 | +* [Content Iterator proposal](https://github.com/readium/architecture/pull/177) |
0 commit comments