Skip to content

Start_block misalignment caused by SIGSTOP/SIGCONT #554

@amissael95

Description

@amissael95

Describe the bug
When SIGSTOP/SIGCONT signal is received while writing data to tape using sg driver, it could cause a position mismatch between what is present in LE cache and the one reported by the drive. If that is the case, the index is not written in the position described by its self-pointer, and the start_block of subsequent files do not reflect the actual position where the file was written, giving LTFS error LTFS11089E when reading those files.

To Reproduce

This issue is more evident if you send SIGSTOP/SIGCONT while writing files whose length is not multiple of 512KiB, since they will not fill the last block size, then when reading those files ltfs will fail with LTFS11089E because the start_block in the index will not point to the real start_block of the file, so the steps to reproduce the issue are:

Writing files whose size is not multiple of 512KiB:

  1. Mount a tape using sg driver.
  2. Write some files to LTFS mount point whose size is not multiple of 512KiB.
  3. While wriiting send SIGSTOP and SIGCONT:
2025-09-05T20:42:42.698101-04:00 rocket ltfs[884015]: d86dc LTFS30397D Backend locate: (1, 345590) 607B800268.
2025-09-05T20:42:42.706302-04:00 rocket ltfs[884015]: d86dc LTFS30398D Backend readpos: (1, 345590) FM = 89 607B800268.
--
2025-09-05T20:43:06.989991-04:00 rocket ltfs[884015]: d7d31 LTFS14037D FUSE flush '/DV0102L8/.LTFSEE_DATA/1700499909257325709-3958063085212139788-1298898597-1231942-0'.
2025-09-05T20:43:06.990220-04:00 rocket ltfs[884015]: d7d30 LTFS14035D FUSE release file '/DV0102L8/.LTFSEE_DATA/1700499909257325709-3958063085212139788-1298898597-1231942-0'.
2025-09-05T20:43:06.990617-04:00 rocket ltfs[884015]: d7d30 LTFS11601D _fsops_dproxy_put: 1700499909257325709-3958063085212139788-1298898597-1231942-0, 1.
2025-09-05T20:43:06.992063-04:00 rocket ltfs[884015]: d7d30 LTFS14032D FUSE open file '/DV0102L8/.LTFSEE_DATA/1700499909257325709-3958063085212139788-1298898597-1231942-0' (write-only).
2025-09-05T20:43:06.992590-04:00 rocket ltfs[884015]: d7d31 LTFS14040D FUSE create file '/DV0102L8/.LTFSEE_DATA/1700499909257325709-3958063085212139788-1853539870-1231922-0'.
2025-09-05T20:43:06.992753-04:00 rocket ltfs[884015]: d7d30 LTFS11601D _fsops_dproxy_get: 1700499909257325709-3958063085212139788-1298898597-1231942-0, 0.
2025-09-05T20:43:06.993130-04:00 rocket ltfs[884015]: d7d31 LTFS11601D _fsops_dproxy_get: 1700499909257325709-3958063085212139788-1853539870-1231922-0, 1.
2025-09-05T20:43:06.993563-04:00 rocket ltfs[884015]: d877c LTFS14053D FUSE removexattr '/DV0102L8/.LTFSEE_DATA/1700499909257325709-3958063085212139788-1853539870-1231922-0' (name='user.ltfs.hash.md5sum').

Note: for this example LTFS was started with verbose=303

[root@rocket ~]# date;kill -19 $(pidof ltfs); sleep 5; date;kill -19 $(pidof ltfs); sleep 5; date;kill -19 $(pidof ltfs); sleep 5; date;kill -19 $(pidof ltfs); kill -18 $(pidof ltfs); sleep 2; kill -19 $(pidof ltfs); sleep 2; kill -18 $(pidof ltfs);
Fri Sep  5 20:42:47 EDT 2025
Fri Sep  5 20:42:52 EDT 2025
Fri Sep  5 20:42:57 EDT 2025
Fri Sep  5 20:43:02 EDT 2025
  1. Read some of the files written to confirm that error LTFS11089E
2025-09-05T20:49:04.768732-04:00 rocket ltfs[884015]: d7d30 LTFS30204D READ (0x08) expected error -20000.
2025-09-05T20:49:04.768799-04:00 rocket ltfs[884015]: d7d30 LTFS30218D Read block: underrun in illegal length. residual = 518132, actual = 6156.
2025-09-05T20:49:04.768845-04:00 rocket ltfs[884015]: d7d30 LTFS11089E Cannot read: expected 524288 bytes from the medium, but received 6156 bytes.

Writing files whose size is multiple of 512KiB:

To be clearer, for this scenario I have added the following check similar to the PR #552 to detect a mismatch between the cached tape position and the position returned by the drive:

+
+
+       /* Get the tape position from the tape drive by using the SCSI command READPOS*/
+       ret = tape_get_position_from_drive(vol->device, &current_position);
+       if (ret < 0) {
+               /* Return error since the current tape position was unable to be determined, so there could be an undetected position mismatch */
+               ltfsmsg(LTFS_ERR, 11081E, ret);
+       }
+
+       /* Prior to writing the index, compare the current location of the head position to the head location 
+       that is kept in the cache of ltfs (physical_selfptr). If they are different return error (-1) */
+       diff = ((unsigned long long)physical_selfptr.block - (unsigned long long)current_position.block);
+       if (diff) {
+               /* Position mismatch, diff not equal zero */
+               ltfsmsg(LTFS_INFO, 17293I, (unsigned long long)physical_selfptr.block, (unsigned long long)current_position.block);
+       }
+
        old_selfptr = vol->index->selfptr;
        vol->index->selfptr.partition = partition;
  1. Mounting tape with sg driver
  2. Writing file whose size of multiple of 512KiB:
cp 7GFile_* /ltfs/
  1. Sending SIGSTOP/SIGONT when writing:
[root@rocket SDE_ltfs]# date; pkill -SIGCONT ltfs
Tue Nov  4 22:14:25 EST 2025
[root@rocket SDE_ltfs]# date; pkill -SIGCONT ltfs
Tue Nov  4 22:14:27 EST 2025
[root@rocket SDE_ltfs]# date; pkill -SIGCONT ltfs
Tue Nov  4 22:14:27 EST 2025
[root@rocket SDE_ltfs]# date; pkill -SIGCONT ltfs
Tue Nov  4 22:14:28 EST 2025
[root@rocket SDE_ltfs]# date; pkill -SIGSTOP ltfs
Tue Nov  4 22:14:29 EST 2025
[root@rocket SDE_ltfs]# date; pkill -SIGSTOP ltfs
Tue Nov  4 22:14:29 EST 2025
[root@rocket SDE_ltfs]# date; pkill -SIGCONT ltfs
Tue Nov  4 22:14:31 EST 2025
[root@rocket SDE_ltfs]# date; pkill -SIGSTOP ltfs
Tue Nov  4 22:15:25 EST 2025
[root@rocket SDE_ltfs]# date; pkill -SIGCONT ltfs
Tue Nov  4 22:15:47 EST 2025
  1. Position mismatch detected when writing the index!
2025-11-04T22:15:18.001267-05:00 rocket ltfs[293270]: 4799a LTFS17068I Syncing index of F31219 (Reason: Periodic Sync) 000783B48B.
2025-11-04T22:15:18.003714-05:00 rocket ltfs[293270]: 4799a LTFS17293I Position mismatch. Cached tape position = 318956. Current tape position = 318958.
2025-11-04T22:15:18.003800-05:00 rocket ltfs[293270]: 4799a LTFS17235I Writing index of F31219 to b (Reason: Periodic Sync, 4 files) 000783B48B.
2025-11-04T22:15:24.843400-05:00 rocket ltfs[293270]: 4799a LTFS17236I Wrote index of F31219 (Gen = 27, Part = b, Pos = 318957, 000783B48B).
2025-11-04T22:15:24.844189-05:00 rocket ltfs[293270]: 4799a LTFS11337I Update index-dirty flag (0) - F31219 (0x0x5587f92fdc10).
2025-11-04T22:15:24.844300-05:00 rocket ltfs[293270]: 4799a LTFS17070I Synced index of F31219 (0) 000783B48B.
2025-11-04T22:15:47.071259-05:00 rocket ltfs[293270]: 47999 LTFS11337I Update index-dirty flag (1) - F31219 (0x0x5587f92fdc10).
2025-11-04T22:20:24.000518-05:00 rocket ltfs[293270]: 4799a LTFS11338I Syncing index of F31219 000783B48B.
2025-11-04T22:20:24.001195-05:00 rocket ltfs[293270]: 4799a LTFS17068I Syncing index of F31219 (Reason: Periodic Sync) 000783B48B.
2025-11-04T22:20:24.003380-05:00 rocket ltfs[293270]: 4799a LTFS17293I Position mismatch. Cached tape position = 333761. Current tape position = 333762.
2025-11-04T22:20:24.003483-05:00 rocket ltfs[293270]: 4799a LTFS17235I Writing index of F31219 to b (Reason: Periodic Sync, 5 files) 000783B48B.
2025-11-04T22:20:38.695398-05:00 rocket ltfs[293270]: 4799a LTFS17236I Wrote index of F31219 (Gen = 28, Part = b, Pos = 333762, 000783B48B).

Expected behavior
We need to prevent this error to occur, since LTFS is not aware of this problem, data appears to be written currently but the start position in the index position points to the previous file. Once quick approach is do not write the index in case of this position mismatch.

Additional context

This problem occurs because the sg driver's sg_ioctl(SG_IO) syscall is interruptible, and returns -ERESTARTSYS, (not -EINTR), if a signal such as SIGSTOP occurs before the SCSI command completes. The syscall() will be re-executed from userspace, resulting in the duplication of last block sent and the position not be incremented in LTFS.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions