# Scan Performance

This page provides reference scan times for each supported file type and size, to help you estimate how long a scan job will take.

Benchmarks were run on Windows 11 over a high-speed broadband network, using a recommended specification of an 8-core CPU and 16 GB RAM. Scan time estimates are calculated using the formula: scan time = file size (MB) × averaged scan rate (s/MB), where the scan rate is derived from benchmarks across seven connectors and averaged per file type.&#x20;

These figures are intended as a general reference for a standard enterprise. Scan times will vary in practice depending on system load, file content, and network conditions.

### File Count vs File Size

Scan duration is affected by both the size and the number of files in scope. A single file will scan faster than multiple smaller files that add up to the same total size, due to the overhead of initiating a scan for each file. When planning a scan job, keep this in mind if your target directories contain a large number of small files.

### OCR Impact by File Type

Enabling OCR increases processing time to varying degrees depending on file type:

**No meaningful impact:** TXT, XML, DOC, RTF, ODT, PPTX, XLSX

**Moderate impact (10–15%):** DOCX (\~14%), XLS (\~15%), ODS (\~11%)

**Pronounced impact:** PPT (\~26% increase in processing time)

**Small impact:** HTML/HTM/MD (\~6% increase in processing time)

**Severe impact:** PDF (\~410% increase in processing time on average when OCR is enabled; impact is significantly higher for PDFs containing image-based pages such as scanned documents)

### Quick Reference: Relative Scan Speed

From fastest to slowest (OCR off): TXT, XLS, XML, DOC, RTF, ODS, PPTX, ODT, DOCX, XLSX, PPT, PDF, HTML/HTM/MD.

PDF and HTML/HTM/MD take significantly longer to scan than other formats, and their scan times vary considerably depending on file content. For details on what to expect, see the PDF, HTML, HTM, and MD sections below.

### Office documents

#### DOCX

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~51 KB   | \~0.1 sec | \~0.1 sec |
| \~737 KB  | \~1 sec   | \~2 sec   |
| \~3 MB    | \~6 sec   | \~6 sec   |
| \~6 MB    | \~11 sec  | \~13 sec  |
| \~22 MB   | \~42 sec  | \~47 sec  |

#### DOC (legacy Word)

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~133 KB  | \~0.1 sec | \~0.1 sec |
| \~2.5 MB  | \~2 sec   | \~2 sec   |
| \~8.5 MB  | \~6 sec   | \~5 sec   |

#### ODT

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~30 KB   | \~0.1 sec | \~0.1 sec |
| \~563 KB  | \~1 sec   | \~1 sec   |
| \~3 MB    | \~6 sec   | \~6 sec   |
| \~6 MB    | \~12 sec  | \~13 sec  |

### Spreadsheets

#### XLSX

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~25 KB   | \~0.1 sec | \~0.1 sec |
| \~1 MB    | \~2 sec   | \~3 sec   |
| \~4 MB    | \~10 sec  | \~10 sec  |
| \~6 MB    | \~15 sec  | \~15 sec  |
| \~33 MB   | \~1.3 min | \~1.4 min |

#### XLS (legacy Excel)

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~51 KB   | \~0.0 sec | \~0.0 sec |
| \~2.5 MB  | \~2 sec   | \~2 sec   |
| \~12.5 MB | \~8 sec   | \~10 sec  |
| \~70 MB   | \~47 sec  | \~54 sec  |

#### ODS

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~15 KB   | \~0.0 sec | \~0.0 sec |
| \~1 MB    | \~2 sec   | \~2 sec   |
| \~4 MB    | \~7 sec   | \~8 sec   |
| \~6 MB    | \~11 sec  | \~12 sec  |
| \~34 MB   | \~1.0 min | \~1.1 min |

### Presentations

#### PPTX

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~87 KB   | \~0.1 sec | \~0.1 sec |
| \~1.2 MB  | \~2 sec   | \~2 sec   |
| \~3 MB    | \~5 sec   | \~5 sec   |
| \~5 MB    | \~8 sec   | \~8 sec   |
| \~21 MB   | \~34 sec  | \~35 sec  |

#### PPT (legacy PowerPoint)

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~256 KB  | \~0.7 sec | \~0.9 sec |
| \~2 MB    | \~6 sec   | \~7 sec   |
| \~4.5 MB  | \~13 sec  | \~17 sec  |
| \~5 MB    | \~15 sec  | \~18 sec  |
| \~18.5 MB | \~54 sec  | \~1.1 min |

PPT is slower than PPTX and shows more variability than any other format in this category. Enabling OCR adds roughly 26% to processing time on average, though individual results varied considerably across test runs.

### Plain text and markup

#### TXT

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~56 KB   | \~0.0 sec | \~0.0 sec |
| \~1 MB    | \~0.5 sec | \~0.5 sec |
| \~3 MB    | \~1 sec   | \~2 sec   |
| \~5 MB    | \~2 sec   | \~3 sec   |
| \~20 MB   | \~10 sec  | \~11 sec  |

#### XML

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~1.4 MB  | \~0.9 sec | \~0.9 sec |
| \~12 MB   | \~8 sec   | \~8 sec   |
| \~14 MB   | \~9 sec   | \~9 sec   |
| \~23 MB   | \~15 sec  | \~15 sec  |
| \~48 MB   | \~30 sec  | \~31 sec  |

#### RTF

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~133 KB  | \~0.2 sec | \~0.2 sec |
| \~6 MB    | \~8 sec   | \~8 sec   |
| \~8 MB    | \~11 sec  | \~11 sec  |
| \~10 MB   | \~14 sec  | \~14 sec  |
| \~24 MB   | \~33 sec  | \~34 sec  |

OCR has no meaningful impact on TXT, XML, or RTF. These formats are already plain text, so there is no text extraction step required, and they contain no embedded images for OCR to process.

### PDF

PDF behaviour is fundamentally different from all other formats. Scan time depends heavily on whether the PDF is text-native (created digitally) or image-based (scanned pages). Two PDFs of the same size can differ in scan time by an order of magnitude. This variability exists with OCR off, and becomes extreme with OCR on.

#### PDF (OCR off)

| File size | Scan time (approx.) |
| --------- | ------------------- |
| \~177 KB  | \~1 sec\*           |
| \~2.5 MB  | \~16 sec\*          |
| \~21 MB   | \~2 min\*           |

#### PDF (OCR on)

| File size | Scan time (approx.) |
| --------- | ------------------- |
| \~177 KB  | \~6 sec\*           |
| \~1.5 MB  | \~50 sec\*          |
| \~2.8 MB  | \~1.6 min\*         |
| \~3.2 MB  | \~1.8 min\*         |
| \~21 MB   | \~12 min\*          |

\*With OCR enabled, processing time increases by \~410% on average over the OCR-off rate, and considerably more for PDFs containing image-based pages.

**Why PDF is different**

Of all supported file types, PDF is where OCR has the most significant and least predictable impact on scan performance.

With OCR disabled, DISCOVER extracts text directly from the PDF and skips all other content. Scan times at this setting are generally predictable, though PDFs tend to scan more slowly than other formats of a similar size due to the complexity of the format.

With OCR enabled, DISCOVER passes each page through optical character recognition to attempt text extraction, including from images. For PDFs that contain image-based pages, such as scanned documents or photographs saved as PDF, this is computationally expensive. A single image-heavy PDF can take several minutes to scan, even at a relatively small file size.

If scan performance is a concern, consider the following:

* If your PDFs are primarily digital documents, OCR will add limited overhead and can generally be left enabled.
* If your PDFs contain scanned documents or image-heavy content, only enable OCR if extracting text from those pages is a requirement for your use case.
* If you are unsure of the composition of your PDF files, run a test scan on a representative sample with OCR on and off before scoping a full scan job.
* If unexpectedly long scan times are reported, check whether PDFs with OCR enabled are in scope. This is the most common cause.

### HTML, HTM, and MD

These are consistently the slowest file types to scan, regardless of OCR setting or connector. Unlike plain text formats, HTML and MD files contain markup, nested tags, inline elements, and links that must be parsed to extract the underlying text content.&#x20;

This parsing overhead is inherent to how these formats are processed and scales with file size, which is why a single large file can take several minutes to scan. Enabling or disabling OCR has no meaningful impact on this.

| File size | OCR off   | OCR on    |
| --------- | --------- | --------- |
| \~501 KB  | \~11 sec  | \~12 sec  |
| \~5.4 MB  | \~2 min   | \~2.2 min |
| \~6.0 MB  | \~2.3 min | \~2.4 min |

A single large HTML or Markdown file can take several minutes to scan. If your target directories contain many files of this type, the total scan duration will be significantly longer than the file count or total data size would suggest.

#### Managing scan time for HTML, HTM, and MD

If these file types are present in your scan scope and performance is a concern, consider the following:

* If HTML, HTM, or MD files do not contain sensitive data relevant to your scan objectives, exclude these extensions from the scan job entirely.
* If only specific directories are known to contain these file types, scope your scan to avoid those directories where possible.
* When planning scan jobs that include large numbers of these files, factor in the extended scan times shown above and adjust scheduling expectations accordingly.
* If a scan is taking longer than expected, check whether HTML, HTM, or MD files are in scope and in significant volume. Alongside PDFs with OCR enabled, these are the most common causes of unexpectedly long scan durations.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.guardware.com/discover/scan/scan-performance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
