mirror of
https://github.com/stashapp/stash.git
synced 2025-12-17 20:34:37 +03:00
Add cdp support for xpath scrapers (#625)
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
This commit is contained in:
@@ -34,10 +34,18 @@ exclude:
|
||||
|
||||
_a useful [link](https://regex101.com/) to experiment with regexps_
|
||||
|
||||
## Scraping User Agent string
|
||||
## Scraping
|
||||
|
||||
### User Agent string
|
||||
|
||||
Some websites require a legitimate User-Agent string when receiving requests, or they will be rejected. If entered, this string will be applied as the `User-Agent` header value in http scrape requests.
|
||||
|
||||
### Chrome CDP path
|
||||
|
||||
Some scrapers require a Chrome instance to function correctly. If left empty, stash will attempt to find the Chrome executable in the path environment, and will fail if it cannot find one.
|
||||
|
||||
`Chrome CDP path` can be set to a path to the chrome executable, or an http(s) address to remote chrome instance (for example: `http://localhost:9222/json/version`).
|
||||
|
||||
## Authentication
|
||||
|
||||
By default, stash is not configured with any sort of password protection. To enable password protection, both `Username` and `Password` must be populated. Note that when entering a new username and password where none was set previously, the system will immediately request these credentials to log you in.
|
||||
|
||||
@@ -283,6 +283,22 @@ For backwards compatibility, `regex`, `subscraper` and `parseDate` are also allo
|
||||
|
||||
Post-processing on attribute post-process is done in the following order: `concat`, `regex`, `subscraper`, `parseDate` and then `split`.
|
||||
|
||||
##### CDP support
|
||||
|
||||
Some websites deliver content that cannot be scraped using the raw html file alone. These websites use javascript to dynamically load the content. As such, direct xpath scraping will not work on these websites. There is an option to use Chrome DevTools Protocol to load the webpage using an instance of Chrome, then scrape the result.
|
||||
|
||||
Chrome CDP support can be enabled for a specific scraping configuration by adding the following to the root of the yml configuration:
|
||||
```
|
||||
driver:
|
||||
useCDP: true
|
||||
```
|
||||
|
||||
Optionally, you can add a `sleep` value under the `driver` section. This specifies the amount of time (in seconds) that the scraper should wait after loading the website to perform the scrape. This is needed as some sites need more time for loading scripts to finish. If unset, this value defaults to 2 seconds.
|
||||
|
||||
When `useCDP` is set to true, stash will execute or connect to an instance of Chrome. The behaviour is dictated by the `Chrome CDP path` setting in the user configuration. If left empty, stash will attempt to find the Chrome executable in the path environment, and will fail if it cannot find one.
|
||||
|
||||
`Chrome CDP path` can be set to a path to the chrome executable, or an http(s) address to remote chrome instance (for example: `http://localhost:9222/json/version`).
|
||||
|
||||
##### Example
|
||||
|
||||
A performer and scene xpath scraper is shown as an example below:
|
||||
|
||||
Reference in New Issue
Block a user