In-app help manual (#628)

2025-12-17 20:34:37 +03:00 · 2020-06-23 10:15:02 +10:00
parent 534d47500b
commit 83f8bc0832
19 changed files with 1555 additions and 48 deletions
--- a/ui/v2.5/src/docs/en/Scraping.md
+++ b/ui/v2.5/src/docs/en/Scraping.md
@@ -0,0 +1,395 @@
+# Metadata Scraping
+
+Stash supports scraping of performer and scene details.
+
+Stash includes a freeones.xxx performer scraper built in.
+
+# Adding custom scrapers
+
+By default, Stash looks for scraper configurations in the `scrapers` sub-directory of the directory where the stash `config.yml` is read. This will either be the `$HOME/.stash` directory or the current working directory.
+
+Custom scrapers are added by adding configuration yaml files (format: `scrapername.yml`) to the `scrapers` directory.
+
+After scrapers are added, removed or edited while stash is running, they can be reloaded by clicking the `Scrape With...` button in New/Edit Performer or Scene page and clicking `Reload Scrapers`.
+
+# Using custom scrapers
+
+Scrapers support a number of different scraping types.
+
+Performer details can be scraped from the new/edit Performer page in two different ways:
+
+* click on the `Scrape With...` button and select the scraper to scrape with. You will be presented with a search dialog to search for the performer by their name
+* enter the URL containing the Performer's details in the URL field. If the URL matches a pattern known to one of the scrapers, then a button will appear to scrape the details.
+
+Scene details can be scraped using URL as above, or via the `Scrape With...` button, which scrapes using the current scene metadata.
+
+# Community Scrapers
+The stash community maintains a number of custom scraper configuration files that can be found [here](https://github.com/stashapp/CommunityScrapers).
+
+# Scraper configuration file format
+
+## Basic scraper configuration file structure
+
+```
+name: <site>
+performerByName:
+  <single scraper config>
+performerByFragment:
+  <single scraper config>
+performerByURL:
+  <multiple scraper URL configs>
+sceneByFragment:
+  <single scraper config>
+sceneByURL:
+  <multiple scraper URL configs>
+<other configurations>
+```
+
+`name` is mandatory, all other top-level fields are optional. The inclusion of each top-level field determines what capabilities the scraper has.
+
+A scraper configuration in any of the top-level fields must at least have an `action` field. The other fields are required based on the value of the `action` field.
+
+The scraping types and their required fields are outlined in the following table:
+
+| Behaviour | Required configuration |
+|-----------|------------------------|
+| Scraper in `Scrape...` dropdown button in Performer Edit page | Valid `performerByName` and `performerByFragment` configurations. |
+| Scrape performer from URL | Valid `performerByURL` configuration with matching URL. |
+| Scraper in `Scrape...` dropdown button in Scene Edit page | Valid `sceneByFragment` configuration. |
+| Scrape scene from URL | Valid `sceneByURL` configuration with matching URL. |
+
+URL-based scraping accepts multiple scrape configurations, and each configuration requires a `url` field. stash iterates through these configurations, attempting to match the entered URL against the `url` fields in the configuration. It executes the first scraping configuration where the entered URL contains the value of the `url` field. 
+
+## Scraper Actions
+
+### Script
+
+Executes a script to perform the scrape. The `script` field is required for this action and accepts a list of string arguments. For example:
+
+```
+action: script
+script:
+  - python
+  - iafdScrape.py
+  - query
+```
+
+This configuration would execute `python iafdScrape.py query`.
+
+Stash sends data to the script process's `stdin` stream and expects the output to be streamed to the `stdout` stream. Any errors and progress messages should be output to `stderr`.
+
+The script is sent input and expects output based on the scraping type, as detailed in the following table:
+
+| Scrape type | Input | Output |
+|-------------|-------|--------|
+| `performerByName` | `{"name": "<performer query string>"}` | Array of JSON-encoded performer fragments (including at least `name`) |
+| `performerByFragment` | JSON-encoded performer fragment | JSON-encoded performer fragment |
+| `performerByURL` | `{"url": "<url>"}` | JSON-encoded performer fragment |
+| `sceneByFragment` | JSON-encoded scene fragment | JSON-encoded scene fragment |
+| `sceneByURL` | `{"url": "<url>"}` | JSON-encoded scene fragment |
+
+For `performerByName`, only `name` is required in the returned performer fragments. One entire object is sent back to `performerByFragment` to scrape a specific performer, so the other fields may be included to assist in scraping a performer. For example, the `url` field may be filled in for the specific performer page, then `performerByFragment` can extract by using its value.
+
+As an example, the following python code snippet can be used to scrape a performer:
+
+```
+import json
+import sys
+import string
+
+def readJSONInput():
+	input = sys.stdin.read()
+	return json.loads(input)
+
+def searchPerformer(name):
+    # perform scraping here - using name for the query
+
+    # fill in the output
+    ret = []
+    
+    # example shown for a single found performer 
+    p = {}
+    p['name'] = "some name"
+    p['url'] = "performer url"
+    ret.append(p)
+    
+    return ret
+
+def scrapePerformer(input):
+    ret = []
+    # get the url from the input
+    url = input['url']
+    return scrapePerformerURL(url)
+
+def debugPrint(t):
+    sys.stderr.write(t + "\n")
+
+def scrapePerformerURL(url):
+    debugPrint("Reading url...")
+    debugPrint("Parsing html...")
+    
+    # parse html
+
+    # fill in performer details - single object
+    ret = {}
+
+    ret['name'] = "fred"
+    ret['aliases'] = "freddy"
+    ret['ethnicity'] = ""
+    # and so on
+
+    return ret
+
+# read the input 
+i = readJSONInput()
+
+if sys.argv[1] == "query":
+    ret = searchPerformer(i['name'])
+    print(json.dumps(ret))
+elif sys.argv[1] == "scrape":
+    ret = scrapePerformer(i)
+    print(json.dumps(ret))
+elif sys.argv[1] == "scrapeURL":
+    ret = scrapePerformerURL(i['url'])
+    print(json.dumps(ret))
+```
+
+### scrapeXPath
+
+This action scrapes a web page using an xpath configuration to parse. This action is valid for `performerByName`, `performerByURL` and `sceneByURL` only.
+
+This action requires that the top-level `xPathScrapers` configuration is populated. The `scraper` field is required and must match the name of a scraper name configured in `xPathScrapers`. For example:
+
+```
+sceneByURL:
+- action: scrapeXPath
+  url: 
+    - pornhub.com/view_video.php
+  scraper: sceneScraper
+```
+
+The above configuration requires that `sceneScraper` exists in the `xPathScrapers` configuration.
+
+#### Use with `performerByName`
+
+For `performerByName`, the `queryURL` field must be present also. This field is used to perform a search query URL for performer names. The placeholder string sequence `{}` is replaced with the performer name search string. For the subsequent performer scrape to work, the `URL` field must be filled in with the URL of the performer page that matches a URL given in a `performerByURL` scraping configuration. For example:
+
+```
+name: Boobpedia
+performerByName:
+  action: scrapeXPath
+  queryURL: http://www.boobpedia.com/wiki/index.php?title=Special%3ASearch&search={}&fulltext=Search
+  scraper: performerSearch
+performerByURL:
+  - action: scrapeXPath
+    url: 
+      - boobpedia.com/boobs/
+    scraper: performerScraper
+xPathScrapers:
+  performerSearch:
+    performer:
+      Name: # name element
+      URL: # URL element that matches the boobpedia.com/boobs/ URL above
+  performerScraper:
+    # ... performer scraper details ...
+```
+
+#### XPath scrapers configuration
+
+The top-level `xPathScrapers` field contains xpath scraping configurations, freely named. The scraping configuration may contain a `common` field, and must contain `performer` or `scene` depending on the scraping type it is configured for. 
+
+Within the `performer`/`scene` field are key/value pairs corresponding to the golang fields (see below) on the performer/scene object. These fields are case-sensitive. 
+
+The values of these may be either a simple xpath value, which tells the system where to get the value of the field from, or a more advanced configuration (see below). For example:
+
+```
+performer:
+  Name: //h1[@itemprop="name"]
+```
+
+This will set the `Name` attribute of the returned performer to the text content of the element that matches `<h1 itemprop="name">...`.
+
+The value may also be a sub-object, indicating that post-processing is required. If it is a sub-object, then the xpath must be set to the `selector` key of the sub-object. For example, using the same xpath as above:
+
+```
+performer:
+  Name: 
+    selector: //h1[@itemprop="name"]
+    # post-processing config values
+```
+
+##### Common fragments
+
+The `common` field is used to configure xpath fragments that can be referenced in the xpath strings. These are key-value pairs where the key is the string to reference the fragment, and the value is the string that the fragment will be replaced with. For example:
+
+```
+common:
+  $infoPiece: //div[@class="infoPiece"]/span
+performer:
+  Measurements: $infoPiece[text() = 'Measurements:']/../span[@class="smallInfo"]  
+```
+
+The `Measurements` xpath string will replace `$infoPiece` with `//div[@class="infoPiece"]/span`, resulting in: `//div[@class="infoPiece"]/span[text() = 'Measurements:']/../span[@class="smallInfo"]`.
+
+##### Post-processing options
+
+The following post-processing keys are available:
+* `concat`: if an xpath matches multiple elements, and `concat` is present, then all of the elements will be concatenated together
+* `replace`: contains an array of sub-objects. Each sub-object must have a `regex` and `with` field. The `regex` field is the regex pattern to replace, and `with` is the string to replace it with. `$` is used to reference capture groups - `` is the first capture group, `` the second and so on. Replacements are performed in order of the array.
+* `subScraper`: if present, the sub-scraper will be executed after all other post-processes are complete and before parseDate. It then takes the value and performs an http request, using the value as the URL. Within the `subScraper` config is a nested scraping configuration. This allows you to traverse to other webpages to get the attribute value you are after. For more info and examples have a look at [#370](https://github.com/stashapp/stash/pull/370), [#606](https://github.com/stashapp/stash/pull/606)
+* `parseDate`: if present, the value is the date format using go's reference date (2006-01-02). For example, if an example date was `14-Mar-2003`, then the date format would be `02-Jan-2006`. See the [time.Parse documentation](https://golang.org/pkg/time/#Parse) for details. When present, the scraper will convert the input string into a date, then convert it to the string format used by stash (`YYYY-MM-DD`).
+* `split`: Its the inverse of `concat`. Splits a string to more elements using the separator given. For more info and examples have a look at PR [#579](https://github.com/stashapp/stash/pull/579)
+
+Post-processing is done in order of the fields above - `concat`, `regex`, `subscraper`, `parseDate` and then `split`.
+
+##### Example
+
+A performer and scene xpath scraper is shown as an example below:
+
+```
+name: Pornhub
+performerByURL:
+  - action: scrapeXPath
+    url: 
+      - pornhub.com
+    scraper: performerScraper
+sceneByURL:
+  - action: scrapeXPath
+    url: 
+      - pornhub.com/view_video.php
+    scraper: sceneScraper
+xPathScrapers:
+  performerScraper:
+    common:
+      $infoPiece: //div[@class="infoPiece"]/span
+    performer:
+      Name: //h1[@itemprop="name"]
+      Birthdate: 
+        selector: //span[@itemprop="birthDate"]
+        parseDate: Jan 2, 2006
+      Twitter: //span[text() = 'Twitter']/../@href
+      Instagram: //span[text() = 'Instagram']/../@href
+      Measurements: $infoPiece[text() = 'Measurements:']/../span[@class="smallInfo"]
+      Height: 
+        selector: $infoPiece[text() = 'Height:']/../span[@class="smallInfo"]
+        replace: 
+          - regex: .*\((\d+) cm\)
+            with: $1
+      Ethnicity: $infoPiece[text() = 'Ethnicity:']/../span[@class="smallInfo"]
+      FakeTits: $infoPiece[text() = 'Fake Boobs:']/../span[@class="smallInfo"]
+      Piercings: $infoPiece[text() = 'Piercings:']/../span[@class="smallInfo"]
+      Tattoos: $infoPiece[text() = 'Tattoos:']/../span[@class="smallInfo"]
+      CareerLength: 
+        selector: $infoPiece[text() = 'Career Start and End:']/../span[@class="smallInfo"]
+        replace:
+          - regex: \s+to\s+
+            with: "-"
+  sceneScraper:
+    common:
+      $performer: //div[@class="pornstarsWrapper"]/a[@data-mxptype="Pornstar"]
+      $studio: //div[@data-type="channel"]/a
+    scene:
+      Title: //div[@id="main-container"]/@data-video-title
+      Tags: 
+        Name: //div[@class="categoriesWrapper"]//a[not(@class="add-btn-small ")]
+      Performers:
+        Name: $performer/@data-mxptext
+        URL: $performer/@href
+      Studio:
+        Name: $studio
+        URL: $studio/@href    
+```
+
+See also [#333](https://github.com/stashapp/stash/pull/333) for more examples.
+
+#### XPath resources:
+
+- Test XPaths in Firefox: https://addons.mozilla.org/en-US/firefox/addon/try-xpath/
+- XPath cheatsheet: https://devhints.io/xpath
+
+#### Object fields
+##### Performer
+
+```
+Name
+Gender
+URL
+Twitter
+Instagram
+Birthdate
+Ethnicity
+Country
+EyeColor
+Height
+Measurements
+FakeTits
+CareerLength
+Tattoos
+Piercings
+Aliases
+Image
+```
+
+*Note:*  - `Gender` must be one of `male`, `female`, `transgender_male`, `transgender_female` (case insensitive).
+
+##### Scene
+```
+Title
+Details
+URL
+Date
+Image
+Studio (see Studio Fields)
+Movies (see Movie Fields)
+Tags (see Tag fields)
+Performers (list of Performer fields)
+```
+##### Studio
+```
+Name
+URL
+```
+
+##### Tag
+```
+Name
+```
+
+##### Movie
+```
+Name
+Aliases
+Duration
+Date
+Rating
+Director
+Synopsis
+URL
+```
+
+### Stash
+
+A different stash server can be configured as a scraping source. This action applies only to `performerByName`, `performerByFragment`, and `sceneByFragment` types. This action requires that the top-level `stashServer` field is configured.
+
+`stashServer` contains a single `url` field for the remote stash server. The username and password can be embedded in this string using `username:password@host`.
+
+An example stash scrape configuration is below:
+
+```
+name: stash
+performerByName:
+  action: stash
+performerByFragment:
+  action: stash
+sceneByFragment:
+  - action: stash
+stashServer:
+  url: http://stashserver.com:9999
+```
+
+### Debugging support
+To print the received html from a scraper request to the log file, add the following to your scraper yml file:
+```
+debug:
+  printHTML: true
+```