easy-scraper 0.1.0

HTML scraping library focused on easy to use
Documentation
# easy-scraper

HTML scraping library focused on easy to use.

In this library, matching patterns are described as HTML DOM trees.
You can write patterns intuitive and extract desired contents easily.

# Usage

Add this line to your `Cargo.toml`:

```toml
[dependencies]
easy-scraper = "0.1"
```

# Example

```rust
use easy_scraper::Pattern;

let pat = Pattern::new(r#"
<ul>
    <li>{{foo}}</li>
</ul>
"#).unwrap();

let ms = pat.matches(r#"
<!DOCTYPE html>
<html lang="en">
    <body>
        <ul>
            <li>1</li>
            <li>2</li>
            <li>3</li>
        </ul>
    </body>
</html>
"#);

assert_eq!(ms.len(), 3);
assert_eq!(ms[0]["foo"], "1");
assert_eq!(ms[1]["foo"], "2");
assert_eq!(ms[2]["foo"], "3");
```

# Syntax

## DOM Tree

DOM trees are valid pattern. You can write placeholders in DOM trees.

```html
<ul>
    <li>{{foo}}</li>
</ul>
```

Patterns are matched if the pattern is subset of document.

If the document is:

```html
<ul>
    <li>1</li>
    <li>2</li>
    <li>3</li>
</ul>
```

there trees are subset of this.

```html
<ul>
    <li>1</li>
</ul>
```

```html
<ul>
    <li>2</li>
</ul>
```

```html
<ul>
    <li>3</li>
</ul>
```

So, match result is

```json
[
    { "foo": "1" },
    { "foo": "2" },
    { "foo": "3" },
]
```

## Child

Child nodes are matched to any descendants
because of subset rule.

For example, this pattern

```html
<div>
    <li>{{id}}</li>
</div>
```

matches against this document.

```html
<div>
    <ul>
        <li>1</li>
    </ul>
</div>
```

## Siblings

To avoid useless matches,
siblings are restricted to match
only consective children of the same parent.

For example, this pattern

```html
<ul>
    <li>{{foo}}</li>
    <li>{{bar}}</li>
</ul>
```

does not match to this document.

```html
<ul>
    <li>123</li>
    <div>
        <li>456</li>
    </div>
</ul>
```

And for this document,

```html
<ul>
    <li>1</li>
    <li>2</li>
    <li>3</li>
</ul>
```

match results are:

```json
[
    { "foo": "1", "bar": "2" },
    { "foo": "2", "bar": "3" },
]
```

`{ "foo": 1, "bar": 3 }` is not contained, because there are not consective children.

You can specify allow nodes between siblings by writing `...` in the pattern.

```html
<ul>
    <li>{{foo}}</li>
    ...
    <li>{{bar}}</li>
</ul>
```

Match result for this pattern is:

```json
[
    { "foo": "1", "bar": "2" },
    { "foo": "1", "bar": "3" },
    { "foo": "2", "bar": "3" },
]
``````

## Attribute

You can specify attributes in patterns.
Attribute patterns match when pattern's attributes are subset of document's attributes.

This pattern

```html
<div class="attr1">
    {{foo}}
</div>
```

matches to this document.

```html
<div class="attr1 attr2">
    Hello
</div>
```

You can also write placeholders in attributes.

```html
<a href="{{url}}">{{title}}</a>
```

Match result for

```html
<a href="https://www.google.com">Google</a>
<a href="https://www.yahoo.com">Yahoo</a>
```

this document is:

```json
[
    { "url": "https://www.google.com", "title": "Google" },
    { "url": "https://www.yahoo.com", "title": "Yahoo" },
]
```

## Partial text-node pattern

You can write placeholders arbitrary positions in text-node.

```html
<ul>
    <li>A: {{a}}, B: {{b}}</li>
</ul>
```

Match result for

```html
<ul>
    <li>A: 1, B: 2</li>
    <li>A: 3, B: 4</li>
    <li>A: 5, B: 6</li>
</ul>
```

this document is:

```json
[
    { "a": "1",  "b": "2" },
    { "a": "3",  "b": "4" },
    { "a": "5",  "b": "6" },
]
```

## Whole subtree pattern

The pattern `{{var:*}}` matches to whole sub-tree as string.

```html
<div>{{body:*}}</div>
```

Match result for

```html
<body>
    Hello
    <span>hoge</span>
    World
</body>
```

this document is:

```json
[
    { "body": "Hello<span>hoge</span>World" }
]
```

## White-space

White-space are ignored almost all positions.

# Restrictions

* Whole sub-tree patterns must be the only one element of the parent node.

This is valid:

```html
<div>
    {{foo:*}}
</div>
```

There are invalid:

```html
<div>
    hoge {{foo:*}}
</div>
```

```html
<ul>
    <li></li>
    {{foo:*}}
    <li></li>
<ul>
```