Skip to content

Parsing Markdown

!!! info "๐Ÿšง work in progress" (TODO: add more examples)

I created some helpers to help to split markdown documents and create better chunks.

  • ParseMarkdownWithHierarchy chunks a markdown document while maintaining semantic meaning and preserving the relationship between sections.
chunks := content.ParseMarkdownWithHierarchy(document)

func ParseMarkdownWithHierarchy(document string) []Chunk

You will get the following data:

1
2
3
4
5
6
7
8
9
chunk := Chunk{
    Level:        level,
    Prefix:       prefix,
    Header:       header,
    Content:      strings.TrimSpace(content),
    ParentPrefix: parent.Prefix,
    ParentLevel:  parent.Level,
    ParentHeader: parent.Header,
}
Then you can add meta data when creating the vectors thanks to these fields: ParentPrefix, ParentLevel, ParentHeader.

  • ParseMarkdownWithLineage parses the given markdown content and returns a slice of Chunk structs. Each Chunk represents a header and its associated content, along with its hierarchical lineage.
chunks := content.ParseMarkdownWithLineage(document)

func ParseMarkdownWithLineage(document string) []Chunk

You will get the following data:

chunk := Chunk{
    Level:        level,
    Prefix:       prefix,
    Header:       header,
    Content:      strings.TrimSpace(content),
    ParentPrefix: parent.Prefix,
    ParentLevel:  parent.Level,
    ParentHeader: parent.Header,
    Lineage:      lineage,
}
Then you can add meta data when creating the vectors thanks to this field: Lineage.

Lineage will keep the path of the sections. For example, with this document:

1
2
3
4
5
6
7
# Tiefling Species in Fantasy Realms: A Comprehensive Analysis

... some text ...

## Professional Development and Education

... some text ...

The Lineage value of the chunk of the second section will be:

Tiefling Species in Fantasy Realms: A Comprehensive Analysis > Professional Development and Education

Note

๐Ÿ‘€ you will find a complete example in: