Getting Started

After reading this document you’ll have a good idea of what you need to do to get your first emit graph running, both in local memory and Celery.

Installing

Eventually, you’ll be able to install via pip. However, while the library is under initial development you’ll need to install via git:

pip install emit

Quickstart

For a sampler, we’re going to make a simple command-line application that will take and count all the words in a document, giving you the top 5.

Put the following into graph.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from __future__ import print_function
from collections import Counter
from emit import Router
import sys

router = Router()


def prefix(name):
    return '%s.%s' % (__name__, name)


@router.node(('word',), entry_point=True)
def words(msg):
    print('got document')
    for word in msg.document.strip().split(' '):
        yield word


WORDS = Counter()


@router.node(('word', 'count'), prefix('words'))
def count_word(msg):
    print('got word (%s)' % msg.word)

    global WORDS
    WORDS.update([msg.word])

    return msg.word, WORDS[msg.word]

if __name__ == '__main__':
    router(document=sys.stdin.read())

    print()
    print('Top 5 words:')
    for word, count in WORDS.most_common(5):
        print('    %s: %s' % (word, count))

(incidentally, this file is available in the project directory as examples/simple/graph.py.)

Now on the command line: echo "the rain in spain falls mainly on the plain" | python graph.py. You should get some output that looks similar to the following

got document
got word (the)
got word (rain)
got word (in)
got word (spain)
got word (falls)
got word (mainly)
got word (on)
got word (the)
got word (plain)

Top 5 words:
    the: 2
    on: 1
    plain: 1
    mainly: 1
    rain: 1

Breaking it Down

First, we need to construct a router:

router = Router()

Since we’re keeping everything in-memory, we don’t need to specify anything to get this to work properly. It should “Just Work(TM)”.

Next, we define a function to split apart a document on spaces to get words:

@router.node(('word',), entry_point=True)
def words(msg):
    print('got document')
    for word in msg.document.strip().split(' '):
        yield word

Router provides a decorator. The first argument is the fields that the decorated function returns. These are wrapped in a message and passed around between functions.

We don’t specify any subscriptions on this function, since it really doesn’t need any. In fact, it’s an entry point, so we specify that instead. This specifically means that if you call the router directly it will delegate to this function. There can be multiple functions with entry_point set to true on a given Router.

If the decorated function is a generator, each yielded value is treated as a separate input into the next nodes in the graph.

Splitting the document into parts is only as useful as what we can do with the words, so let’s count them now:

WORDS = Counter()
@router.node(('word', 'count'), prefix('words'))
def count_word(msg):
    print('got word (%s)' % msg.word)

    global WORDS
    WORDS.update([msg.word])

    return msg.word, WORDS[msg.word]

There’s a little less going on in this function. We just update a Counter builtin, and then return the word and the count to be passed down the graph. In real life, you’d probably persist this value in a database to allow multiple workers to process different parts of the stream.

In non-entry nodes, the second argument of router.node is a string or list of functions to subscribe to. These need to be fully qualified when you’re using Celery, but for now they’re fine.

Now that we’ve defined both functions, it’s time to send some data into our graph:

    router(document=sys.stdin.read())

Calling this graph is easy, since we defined a function as an entry point. You can call any of the functions (or the router itself) by using keyword arguments or passing a dictionary.

In the end, data flows through the graph like this:

_images/graph2.png

Project Versions

Table Of Contents

Previous topic

Welcome to Emit’s documentation!

Next topic

Using Celery to Distribute Processing

This Page