Launch HN: Langfuse (YC W23) – OSS Tracing and Workflows to Improve LLM Apps

(github.com)

209 points | by mdeichmann a day ago ago

57 comments

swyx a day ago
(unsolicited review) we've been happy adopters of LangFuse at AINews (https://smol.ai/news). ive been tracking the llm ops landscape (https://www.latent.space/p/braintrust) for a while and its very nice to have an open source solution that is so comprehensive and intuitive!
reflections/thoughts on where this field goes next:
1. i wonder if there are new ops solutions for the realtime apis popping up
2. retries for instructor like structured outputs mess up the traces, i wonder if they can be tracked and collapsible
3. chatgpt canvas like "drafting" workflows are on the rise (https://www.latent.space/p/inference-fast-and-slow) and again its noisy to see in a chat flow
4. how often do people actually use the feedback tagging and then subsequently finetuning? i always feel guilty that i dont do it yet and wonder when and where i should.
[-]
- marcklingen a day ago
  appreciate your constructive feedback!
  > i wonder if there are new ops solutions for the realtime apis popping up
  This is something we have spent quite some time on already, both on designs internally and talking to teams using Langfuse with realtime applications. IMO the usage patterns are still developing and the data capturing/visualization needs across teams is not aligned. What matters: (1) capture streams, (2) for non-text provide timestamped transcript/labels, (3) capture the difference between user-time and api-level-time (e.g. when catching up on a stream after having categorized the input first).
  We are excited to build support for this, if you or others have ideas or a wishlist, please add them to this thread: https://github.com/orgs/langfuse/discussions/4757
  > retries for instructor like structured outputs mess up the traces, i wonder if they can be tracked and collapsible
  Great feedback. Being able to retroactively downrank llm calls to be `debug` level in order to collapse/hide them by default would be interesting. Added thread for this here: https://github.com/orgs/langfuse/discussions/4758
  > chatgpt canvas like "drafting" workflows are on the rise (https://www.latent.space/p/inference-fast-and-slow) and again its noisy to see in a chat flow
  Can you share an example trace for this or open a thread on github? Would love to understand this in more detail as I have seen different trace-representations of it -- the best yet was a _git diff_ on a wrapper span for every iteration.
  > how often do people actually use the feedback tagging and then subsequently finetuning? i always feel guilty that i dont do it yet and wonder when and where i should.
  Have not seen finetuning based on user-feedback a lot as the feedback can be noisy and low in frequency (unless there is a very clear feedback loop built into the product). More common workflow that I have seen: identify new problems via user feedback -> review them manually -> create llm-as-a-judge or other automated evals for this problem -> select "good" examples for fine-tuning based on a mix of different evals that currently run on production data -> sanitize the dataset (e.g. remove PII).
  Finetuning has been more popular for structured output, sql generation (clear feedback loop / retries at run-time if the output does not work). More teams fine-tune on all the output that has passed this initial run-time gate for model distillation without further quality controls on the training dataset. They usually then run evals on a test dataset in order to verify whether the fine-tuned hits their quality bar.
gmays an hour ago
Awesome improvements. How does this compare to Braintrust? I've played with it a bit and we're gearing up to implement a solution in during the Christmas lull.
We use various LLMs as a core part of our app but I'm looking for ways to more quickly iterate on our prompts, test different LLM outputs against each other, etc. ideally while minimizing deploys. Would Langfuse serve that purpose?
dcreater 10 hours ago
Thread is filled with positive reviews.. Little odd
[-]
- Maxious 7 hours ago
  > Make sure your friends don't post booster comments. That's not allowed on HN. Our readers have a nose for this, and will sniff them out and flame you. That will damage your reputation—and ours—and we may have to bury your thread.
  https://news.ycombinator.com/yli.html
mfdupuis a day ago
This is actually one of the more interesting LLM observability platforms I've seen. Beyond addressing scaling issues, where do you see yourself going next?
[-]
- marcklingen a day ago
  Positioning/roadmap differs between the different project in the space.
  We summarized what we strongly believe in here: https://langfuse.com/why Tldr: open apis, self-hostable, LLM/cloud/model/framework-agnostic, API first, unopinionated building blocks for sophisticated teams, simple yet scalable instrumentation that is incrementally adoptable
  Regarding roadmap, this is the near-term view: https://langfuse.com/roadmap
  We work closely with the community, and the roadmap can change frequently based on feedback. GitHub Discussions is very active, so feel free to join the conversation if you want to suggest or contribute a feature: https://langfuse.com/ideas
- mathiasn a day ago
  What are other potential platforms?
  [-]
  - marcklingen a day ago
    This is a good long-list of projects, although it is not narrowly scoped to tracing/evals/prompt-management: https://github.com/tensorchord/Awesome-LLMOps?tab=readme-ov-...
  - resiros 20 hours ago
    One missing in the list below is Agenta (https://github.com/agenta-ai/agenta).
    We're oss, otel compliant with stronger focus on evals and the enabling collaboration between subject matter experts and devs.
  - suninsight a day ago
    Bunch of them : Langsmith, Lunary, Phoenix Arize, Portkey, Datadog and Helicone.
    We also picked Langfuse - more details here: https://www.nonbios.ai/post/the-nonbios-llm-observability-pi...
    [-]
    - ianbicking an hour ago
      "Langsmith appeared popular, but we had encountered challenges with Langchain from the same company, finding it overly complex for previous NonBioS tooling. We rewrote our systems to remove dependencies on Langchain and chose not to proceed with Langsmith as it seemed strongly coupled with Langchain."
      I've never really used Langchain, but setup Langsmith with my own project quite quickly. It's very similar to setting up Langfuse, activated with a wrapper around the OpenAI library. (Though I haven't looked into the metadata and tracing yet.)
      Functionally the two seem very similar. I'm looking at both and am having a hard time figuring out differences.
    - unnikrishnan_r a day ago
      Thanks, this post was insightful. I laughed at the reason why you rejected Arize Phoenix, I had similar thoughts while going through their site!=)
      > "Another notable feature of Langfuse is the use of a model as a judge ... this is not enabled in the free version/self-hosted version"
      I think you can add LLM-as-judge to the self-hosted version of Langfuse by defining your own evaluation pipeline: https://langfuse.com/docs/scores/external-evaluation-pipelin...
      [-]
      - suninsight 12 hours ago
        Thanks for the pointer !
        We are actually toying with building out a prompt evaluation platform and were considering extending langfuse. Maybe just use this instead.
    - barefeg a day ago
      Thanks for sharing your blogpost. We had a similar journey. I installed and tried both Langfuse and Phoenix and ended up choosing Langfuse due to some versioning conflicts on the python dependency. I’m curious if your thoughts change after V3? I also liked that it only depended on Postgres but the scalable version requires other dependencies.
      The thing I liked about Phoenix is that it uses OpenTelemetry. In the end we’re building our Agents SDK in a way that the observability platform can be swapped (https://github.com/zetaalphavector/platform/tree/master/agen...) and the abstraction is OpenTelemetry-inspired.
      [-]
      - marcklingen 20 hours ago
        As you mentioned, this was a significant trade-off. We faced two choices:
        (1) Stick with a single Docker container and Postgres. This option is simple to self-host, operate, and iterate on, but it suffers from poor performance at scale, especially for analytical queries that become crucial as the project grows. Additionally, as more features emerged, we needed a queue and benefited from caching and asynchronous processing, which required splitting into a second container and adding Redis. These features would have been blocked when going for this setup.
        (2) Switch to a scalable setup with a robust infrastructure that enables us to develop features that interest the majority of our community. We have chosen this path and prioritized templates and Helm charts to simplify self-hosting. Please let us know if you have any questions or feedback as we transition to v3. We aim to make this process as easy as possible.
        Regarding OTel, we are considering adding a collector to Langfuse as the OTel semantics are currently developing well. The needs of the Langfuse community are evolving rapidly, and starting with our own instrumentation has allowed us to move quickly while the semantic conventions were not developed. We are tracking this here and would greatly appreciate your feedback, upvotes, or any comments you have on this thread: https://github.com/orgs/langfuse/discussions/2509
      - suninsight 8 hours ago
        So we are still on V2.7 - works pretty good for us. Havent tried V3 yet, and not looking to upgrade. I think the next big feature set we are looking for is a prompt evaluation system.
        But we are coming around to the view that it is a big enough problem to have dedicated saas, rather than piggy back on observability saas. At NonBioS, we have very complex requirements - so we might just end up building it up from the ground up.
    - skull8888888 20 hours ago
      We launched Laminar couple of months ago, https://www.lmnr.ai. Extremely fast, great DX and written in Rust. Definitely worth a look.
      [-]
      - marcklingen 20 hours ago
        Congrats on the Launch!
        [-]
        skull8888888 20 hours ago
        apologies for hijacking your launch (congrats btw!)
        skull8888888 20 hours ago
        thanks Marc :)
  - calebkaiser a day ago
    I'm a maintainer of Opik, an open source LLM evaluation and observability platform. We only launched a few months ago, but we're growing rapidly: https://github.com/comet-ml/opik
kappamax a day ago
Congrats Marc! We've been using Langfuse for about 6-months for our LLMOps tooling. While its SDKs are limited to python and typescript, their openapi specification is pretty easy to implement in any language.
The team behind it is amazing, and their product being OSS is one of the reasons we chose it. But it just keeps getting better.
We're incidentally only using part of the product because we've implemented most of these new features, prompt caching, execution etc in our app. But with the API you can decide what parts are core to your business logic and outsource the parts you don't want to deal with to Langfuse.
I appreciate that its not an opionated product.
[-]
- marcklingen a day ago
  Thanks for the feedback.
  Being unopinionated and API-first has been a core design decision. We want to build the building blocks that everyone needs while acknowledging that most Langfuse users are very sophisticated teams that have a clear idea of what they want to achieve. Over time we will build more abstractions for common workflows to make it easier to get started but new features will always start API-first.
  More on this here: https://langfuse.com/why
lunarcave 21 hours ago
A happy Langfuse customer here!
We've been building an agent platform and some of our customers wanted someway to exfil OTEL traces to their own setup. Initially we tried building our own but then realised Languse does exactly what we needed doing. So we offered it as a first class integration, (and started using it internally).
Great product, and hope you guys continue to improve it!
[-]
- marcklingen 21 hours ago
  Thanks! Really enjoyed working with you maintainers of other projects to help them offer more native LLM observability and evaluation to their users/communities. There is a lot that goes into making the observability/eval part scalable/useful and requirements change on a weekly basis with new advancements. Same applies to other projects and it makes a lot of sense to integrate.
  Overview of community integrations: https://langfuse.com/docs/integrations/overview
  Packages that depend on Langfuse: https://langfuse.com/faq/all/packages-depending-on-langfuse
ddtaylor a day ago
You guys just saved me a lot of trouble. Amazing work everyone wow.
extr a day ago
Very timely post/update, was just checking out your product. IMO it is one of the best solutions I've looked at. Appreciate your dedication to self hosting, for us it's not really practical to have traces with potentially sensitive customer data sitting around on some external company's server somewhere (no offense).
[-]
- marcklingen a day ago
  Thank you for the kind words! Let us know if you have any questions or feedback regarding the self-hosting documentation and experience. We collaborate with many teams that have diverse security needs, including HIPAA, PCI, and on-premises deployments on bare metal without internet access.
tmshapland a day ago
Seems like Langfuse is becoming the standard. Whenever I talk to other builders, they're using Langfuse.
[-]
- mdeichmann a day ago
  Thank you! If these builders have some feedback to share, ask them to reach out to us :)
punkpeye 21 hours ago
Been using it. Happy customer. It gave me sanity into otherwise very complex LLM infrastructure. We spend 60k+ every month on LLM calls, so having the backbone to debug when things go haywire has helped a lot.
robrenaud a day ago
I've been using self hosted langfuse via litellm in a juptyer notebook for a few weeks for some synthetic data experiments. It's been a nice/useful tool.
I've liked having the traces and scores in a unified browser based UI, it made sanity checking experiments way easier than doing the same thing inside the notebook.
The trace/generation retrieval API was brutally slow for bulk scanning operations, so I bypassed it and just queried the db directly. But the is the beauty of open source/self hosted code.
[-]
- marcklingen a day ago
  Thanks for the feedback, glad that you find Langfuse useful!
  Can you create an issue with more details on the API performance problems? We monitor strict SLOs on the public API for Langfuse Cloud and are not aware of any ongoing issues, would love to learn more.
arjunram77 a day ago
Congratulations @Marc. Been using this product for 5ish months, love the iteration and how the team reacts to feedback. The prompt versioning has been immensely valuable!
[-]
- marcklingen a day ago
  Thanks AJ, feedback on GitHub/Discord (like yours) has been very helpful to evolve prompt management from a quick addition of the core platform to one of the most-used features -- for which we then actually needed to change a lot of infrastructure to make it reliable and fast (see blog post linked in the original post)
TripleChecker a day ago
Looks cool! I’d love to see a simple embedding/sharing tool for an LLM playground to share with the non-tech team so they can try it. Is that something Langfuse could do?
Also, some typos you want to review on the site: https://triplechecker.com/s/655511/langfuse.com
jondwillis a day ago
I promise this isn’t astroturfing ;)
I happened to have been triaging LLM observability, dataset, and eval solutions yesterday at the day job, and congratulations, Langfuse was the second solution that I tried, and simple enough to get set up locally with my existing stack for me to stop looking (ye olde time constraints, and I know good-enough when I see it!)
Thanks for your and your team’s work.
[-]
- clemo_ra a day ago
  thank you, that is genuinely nice to hear and motivating for our team.
  we're available if you ever run into any issues (github, email etc.)
bewestphal a day ago
Congrats on the launch :) happy users @ Samsara.
Key to our LLM customer feedback flywheel and dataset building.
[-]
- marcklingen a day ago
  Thank you! Working with your team has been great. I love seeing you ship LLM-powered features and appreciate the feedback you have shared along the way.
nextworddev 18 hours ago
Been using Langfuse OSS for almost 15 months from the start. By far the best solution. No dark patterns found in other projects such as Portkey.
[-]
- marcklingen 17 hours ago
  All core features are fully open-source and identical to those in Langfuse Cloud, with no limitations on capabilities or scalability (e.g. all v3 infrastructure changes).
  We also offer some optional commercial add-on features that can help iterate faster or support very large teams using Langfuse. However, these features are entirely optional and we do our best to be transparent about this across our docs.
lvkleist a day ago
Have been a very happy Langfuse user since March - dead simple to use and has helped us a lot with LLM observability and debugging - great work guys :))
[-]
- marcklingen a day ago
  thank you! if you have any ideas for improvements after having used Langfuse for a while, please contribute them via github discussions: https://langfuse.com/ideas
david1542 a day ago
Looks awesome! Been using for over a year now and it's a great product :) The new improvements seems exciting.
mritchie712 a day ago
In this example:
```
    from langfuse.openai import openai # OpenAI integration
```
Why do I need to import openai from langfuse?
[-]
- marcklingen a day ago
  This is an optional instrumentation of the OpenAI SDK which simplifies getting started, tracking token counts, model parameters and streaming latencies.
  Langfuse is not in the critical path, this just helps with instrumentation.
  You can use the Langfuse Python SDK / Decorator to track any LLM (with some instrumentation code) or use one of the framework integrations.
  Here is a fully-featured example using the Amazon Bedrock SDK: https://langfuse.com/docs/integrations/amazon-bedrock
- priompteng a day ago
  Nice work but, Sorry but I don’t feel comfortable either proxying my llm calls through a 3rd party unless the 3rd party is a llm gateway like litellm or arch or storing my prompts in a SaaS. For tracing, I use OTEL libraries which is more than sufficient for my use case.
  [-]
  - marcklingen a day ago
    If you use an OSS Gateway already, some (e.g. LiteLLM) can natively forward logs to Langfuse: https://docs.litellm.ai/docs/proxy/logging#langfuse
    We are looking into adding an Otel Collector as OTel-semantics are maturing around LLMs. For now many features that are key to LLMOPs are difficult to make work with OTel instrumentation as the space is moving quickly. Main thread on this is here: https://github.com/orgs/langfuse/discussions/2509
fiehtle 16 hours ago
great to see how you guys worked with the community on discord over the last year to build Langfuse
[-]
- marcklingen 15 hours ago
  Thanks! IMO, Discord is good, but GitHub Discussions is the better option for building a growing open-source community. It is indexed and makes it easier to revisit conversations weeks later. Currently we use both but have a strong preference for GitHub Discussions.
aantti a day ago
great product & great team, kudos & congrats! :)
matthewolfe a day ago
Great work, guys!
krb0 21 hours ago
Great work! Easy to integrate :)
sebselassie a day ago
great product, so easy to use. love it.
tucnak a day ago
> YC > OSS
Nice try